How To: Read a web page and get all the urls assigned to an HREF attribute via Regular Expression#

A fried of mine IM'ed me today asking for help about an specific task that was assigned to him by his project manager. He is currently working on a project that has the client getting furious alot because the client discovered that most of the links on their site were broken (the vicious 404 erros) or are not pointing to the right pages (misplaced links). His PM wasn't happy at all so he was asked me to help him create a program that would parse a website and get all URLs accessible inside a page and dump the result into a text file.

I had a little bit of free time so i decided to help him by building this small application to show him how he can accomplish the task in C#.

using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;
using System.IO;
using System.Net;

namespace KeithRull.GiveMeUrls
{
   class Program
   {
      static void Main(string[] args)
      {
         //the url to scrape
         Uri urlToScrape = new Uri("http://www.devpinoy.org");
         //the list that would contain the urls recovered from the specified uri
         List<string> listOfUrls = GetAllUrlsFromUri(urlToScrape);

         string fileName = SaveToFile(listOfUrls);

         Console.WriteLine("Parsing completed! Urls saved to file: {0}", fileName);

         Console.ReadLine();
      }

      public static List<string> GetAllUrlsFromUri(Uri urlToScrape)
      {
         //the list that would hold the urls
         List<string> listOfUrls = new List<string>();
         //the search pattern that we are going to use for our regular expression
         string searchPattern = "href\\s*=\\s*(?:(?:\\\"(?<url>[^\\\"]*)\\\")|(?<url>[^\\s]* ))";

         //get the contents of the page and put it to a string
         string pageContents = GetPageContents(urlToScrape);

         //our regular expression should ignore case
         Regex regEx = new Regex(searchPattern, RegexOptions.IgnoreCase);

         //get all the maching values generated by our regular expression
         Match match = regEx.Match(pageContents);

         //loop thru all the matching strings
         while (match.Success)
         {
            //assign the match value to a temporary placeholder
            string urlFound = match.Value;

            //check to see if the url does not include the full path(e.g: default.aspx)
            if (listOfUrls.IndexOf(urlFound) < 0)
            {
               string urlToAdd = urlFound;
               if (urlFound.StartsWith("href=\"javascript:"))
               {
                  //do nothing, we need to display it as is.
               }
               else if (urlFound.StartsWith("href=\"/") || !urlFound.StartsWith("href=\"http://"))
               {
                  //add the scrape url to the beginning of our found string
                  urlToAdd = urlFound.Insert(6, urlToScrape.OriginalString);
               }
               //add the url to our list
               listOfUrls.Add(urlToAdd);
            }
            //move to the next match result
            match = match.NextMatch();
         }

         //return the list of urls that we have recovered from the site
         return listOfUrls;
      }

      /// <summary>
      /// Reads a webpage and captures it html representation into a string
      /// </summary>
      /// <param name="urlToScrape">the website you want to read</param>
      /// <returns>the html representation of the site</returns>
      private static string GetPageContents(Uri urlToScrape)
      {
         HttpWebResponse httpWebResponse = null;
         StreamReader streamReader = null;
         string pageContents = String.Empty;

         try
         {
            //create a webrequest object for the url
            WebRequest webRequest = WebRequest.Create(urlToScrape);
            //convert the webrequest to an httpwebrequest
            HttpWebRequest httpWebRequest = (HttpWebRequest)webRequest;
            //assign a timeout value for the process
            httpWebRequest.Timeout = 100000;

            //create a webresponse object to hold the response generated for our request
            WebResponse webResponse = httpWebRequest.GetResponse();
            //convert the webresponse to httpwebresponse
            httpWebResponse = (HttpWebResponse)webResponse;

            //get the response stream and assign it to our streamreader
            streamReader = new StreamReader(httpWebResponse.GetResponseStream());

            //read the contents of the stream
            pageContents = streamReader.ReadToEnd();
         }
         catch (Exception ex)
         {
            //buble up the error
            throw ex;
         }
         finally
         {
            //close our webresponse object
            httpWebResponse.Close();
            //close our streamreader object
            streamReader.Close();
         }

         //return the page contents
         return pageContents;
      }

      /// <summary>
      /// Saves our list of urls to a text file
      /// </summary>
      /// <param name="listOfUrls">the list containing the urls</param>
      /// <returns>the filename created for the file</returns>
      public static string SaveToFile(List<string> listOfUrls)
      {
         //the file name
         string fileName = String.Format("{0}.{1}",Guid.NewGuid(), "txt");

         //create a streamwriter for our file
         StreamWriter sw = File.CreateText(fileName);

         //loop thru each string in our collection
         foreach (string url in listOfUrls)
         {
            //write the string to our file
            sw.WriteLine(url);
         }

         //close oour streamwriter
         sw.Close();

         //return our filename
         return fileName;
      }
   }
}

Basically, the code does is it accepts a url and then parses that page using a regular expression to check all the strings that matches our search pattern. Once it finishes the processing of the page, it would then dump all those urls into a text file.

I sent the code to him and he was very happy with the result. Sweet!

Tuesday, November 06, 2007 8:21:58 PM (GMT Standard Time, UTC+00:00) #    Comments [0]  | 

 

A Comparisson Of Open Source Bug Tracking Softwares in ASP.NET#
We are currently using Gemini here for our issue tracking and we love it eversince we had it installed in our server but this joy of Gemini has not stopped us from searching a better alternative because there are things that we don't like about it(specially that cost part of the software). That lead me to scour the web to find alternatives that we might consider in the future as a viable replacement for our long trusted Gemini. to my surprise I only found 4 open source ASP.NET bug tracking solution compared to the gargantuan list that I saw for PHP. Below are the 4 applications that I found and my comment about each project.
Friday, November 02, 2007 7:41:15 PM (GMT Standard Time, UTC+00:00) #    Comments [0]  | 

 

ASP.NET was originally written in Java. What?#

Sounds strange but it's true. Just ask Mark Anders and he'll tell you the complete story. ;)

Anders:
"... The original prototype was written in Java. I loved Java as a language and Scott(Guthrie) did too. So it was done in Java, and we took that around to lots of different groups. The first group that we took it to was the tools team. The VB and the InterDev teams were in a feud, and when they saw our demo they liked it. They said, 'If you build that, we will target it with our tools."

Thursday, November 01, 2007 10:07:10 PM (GMT Standard Time, UTC+00:00) #    Comments [0]  | 

 

How To: Read the values of a GridViewRow and assign them to a control#
A few days ago a forum question was posted in DevPinoy.org on how to read the values of a inside a GridViewRow and assigning them to a Label control(or TextBox) when that row is selected. I wasn't able to reply to that thread early due to time constraints with a project I had at that time but I told myself that I'm going to answer it as soon as my schedule frees up. This article is a bit late(about 3 days to be exact) but i still hope that this answers that persons question.
Thursday, November 01, 2007 9:33:42 PM (GMT Standard Time, UTC+00:00) #    Comments [0]  | 

 

HowTo: Add custom events to your classes in C##
A colleague of mine asked me today about adding custom eventhandlers to a class. I explained the whole process to him and ended up doing a demo on how to accomplish this task. After a we were done talking it dawned to me that I haven't had the chance to blog code in weeks I think its a good time to show something that is pretty helpful when you understand how to use it.
Wednesday, October 31, 2007 8:39:22 PM (GMT Standard Time, UTC+00:00) #    Comments [0]  | 

 

A lesson in SQL Injection#

Thanks Dave for making me laugh today!

Monday, October 29, 2007 11:34:52 PM (GMT Standard Time, UTC+00:00) #    Comments [0]  | 

 

ASP.NET for ASM Developers#

Modchip is going to love this!

ASP.NET: ASM to IL compiler

[Via Joe Stagnner]

Monday, October 29, 2007 11:14:01 PM (GMT Standard Time, UTC+00:00) #    Comments [1]  | 

 

How to: Get a list of fixed drives and their free space inside SQL Server#

I didn't know that I could do this inside SQL Server

EXEC master..xp_fixeddrives

Executing the procedure onmy server gave me this result set:

drive  MB free
C       5897
E       33334

Man, I think it's about time I upgrade and brush up my SQL skills. I don't know why you would do it inside SQL Server. But then again it's pretty cool to know that something like this exist. You never know, you might need it someday.

SQL
Friday, September 07, 2007 7:36:46 PM (GMT Daylight Time, UTC+01:00) #    Comments [0]  | 

 

How to: Truncate Multiple Tables In SQL Server and the magic of sp_MSforeachtable#

I've been working on alot of SQL Server lately due to the current project i'm assigned to and I found myself this morning needing a query that would truncate all the tables in one of my staging database. My initial thought is that I can do this using a cursor that would hold all truncate statements and execute each one of them one at a time so within 5 minutes i was able to build a query that looks like this:

--declare a variable that would hold the query to be executed
DECLARE @TruncateQuery varchar(4000)

-- create a cursor that would hold our truncate statements
DECLARE TruncateQuerries CURSOR LOCAL FAST_FORWARD
FOR SELECT
    N'TRUNCATE TABLE ' +
    QUOTENAME(TABLE_SCHEMA) +
    N'.' + QUOTENAME(TABLE_NAME)
FROM INFORMATION_SCHEMA.TABLES
WHERE
        TABLE_TYPE = 'BASE TABLE'
    AND    OBJECTPROPERTY    (
                            OBJECT_ID(QUOTENAME(TABLE_SCHEMA) +
                            N'.' + QUOTENAME(TABLE_NAME)
                        ), 'IsMSShipped') = 0

-- read our truncate statements
OPEN TruncateQuerries
-- loop thru each statement in our truncate statement cursor
FETCH NEXT FROM TruncateQuerries INTO @TruncateQuery
WHILE @@FETCH_STATUS = 0
BEGIN
    --execute the statement
    EXEC(@TruncateQuery)
    --assign the current truncate statement to our @TruncateStatement variable
    FETCH NEXT FROM TruncateQuerries INTO @TruncateQuery 

END
-- close our cursor
CLOSE TruncateQuerries
-- and free up the resources
DEALLOCATE TruncateQuerries

Looks great right? Then I realized what my good friend Jon Galloway told me once that there are hidden stored procedures in SQL Server and 1 of those stored proc is sp_MSforeachtable. It's an undocumented sp so you won't find anything about it in the SQL Books Online. What this stored procedure basically does is that it lets you execute a command or a set of commands against all tables inside a database. Before we go into further detail lets look at the parameters that sp_MSforeachtable expects.

exec @RETURN_VALUE=sp_MSforeachtable @command1, @replacechar, @command2,
@command3, @whereand, @precommand, @postcommand

Where:(description taken from [LINK])

  • @RETURN_VALUE - is the return value which will be set by "sp_MSforeachtable"
  • @command1 - is the first command to be executed by "sp_MSforeachtable" and is defined as a nvarchar(2000)
  • @replacechar - is a character in the command string that will be replaced with the table name being processed (default replacechar is a "?")
  • @command2 and @command3 are two additional commands that can be run for each table, where @command2 runs after @command1, and @command3 will be run after @command2
  • @whereand - this parameter can be used to add additional constraints to help identify the rows in the sysobjects table that will be selected, this parameter is also a nvarchar(2000)
  • @precommand - is a nvarchar(2000) parameter that specifies a command to be run prior to processing any table
  • @postcommand - is also a nvarchar(2000) field used to identify a command to be run after all commands have been processed against all tables

All other parameters are optional except for @command1 which is first statement that would be executed.

Now, lets rewrite our truncate table query from above to use the sp_MSForeachtable procedure:

EXEC [sp_MSforeachtable] @command1="TRUNCATE TABLE ?"

What??? Tha't it? 1 line? Crazy huh? I didn't realized that it was that easy until i tried it. Man, If i knew this when I started writing my query above I could have not wasted 5 minutes of my life in something that can be done in 10 seconds. Thanks for hiding feature this Micrsoft!

But wait! There's more, now i've truncated my tables i need a way to reseed all the identity columns on my database:

EXEC [sp_MSforeachtable] @command1="DBCC CHECKIDENT (?, RESEED, 100)"

What??? Another 1 liner? Whats even better is that I could also REINDEX all the tables in my DB with one line of beautiful SQL code:

EXEC [sp_MSforeachtable] @command1="DBCC DBREINDEX('?')"

Wanna show progress message on each execution? Try this version:

EXEC [sp_MSforeachtable] @command1="RAISERROR('DBCC DBREINDEX(''?'') ...',10,1) WITH NOWAIT DBCC DBREINDEX('?')"

Oh, Ma! I can't believe being a programmer could be this easy. ;) I hope i could save someone's precious time by proving this example because i know it would save mine in the future.

SQL
Friday, September 07, 2007 7:07:58 PM (GMT Daylight Time, UTC+01:00) #    Comments [0]  | 

 

C# 3.0 Language Specification #

New to C#? Need to know whats on C# 3.0? Then download this 500 page book coutersy of Microsoft. It's the most complete C# reference you can find and it is primarily written by the engineers of the C# language

Go download it here!

Thank you Charlie for the link!

Friday, September 07, 2007 5:58:37 PM (GMT Daylight Time, UTC+01:00) #    Comments [0]  | 

 

All content © 2010, Keith Rull
On this page
This site
Calendar
<November 2007>
SunMonTueWedThuFriSat
28293031123
45678910
11121314151617
18192021222324
2526272829301
2345678
Archives
Sitemap
Blogroll OPML
Disclaimer

Powered by: newtelligence dasBlog 2.3.9074.18820

The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.

Send mail to the author(s) E-mail

Theme design by Jelle Druyts


Pick a theme: