Website Design United States, Website Design California, Website Designing United States, Website Designing California

Grab the content of a (GZIP) webpage using C#

This code snippet demonstrates how to grab the content from a webpage and put it in a string variable. This snippet can be used in several applications like a webcrawler or spider. Since Bandwith (most of the times) is an issue I also have added code that enables and handles GZIP/DEFLATE compressed content. Compressed content can save up to 80% of the needed bandwith, since most of the content is text based. This feature can easily be disabled by removing the 'Accept encoding' line in the first snippet. Now lets start! First we need a routine that grabs content from a valid URL:
public string GrabURL(string in_URL)
{
  try
  {
    HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(in_URL);
    webRequest.Timeout = 6000;
    webRequest.ReadWriteTimeout = 8000;
   
    //Accept GZIP and DEFLATE compressed content.
    //You can decide to disable this part. Then the decompression functions
    //Are not needed
    webRequest.Headers.Add("Accept-Encoding: deflate, gzip");
   
    //Defaine the user agent
    webRequest.UserAgent = "MyUserAgent";
   
    HttpWebResponse webResponse = (HttpWebResponse)webRequest.GetResponse();
    Stream responseStream = webResponse.GetResponseStream();   
    string responseEncoding = webResponse.ContentEncoding.Trim();
    if (responseEncoding.Length == 0)
        responseEncoding="utf-8";
   
    StreamReader responseReader = new StreamReader(responseStream);

    //Decompress te content when GZIP compression is used.
    if (webResponse.ContentEncoding.IndexOf("gzip") > -1)
    {
       return (DecompressGzip(responseStream));
    }
    //Decompress te content when DEFLATE compression is used
    if (webResponse.ContentEncoding.IndexOf("deflate") > -1)
    {
       return (DecompressDeflate(responseStream));
    }
    else
    {
       return responseReader.ReadToEnd();
    }
  }
  catch
  {
    return "error";
  }
}
In case you want to remove the HTML formatting from the webpage you might be interested in this snippet. In case you want to extract all URL/Anchor combinations from a webpage you might be interested in this snippet Of course, when you have enabled the GZIP and/or DEFLATE encoding, the decompression algoritms for GZIP and DEFLATE are needed.
using System.IO.Compression;
First we need a function that handles GZIP-encoded content:
private string DecompressGzip(Stream in_InputStream)
{
    Stream lv_OutputStream = new MemoryStream();

    try
    {
        byte[] lv_Buffer = new byte[4096];

        using (GZipStream lv_gzip = new GZipStream( _
               in_InputStream, CompressionMode.Decompress))
        {
            int i;
            while ((i = lv_gzip.Read(lv_Buffer, 0, lv_Buffer.Length)) != 0)
            {
                lv_OutputStream.Write(lv_Buffer, 0, i);
            }
        }
    }
    catch(Exception ex)
    {
        WriteEventLog(ex.Message);
    }

    return Stream2String(lv_OutputStream);
}
Then we need the routine to decompress DEFLATE encoded content:
private string DecompressDeflate(Stream in_InputStream)
{
    Stream lv_OutputStream = new MemoryStream();

    try
    {
        byte[] lv_Buffer = new byte[4096];

        using (DeflateStream lv_Deflate = new DeflateStream(_
               in_InputStream, CompressionMode.Decompress))
        {
            int i;
            while ((i = lv_Deflate.Read(lv_Buffer, 0, lv_Buffer.Length)) != 0)
            {
                lv_OutputStream.Write(lv_Buffer, 0, i);
            }
        }
    }
    catch (Exception ex)
    {
        WriteEventLog(ex.Message);
    }

    return Stream2String(lv_OutputStream);
}
After implementing the code above the function GrabURL can be used the following way:
string lv_HTML;

lv_HTML = GrabURL("http://www.dotnet4all.com");



    Enquiry
 






    Contact us
340 W, 26th Street, Suite F, National City, California - 91950 United States
Tel : 818-667-2853
Info@webbgirrl.com

















 

 

 

 

 

 

 

 

 

 

Ajax Technologies | Foodgear Shop| Lowest Unique Bid Auction | Our Team | Terms And Condition | What is encapsulation | MustInherit class with parameters in the constructor | How to extract src from img elements in html code | How to extract url and anchor from html content | Grab the content of a webpage using C | How to extract the host name from an url C| How to send an email using smtp C | How to remove html tags from web content C | How to convert date time to sql valid string | Set row color in datagrif web from | Disclaimer | How to refresh the data in an xml reader object | Aprendiendo ms access 2000 en 24 horas | Microsoft visual C++ net step by step | Possible to disable a row in listview | Designing sql server 2000 databases for net enterprise | Biztalk server 2000 developer guide for net
© 2008-2009 webbgirrl.com