Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
527 views
in Technique[技术] by (71.8m points)

python - C# WebClient Strange Characters

I am trying to download this webpage using C# WebClient.. Now it works perfectly with python urllib2 but with c# web client it gives these strange characters in the output file..

I have tried using Encoding with webclient class as well but it doesn't work at all..

public static string GetWebURL()
    {
        string url = "http://bet.hkjc.com";
        WebClient webClient = new WebClient();
        webClient.Encoding = Encoding.UTF8;
        string html = webClient.DownloadString(url);
        File.WriteAllText("page.html", html);
    }

this is the output with those strange characters

a€1?¢?¥?2Qt?±wa€°pU?°?±?μQu?2?±tVP?’???—7v?–?—w q?H???¨*a€?%?|ga€“d?|?§%?|?¨?????o)???±r?(N.??,(Q(??,H?μU*I?-(?‘?J,Ka€???*??q)((a€U*T?’ea€°E ??ya€°I9?????‰?‰???…?…???1y%E?19 ??ia€°9?…???– %a?¢i Xa€h"(?‰-P?°U(???K?‰/?—???‰ON?1H/?£(5M?ˉ??4?????¤H??SlHu?°kP??kP?????£?ˉ+PP/La€????4&?μ???MCI_IS??+%?713?/17?¨   ?‰??fd!??   zJ????a€ P??S?2a€?KsS?3J?′ &MA  V?¨??K?2?′a€?RKa€?s2??? a???a?′2a€1}?2?“?3?3445?????=?-Wa€Z?a€????“ t|zj^jQbN<??1za€°?…??9a€°y??????yJ_?P-???”???“ch??e?|a€? ?μH&[?—r???¨Ca€?a??0?J%? a€? ?·a€?????P9Ud?|M???”???????–M?—???25?2 ?·?′?3V?·a€ (??M-JOM 

What should I do to see the html that is being send?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You're looking at a compressed byte stream. You can tell by inspecting the headers of the http response, for example with curl:

curl -X HEAD -i http://bet.hkjc.com/

but the Developer Console of your browser will reveal the same:

HTTP/1.1 200 OK
Cache-Control: public, max-age=120, must-revalidate
Content-Length: 3615
Content-Type: text/html; charset=utf-8
Content-Encoding: gzip
Expires: Wed, 29 Jun 2016 08:01:06 GMT
Vary: Accept-Encoding
Server: Microsoft-IIS/7.0
X-AspNet-Version: 2.0.50727
X-Powered-By: ASP.NET
Date: Wed, 29 Jun 2016 08:00:14 GMT
Via: 1.1 stjbwbwa52
Accept-Ranges: bytes

Notice the Content-Encoding: to say gzip. This means the result you just got is compressed with the gzip algorithm. The standard WebClient can't handle that but with an simple subclass the WebClient can do new tricks:

public class DecompressWebClient:WebClient
{
    // moved common logic here
    public DecompressWebClient()
    {
        this.Encoding = Encoding.UTF8;
    }

    // This is the factory to create the webrequest
    protected override WebRequest GetWebRequest(Uri address)
    {
        // get the default one
        var request = base.GetWebRequest(address);
        // see if it is a HttpWebRequest
        var httpReq = request as HttpWebRequest;
        if (httpReq != null)
        {
            // add extra capabilities, like decompression
            httpReq.AutomaticDecompression =  DecompressionMethods.GZip;
        }
        return request;
    }
}

On the HttpWebRequest there exists a property AutomaticDecompression that, when set to true, will take care of the decompression for us.

When you put the Subclassed WebClient to use your code will look like:

string url = "http://bet.hkjc.com";
using(WebClient webClient = new DecompressWebClient())
{
    string html = webClient.DownloadString(url);
    File.WriteAllText("page.html", html);
}

The encoding UTF8 is correct, as you can also see in the header for the Content-Type setting.

The top of the html file will look like this:

<html>
<head>
  <meta http-equiv="X-UA-Compatible" content="IE=EmulateIE7; IE=EmulateIE10"/>
  <meta name="application-name" content="香港賽馬會"/>
  <title>香港賽馬會</title>

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...