An extremely common error is the failure to correctly convert an HTTP response from bytes to characters. To do this, you have to know the character encoding of the response. Hopefully, this is specified as a parameter in the "Content-Type" parameter. But putting it in the body itself, as an "http-equiv" attribute in a meta
tag is also an option.
So, it is surprisingly complicated to load a page into a String
correctly, and even 3rd party libraries like HttpClient don't offer a general solution.
Here's a simple implementation that will handle the most common case:
URL url = new URL("http://stackoverflow.com/questions/1381617");
URLConnection con = url.openConnection();
Pattern p = Pattern.compile("text/html;\s+charset=([^\s]+)\s*");
Matcher m = p.matcher(con.getContentType());
/* If Content-Type doesn't match this pre-conception, choose default and
* hope for the best. */
String charset = m.matches() ? m.group(1) : "ISO-8859-1";
Reader r = new InputStreamReader(con.getInputStream(), charset);
StringBuilder buf = new StringBuilder();
while (true) {
int ch = r.read();
if (ch < 0)
break;
buf.append((char) ch);
}
String str = buf.toString();
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…