WinXP-x32, R-2.13.0
Dear list,
I have a problem that (I think) relates to the interaction between Windows and R.
I am trying to scrape a table with data on the Hawai'ian Islands. This is my R code:
library(XML)
u <- "http://en.wikipedia.org/wiki/Hawaii"
tables <- readHTMLTable(u)
Islands <- tables[[5]]
The output is (first set of columns):
Island Nickname > > Islands
Island Nickname > > Location 1 Hawaiê?i[7] The Big
Island 19?°34a€2N 155?°30a€2W??? /
???19.567?°N 155.5?°W??? / 19.567;
-155.5 2 Maui[8] The Valley Isle 20?°48a€2N 156?°20a€2W??? /
???20.8?°N 156.333?°W??? / 20.8;
-156.333 3 Kahoê?olawe[9] The Target Isle 20?°33a€2N
156?°36a€2W??? / ???20.55?°N
156.6?°W??? / 20.55; -156.6 4 L?naê?i[10] The Pineapple Isle
20?°50a€2N 156?°56a€2W??? /
???20.833?°N 156.933?°W??? / 20.833;
-156.933 5 Molokaê?i[11] The Friendly Isle 21?°08a€2N
157?°02a€2W??? / ???21.133?°N
157.033?°W??? / 21.133; -157.033 6 Oê?ahu[12] The Gathering Place
21?°28a€2N 157?°59a€2W??? /
???21.467?°N 157.983?°W??? / 21.467;
-157.983 7 Kauaê?i[13] The Garden Isle 22?°05a€2N
159?°30a€2W??? / ???22.083?°N
159.5?°W??? / 22.083; -159.5 8 Niê?ihau[14] The Forbidden Isle
21?°54a€2N 160?°10a€2W??? / ???21.9?°N
160.167?°W??? / 21.9; -160.167
As you can see, there are "weird" characters in there. I have also tried readHTMLTable(u, encoding = "UTF-16")
and readHTMLTable(u, encoding = "UTF-8")
but that didn't help.
It seems to me that there may be an issue with the interaction of the Windows settings of the character set and R.
sessionInfo()
gives
> sessionInfo()
R version 2.13.0 (2011-04-13)
Platform: i386-pc-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=Dutch_Netherlands.1252 LC_CTYPE=Dutch_Netherlands.1252 LC_MONETARY=Dutch_Netherlands.1252
[4] LC_NUMERIC=C LC_TIME=Dutch_Netherlands.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] XML_3.2-0.2
I have also attempted to let R use another setting by entering: Sys.setlocale("LC_ALL", "en_US.UTF-8")
, but this yields the response:
> Sys.setlocale("LC_ALL", "en_US.UTF-8")
[1] ""
Warning message:
In Sys.setlocale("LC_ALL", "en_US.UTF-8") :
OS reports request to set locale to "en_US.UTF-8" cannot be honored
In addition, I have attempted to make the change directly from the windows command prompt, using: chcp 65001
and variations of that, but that didn't change anything.
I noticed from searching the web that others have the issue as well, but have not been able to find a solution. I looks like this is an issue of how Windows and R interact. Unfortunately, all three computers at my disposal have this problem. It occurs both under WinXP-x32 and under Win7-x86.
Is there a way to make R override the windows settings or can the issue be solved otherwise?
I have also tried other websites, and the issue occurs every time when there is an é, ü, ?, ?, et cetera in the text-to-be-scraped.
Thank you,
Roger
See Question&Answers more detail:
os