Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
692 views
in Technique[技术] by (71.8m points)

utf 8 - Force character vector encoding from "unknown" to "UTF-8" in R

I have a problem with inconsistent encoding of character vector in R.

The text file which I read a table from is encoded (via Notepad++) in UTF-8 (I tried with UTF-8 without BOM, too.).

I want to read table from this text file, convert it do data.table, set a key and make use of binary search. When I tried to do so, the following appeared:

Warning message: In [.data.table(poli.dt, "??onymi", mult = "first") : A known encoding (latin1 or UTF-8) was detected in a join column. data.table compares the bytes currently, so doesn't support mixed encodings well; i.e., using both latin1 and UTF-8, or if any unknown encodings are non-ascii and some of those are marked known and others not. But if either latin1 or UTF-8 is used exclusively, and all unknown encodings are ascii, then the result should be ok. In future we will check for you and avoid this warning if everything is ok. The tricky part is doing this without impacting performance for ascii-only cases.

and binary search does not work.

I realised that my data.table-key column consists of both: "unknown" and "UTF-8" Encoding types:

> table(Encoding(poli.dt$word))
unknown   UTF-8 
2061312 2739122 

I tried to convert this column (before creating a data.table object) with the use of:

  • Encoding(word) <- "UTF-8"
  • word<- enc2utf8(word)

but with no effect.

I also tried a few different ways of reading a file into R (setting all helpful parameters, e.g. encoding = "UTF-8"):

  • data.table::fread
  • utils::read.table
  • base::scan
  • colbycol::cbc.read.table

but with no effect.

==================================================

My R.version:

> R.version
           _                           
platform       x86_64-w64-mingw32          
arch           x86_64                      
os             mingw32                     
system         x86_64, mingw32             
status                                     
major          3                           
minor          0.3                         
year           2014                        
month          03                          
day            06                          
svn rev        65126                       
language       R                           
version.string R version 3.0.3 (2014-03-06)
nickname       Warm Puppy  

My session info:

> sessionInfo()
R version 3.0.3 (2014-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=Polish_Poland.1250  LC_CTYPE=Polish_Poland.1250                LC_MONETARY=Polish_Poland.1250
[4] LC_NUMERIC=C                   LC_TIME=Polish_Poland.1250    

base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.9.2 colbycol_0.8     filehash_2.2-2   rJava_0.9-6     

loaded via a namespace (and not attached):
[1] plyr_1.8.1     Rcpp_0.11.1    reshape2_1.2.2 stringr_0.6.2  tools_3.0.3   
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

The Encoding function returns unknown if a character string has a "native encoding" mark (CP-1250 in your case) or if it's in ASCII. To discriminate between these two cases, call:

library(stringi)
stri_enc_mark(poli.dt$word)

To check whether each string is a valid UTF-8 byte sequence, call:

all(stri_enc_isutf8(poli.dt$word))

If it's not the case, your file is definitely not in UTF-8.

I suspect that you haven't forced the UTF-8 mode in the data read function (try inspecting the contents of poli.dt$word to verify this statement). If my guess is true, try:

read.csv2(file("filename", encoding="UTF-8"))

or

poli.dt$word <- stri_encode(poli.dt$word, "", "UTF-8") # re-mark encodings

If data.table still complains about the "mixed" encodings, you may want to transliterate the non-ASCII characters, e.g.:

stri_trans_general("Za?ó?? g??l? ja?ń", "Latin-ASCII")
## [1] "Zazolc gesla jazn"

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...