Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.0k views
in Technique[技术] by (71.8m points)

r - Different results from dplyr filter starting with identical data

When I tried to answer this question, I came across some very strange behavior. Below I define the same data twice, once just as a data.frame and the second time using mutate. I check that the results are identical. Then I try to do the same filtering operation. For the first data set this works, but for the second (identical) data set it fails. Can anybody figure out why.

It seems that part of the reason for this difference is the use of ?. But I don't understand why that is a problem for the second data set, but not for the first.

# define the same data twice
datos1 <- data.frame(a?o = 2001:2005, gedad = c(letters[1:5]), a?o2 = 2001:2005)  
datos2 <- data.frame(a?o = 2001:2005, gedad = c(letters[1:5])) %>% mutate(a?o2 = a?o) 
# check that they are identical
identical(datos1, datos2)
# do same operation
datos1 %>% filter(a?o2 >= 2003)
## a?o gedad a?o2
## 1 2003     c 2003
## 2 2004     d 2004
## 3 2005     e 2005
datos2 %>% filter(a?o2 >= 2003)
## Error in filter_impl(.data, dots) : object 'a?o2' not found

Note: I don't believe that this is a duplicate of the original question because I ask why this difference occurs and the original post asked how to fix it.

EDIT: Since @Khashaa could not reproduce the error, here is my sessionInfo() output:

sessionInfo()
## R version 3.1.2 (2014-10-31)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## 
## locale:
## [1] LC_COLLATE=German_Switzerland.1252  LC_CTYPE=German_Switzerland.1252    LC_MONETARY=German_Switzerland.1252
## [4] LC_NUMERIC=C                        LC_TIME=German_Switzerland.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dplyr_0.4.1
## 
## loaded via a namespace (and not attached):
## [1] assertthat_0.1  DBI_0.3.1       lazyeval_0.1.10 magrittr_1.5    parallel_3.1.2  Rcpp_0.11.4     tools_3.1.2  
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

I was able to reproduce the error on my machine which has a Greek system locale by switching R's locale to German_Switzerland.1252. I also noticed that both the error and the name of the variable changed in the second case to aρo2.

It seems that mutate uses the system locale when creating the name of the new column, resulting in a conversion if that isn't the same as the locale used by the console. I was able to query dato2 using the modified column name:

library(dplyr)
Sys.setlocale("LC_ALL","German_Switzerland.1252")
datos1 <- data.frame(a?o = 2001:2005, gedad = c(letters[1:5]), a?o2 = 2001:2005)  
datos2 <- data.frame(a?o = 2001:2005, gedad = c(letters[1:5])) %>% mutate(a?o2 = a?o) 

datos1 %>% filter(a?o2 >= 2003)
##   aρo gedad aρo2
## 1 2003     c 2003
## 2 2004     d 2004
## 3 2005     e 2005
datos2 %>% filter(a?o2 >= 2003)
##  Error in filter_impl(.data, dots) : object 'aρo2' not found
datos2 %>% filter("aρo2" >= 2003)
## aρo gedad aρo2
## 1 2001     a 2001
## 2 2002     b 2002
## 3 2003     c 2003
## 4 2004     d 2004
## 5 2005     e 2005

The reason ? appeared in both cases in the original question probably means that the machine's system locale is set to 850, a Latin codepage where characters with diacritics have different codes than Windows 1252.

The "interesting" thing is that :

names(datos2)[[1]]==names(datos1)[[1]]
## [1] TRUE

Because

names(datos1)[[1]]
## [1] "aρo"

and

names(datos2)[[1]]
## [1] "aρo"

That would mean that R itself makes a mess of conversions and its filter that does a proper conversion.

The morale of all this is - don't use non-English characters, or ensure you use the same locale as the machine's (rather fragile).

UPDATE

Semi-official confirmation that R does go through the system locale, because it assumes it actually is the locale used by the system. Windows though use UTF-16 throughout and the "System Locale" is actually what the label in the Regional Settings" box says - the locale used for legacy, non-Unicode applications.

If I remember correctly, "System Locale" used to be the locale of the overall system (including the UI language etc) before Windows 2000 and NT. Nowadays you can even have a different UI language per user but the name has stuck.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...