If we decide to use subset
function, then we need to watch out:
For ordinary vectors, the result is simply ‘x[subset & !is.na(subset)]’.
So only non-NA values will be retained.
If you want to keep NA
cases, use logical or condition to tell R not to drop NA
cases:
subset(df1, Height < 40 | is.na(Height))
# or `df1[df1$Height < 40 | is.na(df1$Height), ]`
Don't use directly (to be explained soon):
df2 <- df1[df1$Height < 40, ]
Example
df1 <- data.frame(Height = c(NA, 2, 4, NA, 50, 60), y = 1:6)
subset(df1, Height < 40 | is.na(Height))
# Height y
#1 NA 1
#2 2 2
#3 4 3
#4 NA 4
df1[df1$Height < 40, ]
# Height y
#1 NA NA
#2 2 2
#3 4 3
#4 NA NA
The reason that the latter fails, is that indexing by NA
gives NA
. Consider this simple example with a vector:
x <- 1:4
ind <- c(NA, TRUE, NA, FALSE)
x[ind]
# [1] NA 2 NA
We need to somehow replace those NA
with TRUE
. The most straightforward way is to add another "or" condition is.na(ind)
:
x[ind | is.na(ind)]
# [1] 1 2 3
This is exactly what will happen in your situation. If your Height
contains NA
, then logical operation Height < 40
ends up a mix of TRUE
/ FALSE
/ NA
, so we need replace NA
by TRUE
as above.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…