Spark >= 2.4
Since Spark 2.4 VectorAssembler
extends HasHandleInvalid
. It means you can skip
:
assembler.setHandleInvalid("skip").transform(df).show
+---+---+---------+
| x1| x2| features|
+---+---+---------+
|3.0|4.0|[3.0,4.0]|
+---+---+---------+
keep
(note that ML algorithms are unlikely to handle this correctly):
assembler.setHandleInvalid("keep").transform(df).show
+----+----+---------+
| x1| x2| features|
+----+----+---------+
| 1.0|null|[1.0,NaN]|
|null| 2.0|[NaN,2.0]|
| 3.0| 4.0|[3.0,4.0]|
+----+----+---------+
or default to error
.
Spark < 2.4
There is nothing wrong with VectorAssembler
. Spark Vector
just cannot contain null
values.
import org.apache.spark.ml.feature.VectorAssembler
val df = Seq(
(Some(1.0), None), (None, Some(2.0)), (Some(3.0), Some(4.0))
).toDF("x1", "x2")
val assembler = new VectorAssembler()
.setInputCols(df.columns).setOutputCol("features")
assembler.transform(df).show(3)
org.apache.spark.SparkException: Failed to execute user defined function($anonfun$3: (struct<x1:double,x2:double>) => vector)
...
Caused by: org.apache.spark.SparkException: Values to assemble cannot be null.
Null are not meaningful for ML algorithms and cannot be represented using scala.Double
.
You have to either drop:
assembler.transform(df.na.drop).show(2)
+---+---+---------+
| x1| x2| features|
+---+---+---------+
|3.0|4.0|[3.0,4.0]|
+---+---+---------+
or fill / impute (see also Replace missing values with mean - Spark Dataframe):
// For example with averages
val replacements: Map[String,Any] = Map("x1" -> 2.0, "x2" -> 3.0)
assembler.transform(df.na.fill(replacements)).show(3)
+---+---+---------+
| x1| x2| features|
+---+---+---------+
|1.0|3.0|[1.0,3.0]|
|2.0|2.0|[2.0,2.0]|
|3.0|4.0|[3.0,4.0]|
+---+---+---------+
nulls
.