You're right that VectorAssembler
chooses dense vs sparse output format based on whichever one uses less memory.
You don't need a UDF to convert from SparseVector
to DenseVector
; just use toArray()
method:
from pyspark.ml.linalg import SparseVector, DenseVector
a = SparseVector(4, [1, 3], [3.0, 4.0])
b = DenseVector(a.toArray())
Also, StandardScaler
accepts SparseVector
unless you set withMean=True
at creation. If you do need to de-mean, you have to deduct a (presumably non-zero) number from all the components, so the sparse vector won't be sparse any more.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…