You can combine when
clause with NULL
literal and types casting as follows:
from pyspark.sql.functions import when, lit, col
df = sc.parallelize([(1, "foo"), (2, "bar")]).toDF(["x", "y"])
def replace(column, value):
return when(column != value, column).otherwise(lit(None))
df.withColumn("y", replace(col("y"), "bar")).show()
## +---+----+
## | x| y|
## +---+----+
## | 1| foo|
## | 2|null|
## +---+----+
It doesn't introduce BatchPythonEvaluation
and because of that should be significantly more efficient than using an UDF.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…