I'm using Spark 1.6.1 and encountering a strange behaviour: I'm running an UDF with some heavy computations (a physics simulations) on a dataframe containing some input data, and building up a result-Dataframe containing many columns (~40).
Strangely, my UDF is called more than once per Record of my input Dataframe in this case (1.6 times more often), which I find unacceptable because its very expensive. If I reduce the number of columns (e.g. to 20), then this behavior disappears.
I managed to write down a small script which demonstrates this:
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.functions.udf
object Demo {
case class Result(a: Double)
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new SparkConf().setAppName("Demo").setMaster("local[*]"))
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val numRuns = sc.accumulator(0) // to count the number of udf calls
val myUdf = udf((i:Int) => {numRuns.add(1);Result(i.toDouble)})
val data = sc.parallelize((1 to 100), numSlices = 5).toDF("id")
// get results of UDF
var results = data
.withColumn("tmp", myUdf($"id"))
.withColumn("result", $"tmp.a")
// add many columns to dataframe (must depend on the UDF's result)
for (i <- 1 to 42) {
results=results.withColumn(s"col_$i",$"result")
}
// trigger action
val res = results.collect()
println(res.size) // prints 100
println(numRuns.value) // prints 160
}
}
Now, is there a way to solve this without reducing the number of columns?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…