Let's look at df.rdd
first. This is defined as:
lazy val rdd: RDD[Row] = {
// use a local variable to make sure the map closure doesn't capture the whole DataFrame
val schema = this.schema
queryExecution.toRdd.mapPartitions { rows =>
val converter = CatalystTypeConverters.createToScalaConverter(schema)
rows.map(converter(_).asInstanceOf[Row])
}
}
So firstly, it runs queryExecution.toRdd
, which basically prepares the execution plan based on the operators used to build up the DataFrame, and computes an RDD[InternalRow]
that represents the outcome of plan.
Next these InternalRow
s (which are only for internal use) of that RDD will be mapped to normal Row
s. This will entail the following for each row:
override def toScala(row: InternalRow): Row = {
if (row == null) {
null
} else {
val ar = new Array[Any](row.numFields)
var idx = 0
while (idx < row.numFields) {
ar(idx) = converters(idx).toScala(row, idx)
idx += 1
}
new GenericRowWithSchema(ar, structType)
}
}
So it loops over all elements, coverts them to 'scala' space (from Catalyst space), and creates the final row with them. toDf
will pretty much do these things in reverse.
This all will indeed have some impact on your performance. How much depends on how complex these operations are compared to the things you do with the data. The bigger possible impact however will be that Spark's Catalyst optimizer can only optimize the operations between the conversions to and from RDDs, rather than optimize the full execution plan in its whole. It would be interesting to see which operations you have trouble with, I find most things can be done using basic expressions or UDFs. Using modules that only work on RDDs is a very valid use case though!
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…