According to the Spark RDD docs:
All transformations in Spark are lazy, in that they do not compute their results right away...This design enables Spark to run more efficiently.
There are times when I need to do certain operations on my dataframes right then and now. But because dataframe ops are "lazily evaluated" (per above), when I write these operations in the code, there's very little guarantee that Spark will actually execute those operations inline with the rest of the code. For example:
val someDataFrame : DataFrame = getSomehow()
val someOtherDataFrame : DataFrame = getSomehowAlso()
// Do some stuff with 'someDataFrame' and 'someOtherDataFrame'
// Now we need to do a union RIGHT HERE AND NOW, because
// the next few lines of code require the union to have
// already taken place!
val unionDataFrame : DataFrame = someDataFrame.unionAll(someOtherDataFrame)
// Now do some stuff with 'unionDataFrame'...
So my workaround for this (so far) has been to run .show()
or .count()
immediately following my time-sensitive dataframe op, like so:
val someDataFrame : DataFrame = getSomehow()
val someOtherDataFrame : DataFrame = getSomehowAlso()
// Do some stuff with 'someDataFrame' and 'someOtherDataFrame'
val unionDataFrame : DataFrame = someDataFrame.unionAll(someOtherDataFrame)
unionDataFrame.count() // Forces the union to execute/compute
// Now do some stuff with 'unionDataFrame'...
...which forces Spark to execute the dataframe op right then in there, inline.
This feels awfully hacky/kludgy to me. So I ask: is there a more generally-accepted and/or efficient way to force dataframe ops to happen on-demand (and not be lazily evaluated)?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…