lazy evaluation - How to force Spark to evaluate DataFrame operations inline

Question

Welcome To Ask or Share your Answers For Others

lazy evaluation - How to force Spark to evaluate DataFrame operations inline

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

lazy evaluation - How to force Spark to evaluate DataFrame operations inline

All transformations in Spark are lazy, in that they do not compute their results right away...This design enables Spark to run more efficiently.

There are times when I need to do certain operations on my dataframes right then and now. But because dataframe ops are "lazily evaluated" (per above), when I write these operations in the code, there's very little guarantee that Spark will actually execute those operations inline with the rest of the code. For example:

val someDataFrame : DataFrame = getSomehow()
val someOtherDataFrame : DataFrame = getSomehowAlso()
// Do some stuff with 'someDataFrame' and 'someOtherDataFrame'

// Now we need to do a union RIGHT HERE AND NOW, because
// the next few lines of code require the union to have
// already taken place!
val unionDataFrame : DataFrame = someDataFrame.unionAll(someOtherDataFrame)

// Now do some stuff with 'unionDataFrame'...

So my workaround for this (so far) has been to run .show() or .count() immediately following my time-sensitive dataframe op, like so:

val someDataFrame : DataFrame = getSomehow()
val someOtherDataFrame : DataFrame = getSomehowAlso()
// Do some stuff with 'someDataFrame' and 'someOtherDataFrame'

val unionDataFrame : DataFrame = someDataFrame.unionAll(someOtherDataFrame)
unionDataFrame.count()  // Forces the union to execute/compute

// Now do some stuff with 'unionDataFrame'...

...which forces Spark to execute the dataframe op right then in there, inline.

This feels awfully hacky/kludgy to me. So I ask: is there a more generally-accepted and/or efficient way to force dataframe ops to happen on-demand (and not be lazily evaluated)?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-17T03:08:58+0000

No.

You have to call an action to force Spark to do actual work. Transformations won't trigger that effect, and that's one of the reasons to love spark.

By the way, I am pretty sure that spark knows very well when something must be done "right here and now", so probably you are focusing on the wrong point.

Can you just confirm that count() and show() are considered "actions"

You can see some of the action functions of Spark in the documentation, where count() is listed. show() is not, and I haven't used it before, but it feels like it is an action-how can you show the result without doing actual work? :)

Are you insinuating that Spark would automatically pick up on that, and do the union (just in time)?

Yes! :)

spark remembers the transformations you have called, and when an action appears, it will do them, just in -the right- time!

Something to remember: Because of this policy, of doing actual work only when an action appears, you will not see a logical error you have in your transformation(s), until the action takes place!

Categories

lazy evaluation - How to force Spark to evaluate DataFrame operations inline

lazy evaluation - How to force Spark to evaluate DataFrame operations inline

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags