I believe there are other aspects of the problem ignored by climbage and eliasah:
- code readability
- code maintainability
- codebase size
If operation doesn't reduce amount of data it has to be one way or another semantically equivalent to GroupByKey
. Lets assume we haveRDD[(Int,String)]
:
import scala.util.Random
Random.setSeed(1)
def randomString = Random.alphanumeric.take(Random.nextInt(10)).mkString("")
val rdd = sc.parallelize((1 to 20).map(_ => (Random.nextInt(5), randomString)))
and we want to concatenate all strings for a given key. With groupByKey
it is pretty simple:
rdd.groupByKey.mapValues(_.mkString(""))
Naive solution with reduceByKey
looks like this:
rdd.reduceByKey(_ + _)
It is short and arguably easy to understand but suffers from two issues:
- is extremely inefficient since it creates a new
String
object every time*
- suggests that operation you perform is less expensive than it is in reality, especially if you analyze only DAG or debug string
To deal with the first problem we need a mutable data structure:
import scala.collection.mutable.StringBuilder
rdd.combineByKey[StringBuilder](
(s: String) => new StringBuilder(s),
(sb: StringBuilder, s: String) => sb ++= s,
(sb1: StringBuilder, sb2: StringBuilder) => sb1.append(sb2)
).mapValues(_.toString)
It still suggests something else that is really going on and is quite verbose especially if repeated multiple times in your script. You can of course extract anonymous functions
val createStringCombiner = (s: String) => new StringBuilder(s)
val mergeStringValue = (sb: StringBuilder, s: String) => sb ++= s
val mergeStringCombiners = (sb1: StringBuilder, sb2: StringBuilder) =>
sb1.append(sb2)
rdd.combineByKey(createStringCombiner, mergeStringValue, mergeStringCombiners)
but at the end of the day it still means additional effort to understand this code, increased complexity and no real added value. One thing I find particularly troubling is explicit inclusion of mutable data structures. Even if Spark handles almost all complexity it means we no longer have an elegant, referentially transparent code.
My point is if you really reduce amount of data by all means use reduceByKey
. Otherwise you make your code harder to write, harder to analyze and gain nothing in return.
Note:
This answer is focused on Scala RDD
API. Current Python implementation is quite different from its JVM counterpart and includes optimizations which provide significant advantage over naive reduceByKey
implementation in case of groupBy
-like operations.
For Dataset
API see DataFrame / Dataset groupBy behaviour/optimization.
* See Spark performance for Scala vs Python for a convincing example