I want to group list of values per key and was doing something like this:
sc.parallelize(Array(("red", "zero"), ("yellow", "one"), ("red", "two"))).groupByKey().collect.foreach(println)
(red,CompactBuffer(zero, two))
(yellow,CompactBuffer(one))
But I noticed a blog post from Databricks and it's recommending not to use groupByKey for large dataset.
Avoid GroupByKey
Is there a way to achieve the same result using reduceByKey?
I tried this but it's concatenating all values. By the way, for my case, both key and value are string type.
sc.parallelize(Array(("red", "zero"), ("yellow", "one"), ("red", "two"))).reduceByKey(_ ++ _).collect.foreach(println)
(red,zerotwo)
(yellow,one)
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…