scala - Using ReduceByKey to group list of values

Question

Welcome To Ask or Share your Answers For Others

scala - Using ReduceByKey to group list of values

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

scala - Using ReduceByKey to group list of values

I want to group list of values per key and was doing something like this:

sc.parallelize(Array(("red", "zero"), ("yellow", "one"), ("red", "two"))).groupByKey().collect.foreach(println)

(red,CompactBuffer(zero, two))
(yellow,CompactBuffer(one))

But I noticed a blog post from Databricks and it's recommending not to use groupByKey for large dataset.

Avoid GroupByKey

Is there a way to achieve the same result using reduceByKey?

I tried this but it's concatenating all values. By the way, for my case, both key and value are string type.

sc.parallelize(Array(("red", "zero"), ("yellow", "one"), ("red", "two"))).reduceByKey(_ ++ _).collect.foreach(println)

(red,zerotwo)
(yellow,one)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T20:07:16+0000

Use aggregateByKey:

 sc.parallelize(Array(("red", "zero"), ("yellow", "one"), ("red", "two")))
.aggregateByKey(ListBuffer.empty[String])(
        (numList, num) => {numList += num; numList},
         (numList1, numList2) => {numList1.appendAll(numList2); numList1})
.mapValues(_.toList)
.collect()

scala> Array[(String, List[String])] = Array((yellow,List(one)), (red,List(zero, two)))

See this answer for the details on aggregateByKey, this link for the rationale behind using a mutable dataset ListBuffer.

EDIT:

Is there a way to achieve the same result using reduceByKey?

The above is actually worse in performance, please see comments by @zero323 for the details.

Categories

scala - Using ReduceByKey to group list of values

scala - Using ReduceByKey to group list of values

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags