Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
285 views
in Technique[技术] by (71.8m points)

Skewed dataset join in Spark?

I am joining two big datasets using Spark RDD. One dataset is very much skewed so few of the executor tasks taking a long time to finish the job. How can I solve this scenario?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Pretty good article on how it can be done: https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/

Short version:

  • Add random element to large RDD and create new join key with it
  • Add random element to small RDD using explode/flatMap to increase number of entries and create new join key
  • Join RDDs on new join key which will now be distributed better due to random seeding

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...