I am joining two big datasets using Spark RDD. One dataset is very much skewed so few of the executor tasks taking a long time to finish the job. How can I solve this scenario?
Pretty good article on how it can be done: https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/
Short version:
2.1m questions
2.1m answers
60 comments
57.0k users