bigdata - Improve PySpark implementation for finding connected components in a graph

Question

Welcome To Ask or Share your Answers For Others

bigdata - Improve PySpark implementation for finding connected components in a graph

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

bigdata - Improve PySpark implementation for finding connected components in a graph

I am currently working on the implementation of this paper describing Map Reduce Algorithm to fing connected component : https://www.cse.unr.edu/~hkardes/pdfs/ccf.pdf As a beginner in Big Data world , I started the implementation of CCF-Iterate (w. secondary sorting) algorithm with a small graph : 6 edges and 8 nodes. I'm running this code with the free version of Databricks.

It takes 1 minute to give a result. That seems too long for a such small example . How can I reduce this time ? What kind of optimization is possible? Any advice would be really apreciated. The idea is to test this algo for large graphs

PySpark code:

graph = sc.parallelize([ (2,3),(1,2), (2,4), (3,5), (6,7),(7,8)])
counter_new_pair = sc.accumulator(1)

while (counter_new_pair.value > 0):
  
  counter_new_pair = sc.accumulator(0)

  #CCF Iterate Sorting
  mapping_1 = graph.map(lambda x : (x[0], x[1]))
  mapping_2 = graph.map(lambda x : (x[1], x[0]))
  fusion = mapping_1.union(mapping_2)
  fusion = fusion.groupByKey().map(lambda x : (x[0], list(x[1])))

  fusion = fusion.map(lambda x : (x[0], sorted(x[1])))
  values = fusion.filter(lambda x : x[1][0] < x[0])
  
  key_min_value = values.map(lambda x : (x[0], x[1][0]))
  values = values.map(lambda x : (x[1][0], x[1][1:]))
  values = values.filter(lambda x : len(x[1]) != 0)
  values = values.flatMap(lambda x : [(val, x[0]) for val in x[1]])
  values.foreach(lambda x: counter_new_pair.add(1))
  joined = values.union(key_min_value)

  # CCF Dedup

  mapping = joined.map(lambda x : ((x[0], x[1]), None))
  graph = mapping.groupByKey().map(lambda x : (x[0][0], x[0][1]))

Thanks

question from:https://stackoverflow.com/questions/65921915/improve-pyspark-implementation-for-finding-connected-components-in-a-graph

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

Categories

bigdata - Improve PySpark implementation for finding connected components in a graph

bigdata - Improve PySpark implementation for finding connected components in a graph

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags