I have a dataframe like this:
df = spark.createDataFrame([(0, ["B","C","D","E"]),(1,["E","A","C"]),(2, ["F","A","E","B"]),(3,["E","G","A"]),(4,["A","C","E","B","D"])], ["id","items"])
which creates a data frame df
like this:
+---+-----------------+
| 0| [B, C, D, E]|
| 1| [E, A, C]|
| 2| [F, A, E, B]|
| 3| [E, G, A]|
| 4| [A, C, E, B, D]|
+---+-----------------+
I would like to get a result like this:
+---+-----+
|all|count|
+---+-----+
| F| 1|
| E| 5|
| B| 3|
| D| 2|
| C| 3|
| A| 4|
| G| 1|
+---+-----+
Which essentially just finds all distinct elements in df["items"]
and counts their frequency. If my data was of a more manageable size, I would just do this:
all_items = df.select(explode("items").alias("all"))
result = all_items.groupby(all_items.all).count().distinct()
result.show()
But because my data has millions of rows and thousands of elements in each list, this is not an option. I was thinking of doing this row by row, so that I only work with 2 lists at a time. Because most elements are frequently repeated in many rows (but the list in each row is a set), this approach should solve my problem. But the problem is, I don't really know how to do this in Spark, as I've only just started learning it. Could anyone help, please?
See Question&Answers more detail:
os