pyspark - Get top 10 words from column in spark

Question

Welcome To Ask or Share your Answers For Others

pyspark - Get top 10 words from column in spark

asked Feb 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

pyspark - Get top 10 words from column in spark

date_range = mydata[mydata.headline_category=='india'].sort('publish_date') 
date_range.show()

+-------------------+-----------------+--------------------+ 
|       publish_date|headline_category|       headline_text|
+-------------------+-----------------+--------------------+ 
|2001-01-04 00:00:00|            india|Dudhwa tiger died...| 
|2001-01-05 00:00:00|            india|MP best in forest...| 
|2001-05-28 00:00:00|            india|India-Bangladesh ...| 
|2001-05-28 00:00:00|            india|Govt to modernise...| 
|2001-05-28 00:00:00|            india|Priyanka is the C...| 
|2001-05-28 00:00:00|            india|MPs riling Relian...| 
|2001-05-28 00:00:00|            india|CBI probing A-I's...| 
|2001-05-28 00:00:00|            india|Gujarat braces as...| 
|2001-05-28 00:00:00|            india|Ayodhya may force...| 
|2001-05-28 00:00:00|            india|3 new frigates to...| 
|2001-05-28 00:00:00|            india|Plea in SC challe...| 
|2001-05-28 00:00:00|            india|Kashmiri Sikhs pr...| 
|2001-05-28 00:00:00|            india|Bengal to revamp ...| 
|2001-05-29 00:00:00|            india|Rs 280 cr sanctio...| 
|2001-05-29 00:00:00|            india|DD Metro is up fo...| 
|2001-05-29 00:00:00|            india|Govt employees' n...| 
|2001-05-29 00:00:00|            india|BMS; Left to oppo...| 
|2001-05-29 00:00:00|            india|CBI vetting paper...| 
|2001-05-29 00:00:00|            india|Indo-Pak ties: Fr...| 
|2001-05-29 00:00:00|            india|BJP; Samata to st...|
+-------------------+-----------------+--------------------+

Find the top 10 words in headline_text column for a headline_category category india?

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-02-16T21:02:32+0000

You can split the headline into words, explode the array of words, group by the words, and count the words.

import pyspark.sql.functions as F

result = date_range.withColumn('words', F.explode(F.split('headline_text', ' '))) 
                   .groupBy('words') 
                   .count() 
                   .orderBy(F.desc('count')) 
                   .limit(10)

Categories

pyspark - Get top 10 words from column in spark

pyspark - Get top 10 words from column in spark

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags