Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
313 views
in Technique[技术] by (71.8m points)

pyspark - Get top 10 words from column in spark

date_range = mydata[mydata.headline_category=='india'].sort('publish_date') 
date_range.show()

+-------------------+-----------------+--------------------+ 
|       publish_date|headline_category|       headline_text|
+-------------------+-----------------+--------------------+ 
|2001-01-04 00:00:00|            india|Dudhwa tiger died...| 
|2001-01-05 00:00:00|            india|MP best in forest...| 
|2001-05-28 00:00:00|            india|India-Bangladesh ...| 
|2001-05-28 00:00:00|            india|Govt to modernise...| 
|2001-05-28 00:00:00|            india|Priyanka is the C...| 
|2001-05-28 00:00:00|            india|MPs riling Relian...| 
|2001-05-28 00:00:00|            india|CBI probing A-I's...| 
|2001-05-28 00:00:00|            india|Gujarat braces as...| 
|2001-05-28 00:00:00|            india|Ayodhya may force...| 
|2001-05-28 00:00:00|            india|3 new frigates to...| 
|2001-05-28 00:00:00|            india|Plea in SC challe...| 
|2001-05-28 00:00:00|            india|Kashmiri Sikhs pr...| 
|2001-05-28 00:00:00|            india|Bengal to revamp ...| 
|2001-05-29 00:00:00|            india|Rs 280 cr sanctio...| 
|2001-05-29 00:00:00|            india|DD Metro is up fo...| 
|2001-05-29 00:00:00|            india|Govt employees' n...| 
|2001-05-29 00:00:00|            india|BMS; Left to oppo...| 
|2001-05-29 00:00:00|            india|CBI vetting paper...| 
|2001-05-29 00:00:00|            india|Indo-Pak ties: Fr...| 
|2001-05-29 00:00:00|            india|BJP; Samata to st...|
+-------------------+-----------------+--------------------+

Find the top 10 words in headline_text column for a headline_category category india?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You can split the headline into words, explode the array of words, group by the words, and count the words.

import pyspark.sql.functions as F

result = date_range.withColumn('words', F.explode(F.split('headline_text', ' '))) 
                   .groupBy('words') 
                   .count() 
                   .orderBy(F.desc('count')) 
                   .limit(10)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...