Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
720 views
in Technique[技术] by (71.8m points)

apache spark - How to split a list to multiple columns in Pyspark?

I have:

key   value
a    [1,2,3]
b    [2,3,4]

I want:

key value1 value2 value3
a     1      2      3
b     2      3      4

It seems that in scala I can write:df.select($"value._1", $"value._2", $"value._3"), but it is not possible in python.

So is there a good way to do this?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

It depends on the type of your "list":

  • If it is of type ArrayType():

    df = hc.createDataFrame(sc.parallelize([['a', [1,2,3]], ['b', [2,3,4]]]), ["key", "value"])
    df.printSchema()
    df.show()
    root
     |-- key: string (nullable = true)
     |-- value: array (nullable = true)
     |    |-- element: long (containsNull = true)
    

    you can access the values like you would with python using []:

    df.select("key", df.value[0], df.value[1], df.value[2]).show()
    +---+--------+--------+--------+
    |key|value[0]|value[1]|value[2]|
    +---+--------+--------+--------+
    |  a|       1|       2|       3|
    |  b|       2|       3|       4|
    +---+--------+--------+--------+
    
    +---+-------+
    |key|  value|
    +---+-------+
    |  a|[1,2,3]|
    |  b|[2,3,4]|
    +---+-------+
    
  • If it is of type StructType(): (maybe you built your dataframe by reading a JSON)

    df2 = df.select("key", psf.struct(
            df.value[0].alias("value1"), 
            df.value[1].alias("value2"), 
            df.value[2].alias("value3")
        ).alias("value"))
    df2.printSchema()
    df2.show()
    root
     |-- key: string (nullable = true)
     |-- value: struct (nullable = false)
     |    |-- value1: long (nullable = true)
     |    |-- value2: long (nullable = true)
     |    |-- value3: long (nullable = true)
    
    +---+-------+
    |key|  value|
    +---+-------+
    |  a|[1,2,3]|
    |  b|[2,3,4]|
    +---+-------+
    

    you can directly 'split' the column using *:

    df2.select('key', 'value.*').show()
    +---+------+------+------+
    |key|value1|value2|value3|
    +---+------+------+------+
    |  a|     1|     2|     3|
    |  b|     2|     3|     4|
    +---+------+------+------+
    

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...