Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.1k views
in Technique[技术] by (71.8m points)

apache spark - PySpark- How to use a row value from one column to access another column which has the same name as of the row value

I have a PySpark df:

+---+---+---+---+---+---+---+---+
| id| a1| b1| c1| d1| e1| f1|ref|
+---+---+---+---+---+---+---+---+
|  0|  1| 23|  4|  8|  9|  5| b1|
|  1|  2| 43|  8| 10| 20| 43| e1|
|  2|  3| 15|  0|  1| 23|  7| b1|
|  3|  4|  2|  6| 11|  5|  8| d1|
|  4|  5|  6|  7|  2|  8|  1| f1|
+---+---+---+---+---+---+---+---+

I eventually want to create another column "out" whose values are based on "ref" column. For example, in the first row ref column has b1 as value. In "out" column i would like to see column "b1" value i.e., 23. Here is the expected output:

+---+---+---+---+---+---+---+---+---+
| id| a1| b1| c1| d1| e1| f1|ref|out|
+---+---+---+---+---+---+---+---+---+
|  0|  1| 23|  4|  8|  9|  5| b1| 23|
|  1|  2| 43|  8| 10| 20| 43| e1| 20|
|  2|  3| 15|  0|  1| 23|  7| b1| 15|
|  3|  4|  2|  6| 11|  5|  8| d1| 11|
|  4|  5|  6|  7|  2|  8|  1| f1|  1|
+---+---+---+---+---+---+---+---+---+

Please advise on how to achieve the "out" column. I'm using Spark 1.6 version.Thanks

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Independent of version you can convert to RDD, map, and convert back to DataFrame:

df = spark.createDataFrame(
    [(0, 1, 23, 4, 8, 9, 5, "b1"), (1, 2, 43, 8, 10, 20, 43, "e1")], 
    ("id", "a1", "b1", "c1", "d1", "e1", "f1", "ref")
)

df.rdd.map(lambda row: row + (row[row.ref], )).toDF(df.columns + ["out"])
+---+---+---+---+---+---+---+---+---+
| id| a1| b1| c1| d1| e1| f1|ref|out|
+---+---+---+---+---+---+---+---+---+
|  0|  1| 23|  4|  8|  9|  5| b1| 23|
|  1|  2| 43|  8| 10| 20| 43| e1| 20|
+---+---+---+---+---+---+---+---+---+

You could also preserve schema

from pyspark.sql.types import LongType, StructField

spark.createDataFrame(
    df.rdd.map(lambda row: row + (row[row.ref], )), 
    df.schema.add(StructField("out", LongType())))

With DataFrames you can compose complex Columns. In 1.6:

from pyspark.sql.functions import array, col, udf
from pyspark.sql.types import  LongType, MapType, StringType

data_cols = [x for x in df.columns if x not in {"id", "ref"}]

# Literal map from column name to index
name_to_index = udf(
    lambda: {x: i for i, x in enumerate(data_cols)},
    MapType(StringType(), LongType())
)()

# Array of data
data_array = array(*[col(c) for c in data_cols])
df.withColumn("out", data_array[name_to_index[col("ref")]])
+---+---+---+---+---+---+---+---+---+
| id| a1| b1| c1| d1| e1| f1|ref|out|
+---+---+---+---+---+---+---+---+---+
|  0|  1| 23|  4|  8|  9|  5| b1| 23|
|  1|  2| 43|  8| 10| 20| 43| e1| 20|
+---+---+---+---+---+---+---+---+---+

In 2.x you can skip intermediate objects:

from pyspark.sql.functions import create_map, lit, col
from itertools import chain

# Map from column name to column value
name_to_value = create_map(*chain.from_iterable(
    (lit(c), col(c)) for c in data_cols
))

df.withColumn("out", name_to_value[col("ref")])
+---+---+---+---+---+---+---+---+---+
| id| a1| b1| c1| d1| e1| f1|ref|out|
+---+---+---+---+---+---+---+---+---+
|  0|  1| 23|  4|  8|  9|  5| b1| 23|
|  1|  2| 43|  8| 10| 20| 43| e1| 20|
+---+---+---+---+---+---+---+---+---+

Finally you can use when:

from pyspark.sql.functions import col, lit, when
from functools import reduce

out = reduce(
    lambda acc, x: when(col("ref") == x, col(x)).otherwise(acc), 
    data_cols,
    lit(None)
)
+---+---+---+---+---+---+---+---+---+
| id| a1| b1| c1| d1| e1| f1|ref|out|
+---+---+---+---+---+---+---+---+---+
|  0|  1| 23|  4|  8|  9|  5| b1| 23|
|  1|  2| 43|  8| 10| 20| 43| e1| 20|
+---+---+---+---+---+---+---+---+---+

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...