You can achieve this using one left join.
Create Example DataFrames
Using the sample data provided by @Shankar Koirala in his answer.
data1 = [
(1, "a"),
(2, "b"),
(3, "c")
]
df1 = sqlCtx.createDataFrame(data1, ["id", "value"])
data2 = [
(1, "x"),
(2, "y")
]
df2 = sqlCtx.createDataFrame(data2, ["id", "value"])
Do a left join
Join the two DataFrames using a left join on the id
column. This will keep all of the rows in the left DataFrame. For the rows in the right DataFrame that don't have a matching id
, the value will be null
.
import pyspark.sql.functions as f
df1.alias('l').join(df2.alias('r'), on='id', how='left')
.select(
'id',
f.col('l.value').alias('left_value'),
f.col('r.value').alias('right_value')
)
.show()
#+---+----------+-----------+
#| id|left_value|right_value|
#+---+----------+-----------+
#| 1| a| x|
#| 3| c| null|
#| 2| b| y|
#+---+----------+-----------+
Select the desired data
We will use the fact that the unmatched id
s have a null
to select the final columns. Use pyspark.sql.functions.when()
to use the right value if it is not null, otherwise keep the left value.
df1.alias('l').join(df2.alias('r'), on='id', how='left')
.select(
'id',
f.when(
~f.isnull(f.col('r.value')),
f.col('r.value')
).otherwise(f.col('l.value')).alias('value')
)
.show()
#+---+-----+
#| id|value|
#+---+-----+
#| 1| x|
#| 3| c|
#| 2| y|
#+---+-----+
You can sort this output if you want the id
s in order.
Using pyspark-sql
You can do the same thing using a pyspark-sql query:
df1.registerTempTable('df1')
df2.registerTempTable('df2')
query = """SELECT l.id,
CASE WHEN r.value IS NOT NULL THEN r.value ELSE l.value END AS value
FROM df1 l LEFT JOIN df2 r ON l.id = r.id"""
sqlCtx.sql(query.replace("
", "")).show()
#+---+-----+
#| id|value|
#+---+-----+
#| 1| x|
#| 3| c|
#| 2| y|
#+---+-----+