The resulting schema is correct, but I get every value twice
While schema is correct the output you've provided doesn't reflect actual result. In practice you'll get Cartesian product of timeStamp
and reading
for each input row.
My feeling is that there is something about the lazy evaluation
No, it has nothing to do with lazy evaluation. The way you use explode
is just wrong. To understand what is going on lets trace execution for date
equal 100:
val df100 = df.where($"date" === 100)
step by step. First explode
will generate two rows, one for 1 and one for 2:
val df100WithReading = df100.withColumn("reading", explode(df("data.reading")))
df100WithReading.show
// +------------------+----+------+-------+
// | data|date|userId|reading|
// +------------------+----+------+-------+
// |[[1,101], [2,102]]| 100| 1| 1|
// |[[1,101], [2,102]]| 100| 1| 2|
// +------------------+----+------+-------+
The second explode generate two rows (timeStamp
equal 101 and 102) for each row from the previous step:
val df100WithReadingAndTs = df100WithReading
.withColumn("timeStamp", explode(df("data.timeStamp")))
df100WithReadingAndTs.show
// +------------------+----+------+-------+---------+
// | data|date|userId|reading|timeStamp|
// +------------------+----+------+-------+---------+
// |[[1,101], [2,102]]| 100| 1| 1| 101|
// |[[1,101], [2,102]]| 100| 1| 1| 102|
// |[[1,101], [2,102]]| 100| 1| 2| 101|
// |[[1,101], [2,102]]| 100| 1| 2| 102|
// +------------------+----+------+-------+---------+
If you want correct results explode
data and select
afterwards:
val exploded = df.withColumn("data", explode($"data"))
.select($"userId", $"date",
$"data".getItem("reading"), $"data".getItem("timestamp"))
exploded.show
// +------+----+-------------+---------------+
// |userId|date|data[reading]|data[timestamp]|
// +------+----+-------------+---------------+
// | 1| 100| 1| 101|
// | 1| 100| 2| 102|
// | 1| 200| 3| 201|
// | 1| 200| 4| 202|
// +------+----+-------------+---------------+