Spark 2.4.0
As of Spark 2.4, Spark SQL supports higher-order functions that are to manipulate complex data structures, including arrays.
The "modern" solution would be as follows:
scala> input.show(false)
+-------+-------+-------------------------+
|dept_id|dept_nm|emp_details |
+-------+-------+-------------------------+
|10 |Finance|[100, 200, 300, 400, 500]|
|20 |IT |[10, 20, 50, 100] |
+-------+-------+-------------------------+
input.createOrReplaceTempView("mytable")
val sqlText = "select dept_id, dept_nm, aggregate(emp_details, 0, (acc, value) -> acc + value) as sum from mytable"
scala> sql(sqlText).show
+-------+-------+----+
|dept_id|dept_nm| sum|
+-------+-------+----+
| 10|Finance|1500|
| 20| IT| 180|
+-------+-------+----+
You can find a good reading on higher-order functions in the following articles and video:
- Introducing New Built-in and Higher-Order Functions for Complex Data Types in Apache Spark 2.4
- Working with Nested Data Using Higher Order Functions in SQL on Databricks
- An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell (Databricks)
Spark 2.3.2 and earlier
DISCLAIMER I would not recommend this approach (even though it got the most upvotes) because of the deserialization that Spark SQL does to execute Dataset.map
. The query forces Spark to deserialize the data and load it onto JVM (from memory regions that are managed by Spark outside JVM). That will inevitably lead to more frequent GCs and hence make performance worse.
One solution would be to use Dataset
solution where the combination of Spark SQL and Scala could show its power.
scala> val inventory = Seq(
| (10, "Finance", Seq(100, 200, 300, 400, 500)),
| (20, "IT", Seq(10, 20, 50, 100))).toDF("dept_id", "dept_nm", "emp_details")
inventory: org.apache.spark.sql.DataFrame = [dept_id: int, dept_nm: string ... 1 more field]
// I'm too lazy today for a case class
scala> inventory.as[(Long, String, Seq[Int])].
map { case (deptId, deptName, details) => (deptId, deptName, details.sum) }.
toDF("dept_id", "dept_nm", "sum").
show
+-------+-------+----+
|dept_id|dept_nm| sum|
+-------+-------+----+
| 10|Finance|1500|
| 20| IT| 180|
+-------+-------+----+
I'm leaving the slice part as an exercise as it's equally simple.