We can calculate averages across keys in Spark either using combineByKey
or foldByKey
.
foldByKey
foldByKey(initialValue)((initialValue,inputDataValue) => { //code })
Input Data:
employee,department,salary
e1,d1,100
e2,d1,500
e5,d2,200
e6,d1,300
e7,d3,200
e7,d3,500
1 at the end is for counts. Being fold the type of input and initialValue must match
val depSalary = data.map(_.split(',')).map( x=> (x(1),(x(2).toInt,1)))
val dummy = (0,0)
val depSalarySumCount = depSalary.foldByKey(dummy)((startValue,data) => ( startValue._1 + data._1 , startValue._2 +data._2 ) )
val result = depSalarySumCount.map(x => (x._1, (x._2._1/x._2._2) ))
result.collect
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…