I using hive through Spark. I have a Insert into partitioned table query in my spark code. The input data is in 200+gb. When Spark is writing to a partitioned table, it is spitting very small files(files in kb's). so now the output partitioned table folder have 5000+ small kb files. I want to merge these in to few large MB files, may be about few 200mb files. I tired using hive merge settings, but they don't seem to work.
'val result7A = hiveContext.sql("set hive.exec.dynamic.partition=true")
val result7B = hiveContext.sql("set hive.exec.dynamic.partition.mode=nonstrict")
val result7C = hiveContext.sql("SET hive.merge.size.per.task=256000000")
val result7D = hiveContext.sql("SET hive.merge.mapfiles=true")
val result7E = hiveContext.sql("SET hive.merge.mapredfiles=true")
val result7F = hiveContext.sql("SET hive.merge.sparkfiles = true")
val result7G = hiveContext.sql("set hive.aux.jars.path=c:\Applications\json-serde-1.1.9.3-SNAPSHOT-jar-with-dependencies.jar")
val result8 = hiveContext.sql("INSERT INTO TABLE partition_table PARTITION (date) select a,b,c from partition_json_table")'
The above hive settings work in a mapreduce hive execution and spits out files of specified size. Is there any option to do this Spark or Scala?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…