I'm trying to execute a simple random sample with Scala from an existing table, containing around 100e6 records.
import org.apache.spark.sql.SaveMode
val nSamples = 3e5.toInt
val frac = 1e-5
val table = spark.table("db_name.table_name").sample(false, frac).limit(nSamples)
(table
.write
.mode(SaveMode.Overwrite)
.saveAsTable("db_name.new_name")
)
But it is taking too long (~5h by my estimates).
Useful information:
I have ~6 workers. By analyzing the number of partitions of the table I get: 11433
.
I'm not sure if the partitions/workers ratio is reasonable.
I'm running Spark 2.1.0 using Scala.
I have tried:
Removing the .limit()
part.
Changing frac
to 1.0
, 0.1
, etc.
Question: how can I make it faster?
Best,
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…