I've been encountering various issues in some Spark LDA topic modeling (mainly disassociation errors at seemingly random intervals) I've been running, which I think mainly have to do with insufficient memory allocation on my executors. This would seem to be related to problematic automatic cluster configuration. My latest attempt uses n1-standard-8 machines (8 cores, 30GB RAM) for both the master and worker nodes (6 workers, so 48 total cores).
But when I look at /etc/spark/conf/spark-defaults.conf
I see this:
spark.master yarn-client
spark.eventLog.enabled true
spark.eventLog.dir hdfs://cluster-3-m/user/spark/eventlog
# Dynamic allocation on YARN
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.minExecutors 1
spark.dynamicAllocation.initialExecutors 100000
spark.dynamicAllocation.maxExecutors 100000
spark.shuffle.service.enabled true
spark.scheduler.minRegisteredResourcesRatio 0.0
spark.yarn.historyServer.address cluster-3-m:18080
spark.history.fs.logDirectory hdfs://cluster-3-m/user/spark/eventlog
spark.executor.cores 4
spark.executor.memory 9310m
spark.yarn.executor.memoryOverhead 930
# Overkill
spark.yarn.am.memory 9310m
spark.yarn.am.memoryOverhead 930
spark.driver.memory 7556m
spark.driver.maxResultSize 3778m
spark.akka.frameSize 512
# Add ALPN for Bigtable
spark.driver.extraJavaOptions -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.3.v20150130.jar
spark.executor.extraJavaOptions -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.3.v20150130.jar
But these values don't make much sense. Why use only 4/8 executor cores? And only 9.3 / 30GB RAM? My impression was that all this config was supposed to be handled automatically, but even my attempts at manual tweaking aren't getting me anywhere.
For instance, I tried launching the shell with:
spark-shell --conf spark.executor.cores=8 --conf spark.executor.memory=24g
But then this failed with
java.lang.IllegalArgumentException: Required executor memory (24576+930 MB) is above the max threshold (22528 MB) of this cluster! Please increase the value of 'yarn.scheduler.maximum-allocation-mb'.
I tried changing the associated value in /etc/hadoop/conf/yarn-site.xml
, to no effect. Even when I try a different cluster setup (e.g. using executors with 60+ GB RAM) I end up with the same problem. For some reason the max threshold remains at 22528MB.
Is there something I'm doing wrong here, or is this is a problem with Google's automatic configuration?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…