configuration - Spark cluster full of heartbeat timeouts, executors exiting on their own

Question

Welcome To Ask or Share your Answers For Others

configuration - Spark cluster full of heartbeat timeouts, executors exiting on their own

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

configuration - Spark cluster full of heartbeat timeouts, executors exiting on their own

My Apache Spark cluster is running an application that is giving me lots of executor timeouts:

10:23:30,761 ERROR ~ Lost executor 5 on slave2.cluster: Executor heartbeat timed out after 177005 ms
10:23:30,806 ERROR ~ Lost executor 1 on slave4.cluster: Executor heartbeat timed out after 176991 ms
10:23:30,812 ERROR ~ Lost executor 4 on slave6.cluster: Executor heartbeat timed out after 176981 ms
10:23:30,816 ERROR ~ Lost executor 6 on slave3.cluster: Executor heartbeat timed out after 176984 ms
10:23:30,820 ERROR ~ Lost executor 0 on slave5.cluster: Executor heartbeat timed out after 177004 ms
10:23:30,835 ERROR ~ Lost executor 3 on slave7.cluster: Executor heartbeat timed out after 176982 ms

However, in my configuration I can confirm I successfully increased the executor heartbeat interval:

When I visit the logs of executors marked as EXITED (i.e.: the driver removed them when it couldn't get a heartbeat), it appears that executors killed themselves because they didn't receive any tasks from the driver:

16/05/16 10:11:26 ERROR TransportChannelHandler: Connection to /10.0.0.4:35328 has been quiet for 120000 ms while there are outstanding requests. Assuming connection is dead; please adjust spark.network.timeout if this is wrong.
16/05/16 10:11:26 ERROR CoarseGrainedExecutorBackend: Cannot register with driver: spark://[email protected]:35328

How can I turn off heartbeats and/or prevent the executors from timing out?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T18:28:27+0000

Missing heartbeats and executors being killed by YARN is nearly always due to OOMs. You should inspect the logs on the individual executors (look for the text "running beyond physical memory"). If you have many executors and find it cumbersome to inspect all of the logs manually, I recommend monitoring your job in the Spark UI while it runs. As soon as a task fails, it will report the cause in the UI, so it's easy to see. Note that some tasks will report failure due to missing executors that have already been killed, so make sure you look at causes for each of the individual failing tasks.

Note also that most OOM problems can be solved quickly by simply repartitioning your data at appropriate places in your code (again look at the Spark UI for hints as to where there might be a need for a call to repartition). Otherwise, you might want to scale up your machines to accommodate the need for memory.

Categories

configuration - Spark cluster full of heartbeat timeouts, executors exiting on their own

configuration - Spark cluster full of heartbeat timeouts, executors exiting on their own

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags