Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.1k views
in Technique[技术] by (71.8m points)

mpi - Data unpack would read past end of buffer in file util/show_help.c at line 501

I submitted a job via slurm. The job ran for 12 hours and was working as expected. Then I got Data unpack would read past end of buffer in file util/show_help.c at line 501. It is usual for me to get errors like ORTE has lost communication with a remote daemon but I usually get this in the beginning of the job. It is annoying but still does not cause as much time loss as getting error after 12 hours. Is there a quick fix for this? Open MPI version is 4.0.1.

--------------------------------------------------------------------------                                                                                                                                                                       
By default, for Open MPI 4.0 and later, infiniband ports on a device                                                                                                                                                                         
are not used by default.  The intent is to use UCX for these devices.                                                                                                                                                                        
You can override this policy by setting the btl_openib_allow_ib MCA parameter                                                                                                                                                                    
to true.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
Local host:              barbun40                                                                                                                                                                                                            
Local adapter:           mlx5_0                                                                                                                                                                                                              
Local port:              1                                                                                                                                                                                                                                                                                                                                                                                                                                                              
--------------------------------------------------------------------------                                                                                                                                                                   
--------------------------------------------------------------------------                                                                                                                                                                   
WARNING: There was an error initializing an OpenFabrics device.                                                                                                                                                                                                                                                                                                                                                                                                                             
Local host:   barbun40                                                                                                                                                                                                                       
Local device: mlx5_0                                                                                                                                                                                                                       
--------------------------------------------------------------------------                                                                                                                                                                   
[barbun21.yonetim:48390] [[15284,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in 
file util/show_help.c at line 501                                                                                                        
[barbun21.yonetim:48390] 127 more processes have sent help message help-mpi-btl-openib.txt / ib port 
not selected                                                                                                                            
[barbun21.yonetim:48390] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error 
messages                                                                                                                                  
[barbun21.yonetim:48390] 126 more processes have sent help message help-mpi-btl-openib.txt / error in 
device init                                                                                                                            
--------------------------------------------------------------------------                                                                                                                                                                   
Primary job  terminated normally, but 1 process returned                                                                                                                                                                                     
a non-zero exit code. Per user-direction, the job has been aborted.                                                                                                                                                                          
--------------------------------------------------------------------------                                                                                                                                                                   
--------------------------------------------------------------------------                                                                                                                                                                   
An MPI communication peer process has unexpectedly disconnected.  This                                                                                                                                                                       
usually indicates a failure in the peer process (e.g., a crash or                                                                                                                                                                            
otherwise exiting without calling MPI_FINALIZE first).                                                                                                                                                                                                                                                                                                                                                                                                                                    
Although this local MPI process will likely now behave unpredictably                                                                                                                                                                         
(it may even hang or crash), the root cause of this problem is the                                                                                                                                                                           
failure of the peer -- that is what you need to investigate.  For                                                                                                                                                                            
example, there may be a core file that you can examine.  More                                                                                                                                                                                
generally: such peer hangups are frequently caused by application bugs                                                                                                                                                                       
or other external events.                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
Local host: barbun64                                                                                                                                                                                                                         
Local PID:  252415                                                                                                                                                                                                                           
Peer host:  barbun39                                                                                                                                                                                                                       
--------------------------------------------------------------------------                                                                                                                                                                   
--------------------------------------------------------------------------                                                                                                                                                                   
mpirun detected that one or more processes exited with non-zero status, thus causing                                                                                                                                                         
the job to be terminated. The first process to do so was:                                                                                                                                                                                                                                                                                                                                                                                                                                   
Process name: [[15284,1],35]                                                                                                                                                                                                                 
Exit code:    9                                                                                                                                                                                                                            
--------------------------------------------------------------------------  

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)
等待大神答复

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...