未验证 提交 71cbbdac 编写于 作者: muyuliufeng's avatar muyuliufeng 提交者: GitHub

训练进程异常退出,但是分布式lanch进程是正常退出状态的问题修复 #44583 (#44807)

上级 ed0e95a8
......@@ -658,7 +658,7 @@ def watch_local_trainers(procs, nranks):
"ABORT!!! Out of all {} trainers, the trainer process with rank={} was aborted. Please check its log."
.format(nranks, error_rank))
terminate_local_procs(procs)
return
raise
except:
logger.error(
"ABORT!!! Out of all {} trainers, the trainer process with rank={} was aborted. Please check its log."
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册