多机下Unknown exception caught
Created by: ccmeteorljh
paddle版本:0.15 模型:se-resnext 环境:多机单卡异步 paddlecloud地址: http://paddlecloud.baidu-int.com:8088/paddle/jobRunInfo?jobId=job-e6c5b7cd20745dde&flag=jobs&groupName=k8s_gpu_demo&groupId=c0a1f165-6279-5320-b9e7-e0218c7a87f5¤tPage=1¤tKey=1
F0822 11:08:56.576814 3256 exception_holder.h:34] Unknown exception caught
*** Check failure stack trace: ***
@ 0x7f5fa75d853d google::LogMessage::Fail()
@ 0x7f5fa75dbfec google::LogMessage::SendToLog()
@ 0x7f5fa75d8063 google::LogMessage::Flush()
@ 0x7f5fa75dd4fe google::LogMessageFatal::~LogMessageFatal()
@ 0x7f5fa839c002 paddle::framework::details::ExceptionHolder::Catch()
@ 0x7f5fa839894d _ZZN6paddle9framework7details24ThreadedSSAGraphExecutor5RunOpEPNS0_13BlockingQueueIPNS1_13VarHandleBaseEEEPNS1_12OpHandleBaseEENKUlvE_clEv
@ 0x7f5fa8398ea5 paddle::framework::details::ThreadedSSAGraphExecutor::RunOp()
@ 0x7f5fa839aa58 paddle::framework::details::ThreadedSSAGraphExecutor::Run()
@ 0x7f5fa83a0db7 paddle::framework::details::ScopeBufferedSSAGraphExecutor::Run()
@ 0x7f5fa763675c paddle::framework::ParallelExecutor::Run()
@ 0x7f5fa7549780 _ZZN8pybind1112cpp_function10initializeIZN6paddle6pybindL13pybind11_initEvEUlRNS2_9framework16ParallelExecutorERKSt6vectorISsSaISsEERKSsE91_vIS6_SB_SD_EINS_4nameENS_9is_methodENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNESV_
@ 0x7f5fa7565484 pybind11::cpp_function::dispatcher()
@ 0x7f60108df631 PyEval_EvalFrameEx
@ 0x7f60108e0bce PyEval_EvalCodeEx
@ 0x7f60108df20a PyEval_EvalFrameEx
@ 0x7f60108e0bce PyEval_EvalCodeEx
@ 0x7f60108df20a PyEval_EvalFrameEx
@ 0x7f60108df560 PyEval_EvalFrameEx
@ 0x7f60108df560 PyEval_EvalFrameEx
@ 0x7f60108e0bce PyEval_EvalCodeEx
@ 0x7f60108e0ce2 PyEval_EvalCode
@ 0x7f60109009e0 PyRun_FileExFlags
@ 0x7f6010900bbf PyRun_SimpleFileExFlags
@ 0x7f6010916454 Py_Main
@ 0x7f600fbcacdd __libc_start_main
*** Aborted at 1534907336 (unix time) try "date -d @1534907336" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGABRT (@0xcb8) received by PID 3256 (TID 0x7f6010fd9700) from PID 3256; stack trace: ***
@ 0x7f60105d9500 (unknown)
@ 0x7f600fbde8a5 __GI_raise
@ 0x7f600fbe0085 __GI_abort
@ 0x7f5fa75e2fbb google::FindSymbol()
@ 0x7f5fa75e397a google::GetSymbolFromObjectFile()
@ 0x7f5fa75e4042 google::SymbolizeAndDemangle()
@ 0x7f5fa75e1848 google::DumpStackTrace()
@ 0x7f5fa75e1906 google::DumpStackTraceAndExit()
@ 0x7f5fa75d853d google::LogMessage::Fail()
@ 0x7f5fa75dbfec google::LogMessage::SendToLog()
@ 0x7f5fa75d8063 google::LogMessage::Flush()
@ 0x7f5fa75dd4fe google::LogMessageFatal::~LogMessageFatal()
@ 0x7f5fa839c002 paddle::framework::details::ExceptionHolder::Catch()
@ 0x7f5fa839894d _ZZN6paddle9framework7details24ThreadedSSAGraphExecutor5RunOpEPNS0_13BlockingQueueIPNS1_13VarHandleBaseEEEPNS1_12OpHandleBaseEENKUlvE_clEv
@ 0x7f5fa8398ea5 paddle::framework::details::ThreadedSSAGraphExecutor::RunOp()
@ 0x7f5fa839aa58 paddle::framework::details::ThreadedSSAGraphExecutor::Run()
@ 0x7f5fa83a0db7 paddle::framework::details::ScopeBufferedSSAGraphExecutor::Run()
@ 0x7f5fa763675c paddle::framework::ParallelExecutor::Run()
@ 0x7f5fa7549780 _ZZN8pybind1112cpp_function10initializeIZN6paddle6pybindL13pybind11_initEvEUlRNS2_9framework16ParallelExecutorERKSt6vectorISsSaISsEERKSsE91_vIS6_SB_SD_EINS_4nameENS_9is_methodENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNESV_
@ 0x7f5fa7565484 pybind11::cpp_function::dispatcher()
@ 0x7f60108df631 PyEval_EvalFrameEx
@ 0x7f60108e0bce PyEval_EvalCodeEx
@ 0x7f60108df20a PyEval_EvalFrameEx
@ 0x7f60108e0bce PyEval_EvalCodeEx
@ 0x7f60108df20a PyEval_EvalFrameEx
@ 0x7f60108df560 PyEval_EvalFrameEx
@ 0x7f60108df560 PyEval_EvalFrameEx
@ 0x7f60108e0bce PyEval_EvalCodeEx
@ 0x7f60108e0ce2 PyEval_EvalCode
@ 0x7f60109009e0 PyRun_FileExFlags
@ 0x7f6010900bbf PyRun_SimpleFileExFlags
@ 0x7f6010916454 Py_Main
/root/paddlejob/run.sh: line 239: 3256 Aborted (core dumped) python train.py