集群训练coredump是否有进一步调试方式或日志记录
Created by: lyp2github
paddlecloud提交的任务,off_smart集群,jobid=job-e6c5bc61035b7045 I1017 00:36:32.808667 30555 ParameterClient2.cpp:113] pserver 0 10.75.54.40:8000 W1017 00:36:32.808787 30555 LightNetwork.cpp:397] connection refused by pserver, try again! W1017 00:36:33.808917 30555 LightNetwork.cpp:397] connection refused by pserver, try again! W1017 00:36:34.809073 30555 LightNetwork.cpp:397] connection refused by pserver, try again! W1017 00:36:35.809216 30555 LightNetwork.cpp:397] connection refused by pserver, try again! W1017 00:36:36.809355 30555 LightNetwork.cpp:397] connection refused by pserver, try again! W1017 00:36:37.809502 30555 LightNetwork.cpp:397] connection refused by pserver, try again! W1017 00:36:38.809648 30555 LightNetwork.cpp:397] connection refused by pserver, try again! W1017 00:36:39.809792 30555 LightNetwork.cpp:397] connection refused by pserver, try again! F1017 00:36:39.809828 30555 LightNetwork.cpp:399] connection refused by pserver, maybe pserver failed! *** Check failure stack trace: *** @ 0x7f351fda3f5d google::LogMessage::Fail() @ 0x7f351fda7a0c google::LogMessage::SendToLog() @ 0x7f351fda3a83 google::LogMessage::Flush() @ 0x7f351fda8f1e google::LogMessageFatal::~LogMessageFatal() @ 0x7f351fb8536f paddle::SocketClient::TcpClient() @ 0x7f351fb85511 paddle::SocketClient::SocketClient() @ 0x7f352197a0af paddle::ParameterClient2::init() @ 0x7f35214d744d paddle::RemoteParameterUpdater::init() @ 0x7f351fd83e3a ParameterUpdater::init() @ 0x7f351f950d0b _wrap_ParameterUpdater_init @ 0x4b4cb9 PyEval_EvalFrameEx @ 0x4b6b28 PyEval_EvalCodeEx @ 0x4b5d10 PyEval_EvalFrameEx @ 0x4b6b28 PyEval_EvalCodeEx @ 0x4b5d10 PyEval_EvalFrameEx @ 0x4b6b28 PyEval_EvalCodeEx @ 0x4b5d10 PyEval_EvalFrameEx @ 0x4b6b28 PyEval_EvalCodeEx @ 0x4b6c52 PyEval_EvalCode @ 0x4e1c7d PyRun_FileExFlags @ 0x4e3501 PyRun_SimpleFileExFlags @ 0x4159dd Py_Main @ 0x7f3579f17bd5 __libc_start_main @ 0x414b71 (unknown) @ (nil) (unknown) ('use_gpu_flag', False) ('model type: ', 'rank') .//paddle/start_trainer.sh: line 89: 30555 Aborted (core dumped)