connection refused by pserver, maybe pserver failed!
Created by: alexqdh
已经看了类似报错的issue #3532 机器已经是unlimited配置,在gpu集群提作业train.log还是出现如下错误。想问下有什么办法排查?
Thu Nov 23 15:09:00 2017[1,0]:F1123 15:09:00.210635 9537 LightNetwork.cpp:399] connection refused by pserver, maybe pserver failed! Thu Nov 23 15:09:00 2017[1,0]:*** Check failure stack trace: *** Thu Nov 23 15:09:00 2017[1,0]: @ 0x7f14ac784f5d google::LogMessage::Fail() Thu Nov 23 15:09:00 2017[1,0]: @ 0x7f14ac788a0c google::LogMessage::SendToLog() Thu Nov 23 15:09:00 2017[1,0]: @ 0x7f14ac784a83 google::LogMessage::Flush() Thu Nov 23 15:09:00 2017[1,0]: @ 0x7f14ac789f1e google::LogMessageFatal::~LogMessageFatal() Thu Nov 23 15:09:00 2017[1,0]: @ 0x7f14ac56636f paddle::SocketClient::TcpClient() Thu Nov 23 15:09:00 2017[1,0]: @ 0x7f14ac566511 paddle::SocketClient::SocketClient() Thu Nov 23 15:09:00 2017[1,0]: @ 0x7f14ae35b0af paddle::ParameterClient2::init() Thu Nov 23 15:09:00 2017[1,0]: @ 0x7f14adeb844d paddle::RemoteParameterUpdater::init() Thu Nov 23 15:09:00 2017[1,0]: @ 0x7f14ac764e3a ParameterUpdater::init() Thu Nov 23 15:09:00 2017[1,0]: @ 0x7f14ac331d0b _wrap_ParameterUpdater_init Thu Nov 23 15:09:00 2017[1,0]: @ 0x4b4cb9 PyEval_EvalFrameEx Thu Nov 23 15:09:00 2017[1,0]: @ 0x4b6b28 PyEval_EvalCodeEx Thu Nov 23 15:09:00 2017[1,0]: @ 0x4b5d10 PyEval_EvalFrameEx Thu Nov 23 15:09:00 2017[1,0]: @ 0x4b6b28 PyEval_EvalCodeEx Thu Nov 23 15:09:00 2017[1,0]: @ 0x4b5d10 PyEval_EvalFrameEx Thu Nov 23 15:09:00 2017[1,0]: @ 0x4b6b28 PyEval_EvalCodeEx Thu Nov 23 15:09:00 2017[1,0]: @ 0x4b5d10 PyEval_EvalFrameEx Thu Nov 23 15:09:00 2017[1,0]: @ 0x4b6b28 PyEval_EvalCodeEx Thu Nov 23 15:09:00 2017[1,0]: @ 0x4b6c52 PyEval_EvalCode Thu Nov 23 15:09:00 2017[1,0]: @ 0x4e1c7d PyRun_FileExFlags Thu Nov 23 15:09:00 2017[1,0]: @ 0x4e3501 PyRun_SimpleFileExFlags Thu Nov 23 15:09:00 2017[1,0]: @ 0x4159dd Py_Main Thu Nov 23 15:09:00 2017[1,0]: @ 0x7f14dc9fdbd5 __libc_start_main Thu Nov 23 15:09:00 2017[1,0]: @ 0x414b71 (unknown) Thu Nov 23 15:09:00 2017[1,0]: @ (nil) (unknown) Thu Nov 23 15:09:00 2017[1,0]:./train.sh: line 206: 9537 Aborted python27-gcc482/bin/python conf/trainer_config.conf Thu Nov 23 15:09:00 2017[1,0]:+ '[' -n 9536 ']' Thu Nov 23 15:09:00 2017[1,0]:+ kill -9 9536 Thu Nov 23 15:09:00 2017[1,0]:+ '[' -n '' ']' Thu Nov 23 15:09:00 2017[1,0]:+ '[' 0 -ne 0 ']' Thu Nov 23 15:09:00 2017[1,1]:F1123 15:09:00.883389 19284 LightNetwork.cpp:399] connection refused by pserver, maybe pserver failed! Thu Nov 23 15:09:00 2017[1,1]:*** Check failure stack trace: *** Thu Nov 23 15:09:00 2017[1,1]: @ 0x7f349c89af5d google::LogMessage::Fail() Thu Nov 23 15:09:00 2017[1,1]: @ 0x7f349c89ea0c google::LogMessage::SendToLog() Thu Nov 23 15:09:00 2017[1,1]: @ 0x7f349c89aa83 google::LogMessage::Flush() Thu Nov 23 15:09:00 2017[1,1]: @ 0x7f349c89ff1e google::LogMessageFatal::~LogMessageFatal() Thu Nov 23 15:09:00 2017[1,1]: @ 0x7f349c67c36f paddle::SocketClient::TcpClient() Thu Nov 23 15:09:00 2017[1,1]: @ 0x7f349c67c511 paddle::SocketClient::SocketClient() Thu Nov 23 15:09:00 2017[1,1]: @ 0x7f349e4710af paddle::ParameterClient2::init() Thu Nov 23 15:09:00 2017[1,1]: @ 0x7f349dfce44d paddle::RemoteParameterUpdater::init() Thu Nov 23 15:09:00 2017[1,1]: @ 0x7f349c87ae3a ParameterUpdater::init() Thu Nov 23 15:09:00 2017[1,1]: @ 0x7f349c447d0b _wrap_ParameterUpdater_init Thu Nov 23 15:09:00 2017[1,1]: @ 0x4b4cb9 PyEval_EvalFrameEx Thu Nov 23 15:09:00 2017[1,1]: @ 0x4b6b28 PyEval_EvalCodeEx Thu Nov 23 15:09:00 2017[1,1]: @ 0x4b5d10 PyEval_EvalFrameEx Thu Nov 23 15:09:00 2017[1,1]: @ 0x4b6b28 PyEval_EvalCodeEx Thu Nov 23 15:09:00 2017[1,1]: @ 0x4b5d10 PyEval_EvalFrameEx Thu Nov 23 15:09:00 2017[1,1]: @ 0x4b6b28 PyEval_EvalCodeEx Thu Nov 23 15:09:00 2017[1,1]: @ 0x4b5d10 PyEval_EvalFrameEx Thu Nov 23 15:09:00 2017[1,1]: @ 0x4b6b28 PyEval_EvalCodeEx Thu Nov 23 15:09:00 2017[1,1]: @ 0x4b6c52 PyEval_EvalCode Thu Nov 23 15:09:00 2017[1,1]: @ 0x4e1c7d PyRun_FileExFlags Thu Nov 23 15:09:00 2017[1,1]: @ 0x4e3501 PyRun_SimpleFileExFlags Thu Nov 23 15:09:00 2017[1,1]: @ 0x4159dd Py_Main Thu Nov 23 15:09:00 2017[1,1]: @ 0x7f34ccad3bd5 __libc_start_main Thu Nov 23 15:09:00 2017[1,1]: @ 0x414b71 (unknown) Thu Nov 23 15:09:00 2017[1,1]: @ (nil) (unknown) Thu Nov 23 15:09:01 2017[1,1]:./train.sh: line 206: 19284 Aborted python27-gcc482/bin/python conf/trainer_config.conf Thu Nov 23 15:09:01 2017[1,1]:+ '[' -n 19283 ']'