connection refused by pserver, maybe pserver failed
Created by: youan1
FYI,错误日志如下,请paddle同学看看,是否为网络问题,导致训练失败
Thu Sep 21 10:59:47 2017[1,89]<stderr>:+ '[' --local == --ports_num ']'
Thu Sep 21 10:59:47 2017[1,89]<stderr>:+ '[' --local == --ports_num_for_sparse ']'
Thu Sep 21 10:59:47 2017[1,89]<stderr>:+ export PADDLE_NUM_GRADIENT_SERVERS=100
Thu Sep 21 10:59:47 2017[1,89]<stderr>:+ PADDLE_NUM_GRADIENT_SERVERS=100
Thu Sep 21 10:59:47 2017[1,89]<stderr>:+ export PADDLE_TRAINER_ID=89
Thu Sep 21 10:59:47 2017[1,89]<stderr>:+ PADDLE_TRAINER_ID=89
Thu Sep 21 10:59:47 2017[1,89]<stderr>:+ export PADDLE_PSERVERS=10.76.52.19,10.76.50.40,10.76.50.42,10.76.50.43,10.76.50.44,10.76.50.45,10.76.50.46,10.76.50.47,10.76.50.48,10.76.50.49,10.76.51.11,10.76.51.12,10.76.51.13,10.76.51.14,10.76.51.16,10.76.51.18,10.76.51.20,10.76.51.21,10.76.51.22,10.76.51.23,10.76.51.24,10.76.51.25,10.76.51.27,10.76.51.28,10.76.51.29,10.76.51.30,10.76.51.33,10.76.51.34,10.76.51.35,10.76.50.39,10.76.52.20,10.76.52.44,10.76.52.45,10.76.52.47,10.76.52.48,10.76.52.49,10.76.52.50,10.76.53.11,10.76.53.12,10.76.53.13,10.76.53.14,10.76.53.15,10.76.53.16,10.76.53.17,10.76.53.18,10.76.53.19,10.76.53.20,10.76.53.21,10.76.53.22,10.76.53.23,10.76.53.24,10.76.53.25,10.76.53.26,10.76.53.27,10.76.53.28,10.76.53.29,10.76.53.30,10.76.53.31,10.76.53.32,10.76.53.33,10.76.53.35,10.76.53.36,10.76.53.37,10.76.53.38,10.76.53.40,10.76.53.41,10.76.53.42,10.76.53.43,10.76.53.44,10.76.53.45,10.76.53.46,10.76.53.47,10.76.53.49,10.76.53.50,10.76.57.26,10.76.57.27,10.76.57.28,10.76.57.29,10.76.57.30,10.76.57.32,10.76.57.33,10.76.57.34,10.76.57.35,10.76.57.36,10.76.57.37,10.76.57.38,10.76.57.39,10.76.57.40,10.76.57.41,10.76.57.42,10.76.57.43,10.76.57.44,10.76.57.45,10.76.57.46,10.76.57.47,10.76.57.48,10.76.57.49,10.76.57.50,10.76.58.11,10.76.68.20
Thu Sep 21 10:59:47 2017[1,89]<stderr>:+ PADDLE_PSERVERS=10.76.52.19,10.76.50.40,10.76.50.42,10.76.50.43,10.76.50.44,10.76.50.45,10.76.50.46,10.76.50.47,10.76.50.48,10.76.50.49,10.76.51.11,10.76.51.12,10.76.51.13,10.76.51.14,10.76.51.16,10.76.51.18,10.76.51.20,10.76.51.21,10.76.51.22,10.76.51.23,10.76.51.24,10.76.51.25,10.76.51.27,10.76.51.28,10.76.51.29,10.76.51.30,10.76.51.33,10.76.51.34,10.76.51.35,10.76.50.39,10.76.52.20,10.76.52.44,10.76.52.45,10.76.52.47,10.76.52.48,10.76.52.49,10.76.52.50,10.76.53.11,10.76.53.12,10.76.53.13,10.76.53.14,10.76.53.15,10.76.53.16,10.76.53.17,10.76.53.18,10.76.53.19,10.76.53.20,10.76.53.21,10.76.53.22,10.76.53.23,10.76.53.24,10.76.53.25,10.76.53.26,10.76.53.27,10.76.53.28,10.76.53.29,10.76.53.30,10.76.53.31,10.76.53.32,10.76.53.33,10.76.53.35,10.76.53.36,10.76.53.37,10.76.53.38,10.76.53.40,10.76.53.41,10.76.53.42,10.76.53.43,10.76.53.44,10.76.53.45,10.76.53.46,10.76.53.47,10.76.53.49,10.76.53.50,10.76.57.26,10.76.57.27,10.76.57.28,10.76.57.29,10.76.57.30,10.76.57.32,10.76.57.33,10.76.57.34,10.76.57.35,10.76.57.36,10.76.57.37,10.76.57.38,10.76.57.39,10.76.57.40,10.76.57.41,10.76.57.42,10.76.57.43,10.76.57.44,10.76.57.45,10.76.57.46,10.76.57.47,10.76.57.48,10.76.57.49,10.76.57.50,10.76.58.11,10.76.68.20
Thu Sep 21 10:59:47 2017[1,89]<stderr>:+ python27-gcc482/bin/python conf/trainer_config.conf
Thu Sep 21 11:00:39 2017[1,47]<stderr>:F0921 11:00:39.882642 29114 LightNetwork.cpp:395] connection refused by pserver, maybe pserver failed!
Thu Sep 21 11:00:39 2017[1,47]<stderr>:*** Check failure stack trace: ***
Thu Sep 21 11:00:39 2017[1,47]<stderr>: @ 0x7fd829bb427d google::LogMessage::Fail()
Thu Sep 21 11:00:39 2017[1,47]<stderr>: @ 0x7fd829bb7d2c google::LogMessage::SendToLog()
Thu Sep 21 11:00:39 2017[1,47]<stderr>: @ 0x7fd829bb3da3 google::LogMessage::Flush()
Thu Sep 21 11:00:39 2017[1,47]<stderr>: @ 0x7fd829bb923e google::LogMessageFatal::~LogMessageFatal()
Thu Sep 21 11:00:39 2017[1,47]<stderr>: @ 0x7fd829a113c1 paddle::SocketClient::TcpClient()
Thu Sep 21 11:00:39 2017[1,47]<stderr>: @ 0x7fd829a115a1 paddle::SocketClient::SocketClient()
Thu Sep 21 11:00:39 2017[1,47]<stderr>: @ 0x7fd82a6949b0 paddle::ParameterClient2::init()
Thu Sep 21 11:00:39 2017[1,47]<stderr>: @ 0x7fd82a22109d paddle::RemoteParameterUpdater::init()
Thu Sep 21 11:00:39 2017[1,47]<stderr>: @ 0x7fd829b941ea ParameterUpdater::init()
Thu Sep 21 11:00:39 2017[1,47]<stderr>: @ 0x7fd82983df7b _wrap_ParameterUpdater_init
Thu Sep 21 11:00:39 2017[1,47]<stderr>: @ 0x4b4cb9 PyEval_EvalFrameEx
Thu Sep 21 11:00:39 2017[1,47]<stderr>: @ 0x4b6b28 PyEval_EvalCodeEx
Thu Sep 21 11:00:39 2017[1,47]<stderr>: @ 0x4b5d10 PyEval_EvalFrameEx
Thu Sep 21 11:00:39 2017[1,47]<stderr>: @ 0x4b6b28 PyEval_EvalCodeEx
Thu Sep 21 11:00:39 2017[1,47]<stderr>: @ 0x4b5d10 PyEval_EvalFrameEx
Thu Sep 21 11:00:39 2017[1,47]<stderr>: @ 0x4b6b28 PyEval_EvalCodeEx
Thu Sep 21 11:00:39 2017[1,47]<stderr>: @ 0x4b5d10 PyEval_EvalFrameEx
Thu Sep 21 11:00:39 2017[1,47]<stderr>: @ 0x4b6b28 PyEval_EvalCodeEx
Thu Sep 21 11:00:39 2017[1,47]<stderr>: @ 0x4b6c52 PyEval_EvalCode
Thu Sep 21 11:00:39 2017[1,47]<stderr>: @ 0x4e1c7d PyRun_FileExFlags
Thu Sep 21 11:00:39 2017[1,47]<stderr>: @ 0x4e3501 PyRun_SimpleFileExFlags
Thu Sep 21 11:00:39 2017[1,47]<stderr>: @ 0x4159dd Py_Main
Thu Sep 21 11:00:39 2017[1,47]<stderr>: @ 0x7fd82c62abd5 __libc_start_main
Thu Sep 21 11:00:39 2017[1,47]<stderr>: @ 0x414b71 (unknown)
Thu Sep 21 11:00:39 2017[1,47]<stderr>: @ (nil) (unknown)
Thu Sep 21 11:00:40 2017[1,47]<stderr>:./train.sh: line 239: 29114 Aborted python27-gcc482/bin/python conf/trainer_config.conf
Thu Sep 21 11:00:40 2017[1,47]<stderr>:+ '[' 134 -ne 0 ']'
Thu Sep 21 11:00:40 2017[1,47]<stderr>:+ kill_pserver2_exit
Thu Sep 21 11:00:40 2017[1,47]<stderr>:+ ps aux
Thu Sep 21 11:00:40 2017[1,47]<stderr>:+ grep paddle_pserver2
Thu Sep 21 11:00:40 2017[1,47]<stderr>:+ grep paddle_cluster_job
Thu Sep 21 11:00:40 2017[1,47]<stderr>:+ grep -v grep
Thu Sep 21 11:00:40 2017[1,47]<stderr>:+ cut -c10-14
Thu Sep 21 11:00:40 2017[1,47]<stderr>:+ xargs kill -9
Thu Sep 21 11:00:40 2017[1,47]<stderr>:+ log_fatal 'paddle_trainer failed kill paddle_pserver2 and exit'
Thu Sep 21 11:00:40 2017[1,47]<stderr>:+ echo '[./common.sh : 399] [kill_pserver2_exit]'
Thu Sep 21 11:00:40 2017[1,47]<stderr>:[./common.sh : 399] [kill_pserver2_exit]
Thu Sep 21 11:00:40 2017[1,47]<stderr>:+ echo '[FATAL]: paddle_trainer failed kill paddle_pserver2 and exit'
Thu Sep 21 11:00:40 2017[1,47]<stderr>:[FATAL]: paddle_trainer failed kill paddle_pserver2 and exit
Thu Sep 21 11:00:40 2017[1,47]<stderr>:+ get_stack
Thu Sep 21 11:00:40 2017[1,47]<stderr>:+ set +x