多机多卡pserver方式训练出错
Created by: dashulu
多机多卡,使用pserver进行训练时,报如下错误: F1126 15:56:27.350294 4956 grpc_client.cc:288] Get name:[conv2d_4.b_0], ep:[10.255.121.20:30012], status:[-1] meets grpc error:Socket closed *** Check failure stack trace: *** @ 0x7f69cefa510d google::LogMessage::Fail() @ 0x7f69cefa8bbc google::LogMessage::SendToLog() @ 0x7f69cefa4c33 google::LogMessage::Flush() @ 0x7f69cefaa0ce google::LogMessageFatal::~LogMessageFatal() @ 0x7f69d00a3f7a paddle::operators::distributed::GRPCClient::Proceed() @ 0x7f6a23d92470 (unknown) @ 0x7f6a2ed0f851 start_thread @ 0x7f6a2e3d290d clone @ (nil) (unknown) /root/paddlejob/run.sh: line 313: 4917 Aborted (core dumped) python train.py *error messages [/root/paddlejob/run.sh : 316] [start_paddle_trainer] [ERROR]: execute user cmd failed, Mon Nov 26 15:56:42 CST 2018。
同样的代码配置GPU数量为1时,能正常训练。