status:[-1] meets grpc error:OS Error train.py core dump
Created by: lighthearts
- 版本、环境信息: 1)PaddlePaddle版本:fluid 1.0.0 2)系统环境:厂内paddle cloud
- 训练信息 1)多机cpu 2)内存15G 3)Operator信息
- 问题描述:使用类似fluid/PaddleRec/ctr下的代码,在pserver模式下训练,训练到第3轮的时候报错: 错误信息: F1210 04:41:17.187237 2534 grpc_client.cc:288] Send name:[cuidFactors@GRAD.block4], ep:[10.90.219.20:30038], status:[-1] meets grpc error:OS Error *** Check failure stack trace: *** @ 0x7f5b0125640d google::LogMessage::Fail() @ 0x7f5b01259ebc google::LogMessage::SendToLog() @ 0x7f5b01255f33 google::LogMessage::Flush() @ 0x7f5b0125b3ce google::LogMessageFatal::~LogMessageFatal() @ 0x7f5b01c539da paddle::operators::distributed::GRPCClient::Proceed() @ 0x7f5b10384470 (unknown) @ 0x7f5b38ab7851 start_thread @ 0x7f5b3817a90d clone @ (nil) (unknown) /root/paddlejob/run.sh: line 313: 2516 Aborted (core dumped) python train.py *error messages [/root/paddlejob/run.sh : 316] [start_paddle_trainer] [ERROR]: execute user cmd failed, Mon Dec 10 04:41:25 CST 2018 ~/paddlejob/workspace