mpi训练出 core,meets grpc error
Created by: maosengshulei
1)PaddlePaddle版本:paddle-fluid-v1.3.0 2)使用cpu 4)系统环境:请您描述系统类型、版本,例如Mac OS 10.14,Python版本
- 训练信息 1)mpi多级训练,启用pserver 复现信息:job-id:job-0bb5cf2cc9b8a180; job-0bb5cf2c4aedcb60
- 问题描述: 两次出core 第一次: getRPC name:[DenseFeatFactors], ep:[10.76.128.38:62004], status:[-1] meets grpc error, error_code:14 error_message:Socket closed error_details: *** Check failure stack trace: *** @ 0x7fae35c6ec0d google::LogMessage::Fail() @ 0x7fae35c726bc google::LogMessage::SendToLog() @ 0x7fae35c6e733 google::LogMessage::Flush() @ 0x7fae35c73bce google::LogMessageFatal::~LogMessageFatal() @ 0x7fae3677630a paddle::operators::distributed::GRPCClient::Proceed() @ 0x7fae8c0438a0 execute_native_thread_routine @ 0x7fae92bb21c3 start_thread @ 0x7fae921da12d __clone @ (nil) (unknown)
第二次: GetRPC name:[fc_1.w_0.block6], ep:[10.76.87.28:62004], status:[-1] meets grpc error, error_code:14 error_message:Socket closed error_details: