多机下经常碰到OS Error
Created by: ccmeteorljh
多机同步,异步都经常在启动时碰到这个问题,有时设置trainer先sleep 30s也不行
F0829 21:31:20.322667 803 grpc_client.cc:295] Get name:[batch_norm_19.w_0], ep:[10.255.122.14:6174] meets grpc error:OS Error
*** Check failure stack trace: ***
@ 0x7fdfd17fafed google::LogMessage::Fail()
@ 0x7fdfd17fea9c google::LogMessage::SendToLog()
@ 0x7fdfd17fab13 google::LogMessage::Flush()
@ 0x7fdfd17fffae google::LogMessageFatal::~LogMessageFatal()
@ 0x7fdfd2696e38 paddle::operators::distributed::GRPCClient::Proceed()
@ 0x7fe02526b470 (unknown)
@ 0x7fe0301a8851 start_thread
@ 0x7fe02f86b90d clone
@ (nil) (unknown)
*********************error messages********************