fluid 集群训练报错 meets grpc error:OS Error
Created by: oshixiaoxiliu
Fri Nov 2 15:37:12 2018[1,133]:F1102 15:37:12.763226 49369 grpc_client.cc:295] Get name:[embedding_5.w_0.block70], ep:[10.182.108.31:8000] meets grpc error:OS Error Fri Nov 2 15:37:12 2018[1,133]:*** Check failure stack trace: *** Fri Nov 2 15:37:12 2018[1,133]: @ 0x7f2e104e1bed google::LogMessage::Fail() Fri Nov 2 15:37:12 2018[1,133]: @ 0x7f2e104e569c google::LogMessage::SendToLog() Fri Nov 2 15:37:12 2018[1,133]: @ 0x7f2e104e1713 google::LogMessage::Flush() Fri Nov 2 15:37:12 2018[1,133]: @ 0x7f2e104e6bae google::LogMessageFatal::~LogMessageFatal() Fri Nov 2 15:37:12 2018[1,133]: @ 0x7f2e10d42250 paddle::operators::distributed::GRPCClient::Proceed() Fri Nov 2 15:37:12 2018[1,133]: @ 0x7f2e1d5a28a0 execute_native_thread_routine Fri Nov 2 15:37:12 2018[1,133]: @ 0x7f2ebab691c3 start_thread Fri Nov 2 15:37:12 2018[1,133]: @ 0x7f2eba19112d __clone Fri Nov 2 15:37:12 2018[1,133]: @ (nil) (unknown)
对应的机器上找不到报错信息。 经常碰到这个问题,偶尔才能训练成功