paddle在MPI集群上训练发现GRPC error后Pserver挂掉
Created by: wawltor
目前使用的reader是从redis集群实时的抽取训练数据,在前几个batch可能出现获取batch时间较长的问题
出现如下的错误:
F0704 22:23:29.637022 9426 grpc_client.cc:414] GetRPC name:[graphsage_mean_1_l.b_0], ep:[10.104.150.34:62003], status:[-1] meets grpc error, error_code:4 error_message:Deadline Exceeded error_details:
* Check failure stack trace: *
@ 0x7fa21b49cb6d google::LogMessage::Fail()
@ 0x7fa21b4a061c google::LogMessage::SendToLog()
@ 0x7fa21b49c693 google::LogMessage::Flush()
@ 0x7fa21b4a1b2e google::LogMessageFatal::~LogMessageFatal()
@ 0x7fa21c1dbaae paddle::operators::distributed::GRPCClient::Proceed()
@ 0x7fa22c2478a0 execute_native_thread_routine
@ 0x7fa2fcaf61c3 start_thread
@ 0x7fa2fc11e12d __clone
@ (nil) (unknown)
具体日志:http://10.104.46.22:8900/fileview.html?path=/home/disk1/normandy/maybach/app-user-20190704210854-14615/