多次提任务均遇到问题meets grpc error
Created by: zhaoyang1708
使用paddlecloud提交训练任务,任务多次失败,日志报错均类似如下
F0820 06:06:23.906448 3871 grpc_client.cc:408] GetRPC name:[sequence_conv_9.w_0.block3], ep:[10.89.128.16:62004], status:[-1] meets grpc error, error_code:4 error_message:Deadline Exceeded error_details:
* Check failure stack trace: *
@ 0x7f1f209fdc0d google::LogMessage::Fail()
@ 0x7f1f20a016bc google::LogMessage::SendToLog()
@ 0x7f1f209fd733 google::LogMessage::Flush()
@ 0x7f1f20a02bce google::LogMessageFatal::~LogMessageFatal()
@ 0x7f1f21602e0e paddle::operators::distributed::GRPCClient::Proceed()
@ 0x7f1f039318a0 execute_native_thread_routine
@ 0x7f1f892b51c3 start_thread
@ 0x7f1f888dd12d __clone
@ (nil) (unknown)
paddlecloud同学排查后建议咨询paddle
任务链接:http://yq01-hpc-lvliang-mon.dmop.baidu.com:8919/taskinfo.html?appId=app-user-20190820004839-25295&appname=zhaorutao_online_tbbsq_paddlecloud
http://10.88.35.21:8910/fileview.html?type=logsdir&path=/&instance=0.app-user-20190819233733-25255--zhaorutao_online_tbbsq_paddlecloud
http://10.89.128.25:8910/fileview.html?type=logsdir&path=/&instance=0.app-user-20190819221033-25205--zhaorutao_online_tbbsq_paddlecloud