Paddle集群训练过程中报GRPC错误
Created by: yangjing14
- 版本、环境信息: 1)PaddlePaddle版本:1.4.1 2)Paddle Cloud,MPI任务
- 问题描述: 任务链接:http://10.90.102.22:8910/fileview.html?type=logsdir&path=/&instance=0.app-user-20190821225301-12979--train_job_quanmin_siamese2_32_epoch3_4_paddlecloud
训练过程中忽然报Grpc 14的错误,具体日志如下: Thu Aug 22 01:28:55 2019[1,12]:F0822 01:28:55.904119 14626 grpc_client.cc:418] SendRPC name:[emb_title@GRAD.block48.trainer_12], ep:[10.90.122.15:62004], status:[-1] meets grpc error, error_code:14 error_message:OS Error error_details: Thu Aug 22 01:28:55 2019[1,12]:*** Check failure stack trace: *** Thu Aug 22 01:28:55 2019[1,12]: @ 0x7fbfd9126c0d google::LogMessage::Fail() Thu Aug 22 01:28:55 2019[1,12]: @ 0x7fbfd912a6bc google::LogMessage::SendToLog() Thu Aug 22 01:28:55 2019[1,12]: @ 0x7fbfd9126733 google::LogMessage::Flush() Thu Aug 22 01:28:55 2019[1,12]: @ 0x7fbfd912bbce google::LogMessageFatal::~LogMessageFatal() Thu Aug 22 01:28:55 2019[1,12]: @ 0x7fbfd9d2be0e paddle::operators::distributed::GRPCClient::Proceed() Thu Aug 22 01:28:55 2019[1,12]: @ 0x7fc04349d8a0 execute_native_thread_routine Thu Aug 22 01:28:55 2019[1,12]: @ 0x7fc04e1ae1c3 start_thread Thu Aug 22 01:28:55 2019[1,12]: @ 0x7fc04d7d612d __clone Thu Aug 22 01:28:55 2019[1,12]: @ (nil) (unknown)