meets grpc error:OS Error
Created by: LoveGarden
mpi上fluid 0.15在训练过程报错如下 Tue Oct 23 11:15:31 2018[1,132]:INFO[2018-10-23 11:15:31][train.py][441] - pass_id=0 batch_id=140 cost=1009.5504 auc=[0.64529329] py_reader.queue.size: 100 acc=0.9277344 read_data_time=9.53674316406e-07 train_batch_time=10.4956641197 sample_per_second=390.256390951 Tue Oct 23 11:18:40 2018[1,128]:F1023 11:18:40.494853 9754 grpc_client.cc:295] Get name:[fc_0.w_0.block2], ep:[10.182.9.15:8000] meets grpc error:OS Error Tue Oct 23 11:18:40 2018[1,128]:*** Check failure stack trace: *** Tue Oct 23 11:18:40 2018[1,128]: @ 0x7f6fff50cebd google::LogMessage::Fail() Tue Oct 23 11:18:40 2018[1,128]: @ 0x7f6fff51096c google::LogMessage::SendToLog() Tue Oct 23 11:18:40 2018[1,128]: @ 0x7f6fff50c9e3 google::LogMessage::Flush() Tue Oct 23 11:18:40 2018[1,128]: @ 0x7f6fff511e7e google::LogMessageFatal::~LogMessageFatal() Tue Oct 23 11:18:40 2018[1,128]: @ 0x7f6fffd6e1d0 paddle::operators::distributed::GRPCClient::Proceed() Tue Oct 23 11:18:40 2018[1,128]: @ 0x7f700a3978a0 execute_native_thread_routine Tue Oct 23 11:18:40 2018[1,128]: @ 0x7f70a7a1f1c3 start_thread Tue Oct 23 11:18:40 2018[1,128]: @ 0x7f70a704712d __clone Tue Oct 23 11:18:40 2018[1,128]: @ (nil) (unknown) Tue Oct 23 11:18:44 2018[1,128]:.//paddle/start_trainer.sh: line 89: 49394 Aborted /home/disk1/normandy/maybach/app-user-20181023101155-15287/workspace/python27-gcc482//bin/python -u mpi_train.py --use_parallel_exe 1 --batch_size 4096 --cpu_num=1 --train_data_dir train_data --test_data_dir test_data Tue Oct 23 11:18:44 2018[1,128]:+ trainer_ret=134 Tue Oct 23 11:18:44 2018[1,128]:+ popd Tue Oct 23 11:18:44 2018[1,128]:/home/disk1/nor