多机同步下启动就core掉
Created by: ccmeteorljh
paddle:develop 分支 8月15日编译版本; 模型:resnet 运行代码库:https://github.com/PaddlePaddle/Paddle/tree/develop/benchmark 运行脚本:
python fluid_benchmark.py --update_method pserver --model resnet --batch_size 64 --pass_num 1 --iterations 100
运行环境:2ps,4trainer pserver日志:
*** Aborted at 1534389265 (unix time) try "date -d @1534389265" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGSEGV (@0x0) received by PID 633 (TID 0x7fe4257fb700) from PID 0; stack trace: ***
@ 0x7fe604059500 (unknown)
@ 0x7fe5a657d297 paddle::operators::distributed::VariableResponse::CopyLodTensorData()
@ 0x7fe5a657dc1c paddle::operators::distributed::VariableResponse::ProcSerializedField()
@ 0x7fe5a657455a paddle::operators::distributed::GRPCVariableResponse::Parse()
@ 0x7fe5a656f76e grpc::ServerInterface::PayloadAsyncRequest<>::FinalizeResult()
@ 0x7fe5a6593bf3 grpc::CompletionQueue::AsyncNextInternal()
@ 0x7fe5a656b729 paddle::operators::distributed::AsyncGRPCServer::HandleRequest()
@ 0x7fe5a656f58f std::thread::_Impl<>::_M_run()
@ 0x7fe5f9114470 (unknown)
@ 0x7fe604051851 start_thread
@ 0x7fe60371490d clone
@ 0x0 (unknown)
*********************error messages********************
[/root/paddlejob/run.sh : 136] [start_fluid_pserver]
trainer报错:
F0816 11:14:31.137490 851 grpc_client.cc:295] Send name:[fc_0.w_0@GRAD.block1.trainer_1], ep:[10.255.118.23:6174] meets grpc error:OS Error
*** Check failure stack trace: ***
@ 0x7fc085dd949d google::LogMessage::Fail()
@ 0x7fc085ddcf4c google::LogMessage::SendToLog()
@ 0x7fc085dd8fc3 google::LogMessage::Flush()
@ 0x7fc085dde45e google::LogMessageFatal::~LogMessageFatal()
@ 0x7fc086c5b2b8 paddle::operators::distributed::GRPCClient::Proceed()
@ 0x7fc0d9558470 (unknown)
@ 0x7fc0e4495851 start_thread
@ 0x7fc0e3b5890d clone
@ (nil) (unknown)
*********************error messages********************