fluid mpi集群训练出现:Get name:[embedding_0.w_0.block15], ep:[10.109.92.24:8000] meets grpc error:Socket closed
Created by: 333caowei
0号节点trainer.log日志显示:
F1012 18:13:48.747632 57302 grpc_client.cc:295] Get name:[embedding_0.w_0.block15], ep:[10.109.92.24:8000] meets grpc error:Socket closed
*** Check failure stack trace: ***
@ 0x7f6251a18ebd google::LogMessage::Fail()
@ 0x7f6251a1c96c google::LogMessage::SendToLog()
@ 0x7f6251a189e3 google::LogMessage::Flush()
@ 0x7f6251a1de7e google::LogMessageFatal::~LogMessageFatal()
@ 0x7f625227a1d0 paddle::operators::distributed::GRPCClient::Proceed()
@ 0x7f62e69318a0 execute_native_thread_routine
@ 0x7f62f0d751c3 start_thread
@ 0x7f62f039d12d __clone
@ (nil) (unknown)
.//paddle/start_trainer.sh: line 89: 57154 Aborted /home/disk1/normandy/maybach/app-user-20181012180450-24117/workspace/python27-gcc482//bin/python -u train.py
pserver.log日志显示:
++ /home/disk1/normandy/maybach/app-user-20181012180450-24117/workspace/python27-gcc482/bin/python -u train.py
vocab size: 314158 5212 47321 3438 35212 43 133
get_pserver_program() is deprecated, call get_pserver_programs() to get pserver main and startup in a single call.get_startup_program() is deprecated, call get_pserver_programs() to get pserver main and startup in a single call.passing pserver_program to get_startup_program() is deprecated, you can use new API get_pserver_programs() to get both pserver main program and startup program.E1012 18:13:48.701248 2456 listen_and_serv_op.cc:69] run sub program error holder_ should not be null
Tensor not initialized yet when Tensor::type() is called. at [/paddle/paddle/fluid/framework/tensor.h:139]
PaddlePaddle Call Stacks:
0 0x7fe61e0a8da6p paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) + 486
1 0x7fe61e0ab006p paddle::framework::Tensor::type() const + 150
2 0x7fe61ea546a5p paddle::framework::OperatorWithKernel::IndicateDataType(paddle::framework::ExecutionContext const&) const + 149
3 0x7fe61ea54a7fp paddle::framework::OperatorWithKernel::GetExpectedKernelType(paddle::framework::ExecutionContext const&) const + 47
4 0x7fe61ea54f67p paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> const&) const + 199
尝试调整节点数,重新run以后又能正常运行,本来以为是故障节点的缘故,但是正常执行的这一次也包含了10.109.92.24节点