多机异步运行过程grpc 超时
Created by: ccmeteorljh
paddle version 0.14 模型:machine_translation 代码库地址:https://github.com/xuezhong/transformer-nist 作业运行配置: 2个pserver,4个trainer,8卡 启动脚本:
LAGS_rpc_deadline=3000000 python -u thirdparty/model/transformer_cloud/train.py --src_vocab_fpath ./thirdparty/nist06n/cn_30001.dict --trg_vocab_fpath ./thirdparty/nist06n/en_30001.dict --train_file_pattern './train_data/part-*' --val_file_pattern './test_data/part-*' --batch_size 1024 --use_token_batch True --special_token '_GO' '_EOS' '_UNK' --pass_num=100 --iterations=1000 --local False --sync False
错误日志如下:
Traceback (most recent call last):
File "thirdparty/model/transformer_cloud/train.py", line 541, in <module>
train(args)
File "thirdparty/model/transformer_cloud/train.py", line 531, in train
train_loop(exe, trainer_prog)
File "thirdparty/model/transformer_cloud/train.py", line 463, in train_loop
outs = train_exe.run(fetch_list=[sum_cost.name, token_num.name], feed=feed_list)
File "/usr/local/lib/python2.7/site-packages/paddle/fluid/parallel_executor.py", line 269, in run
self.executor.run(fetch_list, fetch_var_name)
paddle.fluid.core.EnforceNotMet: internal error in RPCClient at [/paddle/paddle/fluid/operators/fetch_barrier_op.cc:48]
PaddlePaddle Call Stacks:
0 0x7fef66eaad96p paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) + 486
1 0x7fef67bf3c0bp paddle::operators::FetchBarrierOp::RunImpl(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> const&) const + 763
2 0x7fef67dfb86dp paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> const&) + 205
3 0x7fef67c37a33p paddle::framework::details::RPCOpHandle::RunImpl() + 883
4 0x7fef67c548b5p paddle::framework::details::OpHandleBase::Run(bool) + 117
5 0x7fef67c158bap
6 0x7fef67a9e983p std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()(), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, void> >::_M_invoke(std::_Any_data const&) + 35
7 0x7fef66ffc137p std::__future_base::_State_base::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()()>&, bool&) + 39
8 0x7fefc4d9fb23p pthread_once + 83
9 0x7fef67c146c2p
10 0x7fef66ffdc74p _ZZN10ThreadPoolC1EmENKUlvE_clEv + 404
11 0x7fefba022470p
12 0x7fefc4d9a851p
13 0x7fefc445d90dp clone + 109