run "fit a line" failed on tag v0.12.0
Created by: xjqbest
when doing distributed cpu training on k8s, using 2 pservers and 2 trainers, one trainer succeeded and the other trainer failed. the succeeded trainer's log is:
...
[245.54922]
[106.66316]
[78.47267]
[86.7585]
[6.032805]
('infer shape: ', (10, 1))
('infer results: ', array([[-46.680927],
[-51.959793],
[-33.93647 ],
[-31.00824 ],
[-38.361206],
[-46.148376],
[-28.084446],
[-35.666855],
[-33.81091 ],
[-50.53453 ]], dtype=float32))
the failed trainer's log is:
...
81.19149]
[95.01187]
[104.61168]
[139.18037]
[113.60637]
[71.95655]
[13.743915]
E0523 18:38:30.050364 1005 grpc_client.cc:236] proc param error:name:[fc_0.b_0] ep:[10.255.92.12:30002] grpc error:Deadline Exceeded
E0523 18:38:30.051322 1004 grpc_client.cc:236] proc param error:name:[fc_0.w_0] ep:[10.255.93.18:30002] grpc error:Deadline Exceeded
Traceback (most recent call last):
File "train.py", line 175, in <module>
main(use_cuda, is_local)
File "train.py", line 168, in main
train(use_cuda, save_dirname, is_local)
File "train.py", line 121, in train
train_loop(t.get_trainer_program())
File "train.py", line 85, in train_loop
fetch_list=[avg_cost])
File "/usr/local/lib/python2.7/site-packages/paddle/fluid/executor.py", line 336, in run
self.executor.run(program.desc, scope, 0, True, True)
paddle.fluid.core.EnforceNotMet: at [/paddle/paddle/fluid/operators/send_op.cc:82]
PaddlePaddle Call Stacks:
0 0x7f9830615566p paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) + 486
1 0x7f9830c754edp paddle::operators::SendOp::RunImpl(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> const&) const + 3197
2 0x7f9830ce5818p paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> const&) + 56
3 0x7f983069cf90p paddle::framework::Executor::RunPreparedContext(paddle::framework::ExecutorPrepareContext*, paddle::framework::Scope*, bool, bool) + 336
4 0x7f983069d6d4p paddle::framework::Executor::Run(paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool) + 100
5 0x7f983062c01bp _ZZN8pybind1112cpp_function10initializeIZNS0_C1IvN6paddle9framework8ExecutorEJRKNS4_11ProgramDescEPNS4_5ScopeEibbEJNS_4nameENS_9is_methodENS_7siblingEEEEMT0_FT_DpT1_EDpRKT2_EUlPS5_S8_SA_ibbE_vJSO_S8_SA_ibbEJSB_SC_SD_EEEvOSF_PFSE_SH_ESN_ENUlRNS_6detail13function_callEE1_4_FUNESV_ + 555
6 0x7f9830625d84p pybind11::cpp_function::dispatcher(_object*, _object*, _object*) + 2596
7 0x7f98c4fe2631p PyEval_EvalFrameEx + 24497
8 0x7f98c4fe3bcep PyEval_EvalCodeEx + 2190
9 0x7f98c4fe220ap PyEval_EvalFrameEx + 23434
10 0x7f98c4fe3bcep PyEval_EvalCodeEx + 2190
11 0x7f98c4fe220ap PyEval_EvalFrameEx + 23434
12 0x7f98c4fe3bcep PyEval_EvalCodeEx + 2190
13 0x7f98c4fe220ap PyEval_EvalFrameEx + 23434
14 0x7f98c4fe3bcep PyEval_EvalCodeEx + 2190
15 0x7f98c4fe220ap PyEval_EvalFrameEx + 23434
16 0x7f98c4fe3bcep PyEval_EvalCodeEx + 2190
17 0x7f98c4fe3ce2p PyEval_EvalCode + 50
18 0x7f98c50039e0p PyRun_FileExFlags + 176
19 0x7f98c5003bbfp PyRun_SimpleFileExFlags + 239
20 0x7f98c5019454p Py_Main + 3188
21 0x7f98c42cdcddp __libc_start_main + 253
22 0x400649p
and no errors in pservers' log