fluid分布式训练起大于2个ps时出错
Created by: flyhighzy
使用自建k8s集群运行fluid分布式任务,每个pod均先启动ps,然后启动trainer。发现启动2个pod时能正常运行,启动3个pod会有一个ps挂掉导致训练失败,启动4个挂2个,启动5个挂3个。。。 使用的版本:docker-hub上的latest-gpu,对应commit id为e8b4e0d6
ps异常信息:
* Aborted at 1535100063 (unix time) try "date -d @1535100063" if you are using GNU date *
PC: @ 0x0 (unknown)* SIGSEGV (@0x0) received by PID 13 (TID 0x7fcdf4995700) from PID 0; stack trace: *
@ 0x7fcdf4372390 (unknown) @ 0x7fcd9c998ab9 paddle::operators::distributed::AsyncGRPCServer::ShutDownImpl() @ 0x7fcd9c996133 paddle::operators::distributed::RPCServer::ShutDown() @ 0x7fcd9c85fa18 paddle::operators::ListenAndServOp::Stop() @ 0x7fcd9c85fb47 paddle::operators::ListenAndServOp::~ListenAndServOp() @ 0x7fcd9c85fce1 paddle::operators::ListenAndServOp::~ListenAndServOp() @ 0x7fcd9bb1f7e6 paddle::framework::ExecutorPrepareContext::~ExecutorPrepareContext() @ 0x7fcd9bb1f856 std::default_delete<>::operator()() @ 0x7fcd9bb23ae5 paddle::framework::Executor::Run() @ 0x7fcd9ba4130d ZZN8pybind1112cpp_function10initializeIZN6paddle6pybindL13pybind11_initEvEUlRNS2_9framework8ExecutorERKNS4_11ProgramDescEPNS4_5ScopeEibbE65_vIS6_S9_SB_ibbEINS_4nameENS_9is_methodENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNEST @ 0x7fcd9ba6c514 pybind11::cpp_function::dispatcher() @ 0x4c37ed PyEval_EvalFrameEx @ 0x4b9ab6 PyEval_EvalCodeEx @ 0x4c16e7 PyEval_EvalFrameEx @ 0x4b9ab6 PyEval_EvalCodeEx @ 0x4c1e6f PyEval_EvalFrameEx @ 0x4b9ab6 PyEval_EvalCodeEx @ 0x4c16e7 PyEval_EvalFrameEx @ 0x4b9ab6 PyEval_EvalCodeEx @ 0x4eb30f (unknown) @ 0x4e5422 PyRun_FileExFlags @ 0x4e3cd6 PyRun_SimpleFileExFlags @ 0x493ae2 Py_Main @ 0x7fcdf3fb7830 __libc_start_main @ 0x4933e9 _start @ 0x0 (unknown)设置日志参数后打印出的ps所有日志: WARNING: Logging before InitGoogleLogging() is written to STDERR I0824 08:40:55.146638 13 init.cc:47] Init commandline: dummy train.py --tryfromenv=use_pinned_memory,check_nan_inf,benchmark,warpctc_dir,eager_delete_scope,use_mkldnn,initial_cpu_memory_in_mb,init_allocated_mem,free_idle_memory,paddle_num_threads,dist_threadpool_size,cpu_deterministic,rpc_deadline,rpc_server_profile_period,rpc_server_profile_path,fraction_of_gpu_memory_to_use,cudnn_deterministic I0824 08:40:56.670580 13 dynamic_loader.cc:76] Try to find library: libcublas.so from default system path. I0824 08:40:59.454721 13 dynamic_loader.cc:76] Try to find library: libcudnn.so from default system path. 0.0.0 I0824 08:41:03.697319 13 op_desc.cc:463] CompileTime infer shape on uniform_random I0824 08:41:03.698568 13 op_desc.cc:463] CompileTime infer shape on mul I0824 08:41:03.698941 13 mul_op.cc:41] mul operator x.shape=-1, 13 y.shape=13, 1 x_num_col_dims=1 y_num_col_dims=1 I0824 08:41:03.699544 13 op_desc.cc:463] CompileTime infer shape on fill_constant I0824 08:41:03.742079 13 op_desc.cc:463] CompileTime infer shape on elementwise_add I0824 08:41:03.743115 13 op_desc.cc:463] CompileTime infer shape on elementwise_sub I0824 08:41:03.743959 13 op_desc.cc:463] CompileTime infer shape on square I0824 08:41:03.744910 13 op_desc.cc:463] CompileTime infer shape on mean I0824 08:41:03.746531 13 op_desc.cc:463] CompileTime infer shape on fill_constant I0824 08:41:03.747144 13 op_desc.cc:463] CompileTime infer shape on mean_grad I0824 08:41:03.747772 13 op_desc.cc:463] CompileTime infer shape on square_grad I0824 08:41:03.748400 13 op_desc.cc:463] CompileTime infer shape on elementwise_sub_grad I0824 08:41:03.749043 13 op_desc.cc:463] CompileTime infer shape on elementwise_add_grad I0824 08:41:03.749712 13 op_desc.cc:463] CompileTime infer shape on mul_grad I0824 08:41:03.751401 13 op_desc.cc:463] CompileTime infer shape on fill_constant I0824 08:41:03.752104 13 op_desc.cc:463] CompileTime infer shape on sgd I0824 08:41:03.752894 13 op_desc.cc:463] CompileTime infer shape on sgd 192.168.138.111:7164,192.168.147.60:7164,192.168.92.124:7164,192.168.231.68:7164,192.168.127.131:7164 5 0 I0824 08:41:03.759470 13 op_desc.cc:463] CompileTime infer shape on send_barrier I0824 08:41:03.760350 13 op_desc.cc:463] CompileTime infer shape on fetch_barrier I0824 08:41:03.761288 13 op_desc.cc:463] CompileTime infer shape on fetch_barrier pserver I0824 08:41:03.766665 13 scope.cc:129] Create variable fetch I0824 08:41:03.767114 13 executor.cc:106] Create Variable fetch global, which pointer is 0x31f79080 I0824 08:41:03.767493 13 scope.cc:129] Create variable feed I0824 08:41:03.767717 13 executor.cc:106] Create Variable feed global, which pointer is 0x31f79130 I0824 08:41:03.768206 13 executor.cc:106] Create Variable fetch global, which pointer is 0x31f79080 I0824 08:41:03.768697 13 executor.cc:106] Create Variable feed global, which pointer is 0x31f79130 I0824 08:41:03.769125 13 operator.cc:130] CPUPlace Op(listen_and_serv), inputs:{X[]}, outputs:{}. I0824 08:41:03.769584 13 listen_and_serv_op.cc:265] sync_mode:1, fan_in:5, end_point:192.168.127.131:7164, checkpoint_block_id: -1 I0824 08:41:03.769991 13 rpc_server.cc:132] RegisterRPC rpc_name:RequestSend, handler:0x31f7b4f0, cond:0 I0824 08:41:03.770208 13 rpc_server.cc:132] RegisterRPC rpc_name:RequestGet, handler:0x31f7b590, cond:1 I0824 08:41:03.770372 13 rpc_server.cc:132] RegisterRPC rpc_name:RequestPrefetch, handler:0x31f7b610, cond:2 I0824 08:41:03.770526 13 rpc_server.cc:132] RegisterRPC rpc_name:RequestCheckpoint, handler:0x31f779d0, cond:3 I0824 08:41:03.783799 13 rpc_server.cc:60] RPCServer ShutDown I0824 08:41:03.784157 13 grpc_server.cc:321] server_ shutdown!