Skip to content

  • 体验新版
    • 正在加载...
  • 登录
  • PaddlePaddle
  • Paddle
  • Issue
  • #12956

P
Paddle
  • 项目概览

PaddlePaddle / Paddle
大约 2 年 前同步成功

通知 2325
Star 20933
Fork 5424
  • 代码
    • 文件
    • 提交
    • 分支
    • Tags
    • 贡献者
    • 分支图
    • Diff
  • Issue 1423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
  • Wiki 0
    • Wiki
  • 分析
    • 仓库
    • DevOps
  • 项目成员
  • Pages
P
Paddle
  • 项目概览
    • 项目概览
    • 详情
    • 发布
  • 仓库
    • 仓库
    • 文件
    • 提交
    • 分支
    • 标签
    • 贡献者
    • 分支图
    • 比较
  • Issue 1,423
    • Issue 1,423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
    • 合并请求 543
  • Pages
  • 分析
    • 分析
    • 仓库分析
    • DevOps
  • Wiki 0
    • Wiki
  • 成员
    • 成员
  • 收起侧边栏
  • 动态
  • 分支图
  • 创建新Issue
  • 提交
  • Issue看板
已关闭
开放中
Opened 8月 26, 2018 by saxon_zh@saxon_zhGuest

fluid分布式训练起大于2个ps时出错

Created by: flyhighzy

使用自建k8s集群运行fluid分布式任务,每个pod均先启动ps,然后启动trainer。发现启动2个pod时能正常运行,启动3个pod会有一个ps挂掉导致训练失败,启动4个挂2个,启动5个挂3个。。。 使用的版本:docker-hub上的latest-gpu,对应commit id为e8b4e0d6

ps异常信息:

* Aborted at 1535100063 (unix time) try "date -d @1535100063" if you are using GNU date *

PC: @ 0x0 (unknown)

* SIGSEGV (@0x0) received by PID 13 (TID 0x7fcdf4995700) from PID 0; stack trace: *

@ 0x7fcdf4372390 (unknown) @ 0x7fcd9c998ab9 paddle::operators::distributed::AsyncGRPCServer::ShutDownImpl() @ 0x7fcd9c996133 paddle::operators::distributed::RPCServer::ShutDown() @ 0x7fcd9c85fa18 paddle::operators::ListenAndServOp::Stop() @ 0x7fcd9c85fb47 paddle::operators::ListenAndServOp::~ListenAndServOp() @ 0x7fcd9c85fce1 paddle::operators::ListenAndServOp::~ListenAndServOp() @ 0x7fcd9bb1f7e6 paddle::framework::ExecutorPrepareContext::~ExecutorPrepareContext() @ 0x7fcd9bb1f856 std::default_delete<>::operator()() @ 0x7fcd9bb23ae5 paddle::framework::Executor::Run() @ 0x7fcd9ba4130d ZZN8pybind1112cpp_function10initializeIZN6paddle6pybindL13pybind11_initEvEUlRNS2_9framework8ExecutorERKNS4_11ProgramDescEPNS4_5ScopeEibbE65_vIS6_S9_SB_ibbEINS_4nameENS_9is_methodENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNEST @ 0x7fcd9ba6c514 pybind11::cpp_function::dispatcher() @ 0x4c37ed PyEval_EvalFrameEx @ 0x4b9ab6 PyEval_EvalCodeEx @ 0x4c16e7 PyEval_EvalFrameEx @ 0x4b9ab6 PyEval_EvalCodeEx @ 0x4c1e6f PyEval_EvalFrameEx @ 0x4b9ab6 PyEval_EvalCodeEx @ 0x4c16e7 PyEval_EvalFrameEx @ 0x4b9ab6 PyEval_EvalCodeEx @ 0x4eb30f (unknown) @ 0x4e5422 PyRun_FileExFlags @ 0x4e3cd6 PyRun_SimpleFileExFlags @ 0x493ae2 Py_Main @ 0x7fcdf3fb7830 __libc_start_main @ 0x4933e9 _start @ 0x0 (unknown)

设置日志参数后打印出的ps所有日志: WARNING: Logging before InitGoogleLogging() is written to STDERR I0824 08:40:55.146638 13 init.cc:47] Init commandline: dummy train.py --tryfromenv=use_pinned_memory,check_nan_inf,benchmark,warpctc_dir,eager_delete_scope,use_mkldnn,initial_cpu_memory_in_mb,init_allocated_mem,free_idle_memory,paddle_num_threads,dist_threadpool_size,cpu_deterministic,rpc_deadline,rpc_server_profile_period,rpc_server_profile_path,fraction_of_gpu_memory_to_use,cudnn_deterministic I0824 08:40:56.670580 13 dynamic_loader.cc:76] Try to find library: libcublas.so from default system path. I0824 08:40:59.454721 13 dynamic_loader.cc:76] Try to find library: libcudnn.so from default system path. 0.0.0 I0824 08:41:03.697319 13 op_desc.cc:463] CompileTime infer shape on uniform_random I0824 08:41:03.698568 13 op_desc.cc:463] CompileTime infer shape on mul I0824 08:41:03.698941 13 mul_op.cc:41] mul operator x.shape=-1, 13 y.shape=13, 1 x_num_col_dims=1 y_num_col_dims=1 I0824 08:41:03.699544 13 op_desc.cc:463] CompileTime infer shape on fill_constant I0824 08:41:03.742079 13 op_desc.cc:463] CompileTime infer shape on elementwise_add I0824 08:41:03.743115 13 op_desc.cc:463] CompileTime infer shape on elementwise_sub I0824 08:41:03.743959 13 op_desc.cc:463] CompileTime infer shape on square I0824 08:41:03.744910 13 op_desc.cc:463] CompileTime infer shape on mean I0824 08:41:03.746531 13 op_desc.cc:463] CompileTime infer shape on fill_constant I0824 08:41:03.747144 13 op_desc.cc:463] CompileTime infer shape on mean_grad I0824 08:41:03.747772 13 op_desc.cc:463] CompileTime infer shape on square_grad I0824 08:41:03.748400 13 op_desc.cc:463] CompileTime infer shape on elementwise_sub_grad I0824 08:41:03.749043 13 op_desc.cc:463] CompileTime infer shape on elementwise_add_grad I0824 08:41:03.749712 13 op_desc.cc:463] CompileTime infer shape on mul_grad I0824 08:41:03.751401 13 op_desc.cc:463] CompileTime infer shape on fill_constant I0824 08:41:03.752104 13 op_desc.cc:463] CompileTime infer shape on sgd I0824 08:41:03.752894 13 op_desc.cc:463] CompileTime infer shape on sgd 192.168.138.111:7164,192.168.147.60:7164,192.168.92.124:7164,192.168.231.68:7164,192.168.127.131:7164 5 0 I0824 08:41:03.759470 13 op_desc.cc:463] CompileTime infer shape on send_barrier I0824 08:41:03.760350 13 op_desc.cc:463] CompileTime infer shape on fetch_barrier I0824 08:41:03.761288 13 op_desc.cc:463] CompileTime infer shape on fetch_barrier pserver I0824 08:41:03.766665 13 scope.cc:129] Create variable fetch I0824 08:41:03.767114 13 executor.cc:106] Create Variable fetch global, which pointer is 0x31f79080 I0824 08:41:03.767493 13 scope.cc:129] Create variable feed I0824 08:41:03.767717 13 executor.cc:106] Create Variable feed global, which pointer is 0x31f79130 I0824 08:41:03.768206 13 executor.cc:106] Create Variable fetch global, which pointer is 0x31f79080 I0824 08:41:03.768697 13 executor.cc:106] Create Variable feed global, which pointer is 0x31f79130 I0824 08:41:03.769125 13 operator.cc:130] CPUPlace Op(listen_and_serv), inputs:{X[]}, outputs:{}. I0824 08:41:03.769584 13 listen_and_serv_op.cc:265] sync_mode:1, fan_in:5, end_point:192.168.127.131:7164, checkpoint_block_id: -1 I0824 08:41:03.769991 13 rpc_server.cc:132] RegisterRPC rpc_name:RequestSend, handler:0x31f7b4f0, cond:0 I0824 08:41:03.770208 13 rpc_server.cc:132] RegisterRPC rpc_name:RequestGet, handler:0x31f7b590, cond:1 I0824 08:41:03.770372 13 rpc_server.cc:132] RegisterRPC rpc_name:RequestPrefetch, handler:0x31f7b610, cond:2 I0824 08:41:03.770526 13 rpc_server.cc:132] RegisterRPC rpc_name:RequestCheckpoint, handler:0x31f779d0, cond:3 I0824 08:41:03.783799 13 rpc_server.cc:60] RPCServer ShutDown I0824 08:41:03.784157 13 grpc_server.cc:321] server_ shutdown!

* Aborted at 1535100063 (unix time) try "date -d @1535100063" if you are using GNU date *

PC: @ 0x0 (unknown)

* SIGSEGV (@0x0) received by PID 13 (TID 0x7fcdf4995700) from PID 0; stack trace: *

@ 0x7fcdf4372390 (unknown) @ 0x7fcd9c998ab9 paddle::operators::distributed::AsyncGRPCServer::ShutDownImpl() @ 0x7fcd9c996133 paddle::operators::distributed::RPCServer::ShutDown() @ 0x7fcd9c85fa18 paddle::operators::ListenAndServOp::Stop() @ 0x7fcd9c85fb47 paddle::operators::ListenAndServOp::~ListenAndServOp() @ 0x7fcd9c85fce1 paddle::operators::ListenAndServOp::~ListenAndServOp() @ 0x7fcd9bb1f7e6 paddle::framework::ExecutorPrepareContext::~ExecutorPrepareContext() @ 0x7fcd9bb1f856 std::default_delete<>::operator()() @ 0x7fcd9bb23ae5 paddle::framework::Executor::Run() @ 0x7fcd9ba4130d ZZN8pybind1112cpp_function10initializeIZN6paddle6pybindL13pybind11_initEvEUlRNS2_9framework8ExecutorERKNS4_11ProgramDescEPNS4_5ScopeEibbE65_vIS6_S9_SB_ibbEINS_4nameENS_9is_methodENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNEST @ 0x7fcd9ba6c514 pybind11::cpp_function::dispatcher() @ 0x4c37ed PyEval_EvalFrameEx @ 0x4b9ab6 PyEval_EvalCodeEx @ 0x4c16e7 PyEval_EvalFrameEx @ 0x4b9ab6 PyEval_EvalCodeEx @ 0x4c1e6f PyEval_EvalFrameEx @ 0x4b9ab6 PyEval_EvalCodeEx @ 0x4c16e7 PyEval_EvalFrameEx @ 0x4b9ab6 PyEval_EvalCodeEx @ 0x4eb30f (unknown) @ 0x4e5422 PyRun_FileExFlags @ 0x4e3cd6 PyRun_SimpleFileExFlags @ 0x493ae2 Py_Main @ 0x7fcdf3fb7830 __libc_start_main @ 0x4933e9 _start @ 0x0 (unknown)
指派人
分配到
无
里程碑
无
分配里程碑
工时统计
无
截止日期
无
标识: paddlepaddle/Paddle#12956
渝ICP备2023009037号

京公网安备11010502055752号

网络110报警服务 Powered by GitLab CE v13.7
开源知识
Git 入门 Pro Git 电子书 在线学 Git
Markdown 基础入门 IT 技术知识开源图谱
帮助
使用手册 反馈建议 博客
《GitCode 隐私声明》 《GitCode 服务条款》 关于GitCode
Powered by GitLab CE v13.7