Skip to content

  • 体验新版
    • 正在加载...
  • 登录
  • PaddlePaddle
  • Paddle
  • Issue
  • #22285

P
Paddle
  • 项目概览

PaddlePaddle / Paddle
大约 2 年 前同步成功

通知 2325
Star 20933
Fork 5424
  • 代码
    • 文件
    • 提交
    • 分支
    • Tags
    • 贡献者
    • 分支图
    • Diff
  • Issue 1423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
  • Wiki 0
    • Wiki
  • 分析
    • 仓库
    • DevOps
  • 项目成员
  • Pages
P
Paddle
  • 项目概览
    • 项目概览
    • 详情
    • 发布
  • 仓库
    • 仓库
    • 文件
    • 提交
    • 分支
    • 标签
    • 贡献者
    • 分支图
    • 比较
  • Issue 1,423
    • Issue 1,423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
    • 合并请求 543
  • Pages
  • 分析
    • 分析
    • 仓库分析
    • DevOps
  • Wiki 0
    • Wiki
  • 成员
    • 成员
  • 收起侧边栏
  • 动态
  • 分支图
  • 创建新Issue
  • 提交
  • Issue看板
已关闭
开放中
Opened 1月 15, 2020 by saxon_zh@saxon_zhGuest

Failed to run resnet dygraph with the following errors

Created by: leonleeldc

I tried to run distributed version of resnet dygraph: Firstly, I did, export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib:/usr/local/cuda/extras/CUPTI/lib64/:/usr/local/cuda-10.0/lib64/:/home/dingcheng/.conda/envs/sac_v1/lib/python3.5/site-packages/torch/lib/

Then, run this command following ReadMe: dingcheng@DINGCHENGSERVER:/media/data1/dingcheng/software/paddlepaddle/models/dygraph/resnet$ /usr/local/bin/anaconda3/bin/python3.6 -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog train.py --use_data_parallel 1

I got the following error messages:

----------- Configuration Arguments ----------- cluster_node_ips: 127.0.0.1 log_dir: ./mylog node_ip: 127.0.0.1 print_config: True selected_gpus: 0,1,2,3 started_port: 6170 training_script: train.py training_script_args: ['--use_data_parallel', '1'] use_paddlecloud: True

trainers_endpoints: 127.0.0.1:6170,127.0.0.1:6171,127.0.0.1:6172,127.0.0.1:6173 , node_id: 0 , current_node_ip: 127.0.0.1 , num_nodes: 1 , node_ips: ['127.0.0.1'] , nranks: 4 2020-01-14 18:15:55,786-ERROR: ABORT!!! Out of all 4 trainers, the trainer process with rank=[0] was aborted. Please check its log. ERROR 2020-01-14 18:15:55,786 launch.py:269] ABORT!!! Out of all 4 trainers, the trainer process with rank=[0] was aborted. Please check its log.

Then, I checked the workerlog files as following.

dingcheng@DINGCHENGSERVER:/media/data1/dingcheng/software/paddlepaddle/models/dygraph/resnet$ less mylog/workerlog.0 I0114 18:07:36.576611 55212 nccl_context.cc:127] init nccl context nranks: 4 local rank: 1 gpu id: 1 W0114 18:07:40.439381 55212 device_context.cc:236] Please NOTE: device: 1, CUDA Capability: 61, Driver API Version: 10.1, Runtime API Version: 10.0 W0114 18:07:40.451385 55212 device_context.cc:244] device: 1, cuDNN Version: 7.6. load finished start data reader (trainers_num: 4, trainer_id: 1) Traceback (most recent call last): File "train.py", line 381, in train_resnet() File "train.py", line 337, in train_resnet out = resnet(img) File "/usr/local/bin/anaconda3/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 178, in call outputs = self.forward(*inputs, **kwargs) File "/usr/local/bin/anaconda3/lib/python3.6/site-packages/paddle/fluid/dygraph/parallel.py", line 148, in forward return self._layers(*inputs, **kwargs) File "/usr/local/bin/anaconda3/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 178, in call outputs = self.forward(*inputs, **kwargs) File "train.py", line 220, in forward y = self.conv(inputs) File "/usr/local/bin/anaconda3/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 178, in call outputs = self.forward(*inputs, **kwargs) File "train.py", line 96, in forward y = self._conv(inputs) File "/usr/local/bin/anaconda3/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 176, in call parallel_helper._broadcast_parameters(self._parameters.values()) File "/usr/local/bin/anaconda3/lib/python3.6/site-packages/paddle/fluid/dygraph/parallel_helper.py", line 43, in _broadcast_parameters collective._broadcast(param, 0, sync_mode=True) File "/usr/local/bin/anaconda3/lib/python3.6/site-packages/paddle/fluid/layers/collective.py", line 60, in _broadcast "root": root}) File "/usr/local/bin/anaconda3/lib/python3.6/site-packages/paddle/fluid/layer_helper.py", line 43, in append_op return self.main_program.current_block().append_op(*args, **kwargs) File "/usr/local/bin/anaconda3/lib/python3.6/site-packages/paddle/fluid/framework.py", line 2479, in append_op kwargs.get("stop_gradient", False)) File "/usr/local/bin/anaconda3/lib/python3.6/site-packages/paddle/fluid/dygraph/tracer.py", line 47, in trace_op not stop_gradient) paddle.fluid.core_avx.EnforceNotMet:


C++ Call Stacks (More useful to developers):

0 std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int) 1 paddle::platform::EnforceNotMet::EnforceNotMet(std::exception_ptr::exception_ptr, char const*, int) 2 paddle::operators::NCCLBroadcastOpKernel::Compute(paddle::framework::ExecutionContext const&) const 3 std::Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::NCCLBroadcastOpKernel, paddle::operators::NCCLBroadcastOpKernel, paddle::operators::NCCLBroadcastOpKernel, paddle::operators::NCCLBroadcastOpKernel, paddle::operators::NCCLBroadcastOpKernelpaddle::platform::float16 >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1 (closed)}>::M_invoke(std::Any_data const&, paddle::framework::ExecutionContext const&) 4 paddle::imperative::PreparedOp::Run() 5 paddle::imperative::OpBase::Run(std::map<std::string, std::vector<std::shared_ptrpaddle::imperative::VarBase, std::allocator<std::shared_ptrpaddle::imperative::VarBase > >, std::lessstd::string, std::allocator<std::pair<std::string const, std::vector<std::shared_ptrpaddle::imperative::VarBase, std::allocator<std::shared_ptrpaddle::imperative::VarBase > > > > > const&, std::map<std::string, std::vector<std::shared_ptrpaddle::imperative::VarBase, std::allocator<std::shared_ptrpaddle::imperative::VarBase > >, std::lessstd::string, std::allocator<std::pair<std::string const, std::vector<std::shared_ptrpaddle::imperative::VarBase, std::allocator<std::shared_ptrpaddle::imperative::VarBase > > > > > const&) 6 paddle::imperative::Tracer::TraceOp(std::string const&, std::map<std::string, std::vector<std::shared_ptrpaddle::imperative::VarBase, std::allocator<std::shared_ptrpaddle::imperative::VarBase > >, std::lessstd::string, std::allocator<std::pair<std::string const, std::vector<std::shared_ptrpaddle::imperative::VarBase, std::allocator<std::shared_ptrpaddle::imperative::VarBase > > > > > const&, std::map<std::string, std::vector<std::shared_ptrpaddle::imperative::VarBase, std::allocator<std::shared_ptrpaddle::imperative::VarBase > >, std::lessstd::string, std::allocator<std::pair<std::string const, std::vector<std::shared_ptrpaddle::imperative::VarBase, std::allocator<std::shared_ptrpaddle::imperative::VarBase > > > > > const&, std::unordered_map<std::string, boost::variant<boost::blank, int, float, std::string, std::vector<int, std::allocator >, std::vector<float, std::allocator >, std::vector<std::string, std::allocatorstd::string >, bool, std::vector<bool, std::allocator >, paddle::framework::BlockDesc*, long, std::vector<paddle::framework::BlockDesc*, std::allocatorpaddle::framework::BlockDesc* >, std::vector<long, std::allocator >, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void_, boost::detail::variant::void_>, std::hashstd::string, std::equal_tostd::string, std::allocator<std::pair<std::string const, boost::variant<boost::blank, int, float, std::string, std::vector<int, std::allocator >, std::vector<float, std::allocator >, std::vector<std::string, std::allocatorstd::string >, bool, std::vector<bool, std::allocator >, paddle::framework::BlockDesc*, long, std::vector<paddle::framework::BlockDesc*, std::allocatorpaddle::framework::BlockDesc* >, std::vector<long, std::allocator >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> > > >, paddle::platform::Place const&, bool)


Error Message Summary:

Error: Paddle internal Check failed. (Please help us create a new issue, here we need to find the developer to add a user friendly error message) at (/paddle/paddle/fluid/operators/distributed_ops/broadcast_op.cu.cc:60)

Not sure what the problems are and how I can fix this? May you give some help?

BTW, the singple GPU version seems to be no problem.

指派人
分配到
无
里程碑
无
分配里程碑
工时统计
无
截止日期
无
标识: paddlepaddle/Paddle#22285
渝ICP备2023009037号

京公网安备11010502055752号

网络110报警服务 Powered by GitLab CE v13.7
开源知识
Git 入门 Pro Git 电子书 在线学 Git
Markdown 基础入门 IT 技术知识开源图谱
帮助
使用手册 反馈建议 博客
《GitCode 隐私声明》 《GitCode 服务条款》 关于GitCode
Powered by GitLab CE v13.7