Failed to run resnet dygraph with the following errors
Created by: leonleeldc
I tried to run distributed version of resnet dygraph: Firstly, I did, export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib:/usr/local/cuda/extras/CUPTI/lib64/:/usr/local/cuda-10.0/lib64/:/home/dingcheng/.conda/envs/sac_v1/lib/python3.5/site-packages/torch/lib/
Then, run this command following ReadMe: dingcheng@DINGCHENGSERVER:/media/data1/dingcheng/software/paddlepaddle/models/dygraph/resnet$ /usr/local/bin/anaconda3/bin/python3.6 -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog train.py --use_data_parallel 1
I got the following error messages:
----------- Configuration Arguments ----------- cluster_node_ips: 127.0.0.1 log_dir: ./mylog node_ip: 127.0.0.1 print_config: True selected_gpus: 0,1,2,3 started_port: 6170 training_script: train.py training_script_args: ['--use_data_parallel', '1'] use_paddlecloud: True
trainers_endpoints: 127.0.0.1:6170,127.0.0.1:6171,127.0.0.1:6172,127.0.0.1:6173 , node_id: 0 , current_node_ip: 127.0.0.1 , num_nodes: 1 , node_ips: ['127.0.0.1'] , nranks: 4 2020-01-14 18:15:55,786-ERROR: ABORT!!! Out of all 4 trainers, the trainer process with rank=[0] was aborted. Please check its log. ERROR 2020-01-14 18:15:55,786 launch.py:269] ABORT!!! Out of all 4 trainers, the trainer process with rank=[0] was aborted. Please check its log.
Then, I checked the workerlog files as following.
dingcheng@DINGCHENGSERVER:/media/data1/dingcheng/software/paddlepaddle/models/dygraph/resnet$ less mylog/workerlog.0 I0114 18:07:36.576611 55212 nccl_context.cc:127] init nccl context nranks: 4 local rank: 1 gpu id: 1 W0114 18:07:40.439381 55212 device_context.cc:236] Please NOTE: device: 1, CUDA Capability: 61, Driver API Version: 10.1, Runtime API Version: 10.0 W0114 18:07:40.451385 55212 device_context.cc:244] device: 1, cuDNN Version: 7.6. load finished start data reader (trainers_num: 4, trainer_id: 1) Traceback (most recent call last): File "train.py", line 381, in train_resnet() File "train.py", line 337, in train_resnet out = resnet(img) File "/usr/local/bin/anaconda3/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 178, in call outputs = self.forward(*inputs, **kwargs) File "/usr/local/bin/anaconda3/lib/python3.6/site-packages/paddle/fluid/dygraph/parallel.py", line 148, in forward return self._layers(*inputs, **kwargs) File "/usr/local/bin/anaconda3/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 178, in call outputs = self.forward(*inputs, **kwargs) File "train.py", line 220, in forward y = self.conv(inputs) File "/usr/local/bin/anaconda3/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 178, in call outputs = self.forward(*inputs, **kwargs) File "train.py", line 96, in forward y = self._conv(inputs) File "/usr/local/bin/anaconda3/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 176, in call parallel_helper._broadcast_parameters(self._parameters.values()) File "/usr/local/bin/anaconda3/lib/python3.6/site-packages/paddle/fluid/dygraph/parallel_helper.py", line 43, in _broadcast_parameters collective._broadcast(param, 0, sync_mode=True) File "/usr/local/bin/anaconda3/lib/python3.6/site-packages/paddle/fluid/layers/collective.py", line 60, in _broadcast "root": root}) File "/usr/local/bin/anaconda3/lib/python3.6/site-packages/paddle/fluid/layer_helper.py", line 43, in append_op return self.main_program.current_block().append_op(*args, **kwargs) File "/usr/local/bin/anaconda3/lib/python3.6/site-packages/paddle/fluid/framework.py", line 2479, in append_op kwargs.get("stop_gradient", False)) File "/usr/local/bin/anaconda3/lib/python3.6/site-packages/paddle/fluid/dygraph/tracer.py", line 47, in trace_op not stop_gradient) paddle.fluid.core_avx.EnforceNotMet:
C++ Call Stacks (More useful to developers):
0 std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int) 1 paddle::platform::EnforceNotMet::EnforceNotMet(std::exception_ptr::exception_ptr, char const*, int) 2 paddle::operators::NCCLBroadcastOpKernel::Compute(paddle::framework::ExecutionContext const&) const 3 std::Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::NCCLBroadcastOpKernel, paddle::operators::NCCLBroadcastOpKernel, paddle::operators::NCCLBroadcastOpKernel, paddle::operators::NCCLBroadcastOpKernel, paddle::operators::NCCLBroadcastOpKernelpaddle::platform::float16 >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1 (closed)}>::M_invoke(std::Any_data const&, paddle::framework::ExecutionContext const&) 4 paddle::imperative::PreparedOp::Run() 5 paddle::imperative::OpBase::Run(std::map<std::string, std::vector<std::shared_ptrpaddle::imperative::VarBase, std::allocator<std::shared_ptrpaddle::imperative::VarBase > >, std::lessstd::string, std::allocator<std::pair<std::string const, std::vector<std::shared_ptrpaddle::imperative::VarBase, std::allocator<std::shared_ptrpaddle::imperative::VarBase > > > > > const&, std::map<std::string, std::vector<std::shared_ptrpaddle::imperative::VarBase, std::allocator<std::shared_ptrpaddle::imperative::VarBase > >, std::lessstd::string, std::allocator<std::pair<std::string const, std::vector<std::shared_ptrpaddle::imperative::VarBase, std::allocator<std::shared_ptrpaddle::imperative::VarBase > > > > > const&) 6 paddle::imperative::Tracer::TraceOp(std::string const&, std::map<std::string, std::vector<std::shared_ptrpaddle::imperative::VarBase, std::allocator<std::shared_ptrpaddle::imperative::VarBase > >, std::lessstd::string, std::allocator<std::pair<std::string const, std::vector<std::shared_ptrpaddle::imperative::VarBase, std::allocator<std::shared_ptrpaddle::imperative::VarBase > > > > > const&, std::map<std::string, std::vector<std::shared_ptrpaddle::imperative::VarBase, std::allocator<std::shared_ptrpaddle::imperative::VarBase > >, std::lessstd::string, std::allocator<std::pair<std::string const, std::vector<std::shared_ptrpaddle::imperative::VarBase, std::allocator<std::shared_ptrpaddle::imperative::VarBase > > > > > const&, std::unordered_map<std::string, boost::variant<boost::blank, int, float, std::string, std::vector<int, std::allocator >, std::vector<float, std::allocator >, std::vector<std::string, std::allocatorstd::string >, bool, std::vector<bool, std::allocator >, paddle::framework::BlockDesc*, long, std::vector<paddle::framework::BlockDesc*, std::allocatorpaddle::framework::BlockDesc* >, std::vector<long, std::allocator >, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void_, boost::detail::variant::void_>, std::hashstd::string, std::equal_tostd::string, std::allocator<std::pair<std::string const, boost::variant<boost::blank, int, float, std::string, std::vector<int, std::allocator >, std::vector<float, std::allocator >, std::vector<std::string, std::allocatorstd::string >, bool, std::vector<bool, std::allocator >, paddle::framework::BlockDesc*, long, std::vector<paddle::framework::BlockDesc*, std::allocatorpaddle::framework::BlockDesc* >, std::vector<long, std::allocator >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> > > >, paddle::platform::Place const&, bool)
Error Message Summary:
Error: Paddle internal Check failed. (Please help us create a new issue, here we need to find the developer to add a user friendly error message) at (/paddle/paddle/fluid/operators/distributed_ops/broadcast_op.cu.cc:60)
Not sure what the problems are and how I can fix this? May you give some help?
BTW, the singple GPU version seems to be no problem.