GPU多卡模式下出现异常
Created by: Bond-H
- 版本、环境信息: 1)PaddlePaddle版本:1.4.1 3)GPU:K40m,Cuda8.0,cuDNN5.1 4)系统环境:CentOS6.3, Python2.7
- 训练信息 1)单机单卡模型能正常训练,单机多卡模型出现异常。
单卡模型主要代码如下
`
pyreader = fluid.layers.py_reader(
capacity=50,
shapes=([-1, 1], [-1, 1]),
dtypes=('int64', 'int64'),
lod_levels=(1, 1),
name=pyreader_name,
use_double_buffer=False)
words, targets = fluid.layers.read_file(pyreader)
avg_cost, crf_decode = nets.lex_net(words, targets, args, vocab_size, num_labels)
(precision, recall, f1_score, num_infer_chunks, num_label_chunks,
num_correct_chunks) = fluid.layers.chunk_eval(
input=crf_decode,
label=targets,
chunk_scheme="IOB",
num_chunk_types=int(math.ceil((num_labels - 1) / 2.0)))
chunk_evaluator = fluid.metrics.ChunkEvaluator()
chunk_evaluator.reset()
place = fluid.CUDAPlace(int(os.getenv('FLAGS_selected_gpus', '0')))
exe = fluid.Executor(place)
`
多卡模式将Executor替换如下:
exe = fluid.ParallelExecutor(use_cuda=True,loss_name=avg_cost.name)
在初始化ParallelExecutor时代码报错如下
Traceback (most recent call last):
File "run_sequence_labeling_parrallel.py", line 359, in
main(args)
File "run_sequence_labeling_parrallel.py", line 172, in main
args, "train_reader", dataset.vocab_size, dataset.num_labels)
File "run_sequence_labeling_parrallel.py", line 101, in create_model
exe = fluid.ParallelExecutor(use_cuda=True,loss_name=avg_cost.name)
File "/home/work/huangdingbang/anaconda2/lib/python2.7/site-packages/paddle/fluid/parallel_executor.py", line 191, in init
local_scopes, exec_strategy, build_strategy)
paddle.fluid.core.EnforceNotMet: Failed to find dynamic library: libnccl.so ( libnccl.so: cannot open shared object file: No such file or directory )
Please specify its path correctly using following ways:
Method. set environment variable LD_LIBRARY_PATH on Linux or DYLD_LIBRARY_PATH on Mac OS.
For instance, issue command: export LD_LIBRARY_PATH=...
Note: After Mac OS 10.11, using the DYLD_LIBRARY_PATH is impossible unless System Integrity Protection (SIP) is disabled. at [/paddle/paddle/fluid/platform/dynload/dynamic_loader.cc:163]
PaddlePaddle Call Stacks:
0 0x7fc5efdb9975p void paddle::platform::EnforceNotMet::Init<char const*>(char const*, char const*, int) + 357
1 0x7fc5efdb9cf9p paddle::platform::EnforceNotMet::EnforceNotMet(std::exception_ptr::exception_ptr, char const*, int) + 137
2 0x7fc5f1a5ffa5p paddle::platform::dynload::GetNCCLDsoHandle() + 1813
3 0x7fc5efefa799p void std::once_call_impl<std::Bind_simple<paddle::platform::dynload::DynLoad__ncclCommInitAll::operator()<ncclComm**, int, int*>(ncclComm**, int, int*)::{lambda()#1 (closed)} ()> >() + 9
4 0x7fc62508f973p pthread_once + 83
5 0x7fc5efefdbcdp paddle::platform::NCCLContextMap::NCCLContextMap(std::vector<boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_>, std::allocator<boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> > > const&, ncclUniqueId*, unsigned long, unsigned long) + 2093
6 0x7fc5efef9902p paddle::framework::ParallelExecutor::ParallelExecutor(std::vector<boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_>, std::allocator<boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> > > const&, std::unordered_set<std::string, std::hashstd::string, std::equal_tostd::string, std::allocatorstd::string > const&, paddle::framework::ProgramDesc const&, std::string const&, paddle::framework::Scope*, std::vector<paddle::framework::Scope*, std::allocatorpaddle::framework::Scope* > const&, paddle::framework::details::ExecutionStrategy const&, paddle::framework::details::BuildStrategy const&) + 3922
7 0x7fc5efe17148p
8 0x7fc5efde528ep
9 0x7fc6252e97a3p PyObject_Call + 67
10 0x7fc6252f863dp
11 0x7fc6252e97a3p PyObject_Call + 67
12 0x7fc625342584p
13 0x7fc62533ee3bp
14 0x7fc6252e97a3p PyObject_Call + 67
15 0x7fc62537fb69p PyEval_EvalFrameEx + 15289
16 0x7fc6253854e9p PyEval_EvalCodeEx + 2025
17 0x7fc62530e377p
18 0x7fc6252e97a3p PyObject_Call + 67
19 0x7fc6252f863dp
20 0x7fc6252e97a3p PyObject_Call + 67
21 0x7fc625342584p
22 0x7fc62533ee3bp
23 0x7fc6252e97a3p PyObject_Call + 67
24 0x7fc62537fb69p PyEval_EvalFrameEx + 15289
25 0x7fc6253854e9p PyEval_EvalCodeEx + 2025
26 0x7fc6253829b8p PyEval_EvalFrameEx + 27144
27 0x7fc6253854e9p PyEval_EvalCodeEx + 2025
28 0x7fc6253829b8p PyEval_EvalFrameEx + 27144
29 0x7fc6253854e9p PyEval_EvalCodeEx + 2025
30 0x7fc62538570ap PyEval_EvalCode + 26
31 0x7fc62539e9cdp
32 0x7fc62539fb48p PyRun_FileExFlags + 120
33 0x7fc6253a0d68p PyRun_SimpleFileExFlags + 232
34 0x7fc6253b2f8cp Py_Main + 2988
35 0x7fc6245ebbd5p __libc_start_main + 245
36 0x7fc6256808bfp
去除loss中的'.name',将其Executor替换如下:
exe = fluid.ParallelExecutor(use_cuda=True,loss_name=avg_cost)
在初始化ParallelExecutor时代码报错如下 Traceback (most recent call last): File "run_sequence_labeling_parrallel.py", line 359, in main(args) File "run_sequence_labeling_parrallel.py", line 172, in main args, "train_reader", dataset.vocab_size, dataset.num_labels) File "run_sequence_labeling_parrallel.py", line 101, in create_model exe = fluid.ParallelExecutor(use_cuda=True,loss_name=avg_cost) File "/home/work/huangdingbang/anaconda2/lib/python2.7/site-packages/paddle/fluid/parallel_executor.py", line 190, in init cpt.to_text(loss_name) if loss_name else six.u(''), scope, File "/home/work/huangdingbang/anaconda2/lib/python2.7/site-packages/paddle/compat.py", line 76, in to_text return _to_text(obj, encoding) File "/home/work/huangdingbang/anaconda2/lib/python2.7/site-packages/paddle/compat.py", line 103, in _to_text return six.u(obj) File "/home/work/huangdingbang/anaconda2/lib/python2.7/site-packages/six.py", line 653, in u return unicode(s.replace(r'\', r'\\'), "unicode_escape") AttributeError: 'Variable' object has no attribute 'replace'
同样的error,也出现在使用Program.with_data_parallel()方法上
exe = fluid.Executor(place) compiled_prog = fluid.compiler.CompiledProgram( train_program).with_data_parallel( loss_name=train_ret["avg_cost"]) avg_cost, nums_infer, nums_label, nums_correct = exe.run( compiled_prog, fetch_list=[ train_ret["avg_cost"], train_ret["num_infer_chunks"], train_ret["num_label_chunks"], train_ret["num_correct_chunks"], ], )
报错信息为
Traceback (most recent call last): File "run_sequence_labeling_parrallel1.py", line 346, in main(args) File "run_sequence_labeling_parrallel1.py", line 259, in main train_ret["num_correct_chunks"], File "/home/work/huangdingbang/anaconda2/lib/python2.7/site-packages/paddle/fluid/executor.py", line 527, in run program._compile(scope, self.place) File "/home/work/huangdingbang/anaconda2/lib/python2.7/site-packages/paddle/fluid/compiler.py", line 231, in _compile self._executor = self._compile_data_parallel() File "/home/work/huangdingbang/anaconda2/lib/python2.7/site-packages/paddle/fluid/compiler.py", line 202, in _compile_data_parallel if self._loss_name else six.u(''), self._scope, self._local_scopes, File "/home/work/huangdingbang/anaconda2/lib/python2.7/site-packages/paddle/compat.py", line 76, in to_text return _to_text(obj, encoding) File "/home/work/huangdingbang/anaconda2/lib/python2.7/site-packages/paddle/compat.py", line 103, in _to_text return six.u(obj) File "/home/work/huangdingbang/anaconda2/lib/python2.7/site-packages/six.py", line 653, in u return unicode(s.replace(r'\', r'\\'), "unicode_escape") AttributeError: 'Variable' object has no attribute 'replace'