按照文档出现问题,请教!!!
Created by: zhaoguoqing12
API is deprecated since 2.0.0 Please use FleetAPI instead. WIKI: https://github.com/PaddlePaddle/Fleet/blob/develop/markdown_doc/transpiler
W0822 16:45:36.462678 28971 device_context.cc:237] Please NOTE: device: 0, CUDA Capability: 52, Driver API Version: 10.0, Runtime API Version: 10.0 W0822 16:45:36.466768 28971 device_context.cc:245] device: 0, cuDNN Version: 7.6. W0822 16:45:37.586989 28971 dynamic_loader.cc:120] Can not find library: libnccl.so. The process maybe hang. Please try to add the lib path to LD_LIBRARY_PATH. /data/data_pc_phone/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/executor.py:789: UserWarning: The following exception is not an EOF exception. "The following exception is not an EOF exception.") Traceback (most recent call last): File "tools/train.py", line 153, in main(args) File "tools/train.py", line 90, in main exe.run(startup_prog) File "/data/data_pc_phone/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/executor.py", line 790, in run six.reraise(*sys.exc_info()) File "/data/data_pc_phone/miniconda3/envs/paddle/lib/python3.7/site-packages/six.py", line 703, in reraise raise value File "/data/data_pc_phone/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/executor.py", line 785, in run use_program_cache=use_program_cache) File "/data/data_pc_phone/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/executor.py", line 838, in _run_impl use_program_cache=use_program_cache) File "/data/data_pc_phone/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/executor.py", line 912, in _run_program fetch_var_name) paddle.fluid.core_avx.EnforceNotMet:
C++ Call Stacks (More useful to developers):
0 std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int) 1 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) 2 paddle::platform::dynload::GetNCCLDsoHandle() 3 void std::__once_call_impl<std::_Bind_simple<paddle::platform::dynload::DynLoad__ncclGetUniqueId::operator()<ncclUniqueId*>(ncclUniqueId*)::{lambda()#1} ()> >() 4 paddle::operators::GenNCCLIdOp::GenerateAndSend(paddle::framework::Scope*, paddle::platform::DeviceContext const&, std::string const&, std::vector<std::string, std::allocatorstd::string > const&) const 5 paddle::operators::GenNCCLIdOp::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const 6 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&) 7 paddle::framework::Executor::RunPreparedContext(paddle::framework::ExecutorPrepareContext*, paddle::framework::Scope*, bool, bool, bool) 8 paddle::framework::Executor::Run(paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool, std::vector<std::string, std::allocatorstd::string > const&, bool, bool)
Python Call Stacks (More useful to users):
File "/data/data_pc_phone/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/framework.py", line 2525, in append_op attrs=kwargs.get("attrs", None)) File "/data/data_pc_phone/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/transpiler/distribute_transpiler.py", line 397, in _transpile_nccl2 self.config.hierarchical_allreduce_inter_nranks File "/data/data_pc_phone/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/transpiler/distribute_transpiler.py", line 625, in transpile wait_port=self.config.wait_port) File "/data/data_pc_phone/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/incubate/fleet/collective/init.py", line 285, in _transpile current_endpoint=current_endpoint) File "/data/data_pc_phone/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/incubate/fleet/collective/init.py", line 358, in _try_to_compile self._transpile(startup_program, main_program) File "/data/data_pc_phone/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/incubate/fleet/collective/init.py", line 424, in minimize fleet.main_program = self._try_to_compile(startup_program, main_program) File "/data/PaddleClas/tools/program.py", line 363, in build optimizer.minimize(fetchs['loss'][0]) File "tools/train.py", line 75, in main config, train_prog, startup_prog, is_train=True) File "tools/train.py", line 153, in main(args)
Error Message Summary:
Error: Failed to find dynamic library: libnccl.so ( libnccl.so: cannot open shared object file: No such file or directory ) Please specify its path correctly using following ways: Method. set environment variable LD_LIBRARY_PATH on Linux or DYLD_LIBRARY_PATH on Mac OS. For instance, issue command: export LD_LIBRARY_PATH=... Note: After Mac OS 10.11, using the DYLD_LIBRARY_PATH is impossible unless System Integrity Protection (SIP) is disabled. at (/paddle/paddle/fluid/platform/dynload/dynamic_loader.cc:177) [operator < gen_nccl_id > error] 2020-08-22 16:45:39,064-ERROR: ABORT!!! Out of all 1 trainers, the trainer process with rank=[0] was aborted. Please check its log. ERROR 2020-08-22 16:45:39,064 launch.py:284] ABORT!!! Out of all 1 trainers, the trainer process with rank=[0] was aborted. Please check its log.