ppyolo单卡可以训练,指定多卡训练就出错了,这个错是什么问题呢?
Created by: atomrun39
CUDA_VISIBLE_DEVICES=0 python tools/train.py -c configs/ppyolo/ppyolo.yml --eval,成功训练,增大了batchsize单卡put of memory,就想多卡训练,但是 CUDA_VISIBLE_DEVICES=0,1 python tools/train.py -c configs/ppyolo/ppyolo.yml --eval,会出现下面的错,是什么问题,怎么解决呢?
C++ Traceback (most recent call last):
0 paddle::framework::ParallelExecutor::ParallelExecutor(std::vector<paddle::platform::Place, std::allocatorpaddle::platform::Place > const&, std::vector<std::string, std::allocatorstd::string > const&, std::string const&, paddle::framework::Scope*, std::vector<paddle::framework::Scope*, std::allocatorpaddle::framework::Scope* > const&, paddle::framework::details::ExecutionStrategy const&, paddle::framework::details::BuildStrategy const&, paddle::framework::ir::Graph*) 1 paddle::framework::ParallelExecutorPrivate::InitOrGetNCCLCommunicator(paddle::framework::Scope*, paddle::framework::details::BuildStrategy*) 2 paddle::framework::ParallelExecutorPrivate::InitNCCLCtxs(paddle::framework::Scope*, paddle::framework::details::BuildStrategy const&) 3 paddle::platform::NCCLCommunicator::InitFlatCtxs(std::vector<paddle::platform::Place, std::allocatorpaddle::platform::Place > const&, std::vector<ncclUniqueId*, std::allocator<ncclUniqueId*> > const&, unsigned long, unsigned long) 4 paddle::platform::NCCLContextMap::NCCLContextMap(std::vector<paddle::platform::Place, std::allocatorpaddle::platform::Place > const&, ncclUniqueId*, unsigned long, unsigned long) 5 void std::__once_call_impl<std::_Bind_simple<paddle::platform::dynload::DynLoad__ncclCommInitAll::operator()<ncclComm**, int, int*>(ncclComm**, int, int*)::{lambda()#1} ()> >() 6 paddle::platform::dynload::GetNCCLDsoHandle() 7 paddle::platform::EnforceNotMet::EnforceNotMet(paddle::platform::ErrorSummary const&, char const*, int) 8 std::string paddle::platform::GetTraceBackStringstd::string(std::string&&, char const*, int)
Error Message Summary:
PreconditionNotMetError: The third-party dynamic library (libnccl.so) that Paddle depends on is not configured correctly. (error code is libnccl.so: cannot open shared object file: No such file or directory) Suggestions:
- Check if the third-party dynamic library (e.g. CUDA, CUDNN) is installed correctly and its version is matched with paddlepaddle you installed.
- Configure third-party dynamic library environment variables as follows:
- Linux: set LD_LIBRARY_PATH by
export LD_LIBRARY_PATH=...
- Windows: set PATH by `set PATH=XXX; at (/paddle/paddle/fluid/platform/dynload/dynamic_loader.cc:196)