video NeXtVLAD stuck at training
Created by: Haijunlv
video model NeXtVLAD stuck at training process. use official docker image: paddlepaddle/paddle:1.3.0-gpu-cuda9.0-cudnn7 GPU environment: 8 Tesla V100, driver versioin:384.81
following the youtuble-8m data preprocessing and then train. train script:
export CUDA_VISIBLE_DEVICES=4,5,6,7
python train.py --model-name="NEXTVLAD" --config=./configs/nextvlad.txt --epoch-
num=6 --valid-interval=1 --log-interval=1
nextvlad.txt:
num_gpus = 4
training stuck log:
[INFO: train.py: 221]: Namespace(batch_size=None, config='./configs/nextvlad.txt', epoch_num=6, learning_rate=None, log_interval=1, model_name='NEXTVLAD', no_memory_optimize=False, no_use_pyreader=False, pretrain=None, resume=None, save_dir='checkpoints', use_gpu=True, valid_interval=1)
W0323 03:15:33.823276 2265 device_context.cc:263] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.0, Runtime API Version: 9.0
W0323 03:15:33.823355 2265 device_context.cc:271] device: 0, cuDNN Version: 7.0.
cur_time:1553310943.58
period:1.50254201889
[INFO: metrics_util.py: 68]: [TRAIN] Epoch 0, iter 0 , loss = 2941.624023, Hit@1 = 0.00, PERR = 0.00, GAP = 0.00
cur_time:1553310945.17
period:0.693611860275
[INFO: metrics_util.py: 68]: [TRAIN] Epoch 0, iter 1 , loss = 2904.056396, Hit@1 = 0.00, PERR = 0.00, GAP = 0.00
cur_time:1553310945.94
period:0.621890068054
[INFO: metrics_util.py: 68]: [TRAIN] Epoch 0, iter 2 , loss = 2848.791504, Hit@1 = 0.01, PERR = 0.01, GAP = 0.00
cur_time:1553310946.64
period:0.602577924728
[INFO: metrics_util.py: 68]: [TRAIN] Epoch 0, iter 3 , loss = 2775.756592, Hit@1 = 0.03, PERR = 0.01, GAP = 0.00
cur_time:1553310947.47
period:0.574059009552
[INFO: metrics_util.py: 68]: [TRAIN] Epoch 0, iter 4 , loss = 2688.623047, Hit@1 = 0.06, PERR = 0.03, GAP = 0.01
cur_time:1553310948.13
period:0.571026086807
[INFO: metrics_util.py: 68]: [TRAIN] Epoch 0, iter 5 , loss = 2595.953613, Hit@1 = 0.08, PERR = 0.04, GAP = 0.01
cur_time:1553310948.77
period:0.997609853745
[INFO: metrics_util.py: 68]: [TRAIN] Epoch 0, iter 6 , loss = 2510.459717, Hit@1 = 0.08, PERR = 0.05, GAP = 0.02
cur_time:1553310949.85
period:0.483380794525
[INFO: metrics_util.py: 68]: [TRAIN] Epoch 0, iter 7 , loss = 2416.960205, Hit@1 = 0.07, PERR = 0.02, GAP = 0.01
cur_time:1553310950.42
period:0.460392951965
[INFO: metrics_util.py: 68]: [TRAIN] Epoch 0, iter 8 , loss = 2313.718994, Hit@1 = 0.13, PERR = 0.07, GAP = 0.03
cur_time:1553310950.95
period:0.521590948105
[INFO: metrics_util.py: 68]: [TRAIN] Epoch 0, iter 9 , loss = 2227.182129, Hit@1 = 0.10, PERR = 0.04, GAP = 0.03
cur_time:1553310951.56
trace the log, process is stuck at tools/train_utils.py L105: train_outs = train_exe.run(fetch_list=train_fetch_list)
#############################3 (closed)
actually there is one strange bug happened after install docker image:
Failed to find dynamic library: libnccl.so ( libnccl.so: cannot open shared object file: No such file or directory )
i follow solution to solve the bug.
如果在GPU版本下用单机多卡训练,可能会报找不到nccl.so的错误,解决办法如下:
# 如果使用的是cuda8的环境(nvcc --version查看)
# 下载NCCL库 for cuda8
$ wget 10.86.69.44:8192/nccl_2.1.4-1+cuda8.0_x86_64.txz
$ tar -Jxf nccl_2.1.4-1+cuda8.0_x86_64.txz
$ export LD_LIBRARY_PATH=`pwd`/nccl_2.1.4-1+cuda8.0_x86_64/lib:$LD_LIBRARY_PATH
# 如果使用的是cuda9的环境
# 下载NCCL库 for cuda9
$ wget 10.86.69.44:8192/nccl_2.2.12-1+cuda9.0_x86_64.txz
$ tar -Jxf nccl_2.2.12-1+cuda9.0_x86_64.txz
$ export LD_LIBRARY_PATH=`pwd`/nccl_2.2.12-1+cuda9.0_x86_64/lib:$LD_LIBRARY_PATH
detailed bug log is following:
Traceback (most recent call last):
File "train.py", line 226, in <module>
train(args)
File "train.py", line 171, in train
main_program=train_prog)
File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/parallel_executor.py", line 191, in __init__
local_scopes, exec_strategy, build_strategy)
paddle.fluid.core.EnforceNotMet: Failed to find dynamic library: libnccl.so ( libnccl.so: cannot open shared object file: No such file or directory )
Please specify its path correctly using following ways:
Method. set environment variable LD_LIBRARY_PATH on Linux or DYLD_LIBRARY_PATH on Mac OS.
For instance, issue command: export LD_LIBRARY_PATH=...
Note: After Mac OS 10.11, using the DYLD_LIBRARY_PATH is impossible unless System Integrity Protection (SIP) is disabled. at [/paddle/paddle/fluid/platform/dynload/dynamic_loader.cc:163]
PaddlePaddle Call Stacks:
0 0x7f52bacfd8f5p void paddle::platform::EnforceNotMet::Init<char const*>(char const*, char const*, int) + 357
1 0x7f52bacfdc79p paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) + 137
2 0x7f52bc7ac365p paddle::platform::dynload::GetNCCLDsoHandle() + 1813
3 0x7f52bae3e7e9p void std::__once_call_impl<std::_Bind_simple<decltype (ncclCommInitAll({parm#1}...)) paddle::platform::dynload::DynLoad__ncclCommInitAll::operator()<ncclComm**, int, int*>(ncclComm**, int, int*)::{lambda()#1} ()> >() + 9
4 0x7f5332f6ba99p
5 0x7f52bae41c1dp paddle::platform::NCCLContextMap::NCCLContextMap(std::vector<boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_>, std::allocator<boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> > > const&, ncclUniqueId*, unsigned long, unsigned long) + 2093
6 0x7f52bae3d952p paddle::framework::ParallelExecutor::ParallelExecutor(std::vector<boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_>, std::allocator<boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> > > const&, std::unordered_set<std::string, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::string> > const&, paddle::framework::ProgramDesc const&, std::string const&, paddle::framework::Scope*, std::vector<paddle::framework::Scope*, std::allocator<paddle::framework::Scope*> > const&, paddle::framework::details::ExecutionStrategy const&, paddle::framework::details::BuildStrategy const&) + 3922
7 0x7f52bad5b098p
8 0x7f52bad291fep
9 0x4eef5ep
10 0x4eeb66p
11 0x4aaafbp
12 0x4c166dp PyEval_EvalFrameEx + 22413
13 0x4b9b66p PyEval_EvalCodeEx + 774
14 0x4d57a3p
15 0x4eef5ep
16 0x4eeb66p
17 0x4aaafbp
18 0x4c166dp PyEval_EvalFrameEx + 22413
19 0x4b9b66p PyEval_EvalCodeEx + 774
20 0x4c1f56p PyEval_EvalFrameEx + 24694
21 0x4b9b66p PyEval_EvalCodeEx + 774
22 0x4eb69fp
23 0x4e58f2p PyRun_FileExFlags + 130
24 0x4e41a6p PyRun_SimpleFileExFlags + 390
25 0x4938cep Py_Main + 1358
26 0x7f5332bb3830p __libc_start_main + 240
27 0x493299p _start + 41
Could anyone help me to solve the stuck problem?