Skip to content

  • 体验新版
    • 正在加载...
  • 登录
  • PaddlePaddle
  • models
  • Issue
  • #1917

M
models
  • 项目概览

PaddlePaddle / models
大约 2 年 前同步成功

通知 232
Star 6828
Fork 2962
  • 代码
    • 文件
    • 提交
    • 分支
    • Tags
    • 贡献者
    • 分支图
    • Diff
  • Issue 602
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 255
  • Wiki 0
    • Wiki
  • 分析
    • 仓库
    • DevOps
  • 项目成员
  • Pages
M
models
  • 项目概览
    • 项目概览
    • 详情
    • 发布
  • 仓库
    • 仓库
    • 文件
    • 提交
    • 分支
    • 标签
    • 贡献者
    • 分支图
    • 比较
  • Issue 602
    • Issue 602
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 255
    • 合并请求 255
  • Pages
  • 分析
    • 分析
    • 仓库分析
    • DevOps
  • Wiki 0
    • Wiki
  • 成员
    • 成员
  • 收起侧边栏
  • 动态
  • 分支图
  • 创建新Issue
  • 提交
  • Issue看板
已关闭
开放中
Opened 3月 23, 2019 by saxon_zh@saxon_zhGuest

video NeXtVLAD stuck at training

Created by: Haijunlv

video model NeXtVLAD stuck at training process. use official docker image: paddlepaddle/paddle:1.3.0-gpu-cuda9.0-cudnn7 GPU environment: 8 Tesla V100, driver versioin:384.81

following the youtuble-8m data preprocessing and then train. train script:

    export CUDA_VISIBLE_DEVICES=4,5,6,7
    python train.py --model-name="NEXTVLAD" --config=./configs/nextvlad.txt --epoch- 
                 num=6  --valid-interval=1 --log-interval=1

nextvlad.txt: num_gpus = 4

training stuck log:

[INFO: train.py:  221]: Namespace(batch_size=None, config='./configs/nextvlad.txt', epoch_num=6, learning_rate=None, log_interval=1, model_name='NEXTVLAD', no_memory_optimize=False, no_use_pyreader=False, pretrain=None, resume=None, save_dir='checkpoints', use_gpu=True, valid_interval=1)
W0323 03:15:33.823276  2265 device_context.cc:263] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.0, Runtime API Version: 9.0
W0323 03:15:33.823355  2265 device_context.cc:271] device: 0, cuDNN Version: 7.0.
cur_time:1553310943.58
period:1.50254201889
[INFO: metrics_util.py:   68]: [TRAIN] Epoch 0, iter 0  , loss = 2941.624023, Hit@1 = 0.00, PERR = 0.00, GAP = 0.00
cur_time:1553310945.17
period:0.693611860275
[INFO: metrics_util.py:   68]: [TRAIN] Epoch 0, iter 1  , loss = 2904.056396, Hit@1 = 0.00, PERR = 0.00, GAP = 0.00
cur_time:1553310945.94
period:0.621890068054
[INFO: metrics_util.py:   68]: [TRAIN] Epoch 0, iter 2  , loss = 2848.791504, Hit@1 = 0.01, PERR = 0.01, GAP = 0.00
cur_time:1553310946.64
period:0.602577924728
[INFO: metrics_util.py:   68]: [TRAIN] Epoch 0, iter 3  , loss = 2775.756592, Hit@1 = 0.03, PERR = 0.01, GAP = 0.00
cur_time:1553310947.47
period:0.574059009552
[INFO: metrics_util.py:   68]: [TRAIN] Epoch 0, iter 4  , loss = 2688.623047, Hit@1 = 0.06, PERR = 0.03, GAP = 0.01
cur_time:1553310948.13
period:0.571026086807
[INFO: metrics_util.py:   68]: [TRAIN] Epoch 0, iter 5  , loss = 2595.953613, Hit@1 = 0.08, PERR = 0.04, GAP = 0.01
cur_time:1553310948.77
period:0.997609853745
[INFO: metrics_util.py:   68]: [TRAIN] Epoch 0, iter 6  , loss = 2510.459717, Hit@1 = 0.08, PERR = 0.05, GAP = 0.02
cur_time:1553310949.85
period:0.483380794525
[INFO: metrics_util.py:   68]: [TRAIN] Epoch 0, iter 7  , loss = 2416.960205, Hit@1 = 0.07, PERR = 0.02, GAP = 0.01
cur_time:1553310950.42
period:0.460392951965
[INFO: metrics_util.py:   68]: [TRAIN] Epoch 0, iter 8  , loss = 2313.718994, Hit@1 = 0.13, PERR = 0.07, GAP = 0.03
cur_time:1553310950.95
period:0.521590948105
[INFO: metrics_util.py:   68]: [TRAIN] Epoch 0, iter 9  , loss = 2227.182129, Hit@1 = 0.10, PERR = 0.04, GAP = 0.03
cur_time:1553310951.56

trace the log, process is stuck at tools/train_utils.py L105: train_outs = train_exe.run(fetch_list=train_fetch_list)

#############################3 (closed) actually there is one strange bug happened after install docker image: Failed to find dynamic library: libnccl.so ( libnccl.so: cannot open shared object file: No such file or directory ) i follow solution to solve the bug.

如果在GPU版本下用单机多卡训练,可能会报找不到nccl.so的错误,解决办法如下:
# 如果使用的是cuda8的环境(nvcc --version查看)
# 下载NCCL库 for cuda8
$ wget 10.86.69.44:8192/nccl_2.1.4-1+cuda8.0_x86_64.txz
$ tar -Jxf nccl_2.1.4-1+cuda8.0_x86_64.txz
$ export LD_LIBRARY_PATH=`pwd`/nccl_2.1.4-1+cuda8.0_x86_64/lib:$LD_LIBRARY_PATH
  
# 如果使用的是cuda9的环境
# 下载NCCL库 for cuda9
$ wget 10.86.69.44:8192/nccl_2.2.12-1+cuda9.0_x86_64.txz
$ tar -Jxf nccl_2.2.12-1+cuda9.0_x86_64.txz
$ export LD_LIBRARY_PATH=`pwd`/nccl_2.2.12-1+cuda9.0_x86_64/lib:$LD_LIBRARY_PATH

detailed bug log is following:

Traceback (most recent call last):
  File "train.py", line 226, in <module>
    train(args)
  File "train.py", line 171, in train
    main_program=train_prog)
  File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/parallel_executor.py", line 191, in __init__
    local_scopes, exec_strategy, build_strategy)
paddle.fluid.core.EnforceNotMet: Failed to find dynamic library: libnccl.so ( libnccl.so: cannot open shared object file: No such file or directory )
 Please specify its path correctly using following ways:
 Method. set environment variable LD_LIBRARY_PATH on Linux or DYLD_LIBRARY_PATH on Mac OS.
 For instance, issue command: export LD_LIBRARY_PATH=...
 Note: After Mac OS 10.11, using the DYLD_LIBRARY_PATH is impossible unless System Integrity Protection (SIP) is disabled. at [/paddle/paddle/fluid/platform/dynload/dynamic_loader.cc:163]
PaddlePaddle Call Stacks:
0       0x7f52bacfd8f5p void paddle::platform::EnforceNotMet::Init<char const*>(char const*, char const*, int) + 357
1       0x7f52bacfdc79p paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) + 137
2       0x7f52bc7ac365p paddle::platform::dynload::GetNCCLDsoHandle() + 1813
3       0x7f52bae3e7e9p void std::__once_call_impl<std::_Bind_simple<decltype (ncclCommInitAll({parm#1}...)) paddle::platform::dynload::DynLoad__ncclCommInitAll::operator()<ncclComm**, int, int*>(ncclComm**, int, int*)::{lambda()#1} ()> >() + 9
4       0x7f5332f6ba99p
5       0x7f52bae41c1dp paddle::platform::NCCLContextMap::NCCLContextMap(std::vector<boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_>, std::allocator<boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> > > const&, ncclUniqueId*, unsigned long, unsigned long) + 2093
6       0x7f52bae3d952p paddle::framework::ParallelExecutor::ParallelExecutor(std::vector<boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_>, std::allocator<boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> > > const&, std::unordered_set<std::string, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::string> > const&, paddle::framework::ProgramDesc const&, std::string const&, paddle::framework::Scope*, std::vector<paddle::framework::Scope*, std::allocator<paddle::framework::Scope*> > const&, paddle::framework::details::ExecutionStrategy const&, paddle::framework::details::BuildStrategy const&) + 3922
7       0x7f52bad5b098p
8       0x7f52bad291fep
9             0x4eef5ep
10            0x4eeb66p
11            0x4aaafbp
12            0x4c166dp PyEval_EvalFrameEx + 22413
13            0x4b9b66p PyEval_EvalCodeEx + 774
14            0x4d57a3p
15            0x4eef5ep
16            0x4eeb66p
17            0x4aaafbp
18            0x4c166dp PyEval_EvalFrameEx + 22413
19            0x4b9b66p PyEval_EvalCodeEx + 774
20            0x4c1f56p PyEval_EvalFrameEx + 24694
21            0x4b9b66p PyEval_EvalCodeEx + 774
22            0x4eb69fp
23            0x4e58f2p PyRun_FileExFlags + 130
24            0x4e41a6p PyRun_SimpleFileExFlags + 390
25            0x4938cep Py_Main + 1358
26      0x7f5332bb3830p __libc_start_main + 240
27            0x493299p _start + 41

Could anyone help me to solve the stuck problem?

指派人
分配到
无
里程碑
无
分配里程碑
工时统计
无
截止日期
无
标识: paddlepaddle/models#1917
渝ICP备2023009037号

京公网安备11010502055752号

网络110报警服务 Powered by GitLab CE v13.7
开源知识
Git 入门 Pro Git 电子书 在线学 Git
Markdown 基础入门 IT 技术知识开源图谱
帮助
使用手册 反馈建议 博客
《GitCode 隐私声明》 《GitCode 服务条款》 关于GitCode
Powered by GitLab CE v13.7