Skip to content

  • 体验新版
    • 正在加载...
  • 登录
  • PaddlePaddle
  • models
  • Issue
  • #1936

M
models
  • 项目概览

PaddlePaddle / models
大约 2 年 前同步成功

通知 232
Star 6828
Fork 2962
  • 代码
    • 文件
    • 提交
    • 分支
    • Tags
    • 贡献者
    • 分支图
    • Diff
  • Issue 602
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 255
  • Wiki 0
    • Wiki
  • 分析
    • 仓库
    • DevOps
  • 项目成员
  • Pages
M
models
  • 项目概览
    • 项目概览
    • 详情
    • 发布
  • 仓库
    • 仓库
    • 文件
    • 提交
    • 分支
    • 标签
    • 贡献者
    • 分支图
    • 比较
  • Issue 602
    • Issue 602
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 255
    • 合并请求 255
  • Pages
  • 分析
    • 分析
    • 仓库分析
    • DevOps
  • Wiki 0
    • Wiki
  • 成员
    • 成员
  • 收起侧边栏
  • 动态
  • 分支图
  • 创建新Issue
  • 提交
  • Issue看板
You need to sign in or sign up before continuing.
已关闭
开放中
Opened 3月 28, 2019 by saxon_zh@saxon_zhGuest

video NeXtVLAD training error

Created by: Haijunlv

video model NeXtVLAD training process get one errer. it happens when model validate after one epoch training.
I tried to use same train and valid data to debug whether data is broken or not. And found eror always happen at the first epoch validatation. I have not modified any mian process code. Just try to train! Hope someone can help me!

enviroment use official docker image: paddlepaddle/paddle:latest-gpu-cuda9.0-cudnn7 GPU environment: 8 Tesla V100, driver versioin:384.81

following the youtuble-8m data preprocessing and then train. train script: export CUDA_VISIBLE_DEVICES=4,5,6,7 python train.py --model-name="NEXTVLAD" --config=./configs/nextvlad.txt --epoch- num=6 --valid-interval=1 --log-interval=1 in nextvlad.txt: num_gpus = 4

error log: file path:/home/lvhaijun/dataset/youtube-8m/frame/pkl/train/train3581.pkl [INFO: metrics_util.py: 68]: [TRAIN] Epoch 0, iter 24299 , loss = 4.340643, Hit@1 = 0.88, PERR = 0.79, GAP = 0.84 [INFO: metrics_util.py: 68]: [TRAIN] Epoch 0, iter 24300 , loss = 4.450137, Hit@1 = 0.86, PERR = 0.76, GAP = 0.82 [INFO: metrics_util.py: 68]: [TRAIN] Epoch 0, iter 24301 , loss = 4.025729, Hit@1 = 0.82, PERR = 0.75, GAP = 0.83 [INFO: metrics_util.py: 68]: [TRAIN] Epoch 0, iter 24302 , loss = 4.144448, Hit@1 = 0.84, PERR = 0.72, GAP = 0.82 [INFO: metrics_util.py: 68]: [TRAIN] Epoch 0, iter 24303 , loss = 4.669774, Hit@1 = 0.88, PERR = 0.78, GAP = 0.82 [INFO: metrics_util.py: 68]: [TRAIN] Epoch 0, iter 24304 , loss = 3.699670, Hit@1 = 0.93, PERR = 0.81, GAP = 0.86 [INFO: train_utils.py: 122]: [TRAIN] Epoch 0 training finished, average time: 0.519670748235 file path:/home/lvhaijun/dataset/youtube-8m/frame/pkl/train/train0000.pkl Traceback (most recent call last): File "train.py", line 226, in <module> train(args) File "train.py", line 216, in train test_fetch_list = valid_fetch_list, test_metrics = valid_metrics) File "/home/lvhaijun/video_recognition/utils_project/paddle-model/models/fluid/PaddleCV/video/tools/train_utils.py", line 127, in train_with_pyreader test_metrics, log_interval) File "/home/lvhaijun/video_recognition/utils_project/paddle-model/models/fluid/PaddleCV/video/tools/train_utils.py", line 43, in test_with_pyreader test_outs = test_exe.run(fetch_list=test_fetch_list) File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/parallel_executor.py", line 192, in run return_numpy=return_numpy) File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/executor.py", line 543, in run return_numpy=return_numpy) File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/executor.py", line 444, in _run_parallel exe.run(fetch_var_names, fetch_var_name) paddle.fluid.core.EnforceNotMet: Cannot find fetched variable(sigmoid_0.tmp_0).(Perhaps the main_program is not set to ParallelExecutor) at [/paddle/paddle/fluid/framework/details/threaded_ssa_graph_executor.cc:142] PaddlePaddle Call Stacks: 28181,1 99% paddle.fluid.core.EnforceNotMet: Cannot find fetched variable(sigmoid_0.tmp_0).(Perhaps the main_program is not set to ParallelExecutor) at [/paddle/paddle/fluid/framework/details/threaded_ssa_graph_executor.cc:142] PaddlePaddle Call Stacks: 0 0x7f507444cf50p void paddle::platform::EnforceNotMet::Init<char const*>(char const*, char const*, int) + 352 1 0x7f507444d2c9p paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) + 137 2 0x7f5075d6707fp paddle::framework::details::ThreadedSSAGraphExecutor::InsertFetchOps(std::vector<std::string, std::allocator<std::string> > const&, std::vector<paddle::framework::details::FetchOpHandle*, std::allocator<paddle::framework::details::FetchOpHandle*> >*, std::unordered_set<paddle::framework::details::VarHandleBase*, std::hash<paddle::framework::details::VarHandleBase*>, std::equal_to<paddle::framework::details::VarHandleBase*>, std::allocator<paddle::framework::details::VarHandleBase*> >*, std::unordered_set<paddle::framework::details::OpHandleBase*, std::hash<paddle::framework::details::OpHandleBase*>, std::equal_to<paddle::framework::details::OpHandleBase*>, std::allocator<paddle::framework::details::OpHandleBase*> >*, std::unordered_map<paddle::framework::details::OpHandleBase*, unsigned long, std::hash<paddle::framework::details::OpHandleBase*>, std::equal_to<paddle::framework::details::OpHandleBase*>, std::allocator<std::pair<paddle::framework::details::OpHandleBase* const, unsigned long> > >*, std::unordered_set<paddle::framework::details::VarHandleBase*, std::hash<paddle::framework::details::VarHandleBase*>, std::equal_to<paddle::framework::details::VarHandleBase*>, std::allocator<paddle::framework::details::VarHandleBase*> >*, std::vector<paddle::framework::LoDTensor, std::allocator<paddle::framework::LoDTensor> >*) + 5119 3 0x7f5075d67632p paddle::framework::details::ThreadedSSAGraphExecutor::Run(std::vector<std::string, std::allocator<std::string> > const&) + 1106 4 0x7f5075d5818ap paddle::framework::details::ScopeBufferedSSAGraphExecutor::Run(std::vector<std::string, std::allocator<std::string> > const&) + 394 5 0x7f5074596072p paddle::framework::ParallelExecutor::Run(std::vector<std::string, std::allocator<std::string> > const&, std::string const&) + 562 6 0x7f507443cc7ep 7 0x7f507447362ep 8 0x4c5326p PyEval_EvalFrameEx + 37958 9 0x4b9b66p PyEval_EvalCodeEx + 774 10 0x4c1f56p PyEval_EvalFrameEx + 24694 11 0x4b9b66p PyEval_EvalCodeEx + 774 12 0x4c17c6p PyEval_EvalFrameEx + 22758 13 0x4b9b66p PyEval_EvalCodeEx + 774 14 0x4c17c6p PyEval_EvalFrameEx + 22758 15 0x4b9b66p PyEval_EvalCodeEx + 774 16 0x4c17c6p PyEval_EvalFrameEx + 22758 17 0x4b9b66p PyEval_EvalCodeEx + 774 18 0x4c17c6p PyEval_EvalFrameEx + 22758 19 0x4b9b66p PyEval_EvalCodeEx + 774 20 0x4c1f56p PyEval_EvalFrameEx + 24694 21 0x4b9b66p PyEval_EvalCodeEx + 774 22 0x4eb69fp 23 0x4e58f2p PyRun_FileExFlags + 130 24 0x4e41a6p PyRun_SimpleFileExFlags + 390 25 0x4938cep Py_Main + 1358 26 0x7f50f1e69830p __libc_start_main + 240 27 0x493299p _start + 41 28180,1 99% PaddlePaddle Call Stacks: 0 0x7f507444cf50p void paddle::platform::EnforceNotMet::Init<char const*>(char const*, char const*, int) + 352 1 0x7f507444d2c9p paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) + 137 2 0x7f5075d6707fp paddle::framework::details::ThreadedSSAGraphExecutor::InsertFetchOps(std::vector<std::string, std::allocator<std::string> > const&, std::vector<paddle::framework::details::FetchOpHandle*, std::allocator<paddle::framework::details::FetchOpHandle*> >*, std::unordered_set<paddle::framework::details::VarHandleBase*, std::hash<paddle::framework::details::VarHandleBase*>, std::equal_to<paddle::framework::details::VarHandleBase*>, std::allocator<paddle::framework::details::VarHandleBase*> >*, std::unordered_set<paddle::framework::details::OpHandleBase*, std::hash<paddle::framework::details::OpHandleBase*>, std::equal_to<paddle::framework::details::OpHandleBase*>, std::allocator<paddle::framework::details::OpHandleBase*> >*, std::unordered_map<paddle::framework::details::OpHandleBase*, unsigned long, std::hash<paddle::framework::details::OpHandleBase*>, std::equal_to<paddle::framework::details::OpHandleBase*>, std::allocator<std::pair<paddle::framework::details::OpHandleBase* const, unsigned long> > >*, std::unordered_set<paddle::framework::details::VarHandleBase*, std::hash<paddle::framework::details::VarHandleBase*>, std::equal_to<paddle::framework::details::VarHandleBase*>, std::allocator<paddle::framework::details::VarHandleBase*> >*, std::vector<paddle::framework::LoDTensor, std::allocator<paddle::framework::LoDTensor> >*) + 5119 3 0x7f5075d67632p paddle::framework::details::ThreadedSSAGraphExecutor::Run(std::vector<std::string, std::allocator<std::string> > const&) + 1106 4 0x7f5075d5818ap paddle::framework::details::ScopeBufferedSSAGraphExecutor::Run(std::vector<std::string, std::allocator<std::string> > const&) + 394 5 0x7f5074596072p paddle::framework::ParallelExecutor::Run(std::vector<std::string, std::allocator<std::string> > const&, std::string const&) + 562 6 0x7f507443cc7ep 7 0x7f507447362ep 8 0x4c5326p PyEval_EvalFrameEx + 37958 9 0x4b9b66p PyEval_EvalCodeEx + 774 10 0x4c1f56p PyEval_EvalFrameEx + 24694 11 0x4b9b66p PyEval_EvalCodeEx + 774 12 0x4c17c6p PyEval_EvalFrameEx + 22758 13 0x4b9b66p PyEval_EvalCodeEx + 774 14 0x4c17c6p PyEval_EvalFrameEx + 22758 15 0x4b9b66p PyEval_EvalCodeEx + 774 16 0x4c17c6p PyEval_EvalFrameEx + 22758 17 0x4b9b66p PyEval_EvalCodeEx + 774 18 0x4c17c6p PyEval_EvalFrameEx + 22758 19 0x4b9b66p PyEval_EvalCodeEx + 774 20 0x4c1f56p PyEval_EvalFrameEx + 24694 21 0x4b9b66p PyEval_EvalCodeEx + 774 22 0x4eb69fp 23 0x4e58f2p PyRun_FileExFlags + 130 24 0x4e41a6p PyRun_SimpleFileExFlags + 390 25 0x4938cep Py_Main + 1358 26 0x7f50f1e69830p __libc_start_main + 240 27 0x493299p _start + 41

指派人
分配到
无
里程碑
无
分配里程碑
工时统计
无
截止日期
无
标识: paddlepaddle/models#1936
渝ICP备2023009037号

京公网安备11010502055752号

网络110报警服务 Powered by GitLab CE v13.7
开源知识
Git 入门 Pro Git 电子书 在线学 Git
Markdown 基础入门 IT 技术知识开源图谱
帮助
使用手册 反馈建议 博客
《GitCode 隐私声明》 《GitCode 服务条款》 关于GitCode
Powered by GitLab CE v13.7