video NeXtVLAD training error
Created by: Haijunlv
video model NeXtVLAD training process get one errer. it happens when model validate after one epoch training.
I tried to use same train and valid data to debug whether data is broken or not. And found eror always happen at the first epoch validatation.
I have not modified any mian process code. Just try to train!
Hope someone can help me!
enviroment use official docker image: paddlepaddle/paddle:latest-gpu-cuda9.0-cudnn7 GPU environment: 8 Tesla V100, driver versioin:384.81
following the youtuble-8m data preprocessing and then train.
train script:
export CUDA_VISIBLE_DEVICES=4,5,6,7 python train.py --model-name="NEXTVLAD" --config=./configs/nextvlad.txt --epoch- num=6 --valid-interval=1 --log-interval=1
in nextvlad.txt:
num_gpus = 4
error log:
file path:/home/lvhaijun/dataset/youtube-8m/frame/pkl/train/train3581.pkl [INFO: metrics_util.py: 68]: [TRAIN] Epoch 0, iter 24299 , loss = 4.340643, Hit@1 = 0.88, PERR = 0.79, GAP = 0.84 [INFO: metrics_util.py: 68]: [TRAIN] Epoch 0, iter 24300 , loss = 4.450137, Hit@1 = 0.86, PERR = 0.76, GAP = 0.82 [INFO: metrics_util.py: 68]: [TRAIN] Epoch 0, iter 24301 , loss = 4.025729, Hit@1 = 0.82, PERR = 0.75, GAP = 0.83 [INFO: metrics_util.py: 68]: [TRAIN] Epoch 0, iter 24302 , loss = 4.144448, Hit@1 = 0.84, PERR = 0.72, GAP = 0.82 [INFO: metrics_util.py: 68]: [TRAIN] Epoch 0, iter 24303 , loss = 4.669774, Hit@1 = 0.88, PERR = 0.78, GAP = 0.82 [INFO: metrics_util.py: 68]: [TRAIN] Epoch 0, iter 24304 , loss = 3.699670, Hit@1 = 0.93, PERR = 0.81, GAP = 0.86 [INFO: train_utils.py: 122]: [TRAIN] Epoch 0 training finished, average time: 0.519670748235 file path:/home/lvhaijun/dataset/youtube-8m/frame/pkl/train/train0000.pkl Traceback (most recent call last): File "train.py", line 226, in <module> train(args) File "train.py", line 216, in train test_fetch_list = valid_fetch_list, test_metrics = valid_metrics) File "/home/lvhaijun/video_recognition/utils_project/paddle-model/models/fluid/PaddleCV/video/tools/train_utils.py", line 127, in train_with_pyreader test_metrics, log_interval) File "/home/lvhaijun/video_recognition/utils_project/paddle-model/models/fluid/PaddleCV/video/tools/train_utils.py", line 43, in test_with_pyreader test_outs = test_exe.run(fetch_list=test_fetch_list) File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/parallel_executor.py", line 192, in run return_numpy=return_numpy) File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/executor.py", line 543, in run return_numpy=return_numpy) File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/executor.py", line 444, in _run_parallel exe.run(fetch_var_names, fetch_var_name) paddle.fluid.core.EnforceNotMet: Cannot find fetched variable(sigmoid_0.tmp_0).(Perhaps the main_program is not set to ParallelExecutor) at [/paddle/paddle/fluid/framework/details/threaded_ssa_graph_executor.cc:142] PaddlePaddle Call Stacks: 28181,1 99% paddle.fluid.core.EnforceNotMet: Cannot find fetched variable(sigmoid_0.tmp_0).(Perhaps the main_program is not set to ParallelExecutor) at [/paddle/paddle/fluid/framework/details/threaded_ssa_graph_executor.cc:142] PaddlePaddle Call Stacks: 0 0x7f507444cf50p void paddle::platform::EnforceNotMet::Init<char const*>(char const*, char const*, int) + 352 1 0x7f507444d2c9p paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) + 137 2 0x7f5075d6707fp paddle::framework::details::ThreadedSSAGraphExecutor::InsertFetchOps(std::vector<std::string, std::allocator<std::string> > const&, std::vector<paddle::framework::details::FetchOpHandle*, std::allocator<paddle::framework::details::FetchOpHandle*> >*, std::unordered_set<paddle::framework::details::VarHandleBase*, std::hash<paddle::framework::details::VarHandleBase*>, std::equal_to<paddle::framework::details::VarHandleBase*>, std::allocator<paddle::framework::details::VarHandleBase*> >*, std::unordered_set<paddle::framework::details::OpHandleBase*, std::hash<paddle::framework::details::OpHandleBase*>, std::equal_to<paddle::framework::details::OpHandleBase*>, std::allocator<paddle::framework::details::OpHandleBase*> >*, std::unordered_map<paddle::framework::details::OpHandleBase*, unsigned long, std::hash<paddle::framework::details::OpHandleBase*>, std::equal_to<paddle::framework::details::OpHandleBase*>, std::allocator<std::pair<paddle::framework::details::OpHandleBase* const, unsigned long> > >*, std::unordered_set<paddle::framework::details::VarHandleBase*, std::hash<paddle::framework::details::VarHandleBase*>, std::equal_to<paddle::framework::details::VarHandleBase*>, std::allocator<paddle::framework::details::VarHandleBase*> >*, std::vector<paddle::framework::LoDTensor, std::allocator<paddle::framework::LoDTensor> >*) + 5119 3 0x7f5075d67632p paddle::framework::details::ThreadedSSAGraphExecutor::Run(std::vector<std::string, std::allocator<std::string> > const&) + 1106 4 0x7f5075d5818ap paddle::framework::details::ScopeBufferedSSAGraphExecutor::Run(std::vector<std::string, std::allocator<std::string> > const&) + 394 5 0x7f5074596072p paddle::framework::ParallelExecutor::Run(std::vector<std::string, std::allocator<std::string> > const&, std::string const&) + 562 6 0x7f507443cc7ep 7 0x7f507447362ep 8 0x4c5326p PyEval_EvalFrameEx + 37958 9 0x4b9b66p PyEval_EvalCodeEx + 774 10 0x4c1f56p PyEval_EvalFrameEx + 24694 11 0x4b9b66p PyEval_EvalCodeEx + 774 12 0x4c17c6p PyEval_EvalFrameEx + 22758 13 0x4b9b66p PyEval_EvalCodeEx + 774 14 0x4c17c6p PyEval_EvalFrameEx + 22758 15 0x4b9b66p PyEval_EvalCodeEx + 774 16 0x4c17c6p PyEval_EvalFrameEx + 22758 17 0x4b9b66p PyEval_EvalCodeEx + 774 18 0x4c17c6p PyEval_EvalFrameEx + 22758 19 0x4b9b66p PyEval_EvalCodeEx + 774 20 0x4c1f56p PyEval_EvalFrameEx + 24694 21 0x4b9b66p PyEval_EvalCodeEx + 774 22 0x4eb69fp 23 0x4e58f2p PyRun_FileExFlags + 130 24 0x4e41a6p PyRun_SimpleFileExFlags + 390 25 0x4938cep Py_Main + 1358 26 0x7f50f1e69830p __libc_start_main + 240 27 0x493299p _start + 41 28180,1 99% PaddlePaddle Call Stacks: 0 0x7f507444cf50p void paddle::platform::EnforceNotMet::Init<char const*>(char const*, char const*, int) + 352 1 0x7f507444d2c9p paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) + 137 2 0x7f5075d6707fp paddle::framework::details::ThreadedSSAGraphExecutor::InsertFetchOps(std::vector<std::string, std::allocator<std::string> > const&, std::vector<paddle::framework::details::FetchOpHandle*, std::allocator<paddle::framework::details::FetchOpHandle*> >*, std::unordered_set<paddle::framework::details::VarHandleBase*, std::hash<paddle::framework::details::VarHandleBase*>, std::equal_to<paddle::framework::details::VarHandleBase*>, std::allocator<paddle::framework::details::VarHandleBase*> >*, std::unordered_set<paddle::framework::details::OpHandleBase*, std::hash<paddle::framework::details::OpHandleBase*>, std::equal_to<paddle::framework::details::OpHandleBase*>, std::allocator<paddle::framework::details::OpHandleBase*> >*, std::unordered_map<paddle::framework::details::OpHandleBase*, unsigned long, std::hash<paddle::framework::details::OpHandleBase*>, std::equal_to<paddle::framework::details::OpHandleBase*>, std::allocator<std::pair<paddle::framework::details::OpHandleBase* const, unsigned long> > >*, std::unordered_set<paddle::framework::details::VarHandleBase*, std::hash<paddle::framework::details::VarHandleBase*>, std::equal_to<paddle::framework::details::VarHandleBase*>, std::allocator<paddle::framework::details::VarHandleBase*> >*, std::vector<paddle::framework::LoDTensor, std::allocator<paddle::framework::LoDTensor> >*) + 5119 3 0x7f5075d67632p paddle::framework::details::ThreadedSSAGraphExecutor::Run(std::vector<std::string, std::allocator<std::string> > const&) + 1106 4 0x7f5075d5818ap paddle::framework::details::ScopeBufferedSSAGraphExecutor::Run(std::vector<std::string, std::allocator<std::string> > const&) + 394 5 0x7f5074596072p paddle::framework::ParallelExecutor::Run(std::vector<std::string, std::allocator<std::string> > const&, std::string const&) + 562 6 0x7f507443cc7ep 7 0x7f507447362ep 8 0x4c5326p PyEval_EvalFrameEx + 37958 9 0x4b9b66p PyEval_EvalCodeEx + 774 10 0x4c1f56p PyEval_EvalFrameEx + 24694 11 0x4b9b66p PyEval_EvalCodeEx + 774 12 0x4c17c6p PyEval_EvalFrameEx + 22758 13 0x4b9b66p PyEval_EvalCodeEx + 774 14 0x4c17c6p PyEval_EvalFrameEx + 22758 15 0x4b9b66p PyEval_EvalCodeEx + 774 16 0x4c17c6p PyEval_EvalFrameEx + 22758 17 0x4b9b66p PyEval_EvalCodeEx + 774 18 0x4c17c6p PyEval_EvalFrameEx + 22758 19 0x4b9b66p PyEval_EvalCodeEx + 774 20 0x4c1f56p PyEval_EvalFrameEx + 24694 21 0x4b9b66p PyEval_EvalCodeEx + 774 22 0x4eb69fp 23 0x4e58f2p PyRun_FileExFlags + 130 24 0x4e41a6p PyRun_SimpleFileExFlags + 390 25 0x4938cep Py_Main + 1358 26 0x7f50f1e69830p __libc_start_main + 240 27 0x493299p _start + 41