Cannot find fetched variable(im_id).
Created by: ellinyang
运行环境:
Linux 16.06/ NVIDIA P4 /CUDA9,cudnn7.0
paddle版本:1.7.1_cuda9_cudnn7.0
paddleDetection release/0.2
复现情况:
基于PaddleDetection跑faster_rcnn_r50_fpn_2x时,默认的start() 和 reset() 数据读取过程训练、评估正常。
因为业务需要通过for-range的方法循环迭代输出数据,feed进模型。于是将 fluid.io.DataLoader 中的iterable 改为 True。这么改训练正常,但是训练中运行eval_run时会报错。已确认将reader.yml里的iterable改为true,data输出正常。辛苦paddle同学看一下。
train.py中修改:
eval_run中修改:
错误日志如下: I0424 10:57:04.039811 34768 parallel_executor.cc:440] The Program will be executed on CUDA using ParallelExecutor, 1 cards are used, so 1 programs are executed in parallel. I0424 10:57:04.073230 34768 build_strategy.cc:365] SeqOnlyAllReduceOps:0, num_trainers:1 I0424 10:57:04.147104 34768 parallel_executor.cc:307] Inplace strategy is enabled, when build_strategy.enable_inplace = True I0424 10:57:04.171646 34768 parallel_executor.cc:375] Garbage collection strategy is enabled, when FLAGS_eager_delete_tensor_gb = 0 2020-04-24 10:57:05,668-INFO: iter: 0, lr: 0.006667, 'loss_cls': '3.527403', 'loss_bbox': '0.032837', 'loss_rpn_cls': '0.692826', 'loss_rpn_bbox': '0.102103', 'loss': '4.355169', time: 25.332, eta: 21:06:37 2020-04-24 10:57:25,907-INFO: iter: 20, lr: 0.007200, 'loss_cls': '1.405681', 'loss_bbox': '0.330584', 'loss_rpn_cls': '0.642447', 'loss_rpn_bbox': '0.052094', 'loss': '2.438134', time: 1.041, eta: 0:51:43 2020-04-24 10:57:45,843-INFO: iter: 40, lr: 0.007733, 'loss_cls': '1.382848', 'loss_bbox': '0.209869', 'loss_rpn_cls': '0.612721', 'loss_rpn_bbox': '0.086118', 'loss': '2.293122', time: 1.022, eta: 0:50:23 2020-04-24 10:57:55,730-INFO: Save model to /home/lj/data/objectDet/paddle-job-79090-0/work_dirs/faster_rcnn_r50_fpn_2x_fluid171_02_20200422_1/faster_rcnn_r50_fpn_2x_79090/50.
/home/lj/DLcode/PaddleDetection/tools/train_feeder.py(89)eval_run_feeder() -> for it, data in enumerate(loader()): (Pdb) c I0424 10:58:15.984709 34768 parallel_executor.cc:440] The Program will be executed on CUDA using ParallelExecutor, 1 cards are used, so 1 programs are executed in parallel. I0424 10:58:15.994505 34768 build_strategy.cc:365] SeqOnlyAllReduceOps:0, num_trainers:1 /root/miniconda3/lib/python3.7/site-packages/paddle/fluid/executor.py:782: UserWarning: The following exception is not an EOF exception. "The following exception is not an EOF exception.") Traceback (most recent call last): File "tools/train_feeder.py", line 437, in main() File "tools/train_feeder.py", line 358, in main resolution=resolution) File "tools/train_feeder.py", line 89, in eval_run_feeder for it, data in enumerate(loader()): File "/root/miniconda3/lib/python3.7/site-packages/paddle/fluid/executor.py", line 783, in run six.reraise(*sys.exc_info()) File "/root/miniconda3/lib/python3.7/site-packages/six.py", line 703, in reraise raise value File "/root/miniconda3/lib/python3.7/site-packages/paddle/fluid/executor.py", line 778, in run use_program_cache=use_program_cache) File "/root/miniconda3/lib/python3.7/site-packages/paddle/fluid/executor.py", line 843, in _run_impl return_numpy=return_numpy) File "/root/miniconda3/lib/python3.7/site-packages/paddle/fluid/executor.py", line 677, in _run_parallel tensors = exe.run(fetch_var_names)._move_to_list() paddle.fluid.core_avx.EnforceNotMet:
C++ Call Stacks (More useful to developers):
0 std::string paddle::platform::GetTraceBackString<std::string const&>(std::string const&, char const*, int) 1 paddle::platform::EnforceNotMet::EnforceNotMet(std::string const&, char const*, int) 2 paddle::framework::details::FastThreadedSSAGraphExecutor::InsertFetchOps(std::vector<std::string, std::allocatorstd::string > const&, std::vector<paddle::framework::LoDTensor, std::allocatorpaddle::framework::LoDTensor >, std::unordered_map<std::string, std::vector<paddle::framework::details::VarHandleBase, std::allocatorpaddle::framework::details::VarHandleBase* >, std::hashstd::string, std::equal_tostd::string, std::allocator<std::pair<std::string const, std::vector<paddle::framework::details::VarHandleBase*, std::allocatorpaddle::framework::details::VarHandleBase* > > > >, std::unordered_map<paddle::framework::details::OpHandleBase, std::atomic, std::hashpaddle::framework::details::OpHandleBase*, std::equal_topaddle::framework::details::OpHandleBase*, std::allocator<std::pair<paddle::framework::details::OpHandleBase* const, std::atomic > > >, std::vector<paddle::framework::details::OpHandleBase, std::allocatorpaddle::framework::details::OpHandleBase* >, std::vector<paddle::framework::details::OpHandleBase, std::allocatorpaddle::framework::details::OpHandleBase* >*) 3 paddle::framework::details::FastThreadedSSAGraphExecutor::Run(std::vector<std::string, std::allocatorstd::string > const&) 4 paddle::framework::details::ScopeBufferedMonitor::Apply(std::function<void ()> const&, bool) 5 paddle::framework::details::ScopeBufferedSSAGraphExecutor::Run(std::vector<std::string, std::allocatorstd::string > const&) 6 paddle::framework::ParallelExecutor::Run(std::vector<std::string, std::allocatorstd::string > const&)
Error Message Summary:
PreconditionNotMetError: Cannot find fetched variable(im_id). Perhaps the main_program is not set to ParallelExecutor. [Hint: Expected fetched_var_it != fetched_vars->end(), but received fetched_var_it == fetched_vars->end().] at (/paddle/paddle/fluid/framework/details/fast_threaded_ssa_graph_executor.cc:147)