SENet154-vd-FPN Cascade Mask read数据时出错,求大佬帮忙
Created by: KK-Jiang
大佬在上: aistudio训练报[operator < read > error]:Blocking queue is killed because the data reader raises an exception。
版本、环境信息: 1)PaddlePaddle版本:1.8.0 2)系统环境|GPU:aistudio上,v100 3)PaddleDetection 0.3
训练信息 1)单卡 2)16G 3)错误为[operator < read > error]
复现信息:使用官方PaddleDetection-release-0.3,配置文件cascade_mask_rcnn_dcnv2_se154_vd_fpn_gn_s1x.yml,修改了class_num, batch_size, 数据路径,lr策略等基础信息,然后直接训练模型,就出错。 我尝试在我的window机器上训练,好不容易装好环境,报同样的错误
问题描述: err log如下,我做如下尝试都没有解决问题:我尝试了将work_num逐渐减小,非0的时候仍然报错,为0的时候,放十几个小时都显示正常log然后不动;将DataLoader.from_generator的capacity改小;use_double_buffer改为False;iterable改为True(默认False);以上都无效:
具体错误如下: 2020-05-26 11:24:35,963-INFO: places would be ommited when DataLoader is not iterable 2020-05-26 11:24:39,118-WARNING: recv endsignal from outq with errmsg[consumer[consumer-c14-0] exits for reason[producer[producer-c14] failed with error: cannot reshape array of size 1 into shape (2)]] 2020-05-26 11:24:39,119-WARNING: recv endsignal from outq with errmsg[consumer[consumer-c14-1] exits for reason[consumer[consumer-c14-0] exits for reason[producer[producer-c14] failed with error: cannot reshape array of size 1 into shape (2)]]] 2020-05-26 11:24:39,119-WARNING: recv endsignal from outq with errmsg[consumer[consumer-c14-2] exits for reason[consumer[consumer-c14-1] exits for reason[consumer[consumer-c14-0] exits for reason[producer[producer-c14] failed with error: cannot reshape array of size 1 into shape (2)]]]] 2020-05-26 11:24:39,119-WARNING: recv endsignal from outq with errmsg[consumer[consumer-c14-3] exits for reason[consumer[consumer-c14-2] exits for reason[consumer[consumer-c14-1] exits for reason[consumer[consumer-c14-0] exits for reason[producer[producer-c14] failed with error: cannot reshape array of size 1 into shape (2)]]]]] 2020-05-26 11:24:39,119-WARNING: recv endsignal from outq with errmsg[consumer[consumer-c14-4] exits for reason[consumer[consumer-c14-3] exits for reason[consumer[consumer-c14-2] exits for reason[consumer[consumer-c14-1] exits for reason[consumer[consumer-c14-0] exits for reason[producer[producer-c14] failed with error: cannot reshape array of size 1 into shape (2)]]]]]] 2020-05-26 11:24:39,119-WARNING: recv endsignal from outq with errmsg[consumer[consumer-c14-5] exits for reason[consumer[consumer-c14-4] exits for reason[consumer[consumer-c14-3] exits for reason[consumer[consumer-c14-2] exits for reason[consumer[consumer-c14-1] exits for reason[consumer[consumer-c14-0] exits for reason[producer[producer-c14] failed with error: cannot reshape array of size 1 into shape (2)]]]]]]] 2020-05-26 11:24:39,119-WARNING: recv endsignal from outq with errmsg[consumer[consumer-c14-6] exits for reason[consumer[consumer-c14-5] exits for reason[consumer[consumer-c14-4] exits for reason[consumer[consumer-c14-3] exits for reason[consumer[consumer-c14-2] exits for reason[consumer[consumer-c14-1] exits for reason[consumer[consumer-c14-0] exits for reason[producer[producer-c14] failed with error: cannot reshape array of size 1 into shape (2)]]]]]]]] 2020-05-26 11:24:39,120-WARNING: recv endsignal from outq with errmsg[consumer[consumer-c14-7] exits for reason[consumer[consumer-c14-6] exits for reason[consumer[consumer-c14-5] exits for reason[consumer[consumer-c14-4] exits for reason[consumer[consumer-c14-3] exits for reason[consumer[consumer-c14-2] exits for reason[consumer[consumer-c14-1] exits for reason[consumer[consumer-c14-0] exits for reason[producer[producer-c14] failed with error: cannot reshape array of size 1 into shape (2)]]]]]]]]] 2020-05-26 11:24:39,120-WARNING: Your reader has raised an exception! Exception in thread Thread-10: Traceback (most recent call last): File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/reader.py", line 1156, in thread_main six.reraise(*sys.exc_info()) File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/six.py", line 693, in reraise raise value File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/reader.py", line 1136, in thread_main for tensors in self._tensor_reader(): File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/reader.py", line 1206, in tensor_reader_impl for slots in paddle_reader(): File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/data_feeder.py", line 505, in reader_creator for item in reader(): File "/home/aistudio/PaddleDetection-release-0.3/ppdet/data/reader.py", line 421, in _reader reader.reset() File "/home/aistudio/PaddleDetection-release-0.3/ppdet/data/parallel_map.py", line 259, in reset assert not self._exit, "cannot reset for already stopped dataset" AssertionError: cannot reset for already stopped dataset
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/executor.py:1070: UserWarning: The following exception is not an EOF exception. "The following exception is not an EOF exception.") loading annotations into memory... Done (t=12.03s) creating index... index created! Traceback (most recent call last): File "PaddleDetection-release-0.3/tools/train.py", line 366, in main() File "PaddleDetection-release-0.3/tools/train.py", line 239, in main outs = exe.run(compiled_train_prog, fetch_list=train_values) File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1071, in run six.reraise(*sys.exc_info()) File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/six.py", line 693, in reraise raise value File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1066, in run return_merged=return_merged) File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1167, in _run_impl return_merged=return_merged) File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/executor.py", line 879, in _run_parallel tensors = exe.run(fetch_var_names, return_merged)._move_to_list() paddle.fluid.core_avx.EnforceNotMet:
C++ Call Stacks (More useful to developers): 0 std::string paddle::platform::GetTraceBackString<std::string const&>(std::string const&, char const*, int) 1 paddle::platform::EnforceNotMet::EnforceNotMet(std::string const&, char const*, int) 2 paddle::operators::reader::BlockingQueue<std::vector<paddle::framework::LoDTensor, std::allocatorpaddle::framework::LoDTensor > >::Receive(std::vector<paddle::framework::LoDTensor, std::allocatorpaddle::framework::LoDTensor >) 3 paddle::operators::reader::PyReader::ReadNext(std::vector<paddle::framework::LoDTensor, std::allocatorpaddle::framework::LoDTensor >) 4 std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result, std::__future_base::_Result_base::_Deleter>, unsigned long> >::_M_invoke(std::_Any_data const&) 5 std::__future_base::_State_base::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>&, bool&) 6 ThreadPool::ThreadPool(unsigned long)::{lambda()#1}::operator()() const
Python Call Stacks (More useful to users): File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/framework.py", line 2610, in append_op attrs=kwargs.get("attrs", None)) File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/reader.py", line 1078, in _init_non_iterable attrs={'drop_last': self._drop_last}) File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/reader.py", line 976, in init self._init_non_iterable() File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/reader.py", line 608, in from_generator iterable, return_list, drop_last) File "/home/aistudio/PaddleDetection-release-0.3/ppdet/modeling/architectures/cascade_mask_rcnn.py", line 426, in build_inputs iterable=iterable) if use_dataloader else None File "PaddleDetection-release-0.3/tools/train.py", line 112, in main feed_vars, train_loader = model.build_inputs(**inputs_def) File "PaddleDetection-release-0.3/tools/train.py", line 366, in main()
Error Message Summary: Error: Blocking queue is killed because the data reader raises an exception [Hint: Expected killed_ != true, but received killed_:1 == true:1.] at (/paddle/paddle/fluid/operators/reader/blocking_queue.h:141) [operator < read > error]