训练到第23060迭代发生core dump
Created by: nihuizhidao
训练几个小时后,发生core dump, 错误信息:
/home/scc/anaconda3/lib/python3.7/site-packages/paddle/fluid/executor.py:1070: UserWarning: The following exception is not an EOF exception. "The following exception is not an EOF exception.") Traceback (most recent call last): File "tools/train.py", line 326, in main() File "tools/train.py", line 236, in main outs = exe.run(compiled_train_prog, fetch_list=train_values) File "/home/scc/anaconda3/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1071, in run six.reraise(*sys.exc_info()) File "/home/scc/anaconda3/lib/python3.7/site-packages/six.py", line 703, in reraise raise value File "/home/scc/anaconda3/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1066, in run return_merged=return_merged) File "/home/scc/anaconda3/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1167, in _run_impl return_merged=return_merged) File "/home/scc/anaconda3/lib/python3.7/site-packages/paddle/fluid/executor.py", line 879, in _run_parallel tensors = exe.run(fetch_var_names, return_merged)._move_to_list() paddle.fluid.core_avx.EnforceNotMet:
C++ Call Stacks (More useful to developers):
0 std::string paddle::platform::GetTraceBackString<std::string const&>(std::string const&, char const*, int) 1 paddle::platform::EnforceNotMet::EnforceNotMet(std::string const&, char const*, int) 2 void paddle::operators::GPUGather<float, int>(paddle::platform::DeviceContext const&, paddle::framework::Tensor const&, paddle::framework::Tensor const&, paddle::framework::Tensor*) 3 paddle::operators::CUDAGenerateProposalsKernel<paddle::platform::CUDADeviceContext, float>::Compute(paddle::framework::ExecutionContext const&) const 4 std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::CUDAGenerateProposalsKernel<paddle::platform::CUDADeviceContext, float> >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&) 5 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const 6 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const 7 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&) 8 paddle::framework::details::ComputationOpHandle::RunImpl() 9 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync(paddle::framework::details::OpHandleBase*) 10 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp(paddle::framework::details::OpHandleBase*, std::shared_ptr<paddle::framework::BlockingQueue > const&, unsigned long*) 11 std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result, std::__future_base::_Result_base::_Deleter>, void> >::_M_invoke(std::_Any_data const&) 12 std::__future_base::_State_base::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>&, bool&) 13 ThreadPool::ThreadPool(unsigned long)::{lambda()#1}::operator()() const
Python Call Stacks (More useful to users):
File "/home/scc/anaconda3/lib/python3.7/site-packages/paddle/fluid/framework.py", line 2610, in append_op attrs=kwargs.get("attrs", None)) File "/home/scc/anaconda3/lib/python3.7/site-packages/paddle/fluid/layer_helper.py", line 43, in append_op return self.main_program.current_block().append_op(*args, **kwargs) File "/home/scc/anaconda3/lib/python3.7/site-packages/paddle/fluid/layers/detection.py", line 2846, in generate_proposals 'RpnRoisLod': rpn_rois_lod File "/home/scc/Projects/AIDetectionProjects/Jinghuajian/ppdet/core/workspace.py", line 150, in partial_apply return op(*args, **kwargs_) File "/home/scc/Projects/AIDetectionProjects/Jinghuajian/ppdet/modeling/anchor_heads/rpn_head.py", line 438, in _get_single_proposals variances=self.anchor_var) File "/home/scc/Projects/AIDetectionProjects/Jinghuajian/ppdet/modeling/anchor_heads/rpn_head.py", line 462, in get_proposals fpn_feat, im_info, lvl, mode) File "/home/scc/Projects/AIDetectionProjects/Jinghuajian/ppdet/modeling/architectures/faster_rcnn.py", line 100, in build rois = self.rpn_head.get_proposals(body_feats, im_info, mode=mode) File "/home/scc/Projects/AIDetectionProjects/Jinghuajian/ppdet/modeling/architectures/faster_rcnn.py", line 240, in train return self.build(feed_vars, 'train') File "tools/train.py", line 117, in main train_fetches = model.train(feed_vars) File "tools/train.py", line 326, in main()
Error Message Summary:
InvalidArgumentError: The index of gather_op should not be emptywhen the index's rank is 1. [Hint: Expected index.dims()[0] > 0, but received index.dims()[0]:0 <= 0:0.] at (/paddle/paddle/fluid/operators/gather.cu.h:83) [operator < generate_proposals > error] terminate called without an active exception W0716 19:09:20.089466 3925 init.cc:216] Warning: PaddlePaddle catches a failure signal, it may not work properly W0716 19:09:20.089524 3925 init.cc:218] You could check whether you killed PaddlePaddle thread/process accidentally or report the case to PaddlePaddle W0716 19:09:20.089537 3925 init.cc:221] The detail failure signal is:
W0716 19:09:20.089586 3925 init.cc:224] *** Aborted at 1594897760 (unix time) try "date -d @1594897760" if you are using GNU date *** W0716 19:09:20.095199 3925 init.cc:224] PC: @ 0x0 (unknown) W0716 19:09:20.095360 3925 init.cc:224] *** SIGABRT (@0x3e800000dfc) received by PID 3580 (TID 0x7ff221b03700) from PID 3580; stack trace: *** W0716 19:09:20.099287 3925 init.cc:224] @ 0x7ff2764f0630 (unknown) W0716 19:09:20.103327 3925 init.cc:224] @ 0x7ff276149387 __GI_raise W0716 19:09:20.107179 3925 init.cc:224] @ 0x7ff27614aa78 __GI_abort W0716 19:09:20.133680 3925 init.cc:224] @ 0x7ff20281184a __gnu_cxx::__verbose_terminate_handler() W0716 19:09:20.136224 3925 init.cc:224] @ 0x7ff20280ff47 __cxxabiv1::__terminate() W0716 19:09:20.142647 3925 init.cc:224] @ 0x7ff20280ff7d std::terminate() W0716 19:09:20.145336 3925 init.cc:224] @ 0x7ff20280fc5a __gxx_personality_v0 W0716 19:09:20.162426 3925 init.cc:224] @ 0x7ff26f238b97 _Unwind_ForcedUnwind_Phase2 W0716 19:09:20.168728 3925 init.cc:224] @ 0x7ff26f238e7d _Unwind_ForcedUnwind W0716 19:09:20.172485 3925 init.cc:224] @ 0x7ff2764ef362 __GI___pthread_unwind W0716 19:09:20.175529 3925 init.cc:224] @ 0x7ff2764e9ef7 __pthread_exit W0716 19:09:20.176205 3925 init.cc:224] @ 0x5595c2d871c9 PyThread_exit_thread W0716 19:09:20.176414 3925 init.cc:224] @ 0x5595c2c19cb1 PyEval_RestoreThread.cold.787 W0716 19:09:20.202080 3925 init.cc:224] @ 0x7ff1bffa2669 pybind11::gil_scoped_release::~gil_scoped_release() W0716 19:09:20.211416 3925 init.cc:224] @ 0x7ff1c008ab75 ZZN8pybind1112cpp_function10initializeIZN6paddle6pybind10BindReaderEPNS_6moduleEEUlRNS2_9operators6reader40OrderedMultiDeviceLoDTensorBlockingQueueERKSt6vectorINS2_9framework9LoDTensorESaISC_EEE2_bIS9_SG_EINS_4nameENS_9is_methodENS_7siblingENS_10call_guardIINS_18gil_scoped_releaseEEEEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNES11 W0716 19:09:20.218363 3925 init.cc:224] @ 0x7ff1bffbfe49 pybind11::cpp_function::dispatcher() W0716 19:09:20.219130 3925 init.cc:224] @ 0x5595c2d11114 _PyMethodDef_RawFastCallKeywords W0716 19:09:20.219820 3925 init.cc:224] @ 0x5595c2d11231 _PyCFunction_FastCallKeywords W0716 19:09:20.220517 3925 init.cc:224] @ 0x5595c2d75e8f _PyEval_EvalFrameDefault W0716 19:09:20.221141 3925 init.cc:224] @ 0x5595c2cca9da _PyEval_EvalCodeWithName W0716 19:09:20.221791 3925 init.cc:224] @ 0x5595c2ccb805 _PyFunction_FastCallDict W0716 19:09:20.222491 3925 init.cc:224] @ 0x5595c2d729f5 _PyEval_EvalFrameDefault W0716 19:09:20.223073 3925 init.cc:224] @ 0x5595c2d1068b _PyFunction_FastCallKeywords W0716 19:09:20.223747 3925 init.cc:224] @ 0x5595c2d71260 _PyEval_EvalFrameDefault W0716 19:09:20.224359 3925 init.cc:224] @ 0x5595c2d1068b _PyFunction_FastCallKeywords W0716 19:09:20.225055 3925 init.cc:224] @ 0x5595c2d71260 _PyEval_EvalFrameDefault W0716 19:09:20.225721 3925 init.cc:224] @ 0x5595c2ccb73b _PyFunction_FastCallDict W0716 19:09:20.226342 3925 init.cc:224] @ 0x5595c2ce6943 _PyObject_Call_Prepend W0716 19:09:20.227012 3925 init.cc:224] @ 0x5595c2cd9b9e PyObject_Call W0716 19:09:20.227315 3925 init.cc:224] @ 0x5595c2dc5af7 t_bootstrap W0716 19:09:20.227457 3925 init.cc:224] @ 0x5595c2d82e18 pythread_wrapper W0716 19:09:20.231935 3925 init.cc:224] @ 0x7ff2764e8ea5 start_thread 已放弃(吐核)
部分配置信息:
architecture: FasterRCNN max_iters: 180000 snapshot_iter: 800 use_gpu: true log_smooth_window: 20 save_dir: output pretrain_weights: https://paddle-imagenet-models-name.bj.bcebos.com/ResNeXt101_vd_64x4d_pretrained.tar weights: output/faster_rcnn_x101_vd_64x4d_fpn_1x/model_final metric: COCO num_classes: 5
使用的是自己的数据,进行fine tuning迁移学习
好像没有什么明显的有用提示是为什么。。。请帮忙看看,有点着急