PPyolo自定义数据集训练core dump (#1257) · Issue · PaddlePaddle / PaddleDetection

PPyolo自定义数据集训练core dump

Created by: nihuizhidao

CentOS7， CUDA 10.2，两个GPU训练，paddlepaddle 1.8.4.post107, paddleDetection release 0.4

在使用自己的数据集（已转换为COCO格式）fine tune时，训练到eval的时候（第1000iter）出现如下错误：

2020-08-19 16:59:40,701-INFO: Save model to output/ppyolo/1000. /home/xxx/anaconda3/envs/pp184/lib/python3.7/site-packages/paddle/fluid/executor.py:1070: UserWarning: The following exception is not an EOF exception. "The following exception is not an EOF exception.") Traceback (most recent call last): File "tools/train.py", line 368, in main() File "tools/train.py", line 286, in main resolution=resolution) File "/home/xxx/Users/xxx/xxx/PaddleDetection-release-0.4/ppdet/utils/eval_utils.py", line 129, in eval_run return_numpy=False) File "/home/xxx/anaconda3/envs/pp184/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1071, in run six.reraise(*sys.exc_info()) File "/home/xxx/anaconda3/envs/pp184/lib/python3.7/site-packages/six.py", line 703, in reraise raise value File "/home/xxx/anaconda3/envs/pp184/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1066, in run return_merged=return_merged) File "/home/xxx/anaconda3/envs/pp184/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1167, in _run_impl return_merged=return_merged) File "/home/xxx/anaconda3/envs/pp184/lib/python3.7/site-packages/paddle/fluid/executor.py", line 879, in _run_parallel tensors = exe.run(fetch_var_names, return_merged)._move_to_list() paddle.fluid.core_avx.EnforceNotMet:

C++ Call Stacks (More useful to developers):

0 std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int) 1 paddle::platform::EnforceNotMet::EnforceNotMet(std::exception_ptr::exception_ptr, char const*, int) 2 paddle::platform::CublasHandleHolder::CublasHandleHolder(CUstream_st*, cublasMath_t) 3 paddle::platform::CUDAContext::CUDAContext(paddle::platform::CUDAPlace const&, paddle::platform::stream::Priority const&) 4 paddle::platform::CUDADeviceContext::CUDADeviceContext(paddle::platform::CUDAPlace) 5 std::Function_handler<std::unique_ptr<paddle::platform::DeviceContext, std::default_deletepaddle::platform::DeviceContext > (), std::reference_wrapper<std::Bind_simple<paddle::platform::EmplaceDeviceContext<paddle::platform::CUDADeviceContext, paddle::platform::CUDAPlace>(std::map<paddle::platform::Place, std::shared_future<std::unique_ptr<paddle::platform::DeviceContext, std::default_deletepaddle::platform::DeviceContext > >, std::lesspaddle::platform::Place, std::allocator<std::pair<paddle::platform::Place const, std::shared_future<std::unique_ptr<paddle::platform::DeviceContext, std::default_deletepaddle::platform::DeviceContext > > > > >, paddle::platform::Place)::{lambda()#1} ()> > >::_M_invoke(std::_Any_data const&) 6 std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<std::unique_ptr<paddle::platform::DeviceContext, std::default_deletepaddle::platform::DeviceContext > >, std::__future_base::_Result_base::_Deleter>, std::unique_ptr<paddle::platform::DeviceContext, std::default_deletepaddle::platform::DeviceContext > > >::_M_invoke(std::_Any_data const&) 7 std::__future_base::_State_base::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>&, bool&) 8 std::__future_base::_Deferred_state<std::_Bind_simple<paddle::platform::EmplaceDeviceContext<paddle::platform::CUDADeviceContext, paddle::platform::CUDAPlace>(std::map<paddle::platform::Place, std::shared_future<std::unique_ptr<paddle::platform::DeviceContext, std::default_deletepaddle::platform::DeviceContext > >, std::lesspaddle::platform::Place, std::allocator<std::pair<paddle::platform::Place const, std::shared_future<std::unique_ptr<paddle::platform::DeviceContext, std::default_deletepaddle::platform::DeviceContext > > > > >, paddle::platform::Place)::{lambda()#1} ()>, std::unique_ptr<paddle::platform::DeviceContext, std::default_deletepaddle::platform::DeviceContext > >::M_run_deferred() 9 paddle::platform::DeviceContextPool::Get(paddle::platform::Place const&) 10 paddle::framework::details::FastThreadedSSAGraphExecutor::InsertFetchOps(std::vector<std::string, std::allocatorstd::string > const&, boost::variant<std::vector<boost::variant<paddle::framework::LoDTensor, std::vector<paddle::framework::LoDTensor, std::allocatorpaddle::framework::LoDTensor >, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_>, std::allocator<boost::variant<paddle::framework::LoDTensor, std::vector<paddle::framework::LoDTensor, std::allocatorpaddle::framework::LoDTensor >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> > >, std::vector<std::vector<boost::variant<paddle::framework::LoDTensor, std::vector<paddle::framework::LoDTensor, std::allocatorpaddle::framework::LoDTensor >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_>, std::allocator<boost::variant<paddle::framework::LoDTensor, std::vector<paddle::framework::LoDTensor, std::allocatorpaddle::framework::LoDTensor >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> > >, std::allocator<std::vector<boost::variant<paddle::framework::LoDTensor, std::vector<paddle::framework::LoDTensor, std::allocatorpaddle::framework::LoDTensor >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_>, std::allocator<boost::variant<paddle::framework::LoDTensor, std::vector<paddle::framework::LoDTensor, std::allocatorpaddle::framework::LoDTensor >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> > > > >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_>, std::unordered_map<std::string, std::vector<paddle::framework::details::VarHandleBase, std::allocatorpaddle::framework::details::VarHandleBase* >, std::hashstd::string, std::equal_tostd::string, std::allocator<std::pair<std::string const, std::vector<paddle::framework::details::VarHandleBase*, std::allocatorpaddle::framework::details::VarHandleBase* > > > >, std::unordered_map<paddle::framework::details::OpHandleBase, std::atomic, std::hashpaddle::framework::details::OpHandleBase*, std::equal_topaddle::framework::details::OpHandleBase*, std::allocator<std::pair<paddle::framework::details::OpHandleBase* const, std::atomic > > >, std::vector<paddle::framework::details::OpHandleBase, std::allocatorpaddle::framework::details::OpHandleBase* >, std::vector<paddle::framework::details::OpHandleBase, std::allocatorpaddle::framework::details::OpHandleBase* >*, bool) 11 paddle::framework::details::FastThreadedSSAGraphExecutor::Run(std::vector<std::string, std::allocatorstd::string > const&, bool) 12 paddle::framework::details::ScopeBufferedMonitor::Apply(std::function<void ()> const&, bool) 13 paddle::framework::details::ScopeBufferedSSAGraphExecutor::Run(std::vector<std::string, std::allocatorstd::string > const&, bool) 14 paddle::framework::ParallelExecutor::Run(std::vector<std::string, std::allocatorstd::string > const&, bool)

Error Message Summary:

ExternalError: Cublas error, CUBLAS_STATUS_ALLOC_FAILED at (/paddle/paddle/fluid/platform/cuda_helper.h:81)

terminate called without an active exception W0819 16:59:47.753170 32004 init.cc:226] Warning: PaddlePaddle catches a failure signal, it may not work properly W0819 16:59:47.753242 32004 init.cc:228] You could check whether you killed PaddlePaddle thread/process accidentally or report the case to PaddlePaddle W0819 16:59:47.753255 32004 init.cc:231] The detail failure signal is:

W0819 16:59:47.753284 32004 init.cc:234] *** Aborted at 1597827587 (unix time) try "date -d @1597827587" if you are using GNU date *** W0819 16:59:47.757735 32004 init.cc:234] PC: @ 0x0 (unknown) W0819 16:59:47.757876 32004 init.cc:234] *** SIGABRT (@0x3e8000059a7) received by PID 22951 (TID 0x7f6071fff700) from PID 22951; stack trace: *** W0819 16:59:47.760473 32004 init.cc:234] @ 0x7f6577ecb630 (unknown) W0819 16:59:47.763413 32004 init.cc:234] @ 0x7f6577b24387 __GI_raise W0819 16:59:47.766125 32004 init.cc:234] @ 0x7f6577b25a78 __GI_abort W0819 16:59:47.785084 32004 init.cc:234] @ 0x7f65450ce84a __gnu_cxx::__verbose_terminate_handler() W0819 16:59:47.787011 32004 init.cc:234] @ 0x7f65450ccf47 __cxxabiv1::__terminate() W0819 16:59:47.789022 32004 init.cc:234] @ 0x7f65450ccf7d std::terminate() W0819 16:59:47.790805 32004 init.cc:234] @ 0x7f65450ccc5a __gxx_personality_v0 W0819 16:59:47.797116 32004 init.cc:234] @ 0x7f657812db97 _Unwind_ForcedUnwind_Phase2 W0819 16:59:47.800135 32004 init.cc:234] @ 0x7f657812de7d _Unwind_ForcedUnwind W0819 16:59:47.802764 32004 init.cc:234] @ 0x7f6577eca362 __GI___pthread_unwind W0819 16:59:47.805341 32004 init.cc:234] @ 0x7f6577ec4ef7 __pthread_exit W0819 16:59:47.850818 32004 init.cc:234] @ 0x561b9fe731c9 PyThread_exit_thread W0819 16:59:47.851243 32004 init.cc:234] @ 0x561b9fd05cb1 PyEval_RestoreThread.cold.787 W0819 16:59:47.851969 32004 init.cc:234] @ 0x7f650e8435d5 (unknown) W0819 16:59:47.853279 32004 init.cc:234] @ 0x561b9fdfd114 _PyMethodDef_RawFastCallKeywords W0819 16:59:47.854166 32004 init.cc:234] @ 0x561b9fdfd231 _PyCFunction_FastCallKeywords W0819 16:59:47.854974 32004 init.cc:234] @ 0x561b9fe61a5d _PyEval_EvalFrameDefault W0819 16:59:47.855684 32004 init.cc:234] @ 0x561b9fdb66f9 _PyEval_EvalCodeWithName W0819 16:59:47.856374 32004 init.cc:234] @ 0x561b9fdb7805 _PyFunction_FastCallDict W0819 16:59:47.857045 32004 init.cc:234] @ 0x561b9fdd2943 _PyObject_Call_Prepend W0819 16:59:47.857429 32004 init.cc:234] @ 0x561b9fe1112a slot_tp_call W0819 16:59:47.858115 32004 init.cc:234] @ 0x561b9fe1218b _PyObject_FastCallKeywords W0819 16:59:47.858860 32004 init.cc:234] @ 0x561b9fe61626 _PyEval_EvalFrameDefault W0819 16:59:47.859539 32004 init.cc:234] @ 0x561b9fdb773b _PyFunction_FastCallDict W0819 16:59:47.860206 32004 init.cc:234] @ 0x561b9fdd2943 _PyObject_Call_Prepend W0819 16:59:47.860591 32004 init.cc:234] @ 0x561b9fe1112a slot_tp_call W0819 16:59:47.861285 32004 init.cc:234] @ 0x561b9fe1218b _PyObject_FastCallKeywords W0819 16:59:47.862025 32004 init.cc:234] @ 0x561b9fe61e8f _PyEval_EvalFrameDefault W0819 16:59:47.862701 32004 init.cc:234] @ 0x561b9fdb66f9 _PyEval_EvalCodeWithName W0819 16:59:47.863380 32004 init.cc:234] @ 0x561b9fdb7805 _PyFunction_FastCallDict W0819 16:59:47.864060 32004 init.cc:234] @ 0x561b9fdd2943 _PyObject_Call_Prepend W0819 16:59:47.864805 32004 init.cc:234] @ 0x561b9fdc5b9e PyObject_Call 已放弃(吐核)

这个错误是什么原因呢？是GPU显存不够了么？谢谢

PaddlePaddle / PaddleDetection 1 年多 前同步成功

PPyolo自定义数据集训练core dump

C++ Call Stacks (More useful to developers):

Error Message Summary:

PaddlePaddle / PaddleDetection
1 年多前同步成功