Floating point exception (core dumped): 跑大数据的时候(比如AISHELL数据) PaddlePaddle报错
已关闭
Floating point exception (core dumped): 跑大数据的时候(比如AISHELL数据) PaddlePaddle报错
Created by: xikunlun001
跑自己的测试的几十个小音频的时候没什么问题,跑大数据的时候总会跳如下错误
................*** Aborted at 1522777380 (unix time) try "date -d @1522777380" if you are using GNU date *** PC: @ 0x0 (unknown) *** SIGFPE (@0x7f0b7f3cc7a4) received by PID 29812 (TID 0x7f0bb86c0700) from PID 2134689700; stack trace: *** @ 0x7f0bb82d0390 (unknown) @ 0x7f0b7f3cc7a4 paddle::GpuVectorT<>::getAbsMax() @ 0x7f0b7f74ddef paddle::OptimizerWithGradientClipping::update() @ 0x7f0b7f739e1e paddle::SgdThreadUpdater::updateImpl() @ 0x7f0b7f5d67a1 ParameterUpdater::update() @ 0x7f0b7f07fff7 _wrap_ParameterUpdater_update @ 0x4bc3fa PyEval_EvalFrameEx @ 0x4c136f PyEval_EvalFrameEx @ 0x4b9ab6 PyEval_EvalCodeEx @ 0x4c16e7 PyEval_EvalFrameEx @ 0x4b9ab6 PyEval_EvalCodeEx @ 0x4c16e7 PyEval_EvalFrameEx @ 0x4b9ab6 PyEval_EvalCodeEx @ 0x4c1e6f PyEval_EvalFrameEx @ 0x4b9ab6 PyEval_EvalCodeEx @ 0x4c1e6f PyEval_EvalFrameEx @ 0x4b9ab6 PyEval_EvalCodeEx @ 0x4eb30f (unknown) @ 0x4e5422 PyRun_FileExFlags @ 0x4e3cd6 PyRun_SimpleFileExFlags @ 0x493ae2 Py_Main @ 0x7f0bb7f15830 __libc_start_main @ 0x4933e9 _start @ 0x0 (unknown) Floating point exception (core dumped)
Created by: fancyerii
另外训练的过程cost越来越大,这是正常的吗(没看代码,不明白cost是否Loss) I0531 12:00:22.599843 26564 GradientMachine.cpp:101] Init parameters done. ................................................................................................... Pass: 0, Batch: 100, TrainCost: 49.042408 ................................................................................................... Pass: 0, Batch: 200, TrainCost: 52.070030 ................................................................................................... Pass: 0, Batch: 300, TrainCost: 54.210860 ................................................................................................... Pass: 0, Batch: 400, TrainCost: 57.374444
Created by: shoegazerstella
hey @xikunlun001, how did you solve this? I am also having this problem during training after 91 epochs.
--------Time: 43.372358 sec, epoch: 91, train loss: 26.726669, test loss: 51.427845 W1203 01:06:31.706848 3752 init.cc:205] *** Aborted at 1575331591 (unix time) try "date -d @1575331591" if you are using GNU date *** W1203 01:06:31.709036 3752 init.cc:205] PC: @ 0x0 (unknown) W1203 01:06:31.713179 3752 init.cc:205] *** SIGFPE (@0x7f499a1ed378) received by PID 3682 (TID 0x7f4580ff9700) from PID 18446744072000295800; stack trace: *** W1203 01:06:31.715488 3752 init.cc:205] @ 0x7f4a01cbf390 (unknown) W1203 01:06:31.728520 3752 init.cc:205] @ 0x7f499a1ed378 paddle::framework::Scope::FindVarLocally() W1203 01:06:31.732019 3752 init.cc:205] @ 0x7f499a1edcae paddle::framework::Scope::VarInternal() W1203 01:06:31.735564 3752 init.cc:205] @ 0x7f499a1ede2d paddle::framework::Scope::Var() W1203 01:06:31.740332 3752 init.cc:205] @ 0x7f499833d455 paddle::operators::RecurrentGradOp::RunImpl() W1203 01:06:31.743172 3752 init.cc:205] @ 0x7f499a17f99c paddle::framework::OperatorBase::Run() W1203 01:06:31.745193 3752 init.cc:205] @ 0x7f4999f66dbd _ZNSt17_Function_handlerIFvvEZN6paddle9framework7details12OpHandleBase17RunAndRecordEventERKSt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data W1203 01:06:31.748976 3752 init.cc:205] @ 0x7f4999f66635 paddle::framework::details::OpHandleBase::RunAndRecordEvent() W1203 01:06:31.754258 3752 init.cc:205] @ 0x7f4999f698db paddle::framework::details::ComputationOpHandle::RunImpl() W1203 01:06:31.759212 3752 init.cc:205] @ 0x7f4999f23f36 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync() W1203 01:06:31.764627 3752 init.cc:205] @ 0x7f4999f22c7f paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp() W1203 01:06:31.766867 3752 init.cc:205] @ 0x7f4999f22f44 _ZNSt17_Function_handlerIFvvESt17reference_wrapperISt12_Bind_simpleIFS1_ISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS6_12OpHandleBaseESt6atomicIiESt4hashISA_ESt8equal_toISA_ESaISt4pairIKSA_SC_EEESA_RKSt10shared_ptrINS5_13BlockingQueueImEEEEUlvE_vEEEvEEEE9_M_invokeERKSt9_Any_data W1203 01:06:31.773646 3752 init.cc:205] @ 0x7f4997b20ef3 std::_Function_handler<>::_M_invoke() W1203 01:06:31.779834 3752 init.cc:205] @ 0x7f499797b6e7 std::__future_base::_State_base::_M_do_set() W1203 01:06:31.781775 3752 init.cc:205] @ 0x7f4a01cbca99 __pthread_once_slow W1203 01:06:31.783601 3752 init.cc:205] @ 0x7f4999f1e722 _ZNSt13__future_base11_Task_stateISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS4_12OpHandleBaseESt6atomicIiESt4hashIS8_ESt8equal_toIS8_ESaISt4pairIKS8_SA_EEES8_RKSt10shared_ptrINS3_13BlockingQueueImEEEEUlvE_vEESaIiEFvvEE6_M_runEv W1203 01:06:31.789702 3752 init.cc:205] @ 0x7f499797cc64 _ZZN10ThreadPoolC1EmENKUlvE_clEv W1203 01:06:31.792124 3752 init.cc:205] @ 0x7f49c51c8421 execute_native_thread_routine_compat W1203 01:06:31.793973 3752 init.cc:205] @ 0x7f4a01cb56ba start_thread W1203 01:06:31.795827 3752 init.cc:205] @ 0x7f4a012db41d clone W1203 01:06:31.797710 3752 init.cc:205] @ 0x0 (unknown) Floating point exception (core dumped)