[Blocker] CRNN-CTC crashes when the training is ran
Created by: Sand3r-
Description
CRNN-CTC from the newest models developo branch crashes during training when ran with newest paddle version from develop branch ().
Models commit hash: 4c6882ab55fbf9b03c405d2fce2b65a3ba90970f Paddle commit hash: 3300a532
Steps to reproduce
- Download the newest paddle version and build (potentially with MKLDNN support)
- Download the models repository
- In the main models directory run
source fluid/ocr_recognition/scripts/train.sh CPU
Actual behavior
The training results with the following output:
/home/mgallus/src/paddle/build/python/paddle/fluid/evaluator.py:69: Warning: The EditDistance is deprecated, because maintain a modified program inside evaluator cause bug easily, please use fluid.metrics.EditDistance instead.
% (self.__class__.__name__, self.__class__.__name__), Warning)
*** Aborted at 1533633833 (unix time) try "date -d @1533633833" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGSEGV (@0x0) received by PID 45492 (TID 0x7fa4bf86d700) from PID 0; stack trace: ***
@ 0x7fa4bf460390 (unknown)
@ 0x7fa4a42ff6ac paddle::framework::make_dim<>()
@ 0x7fa4a42ff74f paddle::framework::make_ddim()
@ 0x7fa4a42ffb98 paddle::framework::make_ddim()
@ 0x7fa4a42ffc99 paddle::framework::make_ddim()
@ 0x7fa4a3933731 paddle::operators::trim_trailing_singular_dims()
@ 0x7fa4a3e9a1cc paddle::operators::ElementwiseComputeEx<>()
@ 0x7fa4a3e90a16 paddle::operators::CompareOpKernel<>::Compute()
@ 0x7fa4a3e89fe5 _ZZNK6paddle9framework24OpKernelRegistrarFunctorINS_8platform8CPUPlaceELb0ELm2EINS_9operators15CompareOpKernelINS2_16CPUDeviceContextENS4_12EqualFunctorIiEEEENS5_IS6_NS7_IlEEEENS5_IS6_NS7_IfEEEENS5_IS6_NS7_IdEEEEEEclEPKcSI_ENKUlRKNS0_16ExecutionContextEE_clESL_
@ 0x7fa4a3ea4fe3 _ZNSt17_Function_handlerIFvRKN6paddle9framework16ExecutionContextEEZNKS1_24OpKernelRegistrarFunctorINS0_8platform8CPUPlaceELb0ELm2EINS0_9operators15CompareOpKernelINS7_16CPUDeviceContextENS9_12EqualFunctorIiEEEENSA_ISB_NSC_IlEEEENSA_ISB_NSC_IfEEEENSA_ISB_NSC_IdEEEEEEclEPKcSN_EUlS4_E_E9_M_invokeERKSt9_Any_dataS4_
@ 0x7fa4a4293087 std::function<>::operator()()
@ 0x7fa4a428e8fc paddle::framework::OperatorWithKernel::RunImpl()
@ 0x7fa4a428b040 paddle::framework::OperatorBase::Run()
@ 0x7fa4a35c2871 paddle::framework::Executor::RunPreparedContext()
@ 0x7fa4a35bf27e paddle::framework::Executor::Run()
@ 0x7fa4a3401ac1 _ZZN6paddle6pybindL13pybind11_initEvENKUlRNS_9framework8ExecutorERKNS1_11ProgramDescEPNS1_5ScopeEibbE63_clES3_S6_S8_ibb
@ 0x7fa4a3427f04 _ZN8pybind116detail15argument_loaderIJRN6paddle9framework8ExecutorERKNS3_11ProgramDescEPNS3_5ScopeEibbEE9call_implIvRZNS2_6pybindL13pybind11_initEvEUlS5_S8_SA_ibbE63_JLm0ELm1ELm2ELm3ELm4ELm5EEEET_OT0_NS0_14index_sequenceIJXspT1_EEEE
@ 0x7fa4a342553f _ZN8pybind116detail15argument_loaderIJRN6paddle9framework8ExecutorERKNS3_11ProgramDescEPNS3_5ScopeEibbEE4callIvRZNS2_6pybindL13pybind11_initEvEUlS5_S8_SA_ibbE63_EENSt9enable_ifIXsrSt7is_voidIT_E5valueENS0_9void_typeEE4typeEOT0_
@ 0x7fa4a341f37d _ZZN8pybind1112cpp_function10initializeIZN6paddle6pybindL13pybind11_initEvEUlRNS2_9framework8ExecutorERKNS4_11ProgramDescEPNS4_5ScopeEibbE63_vJS6_S9_SB_ibbEJNS_4nameENS_9is_methodENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENKUlRNS_6detail13function_callEE1_clEST_
@ 0x7fa4a341f42b _ZZN8pybind1112cpp_function10initializeIZN6paddle6pybindL13pybind11_initEvEUlRNS2_9framework8ExecutorERKNS4_11ProgramDescEPNS4_5ScopeEibbE63_vJS6_S9_SB_ibbEJNS_4nameENS_9is_methodENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNEST_
@ 0x7fa4a3439e32 pybind11::cpp_function::dispatcher()
@ 0x4c37ed PyEval_EvalFrameEx
@ 0x4b9ab6 PyEval_EvalCodeEx
@ 0x4c16e7 PyEval_EvalFrameEx
@ 0x4b9ab6 PyEval_EvalCodeEx
@ 0x4c1e6f PyEval_EvalFrameEx
@ 0x4b9ab6 PyEval_EvalCodeEx
@ 0x4c16e7 PyEval_EvalFrameEx
@ 0x4c136f PyEval_EvalFrameEx
@ 0x4b9ab6 PyEval_EvalCodeEx
@ 0x4eb30f (unknown)
@ 0x4e5422 PyRun_FileExFlags
./train.sh: line 54: 45492 Segmentation fault (core dumped) python ../ctc_train.py --use_gpu $use_gpu --parallel $parallel --batch_size $batch_size --save_model_period 1 --total_step 1 --save_model_dir $save_model_dir
Expected behavior
The training should run just fine.
Additional notes
In the case of MKLDNN output, I have investigated that the error occurs upon calling ElementwiseOp
of operation equal
on fill_constant and editdistance ops. The fill_constant has had its dimensions set to 0, which caused trim_trailing_singular_dims() to yield error upon creating a dim with make_ddim().
However, since there are different errors depending on whether CPU or MKLDNN versions are used, I guess the problem lays somewhere deeper.