Debug mode UT fail
Created by: tensor-tang
最近在debug模式下 跑element wise add 发现一个core dump,在release下居然是跑过的。
最简单的复现方式是 CPU only的情况下,跑一下这个ut。
cmake .. -DCMAKE_BUILD_TYPE=DEBUG -DWITH_GPU=OFF -DWITH_FLUID_ONLY=ON -DWITH_TESTING=ON -DWITH_MKL=ON -DWITH_MKLDNN=OFF -DWITH_CONTRIB=OFF -DCMAKE_INSTALL_PREFIX=./tmpInstall
make -j
make test ARGS="-R test_elementwise_add_op -V"
我得到的结果是:
318: *** SIGSEGV (@0x0) received by PID 27390 (TID 0x7f5cd843b700) from PID 0; stack trace: ***
318: @ 0x318b20f500 (unknown)
318: @ 0x7f5c9cd45057 paddle::framework::make_dim<>()
318: @ 0x7f5c9cd450d7 paddle::framework::make_ddim()
318: @ 0x7f5c9cd454cc paddle::framework::make_ddim()
318: @ 0x7f5c9cd455a4 paddle::framework::make_ddim()
318: @ 0x7f5c9c3731af paddle::operators::trim_trailing_singular_dims()
318: @ 0x7f5c9c4d783f paddle::operators::ElementwiseComputeEx<>()
318: @ 0x7f5c9c727d10 paddle::operators::default_elementwise_add<>()
318: @ 0x7f5c9c726c1f paddle::operators::ElementwiseAddKernel<>::Compute()
318: @ 0x7f5c9c72671c _ZZNK6paddle9framework24OpKernelRegistrarFunctorINS_8platform8CPUPlaceELb0ELm0EINS_9operators20ElementwiseAddKernelINS2_16CPUDeviceContextEfEENS5_IS6_dEENS5_IS6_iEENS5_IS6_lEEEEclEPKcSD_ENKUlRKNS0_16ExecutionContextEE_clESG_
318: @ 0x7f5c9c72aa1e _ZNSt17_Function_handlerIFvRKN6paddle9framework16ExecutionContextEEZNKS1_24OpKernelRegistrarFunctorINS0_8platform8CPUPlaceELb0ELm0EJNS0_9operators20ElementwiseAddKernelINS7_16CPUDeviceContextEfEENSA_ISB_dEENSA_ISB_iEENSA_ISB_lEEEEclEPKcSI_EUlS4_E_E9_M_invokeERKSt9_Any_dataS4_
318: @ 0x7f5c9ccec15b std::function<>::operator()()
318: @ 0x7f5c9cce7f8c paddle::framework::OperatorWithKernel::RunImpl()
318: @ 0x7f5c9cce4ba8 paddle::framework::OperatorBase::Run()
318: @ 0x7f5c9bfb7eb5 _ZZN6paddle6pybindL13pybind11_initEvENKUlRNS_9framework12OperatorBaseERKNS1_5ScopeERKNS_8platform8CPUPlaceEE55_clES3_S6_SA_
318: @ 0x7f5c9bfda6dd _ZN8pybind116detail15argument_loaderIIRN6paddle9framework12OperatorBaseERKNS3_5ScopeERKNS2_8platform8CPUPlaceEEE9call_implIvRZNS2_6pybindL13pybind11_initEvEUlS5_S8_SC_E55_ILm0ELm1ELm2EEEET_OT0_NS0_14index_sequenceIIXspT1_EEEE
318: @ 0x7f5c9bfd8418 _ZN8pybind116detail15argument_loaderIIRN6paddle9framework12OperatorBaseERKNS3_5ScopeERKNS2_8platform8CPUPlaceEEE4callIvRZNS2_6pybindL13pybind11_initEvEUlS5_S8_SC_E55_EENSt9enable_ifIXsrSt7is_voidIT_E5valueENS0_9void_typeEE4typeEOT0_
318: @ 0x7f5c9bfd19f2 _ZZN8pybind1112cpp_function10initializeIZN6paddle6pybindL13pybind11_initEvEUlRNS2_9framework12OperatorBaseERKNS4_5ScopeERKNS2_8platform8CPUPlaceEE55_vIS6_S9_SD_EINS_4nameENS_9is_methodENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENKUlRNS_6detail13function_callEE1_clESV_
318: @ 0x7f5c9bfd1a85 _ZZN8pybind1112cpp_function10initializeIZN6paddle6pybindL13pybind11_initEvEUlRNS2_9framework12OperatorBaseERKNS4_5ScopeERKNS2_8platform8CPUPlaceEE55_vIS6_S9_SD_EINS_4nameENS_9is_methodENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNESV_
318: @ 0x7f5c9bfea48b pybind11::cpp_function::dispatcher()
318: @ 0x7f5cd85363e4 PyEval_EvalFrameEx
318: @ 0x7f5cd8537130 PyEval_EvalCodeEx
318: @ 0x7f5cd85354a1 PyEval_EvalFrameEx
318: @ 0x7f5cd8537130 PyEval_EvalCodeEx
318: @ 0x7f5cd85354a1 PyEval_EvalFrameEx
318: @ 0x7f5cd8537130 PyEval_EvalCodeEx
318: @ 0x7f5cd85354a1 PyEval_EvalFrameEx
318: @ 0x7f5cd8537130 PyEval_EvalCodeEx
318: @ 0x7f5cd85354a1 PyEval_EvalFrameEx
318: @ 0x7f5cd8537130 PyEval_EvalCodeEx
318: @ 0x7f5cd85354a1 PyEval_EvalFrameEx
318: @ 0x7f5cd8537130 PyEval_EvalCodeEx
1/1 Test #318: test_elementwise_add_op ..........***Exception: SegFault 6.80 sec
查了一下发现是
https://github.com/PaddlePaddle/Paddle/blob/c6af7201e99bdb9dfe8798cab65aa4c625ac1cab/paddle/fluid/operators/elementwise_op_function.h#L72-L84
这里如果进来的就是一个{1}的shape,framework::make_ddim(trim_dims);
的时候其实是trim_dims是size为0 的。
最终导致
https://github.com/PaddlePaddle/Paddle/blob/c6af7201e99bdb9dfe8798cab65aa4c625ac1cab/paddle/fluid/framework/ddim.cc#L29-L31
这里 访问*d
挂掉,不会得到一个size为0 的ddim。
这个在release下可能是O3优化掉了,所以没事。
为了避免有别的单侧有类似情况,建议我们定期跑一下debug下的UT,比如每天晚上。