以多机nccl方式训练object_detection时,程序会卡着,GPU占用率100%
Created by: kolinwei
kill掉程序后,出现如下错误信息: 0926 13:34:48.170254 21140 operator.cc:686] expected_kernel_key:data_type[float]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library_type[CUDNN] *** Aborted at 1537940266 (unix time) try "date -d @1537940266" if you are using GNU date *** PC: @ 0x0 (unknown) *** SIGTERM (@0x4a82) received by PID 21140 (TID 0x7fc42f5d1740) from PID 19074; stack trace: *** @ 0x38c040f130 (unknown) @ 0x7ffd831c57c2 ([vdso]+0x7c1) @ 0x38bfd09f7d (unknown) @ 0x7fc3cdee2f2e (unknown) @ 0x7fc3cdf76585 (unknown) @ 0x7fc3cdece083 (unknown) @ 0x7fc3cdece1d9 (unknown) @ 0x7fc3cdde7407 (unknown) @ 0x7fc3cdf2c7e2 cuStreamSynchronize @ 0x7fc3d95ddfa4 cudart::cudaApiStreamSynchronize() @ 0x7fc3d960c6fd cudaStreamSynchronize @ 0x7fc3d94fbbe4 paddle::platform::CUDADeviceContext::RunCudnnFuncWithWorkspace() @ 0x7fc3d905d563 paddle::operators::CUDNNConvGradOpKernel<>::Compute() @ 0x7fc3d905e013 ZNSt17_Function_handlerIFvRKN6paddle9framework16ExecutionContextEEZNKS1_24OpKernelRegistrarFunctorINS0_8platform9CUDAPlaceELb0ELm0EJNS0_9operators21CUDNNConvGradOpKernelIfEENSA_IdEEEEclEPKcSF_EUlS4_E_E9_M_invokeERKSt9_Any_dataS4 @ 0x7fc3d9431826 paddle::framework::OperatorWithKernel::RunImpl() @ 0x7fc3d942e9ac paddle::framework::OperatorBase::Run() @ 0x7fc3d9268167 _ZNSt17_Function_handlerIFvvEZN6paddle9framework7details19ComputationOpHandle7RunImplEvEUlvE_E9_M_invokeERKSt9_Any_data @ 0x7fc3d9284790 _ZNSt17_Function_handlerIFvvEZN6paddle9framework7details12OpHandleBase17RunAndRecordEventERKSt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data @ 0x7fc3d9283ff5 paddle::framework::details::OpHandleBase::RunAndRecordEvent() @ 0x7fc3d9267c3f paddle::framework::details::ComputationOpHandle::RunImpl() @ 0x7fc3d9285095 paddle::framework::details::OpHandleBase::Run() @ 0x7fc3d9233f1a _ZZN6paddle9framework7details24ThreadedSSAGraphExecutor5RunOpEPNS0_13BlockingQueueIPNS1_13VarHandleBaseEEEPNS1_12OpHandleBaseEENKUlvE_clEv @ 0x7fc3d9234845 paddle::framework::details::ThreadedSSAGraphExecutor::RunOp() @ 0x7fc3d9236350 paddle::framework::details::ThreadedSSAGraphExecutor::Run() @ 0x7fc3d923ac5d paddle::framework::details::ScopeBufferedSSAGraphExecutor::Run() @ 0x7fc3d824bc9c paddle::framework::ParallelExecutor::Run() @ 0x7fc3d8162a20 ZZN8pybind1112cpp_function10initializeIZN6paddle6pybindL13pybind11_initEvEUlRNS2_9framework16ParallelExecutorERKSt6vectorISsSaISsEERKSsE95_vIS6_SB_SD_EINS_4nameENS_9is_methodENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNESV @ 0x7fc3d817e764 pybind11::cpp_function::dispatcher() @ 0x7fc42f6ca404 PyEval_EvalFrameEx @ 0x7fc42f6cb150 PyEval_EvalCodeEx @ 0x7fc42f6c94c1 PyEval_EvalFrameEx @ 0x7fc42f6cb150 PyEval_EvalCodeEx 备注:在一台机器上模拟多机nccl