paddle int8 单机多卡训练问题
Created by: suytingwan
paddle 版本1.5.2 gpu K40 训练几十个batch之后报错,per gpu bs=32 配置: export FLAGS_fast_eager_deletion_mode=1 export FLAGS_eager_delete_tensor_gb=0.0
报错信息如下
F1008 16:22:31.066076 20318 all_reduce_op_handle.cc:73] cudaStreamSynchronize an illegal memory access was encountered
*** Check failure stack trace: ***
F1008 16:22:31.066076 20329 device_context.cc:333] cudaStreamSynchronize an illegal memory access was encountered errno:77
*** Check failure stack trace: ***
@ 0x7f524077c7dd google::LogMessage::Fail()
@ 0x7f524077c7dd google::LogMessage::Fail()
@ 0x7f524078028c google::LogMessage::SendToLog()
@ 0x7f524078028c google::LogMessage::SendToLog()
@ 0x7f524077c303 google::LogMessage::Flush()
@ 0x7f524077c303 google::LogMessage::Flush()
@ 0x7f524078179e google::LogMessageFatal::~LogMessageFatal()
@ 0x7f524078179e google::LogMessageFatal::~LogMessageFatal()
@ 0x7f5242777fcd _ZNSt17_Function_handlerIFvvEZNK6paddle8platform17CUDADeviceContext4WaitEvEUlvE_E9_M_invokeERKSt9_Any_data
@ 0x7f524249b7ed paddle::framework::details::AllReduceOpHandle::RunAllReduceFuncs()
@ 0x7f5242785a54 paddle::platform::TemporaryAllocator::Release()
@ 0x7f524249d158 paddle::framework::details::AllReduceOpHandle::RunImpl()
@ 0x7f524277afa1 paddle::platform::CUDADeviceContext::Wait()
@ 0x7f52424ac2e0 paddle::framework::details::OpHandleBase::Run()
@ 0x7f52424aba57 paddle::framework::details::OpHandleBase::RecordWaitEventOnCtx()
@ 0x7f524248d656 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync()
@ 0x7f524248ece5 paddle::framework::details::FetchOpHandle::WaitInputVarGenerated()
@ 0x7f524248c2bf paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp()
@ 0x7f524248f354 paddle::framework::details::FetchOpHandle::RunImpl()
@ 0x7f524248c67f _ZNSt17_Function_handlerIFvvESt17reference_wrapperISt12_Bind_simpleIFS1_ISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS6_12OpHandleBaseESt6atomicIiESt4hashISA_ESt8equal_toISA_ESaISt4pairIKSA_SC_EEESA_RKSt10shared_ptrINS5_13BlockingQueueImEEEEUlvE_vEEEvEEEE9_M_invokeERKSt9_Any_data
@ 0x7f52424ac2e0 paddle::framework::details::OpHandleBase::Run()
@ 0x7f524086bad3 std::_Function_handler<>::_M_invoke()
@ 0x7f524248d656 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync()
@ 0x7f52406ffb37 std::__future_base::_State_base::_M_do_set()
@ 0x7f524248c2bf paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp()
@ 0x7f52a8cd3973 __GI___pthread_once