fluid1.5.2 训练过程中出现 cudaGetLastError out of memory errno:2
Created by: kebinC
环境:单机多GPU
训练跑到2000多轮的时候,出现 cudaGetLastError out of memory errno:2
F0912 17:23:26.346724 783 all_reduce_op_handle.cc:75] cudaGetLastError out of memory errno:2
*** Check failure stack trace: ***
@ 0x7fdffb52a17d google::LogMessage::Fail()
@ 0x7fdffb52dc2c google::LogMessage::SendToLog()
@ 0x7fdffb529ca3 google::LogMessage::Flush()
@ 0x7fdffb52f13e google::LogMessageFatal::~LogMessageFatal()
@ 0x7fdffd36d7a5 paddle::framework::details::AllReduceOpHandle::RunAllReduceFuncs()
@ 0x7fdffd36ec78 paddle::framework::details::AllReduceOpHandle::RunImpl()
@ 0x7fdffd3600b6 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync()
@ 0x7fdffd35edff paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp()
@ 0x7fdffd35f0c4 _ZNSt17_Function_handlerIFvvESt17reference_wrapperISt12_Bind_simpleIFS1_ISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS6_12OpHandleBaseESt6atomicIiESt4hashISA_ESt8equal_toISA_ESaISt4pairIKSA_SC_EEESA_RKSt10shared_ptrINS5_13BlockingQueueImEEEEUlvE_vEEEvEEEE9_M_invokeERKSt9_Any_data
@ 0x7fdffb625273 std::_Function_handler<>::_M_invoke()
@ 0x7fdffb49f7b7 std::__future_base::_State_base::_M_do_set()
@ 0x7fe09562d620 __pthread_once_internal
@ 0x7fdffd35a8e2 _ZNSt13__future_base11_Task_stateISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS4_12OpHandleBaseESt6atomicIiESt4hashIS8_ESt8equal_toIS8_ESaISt4pairIKS8_SA_EEES8_RKSt10shared_ptrINS3_13BlockingQueueImEEEEUlvE_vEESaIiEFvvEE6_M_runEv
@ 0x7fdffb4a0d34 _ZZN10ThreadPoolC1EmENKUlvE_clEv
@ 0x7fe01d32d470 (unknown)
@ 0x7fe095628893 start_thread
@ 0x7fe094c59bfd clone
@ (nil) (unknown)