Created by: xiegegege
1.输入命令python train.py --model=GoogleNet后运行,120个batch后loss会出现nan
2.在paddle develop分支下,跑训练一段时间后有报错(非必现):
Pass 0, trainbatch 2330, loss nan, acc1 0.00000, acc5 0.00781, lr 0.10000, time 0.65 sec
F0505 20:34:00.984820 209240 all_reduce_op_handle.cc:78] cudaStreamSynchronize an illegal memory access was encountered
* Check failure stack trace: *
@ 0x7f7ad2ed07ad google::LogMessage::Fail()
@ 0x7f7ad2ed425c google::LogMessage::SendToLog()
@ 0x7f7ad2ed02d3 google::LogMessage::Flush()
@ 0x7f7ad2ed576e google::LogMessageFatal::~LogMessageFatal()
@ 0x7f7ad465d503 paddle::framework::details::AllReduceOpHandle::RunAllReduceFuncs()
@ 0x7f7ad465eeee paddle::framework::details::AllReduceOpHandle::RunImpl()
@ 0x7f7ad46cebc0 paddle::framework::details::OpHandleBase::Run()
@ 0x7f7ad46b6a72 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp()
@ 0x7f7ad46b6daf _ZNSt17_Function_handlerIFvvESt17reference_wrapperISt12_Bind_simpleIFS1_ISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS6_12OpHandleBaseESt6atomicIiESt4hashISA_ESt8equal_toISA_ESaISt4pairIKSA_SC_EEESA_RKSt10shared_ptrINS5_13BlockingQueueImEEEEUlvE_vEEEvEEEE9_M_invokeERKSt9_Any_data
@ 0x7f7ad400d3f3 std::_Function_handler<>::_M_invoke()
@ 0x7f7ad2e6bc77 std::__future_base::_State_base::_M_do_set()
@ 0x7f7bb5b49e03 __pthread_once_internal
@ 0x7f7ad46b32c2 _ZNSt13__future_base11_Task_stateISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS4_12OpHandleBaseESt6atomicIiESt4hashIS8_ESt8equal_toIS8_ESaISt4pairIKS8_SA_EEES8_RKSt10shared_ptrINS3_13BlockingQueueImEEEEUlvE_vEESaIiEFvvEE6_M_runEv
@ 0x7f7ad2e6d1f4 _ZZN10ThreadPoolC1EmENKUlvE_clEv
@ 0x7f7b6995d470 (unknown)
@ 0x7f7bb5b44aa1 start_thread
@ 0x7f7bb4ffec4d clone
@ (nil) (unknown)