ncclGroupEnd() == ncclSucces
Created by: 333caowei
gpu单机多卡训练,训练过程中突然出现Check failed: dynload::ncclGroupEnd() == ncclSuccess (1 vs. 0)
2018-11-25 15:43:04 , Epoch_id: 0, Batch_id: 17680, Cost: 0.087121
2018-11-25 15:43:07 , Epoch_id: 0, Batch_id: 17690, Cost: 0.092389
2018-11-25 15:44:25 , Epoch_id: 0, Batch_id: 17700, Cost: 0.088337
2018-11-25 15:44:29 , Epoch_id: 0, Batch_id: 17710, Cost: 0.095991
2018-11-25 15:44:32 , Epoch_id: 0, Batch_id: 17720, Cost: 0.087461
2018-11-25 15:44:36 , Epoch_id: 0, Batch_id: 17730, Cost: 0.087458
F1125 15:44:39.266677 200102 nccl_helper.h:62] Check failed: dynload::ncclGroupEnd() == ncclSuccess (1 vs. 0)
*** Check failure stack trace: ***
@ 0x7fb8680c0f9d google::LogMessage::Fail()
@ 0x7fb8680c4a4c google::LogMessage::SendToLog()
@ 0x7fb8680c0ac3 google::LogMessage::Flush()
@ 0x7fb8680c5f5e google::LogMessageFatal::~LogMessageFatal()
@ 0x7fb869369ea6 _ZNSt17_Function_handlerIFvvEZN6paddle9framework7details17BroadcastOpHandle15BroadcastOneVarERKNS3_9VarHandleERKSt6vectorIPS5_SaIS9_EERKS8_IPKNS2_5ScopeESaISG_EEEUlvE1_E9_M_invokeERKSt9_Any_data
@ 0x7fb869376790 _ZNSt17_Function_handlerIFvvEZN6paddle9framework7details12OpHandleBase17RunAndRecordEventERKSt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data
@ 0x7fb869376790 _ZNSt17_Function_handlerIFvvEZN6paddle9framework7details12OpHandleBase17RunAndRecordEventERKSt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data
@ 0x7fb869376790 _ZNSt17_Function_handlerIFvvEZN6paddle9framework7details12OpHandleBase17RunAndRecordEventERKSt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data
@ 0x7fb869376790 _ZNSt17_Function_handlerIFvvEZN6paddle9framework7details12OpHandleBase17RunAndRecordEventERKSt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data
@ 0x7fb869376790 _ZNSt17_Function_handlerIFvvEZN6paddle9framework7details12OpHandleBase17RunAndRecordEventERKSt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data
@ 0x7fb869376790 _ZNSt17_Function_handlerIFvvEZN6paddle9framework7details12OpHandleBase17RunAndRecordEventERKSt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data
@ 0x7fb869376790 _ZNSt17_Function_handlerIFvvEZN6paddle9framework7details12OpHandleBase17RunAndRecordEventERKSt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data
@ 0x7fb869376790 _ZNSt17_Function_handlerIFvvEZN6paddle9framework7details12OpHandleBase17RunAndRecordEventERKSt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data
@ 0x7fb869375ff5 paddle::framework::details::OpHandleBase::RunAndRecordEvent()
@ 0x7fb86936bb58 paddle::framework::details::BroadcastOpHandle::BroadcastOneVar()
@ 0x7fb86936c4e4 paddle::framework::details::BroadcastOpHandle::RunImpl()
@ 0x7fb869377095 paddle::framework::details::OpHandleBase::Run()
@ 0x7fb86930ee8a _ZZN6paddle9framework7details24ThreadedSSAGraphExecutor5RunOpEPNS0_13BlockingQueueIPNS1_13VarHandleBaseEEEPNS1_12OpHandleBaseEENKUlvE_clEv
@ 0x7fb868109593 std::_Function_handler<>::_M_invoke()
@ 0x7fb868108d67 std::__future_base::_State_base::_M_do_set()
@ 0x7fb8d0c0c973 __GI___pthread_once
@ 0x7fb86930de62 _ZNSt13__future_base11_Task_stateISt5_BindIFZN6paddle9framework7details24ThreadedSSAGraphExecutor5RunOpEPNS3_13BlockingQueueIPNS4_13VarHandleBaseEEEPNS4_12OpHandleBaseEEUlvE_vEESaIiEFvvEE6_M_runEv
@ 0x7fb86810aea4 _ZZN10ThreadPoolC1EmENKUlvE_clEv
@ 0x7fb8c111b8a0 execute_native_thread_routine
@ 0x7fb8d0c071c3 start_thread
@ 0x7fb8d022f12d __clone
@ (nil) (unknown)
paddle版本为PaddlePaddle 1.1.0.post97