训练异常停止
Created by: AN-ZE
新机器安装部署 环境:ubuntu 16.04 + cuda10 + cudnn 7 + nccl 2.5.6 + paddle1.8 + paddledetection0.4
模型 cascade rcnn + dcn
多卡训练报错,这个报错信息是什么意思,关闭再次训练就可以了,不过训练出来的模型,结果异常(预测结果全为空)。
2020-09-25 18:05:05,751-INFO: places would be ommited when DataLoader is not iterable
W0925 18:05:33.641295 29199 fuse_all_reduce_op_pass.cc:74] Find all_reduce operators: 114. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce
ops is 101.
2020-09-25 18:05:38,064-INFO: iter: 0, lr: 0.000100, 'loss_cls_0': '3.613413', 'loss_loc_0': '0.012379', 'loss_cls_1': '1.649544', 'loss_loc_1': '0.004582', 'loss_cls_2': '0.896943', 'loss_loc_2': '0.00
0360', 'loss_rpn_cls': '0.671247', 'loss_rpn_bbox': '0.006319', 'loss': '6.854786', time: 0.000, eta: 0:00:03
W0925 18:05:47.226415 29387 operator.cc:187] elementwise_add raises an exception thrust::system::system_error, parallel_for failed: invalid configuration argument
F0925 18:05:47.226567 29387 exception_holder.h:37] std::exception caught, parallel_for failed: invalid configuration argument
*** Check failure stack trace: ***
W0925 18:05:47.228652 29383 operator.cc:187] elementwise_add raises an exception thrust::system::system_error, parallel_for failed: invalid configuration argument
F0925 18:05:47.228725 29383 exception_holder.h:37] std::exception caught, parallel_for failed: invalid configuration argument
*** Check failure stack trace: ***
@ 0x7f8aef24483d google::LogMessage::Fail()
@ 0x7f8aef24483d google::LogMessage::Fail()
@ 0x7f8aef2482ec google::LogMessage::SendToLog()
@ 0x7f8aef2482ec google::LogMessage::SendToLog()
@ 0x7f8aef244363 google::LogMessage::Flush()
@ 0x7f8aef2497fe google::LogMessageFatal::~LogMessageFatal()
@ 0x7f8af24105d8 paddle::framework::details::ExceptionHolder::Catch()
@ 0x7f8aef244363 google::LogMessage::Flush()
@ 0x7f8af24adbfe paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync()
@ 0x7f8af24ab59f paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp()
@ 0x7f8aef2497fe google::LogMessageFatal::~LogMessageFatal()
@ 0x7f8af24ab864 _ZNSt17_Function_handlerIFvvESt17reference_wrapperISt12_Bind_simpleIFS1_ISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS6
_12OpHandleBaseESt6atomicIiESt4hashISA_ESt8equal_toISA_ESaISt4pairIKSA_SC_EEESA_RKSt10shared_ptrINS5_13BlockingQueueImEEEEUlvE_vEEEvEEEE9_M_invokeERKSt9_Any_data
@ 0x7f8af24105d8 paddle::framework::details::ExceptionHolder::Catch()
@ 0x7f8aef2a2413 std::_Function_handler<>::_M_invoke()
@ 0x7f8af24adbfe paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync()
@ 0x7f8aef09c107 std::__future_base::_State_base::_M_do_set()
@ 0x7f8b8913ba99 __pthread_once_slow
@ 0x7f8af24a7a32 _ZNSt13__future_base11_Task_stateISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS4_12OpHandleBaseESt6atomicIiESt4hashIS8_
ESt8equal_toIS8_ESaISt4pairIKS8_SA_EEES8_RKSt10shared_ptrINS3_13BlockingQueueImEEEEUlvE_vEESaIiEFvvEE6_M_runEv
@ 0x7f8af24ab59f paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp()
@ 0x7f8af24ab864 _ZNSt17_Function_handlerIFvvESt17reference_wrapperISt12_Bind_simpleIFS1_ISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS6
_12OpHandleBaseESt6atomicIiESt4hashISA_ESt8equal_toISA_ESaISt4pairIKSA_SC_EEESA_RKSt10shared_ptrINS5_13BlockingQueueImEEEEUlvE_vEEEvEEEE9_M_invokeERKSt9_Any_data
@ 0x7f8aef09e564 _ZZN10ThreadPoolC1EmENKUlvE_clEv
@ 0x7f8b59220421 execute_native_thread_routine_compat
@ 0x7f8aef2a2413 std::_Function_handler<>::_M_invoke()
@ 0x7f8b891346ba start_thread
@ 0x7f8b88e6a4dd clone
@ 0x7f8aef09c107 std::__future_base::_State_base::_M_do_set()
@ (nil) (unknown)
Aborted (core dumped)