训练PPYOLO的时候莫名其妙就报错了
Created by: yeyupiaoling
环境:
- Ubuntu 18.04
- PaddlePaddle 1.8.4 post 107
- CUDA 10.0
- NCCL 2.4.8 for CUDA 10.0
- Python 3.7
报错信息:
2020-09-10 18:04:12,059-INFO: iter: 3700, lr: 0.000925, 'loss_xy': '0.604115', 'loss_wh': '0.627810', 'loss_obj': '3.390031', 'loss_cls': '2.499329', 'loss_iou': '2.253458', 'loss_iou_aware': '0.017610', 'loss': '9.358524', time: 0.535, eta: 3 days, 1:46:08
2020-09-10 18:05:06,607-INFO: iter: 3800, lr: 0.000950, 'loss_xy': '0.586375', 'loss_wh': '0.599361', 'loss_obj': '3.342260', 'loss_cls': '2.450277', 'loss_iou': '2.174241', 'loss_iou_aware': '0.018188', 'loss': '9.133036', time: 0.544, eta: 3 days, 2:59:30
2020-09-10 18:06:01,443-INFO: iter: 3900, lr: 0.000975, 'loss_xy': '0.581221', 'loss_wh': '0.586602', 'loss_obj': '3.305507', 'loss_cls': '2.463003', 'loss_iou': '2.171795', 'loss_iou_aware': '0.016012', 'loss': '9.102219', time: 0.550, eta: 3 days, 3:48:00
2020-09-10 18:06:54,202-INFO: iter: 4000, lr: 0.001000, 'loss_xy': '0.583331', 'loss_wh': '0.616600', 'loss_obj': '3.176893', 'loss_cls': '2.360175', 'loss_iou': '2.219807', 'loss_iou_aware': '0.015864', 'loss': '9.072341', time: 0.526, eta: 3 days, 0:28:04
2020-09-10 18:07:48,287-INFO: iter: 4100, lr: 0.001000, 'loss_xy': '0.580252', 'loss_wh': '0.612756', 'loss_obj': '3.205532', 'loss_cls': '2.373779', 'loss_iou': '2.197740', 'loss_iou_aware': '0.015548', 'loss': '8.962381', time: 0.541, eta: 3 days, 2:35:08
W0910 18:08:32.227979 1947 init.cc:226] Warning: PaddlePaddle catches a failure signal, it may not work properly
W0910 18:08:32.228004 1947 init.cc:228] You could check whether you killed PaddlePaddle thread/process accidentally or report the case to PaddlePaddle
W0910 18:08:32.228008 1947 init.cc:231] The detail failure signal is:
W0910 18:08:32.228010 1947 init.cc:234] *** Aborted at 1599732512 (unix time) try "date -d @1599732512" if you are using GNU date ***
W0910 18:08:32.228685 1947 init.cc:234] PC: @ 0x0 (unknown)
W0910 18:08:32.229272 1947 init.cc:234] *** SIGSEGV (@0x10) received by PID 1872 (TID 0x7f8eeefdc700) from PID 16; stack trace: ***
W0910 18:08:32.229846 1947 init.cc:234] @ 0x7f91ff75cfd0 (unknown)
W0910 18:08:32.231119 1947 init.cc:234] @ 0x7f916d73908a paddle::platform::proto::MessageDesc::MergePartialFromCodedStream()
W0910 18:08:32.233204 1947 init.cc:234] @ 0x7f916d739971 paddle::platform::proto::AllMessageDesc::MergePartialFromCodedStream()
W0910 18:08:32.235473 1947 init.cc:234] @ 0x7f916d73a3dd paddle::platform::proto::cudaerrorDesc::MergePartialFromCodedStream()
W0910 18:08:32.237517 1947 init.cc:234] @ 0x7f916cfed270 google::protobuf::MessageLite::ParseFromCodedStream()
W0910 18:08:32.239045 1947 init.cc:234] @ 0x7f916cfed41a google::protobuf::MessageLite::ParseFromZeroCopyStream()
W0910 18:08:32.240597 1947 init.cc:234] @ 0x7f916cff3b79 google::protobuf::Message::ParseFromIstream()
W0910 18:08:32.242121 1947 init.cc:234] @ 0x7f916a0b90e2 paddle::platform::build_nvidia_error_msg()
W0910 18:08:32.243868 1947 init.cc:234] @ 0x7f916d6cb2df paddle::platform::GpuMemcpyAsync()
W0910 18:08:32.245286 1947 init.cc:234] @ 0x7f916d69b44e paddle::memory::Copy<>()
W0910 18:08:32.246776 1947 init.cc:234] @ 0x7f916a4bb272 paddle::operators::SumToLoDTensor<>()
W0910 18:08:32.248064 1947 init.cc:234] @ 0x7f916a4c30b8 _ZNSt17_Function_handlerIFvRKN6paddle9framework16ExecutionContextEEZNKS1_24OpKernelRegistrarFunctorINS0_8platform9CUDAPlaceELb0ELm0EINS0_9operators9SumKernelINS7_17CUDADeviceContextEfEENSA_ISB_dEENSA_ISB_iEENSA_ISB_lEENSA_ISB_NS7_7float16EEEEEclEPKcSK_iEUlS4_E_E9_M_invokeERKSt9_Any_dataS4_
W0910 18:08:32.250301 1947 init.cc:234] @ 0x7f916d5f9830 paddle::framework::OperatorWithKernel::RunImpl()
W0910 18:08:32.253427 1947 init.cc:234] @ 0x7f916d5fa021 paddle::framework::OperatorWithKernel::RunImpl()
W0910 18:08:32.255237 1947 init.cc:234] @ 0x7f916d5f2fe1 paddle::framework::OperatorBase::Run()
W0910 18:08:32.257776 1947 init.cc:234] @ 0x7f916d3028c6 paddle::framework::details::ComputationOpHandle::RunImpl()
W0910 18:08:32.260211 1947 init.cc:234] @ 0x7f916d2a8aa1 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync()
W0910 18:08:32.262575 1947 init.cc:234] @ 0x7f916d2a659f paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp()
W0910 18:08:32.263298 1947 init.cc:234] @ 0x7f916d2a6864 _ZNSt17_Function_handlerIFvvESt17reference_wrapperISt12_Bind_simpleIFS1_ISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS6_12OpHandleBaseESt6atomicIiESt4hashISA_ESt8equal_toISA_ESaISt4pairIKSA_SC_EEESA_RKSt10shared_ptrINS5_13BlockingQueueImEEEEUlvE_vEEEvEEEE9_M_invokeERKSt9_Any_data
W0910 18:08:32.266253 1947 init.cc:234] @ 0x7f916a09d413 std::_Function_handler<>::_M_invoke()
W0910 18:08:32.269752 1947 init.cc:234] @ 0x7f9169e97107 std::__future_base::_State_base::_M_do_set()
W0910 18:08:32.270504 1947 init.cc:234] @ 0x7f91ff50e827 __pthread_once_slow
W0910 18:08:32.271306 1947 init.cc:234] @ 0x7f916d2a2a32 _ZNSt13__future_base11_Task_stateISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS4_12OpHandleBaseESt6atomicIiESt4hashIS8_ESt8equal_toIS8_ESaISt4pairIKS8_SA_EEES8_RKSt10shared_ptrINS3_13BlockingQueueImEEEEUlvE_vEESaIiEFvvEE6_M_runEv
W0910 18:08:32.273910 1947 init.cc:234] @ 0x7f9169e99564 _ZZN10ThreadPoolC1EmENKUlvE_clEv
W0910 18:08:32.274421 1947 init.cc:234] @ 0x7f91d0aaca50 (unknown)
W0910 18:08:32.275040 1947 init.cc:234] @ 0x7f91ff5066db start_thread
W0910 18:08:32.275523 1947 init.cc:234] @ 0x7f91ff83fa3f clone
W0910 18:08:32.275995 1947 init.cc:234] @ 0x0 (unknown)