lexical analysis适配自己数据集训练过程中报错【killed PaddlePaddle thread/process accidentally】
Created by: xxxsssyyy
- 版本、环境信息: 1)PaddlePaddle版本:1.7.1 python 2.7.16[GCC 7.3.0] 2)CPU:预测若用CPU,请提供CPU型号,MKL/OpenBlas/MKLDNN/等数学库使用情况 3)GPU:V100 4卡, 4)系统环境:请您描述系统类型、版本,例如Mac OS 10.14,Python版本
- 训练信息 1)单机4卡 2)显存信息: 每卡 32480MiB 3)Operator信息
- 问题描述:请详细描述您的问题,同步贴出报错信息、日志、可复现的代码片段 自己的数据集很大,有几千万,batchsize设为60000,使用的sh run.sh train_multi_gpu,不知道报错信息什么意思
[train] step = 667, loss = 5.18007, P: 0.83305, R: 0.72868, F1: 0.77738, elapsed time 44.67638
[train] step = 668, loss = 4.71820, P: 0.83359, R: 0.73958, F1: 0.78377, elapsed time 44.60957
W0415 01:28:43.293865 57018 init.cc:209] Warning: PaddlePaddle catches a failure signal, it may not work properly
W0415 01:28:43.293926 57018 init.cc:211] You could check whether you killed PaddlePaddle thread/process accidentally or report the case to PaddlePaddle
W0415 01:28:43.293931 57018 init.cc:214] The detail failure signal is:
W0415 01:28:43.293936 57018 init.cc:217] *** Aborted at 1586885323 (unix time) try "date -d @1586885323" if you are using GNU date ***
W0415 01:28:43.295604 57018 init.cc:217] PC: @ 0x0 (unknown)
W0415 01:28:43.295755 57018 init.cc:217] *** SIGSEGV (@0x50) received by PID 56823 (TID 0x7fef1b517700) from PID 80; stack trace: ***
W0415 01:28:43.296983 57018 init.cc:217] @ 0x7fef851685d0 (unknown)
W0415 01:28:43.298117 57018 init.cc:217] @ 0x7fef85380b16 _dl_relocate_object
W0415 01:28:43.299258 57018 init.cc:217] @ 0x7fef8538959c dl_open_worker
W0415 01:28:43.300359 57018 init.cc:217] @ 0x7fef85384704 _dl_catch_error
W0415 01:28:43.301456 57018 init.cc:217] @ 0x7fef85388abb _dl_open
W0415 01:28:43.302605 57018 init.cc:217] @ 0x7fef847bfb12 do_dlopen
W0415 01:28:43.303709 57018 init.cc:217] @ 0x7fef85384704 _dl_catch_error
W0415 01:28:43.304888 57018 init.cc:217] @ 0x7fef847bfbd2 __GI___libc_dlopen_mode
W0415 01:28:43.306037 57018 init.cc:217] @ 0x7fef84796fc5 init
W0415 01:28:43.307126 57018 init.cc:217] @ 0x7fef85165e40 __GI___pthread_once
W0415 01:28:43.308282 57018 init.cc:217] @ 0x7fef847970dc __GI___backtrace
W0415 01:28:43.310690 57018 init.cc:217] @ 0x7fee36fb8d45 paddle::platform::GetTraceBackString<>()
W0415 01:28:43.312077 57018 init.cc:217] @ 0x7fee39bb088a paddle::memory::allocation::CUDAAllocator::AllocateImpl()
W0415 01:28:43.313971 57018 init.cc:217] @ 0x7fee39bd4172 paddle::memory::allocation::AlignedAllocator::AllocateImpl()
W0415 01:28:43.316745 57018 init.cc:217] @ 0x7fee39bd2631 paddle::memory::allocation::AutoGrowthBestFitAllocator::AllocateImpl()
W0415 01:28:43.318388 57018 init.cc:217] @ 0x7fee39bb214b paddle::memory::allocation::RetryAllocator::AllocateImpl()
W0415 01:28:43.319717 57018 init.cc:217] @ 0x7fee39bace03 paddle::memory::allocation::AllocatorFacade::Alloc()
W0415 01:28:43.321595 57018 init.cc:217] @ 0x7fee39bad09e paddle::memory::allocation::AllocatorFacade::AllocShared()
W0415 01:28:43.322841 57018 init.cc:217] @ 0x7fee39bac81c paddle::memory::AllocShared()
W0415 01:28:43.324249 57018 init.cc:217] @ 0x7fee39b997b2 paddle::framework::Tensor::mutable_data()
W0415 01:28:43.326561 57018 init.cc:217] @ 0x7fee382f2d9a paddle::operators::GRUGradKernel<>::BatchCompute()
W0415 01:28:43.328370 57018 init.cc:217] @ 0x7fee382f4143 _ZNSt17_Function_handlerIFvRKN6paddle9framework16ExecutionContextEEZNKS1_24OpKernelRegistrarFunctorINS0_8platform9CUDAPlaceELb0ELm0EJNS0_9operators13GRUGradKernelINS7_17CUDADeviceContextEfEENSA_ISB_dEEEEclEPKcSG_iEUlS4_E_E9_M_invokeERKSt9_Any_dataS4_
W0415 01:28:43.330576 57018 init.cc:217] @ 0x7fee39b11b16 paddle::framework::OperatorWithKernel::RunImpl()
W0415 01:28:43.333703 57018 init.cc:217] @ 0x7fee39b122e1 paddle::framework::OperatorWithKernel::RunImpl()
W0415 01:28:43.335433 57018 init.cc:217] @ 0x7fee39b0b430 paddle::framework::OperatorBase::Run()
W0415 01:28:43.337113 57018 init.cc:217] @ 0x7fee3988ce79 paddle::framework::details::OpHandleBase::RunAndRecordEvent()
W0415 01:28:43.339591 57018 init.cc:217] @ 0x7fee3989008b paddle::framework::details::ComputationOpHandle::RunImpl()
W0415 01:28:43.341863 57018 init.cc:217] @ 0x7fee39847281 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync()
W0415 01:28:43.344367 57018 init.cc:217] @ 0x7fee39845fef paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp()
W0415 01:28:43.345208 57018 init.cc:217] @ 0x7fee398462b4 _ZNSt17_Function_handlerIFvvESt17reference_wrapperISt12_Bind_simpleIFS1_ISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS6_12OpHandleBaseESt6atomicIiESt4hashISA_ESt8equal_toISA_ESaISt4pairIKSA_SC_EEESA_RKSt10shared_ptrINS5_13BlockingQueueImEEEEUlvE_vEEEvEEEE9_M_invokeERKSt9_Any_data
W0415 01:28:43.348004 57018 init.cc:217] @ 0x7fee3732bb23 std::_Function_handler<>::_M_invoke()
W0415 01:28:43.350818 57018 init.cc:217] @ 0x7fee370ba407 std::__future_base::_State_base::_M_do_set()
run.sh: line 40: 56823 Segmentation fault (core dumped) python train.py --use_cuda true