lexical analysis适配自己数据集训练过程中报错【killed PaddlePaddle thread/process accidentally】 (#23869) · Issue · PaddlePaddle / Paddle

lexical analysis适配自己数据集训练过程中报错【killed PaddlePaddle thread/process accidentally】

Created by: xxxsssyyy

版本、环境信息： 1）PaddlePaddle版本：1.7.1 python 2.7.16[GCC 7.3.0] 2）CPU：预测若用CPU，请提供CPU型号，MKL/OpenBlas/MKLDNN/等数学库使用情况 3）GPU：V100 4卡， 4）系统环境：请您描述系统类型、版本，例如Mac OS 10.14，Python版本
训练信息 1）单机4卡 2）显存信息: 每卡 32480MiB 3）Operator信息
问题描述：请详细描述您的问题，同步贴出报错信息、日志、可复现的代码片段自己的数据集很大，有几千万，batchsize设为60000，使用的sh run.sh train_multi_gpu，不知道报错信息什么意思

[train] step = 667, loss = 5.18007, P: 0.83305, R: 0.72868, F1: 0.77738, elapsed time 44.67638
[train] step = 668, loss = 4.71820, P: 0.83359, R: 0.73958, F1: 0.78377, elapsed time 44.60957
W0415 01:28:43.293865 57018 init.cc:209] Warning: PaddlePaddle catches a failure signal, it may not work properly
W0415 01:28:43.293926 57018 init.cc:211] You could check whether you killed PaddlePaddle thread/process accidentally or report the case to PaddlePaddle
W0415 01:28:43.293931 57018 init.cc:214] The detail failure signal is:

W0415 01:28:43.293936 57018 init.cc:217] *** Aborted at 1586885323 (unix time) try "date -d @1586885323" if you are using GNU date ***
W0415 01:28:43.295604 57018 init.cc:217] PC: @                0x0 (unknown)
W0415 01:28:43.295755 57018 init.cc:217] *** SIGSEGV (@0x50) received by PID 56823 (TID 0x7fef1b517700) from PID 80; stack trace: ***
W0415 01:28:43.296983 57018 init.cc:217]     @     0x7fef851685d0 (unknown)
W0415 01:28:43.298117 57018 init.cc:217]     @     0x7fef85380b16 _dl_relocate_object
W0415 01:28:43.299258 57018 init.cc:217]     @     0x7fef8538959c dl_open_worker
W0415 01:28:43.300359 57018 init.cc:217]     @     0x7fef85384704 _dl_catch_error
W0415 01:28:43.301456 57018 init.cc:217]     @     0x7fef85388abb _dl_open
W0415 01:28:43.302605 57018 init.cc:217]     @     0x7fef847bfb12 do_dlopen
W0415 01:28:43.303709 57018 init.cc:217]     @     0x7fef85384704 _dl_catch_error
W0415 01:28:43.304888 57018 init.cc:217]     @     0x7fef847bfbd2 __GI___libc_dlopen_mode
W0415 01:28:43.306037 57018 init.cc:217]     @     0x7fef84796fc5 init
W0415 01:28:43.307126 57018 init.cc:217]     @     0x7fef85165e40 __GI___pthread_once
W0415 01:28:43.308282 57018 init.cc:217]     @     0x7fef847970dc __GI___backtrace
W0415 01:28:43.310690 57018 init.cc:217]     @     0x7fee36fb8d45 paddle::platform::GetTraceBackString<>()
W0415 01:28:43.312077 57018 init.cc:217]     @     0x7fee39bb088a paddle::memory::allocation::CUDAAllocator::AllocateImpl()
W0415 01:28:43.313971 57018 init.cc:217]     @     0x7fee39bd4172 paddle::memory::allocation::AlignedAllocator::AllocateImpl()
W0415 01:28:43.316745 57018 init.cc:217]     @     0x7fee39bd2631 paddle::memory::allocation::AutoGrowthBestFitAllocator::AllocateImpl()
W0415 01:28:43.318388 57018 init.cc:217]     @     0x7fee39bb214b paddle::memory::allocation::RetryAllocator::AllocateImpl()
W0415 01:28:43.319717 57018 init.cc:217]     @     0x7fee39bace03 paddle::memory::allocation::AllocatorFacade::Alloc()
W0415 01:28:43.321595 57018 init.cc:217]     @     0x7fee39bad09e paddle::memory::allocation::AllocatorFacade::AllocShared()
W0415 01:28:43.322841 57018 init.cc:217]     @     0x7fee39bac81c paddle::memory::AllocShared()
W0415 01:28:43.324249 57018 init.cc:217]     @     0x7fee39b997b2 paddle::framework::Tensor::mutable_data()
W0415 01:28:43.326561 57018 init.cc:217]     @     0x7fee382f2d9a paddle::operators::GRUGradKernel<>::BatchCompute()
W0415 01:28:43.328370 57018 init.cc:217]     @     0x7fee382f4143 _ZNSt17_Function_handlerIFvRKN6paddle9framework16ExecutionContextEEZNKS1_24OpKernelRegistrarFunctorINS0_8platform9CUDAPlaceELb0ELm0EJNS0_9operators13GRUGradKernelINS7_17CUDADeviceContextEfEENSA_ISB_dEEEEclEPKcSG_iEUlS4_E_E9_M_invokeERKSt9_Any_dataS4_
W0415 01:28:43.330576 57018 init.cc:217]     @     0x7fee39b11b16 paddle::framework::OperatorWithKernel::RunImpl()
W0415 01:28:43.333703 57018 init.cc:217]     @     0x7fee39b122e1 paddle::framework::OperatorWithKernel::RunImpl()
W0415 01:28:43.335433 57018 init.cc:217]     @     0x7fee39b0b430 paddle::framework::OperatorBase::Run()
W0415 01:28:43.337113 57018 init.cc:217]     @     0x7fee3988ce79 paddle::framework::details::OpHandleBase::RunAndRecordEvent()
W0415 01:28:43.339591 57018 init.cc:217]     @     0x7fee3989008b paddle::framework::details::ComputationOpHandle::RunImpl()
W0415 01:28:43.341863 57018 init.cc:217]     @     0x7fee39847281 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync()
W0415 01:28:43.344367 57018 init.cc:217]     @     0x7fee39845fef paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp()
W0415 01:28:43.345208 57018 init.cc:217]     @     0x7fee398462b4 _ZNSt17_Function_handlerIFvvESt17reference_wrapperISt12_Bind_simpleIFS1_ISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS6_12OpHandleBaseESt6atomicIiESt4hashISA_ESt8equal_toISA_ESaISt4pairIKSA_SC_EEESA_RKSt10shared_ptrINS5_13BlockingQueueImEEEEUlvE_vEEEvEEEE9_M_invokeERKSt9_Any_data
W0415 01:28:43.348004 57018 init.cc:217]     @     0x7fee3732bb23 std::_Function_handler<>::_M_invoke()
W0415 01:28:43.350818 57018 init.cc:217]     @     0x7fee370ba407 std::__future_base::_State_base::_M_do_set()
run.sh: line 40: 56823 Segmentation fault      (core dumped) python train.py --use_cuda true

PaddlePaddle / Paddle 1 年多 前同步成功

lexical analysis适配自己数据集训练过程中报错【killed PaddlePaddle thread/process accidentally】

PaddlePaddle / Paddle
1 年多前同步成功