在训练开始时发生Segmentation fault (core dumped) at the begining of training (#677) · Issue · PaddlePaddle / PaddleDetection

在训练开始时发生Segmentation fault (core dumped) at the begining of training

Created by: nihuizhidao

训练刚刚开始时刚刚过了第一个iteration，就发生了Segmentation fault：

loading annotations into memory... Done (t=0.00s) creating index... index created! 2020-05-14 17:13:37,547-INFO: 24 samples in file custom_dataset/tiantian_v3_seg_coco/annotations/instance_val.json 2020-05-14 17:13:37,548-INFO: places would be ommited when DataLoader is not iterable W0514 17:13:38.423727 3942 device_context.cc:237] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 10.2, Runtime API Version: 10.0 W0514 17:13:38.428025 3942 device_context.cc:245] device: 0, cuDNN Version: 7.6. 2020-05-14 17:13:39,522-INFO: Found /home/dell/.cache/paddle/weights/mask_rcnn_r50_fpn_1x 2020-05-14 17:13:39,523-INFO: Loading parameters from /home/dell/.cache/paddle/weights/mask_rcnn_r50_fpn_1x... 2020-05-14 17:13:39,758-WARNING: /tmp/tmpj0vboqhy/mask_rcnn_r50_fpn_1x.pdparams not found, try to load model file saved with [ save_params, save_persistables, save_vars ] 2020-05-14 17:13:39,758-WARNING: /tmp/tmpj0vboqhy/mask_rcnn_r50_fpn_1x.pdparams not found, try to load model file saved with [ save_params, save_persistables, save_vars ] loading annotations into memory... Done (t=0.02s) creating index... index created! 2020-05-14 17:13:40,095-INFO: 99 samples in file custom_dataset/tiantian_v3_seg_coco/annotations/instance_train.json 2020-05-14 17:13:40,096-INFO: places would be ommited when DataLoader is not iterable I0514 17:13:40.195358 3942 parallel_executor.cc:440] The Program will be executed on CUDA using ParallelExecutor, 2 cards are used, so 2 programs are executed in parallel. W0514 17:13:42.282743 3942 fuse_all_reduce_op_pass.cc:74] Find all_reduce operators: 84. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 84. I0514 17:13:42.291462 3942 build_strategy.cc:365] SeqOnlyAllReduceOps:0, num_trainers:1 I0514 17:13:42.634323 3942 parallel_executor.cc:307] Inplace strategy is enabled, when build_strategy.enable_inplace = True I0514 17:13:42.718104 3942 parallel_executor.cc:375] Garbage collection strategy is enabled, when FLAGS_eager_delete_tensor_gb = 0 2020-05-14 17:13:43,547-INFO: iter: 0, lr: 0.000003, 'loss_cls': '2.054725', 'loss_bbox': '0.205467', 'loss_rpn_cls': '0.102608', 'loss_rpn_bbox': '0.031138', 'loss_mask': '1.888842', 'loss': '4.282780', time: 0.058, eta: 0:19:22 W0514 17:13:44.734632 4295 init.cc:209] Warning: PaddlePaddle catches a failure signal, it may not work properly W0514 17:13:44.734692 4295 init.cc:211] You could check whether you killed PaddlePaddle thread/process accidentally or report the case to PaddlePaddle W0514 17:13:44.734707 4295 init.cc:214] The detail failure signal is:

W0514 17:13:44.734725 4295 init.cc:217] *** Aborted at 1589447624 (unix time) try "date -d @1589447624" if you are using GNU date *** W0514 17:13:44.737655 4295 init.cc:217] PC: @ 0x0 (unknown) W0514 17:13:44.737893 4295 init.cc:217] *** SIGSEGV (@0x100000029) received by PID 3942 (TID 0x7f77177fe700) from PID 41; stack trace: *** W0514 17:13:44.742903 4295 init.cc:217] @ 0x7f7854fe0390 (unknown) W0514 17:13:44.749749 4295 init.cc:217] @ 0x7f780b51f688 paddle::memory::detail::MemoryBlock::Split() W0514 17:13:44.758744 4295 init.cc:217] @ 0x7f780b51dbb2 _ZN6paddle6memory6detail14BuddyAllocator12SplitToAllocESt23_Rb_tree_const_iteratorISt5tupleIJmmPvEEEm W0514 17:13:44.768972 4295 init.cc:217] @ 0x7f780b51e095 paddle::memory::detail::BuddyAllocator::Alloc() W0514 17:13:44.776578 4295 init.cc:217] @ 0x7f780b50da35 paddle::memory::legacy::Alloc<>() W0514 17:13:44.785235 4295 init.cc:217] @ 0x7f780b50e9d5 paddle::memory::allocation::NaiveBestFitAllocator::AllocateImpl() W0514 17:13:44.790421 4295 init.cc:217] @ 0x7f780b507d43 paddle::memory::allocation::AllocatorFacade::Alloc() W0514 17:13:44.797467 4295 init.cc:217] @ 0x7f780b507fde paddle::memory::allocation::AllocatorFacade::AllocShared() W0514 17:13:44.802448 4295 init.cc:217] @ 0x7f780b50775c paddle::memory::AllocShared() W0514 17:13:44.807770 4295 init.cc:217] @ 0x7f780b4f3bb2 paddle::framework::Tensor::mutable_data() W0514 17:13:44.817210 4295 init.cc:217] @ 0x7f780acbb400 paddle::operators::GenerateProposalLabelsKernel<>::Compute() W0514 17:13:44.822504 4295 init.cc:217] @ 0x7f780acbca93 ZNSt17_Function_handlerIFvRKN6paddle9framework16ExecutionContextEEZNKS1_24OpKernelRegistrarFunctorINS0_8platform8CPUPlaceELb0ELm0EJNS0_9operators28GenerateProposalLabelsKernelIfEENSA_IdEEEEclEPKcSF_iEUlS4_E_E9_M_invokeERKSt9_Any_dataS4 W0514 17:13:44.830654 4295 init.cc:217] @ 0x7f780b4637e6 paddle::framework::OperatorWithKernel::RunImpl() W0514 17:13:44.842005 4295 init.cc:217] @ 0x7f780b463fb1 paddle::framework::OperatorWithKernel::RunImpl() W0514 17:13:44.848700 4295 init.cc:217] @ 0x7f780b45d100 paddle::framework::OperatorBase::Run() W0514 17:13:44.857833 4295 init.cc:217] @ 0x7f780b1e2376 paddle::framework::details::ComputationOpHandle::RunImpl() W0514 17:13:44.866263 4295 init.cc:217] @ 0x7f780b199531 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync() W0514 17:13:44.876678 4295 init.cc:217] @ 0x7f780b19829f paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp() W0514 17:13:44.879704 4295 init.cc:217] @ 0x7f780b198564 _ZNSt17_Function_handlerIFvvESt17reference_wrapperISt12_Bind_simpleIFS1_ISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS6_12OpHandleBaseESt6atomicIiESt4hashISA_ESt8equal_toISA_ESaISt4pairIKSA_SC_EEESA_RKSt10shared_ptrINS5_13BlockingQueueImEEEEUlvE_vEEEvEEEE9_M_invokeERKSt9_Any_data W0514 17:13:44.890637 4295 init.cc:217] @ 0x7f7808b6c983 std::_Function_handler<>::_M_invoke() W0514 17:13:44.901640 4295 init.cc:217] @ 0x7f78088fac37 std::__future_base::_State_base::_M_do_set() W0514 17:13:44.904276 4295 init.cc:217] @ 0x7f7854fdda99 __pthread_once_slow W0514 17:13:44.906966 4295 init.cc:217] @ 0x7f780b193a52 _ZNSt13__future_base11_Task_stateISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS4_12OpHandleBaseESt6atomicIiESt4hashIS8_ESt8equal_toIS8_ESaISt4pairIKS8_SA_EEES8_RKSt10shared_ptrINS3_13BlockingQueueImEEEEUlvE_vEESaIiEFvvEE6_M_runEv W0514 17:13:44.917790 4295 init.cc:217] @ 0x7f78088fce64 _ZZN10ThreadPoolC1EmENKUlvE_clEv W0514 17:13:44.919467 4295 init.cc:217] @ 0x7f7836393421 execute_native_thread_routine_compat W0514 17:13:44.921916 4295 init.cc:217] @ 0x7f7854fd66ba start_thread W0514 17:13:44.924396 4295 init.cc:217] @ 0x7f7854d0c41d clone W0514 17:13:44.926872 4295 init.cc:217] @ 0x0 (unknown) Segmentation fault (core dumped)

系统信息：

Linux DL 4.15.0-72-generic #81 (closed)~16.04.1-Ubuntu SMP Tue Nov 26 16:34:21 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

GTX 1080Ti*2，都用上了

batch size =2，使用如下配置（部分）：

architecture: MaskRCNN use_gpu: true max_iters: 20000 snapshot_iter: 200 log_smooth_window: 20 save_dir: output pretrain_weights: https://paddlemodels.bj.bcebos.com/object_detection/mask_rcnn_r50_fpn_1x.tar metric: COCO weights: output/mask_rcnn_r50_fpn_1x_tian/model_final num_classes: 7 finetune_exclude_pretrained_params: ['cls_score', 'bbox_pred', 'mask_fcn_logits']

请问这个是什么原因呢？按照 #632 和 #619 (closed) 的做法尝试了，但是还是出错。

PaddlePaddle / PaddleDetection 9 个月 前同步成功

在训练开始时发生Segmentation fault (core dumped) at the begining of training

PaddlePaddle / PaddleDetection
9 个月前同步成功