训练的最后发生core dump - Core dump in training at last log_smooth_window (#619) · Issue · PaddlePaddle / PaddleDetection

训练的最后发生core dump - Core dump in training at last log_smooth_window

Created by: nihuizhidao

Core dump without detailed error massage was found at last log_smooth_window, while fine tuning YOLOV3 in custom dataset. Part of my config yaml file:

architecture: YOLOv3 use_gpu: true max_iters: 25000 log_smooth_window: 20 save_dir: output snapshot_iter: 200 metric: COCO pretrain_weights: https://paddlemodels.bj.bcebos.com/object_detection/yolov3_darknet.tar weights: output/yolov3_darknet_schwarz_v3/model_final num_classes: 8 finetune_exclude_pretrained_params: ['yolo_output'] use_fine_grained_loss: false

LearningRate: base_lr: 0.00005 schedulers:

!PiecewiseDecay gamma: 0.1 milestones:

18750

22500

!LinearWarmup start_factor: 0. steps: 200

Training batch_size is 10, runing in 2 GPUs.

The error message and its context:

2020-05-09 07:48:32,532-INFO: iter: 24820, lr: 0.000000, 'loss': '9.387529', time: 2.270, eta: 0:06:48 2020-05-09 07:49:11,330-INFO: iter: 24840, lr: 0.000000, 'loss': '9.227110', time: 1.970, eta: 0:05:15 2020-05-09 07:49:46,325-INFO: iter: 24860, lr: 0.000000, 'loss': '9.516720', time: 1.718, eta: 0:04:00 2020-05-09 07:50:24,220-INFO: iter: 24880, lr: 0.000000, 'loss': '9.507334', time: 1.961, eta: 0:03:55 2020-05-09 07:50:59,543-INFO: iter: 24900, lr: 0.000000, 'loss': '9.366526', time: 1.704, eta: 0:02:50 2020-05-09 07:51:35,928-INFO: iter: 24920, lr: 0.000000, 'loss': '9.407713', time: 1.881, eta: 0:02:30 2020-05-09 07:52:12,927-INFO: iter: 24940, lr: 0.000000, 'loss': '9.048078', time: 1.792, eta: 0:01:47 2020-05-09 07:52:48,232-INFO: iter: 24960, lr: 0.000000, 'loss': '9.399294', time: 1.822, eta: 0:01:12 2020-05-09 07:53:26,531-INFO: iter: 24980, lr: 0.000000, 'loss': '8.900979', time: 1.847, eta: 0:00:36 2020-05-09 07:53:59,817-INFO: Save model to output/yolov3_tt_v1/model_final. 2020-05-09 07:54:07,028-INFO: Test iter 0 2020-05-09 07:54:08,037-INFO: Test finish iter 7 2020-05-09 07:54:08,037-INFO: Total number of images: 65, inference time: 51.95583207253508 fps. loading annotations into memory... Done (t=0.00s) creating index... index created! 2020-05-09 07:54:08,181-INFO: Start evaluate... Loading and preparing results... DONE (t=0.03s) creating index... index created! Running per image evaluation... Evaluate annotation type bbox DONE (t=0.77s). Accumulating evaluation results... DONE (t=0.28s). Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.295 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.509 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.330 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.402 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.671 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.279 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.293 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.383 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.383 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.443 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.689 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.376 2020-05-09 07:54:09,291-INFO: Best test box ap: 0.31016948065022704, in iter: 9200 terminate called without an active exception W0509 07:54:10.161070 22881 init.cc:209] Warning: PaddlePaddle catches a failure signal, it may not work properly W0509 07:54:10.161113 22881 init.cc:211] You could check whether you killed PaddlePaddle thread/process accidentally or report the case to PaddlePaddle W0509 07:54:10.161131 22881 init.cc:214] The detail failure signal is:

W0509 07:54:10.161152 22881 init.cc:217] *** Aborted at 1588982050 (unix time) try "date -d @1588982050" if you are using GNU date *** W0509 07:54:10.169255 22881 init.cc:217] PC: @ 0x0 (unknown) W0509 07:54:10.169443 22881 init.cc:217] *** SIGABRT (@0x3e8000033bf) received by PID 13247 (TID 0x7f9439fff700) from PID 13247; stack trace: *** W0509 07:54:10.176609 22881 init.cc:217] @ 0x7f9afac08390 (unknown) W0509 07:54:10.181762 22881 init.cc:217] @ 0x7f9afa862428 gsignal W0509 07:54:10.186179 22881 init.cc:217] @ 0x7f9afa86402a abort W0509 07:54:10.189432 22881 init.cc:217] @ 0x7f9aeafc584a __gnu_cxx::__verbose_terminate_handler() W0509 07:54:10.192270 22881 init.cc:217] @ 0x7f9aeafc3f47 __cxxabiv1::__terminate() W0509 07:54:10.195497 22881 init.cc:217] @ 0x7f9aeafc3f7d std::terminate() W0509 07:54:10.198510 22881 init.cc:217] @ 0x7f9aeafc3c5a __gxx_personality_v0 W0509 07:54:10.202011 22881 init.cc:217] @ 0x7f9af9a02b97 _Unwind_ForcedUnwind_Phase2 W0509 07:54:10.205464 22881 init.cc:217] @ 0x7f9af9a02e7d _Unwind_ForcedUnwind W0509 07:54:10.208946 22881 init.cc:217] @ 0x7f9afac07070 __GI___pthread_unwind W0509 07:54:10.212412 22881 init.cc:217] @ 0x7f9afabff845 __pthread_exit W0509 07:54:10.213181 22881 init.cc:217] @ 0x561430665059 PyThread_exit_thread W0509 07:54:10.213379 22881 init.cc:217] @ 0x5614304eac10 PyEval_RestoreThread.cold.799 W0509 07:54:10.215291 22881 init.cc:217] @ 0x7f9ae76d4b27 pyopencv_cv_imdecode() W0509 07:54:10.216156 22881 init.cc:217] @ 0x5614305ebab4 _PyMethodDef_RawFastCallKeywords W0509 07:54:10.216982 22881 init.cc:217] @ 0x5614305ebbd1 _PyCFunction_FastCallKeywords W0509 07:54:10.217809 22881 init.cc:217] @ 0x56143065257b _PyEval_EvalFrameDefault W0509 07:54:10.218571 22881 init.cc:217] @ 0x561430597389 _PyEval_EvalCodeWithName W0509 07:54:10.219349 22881 init.cc:217] @ 0x5614305984c5 _PyFunction_FastCallDict W0509 07:54:10.220100 22881 init.cc:217] @ 0x5614305b7a73 _PyObject_Call_Prepend W0509 07:54:10.220489 22881 init.cc:217] @ 0x5614305ff27a slot_tp_call W0509 07:54:10.221262 22881 init.cc:217] @ 0x5614306002db _PyObject_FastCallKeywords W0509 07:54:10.222090 22881 init.cc:217] @ 0x561430652146 _PyEval_EvalFrameDefault W0509 07:54:10.222860 22881 init.cc:217] @ 0x5614305983fb _PyFunction_FastCallDict W0509 07:54:10.223621 22881 init.cc:217] @ 0x5614305b7a73 _PyObject_Call_Prepend W0509 07:54:10.224014 22881 init.cc:217] @ 0x5614305ff27a slot_tp_call W0509 07:54:10.224784 22881 init.cc:217] @ 0x5614306002db _PyObject_FastCallKeywords W0509 07:54:10.225611 22881 init.cc:217] @ 0x561430652a39 _PyEval_EvalFrameDefault W0509 07:54:10.226372 22881 init.cc:217] @ 0x561430597389 _PyEval_EvalCodeWithName W0509 07:54:10.227146 22881 init.cc:217] @ 0x5614305984c5 _PyFunction_FastCallDict terminate called recursively W0509 07:54:10.227896 22881 init.cc:217] @ 0x5614305b7a73 _PyObject_Call_Prepend W0509 07:54:10.228734 22881 init.cc:217] @ 0x5614305a9fde PyObject_Call Aborted (core dumped)

My environment: paddlepaddle-gpu 1.7.1.post107 GTX 1080Ti *2

This is not the first time that core dump happened in the same training task. I think it has something to do with max_iters, because when reduces or changes max_iters, it won't happen again, but I don't know why and where I did wrong. Thnaks!

这种情况发生蛮多次了，最近项目比较急，每次都是到最后一20个iteration的时候core dump，太影响进度了。。。顺便问下，如果要在训练的时候恢复到某个节点，那训练还是会严格按照配置文件里的配置，比如接着的iteration的learning rate还会是接续下来的么，比如这里用到的lr是跟iteration有关系的，利用checkpoints回复训练的时候paddle在这个是怎么操作的，是严格按照config yaml么，还是会认为是新的训练，lr从warmup开始？

PaddlePaddle / PaddleDetection 大约 2 年 前同步成功

训练的最后发生core dump - Core dump in training at last log_smooth_window

PaddlePaddle / PaddleDetection
大约 2 年前同步成功