Skip to content

  • 体验新版
    • 正在加载...
  • 登录
  • PaddlePaddle
  • PaddleDetection
  • Issue
  • #619

P
PaddleDetection
  • 项目概览

PaddlePaddle / PaddleDetection
大约 2 年 前同步成功

通知 708
Star 11112
Fork 2696
  • 代码
    • 文件
    • 提交
    • 分支
    • Tags
    • 贡献者
    • 分支图
    • Diff
  • Issue 184
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 40
  • Wiki 0
    • Wiki
  • 分析
    • 仓库
    • DevOps
  • 项目成员
  • Pages
P
PaddleDetection
  • 项目概览
    • 项目概览
    • 详情
    • 发布
  • 仓库
    • 仓库
    • 文件
    • 提交
    • 分支
    • 标签
    • 贡献者
    • 分支图
    • 比较
  • Issue 184
    • Issue 184
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 40
    • 合并请求 40
  • Pages
  • 分析
    • 分析
    • 仓库分析
    • DevOps
  • Wiki 0
    • Wiki
  • 成员
    • 成员
  • 收起侧边栏
  • 动态
  • 分支图
  • 创建新Issue
  • 提交
  • Issue看板
已关闭
开放中
Opened 5月 09, 2020 by saxon_zh@saxon_zhGuest

训练的最后发生core dump - Core dump in training at last log_smooth_window

Created by: nihuizhidao

Core dump without detailed error massage was found at last log_smooth_window, while fine tuning YOLOV3 in custom dataset. Part of my config yaml file:

architecture: YOLOv3 use_gpu: true max_iters: 25000 log_smooth_window: 20 save_dir: output snapshot_iter: 200 metric: COCO pretrain_weights: https://paddlemodels.bj.bcebos.com/object_detection/yolov3_darknet.tar weights: output/yolov3_darknet_schwarz_v3/model_final num_classes: 8 finetune_exclude_pretrained_params: ['yolo_output'] use_fine_grained_loss: false

LearningRate: base_lr: 0.00005 schedulers:

  • !PiecewiseDecay gamma: 0.1 milestones:
    • 18750
    • 22500
  • !LinearWarmup start_factor: 0. steps: 200

Training batch_size is 10, runing in 2 GPUs.

The error message and its context:

2020-05-09 07:48:32,532-INFO: iter: 24820, lr: 0.000000, 'loss': '9.387529', time: 2.270, eta: 0:06:48 2020-05-09 07:49:11,330-INFO: iter: 24840, lr: 0.000000, 'loss': '9.227110', time: 1.970, eta: 0:05:15 2020-05-09 07:49:46,325-INFO: iter: 24860, lr: 0.000000, 'loss': '9.516720', time: 1.718, eta: 0:04:00 2020-05-09 07:50:24,220-INFO: iter: 24880, lr: 0.000000, 'loss': '9.507334', time: 1.961, eta: 0:03:55 2020-05-09 07:50:59,543-INFO: iter: 24900, lr: 0.000000, 'loss': '9.366526', time: 1.704, eta: 0:02:50 2020-05-09 07:51:35,928-INFO: iter: 24920, lr: 0.000000, 'loss': '9.407713', time: 1.881, eta: 0:02:30 2020-05-09 07:52:12,927-INFO: iter: 24940, lr: 0.000000, 'loss': '9.048078', time: 1.792, eta: 0:01:47 2020-05-09 07:52:48,232-INFO: iter: 24960, lr: 0.000000, 'loss': '9.399294', time: 1.822, eta: 0:01:12 2020-05-09 07:53:26,531-INFO: iter: 24980, lr: 0.000000, 'loss': '8.900979', time: 1.847, eta: 0:00:36 2020-05-09 07:53:59,817-INFO: Save model to output/yolov3_tt_v1/model_final. 2020-05-09 07:54:07,028-INFO: Test iter 0 2020-05-09 07:54:08,037-INFO: Test finish iter 7 2020-05-09 07:54:08,037-INFO: Total number of images: 65, inference time: 51.95583207253508 fps. loading annotations into memory... Done (t=0.00s) creating index... index created! 2020-05-09 07:54:08,181-INFO: Start evaluate... Loading and preparing results... DONE (t=0.03s) creating index... index created! Running per image evaluation... Evaluate annotation type bbox DONE (t=0.77s). Accumulating evaluation results... DONE (t=0.28s). Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.295 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.509 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.330 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.402 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.671 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.279 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.293 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.383 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.383 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.443 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.689 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.376 2020-05-09 07:54:09,291-INFO: Best test box ap: 0.31016948065022704, in iter: 9200 terminate called without an active exception W0509 07:54:10.161070 22881 init.cc:209] Warning: PaddlePaddle catches a failure signal, it may not work properly W0509 07:54:10.161113 22881 init.cc:211] You could check whether you killed PaddlePaddle thread/process accidentally or report the case to PaddlePaddle W0509 07:54:10.161131 22881 init.cc:214] The detail failure signal is:

W0509 07:54:10.161152 22881 init.cc:217] *** Aborted at 1588982050 (unix time) try "date -d @1588982050" if you are using GNU date *** W0509 07:54:10.169255 22881 init.cc:217] PC: @ 0x0 (unknown) W0509 07:54:10.169443 22881 init.cc:217] *** SIGABRT (@0x3e8000033bf) received by PID 13247 (TID 0x7f9439fff700) from PID 13247; stack trace: *** W0509 07:54:10.176609 22881 init.cc:217] @ 0x7f9afac08390 (unknown) W0509 07:54:10.181762 22881 init.cc:217] @ 0x7f9afa862428 gsignal W0509 07:54:10.186179 22881 init.cc:217] @ 0x7f9afa86402a abort W0509 07:54:10.189432 22881 init.cc:217] @ 0x7f9aeafc584a __gnu_cxx::__verbose_terminate_handler() W0509 07:54:10.192270 22881 init.cc:217] @ 0x7f9aeafc3f47 __cxxabiv1::__terminate() W0509 07:54:10.195497 22881 init.cc:217] @ 0x7f9aeafc3f7d std::terminate() W0509 07:54:10.198510 22881 init.cc:217] @ 0x7f9aeafc3c5a __gxx_personality_v0 W0509 07:54:10.202011 22881 init.cc:217] @ 0x7f9af9a02b97 _Unwind_ForcedUnwind_Phase2 W0509 07:54:10.205464 22881 init.cc:217] @ 0x7f9af9a02e7d _Unwind_ForcedUnwind W0509 07:54:10.208946 22881 init.cc:217] @ 0x7f9afac07070 __GI___pthread_unwind W0509 07:54:10.212412 22881 init.cc:217] @ 0x7f9afabff845 __pthread_exit W0509 07:54:10.213181 22881 init.cc:217] @ 0x561430665059 PyThread_exit_thread W0509 07:54:10.213379 22881 init.cc:217] @ 0x5614304eac10 PyEval_RestoreThread.cold.799 W0509 07:54:10.215291 22881 init.cc:217] @ 0x7f9ae76d4b27 pyopencv_cv_imdecode() W0509 07:54:10.216156 22881 init.cc:217] @ 0x5614305ebab4 _PyMethodDef_RawFastCallKeywords W0509 07:54:10.216982 22881 init.cc:217] @ 0x5614305ebbd1 _PyCFunction_FastCallKeywords W0509 07:54:10.217809 22881 init.cc:217] @ 0x56143065257b _PyEval_EvalFrameDefault W0509 07:54:10.218571 22881 init.cc:217] @ 0x561430597389 _PyEval_EvalCodeWithName W0509 07:54:10.219349 22881 init.cc:217] @ 0x5614305984c5 _PyFunction_FastCallDict W0509 07:54:10.220100 22881 init.cc:217] @ 0x5614305b7a73 _PyObject_Call_Prepend W0509 07:54:10.220489 22881 init.cc:217] @ 0x5614305ff27a slot_tp_call W0509 07:54:10.221262 22881 init.cc:217] @ 0x5614306002db _PyObject_FastCallKeywords W0509 07:54:10.222090 22881 init.cc:217] @ 0x561430652146 _PyEval_EvalFrameDefault W0509 07:54:10.222860 22881 init.cc:217] @ 0x5614305983fb _PyFunction_FastCallDict W0509 07:54:10.223621 22881 init.cc:217] @ 0x5614305b7a73 _PyObject_Call_Prepend W0509 07:54:10.224014 22881 init.cc:217] @ 0x5614305ff27a slot_tp_call W0509 07:54:10.224784 22881 init.cc:217] @ 0x5614306002db _PyObject_FastCallKeywords W0509 07:54:10.225611 22881 init.cc:217] @ 0x561430652a39 _PyEval_EvalFrameDefault W0509 07:54:10.226372 22881 init.cc:217] @ 0x561430597389 _PyEval_EvalCodeWithName W0509 07:54:10.227146 22881 init.cc:217] @ 0x5614305984c5 _PyFunction_FastCallDict terminate called recursively W0509 07:54:10.227896 22881 init.cc:217] @ 0x5614305b7a73 _PyObject_Call_Prepend W0509 07:54:10.228734 22881 init.cc:217] @ 0x5614305a9fde PyObject_Call Aborted (core dumped)

My environment: paddlepaddle-gpu 1.7.1.post107 GTX 1080Ti *2

This is not the first time that core dump happened in the same training task. I think it has something to do with max_iters, because when reduces or changes max_iters, it won't happen again, but I don't know why and where I did wrong. Thnaks!

这种情况发生蛮多次了,最近项目比较急,每次都是到最后一20个iteration的时候core dump,太影响进度了。。。顺便问下,如果要在训练的时候恢复到某个节点,那训练还是会严格按照配置文件里的配置,比如接着的iteration的learning rate还会是接续下来的么,比如这里用到的lr是跟iteration有关系的,利用checkpoints回复训练的时候paddle在这个是怎么操作的,是严格按照config yaml么,还是会认为是新的训练,lr从warmup开始?

指派人
分配到
无
里程碑
无
分配里程碑
工时统计
无
截止日期
无
标识: paddlepaddle/PaddleDetection#619
渝ICP备2023009037号

京公网安备11010502055752号

网络110报警服务 Powered by GitLab CE v13.7
开源知识
Git 入门 Pro Git 电子书 在线学 Git
Markdown 基础入门 IT 技术知识开源图谱
帮助
使用手册 反馈建议 博客
《GitCode 隐私声明》 《GitCode 服务条款》 关于GitCode
Powered by GitLab CE v13.7