PaddleDetection CPU模式下训练中断 (#2920) · Issue · PaddlePaddle / models

PaddleDetection CPU模式下训练中断

Created by: suyali

loading annotations into memory... Done (t=12.66s) creating index... index created! 2019-07-25 10:18:14,038-INFO: 118287 samples in file dataset/coco/annotations/instances_train2017.json 2019-07-25 10:18:15,180-WARNING: You can try our memory optimize feature to save your memory usage: # create a build_strategy variable to set memory optimize option build_strategy = compiler.BuildStrategy() build_strategy.enable_inplace = True build_strategy.memory_optimize = True

     # pass the build_strategy to with_data_parallel API
     compiled_prog = compiler.CompiledProgram(main).with_data_parallel(
         loss_name=loss.name, build_strategy=build_strategy)
  
 !!! Memory optimize is our experimental feature !!!
     some variables may be removed/reused internal to save memory usage, 
     in order to fetch the right value of the fetch_list, please set the 
     persistable property to true for each variable in fetch_list

     # Sample
     conv1 = fluid.layers.conv2d(data, 4, 5, 1, act=None) 
     # if you need to fetch conv1, then:
     conv1.persistable = True

2019-07-25 10:18:15,841-INFO: Load model and fuse batch norm from https://paddle-imagenet-models-name.bj.bcebos.com/ResNet50_cos_pretrained.tar... 2019-07-25 10:18:15,848-INFO: Found /home/suyali/.cache/paddle/weights/ResNet50_cos_pretrained The CPU_NUM is not specified, you should set CPU_NUM in the environment variable list, i.e export CPU_NUM=1. CPU_NUM indicates that how many CPUPlace are used in the current task. !!! The default number of CPUPlaces is 1.

I0725 10:18:16.109668 2658 parallel_executor.cc:329] The number of CPUPlace, which is used in ParallelExecutor, is 1. And the Program will be copied 1 copies I0725 10:18:16.152879 2658 build_strategy.cc:340] SeqOnlyAllReduceOps:0, num_trainers:1 2019-07-25 10:18:56,877-INFO: iter: 0, lr: 0.003333, b'loss_bbox': 0.050583, b'loss_rpn_cls': 0.693647, b'loss': 5.588, b'loss_rpn_bbox': 0.23873, b'loss_cls': 4.60504, time: 0.000 2019-07-25 10:19:37,233-INFO: iter: 1, lr: 0.003347, b'loss_bbox': 0.121054, b'loss_rpn_cls': 0.692905, b'loss': 5.256546, b'loss_rpn_bbox': 0.213924, b'loss_cls': 4.228662, time: 40.777 2019-07-25 10:20:17,638-INFO: iter: 2, lr: 0.003360, b'loss_bbox': 0.050583, b'loss_rpn_cls': 0.693647, b'loss': 4.925092, b'loss_rpn_bbox': 0.23873, b'loss_cls': 3.852284, time: 40.356 浮点数例外 (核心已转储)

无论是mask_rcnn_r50_1x.yml 还是faster_rcnn_r50_1x.yml，都出现这个问题.

PaddlePaddle / models 大约 1 年 前同步成功

PaddleDetection CPU模式下训练中断

PaddlePaddle / models
大约 1 年前同步成功