PaddleDetection CPU模式下训练中断
Created by: suyali
loading annotations into memory... Done (t=12.66s) creating index... index created! 2019-07-25 10:18:14,038-INFO: 118287 samples in file dataset/coco/annotations/instances_train2017.json 2019-07-25 10:18:15,180-WARNING: You can try our memory optimize feature to save your memory usage: # create a build_strategy variable to set memory optimize option build_strategy = compiler.BuildStrategy() build_strategy.enable_inplace = True build_strategy.memory_optimize = True
# pass the build_strategy to with_data_parallel API
compiled_prog = compiler.CompiledProgram(main).with_data_parallel(
loss_name=loss.name, build_strategy=build_strategy)
!!! Memory optimize is our experimental feature !!!
some variables may be removed/reused internal to save memory usage,
in order to fetch the right value of the fetch_list, please set the
persistable property to true for each variable in fetch_list
# Sample
conv1 = fluid.layers.conv2d(data, 4, 5, 1, act=None)
# if you need to fetch conv1, then:
conv1.persistable = True
2019-07-25 10:18:15,841-INFO: Load model and fuse batch norm from https://paddle-imagenet-models-name.bj.bcebos.com/ResNet50_cos_pretrained.tar... 2019-07-25 10:18:15,848-INFO: Found /home/suyali/.cache/paddle/weights/ResNet50_cos_pretrained The CPU_NUM is not specified, you should set CPU_NUM in the environment variable list, i.e export CPU_NUM=1. CPU_NUM indicates that how many CPUPlace are used in the current task. !!! The default number of CPUPlaces is 1.
I0725 10:18:16.109668 2658 parallel_executor.cc:329] The number of CPUPlace, which is used in ParallelExecutor, is 1. And the Program will be copied 1 copies I0725 10:18:16.152879 2658 build_strategy.cc:340] SeqOnlyAllReduceOps:0, num_trainers:1 2019-07-25 10:18:56,877-INFO: iter: 0, lr: 0.003333, b'loss_bbox': 0.050583, b'loss_rpn_cls': 0.693647, b'loss': 5.588, b'loss_rpn_bbox': 0.23873, b'loss_cls': 4.60504, time: 0.000 2019-07-25 10:19:37,233-INFO: iter: 1, lr: 0.003347, b'loss_bbox': 0.121054, b'loss_rpn_cls': 0.692905, b'loss': 5.256546, b'loss_rpn_bbox': 0.213924, b'loss_cls': 4.228662, time: 40.777 2019-07-25 10:20:17,638-INFO: iter: 2, lr: 0.003360, b'loss_bbox': 0.050583, b'loss_rpn_cls': 0.693647, b'loss': 4.925092, b'loss_rpn_bbox': 0.23873, b'loss_cls': 3.852284, time: 40.356 浮点数例外 (核心已转储)
无论是mask_rcnn_r50_1x.yml 还是faster_rcnn_r50_1x.yml,都出现这个问题.