Yolov3 + resnet 50 从 checkpoint 处继续训练出现 loss 为负 (#1229) · Issue · PaddlePaddle / PaddleDetection

Yolov3 + resnet 50 从 checkpoint 处继续训练出现 loss 为负

Created by: XiminLin

我的checkpoint和这次的训练都在 AI Studio GPU 环境下训练的, paddlepaddle-gpu==1.8.0.post97, python3.

除了改变reader.yml 里面的 transforms, 其他都保持不变.

这是我的模型配置:

YOLOv3: backbone: ResNet yolo_head: YOLOv3Head ResNet: norm_type: bn freeze_norm: false norm_decay: 0. variant: d depth: 50 dcn_v2_stages: [5] feature_maps: [3, 4, 5] YOLOv3Head: anchor_masks: [[6, 7, 8], [3, 4, 5], [0, 1, 2]] anchors: [[19, 29], [28, 20], [25, 40], [31, 47], [36, 37], [41, 26], [47, 66], [48, 33], [67, 53]] norm_decay: 0. yolo_loss: YOLOv3Loss nms: background_label: -1 keep_top_k: 100 nms_threshold: 0.45 nms_top_k: 1000 normalized: false score_threshold: 0.01

YOLOv3Loss: # batch_size here is only used for fine grained loss, not used # for training batch_size setting, training batch_size setting # is in configs/yolov3_reader.yml TrainReader.batch_size, batch # size here should be set as same value as TrainReader.batch_size batch_size: 8 ignore_thresh: 0.7 # default is 0.7 label_smooth: true

我的checkpoint是模型在没有 data augmentation 的时候训练的, 具体的reader yml配置如下:

sample_transforms: - !DecodeImage to_rgb: True with_mixup: False - !RandomInterpImage target_size: 608 - !NormalizeBox {} - !PadBox num_max_boxes: 50 - !BboxXYXY2XYWH {} - !NormalizeImage mean: [0.8937, 0.9031, 0.8988] std: [0.19, 0.1995, 0.2022] is_scale: true is_channel_first: false batch_transforms: - !Permute to_bgr: false channel_first: true # Gt2YoloTarget is only used when use_fine_grained_loss set as true, # this operator will be deleted automatically if use_fine_grained_loss # is set as false - !Gt2YoloTarget anchor_masks: [[6, 7, 8], [3, 4, 5], [0, 1, 2]] anchors: [[10, 13], [16, 30], [33, 23], [30, 61], [62, 45], [59, 119], [116, 90], [156, 198], [373, 326]] downsample_ratios: [32, 16, 8]

我接下来用 python train.py -r checkpoint_path 来继续训练, 这次加上 augmentation, 具体的如下:

sample_transforms: - !DecodeImage to_rgb: True with_mixup: True # with_mixup: False - !MixupImage alpha: 1.5 beta: 1.5 - !ColorDistort {} - !RandomExpand ratio: 3 fill_value: [231.438 , 236.2575, 235.416] - !RandomCrop {} - !RandomInterpImage target_size: 608 - !RandomFlipImage is_normalized: false - !NormalizeBox {} - !PadBox num_max_boxes: 50 - !BboxXYXY2XYWH {} - !NormalizeImage mean: [0.8937, 0.9031, 0.8988] std: [0.19, 0.1995, 0.2022] is_scale: true is_channel_first: false batch_transforms: - !RandomShape sizes: [320, 352, 384, 416, 448, 480, 512, 544, 576, 608] random_inter: True - !Permute to_bgr: false channel_first: true # Gt2YoloTarget is only used when use_fine_grained_loss set as true, # this operator will be deleted automatically if use_fine_grained_loss # is set as false - !Gt2YoloTarget anchor_masks: [[6, 7, 8], [3, 4, 5], [0, 1, 2]] anchors: [[10, 13], [16, 30], [33, 23], [30, 61], [62, 45], [59, 119], [116, 90], [156, 198], [373, 326]] downsample_ratios: [32, 16, 8]

结果出现负的loss, 并且训练结果变差...

2020-08-18 10:02:53,083-INFO: iter: 61100, lr: 0.000165, 'loss': '49.983803', time: 1.080, eta: 11:40:19 2020-08-18 10:02:53,083-INFO: iter: 61100, lr: 0.000165, 'loss': '49.983803', time: 1.080, eta: 11:40:19 2020-08-18 10:03:32,268-INFO: iter: 61200, lr: 0.000164, 'loss': '-3.282897', time: 0.390, eta: 4:12:26 2020-08-18 10:03:32,268-INFO: iter: 61200, lr: 0.000164, 'loss': '-3.282897', time: 0.390, eta: 4:12:26 2020-08-18 10:04:29,751-INFO: iter: 61300, lr: 0.000163, 'loss': '-483.048737', time: 0.575, eta: 6:10:44 2020-08-18 10:04:29,751-INFO: iter: 61300, lr: 0.000163, 'loss': '-483.048737', time: 0.575, eta: 6:10:44 2020-08-18 10:05:29,555-INFO: iter: 61400, lr: 0.000162, 'loss': '-1974.388916', time: 0.597, eta: 6:24:04 2020-08-18 10:05:29,555-INFO: iter: 61400, lr: 0.000162, 'loss': '-1974.388916', time: 0.597, eta: 6:24:04 2020-08-18 10:06:31,471-INFO: iter: 61500, lr: 0.000162, 'loss': '-4106.348633', time: 0.614, eta: 6:33:59 2020-08-18 10:06:31,471-INFO: iter: 61500, lr: 0.000162, 'loss': '-4106.348633', time: 0.614, eta: 6:33:59 2020-08-18 10:07:28,676-INFO: iter: 61600, lr: 0.000161, 'loss': '-2150.716797', time: 0.574, eta: 6:07:15 2020-08-18 10:07:28,676-INFO: iter: 61600, lr: 0.000161, 'loss': '-2150.716797', time: 0.574, eta: 6:07:15 2020-08-18 10:08:28,551-INFO: iter: 61700, lr: 0.000160, 'loss': '-24875.148438', time: 0.601, eta: 6:23:36 2020-08-18 10:08:28,551-INFO: iter: 61700, lr: 0.000160, 'loss': '-24875.148438', time: 0.601, eta: 6:23:36 2020-08-18 10:09:24,851-INFO: iter: 61800, lr: 0.000159, 'loss': '-175619.359375', time: 0.567, eta: 6:01:05 2020-08-18 10:09:24,851-INFO: iter: 61800, lr: 0.000159, 'loss': '-175619.359375', time: 0.567, eta: 6:01:05 2020-08-18 10:10:23,562-INFO: iter: 61900, lr: 0.000159, 'loss': '-316771.875000', time: 0.587, eta: 6:12:42 2020-08-18 10:10:23,562-INFO: iter: 61900, lr: 0.000159, 'loss': '-316771.875000', time: 0.587, eta: 6:12:42 2020-08-18 10:11:22,151-INFO: iter: 62000, lr: 0.000158, 'loss': '-657982.000000', time: 0.584, eta: 6:09:47 2020-08-18 10:11:22,151-INFO: iter: 62000, lr: 0.000158, 'loss': '-657982.000000', time: 0.584, eta: 6:09:47

这种情况是 loss overflow 了吗, 这种情况怎么办?

谢谢

PaddlePaddle / PaddleDetection 1 年多 前同步成功

Yolov3 + resnet 50 从 checkpoint 处继续训练出现 loss 为负

PaddlePaddle / PaddleDetection
1 年多前同步成功