Yolov3 + resnet 50 从 checkpoint 处继续训练出现 loss 为负
Created by: XiminLin
我的checkpoint和这次的训练都在 AI Studio GPU 环境下训练的, paddlepaddle-gpu==1.8.0.post97, python3.
除了改变reader.yml 里面的 transforms, 其他都保持不变.
这是我的模型配置:
YOLOv3:
backbone: ResNet
yolo_head: YOLOv3Head
ResNet:
norm_type: bn
freeze_norm: false
norm_decay: 0.
variant: d
depth: 50
dcn_v2_stages: [5]
feature_maps: [3, 4, 5]
YOLOv3Head:
anchor_masks: [[6, 7, 8], [3, 4, 5], [0, 1, 2]]
anchors: [[19, 29], [28, 20], [25, 40],
[31, 47], [36, 37], [41, 26],
[47, 66], [48, 33], [67, 53]]
norm_decay: 0.
yolo_loss: YOLOv3Loss
nms:
background_label: -1
keep_top_k: 100
nms_threshold: 0.45
nms_top_k: 1000
normalized: false
score_threshold: 0.01
YOLOv3Loss:
# batch_size here is only used for fine grained loss, not used
# for training batch_size setting, training batch_size setting
# is in configs/yolov3_reader.yml TrainReader.batch_size, batch
# size here should be set as same value as TrainReader.batch_size
batch_size: 8
ignore_thresh: 0.7 # default is 0.7
label_smooth: true
我的checkpoint是模型在没有 data augmentation 的时候训练的, 具体的reader yml配置如下:
sample_transforms:
- !DecodeImage
to_rgb: True
with_mixup: False
- !RandomInterpImage
target_size: 608
- !NormalizeBox {}
- !PadBox
num_max_boxes: 50
- !BboxXYXY2XYWH {}
- !NormalizeImage
mean: [0.8937, 0.9031, 0.8988]
std: [0.19, 0.1995, 0.2022]
is_scale: true
is_channel_first: false
batch_transforms:
- !Permute
to_bgr: false
channel_first: true
# Gt2YoloTarget is only used when use_fine_grained_loss set as true,
# this operator will be deleted automatically if use_fine_grained_loss
# is set as false
- !Gt2YoloTarget
anchor_masks: [[6, 7, 8], [3, 4, 5], [0, 1, 2]]
anchors: [[10, 13], [16, 30], [33, 23],
[30, 61], [62, 45], [59, 119],
[116, 90], [156, 198], [373, 326]]
downsample_ratios: [32, 16, 8]
我接下来用 python train.py -r checkpoint_path
来继续训练, 这次加上 augmentation, 具体的如下:
sample_transforms:
- !DecodeImage
to_rgb: True
with_mixup: True
# with_mixup: False
- !MixupImage
alpha: 1.5
beta: 1.5
- !ColorDistort {}
- !RandomExpand
ratio: 3
fill_value: [231.438 , 236.2575, 235.416]
- !RandomCrop {}
- !RandomInterpImage
target_size: 608
- !RandomFlipImage
is_normalized: false
- !NormalizeBox {}
- !PadBox
num_max_boxes: 50
- !BboxXYXY2XYWH {}
- !NormalizeImage
mean: [0.8937, 0.9031, 0.8988]
std: [0.19, 0.1995, 0.2022]
is_scale: true
is_channel_first: false
batch_transforms:
- !RandomShape
sizes: [320, 352, 384, 416, 448, 480, 512, 544, 576, 608]
random_inter: True
- !Permute
to_bgr: false
channel_first: true
# Gt2YoloTarget is only used when use_fine_grained_loss set as true,
# this operator will be deleted automatically if use_fine_grained_loss
# is set as false
- !Gt2YoloTarget
anchor_masks: [[6, 7, 8], [3, 4, 5], [0, 1, 2]]
anchors: [[10, 13], [16, 30], [33, 23],
[30, 61], [62, 45], [59, 119],
[116, 90], [156, 198], [373, 326]]
downsample_ratios: [32, 16, 8]
结果出现负的loss, 并且训练结果变差...
2020-08-18 10:02:53,083-INFO: iter: 61100, lr: 0.000165, 'loss': '49.983803', time: 1.080, eta: 11:40:19
2020-08-18 10:02:53,083-INFO: iter: 61100, lr: 0.000165, 'loss': '49.983803', time: 1.080, eta: 11:40:19
2020-08-18 10:03:32,268-INFO: iter: 61200, lr: 0.000164, 'loss': '-3.282897', time: 0.390, eta: 4:12:26
2020-08-18 10:03:32,268-INFO: iter: 61200, lr: 0.000164, 'loss': '-3.282897', time: 0.390, eta: 4:12:26
2020-08-18 10:04:29,751-INFO: iter: 61300, lr: 0.000163, 'loss': '-483.048737', time: 0.575, eta: 6:10:44
2020-08-18 10:04:29,751-INFO: iter: 61300, lr: 0.000163, 'loss': '-483.048737', time: 0.575, eta: 6:10:44
2020-08-18 10:05:29,555-INFO: iter: 61400, lr: 0.000162, 'loss': '-1974.388916', time: 0.597, eta: 6:24:04
2020-08-18 10:05:29,555-INFO: iter: 61400, lr: 0.000162, 'loss': '-1974.388916', time: 0.597, eta: 6:24:04
2020-08-18 10:06:31,471-INFO: iter: 61500, lr: 0.000162, 'loss': '-4106.348633', time: 0.614, eta: 6:33:59
2020-08-18 10:06:31,471-INFO: iter: 61500, lr: 0.000162, 'loss': '-4106.348633', time: 0.614, eta: 6:33:59
2020-08-18 10:07:28,676-INFO: iter: 61600, lr: 0.000161, 'loss': '-2150.716797', time: 0.574, eta: 6:07:15
2020-08-18 10:07:28,676-INFO: iter: 61600, lr: 0.000161, 'loss': '-2150.716797', time: 0.574, eta: 6:07:15
2020-08-18 10:08:28,551-INFO: iter: 61700, lr: 0.000160, 'loss': '-24875.148438', time: 0.601, eta: 6:23:36
2020-08-18 10:08:28,551-INFO: iter: 61700, lr: 0.000160, 'loss': '-24875.148438', time: 0.601, eta: 6:23:36
2020-08-18 10:09:24,851-INFO: iter: 61800, lr: 0.000159, 'loss': '-175619.359375', time: 0.567, eta: 6:01:05
2020-08-18 10:09:24,851-INFO: iter: 61800, lr: 0.000159, 'loss': '-175619.359375', time: 0.567, eta: 6:01:05
2020-08-18 10:10:23,562-INFO: iter: 61900, lr: 0.000159, 'loss': '-316771.875000', time: 0.587, eta: 6:12:42
2020-08-18 10:10:23,562-INFO: iter: 61900, lr: 0.000159, 'loss': '-316771.875000', time: 0.587, eta: 6:12:42
2020-08-18 10:11:22,151-INFO: iter: 62000, lr: 0.000158, 'loss': '-657982.000000', time: 0.584, eta: 6:09:47
2020-08-18 10:11:22,151-INFO: iter: 62000, lr: 0.000158, 'loss': '-657982.000000', time: 0.584, eta: 6:09:47
这种情况是 loss overflow 了吗, 这种情况怎么办?
谢谢