resnet\SEnet 训练出现loss Nan
Created by: ellinyang
版本、环境信息 1)docker镜像 paddle:1.4.1-gpu-cuda9.0-cudnn7 2)GPU Tesla V100 (16G显存) 训练信息 1)单机单卡训练 2)batchsize=16 3)图像resize448x448 (原图48*70) 4)自定义数据集 8000样本,5大类 出现问题: resnet 和SEnet 在训练自己的数据集时,正常训练几个batch便出现loss Nan ,acc变0。
详细细节
------------- Configuration Arguments -------------
batch_size : 16
checkpoint : None
class_dim : 6
data_dir : dataset
enable_ce : False
fp16 : False
image_shape : 3,448,448
l2_decay : 0.00012
lr : 0.01
lr_strategy : cosine_decay
model : SE_ResNeXt50_32x4d
model_save_dir : train_SE_ResNeXt50_32x4d_0610/export_models
momentum_rate : 0.9
num_epochs : 80
pretrained_model : SE_ResNeXt50_32x4d_pretrained
scale_loss : 1.0
total_images : 8557
use_gpu : True
with_mem_opt : 1
----------------------------------------------------
Pass 0, trainbatch 0, loss 1.49129, acc1 0.00000, acc5 0.62500, lr 0.01000, time 6.36 sec
Pass 0, trainbatch 10, loss 1.78435, acc1 0.31250, acc5 0.75000, lr 0.01000, time 0.59 sec
Pass 0, trainbatch 20, loss 1.04638, acc1 0.50000, acc5 0.87500, lr 0.01000, time 0.61 sec
Pass 0, trainbatch 30, loss 0.79907, acc1 0.43750, acc5 0.87500, lr 0.01000, time 0.60 sec
Pass 0, trainbatch 40, loss 1.52700, acc1 0.43750, acc5 0.93750, lr 0.01000, time 0.63 sec
Pass 0, trainbatch 50, loss 1.00796, acc1 0.56250, acc5 0.87500, lr 0.01000, time 0.64 sec
Pass 0, trainbatch 60, loss 0.65415, acc1 0.50000, acc5 0.81250, lr 0.01000, time 0.60 sec
Pass 0, trainbatch 70, loss 0.67213, acc1 0.56250, acc5 0.87500, lr 0.01000, time 0.60 sec
Pass 0, trainbatch 80, loss 1.23143, acc1 0.25000, acc5 0.68750, lr 0.01000, time 0.60 sec
Pass 0, trainbatch 90, loss nan, acc1 0.00000, acc5 0.00000, lr 0.01000, time 0.65 sec
Pass 0, trainbatch 100, loss nan, acc1 0.00000, acc5 0.00000, lr 0.01000, time 0.64 sec
Pass 0, trainbatch 110, loss nan, acc1 0.00000, acc5 0.00000, lr 0.01000, time 0.65 sec
Pass 0, trainbatch 120, loss nan, acc1 0.00000, acc5 0.00000, lr 0.01000, time 0.63 sec
Pass 0, trainbatch 130, loss nan, acc1 0.00000, acc5 0.00000, lr 0.01000, time 0.64 sec
Pass 0, trainbatch 140, loss nan, acc1 0.00000, acc5 0.00000, lr 0.01000, time 0.65 sec
Pass 0, trainbatch 150, loss nan, acc1 0.00000, acc5 0.00000, lr 0.01000, time 0.64 sec
Pass 0, trainbatch 160, loss nan, acc1 0.00000, acc5 0.00000, lr 0.01000, time 0.63 sec
Pass 0, trainbatch 170, loss nan, acc1 0.00000, acc5 0.00000, lr 0.01000, time 0.64 sec
Pass 0, trainbatch 180, loss nan, acc1 0.00000, acc5 0.00000, lr 0.01000, time 0.63 sec
Pass 0, trainbatch 190, loss nan, acc1 0.00000, acc5 0.00000, lr 0.01000, time 0.65 sec
Pass 0, trainbatch 200, loss nan, acc1 0.00000, acc5 0.00000, lr 0.01000, time 0.71 sec
Pass 0, trainbatch 210, loss nan, acc1 0.00000, acc5 0.00000, lr 0.01000, time 0.66 sec
Pass 0, trainbatch 220, loss nan, acc1 0.00000, acc5 0.00000, lr 0.01000, time 0.63 sec
Pass 0, trainbatch 230, loss nan, acc1 0.00000, acc5 0.00000, lr 0.01000, time 0.64 sec
Pass 0, trainbatch 240, loss nan, acc1 0.00000, acc5 0.00000, lr 0.01000, time 0.66 sec
Pass 0, trainbatch 250, loss nan, acc1 0.00000, acc5 0.00000, lr 0.01000, time 0.64 sec
Pass 0, trainbatch 260, loss nan, acc1 0.00000, acc5 0.00000, lr 0.01000, time 0.63 sec
Pass 0, trainbatch 270, loss nan, acc1 0.00000, acc5 0.00000, lr 0.01000, time 0.63 sec
Pass 0, trainbatch 280, loss nan, acc1 0.00000, acc5 0.00000, lr 0.01000, time 0.65 sec