训练过程loss不规律出现nan,导致训练崩溃
Created by: xk2261
- 版本、环境信息: 1)PaddlePaddle版本:1.6.2 2)GPU:V100 3)系统环境:Red Hat 4.8.5-28, Python 3.7.5
- 训练信息 1)单卡 2)显存16GB
在训练过程中,不定期出现loss为nan的情况,训练数据经查看没有问题。在用fluid.layers.Print打印op数据之后发现,是在elementwise_mul过程中出现了问题,相乘的两个op数据都没有问题,但相乘并reduce_sum之后,数值变成了nan,目前猜测是否是因为paddle精度问题导致的这个情况,还请解答。
下面是部分代码及日志信息:
code
ins_num = global_features[:, self.dim_num_vertices]
mask = fluid.layers.unsqueeze(fluid.layers.sequence_mask(ins_num, maxlen=self.max_vertices),
axes=[-1],
name="mask")
mask = fluid.layers.cast(mask, dtype="float32")
mask = fluid.layers.Print(mask)
edge_output = fluid.layers.Print(edge_output)
loss = self.focal_loss(edge_output, gt_adj_matrix)
loss = fluid.layers.Print(loss)
loss = fluid.layers.reduce_sum(loss, dim=-1, name="loss_reduce_mean")
loss = fluid.layers.Print(loss)
loss = fluid.layers.reduce_sum(
fluid.layers.elementwise_mul(loss, fluid.layers.squeeze(mask, axes=[-1])),
dim=-1, name="loss_reduce_sum")
loss = fluid.layers.Print(loss)
log (只贴出来了出现问题的step中mask,loss_reduce_mean,loss_reduce_sum的日志信息)
1580801253 The place is:CUDAPlace(0)
Tensor[cast_27.tmp_0]
shape: [1,500,1,]
dtype: f
data: 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1580801253 The place is:CUDAPlace(0)
Tensor[loss_reduce_mean.tmp_2]
shape: [1,500,]
dtype: f
data: 1.76454,13.4798,2.20408,3.50847,4.648,0.820761,1.42328,12.5072,5.4794,3.08568,0.807624,4.13455,8.763,9.18524,7.3746
4,2.88556,14.1432,1.34116,7.93732,1.59432,
1580801253 The place is:CUDAPlace(0)
Tensor[loss_reduce_sum.tmp_2]
shape: [1,]
dtype: f
data: nan,