RuntimeWarning: Invalid value encountered in median
Created by: ellinyang
-
版本、环境信息: 1)PaddlePaddle版本:paddle1.6.2-cuda9-cudnn7 2)GPU:P4(8G) 3)系统环境:Unbuntu 16.04 python2.7
-
训练信息 1)2机2卡分布式(fleet.dgc)训练 2)显存信息 P4(8G) 3)学习率 单卡0.00125
-
问题描述:用retinanet模型训练,正常训练了几十个batch之后,突然报错,显示“RuntimeWarning: Invalid value encountered in median”警告。然后loss都变为nan,日志如下所示。 2020-04-08T14:58:09.615807324Z Epoch 0, batch 40, lr: 0.003000, 'loss_bbox': '0.414711', 'loss': '2.200727', 'loss_cls': '1.781398', time: 1.030, eta: 1 day, 0:29:47 2020-04-08T14:58:20.086420166Z Epoch 0, batch 50, lr: 0.003333, 'loss_bbox': '0.393112', 'loss': '2.016517', 'loss_cls': '1.605640', time: 1.047, eta: 1 day, 0:54:19 2020-04-08T14:58:30.249262274Z Epoch 0, batch 60, lr: 0.003667, 'loss_bbox': '0.366174', 'loss': '1.618668', 'loss_cls': '1.262222', time: 1.016, eta: 1 day, 0:10:14 2020-04-08T14:58:36.399604018Z /usr/local/lib/python2.7/dist-packages/numpy/lib/function_base.py:3405: RuntimeWarning: Invalid value encountered in median 2020-04-08T14:58:36.399635198Z r = func(a, **kwargs) 2020-04-08T14:58:40.284843706Z Epoch 0, batch 70, lr: 0.004000, 'loss_bbox': 'nan', 'loss': 'nan', 'loss_cls': 'nan', time: 1.004, eta: 23:51:54 2020-04-08T14:58:50.384185447Z Epoch 0, batch 80, lr: 0.004333, 'loss_bbox': 'nan', 'loss': 'nan', 'loss_cls': 'nan', time: 1.010, eta: 1 day, 0:00:51 2020-04-08T14:59:00.326033855Z Epoch 0, batch 90, lr: 0.004667, 'loss_bbox': 'nan', 'loss': 'nan', 'loss_cls': 'nan', time: 0.994, eta: 23:38:11 2020-04-08T14:59:10.442273324Z Epoch 0, batch 100, lr: 0.005000, 'loss_bbox': 'nan', 'loss': 'nan', 'loss_cls': 'nan', time: 1.012, eta: 1 day, 0:02:54 2020-04-08T14:59:21.057705036Z Epoch 0, batch 110, lr: 0.005000, 'loss_bbox': 'nan', 'loss': 'nan', 'loss_cls': 'nan', time: 1.062, eta: 1 day, 1:13:56