NaN in CPU training of resnet (#26673) · Issue · PaddlePaddle / Paddle

NaN in CPU training of resnet

Created by: sfraczek

NaNs in training Resnet on CPU from models repo (without MKLDNN)

Paddle models version

commit 64cde5d16de78181e348abc85fa45cee29512ec6 Author: LiuHao liuhao19900412@gmail.com Date: Thu Aug 6 18:23:44 2020 +0800

Update run_ernie_classifier.py (#4790)

Paddle version

I chcked below two commits and some in between and all don't work

This I chcked first

commit 36868e84 Author: yukavio 67678385+yukavio@users.noreply.github.com Date: Mon Aug 24 18:35:03 2020 +0800

fix one_hot example doc test=document_fix (#26585)

* fix one_hot example doc test=document_fix

* fix some bug

This I checked later too

commit 9d2bd0ac Author: 123malin malin10@baidu.com Date: Wed Jun 3 14:05:17 2020 +0800

downpour_worker增加try_catch机制，打印program所有参数 (#24700)

* test=develop, add try_catch for debug

Other info

Build type: Release CPU: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz OS: Ubuntu 16.04.5 LTS Python: 2.7.12

To reproduce

in models/dygraph/resnet Command: python train.py change place from CUDA to CPU

-    place = fluid.CUDAPlace(fluid.dygraph.parallel.Env().dev_id) \
-        if args.use_data_parallel else fluid.CUDAPlace(0)
+    place = fluid.CPUPlace()

Log

[Epoch 0, batch 0] loss 4.60813, acc1 0.03125, acc5 0.06250, batch_cost: 14.17363 s, reader_cost: 0.14763 s
[Epoch 0, batch 10] loss nan, acc1 0.01136, acc5 0.06250, batch_cost: 11.84618 s, reader_cost: 0.00004 s

PaddlePaddle / Paddle 大约 1 年 前同步成功