NaN in CPU training of resnet
Created by: sfraczek
NaNs in training Resnet on CPU from models repo (without MKLDNN)
Paddle models version
commit 64cde5d16de78181e348abc85fa45cee29512ec6 Author: LiuHao liuhao19900412@gmail.com Date: Thu Aug 6 18:23:44 2020 +0800
Update run_ernie_classifier.py (#4790)
Paddle version
I chcked below two commits and some in between and all don't work
This I chcked first
commit 36868e84 Author: yukavio 67678385+yukavio@users.noreply.github.com Date: Mon Aug 24 18:35:03 2020 +0800
fix one_hot example doc test=document_fix (#26585)
* fix one_hot example doc test=document_fix
* fix some bug
This I checked later too
commit 9d2bd0ac Author: 123malin malin10@baidu.com Date: Wed Jun 3 14:05:17 2020 +0800
downpour_worker增加try_catch机制,打印program所有参数 (#24700)
* test=develop, add try_catch for debug
Other info
Build type: Release CPU: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz OS: Ubuntu 16.04.5 LTS Python: 2.7.12
To reproduce
in models/dygraph/resnet
Command: python train.py
change place from CUDA to CPU
- place = fluid.CUDAPlace(fluid.dygraph.parallel.Env().dev_id) \
- if args.use_data_parallel else fluid.CUDAPlace(0)
+ place = fluid.CPUPlace()
Log
[Epoch 0, batch 0] loss 4.60813, acc1 0.03125, acc5 0.06250, batch_cost: 14.17363 s, reader_cost: 0.14763 s
[Epoch 0, batch 10] loss nan, acc1 0.01136, acc5 0.06250, batch_cost: 11.84618 s, reader_cost: 0.00004 s