Cost going to NaN with Paddle v0.10.0 for MT example
Created by: alvations
Installing from source off the develop
branch, the paddle
command seems to be working fine:
$ git log
commit 7bce40d7be9174bea90e75df684ce8526485b36a
Merge: 603fd43 252ef0c
Author: gangliao <liaogang@baidu.com>
Date: Wed Jun 21 10:22:04 2017 +0800
Merge pull request #2538 from wangkuiyi/generic.cmake-comments
Rewrite tutorial comments in generic.cmake
$ sudo paddle version
PaddlePaddle 0.10.0, compiled with
with_avx: ON
with_gpu: ON
with_double: OFF
with_python: ON
with_rdma: OFF
with_timer: OFF
$ python
Python 2.7.12 (default, Nov 19 2016, 06:48:10)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import paddle.v2 as paddle
>>> paddle.init(use_gpu=True, trainer_count=4)
I0622 16:51:44.955044 28154 Util.cpp:166] commandline: --use_gpu=True --trainer_count=4
>>> exit()
And I cloned the book repo and ran the train.py
from the machine translation example. But the CPU training stopped with a Floating point exception
:
$ git clone https://github.com/PaddlePaddle/book.git
$ cd book/08.machine_translation/
book/08.machine_translation/$ python train.py
I0622 16:54:11.143401 28309 Util.cpp:166] commandline: --use_gpu=False --trainer_count=1
I0622 16:54:11.374763 28309 GradientMachine.cpp:85] Initing parameters..
I0622 16:54:13.712622 28309 GradientMachine.cpp:92] Init parameters done.
Pass 0, Batch 0, Cost 230.933862, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 10, Cost 230.808911, {'classification_error_evaluator': 0.9642857313156128}
.........
Pass 0, Batch 20, Cost 343.881104, {'classification_error_evaluator': 0.916167676448822}
.........
Pass 0, Batch 30, Cost 244.960254, {'classification_error_evaluator': 0.8907563090324402}
.....*** Aborted at 1498121868 (unix time) try "date -d @1498121868" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGFPE (@0x7f2047213b49) received by PID 28309 (TID 0x7f201a5d3700) from PID 1193360201; stack trace: ***
@ 0x7f2048fc8390 (unknown)
@ 0x7f2047213b49 paddle::AssignCpuEvaluate<>()
@ 0x7f204721a9a7 paddle::AssignEvaluate<>()
@ 0x7f2047211183 paddle::adamApply()
@ 0x7f2047208909 paddle::AdamParameterOptimizer::update()
@ 0x7f20471f2b6e paddle::SgdThreadUpdater::threadUpdateDense()
@ 0x7f20471f3d9f _ZNSt17_Function_handlerIFvimEZN6paddle16SgdThreadUpdater11finishBatchEfEUlimE_E9_M_invokeERKSt9_Any_dataOiOm
@ 0x7f2046ffec1c _ZNSt6thread5_ImplISt12_Bind_simpleIFZN6paddle14SyncThreadPool5startEvEUliE_mEEE6_M_runEv
@ 0x7f2045b39c80 (unknown)
@ 0x7f2048fbe6ba start_thread
@ 0x7f2048cf43dd clone
@ 0x0 (unknown)
Floating point exception (core dumped)
Changing to use GPU, the cost goes to NaN:
book/08.machine_translation$ sed -i "s|\(use_gpu=.*\)|use_gpu=True, trainer_count=4\)|g" train.py
book/08.machine_translation$ python test.py
python: can't open file 'test.py': [Errno 2] No such file or directory
ltan@walle1:~/book/08.machine_translation$ python train.py
I0622 17:04:29.819021 28398 Util.cpp:166] commandline: --use_gpu=True --trainer_count=4
I0622 17:04:35.025086 28398 MultiGradientMachine.cpp:99] numLogicalDevices=1 numThreads=4 numDevices=4
I0622 17:04:35.179461 28398 GradientMachine.cpp:85] Initing parameters..
I0622 17:04:37.593305 28398 GradientMachine.cpp:92] Init parameters done.
Pass 0, Batch 0, Cost 232.981567, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 10, Cost 284.369263, {'classification_error_evaluator': 0.9420289993286133}
.........
Pass 0, Batch 20, Cost 265.632788, {'classification_error_evaluator': 0.9224806427955627}
.........
Pass 0, Batch 30, Cost 168.668164, {'classification_error_evaluator': 0.9146341681480408}
.........
Pass 0, Batch 40, Cost 119.270068, {'classification_error_evaluator': 0.8965517282485962}
.........
Pass 0, Batch 50, Cost 224.066553, {'classification_error_evaluator': 0.9174311757087708}
.........
Pass 0, Batch 60, Cost 295.795679, {'classification_error_evaluator': 0.9305555820465088}
.........
Pass 0, Batch 70, Cost 256.279614, {'classification_error_evaluator': 0.9599999785423279}
.........
Pass 0, Batch 80, Cost 206.731763, {'classification_error_evaluator': 0.9504950642585754}
.........
Pass 0, Batch 90, Cost 484.451318, {'classification_error_evaluator': 0.9037656784057617}
.........
Pass 0, Batch 100, Cost 181.277283, {'classification_error_evaluator': 0.966292142868042}
.........
Pass 0, Batch 110, Cost 281.560010, {'classification_error_evaluator': 0.9424460530281067}
.........
Pass 0, Batch 120, Cost 198.955090, {'classification_error_evaluator': 0.9693877696990967}
.........
Pass 0, Batch 130, Cost nan, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 140, Cost nan, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 150, Cost nan, {'classification_error_evaluator': 1.0}
Similarly with 1 GPU trainer:
book/08.machine_translation$ sed -i "s|\(use_gpu=.*\)|use_gpu=True, trainer_count=1\)|g" train.py
ltan@walle1:~/book/08.machine_translation$ python train.py
I0622 17:09:47.405041 28503 Util.cpp:166] commandline: --use_gpu=True --trainer_count=1
I0622 17:09:52.146150 28503 GradientMachine.cpp:85] Initing parameters..
I0622 17:09:54.330538 28503 GradientMachine.cpp:92] Init parameters done.
Pass 0, Batch 0, Cost 253.607739, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 10, Cost 245.239307, {'classification_error_evaluator': 0.9495798349380493}
.........
Pass 0, Batch 20, Cost 362.484961, {'classification_error_evaluator': 0.9034090638160706}
.........
Pass 0, Batch 30, Cost 228.537988, {'classification_error_evaluator': 0.9099099040031433}
.........
Pass 0, Batch 40, Cost 277.921631, {'classification_error_evaluator': 0.9333333373069763}
.........
Pass 0, Batch 50, Cost 273.311084, {'classification_error_evaluator': 0.8872180581092834}
.........
Pass 0, Batch 60, Cost 310.044189, {'classification_error_evaluator': 0.9006622433662415}
.........
Pass 0, Batch 70, Cost 262.669629, {'classification_error_evaluator': 0.921875}
.........
Pass 0, Batch 80, Cost 135.404944, {'classification_error_evaluator': 0.9242424368858337}
.........
Pass 0, Batch 90, Cost 272.579102, {'classification_error_evaluator': 0.932330846786499}
.........
Pass 0, Batch 100, Cost 348.291699, {'classification_error_evaluator': 0.929411768913269}
.........
Pass 0, Batch 110, Cost 257.603052, {'classification_error_evaluator': 0.920634925365448}
.........
Pass 0, Batch 120, Cost 212.971094, {'classification_error_evaluator': 0.9903846383094788}
.........
Pass 0, Batch 130, Cost 198.442700, {'classification_error_evaluator': 0.9587628841400146}
.........
Pass 0, Batch 140, Cost 192.191089, {'classification_error_evaluator': 0.936170220375061}
.........
Pass 0, Batch 150, Cost 365.744531, {'classification_error_evaluator': 0.9329608678817749}
.........
Pass 0, Batch 160, Cost 226.738013, {'classification_error_evaluator': 0.9009009003639221}
.........
Pass 0, Batch 170, Cost 294.002539, {'classification_error_evaluator': 0.9444444179534912}
.........
Pass 0, Batch 180, Cost nan, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 190, Cost nan, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 200, Cost nan, {'classification_error_evaluator': 1.0}
.........
I've tried changing the values for:
- learning rate
- batch size
- L2 regularizer
- gradient clipping
- encoder/decoder dimensions
- vocab size
But somehow the cost goes to NaN and I can't seem to go through 1 epoch without the cost going to NaN.
Possibly this is a related issue: #1738 (closed)