Cost going to NaN with Paddle v0.10.0 for MT example (#2563) · Issue · PaddlePaddle / Paddle

Cost going to NaN with Paddle v0.10.0 for MT example

Created by: alvations

Installing from source off the develop branch, the paddle command seems to be working fine:

$ git log 
commit 7bce40d7be9174bea90e75df684ce8526485b36a
Merge: 603fd43 252ef0c
Author: gangliao <liaogang@baidu.com>
Date:   Wed Jun 21 10:22:04 2017 +0800

    Merge pull request #2538 from wangkuiyi/generic.cmake-comments
    
    Rewrite tutorial comments in generic.cmake

$ sudo paddle version
PaddlePaddle 0.10.0, compiled with
    with_avx: ON
    with_gpu: ON
    with_double: OFF
    with_python: ON
    with_rdma: OFF
    with_timer: OFF

$ python
Python 2.7.12 (default, Nov 19 2016, 06:48:10) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import paddle.v2 as paddle
>>> paddle.init(use_gpu=True, trainer_count=4)
I0622 16:51:44.955044 28154 Util.cpp:166] commandline:  --use_gpu=True --trainer_count=4 
>>> exit()

And I cloned the book repo and ran the train.py from the machine translation example. But the CPU training stopped with a Floating point exception:

$ git clone https://github.com/PaddlePaddle/book.git
$ cd book/08.machine_translation/

book/08.machine_translation/$ python train.py 
I0622 16:54:11.143401 28309 Util.cpp:166] commandline:  --use_gpu=False --trainer_count=1 
I0622 16:54:11.374763 28309 GradientMachine.cpp:85] Initing parameters..
I0622 16:54:13.712622 28309 GradientMachine.cpp:92] Init parameters done.

Pass 0, Batch 0, Cost 230.933862, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 10, Cost 230.808911, {'classification_error_evaluator': 0.9642857313156128}
.........
Pass 0, Batch 20, Cost 343.881104, {'classification_error_evaluator': 0.916167676448822}
.........
Pass 0, Batch 30, Cost 244.960254, {'classification_error_evaluator': 0.8907563090324402}
.....*** Aborted at 1498121868 (unix time) try "date -d @1498121868" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGFPE (@0x7f2047213b49) received by PID 28309 (TID 0x7f201a5d3700) from PID 1193360201; stack trace: ***
    @     0x7f2048fc8390 (unknown)
    @     0x7f2047213b49 paddle::AssignCpuEvaluate<>()
    @     0x7f204721a9a7 paddle::AssignEvaluate<>()
    @     0x7f2047211183 paddle::adamApply()
    @     0x7f2047208909 paddle::AdamParameterOptimizer::update()
    @     0x7f20471f2b6e paddle::SgdThreadUpdater::threadUpdateDense()
    @     0x7f20471f3d9f _ZNSt17_Function_handlerIFvimEZN6paddle16SgdThreadUpdater11finishBatchEfEUlimE_E9_M_invokeERKSt9_Any_dataOiOm
    @     0x7f2046ffec1c _ZNSt6thread5_ImplISt12_Bind_simpleIFZN6paddle14SyncThreadPool5startEvEUliE_mEEE6_M_runEv
    @     0x7f2045b39c80 (unknown)
    @     0x7f2048fbe6ba start_thread
    @     0x7f2048cf43dd clone
    @                0x0 (unknown)
Floating point exception (core dumped)

Changing to use GPU, the cost goes to NaN:

book/08.machine_translation$ sed -i "s|\(use_gpu=.*\)|use_gpu=True, trainer_count=4\)|g" train.py
book/08.machine_translation$ python test.py
python: can't open file 'test.py': [Errno 2] No such file or directory
ltan@walle1:~/book/08.machine_translation$ python train.py
I0622 17:04:29.819021 28398 Util.cpp:166] commandline:  --use_gpu=True --trainer_count=4 
I0622 17:04:35.025086 28398 MultiGradientMachine.cpp:99] numLogicalDevices=1 numThreads=4 numDevices=4
I0622 17:04:35.179461 28398 GradientMachine.cpp:85] Initing parameters..
I0622 17:04:37.593305 28398 GradientMachine.cpp:92] Init parameters done.

Pass 0, Batch 0, Cost 232.981567, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 10, Cost 284.369263, {'classification_error_evaluator': 0.9420289993286133}
.........
Pass 0, Batch 20, Cost 265.632788, {'classification_error_evaluator': 0.9224806427955627}
.........
Pass 0, Batch 30, Cost 168.668164, {'classification_error_evaluator': 0.9146341681480408}
.........
Pass 0, Batch 40, Cost 119.270068, {'classification_error_evaluator': 0.8965517282485962}
.........
Pass 0, Batch 50, Cost 224.066553, {'classification_error_evaluator': 0.9174311757087708}
.........
Pass 0, Batch 60, Cost 295.795679, {'classification_error_evaluator': 0.9305555820465088}
.........
Pass 0, Batch 70, Cost 256.279614, {'classification_error_evaluator': 0.9599999785423279}
.........
Pass 0, Batch 80, Cost 206.731763, {'classification_error_evaluator': 0.9504950642585754}
.........
Pass 0, Batch 90, Cost 484.451318, {'classification_error_evaluator': 0.9037656784057617}
.........
Pass 0, Batch 100, Cost 181.277283, {'classification_error_evaluator': 0.966292142868042}
.........
Pass 0, Batch 110, Cost 281.560010, {'classification_error_evaluator': 0.9424460530281067}
.........
Pass 0, Batch 120, Cost 198.955090, {'classification_error_evaluator': 0.9693877696990967}
.........
Pass 0, Batch 130, Cost nan, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 140, Cost nan, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 150, Cost nan, {'classification_error_evaluator': 1.0}

Similarly with 1 GPU trainer:

book/08.machine_translation$ sed -i "s|\(use_gpu=.*\)|use_gpu=True, trainer_count=1\)|g" train.py

ltan@walle1:~/book/08.machine_translation$ python train.py
I0622 17:09:47.405041 28503 Util.cpp:166] commandline:  --use_gpu=True --trainer_count=1 
I0622 17:09:52.146150 28503 GradientMachine.cpp:85] Initing parameters..
I0622 17:09:54.330538 28503 GradientMachine.cpp:92] Init parameters done.

Pass 0, Batch 0, Cost 253.607739, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 10, Cost 245.239307, {'classification_error_evaluator': 0.9495798349380493}
.........
Pass 0, Batch 20, Cost 362.484961, {'classification_error_evaluator': 0.9034090638160706}
.........
Pass 0, Batch 30, Cost 228.537988, {'classification_error_evaluator': 0.9099099040031433}
.........
Pass 0, Batch 40, Cost 277.921631, {'classification_error_evaluator': 0.9333333373069763}
.........
Pass 0, Batch 50, Cost 273.311084, {'classification_error_evaluator': 0.8872180581092834}
.........
Pass 0, Batch 60, Cost 310.044189, {'classification_error_evaluator': 0.9006622433662415}
.........
Pass 0, Batch 70, Cost 262.669629, {'classification_error_evaluator': 0.921875}
.........
Pass 0, Batch 80, Cost 135.404944, {'classification_error_evaluator': 0.9242424368858337}
.........
Pass 0, Batch 90, Cost 272.579102, {'classification_error_evaluator': 0.932330846786499}
.........
Pass 0, Batch 100, Cost 348.291699, {'classification_error_evaluator': 0.929411768913269}
.........
Pass 0, Batch 110, Cost 257.603052, {'classification_error_evaluator': 0.920634925365448}
.........
Pass 0, Batch 120, Cost 212.971094, {'classification_error_evaluator': 0.9903846383094788}
.........
Pass 0, Batch 130, Cost 198.442700, {'classification_error_evaluator': 0.9587628841400146}
.........
Pass 0, Batch 140, Cost 192.191089, {'classification_error_evaluator': 0.936170220375061}
.........
Pass 0, Batch 150, Cost 365.744531, {'classification_error_evaluator': 0.9329608678817749}
.........
Pass 0, Batch 160, Cost 226.738013, {'classification_error_evaluator': 0.9009009003639221}
.........
Pass 0, Batch 170, Cost 294.002539, {'classification_error_evaluator': 0.9444444179534912}
.........
Pass 0, Batch 180, Cost nan, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 190, Cost nan, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 200, Cost nan, {'classification_error_evaluator': 1.0}
.........

I've tried changing the values for:

learning rate
batch size
L2 regularizer
gradient clipping
encoder/decoder dimensions
vocab size

But somehow the cost goes to NaN and I can't seem to go through 1 epoch without the cost going to NaN.

Possibly this is a related issue: #1738 (closed)

PaddlePaddle / Paddle 大约 2 年 前同步成功

Cost going to NaN with Paddle v0.10.0 for MT example

PaddlePaddle / Paddle
大约 2 年前同步成功