Skip to content

  • 体验新版
    • 正在加载...
  • 登录
  • PaddlePaddle
  • Paddle
  • Issue
  • #2563

P
Paddle
  • 项目概览

PaddlePaddle / Paddle
大约 2 年 前同步成功

通知 2325
Star 20933
Fork 5424
  • 代码
    • 文件
    • 提交
    • 分支
    • Tags
    • 贡献者
    • 分支图
    • Diff
  • Issue 1423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
  • Wiki 0
    • Wiki
  • 分析
    • 仓库
    • DevOps
  • 项目成员
  • Pages
P
Paddle
  • 项目概览
    • 项目概览
    • 详情
    • 发布
  • 仓库
    • 仓库
    • 文件
    • 提交
    • 分支
    • 标签
    • 贡献者
    • 分支图
    • 比较
  • Issue 1,423
    • Issue 1,423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
    • 合并请求 543
  • Pages
  • 分析
    • 分析
    • 仓库分析
    • DevOps
  • Wiki 0
    • Wiki
  • 成员
    • 成员
  • 收起侧边栏
  • 动态
  • 分支图
  • 创建新Issue
  • 提交
  • Issue看板
已关闭
开放中
Opened 6月 22, 2017 by saxon_zh@saxon_zhGuest

Cost going to NaN with Paddle v0.10.0 for MT example

Created by: alvations

Installing from source off the develop branch, the paddle command seems to be working fine:

$ git log 
commit 7bce40d7be9174bea90e75df684ce8526485b36a
Merge: 603fd43 252ef0c
Author: gangliao <liaogang@baidu.com>
Date:   Wed Jun 21 10:22:04 2017 +0800

    Merge pull request #2538 from wangkuiyi/generic.cmake-comments
    
    Rewrite tutorial comments in generic.cmake

$ sudo paddle version
PaddlePaddle 0.10.0, compiled with
    with_avx: ON
    with_gpu: ON
    with_double: OFF
    with_python: ON
    with_rdma: OFF
    with_timer: OFF

$ python
Python 2.7.12 (default, Nov 19 2016, 06:48:10) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import paddle.v2 as paddle
>>> paddle.init(use_gpu=True, trainer_count=4)
I0622 16:51:44.955044 28154 Util.cpp:166] commandline:  --use_gpu=True --trainer_count=4 
>>> exit()

And I cloned the book repo and ran the train.py from the machine translation example. But the CPU training stopped with a Floating point exception:

$ git clone https://github.com/PaddlePaddle/book.git
$ cd book/08.machine_translation/

book/08.machine_translation/$ python train.py 
I0622 16:54:11.143401 28309 Util.cpp:166] commandline:  --use_gpu=False --trainer_count=1 
I0622 16:54:11.374763 28309 GradientMachine.cpp:85] Initing parameters..
I0622 16:54:13.712622 28309 GradientMachine.cpp:92] Init parameters done.

Pass 0, Batch 0, Cost 230.933862, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 10, Cost 230.808911, {'classification_error_evaluator': 0.9642857313156128}
.........
Pass 0, Batch 20, Cost 343.881104, {'classification_error_evaluator': 0.916167676448822}
.........
Pass 0, Batch 30, Cost 244.960254, {'classification_error_evaluator': 0.8907563090324402}
.....*** Aborted at 1498121868 (unix time) try "date -d @1498121868" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGFPE (@0x7f2047213b49) received by PID 28309 (TID 0x7f201a5d3700) from PID 1193360201; stack trace: ***
    @     0x7f2048fc8390 (unknown)
    @     0x7f2047213b49 paddle::AssignCpuEvaluate<>()
    @     0x7f204721a9a7 paddle::AssignEvaluate<>()
    @     0x7f2047211183 paddle::adamApply()
    @     0x7f2047208909 paddle::AdamParameterOptimizer::update()
    @     0x7f20471f2b6e paddle::SgdThreadUpdater::threadUpdateDense()
    @     0x7f20471f3d9f _ZNSt17_Function_handlerIFvimEZN6paddle16SgdThreadUpdater11finishBatchEfEUlimE_E9_M_invokeERKSt9_Any_dataOiOm
    @     0x7f2046ffec1c _ZNSt6thread5_ImplISt12_Bind_simpleIFZN6paddle14SyncThreadPool5startEvEUliE_mEEE6_M_runEv
    @     0x7f2045b39c80 (unknown)
    @     0x7f2048fbe6ba start_thread
    @     0x7f2048cf43dd clone
    @                0x0 (unknown)
Floating point exception (core dumped)

Changing to use GPU, the cost goes to NaN:

book/08.machine_translation$ sed -i "s|\(use_gpu=.*\)|use_gpu=True, trainer_count=4\)|g" train.py
book/08.machine_translation$ python test.py
python: can't open file 'test.py': [Errno 2] No such file or directory
ltan@walle1:~/book/08.machine_translation$ python train.py
I0622 17:04:29.819021 28398 Util.cpp:166] commandline:  --use_gpu=True --trainer_count=4 
I0622 17:04:35.025086 28398 MultiGradientMachine.cpp:99] numLogicalDevices=1 numThreads=4 numDevices=4
I0622 17:04:35.179461 28398 GradientMachine.cpp:85] Initing parameters..
I0622 17:04:37.593305 28398 GradientMachine.cpp:92] Init parameters done.

Pass 0, Batch 0, Cost 232.981567, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 10, Cost 284.369263, {'classification_error_evaluator': 0.9420289993286133}
.........
Pass 0, Batch 20, Cost 265.632788, {'classification_error_evaluator': 0.9224806427955627}
.........
Pass 0, Batch 30, Cost 168.668164, {'classification_error_evaluator': 0.9146341681480408}
.........
Pass 0, Batch 40, Cost 119.270068, {'classification_error_evaluator': 0.8965517282485962}
.........
Pass 0, Batch 50, Cost 224.066553, {'classification_error_evaluator': 0.9174311757087708}
.........
Pass 0, Batch 60, Cost 295.795679, {'classification_error_evaluator': 0.9305555820465088}
.........
Pass 0, Batch 70, Cost 256.279614, {'classification_error_evaluator': 0.9599999785423279}
.........
Pass 0, Batch 80, Cost 206.731763, {'classification_error_evaluator': 0.9504950642585754}
.........
Pass 0, Batch 90, Cost 484.451318, {'classification_error_evaluator': 0.9037656784057617}
.........
Pass 0, Batch 100, Cost 181.277283, {'classification_error_evaluator': 0.966292142868042}
.........
Pass 0, Batch 110, Cost 281.560010, {'classification_error_evaluator': 0.9424460530281067}
.........
Pass 0, Batch 120, Cost 198.955090, {'classification_error_evaluator': 0.9693877696990967}
.........
Pass 0, Batch 130, Cost nan, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 140, Cost nan, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 150, Cost nan, {'classification_error_evaluator': 1.0}

Similarly with 1 GPU trainer:

book/08.machine_translation$ sed -i "s|\(use_gpu=.*\)|use_gpu=True, trainer_count=1\)|g" train.py

ltan@walle1:~/book/08.machine_translation$ python train.py
I0622 17:09:47.405041 28503 Util.cpp:166] commandline:  --use_gpu=True --trainer_count=1 
I0622 17:09:52.146150 28503 GradientMachine.cpp:85] Initing parameters..
I0622 17:09:54.330538 28503 GradientMachine.cpp:92] Init parameters done.

Pass 0, Batch 0, Cost 253.607739, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 10, Cost 245.239307, {'classification_error_evaluator': 0.9495798349380493}
.........
Pass 0, Batch 20, Cost 362.484961, {'classification_error_evaluator': 0.9034090638160706}
.........
Pass 0, Batch 30, Cost 228.537988, {'classification_error_evaluator': 0.9099099040031433}
.........
Pass 0, Batch 40, Cost 277.921631, {'classification_error_evaluator': 0.9333333373069763}
.........
Pass 0, Batch 50, Cost 273.311084, {'classification_error_evaluator': 0.8872180581092834}
.........
Pass 0, Batch 60, Cost 310.044189, {'classification_error_evaluator': 0.9006622433662415}
.........
Pass 0, Batch 70, Cost 262.669629, {'classification_error_evaluator': 0.921875}
.........
Pass 0, Batch 80, Cost 135.404944, {'classification_error_evaluator': 0.9242424368858337}
.........
Pass 0, Batch 90, Cost 272.579102, {'classification_error_evaluator': 0.932330846786499}
.........
Pass 0, Batch 100, Cost 348.291699, {'classification_error_evaluator': 0.929411768913269}
.........
Pass 0, Batch 110, Cost 257.603052, {'classification_error_evaluator': 0.920634925365448}
.........
Pass 0, Batch 120, Cost 212.971094, {'classification_error_evaluator': 0.9903846383094788}
.........
Pass 0, Batch 130, Cost 198.442700, {'classification_error_evaluator': 0.9587628841400146}
.........
Pass 0, Batch 140, Cost 192.191089, {'classification_error_evaluator': 0.936170220375061}
.........
Pass 0, Batch 150, Cost 365.744531, {'classification_error_evaluator': 0.9329608678817749}
.........
Pass 0, Batch 160, Cost 226.738013, {'classification_error_evaluator': 0.9009009003639221}
.........
Pass 0, Batch 170, Cost 294.002539, {'classification_error_evaluator': 0.9444444179534912}
.........
Pass 0, Batch 180, Cost nan, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 190, Cost nan, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 200, Cost nan, {'classification_error_evaluator': 1.0}
.........

I've tried changing the values for:

  • learning rate
  • batch size
  • L2 regularizer
  • gradient clipping
  • encoder/decoder dimensions
  • vocab size

But somehow the cost goes to NaN and I can't seem to go through 1 epoch without the cost going to NaN.

Possibly this is a related issue: #1738 (closed)

指派人
分配到
无
里程碑
无
分配里程碑
工时统计
无
截止日期
无
标识: paddlepaddle/Paddle#2563
渝ICP备2023009037号

京公网安备11010502055752号

网络110报警服务 Powered by GitLab CE v13.7
开源知识
Git 入门 Pro Git 电子书 在线学 Git
Markdown 基础入门 IT 技术知识开源图谱
帮助
使用手册 反馈建议 博客
《GitCode 隐私声明》 《GitCode 服务条款》 关于GitCode
Powered by GitLab CE v13.7