Skip to content

  • 体验新版
    • 正在加载...
  • 登录
  • PaddlePaddle
  • Paddle
  • Issue
  • #182

P
Paddle
  • 项目概览

PaddlePaddle / Paddle
大约 2 年 前同步成功

通知 2325
Star 20933
Fork 5424
  • 代码
    • 文件
    • 提交
    • 分支
    • Tags
    • 贡献者
    • 分支图
    • Diff
  • Issue 1423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
  • Wiki 0
    • Wiki
  • 分析
    • 仓库
    • DevOps
  • 项目成员
  • Pages
P
Paddle
  • 项目概览
    • 项目概览
    • 详情
    • 发布
  • 仓库
    • 仓库
    • 文件
    • 提交
    • 分支
    • 标签
    • 贡献者
    • 分支图
    • 比较
  • Issue 1,423
    • Issue 1,423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
    • 合并请求 543
  • Pages
  • 分析
    • 分析
    • 仓库分析
    • DevOps
  • Wiki 0
    • Wiki
  • 成员
    • 成员
  • 收起侧边栏
  • 动态
  • 分支图
  • 创建新Issue
  • 提交
  • Issue看板
已关闭
开放中
Opened 10月 10, 2016 by saxon_zh@saxon_zhGuest

Cuda Error: an illegal memory access was encountered

Created by: jamestang0219

When I'm training the LSTM model, the error occur. Here is the partial training log:

`I1010 05:43:00.041766 119630 TrainerInternal.cpp:204] ___fc_layer_0__.w0   avg_abs_val=0.175171    max_val=1.02055     avg_abs_grad=0.0117175   max_grad=1.19871
I1010 05:43:00.042105 119630 TrainerInternal.cpp:204] ___fc_layer_0__.wbias avg_abs_val=0.127698    max_val=0.430612    avg_abs_grad=0.0627174   max_grad=1.82515
I1010 05:43:00.042553 119630 TrainerInternal.cpp:204] ___lstmemory_0__.w0  avg_abs_val=0.133314    max_val=0.794175    avg_abs_grad=0.0206954   max_grad=2.81496
I1010 05:43:00.042837 119630 TrainerInternal.cpp:204] ___lstmemory_0__.wbias avg_abs_val=0.115955    max_val=0.508302    avg_abs_grad=0.101172    max_grad=14.5017
I1010 05:43:00.043148 119630 TrainerInternal.cpp:204] ___fc_layer_1__.w0   avg_abs_val=0.274876    max_val=0.900992    avg_abs_grad=0.382739    max_grad=2.3795
I1010 05:43:00.043421 119630 TrainerInternal.cpp:204] ___fc_layer_1__.wbias avg_abs_val=0.217184    max_val=0.217184    avg_abs_grad=0.373559    max_grad=0.37356

I1010 05:43:00.043450 119630 TrainerInternal.cpp:162]  Batch=6400 samples=819200 AvgCost=0.294956 CurrentCost=0.31619 Eval: classification_error_evaluator=0.128513  CurrentEval: classification_error_evaluator=0.1375
...................
I1010 05:43:06.412021 119630 TrainerInternal.cpp:162]  Batch=6420 samples=821760 AvgCost=0.294991 CurrentCost=0.30623 Eval: classification_error_evaluator=0.128552  CurrentEval: classification_error_evaluator=0.141016
...................
I1010 05:43:13.165990 119630 TrainerInternal.cpp:162]  Batch=6440 samples=824320 AvgCost=0.29512 CurrentCost=0.336758 Eval: classification_error_evaluator=0.128653  CurrentEval: classification_error_evaluator=0.160938
...................
I1010 05:43:19.573108 119630 TrainerInternal.cpp:162]  Batch=6460 samples=826880 AvgCost=0.295171 CurrentCost=0.311461 Eval: classification_error_evaluator=0.128692  CurrentEval: classification_error_evaluator=0.141406
...................F1010 05:43:26.722699 119642 hl_cuda_device.cc:646] Check failed: cudaSuccess == cudaStat (0 vs. 77) Cuda Error: an illegal memory access was encountered
*** Check failure stack trace: ***
    @     0x7f1c2eccadaa  (unknown)
    @     0x7f1c2eccace4  (unknown)
    @     0x7f1c2ecca6e6  (unknown)
    @     0x7f1c2eccd687  (unknown)
    @           0x8ae00b  hl_stream_synchronize()
    @           0x8ca3b0  hl_max_sequence_backward()
    @           0x6c860e  paddle::GpuMatrix::maxSequenceBackward()
    @           0x5f71db  paddle::MaxLayer::backward()
    @           0x67228e  paddle::NeuralNetwork::backward()
    @           0x65324c  paddle::TrainerThread::backward()
    @           0x65337d  paddle::TrainerThread::computeThread()
    @     0x7f1c2e847a60  (unknown)
    @     0x7f1c2f883184  start_thread
    @     0x7f1c2dfaf37d  (unknown)
    @              (nil)  (unknown)
/usr/local/bin/paddle: line 46: 119630 Aborted                 (core dumped) ${DEBUGGER} $MYDIR/../opt/paddle/bin/paddle_trainer ${@:2}`

and I checked nvidia source manager before the error occured:

Mon Oct 10 05:12:26 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.93     Driver Version: 352.93         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K520           Off  | 0000:00:03.0     Off |                  N/A |
| N/A   52C    P0    66W / 125W |   1497MiB /  4095MiB |     41%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GRID K520           Off  | 0000:00:04.0     Off |                  N/A |
| N/A   56C    P0    55W / 125W |   1257MiB /  4095MiB |     47%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GRID K520           Off  | 0000:00:05.0     Off |                  N/A |
| N/A   54C    P0    61W / 125W |   1079MiB /  4095MiB |     76%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GRID K520           Off  | 0000:00:06.0     Off |                  N/A |
| N/A   58C    P0    60W / 125W |   1049MiB /  4095MiB |     70%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0    119630    C   ...ocal/bin/../opt/paddle/bin/paddle_trainer  1484MiB |
|    1    119630    C   ...ocal/bin/../opt/paddle/bin/paddle_trainer  1243MiB |
|    2    119630    C   ...ocal/bin/../opt/paddle/bin/paddle_trainer  1066MiB |
|    3    119630    C   ...ocal/bin/../opt/paddle/bin/paddle_trainer  1036MiB |
+-----------------------------------------------------------------------------+`

The model can be initialized successfully, but when Paddle trained samples, the error will occur.

I've tried several times, each time get the same error.

指派人
分配到
无
里程碑
无
分配里程碑
工时统计
无
截止日期
无
标识: paddlepaddle/Paddle#182
渝ICP备2023009037号

京公网安备11010502055752号

网络110报警服务 Powered by GitLab CE v13.7
开源知识
Git 入门 Pro Git 电子书 在线学 Git
Markdown 基础入门 IT 技术知识开源图谱
帮助
使用手册 反馈建议 博客
《GitCode 隐私声明》 《GitCode 服务条款》 关于GitCode
Powered by GitLab CE v13.7