Cuda Error: an illegal memory access was encountered (#182) · Issue · PaddlePaddle / Paddle

Cuda Error: an illegal memory access was encountered

Created by: jamestang0219

When I'm training the LSTM model, the error occur. Here is the partial training log:

`I1010 05:43:00.041766 119630 TrainerInternal.cpp:204] ___fc_layer_0__.w0   avg_abs_val=0.175171    max_val=1.02055     avg_abs_grad=0.0117175   max_grad=1.19871
I1010 05:43:00.042105 119630 TrainerInternal.cpp:204] ___fc_layer_0__.wbias avg_abs_val=0.127698    max_val=0.430612    avg_abs_grad=0.0627174   max_grad=1.82515
I1010 05:43:00.042553 119630 TrainerInternal.cpp:204] ___lstmemory_0__.w0  avg_abs_val=0.133314    max_val=0.794175    avg_abs_grad=0.0206954   max_grad=2.81496
I1010 05:43:00.042837 119630 TrainerInternal.cpp:204] ___lstmemory_0__.wbias avg_abs_val=0.115955    max_val=0.508302    avg_abs_grad=0.101172    max_grad=14.5017
I1010 05:43:00.043148 119630 TrainerInternal.cpp:204] ___fc_layer_1__.w0   avg_abs_val=0.274876    max_val=0.900992    avg_abs_grad=0.382739    max_grad=2.3795
I1010 05:43:00.043421 119630 TrainerInternal.cpp:204] ___fc_layer_1__.wbias avg_abs_val=0.217184    max_val=0.217184    avg_abs_grad=0.373559    max_grad=0.37356

I1010 05:43:00.043450 119630 TrainerInternal.cpp:162]  Batch=6400 samples=819200 AvgCost=0.294956 CurrentCost=0.31619 Eval: classification_error_evaluator=0.128513  CurrentEval: classification_error_evaluator=0.1375
...................
I1010 05:43:06.412021 119630 TrainerInternal.cpp:162]  Batch=6420 samples=821760 AvgCost=0.294991 CurrentCost=0.30623 Eval: classification_error_evaluator=0.128552  CurrentEval: classification_error_evaluator=0.141016
...................
I1010 05:43:13.165990 119630 TrainerInternal.cpp:162]  Batch=6440 samples=824320 AvgCost=0.29512 CurrentCost=0.336758 Eval: classification_error_evaluator=0.128653  CurrentEval: classification_error_evaluator=0.160938
...................
I1010 05:43:19.573108 119630 TrainerInternal.cpp:162]  Batch=6460 samples=826880 AvgCost=0.295171 CurrentCost=0.311461 Eval: classification_error_evaluator=0.128692  CurrentEval: classification_error_evaluator=0.141406
...................F1010 05:43:26.722699 119642 hl_cuda_device.cc:646] Check failed: cudaSuccess == cudaStat (0 vs. 77) Cuda Error: an illegal memory access was encountered
*** Check failure stack trace: ***
    @     0x7f1c2eccadaa  (unknown)
    @     0x7f1c2eccace4  (unknown)
    @     0x7f1c2ecca6e6  (unknown)
    @     0x7f1c2eccd687  (unknown)
    @           0x8ae00b  hl_stream_synchronize()
    @           0x8ca3b0  hl_max_sequence_backward()
    @           0x6c860e  paddle::GpuMatrix::maxSequenceBackward()
    @           0x5f71db  paddle::MaxLayer::backward()
    @           0x67228e  paddle::NeuralNetwork::backward()
    @           0x65324c  paddle::TrainerThread::backward()
    @           0x65337d  paddle::TrainerThread::computeThread()
    @     0x7f1c2e847a60  (unknown)
    @     0x7f1c2f883184  start_thread
    @     0x7f1c2dfaf37d  (unknown)
    @              (nil)  (unknown)
/usr/local/bin/paddle: line 46: 119630 Aborted                 (core dumped) ${DEBUGGER} $MYDIR/../opt/paddle/bin/paddle_trainer ${@:2}`

and I checked nvidia source manager before the error occured:

Mon Oct 10 05:12:26 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.93     Driver Version: 352.93         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K520           Off  | 0000:00:03.0     Off |                  N/A |
| N/A   52C    P0    66W / 125W |   1497MiB /  4095MiB |     41%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GRID K520           Off  | 0000:00:04.0     Off |                  N/A |
| N/A   56C    P0    55W / 125W |   1257MiB /  4095MiB |     47%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GRID K520           Off  | 0000:00:05.0     Off |                  N/A |
| N/A   54C    P0    61W / 125W |   1079MiB /  4095MiB |     76%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GRID K520           Off  | 0000:00:06.0     Off |                  N/A |
| N/A   58C    P0    60W / 125W |   1049MiB /  4095MiB |     70%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0    119630    C   ...ocal/bin/../opt/paddle/bin/paddle_trainer  1484MiB |
|    1    119630    C   ...ocal/bin/../opt/paddle/bin/paddle_trainer  1243MiB |
|    2    119630    C   ...ocal/bin/../opt/paddle/bin/paddle_trainer  1066MiB |
|    3    119630    C   ...ocal/bin/../opt/paddle/bin/paddle_trainer  1036MiB |
+-----------------------------------------------------------------------------+`

The model can be initialized successfully, but when Paddle trained samples, the error will occur.

I've tried several times, each time get the same error.

PaddlePaddle / Paddle 1 年多 前同步成功

Cuda Error: an illegal memory access was encountered

PaddlePaddle / Paddle
1 年多前同步成功