Cuda Error: an illegal memory access was encountered
Created by: jamestang0219
When I'm training the LSTM model, the error occur. Here is the partial training log:
`I1010 05:43:00.041766 119630 TrainerInternal.cpp:204] ___fc_layer_0__.w0 avg_abs_val=0.175171 max_val=1.02055 avg_abs_grad=0.0117175 max_grad=1.19871
I1010 05:43:00.042105 119630 TrainerInternal.cpp:204] ___fc_layer_0__.wbias avg_abs_val=0.127698 max_val=0.430612 avg_abs_grad=0.0627174 max_grad=1.82515
I1010 05:43:00.042553 119630 TrainerInternal.cpp:204] ___lstmemory_0__.w0 avg_abs_val=0.133314 max_val=0.794175 avg_abs_grad=0.0206954 max_grad=2.81496
I1010 05:43:00.042837 119630 TrainerInternal.cpp:204] ___lstmemory_0__.wbias avg_abs_val=0.115955 max_val=0.508302 avg_abs_grad=0.101172 max_grad=14.5017
I1010 05:43:00.043148 119630 TrainerInternal.cpp:204] ___fc_layer_1__.w0 avg_abs_val=0.274876 max_val=0.900992 avg_abs_grad=0.382739 max_grad=2.3795
I1010 05:43:00.043421 119630 TrainerInternal.cpp:204] ___fc_layer_1__.wbias avg_abs_val=0.217184 max_val=0.217184 avg_abs_grad=0.373559 max_grad=0.37356
I1010 05:43:00.043450 119630 TrainerInternal.cpp:162] Batch=6400 samples=819200 AvgCost=0.294956 CurrentCost=0.31619 Eval: classification_error_evaluator=0.128513 CurrentEval: classification_error_evaluator=0.1375
...................
I1010 05:43:06.412021 119630 TrainerInternal.cpp:162] Batch=6420 samples=821760 AvgCost=0.294991 CurrentCost=0.30623 Eval: classification_error_evaluator=0.128552 CurrentEval: classification_error_evaluator=0.141016
...................
I1010 05:43:13.165990 119630 TrainerInternal.cpp:162] Batch=6440 samples=824320 AvgCost=0.29512 CurrentCost=0.336758 Eval: classification_error_evaluator=0.128653 CurrentEval: classification_error_evaluator=0.160938
...................
I1010 05:43:19.573108 119630 TrainerInternal.cpp:162] Batch=6460 samples=826880 AvgCost=0.295171 CurrentCost=0.311461 Eval: classification_error_evaluator=0.128692 CurrentEval: classification_error_evaluator=0.141406
...................F1010 05:43:26.722699 119642 hl_cuda_device.cc:646] Check failed: cudaSuccess == cudaStat (0 vs. 77) Cuda Error: an illegal memory access was encountered
*** Check failure stack trace: ***
@ 0x7f1c2eccadaa (unknown)
@ 0x7f1c2eccace4 (unknown)
@ 0x7f1c2ecca6e6 (unknown)
@ 0x7f1c2eccd687 (unknown)
@ 0x8ae00b hl_stream_synchronize()
@ 0x8ca3b0 hl_max_sequence_backward()
@ 0x6c860e paddle::GpuMatrix::maxSequenceBackward()
@ 0x5f71db paddle::MaxLayer::backward()
@ 0x67228e paddle::NeuralNetwork::backward()
@ 0x65324c paddle::TrainerThread::backward()
@ 0x65337d paddle::TrainerThread::computeThread()
@ 0x7f1c2e847a60 (unknown)
@ 0x7f1c2f883184 start_thread
@ 0x7f1c2dfaf37d (unknown)
@ (nil) (unknown)
/usr/local/bin/paddle: line 46: 119630 Aborted (core dumped) ${DEBUGGER} $MYDIR/../opt/paddle/bin/paddle_trainer ${@:2}`
and I checked nvidia source manager before the error occured:
Mon Oct 10 05:12:26 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.93 Driver Version: 352.93 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GRID K520 Off | 0000:00:03.0 Off | N/A |
| N/A 52C P0 66W / 125W | 1497MiB / 4095MiB | 41% Default |
+-------------------------------+----------------------+----------------------+
| 1 GRID K520 Off | 0000:00:04.0 Off | N/A |
| N/A 56C P0 55W / 125W | 1257MiB / 4095MiB | 47% Default |
+-------------------------------+----------------------+----------------------+
| 2 GRID K520 Off | 0000:00:05.0 Off | N/A |
| N/A 54C P0 61W / 125W | 1079MiB / 4095MiB | 76% Default |
+-------------------------------+----------------------+----------------------+
| 3 GRID K520 Off | 0000:00:06.0 Off | N/A |
| N/A 58C P0 60W / 125W | 1049MiB / 4095MiB | 70% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 119630 C ...ocal/bin/../opt/paddle/bin/paddle_trainer 1484MiB |
| 1 119630 C ...ocal/bin/../opt/paddle/bin/paddle_trainer 1243MiB |
| 2 119630 C ...ocal/bin/../opt/paddle/bin/paddle_trainer 1066MiB |
| 3 119630 C ...ocal/bin/../opt/paddle/bin/paddle_trainer 1036MiB |
+-----------------------------------------------------------------------------+`
The model can be initialized successfully, but when Paddle trained samples, the error will occur.
I've tried several times, each time get the same error.