operator < conv2d > error
Created by: hawkcoder
when I try this example/tiny/run_train.py, I got this cuda error in the test stage, while it is ok in train stage, why? can somebody help me?
Python2.7 PaddlePaddle=1.8.0.post107
The error was:
3f93d4eb8892 /DeepSpeech/examples/tiny {develop} bash run_train.sh
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
\----------- Configuration Arguments -----------
augment_conf_path: conf/augmentation.config
batch_size: 4
dev_manifest: data/tiny/manifest.tiny
init_from_pretrained_model: None
is_local: 1
learning_rate: 1e-05
max_duration: 27.0
mean_std_path: data/tiny/mean_std.npz
min_duration: 0.0
num_conv_layers: 2
num_epoch: 20
num_iter_print: 1
num_rnn_layers: 3
num_samples: 64
output_model_dir: ./checkpoints/tiny
rnn_layer_size: 2048
save_epoch: 1
share_rnn_weights: 1
shuffle_method: batch_shuffle_clipped
specgram_type: linear
test_off: 0
train_manifest: data/tiny/manifest.tiny
use_gpu: 1
use_gru: 0
use_sortagrad: 1
vocab_path: data/tiny/vocab.txt
\------------------------------------------------
W0602 03:40:12.703999 393 device_context.cc:252] Please NOTE: device: 0, CUDA Capability: 75, Driver API Version: 10.2, Runtime API Version: 10.0
W0602 03:40:12.706190 393 device_context.cc:260] device: 0, cuDNN Version: 7.5.
W0602 03:40:13.205840 393 device_context.h:155] WARNING: device: 0. The installed Paddle is compiled with CUDNN 7.6, but CUDNN version in your machine is 7.5, which may cause serious incompatible bug. Please recompile or reinstall Paddle with compatible CUDNN version$
W0602 03:40:15.150069 393 fuse_all_reduce_op_pass.cc:74] Find all_reduce operators: 29. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 29.
epoch: 0, batch: 0, train loss: 161.631409
epoch: 0, batch: 1, train loss: 250.135147
epoch: 0, batch: 2, train loss: 328.794434
epoch: 0, batch: 3, train loss: 318.444031
epoch: 0, batch: 4, train loss: 342.752563
epoch: 0, batch: 5, train loss: 407.027008
epoch: 0, batch: 6, train loss: 512.739380
epoch: 0, batch: 7, train loss: 706.538940
----------Begin test... /usr/local/lib/python2.7/dist-packages/paddle/fluid/executor.py:1070: UserWarning: The following exception is not an EOF exception. "The following exception is not an EOF exception.") Traceback (most recent call last): File "train.py", line 142, in main() File "train.py", line 138, in main train() File "train.py", line 133, in train test_off=args.test_off) File "/DeepSpeech/model_utils/model.py", line 366, in train fetch_list=[ctc_loss]) File "/DeepSpeech/model_utils/model.py", line 217, in test return_numpy=False) File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/executor.py", line 1071, in run six.reraise(*sys.exc_info()) File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/executor.py", line 1066, in run return_merged=return_merged) File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/executor.py", line 1154, in _run_impl use_program_cache=use_program_cache) File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/executor.py", line 1229, in _run_program fetch_var_name) paddle.fluid.core_avx.EnforceNotMet:
C++ Call Stacks (More useful to developers):
0 std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int) 1 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) 2 paddle::operators::CUDNNConvOpKernel::Compute(paddle::framework::ExecutionContext const&) const 3 std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::CUDNNConvOpKernel, paddle::operators::CUDNNConvOpKernel, paddle::operators::CUDNNConvOpKernelpaddle::platform::float16 >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1 (closed)}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&) 4 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const 5 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const 6 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&) 7 paddle::framework::Executor::RunPartialPreparedContext(paddle::framework::ExecutorPrepareContext*, paddle::framework::Scope*, long, long, bool, bool, bool) 8 paddle::framework::Executor::RunPreparedContext(paddle::framework::ExecutorPrepareContext*, paddle::framework::Scope*, bool, bool, bool) 9 paddle::framework::Executor::Run(paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool, std::vector<std::string, std::allocatorstd::string > const&, bool, bool)
Python Call Stacks (More useful to users):
File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/framework.py", line 2610, in append_op attrs=kwargs.get("attrs", None)) File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/layer_helper.py", line 43, in append_op return self.main_program.current_block().append_op(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/layers/nn.py", line 2928, in conv2d "data_format": data_format, File "/DeepSpeech/model_utils/network.py", line 47, in conv_bn_layer bias_attr=False) File "/DeepSpeech/model_utils/network.py", line 299, in conv_group name='layer_0', ) File "/DeepSpeech/model_utils/network.py", line 408, in deep_speech_v2_network masks=masks) File "/DeepSpeech/model_utils/model.py", line 145, in create_network share_rnn_weights=self._share_rnn_weights) File "/DeepSpeech/model_utils/model.py", line 296, in train test_reader, _, ctc_loss = self.create_network() File "train.py", line 133, in train test_off=args.test_off) File "train.py", line 138, in main train() File "train.py", line 142, in main()
Error Message Summary:
ExternalError: Cudnn error, CUDNN_STATUS_EXECUTION_FAILED at (/paddle/paddle/fluid/operators/conv_cudnn_op.cu:300) [operator < conv2d > error] Failed in training!