在T4上paddle分类训练报错CUDNN_STATUS_EXECUTION_FAILED
Created by: chenqiuhui
- 版本、环境信息: 1)PaddlePaddle版本:conda install --yes paddlepaddle-gpu=1.6.2=py36_gpu_cuda9.0_many_linux这么安装的
环境host driver是 Driver Version: 418.116.00 CUDA Version: 10.1 T4 镜像是CUDA Version 9.0.176 cudnn7.3.1
- 训练信息 单卡T4 显存15079.69 MB batchsize 1
- 问题描述 在V100上正常训练,T4上训练会有如下报错,毕现的: /opt/conda/lib/python3.6/site-packages/paddle/fluid/executor.py:779: UserWarning: The following exception is not an EOF exception. "The following exception is not an EOF exception.") Traceback (most recent call last): File "train.py", line 435, in estimator.fit() File "train.py", line 274, in fit fetch_list=train_fetch_list) File "/opt/conda/lib/python3.6/site-packages/paddle/fluid/executor.py", line 780, in run six.reraise(*sys.exc_info()) File "/opt/conda/lib/python3.6/site-packages/six.py", line 686, in reraise raise value File "/opt/conda/lib/python3.6/site-packages/paddle/fluid/executor.py", line 775, in run use_program_cache=use_program_cache) File "/opt/conda/lib/python3.6/site-packages/paddle/fluid/executor.py", line 834, in _run_impl return_numpy=return_numpy) File "/opt/conda/lib/python3.6/site-packages/paddle/fluid/executor.py", line 674, in _run_parallel tensors = exe.run(fetch_var_names)._move_to_list() paddle.fluid.core_avx.EnforceNotMet:
C++ Call Stacks (More useful to developers):
0 std::string paddle::platform::GetTraceBackString(std::string&&, char const*, int) 1 paddle::platform::EnforceNotMet::EnforceNotMet(paddle::platform::ErrorSummary const&, char const*, int) 2 paddle::operators::CUDNNConvOpKernel::Compute(paddle::framework::ExecutionContext const&) const 3 std::_Function_handler, paddle::operators::CUDNNConvOpKernel, paddle::operators::CUDNNConvOpKernel >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1 (closed)}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&) 4 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const 5 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const 6 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&) 7 paddle::framework::details::ComputationOpHandle::RunImpl() 8 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync(paddle::framework::details::OpHandleBase*) 9 paddle::framework::details::FastThreadedSSAGraphExecutor::RunTracedOps(std::vector > const&) 10 paddle::framework::details::FastThreadedSSAGraphExecutor::Run(std::vector > const&) 11 paddle::framework::details::ScopeBufferedMonitor::Apply(std::function const&, bool) 12 paddle::framework::details::ScopeBufferedSSAGraphExecutor::Run(std::vector > const&) 13 paddle::framework::ParallelExecutor::Run(std::vector > const&)
Python Call Stacks (More useful to users):
File "/opt/conda/lib/python3.6/site-packages/paddle/fluid/framework.py", line 2488, in append_op attrs=kwargs.get("attrs", None)) File "/opt/conda/lib/python3.6/site-packages/paddle/fluid/layer_helper.py", line 43, in append_op return self.main_program.current_block().append_op(*args, **kwargs) File "/opt/conda/lib/python3.6/site-packages/paddle/fluid/layers/nn.py", line 2803, in conv2d "data_format": data_format, File "/tmp/bml/data/24c08aa70df44057bb1f1dc20503ae0a/image_classification/models/resnet.py", line 126, in conv_bn_layer name=name + '.conv2d.output.1') File "/tmp/bml/data/24c08aa70df44057bb1f1dc20503ae0a/image_classification/models/resnet.py", line 56, in net name="conv1") File "/tmp/bml/data/24c08aa70df44057bb1f1dc20503ae0a/image_classification/build_model.py", line 39, in _basic_model net_out = model.net(input=image, class_dim=args.class_dim) File "/tmp/bml/data/24c08aa70df44057bb1f1dc20503ae0a/image_classification/build_model.py", line 120, in create_model loss_out = _basic_model(data, model, args, is_train) File "train.py", line 341, in build_program data_loader, loss_out = create_model(model, args, is_train) File "train.py", line 213, in fit args=args) File "train.py", line 435, in estimator.fit()
Error Message Summary:
Error: CUDNN_STATUS_EXECUTION_FAILED at (/paddle/paddle/fluid/operators/conv_cudnn_op.cu:288) [operator < conv2d > error]
启动时的log: W0115 09:30:22.522198 27 device_context.cc:236] Please NOTE: device: 0, CUDA Capability: 75, Driver API Version: 10.1, Runtime API Version: 9.0 W0115 09:30:22.525074 27 device_context.cc:244] device: 0, cuDNN Version: 7.3. I0115 09:30:26.071838 27 parallel_executor.cc:421] The number of CUDAPlace, which is used in ParallelExecutor, is 1. And the Program will be copied 1 copies I0115 09:30:26.101428 27 build_strategy.cc:363] SeqOnlyAllReduceOps:0, num_trainers:1 I0115 09:30:26.133617 27 parallel_executor.cc:285] Inplace strategy is enabled, when build_strategy.enable_inplace = True I0115 09:30:26.151190 27 parallel_executor.cc:368] Garbage collection strategy is enabled, when FLAGS_eager_delete_tensor_gb = 0