在训练时报了CUDNN_STATUS_EXECUTION_FAILED错误 (#4374) · Issue · PaddlePaddle / models

在训练时报了CUDNN_STATUS_EXECUTION_FAILED错误

Created by: chaogehah

我用的是ubuntu16.04 显卡是RTX2080Ti 我是在anaconda的虚拟环境下跑的训练，网络为ResNet50_vd paddle1.7.0.post97 + cuda9.0 + cudnn7.3.1

2020-03-05 14:27:22,808-INFO: ------------- Configuration Arguments ------------- 2020-03-05 14:27:22,808-INFO: batch_size : 16 2020-03-05 14:27:22,808-INFO: checkpoint : None 2020-03-05 14:27:22,808-INFO: class_dim : 2 2020-03-05 14:27:22,808-INFO: data_dir : ./data/ILSVRC2012/ 2020-03-05 14:27:22,808-INFO: data_format : NCHW 2020-03-05 14:27:22,808-INFO: decay_epochs : 2.4 2020-03-05 14:27:22,808-INFO: decay_rate : 0.97 2020-03-05 14:27:22,808-INFO: drop_connect_rate : 0.2 2020-03-05 14:27:22,808-INFO: ema_decay : 0.9999 2020-03-05 14:27:22,808-INFO: enable_ce : False 2020-03-05 14:27:22,808-INFO: finetune_exclude_pretrained_params : None 2020-03-05 14:27:22,808-INFO: fuse_bn_act_ops : False 2020-03-05 14:27:22,808-INFO: fuse_elewise_add_act_ops : False 2020-03-05 14:27:22,808-INFO: image_mean : [0.485, 0.456, 0.406] 2020-03-05 14:27:22,808-INFO: image_shape : [3, 224, 224] 2020-03-05 14:27:22,808-INFO: image_std : [0.229, 0.224, 0.225] 2020-03-05 14:27:22,808-INFO: interpolation : None 2020-03-05 14:27:22,808-INFO: is_profiler : False 2020-03-05 14:27:22,808-INFO: l2_decay : 7e-05 2020-03-05 14:27:22,808-INFO: label_smoothing_epsilon : 0.1 2020-03-05 14:27:22,808-INFO: lower_ratio : 0.75 2020-03-05 14:27:22,808-INFO: lower_scale : 0.08 2020-03-05 14:27:22,808-INFO: lr : 0.1 2020-03-05 14:27:22,808-INFO: lr_strategy : cosine_decay 2020-03-05 14:27:22,808-INFO: max_iter : 0 2020-03-05 14:27:22,808-INFO: mixup_alpha : 0.2 2020-03-05 14:27:22,808-INFO: model : ResNet50_vd 2020-03-05 14:27:22,808-INFO: model_save_dir : output/ 2020-03-05 14:27:22,808-INFO: momentum_rate : 0.9 2020-03-05 14:27:22,808-INFO: num_epochs : 200 2020-03-05 14:27:22,808-INFO: padding_type : SAME 2020-03-05 14:27:22,808-INFO: pretrained_model : None 2020-03-05 14:27:22,808-INFO: print_step : 10 2020-03-05 14:27:22,808-INFO: profiler_path : ./profilier_files 2020-03-05 14:27:22,808-INFO: random_seed : None 2020-03-05 14:27:22,808-INFO: reader_buf_size : 2048 2020-03-05 14:27:22,808-INFO: reader_thread : 8 2020-03-05 14:27:22,808-INFO: resize_short_size : 256 2020-03-05 14:27:22,808-INFO: same_feed : 0 2020-03-05 14:27:22,808-INFO: save_step : 1 2020-03-05 14:27:22,808-INFO: scale_loss : 1.0 2020-03-05 14:27:22,808-INFO: step_epochs : [30, 60, 90] 2020-03-05 14:27:22,808-INFO: test_batch_size : 8 2020-03-05 14:27:22,809-INFO: total_images : 476 2020-03-05 14:27:22,809-INFO: upper_ratio : 1.3333333333333333 2020-03-05 14:27:22,809-INFO: use_aa : False 2020-03-05 14:27:22,809-INFO: use_dali : False 2020-03-05 14:27:22,809-INFO: use_dynamic_loss_scaling : True 2020-03-05 14:27:22,809-INFO: use_ema : False 2020-03-05 14:27:22,809-INFO: use_fp16 : False 2020-03-05 14:27:22,809-INFO: use_gpu : True 2020-03-05 14:27:22,809-INFO: use_label_smoothing : 1 2020-03-05 14:27:22,809-INFO: use_mixup : 1 2020-03-05 14:27:22,809-INFO: use_se : True 2020-03-05 14:27:22,809-INFO: validate : 1 2020-03-05 14:27:22,809-INFO: warm_up_epochs : 5.0 2020-03-05 14:27:22,809-INFO: ---------------------------------------------------- W0305 14:27:26.086107 23366 device_context.cc:237] Please NOTE: device: 0, CUDA Capability: 75, Driver API Version: 10.1, Runtime API Version: 9.0 W0305 14:27:26.087620 23366 device_context.cc:245] device: 0, cuDNN Version: 7.3. I0305 14:27:26.850368 23366 parallel_executor.cc:440] The Program will be executed on CUDA using ParallelExecutor, 1 cards are used, so 1 programs are executed in parallel. I0305 14:27:26.869619 23366 build_strategy.cc:365] SeqOnlyAllReduceOps:0, num_trainers:1 I0305 14:27:26.908351 23366 parallel_executor.cc:307] Inplace strategy is enabled, when build_strategy.enable_inplace = True I0305 14:27:26.920976 23366 parallel_executor.cc:375] Garbage collection strategy is enabled, when FLAGS_eager_delete_tensor_gb = 0 2020-03-05 14:27:27,033-INFO: [Pass 0, train batch 0] loss 0.71246, lr 0.10000, elapse 0.2654 sec /home/junc-lin/anaconda3/envs/paddle_test/lib/python3.7/site-packages/paddle/fluid/executor.py:782: UserWarning: The following exception is not an EOF exception. "The following exception is not an EOF exception.") Traceback (most recent call last): File "train.py", line 304, in main() File "train.py", line 300, in main train(args) File "train.py", line 250, in train fetch_list=train_fetch_list) File "/home/junc-lin/anaconda3/envs/paddle_test/lib/python3.7/site-packages/paddle/fluid/executor.py", line 783, in run six.reraise(*sys.exc_info()) File "/home/junc-lin/anaconda3/envs/paddle_test/lib/python3.7/site-packages/six.py", line 703, in reraise raise value File "/home/junc-lin/anaconda3/envs/paddle_test/lib/python3.7/site-packages/paddle/fluid/executor.py", line 778, in run use_program_cache=use_program_cache) File "/home/junc-lin/anaconda3/envs/paddle_test/lib/python3.7/site-packages/paddle/fluid/executor.py", line 843, in _run_impl return_numpy=return_numpy) File "/home/junc-lin/anaconda3/envs/paddle_test/lib/python3.7/site-packages/paddle/fluid/executor.py", line 677, in _run_parallel tensors = exe.run(fetch_var_names)._move_to_list() paddle.fluid.core_avx.EnforceNotMet:

C++ Call Stacks (More useful to developers):

0 std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int) 1 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) 2 paddle::operators::CUDNNConvOpKernel::Compute(paddle::framework::ExecutionContext const&) const 3 std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::CUDNNConvOpKernel, paddle::operators::CUDNNConvOpKernel, paddle::operators::CUDNNConvOpKernelpaddle::platform::float16 >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1 (closed)}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&) 4 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const 5 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const 6 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&) 7 paddle::framework::details::ComputationOpHandle::RunImpl() 8 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync(paddle::framework::details::OpHandleBase*) 9 paddle::framework::details::FastThreadedSSAGraphExecutor::RunTracedOps(std::vector<paddle::framework::details::OpHandleBase*, std::allocatorpaddle::framework::details::OpHandleBase* > const&) 10 paddle::framework::details::FastThreadedSSAGraphExecutor::Run(std::vector<std::string, std::allocatorstd::string > const&) 11 paddle::framework::details::ScopeBufferedMonitor::Apply(std::function<void ()> const&, bool) 12 paddle::framework::details::ScopeBufferedSSAGraphExecutor::Run(std::vector<std::string, std::allocatorstd::string > const&) 13 paddle::framework::ParallelExecutor::Run(std::vector<std::string, std::allocatorstd::string > const&)

Python Call Stacks (More useful to users):

File "/home/junc-lin/anaconda3/envs/paddle_test/lib/python3.7/site-packages/paddle/fluid/framework.py", line 2525, in append_op attrs=kwargs.get("attrs", None)) File "/home/junc-lin/anaconda3/envs/paddle_test/lib/python3.7/site-packages/paddle/fluid/layer_helper.py", line 43, in append_op return self.main_program.current_block().append_op(*args, **kwargs) File "/home/junc-lin/anaconda3/envs/paddle_test/lib/python3.7/site-packages/paddle/fluid/layers/nn.py", line 1405, in conv2d "data_format": data_format, File "/home/junc-lin/Downloads/models-develop/PaddleCV/image_classification/models/resnet_vd.py", line 146, in conv_bn_layer bias_attr=False) File "/home/junc-lin/Downloads/models-develop/PaddleCV/image_classification/models/resnet_vd.py", line 67, in net name='conv1_1') File "/home/junc-lin/Downloads/models-develop/PaddleCV/image_classification/build_model.py", line 98, in _mixup_model net_out = model.net(input=image, class_dim=args.class_dim) File "/home/junc-lin/Downloads/models-develop/PaddleCV/image_classification/build_model.py", line 125, in create_model loss_out = _mixup_model(data, model, args, is_train) File "train.py", line 65, in build_program data_loader, loss_out = create_model(model, args, is_train) File "train.py", line 166, in train args=args) File "train.py", line 300, in main train(args) File "train.py", line 304, in main()

Error Message Summary:

Error: An error occurred here. There is no accurate error hint for this error yet. We are continuously in the process of increasing hint for this kind of error check. It would be helpful if you could inform us of how this conversion went by opening a github issue. And we will resolve it with high priority.

New issue link: https://github.com/PaddlePaddle/Paddle/issues/new
Recommended issue content: all error stack information [Hint: CUDNN_STATUS_EXECUTION_FAILED] at (/paddle/paddle/fluid/operators/conv_cudnn_op.cu:286) [operator < conv2d > error]

PaddlePaddle / models 大约 1 年 前同步成功

在训练时报了CUDNN_STATUS_EXECUTION_FAILED错误

C++ Call Stacks (More useful to developers):

Python Call Stacks (More useful to users):

Error Message Summary:

PaddlePaddle / models
大约 1 年前同步成功