Skip to content

  • 体验新版
    • 正在加载...
  • 登录
  • PaddlePaddle
  • models
  • Issue
  • #4374

M
models
  • 项目概览

PaddlePaddle / models
大约 2 年 前同步成功

通知 232
Star 6828
Fork 2962
  • 代码
    • 文件
    • 提交
    • 分支
    • Tags
    • 贡献者
    • 分支图
    • Diff
  • Issue 602
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 255
  • Wiki 0
    • Wiki
  • 分析
    • 仓库
    • DevOps
  • 项目成员
  • Pages
M
models
  • 项目概览
    • 项目概览
    • 详情
    • 发布
  • 仓库
    • 仓库
    • 文件
    • 提交
    • 分支
    • 标签
    • 贡献者
    • 分支图
    • 比较
  • Issue 602
    • Issue 602
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 255
    • 合并请求 255
  • Pages
  • 分析
    • 分析
    • 仓库分析
    • DevOps
  • Wiki 0
    • Wiki
  • 成员
    • 成员
  • 收起侧边栏
  • 动态
  • 分支图
  • 创建新Issue
  • 提交
  • Issue看板
已关闭
开放中
Opened 3月 05, 2020 by saxon_zh@saxon_zhGuest

在训练时报了CUDNN_STATUS_EXECUTION_FAILED错误

Created by: chaogehah

我用的是ubuntu16.04 显卡是RTX2080Ti 我是在anaconda的虚拟环境下跑的训练,网络为ResNet50_vd paddle1.7.0.post97 + cuda9.0 + cudnn7.3.1

2020-03-05 14:27:22,808-INFO: ------------- Configuration Arguments ------------- 2020-03-05 14:27:22,808-INFO: batch_size : 16 2020-03-05 14:27:22,808-INFO: checkpoint : None 2020-03-05 14:27:22,808-INFO: class_dim : 2 2020-03-05 14:27:22,808-INFO: data_dir : ./data/ILSVRC2012/ 2020-03-05 14:27:22,808-INFO: data_format : NCHW 2020-03-05 14:27:22,808-INFO: decay_epochs : 2.4 2020-03-05 14:27:22,808-INFO: decay_rate : 0.97 2020-03-05 14:27:22,808-INFO: drop_connect_rate : 0.2 2020-03-05 14:27:22,808-INFO: ema_decay : 0.9999 2020-03-05 14:27:22,808-INFO: enable_ce : False 2020-03-05 14:27:22,808-INFO: finetune_exclude_pretrained_params : None 2020-03-05 14:27:22,808-INFO: fuse_bn_act_ops : False 2020-03-05 14:27:22,808-INFO: fuse_elewise_add_act_ops : False 2020-03-05 14:27:22,808-INFO: image_mean : [0.485, 0.456, 0.406] 2020-03-05 14:27:22,808-INFO: image_shape : [3, 224, 224] 2020-03-05 14:27:22,808-INFO: image_std : [0.229, 0.224, 0.225] 2020-03-05 14:27:22,808-INFO: interpolation : None 2020-03-05 14:27:22,808-INFO: is_profiler : False 2020-03-05 14:27:22,808-INFO: l2_decay : 7e-05 2020-03-05 14:27:22,808-INFO: label_smoothing_epsilon : 0.1 2020-03-05 14:27:22,808-INFO: lower_ratio : 0.75 2020-03-05 14:27:22,808-INFO: lower_scale : 0.08 2020-03-05 14:27:22,808-INFO: lr : 0.1 2020-03-05 14:27:22,808-INFO: lr_strategy : cosine_decay 2020-03-05 14:27:22,808-INFO: max_iter : 0 2020-03-05 14:27:22,808-INFO: mixup_alpha : 0.2 2020-03-05 14:27:22,808-INFO: model : ResNet50_vd 2020-03-05 14:27:22,808-INFO: model_save_dir : output/ 2020-03-05 14:27:22,808-INFO: momentum_rate : 0.9 2020-03-05 14:27:22,808-INFO: num_epochs : 200 2020-03-05 14:27:22,808-INFO: padding_type : SAME 2020-03-05 14:27:22,808-INFO: pretrained_model : None 2020-03-05 14:27:22,808-INFO: print_step : 10 2020-03-05 14:27:22,808-INFO: profiler_path : ./profilier_files 2020-03-05 14:27:22,808-INFO: random_seed : None 2020-03-05 14:27:22,808-INFO: reader_buf_size : 2048 2020-03-05 14:27:22,808-INFO: reader_thread : 8 2020-03-05 14:27:22,808-INFO: resize_short_size : 256 2020-03-05 14:27:22,808-INFO: same_feed : 0 2020-03-05 14:27:22,808-INFO: save_step : 1 2020-03-05 14:27:22,808-INFO: scale_loss : 1.0 2020-03-05 14:27:22,808-INFO: step_epochs : [30, 60, 90] 2020-03-05 14:27:22,808-INFO: test_batch_size : 8 2020-03-05 14:27:22,809-INFO: total_images : 476 2020-03-05 14:27:22,809-INFO: upper_ratio : 1.3333333333333333 2020-03-05 14:27:22,809-INFO: use_aa : False 2020-03-05 14:27:22,809-INFO: use_dali : False 2020-03-05 14:27:22,809-INFO: use_dynamic_loss_scaling : True 2020-03-05 14:27:22,809-INFO: use_ema : False 2020-03-05 14:27:22,809-INFO: use_fp16 : False 2020-03-05 14:27:22,809-INFO: use_gpu : True 2020-03-05 14:27:22,809-INFO: use_label_smoothing : 1 2020-03-05 14:27:22,809-INFO: use_mixup : 1 2020-03-05 14:27:22,809-INFO: use_se : True 2020-03-05 14:27:22,809-INFO: validate : 1 2020-03-05 14:27:22,809-INFO: warm_up_epochs : 5.0 2020-03-05 14:27:22,809-INFO: ---------------------------------------------------- W0305 14:27:26.086107 23366 device_context.cc:237] Please NOTE: device: 0, CUDA Capability: 75, Driver API Version: 10.1, Runtime API Version: 9.0 W0305 14:27:26.087620 23366 device_context.cc:245] device: 0, cuDNN Version: 7.3. I0305 14:27:26.850368 23366 parallel_executor.cc:440] The Program will be executed on CUDA using ParallelExecutor, 1 cards are used, so 1 programs are executed in parallel. I0305 14:27:26.869619 23366 build_strategy.cc:365] SeqOnlyAllReduceOps:0, num_trainers:1 I0305 14:27:26.908351 23366 parallel_executor.cc:307] Inplace strategy is enabled, when build_strategy.enable_inplace = True I0305 14:27:26.920976 23366 parallel_executor.cc:375] Garbage collection strategy is enabled, when FLAGS_eager_delete_tensor_gb = 0 2020-03-05 14:27:27,033-INFO: [Pass 0, train batch 0] loss 0.71246, lr 0.10000, elapse 0.2654 sec /home/junc-lin/anaconda3/envs/paddle_test/lib/python3.7/site-packages/paddle/fluid/executor.py:782: UserWarning: The following exception is not an EOF exception. "The following exception is not an EOF exception.") Traceback (most recent call last): File "train.py", line 304, in main() File "train.py", line 300, in main train(args) File "train.py", line 250, in train fetch_list=train_fetch_list) File "/home/junc-lin/anaconda3/envs/paddle_test/lib/python3.7/site-packages/paddle/fluid/executor.py", line 783, in run six.reraise(*sys.exc_info()) File "/home/junc-lin/anaconda3/envs/paddle_test/lib/python3.7/site-packages/six.py", line 703, in reraise raise value File "/home/junc-lin/anaconda3/envs/paddle_test/lib/python3.7/site-packages/paddle/fluid/executor.py", line 778, in run use_program_cache=use_program_cache) File "/home/junc-lin/anaconda3/envs/paddle_test/lib/python3.7/site-packages/paddle/fluid/executor.py", line 843, in _run_impl return_numpy=return_numpy) File "/home/junc-lin/anaconda3/envs/paddle_test/lib/python3.7/site-packages/paddle/fluid/executor.py", line 677, in _run_parallel tensors = exe.run(fetch_var_names)._move_to_list() paddle.fluid.core_avx.EnforceNotMet:


C++ Call Stacks (More useful to developers):

0 std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int) 1 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) 2 paddle::operators::CUDNNConvOpKernel::Compute(paddle::framework::ExecutionContext const&) const 3 std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::CUDNNConvOpKernel, paddle::operators::CUDNNConvOpKernel, paddle::operators::CUDNNConvOpKernelpaddle::platform::float16 >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1 (closed)}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&) 4 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const 5 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const 6 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&) 7 paddle::framework::details::ComputationOpHandle::RunImpl() 8 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync(paddle::framework::details::OpHandleBase*) 9 paddle::framework::details::FastThreadedSSAGraphExecutor::RunTracedOps(std::vector<paddle::framework::details::OpHandleBase*, std::allocatorpaddle::framework::details::OpHandleBase* > const&) 10 paddle::framework::details::FastThreadedSSAGraphExecutor::Run(std::vector<std::string, std::allocatorstd::string > const&) 11 paddle::framework::details::ScopeBufferedMonitor::Apply(std::function<void ()> const&, bool) 12 paddle::framework::details::ScopeBufferedSSAGraphExecutor::Run(std::vector<std::string, std::allocatorstd::string > const&) 13 paddle::framework::ParallelExecutor::Run(std::vector<std::string, std::allocatorstd::string > const&)


Python Call Stacks (More useful to users):

File "/home/junc-lin/anaconda3/envs/paddle_test/lib/python3.7/site-packages/paddle/fluid/framework.py", line 2525, in append_op attrs=kwargs.get("attrs", None)) File "/home/junc-lin/anaconda3/envs/paddle_test/lib/python3.7/site-packages/paddle/fluid/layer_helper.py", line 43, in append_op return self.main_program.current_block().append_op(*args, **kwargs) File "/home/junc-lin/anaconda3/envs/paddle_test/lib/python3.7/site-packages/paddle/fluid/layers/nn.py", line 1405, in conv2d "data_format": data_format, File "/home/junc-lin/Downloads/models-develop/PaddleCV/image_classification/models/resnet_vd.py", line 146, in conv_bn_layer bias_attr=False) File "/home/junc-lin/Downloads/models-develop/PaddleCV/image_classification/models/resnet_vd.py", line 67, in net name='conv1_1') File "/home/junc-lin/Downloads/models-develop/PaddleCV/image_classification/build_model.py", line 98, in _mixup_model net_out = model.net(input=image, class_dim=args.class_dim) File "/home/junc-lin/Downloads/models-develop/PaddleCV/image_classification/build_model.py", line 125, in create_model loss_out = _mixup_model(data, model, args, is_train) File "train.py", line 65, in build_program data_loader, loss_out = create_model(model, args, is_train) File "train.py", line 166, in train args=args) File "train.py", line 300, in main train(args) File "train.py", line 304, in main()


Error Message Summary:

Error: An error occurred here. There is no accurate error hint for this error yet. We are continuously in the process of increasing hint for this kind of error check. It would be helpful if you could inform us of how this conversion went by opening a github issue. And we will resolve it with high priority.

  • New issue link: https://github.com/PaddlePaddle/Paddle/issues/new
  • Recommended issue content: all error stack information [Hint: CUDNN_STATUS_EXECUTION_FAILED] at (/paddle/paddle/fluid/operators/conv_cudnn_op.cu:286) [operator < conv2d > error]
指派人
分配到
无
里程碑
无
分配里程碑
工时统计
无
截止日期
无
标识: paddlepaddle/models#4374
渝ICP备2023009037号

京公网安备11010502055752号

网络110报警服务 Powered by GitLab CE v13.7
开源知识
Git 入门 Pro Git 电子书 在线学 Git
Markdown 基础入门 IT 技术知识开源图谱
帮助
使用手册 反馈建议 博客
《GitCode 隐私声明》 《GitCode 服务条款》 关于GitCode
Powered by GitLab CE v13.7