图片分类训练报错 (#4710) · Issue · PaddlePaddle / models

图片分类训练报错

Created by: mcl-stone

使用上面的命令报错如下： [root@72b2d0dbbbe6 image_classification]# python3 train.py --data_dir=./data/mask/ --total_images=186 --class_dim=2 --validate=True --model=ResNet50_vd --batch_size=8 --lr_strategy=cosine_decay --lr=0.1 --num_epochs=200 --model_save_dir=output/ --l2_decay=7e-5 --use_mixup=True --use_label_smoothing=True --label_smoothing_epsilon=0.1 2020-06-19 10:08:37,855-INFO: ------------- Configuration Arguments ------------- 2020-06-19 10:08:37,855-INFO: batch_size : 8 2020-06-19 10:08:37,855-INFO: checkpoint : None 2020-06-19 10:08:37,855-INFO: class_dim : 2 2020-06-19 10:08:37,855-INFO: data_dir : ./data/mask/ 2020-06-19 10:08:37,855-INFO: data_format : NCHW 2020-06-19 10:08:37,855-INFO: decay_epochs : 2.4 2020-06-19 10:08:37,855-INFO: decay_rate : 0.97 2020-06-19 10:08:37,855-INFO: drop_connect_rate : 0.2 2020-06-19 10:08:37,855-INFO: ema_decay : 0.9999 2020-06-19 10:08:37,855-INFO: enable_ce : False 2020-06-19 10:08:37,855-INFO: finetune_exclude_pretrained_params : None 2020-06-19 10:08:37,856-INFO: fuse_bn_act_ops : False 2020-06-19 10:08:37,856-INFO: fuse_elewise_add_act_ops : False 2020-06-19 10:08:37,856-INFO: image_mean : [0.485, 0.456, 0.406] 2020-06-19 10:08:37,856-INFO: image_shape : [3, 224, 224] 2020-06-19 10:08:37,856-INFO: image_std : [0.229, 0.224, 0.225] 2020-06-19 10:08:37,856-INFO: interpolation : None 2020-06-19 10:08:37,856-INFO: is_profiler : False 2020-06-19 10:08:37,856-INFO: l2_decay : 7e-05 2020-06-19 10:08:37,856-INFO: label_smoothing_epsilon : 0.1 2020-06-19 10:08:37,856-INFO: lower_ratio : 0.75 2020-06-19 10:08:37,856-INFO: lower_scale : 0.08 2020-06-19 10:08:37,856-INFO: lr : 0.1 2020-06-19 10:08:37,856-INFO: lr_strategy : cosine_decay 2020-06-19 10:08:37,856-INFO: max_iter : 0 2020-06-19 10:08:37,856-INFO: mixup_alpha : 0.2 2020-06-19 10:08:37,856-INFO: model : ResNet50_vd 2020-06-19 10:08:37,856-INFO: model_save_dir : output/ 2020-06-19 10:08:37,856-INFO: momentum_rate : 0.9 2020-06-19 10:08:37,856-INFO: num_epochs : 200 2020-06-19 10:08:37,857-INFO: padding_type : SAME 2020-06-19 10:08:37,857-INFO: pretrained_model : None 2020-06-19 10:08:37,857-INFO: print_step : 10 2020-06-19 10:08:37,857-INFO: profiler_path : ./profilier_files 2020-06-19 10:08:37,857-INFO: random_seed : None 2020-06-19 10:08:37,857-INFO: reader_buf_size : 8 2020-06-19 10:08:37,857-INFO: reader_thread : 8 2020-06-19 10:08:37,857-INFO: resize_short_size : 256 2020-06-19 10:08:37,857-INFO: same_feed : 0 2020-06-19 10:08:37,857-INFO: save_step : 1 2020-06-19 10:08:37,857-INFO: scale_loss : 1.0 2020-06-19 10:08:37,857-INFO: step_epochs : [30, 60, 90] 2020-06-19 10:08:37,857-INFO: test_batch_size : 8 2020-06-19 10:08:37,857-INFO: total_images : 186 2020-06-19 10:08:37,857-INFO: upper_ratio : 1.3333333333333333 2020-06-19 10:08:37,857-INFO: use_aa : False 2020-06-19 10:08:37,857-INFO: use_dali : False 2020-06-19 10:08:37,857-INFO: use_dynamic_loss_scaling : True 2020-06-19 10:08:37,857-INFO: use_ema : False 2020-06-19 10:08:37,857-INFO: use_fp16 : False 2020-06-19 10:08:37,857-INFO: use_gpu : True 2020-06-19 10:08:37,858-INFO: use_label_smoothing : 1 2020-06-19 10:08:37,858-INFO: use_mixup : 1 2020-06-19 10:08:37,858-INFO: use_se : True 2020-06-19 10:08:37,858-INFO: validate : 1 2020-06-19 10:08:37,858-INFO: warm_up_epochs : 5.0 2020-06-19 10:08:37,858-INFO: ---------------------------------------------------- W0619 10:08:39.362601 5105 device_context.cc:252] Please NOTE: device: 0, CUDA Capability: 75, Driver API Version: 10.2, Runtime API Version: 10.0 W0619 10:08:39.366940 5105 device_context.cc:260] device: 0, cuDNN Version: 7.6. 2020-06-19 10:08:41,431-WARNING: img(./data/mask/train/0/aada8594b5c91353a100d936490ecd3d.jpg) is None, pass it. 2020-06-19 10:08:41,884-INFO: [Pass 0, train batch 0] loss 0.65583, lr 0.10000, elapse 0.4723 sec /usr/local/lib64/python3.6/site-packages/paddle/fluid/executor.py:1070: UserWarning: The following exception is not an EOF exception. "The following exception is not an EOF exception.") Traceback (most recent call last): File "train.py", line 304, in main() File "train.py", line 300, in main train(args) File "train.py", line 250, in train fetch_list=train_fetch_list) File "/usr/local/lib64/python3.6/site-packages/paddle/fluid/executor.py", line 1071, in run six.reraise(*sys.exc_info()) File "/usr/local/lib/python3.6/site-packages/six.py", line 703, in reraise raise value File "/usr/local/lib64/python3.6/site-packages/paddle/fluid/executor.py", line 1066, in run return_merged=return_merged) File "/usr/local/lib64/python3.6/site-packages/paddle/fluid/executor.py", line 1167, in _run_impl return_merged=return_merged) File "/usr/local/lib64/python3.6/site-packages/paddle/fluid/executor.py", line 879, in _run_parallel tensors = exe.run(fetch_var_names, return_merged)._move_to_list() paddle.fluid.core_avx.EnforceNotMet:

C++ Call Stacks (More useful to developers):

0 std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int) 1 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) 2 paddle::operators::CUDNNConvOpKernel::Compute(paddle::framework::ExecutionContext const&) const 3 std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::CUDNNConvOpKernel, paddle::operators::CUDNNConvOpKernel, paddle::operators::CUDNNConvOpKernelpaddle::platform::float16 >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1 (closed)}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&) 4 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const 5 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const 6 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&) 7 paddle::framework::details::ComputationOpHandle::RunImpl() 8 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync(paddle::framework::details::OpHandleBase*) 9 paddle::framework::details::FastThreadedSSAGraphExecutor::RunTracedOps(std::vector<paddle::framework::details::OpHandleBase*, std::allocatorpaddle::framework::details::OpHandleBase* > const&) 10 paddle::framework::details::FastThreadedSSAGraphExecutor::Run(std::vector<std::string, std::allocatorstd::string > const&, bool) 11 paddle::framework::details::ScopeBufferedMonitor::Apply(std::function<void ()> const&, bool) 12 paddle::framework::details::ScopeBufferedSSAGraphExecutor::Run(std::vector<std::string, std::allocatorstd::string > const&, bool) 13 paddle::framework::ParallelExecutor::Run(std::vector<std::string, std::allocatorstd::string > const&, bool)

Python Call Stacks (More useful to users):

File "/usr/local/lib64/python3.6/site-packages/paddle/fluid/framework.py", line 2610, in append_op attrs=kwargs.get("attrs", None)) File "/usr/local/lib64/python3.6/site-packages/paddle/fluid/layer_helper.py", line 43, in append_op return self.main_program.current_block().append_op(*args, **kwargs) File "/usr/local/lib64/python3.6/site-packages/paddle/fluid/layers/nn.py", line 2933, in conv2d "data_format": data_format, File "/host/Documents/models-release-1.8/models-release-1.8/PaddleCV/image_classification/models/resnet_vd.py", line 146, in conv_bn_layer bias_attr=False) File "/host/Documents/models-release-1.8/models-release-1.8/PaddleCV/image_classification/models/resnet_vd.py", line 67, in net name='conv1_1') File "/host/Documents/models-release-1.8/models-release-1.8/PaddleCV/image_classification/build_model.py", line 98, in _mixup_model net_out = model.net(input=image, class_dim=args.class_dim) File "/host/Documents/models-release-1.8/models-release-1.8/PaddleCV/image_classification/build_model.py", line 125, in create_model loss_out = _mixup_model(data, model, args, is_train) File "train.py", line 65, in build_program data_loader, loss_out = create_model(model, args, is_train) File "train.py", line 166, in train args=args) File "train.py", line 300, in main train(args) File "train.py", line 304, in main()

Error Message Summary:

ExternalError: Cudnn error, CUDNN_STATUS_EXECUTION_FAILED at (/paddle/paddle/fluid/operators/conv_cudnn_op.cu:300) [operator < conv2d > error]

然后使用检测了环境，没有报错，数据集使用的口罩分类的图片。求解决！

PaddlePaddle / models 大约 2 年 前同步成功

图片分类训练报错

C++ Call Stacks (More useful to developers):

Python Call Stacks (More useful to users):

Error Message Summary:

PaddlePaddle / models
大约 2 年前同步成功