图片分类训练报错
Created by: mcl-stone
使用上面的命令报错如下:
[root@72b2d0dbbbe6 image_classification]# python3 train.py --data_dir=./data/mask/ --total_images=186 --class_dim=2 --validate=True --model=ResNet50_vd --batch_size=8 --lr_strategy=cosine_decay --lr=0.1 --num_epochs=200 --model_save_dir=output/ --l2_decay=7e-5 --use_mixup=True --use_label_smoothing=True --label_smoothing_epsilon=0.1
2020-06-19 10:08:37,855-INFO: ------------- Configuration Arguments -------------
2020-06-19 10:08:37,855-INFO: batch_size : 8
2020-06-19 10:08:37,855-INFO: checkpoint : None
2020-06-19 10:08:37,855-INFO: class_dim : 2
2020-06-19 10:08:37,855-INFO: data_dir : ./data/mask/
2020-06-19 10:08:37,855-INFO: data_format : NCHW
2020-06-19 10:08:37,855-INFO: decay_epochs : 2.4
2020-06-19 10:08:37,855-INFO: decay_rate : 0.97
2020-06-19 10:08:37,855-INFO: drop_connect_rate : 0.2
2020-06-19 10:08:37,855-INFO: ema_decay : 0.9999
2020-06-19 10:08:37,855-INFO: enable_ce : False
2020-06-19 10:08:37,855-INFO: finetune_exclude_pretrained_params : None
2020-06-19 10:08:37,856-INFO: fuse_bn_act_ops : False
2020-06-19 10:08:37,856-INFO: fuse_elewise_add_act_ops : False
2020-06-19 10:08:37,856-INFO: image_mean : [0.485, 0.456, 0.406]
2020-06-19 10:08:37,856-INFO: image_shape : [3, 224, 224]
2020-06-19 10:08:37,856-INFO: image_std : [0.229, 0.224, 0.225]
2020-06-19 10:08:37,856-INFO: interpolation : None
2020-06-19 10:08:37,856-INFO: is_profiler : False
2020-06-19 10:08:37,856-INFO: l2_decay : 7e-05
2020-06-19 10:08:37,856-INFO: label_smoothing_epsilon : 0.1
2020-06-19 10:08:37,856-INFO: lower_ratio : 0.75
2020-06-19 10:08:37,856-INFO: lower_scale : 0.08
2020-06-19 10:08:37,856-INFO: lr : 0.1
2020-06-19 10:08:37,856-INFO: lr_strategy : cosine_decay
2020-06-19 10:08:37,856-INFO: max_iter : 0
2020-06-19 10:08:37,856-INFO: mixup_alpha : 0.2
2020-06-19 10:08:37,856-INFO: model : ResNet50_vd
2020-06-19 10:08:37,856-INFO: model_save_dir : output/
2020-06-19 10:08:37,856-INFO: momentum_rate : 0.9
2020-06-19 10:08:37,856-INFO: num_epochs : 200
2020-06-19 10:08:37,857-INFO: padding_type : SAME
2020-06-19 10:08:37,857-INFO: pretrained_model : None
2020-06-19 10:08:37,857-INFO: print_step : 10
2020-06-19 10:08:37,857-INFO: profiler_path : ./profilier_files
2020-06-19 10:08:37,857-INFO: random_seed : None
2020-06-19 10:08:37,857-INFO: reader_buf_size : 8
2020-06-19 10:08:37,857-INFO: reader_thread : 8
2020-06-19 10:08:37,857-INFO: resize_short_size : 256
2020-06-19 10:08:37,857-INFO: same_feed : 0
2020-06-19 10:08:37,857-INFO: save_step : 1
2020-06-19 10:08:37,857-INFO: scale_loss : 1.0
2020-06-19 10:08:37,857-INFO: step_epochs : [30, 60, 90]
2020-06-19 10:08:37,857-INFO: test_batch_size : 8
2020-06-19 10:08:37,857-INFO: total_images : 186
2020-06-19 10:08:37,857-INFO: upper_ratio : 1.3333333333333333
2020-06-19 10:08:37,857-INFO: use_aa : False
2020-06-19 10:08:37,857-INFO: use_dali : False
2020-06-19 10:08:37,857-INFO: use_dynamic_loss_scaling : True
2020-06-19 10:08:37,857-INFO: use_ema : False
2020-06-19 10:08:37,857-INFO: use_fp16 : False
2020-06-19 10:08:37,857-INFO: use_gpu : True
2020-06-19 10:08:37,858-INFO: use_label_smoothing : 1
2020-06-19 10:08:37,858-INFO: use_mixup : 1
2020-06-19 10:08:37,858-INFO: use_se : True
2020-06-19 10:08:37,858-INFO: validate : 1
2020-06-19 10:08:37,858-INFO: warm_up_epochs : 5.0
2020-06-19 10:08:37,858-INFO: ----------------------------------------------------
W0619 10:08:39.362601 5105 device_context.cc:252] Please NOTE: device: 0, CUDA Capability: 75, Driver API Version: 10.2, Runtime API Version: 10.0
W0619 10:08:39.366940 5105 device_context.cc:260] device: 0, cuDNN Version: 7.6.
2020-06-19 10:08:41,431-WARNING: img(./data/mask/train/0/aada8594b5c91353a100d936490ecd3d.jpg) is None, pass it.
2020-06-19 10:08:41,884-INFO: [Pass 0, train batch 0] loss 0.65583, lr 0.10000, elapse 0.4723 sec
/usr/local/lib64/python3.6/site-packages/paddle/fluid/executor.py:1070: UserWarning: The following exception is not an EOF exception.
"The following exception is not an EOF exception.")
Traceback (most recent call last):
File "train.py", line 304, in
main()
File "train.py", line 300, in main
train(args)
File "train.py", line 250, in train
fetch_list=train_fetch_list)
File "/usr/local/lib64/python3.6/site-packages/paddle/fluid/executor.py", line 1071, in run
six.reraise(*sys.exc_info())
File "/usr/local/lib/python3.6/site-packages/six.py", line 703, in reraise
raise value
File "/usr/local/lib64/python3.6/site-packages/paddle/fluid/executor.py", line 1066, in run
return_merged=return_merged)
File "/usr/local/lib64/python3.6/site-packages/paddle/fluid/executor.py", line 1167, in _run_impl
return_merged=return_merged)
File "/usr/local/lib64/python3.6/site-packages/paddle/fluid/executor.py", line 879, in _run_parallel
tensors = exe.run(fetch_var_names, return_merged)._move_to_list()
paddle.fluid.core_avx.EnforceNotMet:
C++ Call Stacks (More useful to developers):
0 std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int) 1 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) 2 paddle::operators::CUDNNConvOpKernel::Compute(paddle::framework::ExecutionContext const&) const 3 std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::CUDNNConvOpKernel, paddle::operators::CUDNNConvOpKernel, paddle::operators::CUDNNConvOpKernelpaddle::platform::float16 >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1 (closed)}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&) 4 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const 5 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const 6 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&) 7 paddle::framework::details::ComputationOpHandle::RunImpl() 8 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync(paddle::framework::details::OpHandleBase*) 9 paddle::framework::details::FastThreadedSSAGraphExecutor::RunTracedOps(std::vector<paddle::framework::details::OpHandleBase*, std::allocatorpaddle::framework::details::OpHandleBase* > const&) 10 paddle::framework::details::FastThreadedSSAGraphExecutor::Run(std::vector<std::string, std::allocatorstd::string > const&, bool) 11 paddle::framework::details::ScopeBufferedMonitor::Apply(std::function<void ()> const&, bool) 12 paddle::framework::details::ScopeBufferedSSAGraphExecutor::Run(std::vector<std::string, std::allocatorstd::string > const&, bool) 13 paddle::framework::ParallelExecutor::Run(std::vector<std::string, std::allocatorstd::string > const&, bool)
Python Call Stacks (More useful to users):
File "/usr/local/lib64/python3.6/site-packages/paddle/fluid/framework.py", line 2610, in append_op attrs=kwargs.get("attrs", None)) File "/usr/local/lib64/python3.6/site-packages/paddle/fluid/layer_helper.py", line 43, in append_op return self.main_program.current_block().append_op(*args, **kwargs) File "/usr/local/lib64/python3.6/site-packages/paddle/fluid/layers/nn.py", line 2933, in conv2d "data_format": data_format, File "/host/Documents/models-release-1.8/models-release-1.8/PaddleCV/image_classification/models/resnet_vd.py", line 146, in conv_bn_layer bias_attr=False) File "/host/Documents/models-release-1.8/models-release-1.8/PaddleCV/image_classification/models/resnet_vd.py", line 67, in net name='conv1_1') File "/host/Documents/models-release-1.8/models-release-1.8/PaddleCV/image_classification/build_model.py", line 98, in _mixup_model net_out = model.net(input=image, class_dim=args.class_dim) File "/host/Documents/models-release-1.8/models-release-1.8/PaddleCV/image_classification/build_model.py", line 125, in create_model loss_out = _mixup_model(data, model, args, is_train) File "train.py", line 65, in build_program data_loader, loss_out = create_model(model, args, is_train) File "train.py", line 166, in train args=args) File "train.py", line 300, in main train(args) File "train.py", line 304, in main()
Error Message Summary:
ExternalError: Cudnn error, CUDNN_STATUS_EXECUTION_FAILED at (/paddle/paddle/fluid/operators/conv_cudnn_op.cu:300) [operator < conv2d > error]