aistudio: cascade_rcnn_mobilenetv3_fpn 训练报错 (#1450) · Issue · PaddlePaddle / PaddleDetection

aistudio: cascade_rcnn_mobilenetv3_fpn 训练报错

Created by: Fauny

Paddle version: None Paddle With CUDA: None OS: Ubuntu 18.04 Python version: 3.6.9 CUDA version: 10.0.326 cuDNN version: 7.6.3 Nvidia driver version: None

自制数据集，640, 320 都报错

2020-09-21 11:38:22,275-INFO: iter: 39600, lr: 0.003861, 'loss_cls_0': '0.000000', 'loss_loc_0': '3.746526', 'loss_cls_1': '0.000000', 'loss_loc_1': '0.000000', 'loss_cls_2': '0.000000', 'loss_loc_2': '0.000000', 'loss_rpn_cls': '0.250100', 'loss_rpn_bbox': '0.029995', 'loss': '4.017196', time: 0.072, eta: 9:10:32
Traceback (most recent call last):
  File "tools/train.py", line 372, in <module>
    main()
  File "tools/train.py", line 245, in main
    outs = exe.run(compiled_train_prog, fetch_list=train_values)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1082, in run
    six.reraise(*sys.exc_info())
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/six.py", line 703, in reraise
    raise value
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1080, in run
    return_merged=return_merged)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1178, in _run_impl
    return_merged=return_merged)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/executor.py", line 893, in _run_parallel
    tensors = exe.run(fetch_var_names, return_merged)._move_to_list()
paddle.fluid.core_noavx.EnforceNotMet: 

--------------------------------------------
C++ Call Stacks (More useful to developers):
--------------------------------------------
0   std::string paddle::platform::GetTraceBackString<std::string const&>(std::string const&, char const*, int)
1   paddle::platform::EnforceNotMet::EnforceNotMet(std::string const&, char const*, int)
2   paddle::framework::Tensor::mutable_data(paddle::platform::Place const&, paddle::framework::proto::VarType_Type, unsigned long)
3   paddle::operators::ElementwiseAddKernel<paddle::platform::CUDADeviceContext, float>::Compute(paddle::framework::ExecutionContext const&) const
4   std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::ElementwiseAddKernel<paddle::platform::CUDADeviceContext, float>, paddle::operators::ElementwiseAddKernel<paddle::platform::CUDADeviceContext, double>, paddle::operators::ElementwiseAddKernel<paddle::platform::CUDADeviceContext, int>, paddle::operators::ElementwiseAddKernel<paddle::platform::CUDADeviceContext, long>, paddle::operators::ElementwiseAddKernel<paddle::platform::CUDADeviceContext, paddle::platform::float16> >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&)
5   paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const
6   paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const
7   paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&)
8   paddle::framework::details::ComputationOpHandle::RunImpl()
9   paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync(paddle::framework::details::OpHandleBase*)
10  paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp(paddle::framework::details::OpHandleBase*, std::shared_ptr<paddle::framework::BlockingQueue<unsigned long> > const&, unsigned long*)
11  std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, void> >::_M_invoke(std::_Any_data const&)
12  std::__future_base::_State_base::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>&, bool&)
13  ThreadPool::ThreadPool(unsigned long)::{lambda()#1}::operator()() const

------------------------------------------
Python Call Stacks (More useful to users):
------------------------------------------
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/framework.py", line 2798, in append_op
    attrs=kwargs.get("attrs", None))
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layer_helper.py", line 43, in append_op
    return self.main_program.current_block().append_op(*args, **kwargs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layer_helper.py", line 135, in append_bias_op
    attrs={'axis': dim_start})
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/nn.py", line 363, in fc
    pre_activation = helper.append_bias_op(pre_bias, dim_start=num_flatten_dims)
  File "/home/aistudio/cascade_rcnn_mobilenetv3_fpn/PaddleDetection/ppdet/modeling/roi_heads/cascade_head.py", line 352, in __call__
    regularizer=L2Decay(0.)))
  File "/home/aistudio/cascade_rcnn_mobilenetv3_fpn/PaddleDetection/ppdet/modeling/roi_heads/cascade_head.py", line 79, in get_output
    head_feat = self.head(roi_feat, wb_scalar, name)
  File "/home/aistudio/cascade_rcnn_mobilenetv3_fpn/PaddleDetection/ppdet/modeling/architectures/cascade_rcnn.py", line 158, in build
    name='_' + str(i + 1) if i > 0 else '')
  File "/home/aistudio/cascade_rcnn_mobilenetv3_fpn/PaddleDetection/ppdet/modeling/architectures/cascade_rcnn.py", line 327, in train
    return self.build(feed_vars, 'train')
  File "tools/train.py", line 117, in main
    train_fetches = model.train(feed_vars)
  File "tools/train.py", line 372, in <module>
    main()

----------------------
Error Message Summary:
----------------------
Error: When calling this method, the Tensor's numel must be equal or larger than zero. Please check Tensor::dims, or Tensor::Resize has been called first. The Tensor's shape is [-1, 128] now
  [Hint: Expected numel() >= 0, but received numel():-128 < 0:0.] at (/paddle/paddle/fluid/framework/tensor.cc:45)
  [operator < elementwise_add > error]
terminate called without an active exception
W0921 11:38:23.122191  1643 init.cc:235] Warning: PaddlePaddle catches a failure signal, it may not work properly
W0921 11:38:23.122239  1643 init.cc:237] You could check whether you killed PaddlePaddle thread/process accidentally or report the case to PaddlePaddle
W0921 11:38:23.122243  1643 init.cc:240] The detail failure signal is:

W0921 11:38:23.122251  1643 init.cc:243] *** Aborted at 1600659503 (unix time) try "date -d @1600659503" if you are using GNU date ***
W0921 11:38:23.124091  1643 init.cc:243] PC: @                0x0 (unknown)
W0921 11:38:23.124198  1643 init.cc:243] *** SIGABRT (@0x3e80000062e) received by PID 1582 (TID 0x7f5bacc3f700) from PID 1582; stack trace: ***
W0921 11:38:23.125437  1643 init.cc:243]     @     0x7f5bbdc28390 (unknown)
W0921 11:38:23.126608  1643 init.cc:243]     @     0x7f5bbd882428 gsignal
W0921 11:38:23.127753  1643 init.cc:243]     @     0x7f5bbd88402a abort
W0921 11:38:23.128615  1643 init.cc:243]     @     0x7f5b7e60184a __gnu_cxx::__verbose_terminate_handler()
W0921 11:38:23.129341  1643 init.cc:243]     @     0x7f5b7e5fff47 __cxxabiv1::__terminate()
W0921 11:38:23.130137  1643 init.cc:243]     @     0x7f5b7e5fff7d std::terminate()
W0921 11:38:23.130897  1643 init.cc:243]     @     0x7f5b7e5ffc5a __gxx_personality_v0
W0921 11:38:23.131584  1643 init.cc:243]     @     0x7f5b7e8f2b97 _Unwind_ForcedUnwind_Phase2
W0921 11:38:23.132267  1643 init.cc:243]     @     0x7f5b7e8f2e7d _Unwind_ForcedUnwind
W0921 11:38:23.133441  1643 init.cc:243]     @     0x7f5bbdc27070 __GI___pthread_unwind
W0921 11:38:23.134588  1643 init.cc:243]     @     0x7f5bbdc1f845 __pthread_exit
W0921 11:38:23.134867  1643 init.cc:243]     @     0x55de55a09e59 PyThread_exit_thread
W0921 11:38:23.134949  1643 init.cc:243]     @     0x55de5588fc17 PyEval_RestoreThread.cold.798
W0921 11:38:23.136456  1643 init.cc:243]     @     0x7f5b40694e39 pybind11::gil_scoped_release::~gil_scoped_release()
W0921 11:38:23.136919  1643 init.cc:243]     @     0x7f5b407e0a0c _ZZN8pybind1112cpp_function10initializeIZN6paddle6pybind10BindReaderEPNS_6moduleEEUlRNS2_9operators6reader40OrderedMultiDeviceLoDTensorBlockingQueueERKSt6vectorINS2_9framework9LoDTensorESaISC_EEE2_bIS9_SG_EINS_4nameENS_9is_methodENS_7siblingENS_10call_guardIINS_18gil_scoped_releaseEEEEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNES11_
W0921 11:38:23.138195  1643 init.cc:243]     @     0x7f5b406b1666 pybind11::cpp_function::dispatcher()
W0921 11:38:23.138525  1643 init.cc:243]     @     0x55de5598b744 _PyMethodDef_RawFastCallKeywords
W0921 11:38:23.138772  1643 init.cc:243]     @     0x55de5598b861 _PyCFunction_FastCallKeywords
W0921 11:38:23.139005  1643 init.cc:243]     @     0x55de559f76e8 _PyEval_EvalFrameDefault
W0921 11:38:23.139221  1643 init.cc:243]     @     0x55de5593b81a _PyEval_EvalCodeWithName
W0921 11:38:23.139432  1643 init.cc:243]     @     0x55de5593c635 _PyFunction_FastCallDict
W0921 11:38:23.139658  1643 init.cc:243]     @     0x55de559f4232 _PyEval_EvalFrameDefault
W0921 11:38:23.139865  1643 init.cc:243]     @     0x55de5598accb _PyFunction_FastCallKeywords
W0921 11:38:23.140095  1643 init.cc:243]     @     0x55de559f2a93 _PyEval_EvalFrameDefault
W0921 11:38:23.140291  1643 init.cc:243]     @     0x55de5598accb _PyFunction_FastCallKeywords
W0921 11:38:23.140516  1643 init.cc:243]     @     0x55de559f2a93 _PyEval_EvalFrameDefault
W0921 11:38:23.140728  1643 init.cc:243]     @     0x55de5593c56b _PyFunction_FastCallDict
W0921 11:38:23.140941  1643 init.cc:243]     @     0x55de5595ae53 _PyObject_Call_Prepend
W0921 11:38:23.141172  1643 init.cc:243]     @     0x55de5594ddbe PyObject_Call
W0921 11:38:23.141268  1643 init.cc:243]     @     0x55de55a4a817 t_bootstrap
W0921 11:38:23.141319  1643 init.cc:243]     @     0x55de55a05788 pythread_wrapper
W0921 11:38:23.142601  1643 init.cc:243]     @     0x7f5bbdc1e6ba start_thread
Aborted (core dumped)

配置文件：

architecture: CascadeRCNN
max_iters: 500000
snapshot_iter: 2000
use_gpu: true
log_smooth_window: 200
save_dir: output
pretrain_weights: https://paddle-imagenet-models-name.bj.bcebos.com/MobileNetV3_large_x1_0_ssld_pretrained.tar
weights: output/big/model_final
metric: COCO
num_classes: 1

CascadeRCNN:
  backbone: MobileNetV3RCNN
  fpn: FPN
  rpn_head: FPNRPNHead
  roi_extractor: FPNRoIAlign
  bbox_head: CascadeBBoxHead
  bbox_assigner: CascadeBBoxAssigner

MobileNetV3RCNN:
  norm_type: bn
  freeze_norm: true
  norm_decay: 0.0
  feature_maps: [2, 3, 4]
  conv_decay: 0.00001
  lr_mult_list: [1.0, 1.0, 1.0, 1.0, 1.0]
  scale: 1.0
  model_name: large

FPN:
  min_level: 2
  max_level: 6
  num_chan: 48
  has_extra_convs: true
  spatial_scale: [0.0625, 0.125, 0.25]

FPNRPNHead:
  anchor_generator:
    anchor_sizes: [32, 64, 128, 256, 512]
    aspect_ratios: [0.5, 1.0, 2.0]
    stride: [16.0, 16.0]
    variance: [1.0, 1.0, 1.0, 1.0]
  anchor_start_size: 24
  min_level: 2
  max_level: 6
  num_chan: 48
  rpn_target_assign:
    rpn_batch_size_per_im: 256
    rpn_fg_fraction: 0.5
    rpn_positive_overlap: 0.7
    rpn_negative_overlap: 0.3
    rpn_straddle_thresh: 0.0
  train_proposal:
    min_size: 0.0
    nms_thresh: 0.7
    pre_nms_top_n: 2000
    post_nms_top_n: 2000
  test_proposal:
    min_size: 0.0
    nms_thresh: 0.7
    pre_nms_top_n: 300
    post_nms_top_n: 100

FPNRoIAlign:
  canconical_level: 4
  canonical_size: 224
  min_level: 2
  max_level: 5
  box_resolution: 7
  sampling_ratio: 2

CascadeBBoxAssigner:
  batch_size_per_im: 512
  bbox_reg_weights: [10, 20, 30]
  bg_thresh_lo: [0.0, 0.0, 0.0]
  bg_thresh_hi: [0.5, 0.6, 0.7]
  fg_thresh: [0.5, 0.6, 0.7]
  fg_fraction: 0.25

CascadeBBoxHead:
  head: CascadeTwoFCHead
  bbox_loss: BalancedL1Loss
  nms:
    keep_top_k: 100
    nms_threshold: 0.5
    score_threshold: 0.05

BalancedL1Loss:
  alpha: 0.5
  gamma: 1.5
  beta: 1.0
  loss_weight: 1.0

CascadeTwoFCHead:
  mlp_dim: 128

LearningRate:
  base_lr: 0.005
  schedulers:
  - !CosineDecay
    max_iters: 125000
  - !LinearWarmup
    start_factor: 0.1
    steps: 500

OptimizerBuilder:
  optimizer:
    momentum: 0.9
    type: Momentum
  regularizer:
    factor: 0.00004
    type: L2

TrainReader:
  inputs_def:
    fields: ['image', 'im_info', 'im_id', 'gt_bbox', 'gt_class', 'is_crowd']
  dataset:
    !COCODataSet
    image_dir: train
    anno_path: annotations/instance_train.json
    dataset_dir: dataset/data50
  sample_transforms:
  - !DecodeImage
    to_rgb: true
  - !RandomFlipImage
    prob: 0.5
  - !AutoAugmentImage
    autoaug_type: v1
  - !NormalizeImage
    is_channel_first: false
    is_scale: true
    mean: [0.485,0.456,0.406]
    std: [0.229, 0.224,0.225]
  - !ResizeImage
    target_size: [416, 448, 480, 512, 544, 576, 608, 640, 672]
    max_size: 1000
    interp: 1
    use_cv2: true
  - !Permute
    to_bgr: false
    channel_first: true
  batch_transforms:
  - !PadBatch
    pad_to_stride: 32
    use_padded_im_info: false
  batch_size: 2
  shuffle: true
  worker_num: 2
  use_process: false


TestReader:
  inputs_def:
    # set image_shape if needed
    fields: ['image', 'im_info', 'im_id', 'im_shape']
  dataset:
    !ImageFolder
    anno_path: annotations/instance_val.json
  sample_transforms:
  - !DecodeImage
    to_rgb: true
    with_mixup: false
  - !NormalizeImage
    is_channel_first: false
    is_scale: true
    mean: [0.485,0.456,0.406]
    std: [0.229, 0.224,0.225]
  - !ResizeImage
    interp: 1
    max_size: 640
    target_size: 640
    use_cv2: true
  - !Permute
    channel_first: true
    to_bgr: false
  batch_transforms:
  - !PadBatch
    pad_to_stride: 32
    use_padded_im_info: true
  batch_size: 1
  shuffle: false



EvalReader:
  inputs_def:
    fields: ['image', 'im_info', 'im_id', 'im_shape']
    # for voc
    #fields: ['image', 'im_info', 'im_id', 'gt_bbox', 'gt_class', 'is_difficult']
  dataset:
    !COCODataSet
    image_dir: val
    anno_path: annotations/instance_val.json
    dataset_dir: dataset/data50
  sample_transforms:
  - !DecodeImage
    to_rgb: true
    with_mixup: false
  - !NormalizeImage
    is_channel_first: false
    is_scale: true
    mean: [0.485,0.456,0.406]
    std: [0.229, 0.224,0.225]
  - !ResizeImage
    interp: 1
    max_size: 640
    target_size: 640
    use_cv2: true
  - !Permute
    channel_first: true
    to_bgr: false
  batch_transforms:
  - !PadBatch
    pad_to_stride: 32
    use_padded_im_info: true
  batch_size: 1
  shuffle: false
  drop_empty: false
  worker_num: 2

PaddlePaddle / PaddleDetection 1 年多 前同步成功

aistudio: cascade_rcnn_mobilenetv3_fpn 训练报错

PaddlePaddle / PaddleDetection
1 年多前同步成功