add vit + mask rcnn (#7592)

3f7e70d4 · Wenyu · GitHub · 9daf164b · 3f7e70d4 · 3f7e70d4
4 changed file
--- a/configs/vitdet/README.md
+++ b/configs/vitdet/README.md
@@ -13,11 +13,14 @@ non-trivial when new architectures, such as Vision Transformer (ViT) models, arr

 ## Model Zoo

-| Backbone | Pretrained | Model | Scheduler | Images/GPU  | Box AP | Config | Download |
-|:------:|:--------:|:--------------:|:--------------:|:--------------:|:------:|:------:|:--------:|
-| ViT-base | CAE | Cascade RCNN  | 1x | 1 | 52.7 | [config](./cascade_rcnn_vit_base_hrfpn_cae_1x_coco.yml) | [model](https://bj.bcebos.com/v1/paddledet/models/cascade_rcnn_vit_base_hrfpn_cae_1x_coco.pdparams) |
-| ViT-large | CAE | Cascade RCNN  | 1x | 1 | 55.7 | [config](./cascade_rcnn_vit_large_hrfpn_cae_1x_coco.yml) | [model](https://bj.bcebos.com/v1/paddledet/models/cascade_rcnn_vit_large_hrfpn_cae_1x_coco.pdparams) |
-| ViT-base | CAE | PP-YOLOE  | 36e | 2 | 52.2 | [config](./ppyoloe_vit_base_csppan_cae_36e_coco.yml) | [model](https://bj.bcebos.com/v1/paddledet/models/ppyoloe_vit_base_csppan_cae_36e_coco.pdparams) |
+| Model | Backbone | Pretrained | Scheduler | Images/GPU  | Box AP | Mask AP | Config | Download |
+|:------:|:--------:|:--------------:|:--------------:|:--------------:|:--------------:|:------:|:------:|:--------:|
+| Cascade RCNN | ViT-base | CAE | 1x | 1 | 52.7 | - | [config](./cascade_rcnn_vit_base_hrfpn_cae_1x_coco.yml) | [model](https://bj.bcebos.com/v1/paddledet/models/cascade_rcnn_vit_base_hrfpn_cae_1x_coco.pdparams) |
+| Cascade RCNN | ViT-large | CAE | 1x | 1 | 55.7 | - | [config](./cascade_rcnn_vit_large_hrfpn_cae_1x_coco.yml) | [model](https://bj.bcebos.com/v1/paddledet/models/cascade_rcnn_vit_large_hrfpn_cae_1x_coco.pdparams) |
+| PP-YOLOE | ViT-base | CAE | 36e | 2 | 52.2 | - | [config](./ppyoloe_vit_base_csppan_cae_36e_coco.yml) | [model](https://bj.bcebos.com/v1/paddledet/models/ppyoloe_vit_base_csppan_cae_36e_coco.pdparams) |
+| Mask RCNN | ViT-base | CAE | 1x | 1 | 50.6 | 44.9 | [config](./mask_rcnn_vit_base_hrfpn_cae_1x_coco.yml) | [model](https://bj.bcebos.com/v1/paddledet/models/mask_rcnn_vit_base_hrfpn_cae_1x_coco.pdparams) |
+| Mask RCNN | ViT-large | CAE | 1x | 1 | 54.2 | 47.4 | [config](./mask_rcnn_vit_large_hrfpn_cae_1x_coco.yml) | [model](https://bj.bcebos.com/v1/paddledet/models/mask_rcnn_vit_large_hrfpn_cae_1x_coco.pdparams) |
+

 **Notes:**
 - Model is trained on COCO train2017 dataset and evaluated on val2017 results of `mAP(IoU=0.5:0.95)

--- a/configs/vitdet/faster_rcnn_vit_base_fpn_cae_1x_coco.yml
+++ b/configs/vitdet/faster_rcnn_vit_base_fpn_cae_1x_coco.yml
@@ -2,7 +2,7 @@
 _BASE_: [
  '../datasets/coco_detection.yml',
  '../runtime.yml',
-  './_base_/reader.yml',
+  './_base_/faster_rcnn_reader.yml',
  './_base_/optimizer_base_1x.yml'
 ]

@@ -81,15 +81,30 @@ RPNHead:
    nms_thresh: 0.7
    pre_nms_top_n: 1000
    post_nms_top_n: 1000
+  loss_rpn_bbox: SmoothL1Loss
+
+
+SmoothL1Loss:
+  beta: 0.1111111111111111


 BBoxHead:
-  head: TwoFCHead
+  # head: TwoFCHead
+  head: XConvNormHead
  roi_extractor:
    resolution: 7
    sampling_ratio: 0
    aligned: True
  bbox_assigner: BBoxAssigner
+  loss_normalize_pos: True
+  bbox_loss: GIoULoss
+
+
+GIoULoss:
+  loss_weight: 10.
+  reduction: 'none'
+  eps: 0.000001 # 1e-6
+

 BBoxAssigner:
  batch_size_per_im: 512
@@ -98,8 +113,13 @@ BBoxAssigner:
  fg_fraction: 0.25
  use_random: True

-TwoFCHead:
-  out_channel: 1024
+# TwoFCHead:
+#   out_channel: 1024
+
+XConvNormHead:
+  num_convs: 4
+  norm_type: bn
+

 BBoxPostProcess:
  decode: RCNNBox

--- a/configs/vitdet/mask_rcnn_vit_large_hrfpn_cae_1x_coco.yml
+++ b/configs/vitdet/mask_rcnn_vit_large_hrfpn_cae_1x_coco.yml
+_BASE_: [
+  './mask_rcnn_vit_base_hrfpn_cae_1x_coco.yml'
+]
+
+weights: output/mask_rcnn_vit_large_hrfpn_cae_1x_coco/model_final
+
+
+depth: &depth 24
+dim: &dim 1024
+use_fused_allreduce_gradients: &use_checkpoint True
+
+VisionTransformer:
+  img_size: [800, 1344]
+  embed_dim: *dim
+  depth: *depth
+  num_heads: 16
+  drop_path_rate: 0.25
+  out_indices: [7, 11, 15, 23]
+  use_checkpoint: *use_checkpoint
+  pretrained: https://bj.bcebos.com/v1/paddledet/models/pretrained/vit_large_cae_pretrained.pdparams
+
+HRFPN:
+  in_channels: [*dim, *dim, *dim, *dim]
+
+OptimizerBuilder:
+  optimizer:
+    layer_decay: 0.9
+    weight_decay: 0.02
+    num_layers: *depth
--- a/ppdet/modeling/backbones/vision_transformer.py
+++ b/ppdet/modeling/backbones/vision_transformer.py
@@ -509,16 +509,24 @@ class VisionTransformer(nn.Layer):
        dim = x.shape[-1]
        # we add a small number to avoid floating point error in the interpolation
        # see discussion at https://github.com/facebookresearch/dino/issues/8
-        w0, h0 = w0 + 0.1, h0 + 0.1
+        # w0, h0 = w0 + 0.1, h0 + 0.1
+        # patch_pos_embed = nn.functional.interpolate(
+        #     patch_pos_embed.reshape([
+        #         1, self.patch_embed.num_patches_w,
+        #         self.patch_embed.num_patches_h, dim
+        #     ]).transpose((0, 3, 1, 2)),
+        #     scale_factor=(w0 / self.patch_embed.num_patches_w,
+        #                   h0 / self.patch_embed.num_patches_h),
+        #     mode='bicubic', )

        patch_pos_embed = nn.functional.interpolate(
            patch_pos_embed.reshape([
                1, self.patch_embed.num_patches_w,
                self.patch_embed.num_patches_h, dim
            ]).transpose((0, 3, 1, 2)),
-            scale_factor=(w0 / self.patch_embed.num_patches_w,
-                          h0 / self.patch_embed.num_patches_h),
+            (w0, h0),
            mode='bicubic', )
+
        assert int(w0) == patch_pos_embed.shape[-2] and int(
            h0) == patch_pos_embed.shape[-1]
        patch_pos_embed = patch_pos_embed.transpose(