Internal change

PiperOrigin-RevId: 394772382

Internal change
PiperOrigin-RevId: 394772382
2f43cff2 · Xianzhi Du · A. Unique TensorFlower · bff5aad0 · 2f43cff2 · 2f43cff2
4 changed file
--- a/official/README.md
+++ b/official/README.md
@@ -52,6 +52,7 @@ In the near future, we will add:
 | [Mask R-CNN](vision/beta/MODEL_GARDEN.md) | [Mask R-CNN](https://arxiv.org/abs/1703.06870) |
 | [ShapeMask](vision/detection) | [ShapeMask: Learning to Segment Novel Objects by Refining Shape Priors](https://arxiv.org/abs/1904.03239) |
 | [SpineNet](vision/beta/MODEL_GARDEN.md) | [SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization](https://arxiv.org/abs/1912.05027) |
+| [Cascade RCNN-RS and RetinaNet-RS](vision/beta/MODEL_GARDEN.md) | [Simple Training Strategies and Model Scaling for Object Detection](https://arxiv.org/abs/2107.00057)|

 ### Natural Language Processing


--- a/official/vision/beta/MODEL_GARDEN.md
+++ b/official/vision/beta/MODEL_GARDEN.md
@@ -54,9 +54,12 @@ depth, label smoothing and dropout.

 ### Common Settings and Notes

-* We provide models based on two detection frameworks, [RetinaNet](https://arxiv.org/abs/1708.02002)
-  or [Mask R-CNN](https://arxiv.org/abs/1703.06870), and two backbones, [ResNet-FPN](https://arxiv.org/abs/1612.03144)
-  or [SpineNet](https://arxiv.org/abs/1912.05027).
+* We provide models adopting [ResNet-FPN](https://arxiv.org/abs/1612.03144) and
+  [SpineNet](https://arxiv.org/abs/1912.05027) backbones  based on detection frameworks:
+  * [RetinaNet](https://arxiv.org/abs/1708.02002) and [RetinaNet-RS](https://arxiv.org/abs/2107.00057)
+  * [Mask R-CNN](https://arxiv.org/abs/1703.06870)
+  * [Cascade RCNN](https://arxiv.org/abs/1712.00726) and [Cascade RCNN-RS](https://arxiv.org/abs/2107.00057)
+
 * Models are all trained on COCO train2017 and evaluated on COCO val2017.
 * Training details:
  * Models finetuned from ImageNet pretrained checkpoints adopt the 12 or 36
@@ -99,13 +102,22 @@ depth, label smoothing and dropout.

 ### Instance Segmentation Baselines

-#### Mask R-CNN (ImageNet pretrained)
-
 #### Mask R-CNN (Trained from scratch)

 | Backbone     | Resolution    | Epochs  | FLOPs (B)  | Params (M) | Box AP | Mask AP | Download |
 | ------------ |:-------------:| -------:|-----------:|-----------:|-------:|--------:|---------:|
-| SpineNet-49  | 640x640       |  350    | 215.7      | 40.8       | 42.6   | 37.9    | config   |
+ResNet50-FPN | 640x640    | 350    | 227.7     | 46.3       | 42.3   | 37.6    | [config](https://github.com/tensorflow/models/blob/master/official/vision/beta/configs/experiments/maskrcnn/r50fpn_640_coco_scratch_tpu4x4.yaml) |
+| SpineNet-49  | 640x640       |  350    | 215.7      | 40.8       | 42.6   | 37.9    | [config](https://github.com/tensorflow/models/blob/master/official/vision/beta/configs/experiments/maskrcnn/coco_spinenet49_mrcnn_tpu.yaml) |
+SpineNet-96  | 1024x1024  | 500    | 315.0     | 55.2       | 48.1   | 42.4    | [config](https://github.com/tensorflow/models/blob/master/official/vision/beta/configs/experiments/maskrcnn/coco_spinenet96_mrcnn_tpu.yaml) |
+SpineNet-143 | 1280x1280  | 500    | 498.8     | 79.2       | 49.3   | 43.4    | [config](https://github.com/tensorflow/models/blob/master/official/vision/beta/configs/experiments/maskrcnn/coco_spinenet143_mrcnn_tpu.yaml) |
+
+
+#### Cascade RCNN-RS (Trained from scratch)
+
+backbone     | resolution | epochs | params (M) | box AP | mask AP | download
+------------ | :--------: | -----: | ---------: | -----: | ------: | -------:
+SpineNet-49  | 640x640    | 500    | 56.4       | 46.4   | 40.0    | [config](https://github.com/tensorflow/models/blob/master/official/vision/beta/configs/experiments/maskrcnn/coco_spinenet49_cascadercnn_tpu.yaml)|
+SpineNet-143 | 1280x1280  | 500    | 94.9       | 51.9   | 45.0    | [config](https://github.com/tensorflow/models/blob/master/official/vision/beta/configs/experiments/maskrcnn/coco_spinenet143_cascadercnn_tpu.yaml)|

 ## Semantic Segmentation

@@ -131,7 +143,7 @@ depth, label smoothing and dropout.

 ### Common Settings and Notes

-* We provide models for video classification with two backbones: 
+* We provide models for video classification with two backbones:
  [SlowOnly](https://arxiv.org/abs/1812.03982) and 3D-ResNet (R3D) used in
  [Spatiotemporal Contrastive Video Representation Learning](https://arxiv.org/abs/2008.03800).
 * Training and evaluation details:

--- a/official/vision/beta/configs/maskrcnn.py
+++ b/official/vision/beta/configs/maskrcnn.py
@@ -13,7 +13,7 @@
 # limitations under the License.

 # Lint as: python3
-"""Mask R-CNN configuration definition."""
+"""R-CNN(-RS) configuration definition."""

 import dataclasses
 import os
@@ -432,7 +432,7 @@ def maskrcnn_spinenet_coco() -> cfg.ExperimentConfig:

 @exp_factory.register_config_factory('cascadercnn_spinenet_coco')
 def cascadercnn_spinenet_coco() -> cfg.ExperimentConfig:
-  """COCO object detection with Cascade R-CNN with SpineNet backbone."""
+  """COCO object detection with Cascade RCNN-RS with SpineNet backbone."""
  steps_per_epoch = 463
  coco_val_samples = 5000
  train_batch_size = 256

--- a/official/vision/beta/modeling/maskrcnn_model.py
+++ b/official/vision/beta/modeling/maskrcnn_model.py
@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-"""Mask R-CNN model."""
+"""R-CNN(-RS) models."""

 from typing import Any, List, Mapping, Optional, Tuple, Union

@@ -24,7 +24,7 @@ from official.vision.beta.ops import box_ops

 @tf.keras.utils.register_keras_serializable(package='Vision')
 class MaskRCNNModel(tf.keras.Model):
-  """The Mask R-CNN model."""
+  """The Mask R-CNN(-RS) and Cascade RCNN-RS models."""

  def __init__(self,
               backbone: tf.keras.Model,
@@ -48,7 +48,7 @@ class MaskRCNNModel(tf.keras.Model):
               aspect_ratios: Optional[List[float]] = None,
               anchor_size: Optional[float] = None,
               **kwargs):
-    """Initializes the Mask R-CNN model.
+    """Initializes the R-CNN(-RS) model.

    Args:
      backbone: `tf.keras.Model`, the backbone network.
@@ -65,19 +65,18 @@ class MaskRCNNModel(tf.keras.Model):
      mask_roi_aligner: the ROI alginer for mask prediction.
      class_agnostic_bbox_pred: if True, perform class agnostic bounding box
        prediction. Needs to be `True` for Cascade RCNN models.
-      cascade_class_ensemble: if True, ensemble classification scores over
-        all detection heads.
+      cascade_class_ensemble: if True, ensemble classification scores over all
+        detection heads.
      min_level: Minimum level in output feature maps.
      max_level: Maximum level in output feature maps.
-      num_scales: A number representing intermediate scales added
-        on each level. For instances, num_scales=2 adds one additional
-        intermediate anchor scales [2^0, 2^0.5] on each level.
-      aspect_ratios: A list representing the aspect raito
-        anchors added on each level. The number indicates the ratio of width to
-        height. For instances, aspect_ratios=[1.0, 2.0, 0.5] adds three anchors
-        on each scale level.
-      anchor_size: A number representing the scale of size of the base
-        anchor to the feature stride 2^level.
+      num_scales: A number representing intermediate scales added on each level.
+        For instances, num_scales=2 adds one additional intermediate anchor
+        scales [2^0, 2^0.5] on each level.
+      aspect_ratios: A list representing the aspect raito anchors added on each
+        level. The number indicates the ratio of width to height. For instances,
+        aspect_ratios=[1.0, 2.0, 0.5] adds three anchors on each scale level.
+      anchor_size: A number representing the scale of size of the base anchor to
+        the feature stride 2^level.
      **kwargs: keyword arguments to be passed.
    """
    super(MaskRCNNModel, self).__init__(**kwargs)