From 470986b60ba11322ed896dbd7d4d664a0f403781 Mon Sep 17 00:00:00 2001
From: Dan Kondratyuk <dankondratyuk@google.com>
Date: Mon, 21 Mar 2022 12:29:27 -0700
Subject: [PATCH] Add video classification with MoViNets to official readme.

PiperOrigin-RevId: 436276225
---
 official/README.md                   |  8 ++++++++
 official/projects/movinet/README.md  |  3 +--
 official/vision/beta/MODEL_GARDEN.md | 16 +++++++++++++++-
 3 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/official/README.md b/official/README.md
index 001d48ff8..06f09d4e5 100644
--- a/official/README.md
+++ b/official/README.md
@@ -20,6 +20,7 @@ In the near future, we will add:
 * State-of-the-art language understanding models.
 * State-of-the-art image classification models.
 * State-of-the-art object detection and instance segmentation models.
+* State-of-the-art video classification models.
 
 ## Table of Contents
 
@@ -27,6 +28,7 @@ In the near future, we will add:
   * [Computer Vision](#computer-vision)
     + [Image Classification](#image-classification)
     + [Object Detection and Segmentation](#object-detection-and-segmentation)
+    + [Video Classification](#video-classification)
   * [Natural Language Processing](#natural-language-processing)
   * [Recommendation](#recommendation)
 - [How to get started with the official models](#how-to-get-started-with-the-official-models)
@@ -55,6 +57,12 @@ In the near future, we will add:
 | [SpineNet](vision/beta/MODEL_GARDEN.md) | [SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization](https://arxiv.org/abs/1912.05027) |
 | [Cascade RCNN-RS and RetinaNet-RS](vision/beta/MODEL_GARDEN.md) | [Simple Training Strategies and Model Scaling for Object Detection](https://arxiv.org/abs/2107.00057)|
 
+#### Video Classification
+
+| Model | Reference (Paper) |
+|-------|-------------------|
+| [Mobile Video Networks (MoViNets)](projects/movinet) | [MoViNets: Mobile Video Networks for Efficient Video Recognition](https://arxiv.org/abs/2103.11511) |
+
 ### Natural Language Processing
 
 | Model | Reference (Paper) |
diff --git a/official/projects/movinet/README.md b/official/projects/movinet/README.md
index 0e72f7459..36bcfe89d 100644
--- a/official/projects/movinet/README.md
+++ b/official/projects/movinet/README.md
@@ -176,8 +176,7 @@ devices. See the [TF Lite Example](#tf-lite-example) to export and run your own
 models. We also provide [quantized TF Lite binaries via TF Hub](https://tfhub.dev/s?deployment-format=lite&q=movinet).
 
 For reference, MoViNet-A0-Stream runs with a similar latency to
-[MobileNetV3-Large]
-(https://tfhub.dev/google/imagenet/mobilenet_v3_large_100_224/classification/)
+[MobileNetV3-Large](https://tfhub.dev/google/imagenet/mobilenet_v3_large_100_224/classification/)
 with +5% accuracy on Kinetics 600.
 
 | Model Name | Input Shape | Pixel 4 Latency\* | x86 Latency\* | TF Lite Binary |
diff --git a/official/vision/beta/MODEL_GARDEN.md b/official/vision/beta/MODEL_GARDEN.md
index ebb0cf280..d8bd43d9e 100644
--- a/official/vision/beta/MODEL_GARDEN.md
+++ b/official/vision/beta/MODEL_GARDEN.md
@@ -171,8 +171,10 @@ evaluated on [COCO](https://cocodataset.org/) val2017.
         [Spatiotemporal Contrastive Video Representation Learning](https://arxiv.org/abs/2008.03800).
     *   ResNet-3D-RS (R3D-RS) in
         [Revisiting 3D ResNets for Video Recognition](https://arxiv.org/pdf/2109.01696.pdf).
+    *   Mobile Video Networks (MoViNets) in
+        [MoViNets: Mobile Video Networks for Efficient Video Recognition](https://arxiv.org/abs/2103.11511).
 
-* Training and evaluation details:
+* Training and evaluation details (SlowFast and ResNet):
   * All models are trained from scratch with vision modality (RGB) for 200
     epochs.
   * We use batch size of 1024 and cosine learning rate decay with linear warmup
@@ -192,6 +194,12 @@ evaluated on [COCO](https://cocodataset.org/) val2017.
 | R3D-RS-152 | 32 x 2                 | 79.9  | 94.3  | -
 | R3D-RS-200 | 32 x 2                 | 80.4  | 94.4  | -
 | R3D-RS-200 | 48 x 2                 | 81.0  | -     | -
+| MoViNet-A0-Base | 50 x 5            | 69.40 | 89.18 | -
+| MoViNet-A1-Base | 50 x 5            | 74.57 | 92.03 | -
+| MoViNet-A2-Base | 50 x 5            | 75.91 | 92.63 | -
+| MoViNet-A3-Base | 120 x 2           | 79.34 | 94.52 | -
+| MoViNet-A4-Base | 80 x 3            | 80.64 | 94.93 | -
+| MoViNet-A5-Base | 120 x 2           | 81.39 | 95.06 | -
 
 ### Kinetics-600 Action Recognition Baselines
 
@@ -201,3 +209,9 @@ evaluated on [COCO](https://cocodataset.org/) val2017.
 | R3D-50   | 32 x 2                 |  79.5   |  94.8   | [config](https://github.com/tensorflow/models/blob/master/official/vision/beta/configs/experiments/video_classification/k600_3d-resnet50_tpu.yaml) |
 | R3D-RS-200 | 32 x 2                 | 83.1  | -     | -
 | R3D-RS-200 | 48 x 2                 | 83.8  | -     | -
+| MoViNet-A0-Base | 50 x 5            | 72.05 | 90.92 | [config](https://github.com/tensorflow/models/blob/master/official/projects/movinet/configs/yaml/movinet_a0_k600_8x8.yaml) |
+| MoViNet-A1-Base | 50 x 5            | 76.69 | 93.40 | [config](https://github.com/tensorflow/models/blob/master/official/projects/movinet/configs/yaml/movinet_a1_k600_8x8.yaml) |
+| MoViNet-A2-Base | 50 x 5            | 78.62 | 94.17 | [config](https://github.com/tensorflow/models/blob/master/official/projects/movinet/configs/yaml/movinet_a2_k600_8x8.yaml) |
+| MoViNet-A3-Base | 120 x 2           | 81.79 | 95.67 | [config](https://github.com/tensorflow/models/blob/master/official/projects/movinet/configs/yaml/movinet_a3_k600_8x8.yaml) |
+| MoViNet-A4-Base | 80 x 3            | 83.48 | 96.16 | [config](https://github.com/tensorflow/models/blob/master/official/projects/movinet/configs/yaml/movinet_a4_k600_8x8.yaml) |
+| MoViNet-A5-Base | 120 x 2           | 84.27 | 96.39 | [config](https://github.com/tensorflow/models/blob/master/official/projects/movinet/configs/yaml/movinet_a5_k600_8x8.yaml) |
-- 
GitLab