From 470986b60ba11322ed896dbd7d4d664a0f403781 Mon Sep 17 00:00:00 2001 From: Dan Kondratyuk Date: Mon, 21 Mar 2022 12:29:27 -0700 Subject: [PATCH] Add video classification with MoViNets to official readme. PiperOrigin-RevId: 436276225 --- official/README.md | 8 ++++++++ official/projects/movinet/README.md | 3 +-- official/vision/beta/MODEL_GARDEN.md | 16 +++++++++++++++- 3 files changed, 24 insertions(+), 3 deletions(-) diff --git a/official/README.md b/official/README.md index 001d48ff8..06f09d4e5 100644 --- a/official/README.md +++ b/official/README.md @@ -20,6 +20,7 @@ In the near future, we will add: * State-of-the-art language understanding models. * State-of-the-art image classification models. * State-of-the-art object detection and instance segmentation models. +* State-of-the-art video classification models. ## Table of Contents @@ -27,6 +28,7 @@ In the near future, we will add: * [Computer Vision](#computer-vision) + [Image Classification](#image-classification) + [Object Detection and Segmentation](#object-detection-and-segmentation) + + [Video Classification](#video-classification) * [Natural Language Processing](#natural-language-processing) * [Recommendation](#recommendation) - [How to get started with the official models](#how-to-get-started-with-the-official-models) @@ -55,6 +57,12 @@ In the near future, we will add: | [SpineNet](vision/beta/MODEL_GARDEN.md) | [SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization](https://arxiv.org/abs/1912.05027) | | [Cascade RCNN-RS and RetinaNet-RS](vision/beta/MODEL_GARDEN.md) | [Simple Training Strategies and Model Scaling for Object Detection](https://arxiv.org/abs/2107.00057)| +#### Video Classification + +| Model | Reference (Paper) | +|-------|-------------------| +| [Mobile Video Networks (MoViNets)](projects/movinet) | [MoViNets: Mobile Video Networks for Efficient Video Recognition](https://arxiv.org/abs/2103.11511) | + ### Natural Language Processing | Model | Reference (Paper) | diff --git a/official/projects/movinet/README.md b/official/projects/movinet/README.md index 0e72f7459..36bcfe89d 100644 --- a/official/projects/movinet/README.md +++ b/official/projects/movinet/README.md @@ -176,8 +176,7 @@ devices. See the [TF Lite Example](#tf-lite-example) to export and run your own models. We also provide [quantized TF Lite binaries via TF Hub](https://tfhub.dev/s?deployment-format=lite&q=movinet). For reference, MoViNet-A0-Stream runs with a similar latency to -[MobileNetV3-Large] -(https://tfhub.dev/google/imagenet/mobilenet_v3_large_100_224/classification/) +[MobileNetV3-Large](https://tfhub.dev/google/imagenet/mobilenet_v3_large_100_224/classification/) with +5% accuracy on Kinetics 600. | Model Name | Input Shape | Pixel 4 Latency\* | x86 Latency\* | TF Lite Binary | diff --git a/official/vision/beta/MODEL_GARDEN.md b/official/vision/beta/MODEL_GARDEN.md index ebb0cf280..d8bd43d9e 100644 --- a/official/vision/beta/MODEL_GARDEN.md +++ b/official/vision/beta/MODEL_GARDEN.md @@ -171,8 +171,10 @@ evaluated on [COCO](https://cocodataset.org/) val2017. [Spatiotemporal Contrastive Video Representation Learning](https://arxiv.org/abs/2008.03800). * ResNet-3D-RS (R3D-RS) in [Revisiting 3D ResNets for Video Recognition](https://arxiv.org/pdf/2109.01696.pdf). + * Mobile Video Networks (MoViNets) in + [MoViNets: Mobile Video Networks for Efficient Video Recognition](https://arxiv.org/abs/2103.11511). -* Training and evaluation details: +* Training and evaluation details (SlowFast and ResNet): * All models are trained from scratch with vision modality (RGB) for 200 epochs. * We use batch size of 1024 and cosine learning rate decay with linear warmup @@ -192,6 +194,12 @@ evaluated on [COCO](https://cocodataset.org/) val2017. | R3D-RS-152 | 32 x 2 | 79.9 | 94.3 | - | R3D-RS-200 | 32 x 2 | 80.4 | 94.4 | - | R3D-RS-200 | 48 x 2 | 81.0 | - | - +| MoViNet-A0-Base | 50 x 5 | 69.40 | 89.18 | - +| MoViNet-A1-Base | 50 x 5 | 74.57 | 92.03 | - +| MoViNet-A2-Base | 50 x 5 | 75.91 | 92.63 | - +| MoViNet-A3-Base | 120 x 2 | 79.34 | 94.52 | - +| MoViNet-A4-Base | 80 x 3 | 80.64 | 94.93 | - +| MoViNet-A5-Base | 120 x 2 | 81.39 | 95.06 | - ### Kinetics-600 Action Recognition Baselines @@ -201,3 +209,9 @@ evaluated on [COCO](https://cocodataset.org/) val2017. | R3D-50 | 32 x 2 | 79.5 | 94.8 | [config](https://github.com/tensorflow/models/blob/master/official/vision/beta/configs/experiments/video_classification/k600_3d-resnet50_tpu.yaml) | | R3D-RS-200 | 32 x 2 | 83.1 | - | - | R3D-RS-200 | 48 x 2 | 83.8 | - | - +| MoViNet-A0-Base | 50 x 5 | 72.05 | 90.92 | [config](https://github.com/tensorflow/models/blob/master/official/projects/movinet/configs/yaml/movinet_a0_k600_8x8.yaml) | +| MoViNet-A1-Base | 50 x 5 | 76.69 | 93.40 | [config](https://github.com/tensorflow/models/blob/master/official/projects/movinet/configs/yaml/movinet_a1_k600_8x8.yaml) | +| MoViNet-A2-Base | 50 x 5 | 78.62 | 94.17 | [config](https://github.com/tensorflow/models/blob/master/official/projects/movinet/configs/yaml/movinet_a2_k600_8x8.yaml) | +| MoViNet-A3-Base | 120 x 2 | 81.79 | 95.67 | [config](https://github.com/tensorflow/models/blob/master/official/projects/movinet/configs/yaml/movinet_a3_k600_8x8.yaml) | +| MoViNet-A4-Base | 80 x 3 | 83.48 | 96.16 | [config](https://github.com/tensorflow/models/blob/master/official/projects/movinet/configs/yaml/movinet_a4_k600_8x8.yaml) | +| MoViNet-A5-Base | 120 x 2 | 84.27 | 96.39 | [config](https://github.com/tensorflow/models/blob/master/official/projects/movinet/configs/yaml/movinet_a5_k600_8x8.yaml) | -- GitLab