* We provide models based on two detection frameworks, [RetinaNet](https://arxiv.org/abs/1708.02002) or [Mask R-CNN](https://arxiv.org/abs/1703.06870), and two backbones, [ResNet-FPN](https://arxiv.org/abs/1612.03144) or [SpineNet](https://arxiv.org/abs/1912.05027).
* We provide models based on two detection frameworks, [RetinaNet](https://arxiv.org/abs/1708.02002)
or [Mask R-CNN](https://arxiv.org/abs/1703.06870), and two backbones, [ResNet-FPN](https://arxiv.org/abs/1612.03144)
or [SpineNet](https://arxiv.org/abs/1912.05027).
* Models are all trained on COCO train2017 and evaluated on COCO val2017.
* Training details:
* Models finetuned from ImageNet pretrained checkpoints adopt the 12 or 36 epochs schedule. Models trained from scratch adopt the 350 epochs schedule.
* The default training data augmentation implements horizontal flipping and scale jittering with a random scale between [0.5, 2.0].
* Unless noted, all models are trained with l2 weight regularization and ReLU activation.
* We use batch size 256 and stepwise learning rate that decays at the last 30 and 10 epoch.
* We use square image as input by resizing the long side of an image to the target size then padding the short side with zeros.
* Models finetuned from ImageNet pretrained checkpoints adopt the 12 or 36
epochs schedule. Models trained from scratch adopt the 350 epochs schedule.
* The default training data augmentation implements horizontal flipping and
scale jittering with a random scale between [0.5, 2.0].
* Unless noted, all models are trained with l2 weight regularization and ReLU
activation.
* We use batch size 256 and stepwise learning rate that decays at the last 30
and 10 epoch.
* We use square image as input by resizing the long side of an image to the
target size then padding the short side with zeros.
* We provide models for video classification with two backbones: [SlowOnly](https://arxiv.org/abs/1812.03982) and 3D-ResNet (R3D) used in [Spatiotemporal Contrastive Video Representation Learning](https://arxiv.org/abs/2008.03800).
* We provide models for video classification with two backbones:
[SlowOnly](https://arxiv.org/abs/1812.03982) and 3D-ResNet (R3D) used in
[Spatiotemporal Contrastive Video Representation Learning](https://arxiv.org/abs/2008.03800).
* Training and evaluation details:
* All models are trained from scratch with vision modality (RGB) for 200 epochs.
* We use batch size of 1024 and cosine learning rate decay with linear warmup in first 5 epochs.
* We follow [SlowFast](https://arxiv.org/abs/1812.03982) to perform 30-view evaluation.
* All models are trained from scratch with vision modality (RGB) for 200
epochs.
* We use batch size of 1024 and cosine learning rate decay with linear warmup
in first 5 epochs.
* We follow [SlowFast](https://arxiv.org/abs/1812.03982) to perform 30-view
evaluation.
### Kinetics-400 Action Recognition Baselines
| model | input (frame x stride) | Top-1 | Top-5 | download |
| Model | Input (frame x stride) | Top-1 | Top-5 | Download |