rebenchmark for Pyslowfast comparison (#28)

* add resolution for K400 * rebenchmark on 32G V100 * add new i3d config and i3d benchmakr minor minor Co-authored-by: N linjintao <linjintao@sensetime.com>

rebenchmark for Pyslowfast comparison (#28)
* add resolution for K400 * rebenchmark on 32G V100 * add new i3d config and i3d benchmakr minor minor Co-authored-by: N linjintao <linjintao@sensetime.com>
b96aab9a · su · GitHub · 3aefca87 · b96aab9a · b96aab9a
Showing with 137 addition and 11 deletion

README.md README.md +6 -5

configs/recognition/i3d/i3d_r50_video_8x8x1_100e_kinetics400_rgb.py ...cognition/i3d/i3d_r50_video_8x8x1_100e_kinetics400_rgb.py +125 -0

docs/benchmark.md docs/benchmark.md +6 -6

未找到文件。
--- a/README.md
+++ b/README.md
@@ -65,12 +65,13 @@ We compare with other popular codebases and the [results](https://mmaction2.read
 | Model | MMAction2 (s/iter) | MMAction (s/iter) | Temporal-Shift-Module (s/iter) | PySlowFast (s/iter) |
 | :--- | :---------------: | :--------------------: | :----------------------------: | :-----------------: |
 | [TSN](/configs/recognition/tsn/tsn_r50_1x1x3_100e_kinetics400_rgb.py) | **0.29** | 0.36 | 0.45 | x |
-| [I3D (setting1)](/configs/recognition/i3d/i3d_r50_32x2x1_100e_kinetics400_rgb.py) | **0.45** | 0.58 | x | x |
-| [I3D (setting2)](/configs/recognition/i3d/i3d_r50_8x8x1_100e_kinetics400_rgb.py) | **0.32** | x | x | 0.56 |
+| [I3D (video)](/configs/recognition/i3d/i3d_r50_video_8x8x1_100e_kinetics400_rgb.py) | **0.31** | x | x | 0.59 |
+| [I3D (rawframe)](/configs/recognition/i3d/i3d_r50_32x2x1_100e_kinetics400_rgb.py) | **0.45** | 0.58 | x | x |
 | [TSM](/configs/recognition/tsm/tsm_r50_1x1x8_50e_kinetics400_rgb.py) | **0.30** | x | 0.38 | x |
-| [Slowonly](/configs/recognition/slowonly/slowonly_r50_4x16x1_256e_kinetics400_rgb.py) | **0.30** | x | x | 1.03 |
-| [Slowfast](/configs/recognition/slowfast/slowfast_r50_4x16x1_256e_kinetics400_rgb.py) | **0.80** | x | x | 1.40 |
-| [R(2+1)D](/configs/recognition/r2plus1d/r2plus1d_r34_8x8x1_180e_kinetics400_rgb.py) | **0.48** | x | x | x |
+| [Slowonly](/configs/recognition/slowonly/slowonly_r50_video_4x16x1_256e_kinetics400_rgb.py) | **0.27** | x | x | 0.89 |
+| [Slowfast](/configs/recognition/slowfast/slowfast_r50_video_4x16x1_256e_kinetics400_rgb.py) | **0.68** | x | x | 1.07 |
+| [R(2+1)D](/configs/recognition/r2plus1d/r2plus1d_r34_video_8x8x1_180e_kinetics400_rgb.py) | **0.45** | x | x | x |
+

 Supported methods for action recognition:
 - [x] [TSN](configs/recognition/tsn/README.md)

--- a/configs/recognition/i3d/i3d_r50_video_8x8x1_100e_kinetics400_rgb.py
+++ b/configs/recognition/i3d/i3d_r50_video_8x8x1_100e_kinetics400_rgb.py
+# model settings
+model = dict(
+    type='Recognizer3D',
+    backbone=dict(
+        type='ResNet3d',
+        pretrained2d=True,
+        pretrained='torchvision://resnet50',
+        depth=50,
+        conv_cfg=dict(type='Conv3d'),
+        norm_eval=False,
+        inflate=((1, 1, 1), (1, 0, 1, 0), (1, 0, 1, 0, 1, 0), (0, 1, 0)),
+        zero_init_residual=False),
+    cls_head=dict(
+        type='I3DHead',
+        num_classes=400,
+        in_channels=2048,
+        spatial_type='avg',
+        dropout_ratio=0.5,
+        init_std=0.01))
+# model training and testing settings
+train_cfg = None
+test_cfg = dict(average_clips=None)
+# dataset settings
+dataset_type = 'VideoDataset'
+data_root = 'data/kinetics400/videos_train'
+data_root_val = 'data/kinetics400/videos_val'
+ann_file_train = 'data/kinetics400/kinetics400_train_list_videos.txt'
+ann_file_val = 'data/kinetics400/kinetics400_val_list_videos.txt'
+ann_file_test = 'data/kinetics400/kinetics400_val_list_videos.txt'
+img_norm_cfg = dict(
+    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_bgr=False)
+train_pipeline = [
+    dict(type='DecordInit'),
+    dict(type='SampleFrames', clip_len=8, frame_interval=8, num_clips=1),
+    dict(type='DecordDecode'),
+    dict(type='Resize', scale=(-1, 256)),
+    dict(
+        type='MultiScaleCrop',
+        input_size=224,
+        scales=(1, 0.8),
+        random_crop=False,
+        max_wh_scale_gap=0),
+    dict(type='Resize', scale=(224, 224), keep_ratio=False),
+    dict(type='Flip', flip_ratio=0.5),
+    dict(type='Normalize', **img_norm_cfg),
+    dict(type='FormatShape', input_format='NCTHW'),
+    dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
+    dict(type='ToTensor', keys=['imgs', 'label'])
+]
+val_pipeline = [
+    dict(type='DecordInit'),
+    dict(
+        type='SampleFrames',
+        clip_len=8,
+        frame_interval=8,
+        num_clips=1,
+        test_mode=True),
+    dict(type='DecordDecode'),
+    dict(type='Resize', scale=(-1, 256)),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='Flip', flip_ratio=0),
+    dict(type='Normalize', **img_norm_cfg),
+    dict(type='FormatShape', input_format='NCTHW'),
+    dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
+    dict(type='ToTensor', keys=['imgs'])
+]
+test_pipeline = [
+    dict(type='DecordInit'),
+    dict(
+        type='SampleFrames',
+        clip_len=8,
+        frame_interval=8,
+        num_clips=10,
+        test_mode=True),
+    dict(type='DecordDecode'),
+    dict(type='Resize', scale=(-1, 256)),
+    dict(type='ThreeCrop', crop_size=256),
+    dict(type='Flip', flip_ratio=0),
+    dict(type='Normalize', **img_norm_cfg),
+    dict(type='FormatShape', input_format='NCTHW'),
+    dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
+    dict(type='ToTensor', keys=['imgs'])
+]
+data = dict(
+    videos_per_gpu=8,
+    workers_per_gpu=4,
+    train=dict(
+        type=dataset_type,
+        ann_file=ann_file_train,
+        data_prefix=data_root,
+        pipeline=train_pipeline),
+    val=dict(
+        type=dataset_type,
+        ann_file=ann_file_val,
+        data_prefix=data_root_val,
+        pipeline=val_pipeline),
+    test=dict(
+        type=dataset_type,
+        ann_file=ann_file_val,
+        data_prefix=data_root_val,
+        pipeline=test_pipeline))
+# optimizer
+optimizer = dict(
+    type='SGD', lr=0.01, momentum=0.9,
+    weight_decay=0.0001)  # this lr is used for 8 gpus
+optimizer_config = dict(grad_clip=dict(max_norm=40, norm_type=2))
+# learning policy
+lr_config = dict(policy='step', step=[40, 80])
+total_epochs = 100
+checkpoint_config = dict(interval=5)
+evaluation = dict(
+    interval=5, metrics=['top_k_accuracy', 'mean_class_accuracy'], topk=(1, 5))
+log_config = dict(
+    interval=20,
+    hooks=[
+        dict(type='TextLoggerHook'),
+        # dict(type='TensorboardLoggerHook'),
+    ])
+# runtime settings
+dist_params = dict(backend='nccl')
+log_level = 'INFO'
+work_dir = './work_dirs/i3d_r50_video_3d_32x2x1_100e_kinetics400_rgb/'
+load_from = None
+resume_from = None
+workflow = [('train', 1)]
--- a/docs/benchmark.md
+++ b/docs/benchmark.md
@@ -8,7 +8,7 @@ Here we compare our MMAction2 repo with other video understanding toolboxes in t
 by the training time per iteration. Here, we use
 - commit id [7f3490d](https://github.com/open-mmlab/mmaction/tree/7f3490d3db6a67fe7b87bfef238b757403b670e3)(1/5/2020) of MMAction
 - commit id [8d53d6f](https://github.com/mit-han-lab/temporal-shift-module/tree/8d53d6fda40bea2f1b37a6095279c4b454d672bd)(5/5/2020) of Temporal-Shift-Module
- commit id [133e40f](https://github.com/facebookresearch/SlowFast/tree/133e40f8349ce37b0e6168639da0811a413579c8)(30/5/2020) of PySlowFast
+- commit id [8299c98](https://github.com/facebookresearch/SlowFast/tree/8299c9862f83a067fa7114ce98120ae1568a83ec)(7/7/2020) of PySlowFast
 - commit id [f13707f](https://github.com/wzmsltw/BSN-boundary-sensitive-network/tree/f13707fbc362486e93178c39f9c4d398afe2cb2f)(12/12/2018) of BSN(boundary sensitive network)
 - commit id [45d0514](https://github.com/JJBOY/BMN-Boundary-Matching-Network/tree/45d05146822b85ca672b65f3d030509583d0135a)(17/10/2019) of BMN(boundary matching network)

@@ -24,12 +24,12 @@ The training speed is measure with s/iter. The lower, the better.
 | Model | MMAction2 (s/iter) | MMAction (s/iter) | Temporal-Shift-Module (s/iter) | PySlowFast (s/iter) |
 | :--- | :---------------: | :--------------------: | :----------------------------: | :-----------------: |
 | [TSN](/configs/recognition/tsn/tsn_r50_1x1x3_100e_kinetics400_rgb.py) | **0.29** | 0.36 | 0.45 | x |
-| [I3D (setting1)](/configs/recognition/i3d/i3d_r50_32x2x1_100e_kinetics400_rgb.py) | **0.45** | 0.58 | x | x |
-| [I3D (setting2)](/configs/recognition/i3d/i3d_r50_8x8x1_100e_kinetics400_rgb.py) | **0.32** | x | x | 0.56 |
+| [I3D (video)](/configs/recognition/i3d/i3d_r50_video_8x8x1_100e_kinetics400_rgb.py) | **0.31** | x | x | 0.59 |
+| [I3D (rawframe)](/configs/recognition/i3d/i3d_r50_32x2x1_100e_kinetics400_rgb.py) | **0.45** | 0.58 | x | x |
 | [TSM](/configs/recognition/tsm/tsm_r50_1x1x8_50e_kinetics400_rgb.py) | **0.30** | x | 0.38 | x |
-| [Slowonly](/configs/recognition/slowonly/slowonly_r50_4x16x1_256e_kinetics400_rgb.py) | **0.30** | x | x | 1.03 |
-| [Slowfast](/configs/recognition/slowfast/slowfast_r50_4x16x1_256e_kinetics400_rgb.py) | **0.80** | x | x | 1.40 |
-| [R(2+1)D](/configs/recognition/r2plus1d/r2plus1d_r34_8x8x1_180e_kinetics400_rgb.py) | **0.48** | x | x | x |
+| [Slowonly](/configs/recognition/slowonly/slowonly_r50_video_4x16x1_256e_kinetics400_rgb.py) | **0.27** | x | x | 0.89 |
+| [Slowfast](/configs/recognition/slowfast/slowfast_r50_video_4x16x1_256e_kinetics400_rgb.py) | **0.68** | x | x | 1.07 |
+| [R(2+1)D](/configs/recognition/r2plus1d/r2plus1d_r34_video_8x8x1_180e_kinetics400_rgb.py) | **0.45** | x | x | x |

 ## Localizers