getting_started.md 23.3 KB
Newer Older
L
linjintao 已提交
1 2
# Getting Started

J
JoannaLXY 已提交
3
This page provides basic tutorials about the usage of MMAction2.
L
linjintao 已提交
4
For installation instructions, please see [install.md](install.md).
L
linjintao 已提交
5

L
lixuanyi 已提交
6
## Datasets
L
linjintao 已提交
7

8
It is recommended to symlink the dataset root to `$MMACTION2/data`.
L
linjintao 已提交
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
If your folder structure is different, you may need to change the corresponding paths in config files.

```
mmaction
├── mmaction
├── tools
├── config
├── data
│   ├── kinetics400
│   │   ├── rawframes_train
│   │   ├── rawframes_val
│   │   ├── kinetics_train_list.txt
│   │   ├── kinetics_val_list.txt
│   ├── ucf101
│   │   ├── rawframes_train
│   │   ├── rawframes_val
│   │   ├── ucf101_train_list.txt
│   │   ├── ucf101_val_list.txt

```
L
linjintao 已提交
29
For more information on data preparation, please see [data_preparation.md](data_preparation.md)
L
linjintao 已提交
30

L
linjintao 已提交
31
For using custom datasets, please refer to [Tutorial 2: Adding New Dataset](tutorials/new_dataset.md)
L
linjintao 已提交
32

33
## Inference with Pre-Trained Models
L
linjintao 已提交
34

35
We provide testing scripts to evaluate a whole dataset (Kinetics-400, Something-Something V1&V2, (Multi-)Moments in Time, etc.),
L
linjintao 已提交
36 37 38 39 40 41 42 43 44 45 46 47 48
and provide some high-level apis for easier integration to other projects.

### Test a dataset

- [x] single GPU
- [x] single node multiple GPUs
- [x] multiple node

You can use the following commands to test a dataset.

```shell
# single-gpu testing
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [--out ${RESULT_FILE}] [--eval ${EVAL_METRICS}] \
J
Jintao Lin 已提交
49
    [--gpu-collect] [--tmpdir ${TMPDIR}] [--options ${OPTIONS}] [--average-clips ${AVG_TYPE}] \
L
linjintao 已提交
50 51
    [--launcher ${JOB_LAUNCHER}] [--local_rank ${LOCAL_RANK}]

L
linjintao 已提交
52
# multi-gpu testing
J
Jintao Lin 已提交
53 54
./tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} [--out ${RESULT_FILE}] [--eval ${EVAL_METRICS}] \
    [--gpu-collect] [--tmpdir ${TMPDIR}] [--options ${OPTIONS}] [--average-clips ${AVG_TYPE}] \
L
linjintao 已提交
55 56 57 58 59 60
    [--launcher ${JOB_LAUNCHER}] [--local_rank ${LOCAL_RANK}]
```

Optional arguments:
- `RESULT_FILE`: Filename of the output results. If not specified, the results will not be saved to a file.
- `EVAL_METRICS`: Items to be evaluated on the results. Allowed values depend on the dataset, e.g., `top_k_accuracy`, `mean_class_accuracy` are available for all datasets in recognition, `mean_average_precision` for Multi-Moments in Time, `AR@AN` for ActivityNet, etc.
L
linjintao 已提交
61 62
- `--gpu-collect`: If specified, recognition results will be collected using gpu communication. Otherwise, it will save the results on different gpus to `TMPDIR` and collect them by the rank 0 worker.
- `TMPDIR`: Temporary directory used for collecting results from multiple workers, available when `--gpu-collect` is not specified.
J
Jintao Lin 已提交
63
- `OPTIONS`: Custom options used for evaluation. Allowed values depend on the arguments of the `evaluate` function in dataset.
L
linjintao 已提交
64 65 66 67 68 69 70 71
- `AVG_TYPE`: Items to average the test clips. If set to `prob`, it will apply softmax before averaging the clip scores. Otherwise, it will directly average the clip scores.
- `JOB_LAUNCHER`: Items for distributed job initialization launcher. Allowed choices are `none`, `pytorch`, `slurm`, `mpi`. Especially, if set to none, it will test in a non-distributed mode.
- `LOCAL_RANK`: ID for local rank. If not specified, it will be set to 0.

Examples:

Assume that you have already downloaded the checkpoints to the directory `checkpoints/`.

72
1. Test TSN on Kinetics-400 (without saving the test results) and evaluate the top-k accuracy and mean class accuracy.
L
linjintao 已提交
73

74
    ```shell
75
    python tools/test.py configs/recognition/tsn/tsn_r50_1x1x3_100e_kinetics400_rgb.py \
76 77 78
        checkpoints/SOME_CHECKPOINT.pth \
        --eval top_k_accuracy mean_class_accuracy
    ```
L
linjintao 已提交
79 80 81

2. Test TSN on Something-Something V1 with 8 GPUS, and evaluate the top-k accuracy.

82
    ```shell
J
Jintao Lin 已提交
83
    ./tools/dist_test.py configs/recognition/tsn/tsn_r50_1x1x8_50e_sthv1_rgb.py \
84 85 86
        checkpoints/SOME_CHECKPOINT.pth \
        8 --out results.pkl --eval top_k_accuracy
    ```
L
linjintao 已提交
87

88
3. Test TSN on Kinetics-400 in slurm environment and evaluate the top-k accuracy
L
linjintao 已提交
89

90
    ```shell
91
    python tools/test.py configs/recognition/tsn/tsn_r50_1x1x3_100e_kinetics400_rgb.py \
92 93 94
        checkpoints/SOME_CHECKPOINT.pth \
        --launcher slurm --eval top_k_accuracy
    ```
L
linjintao 已提交
95 96 97 98 99 100

### Video demo

We provide a demo script to predict the recognition result using a single video.

```shell
J
Jintao Lin 已提交
101 102 103
python demo/demo.py ${CONFIG_FILE} ${CHECKPOINT_FILE} ${VIDEO_FILE} {LABEL_FILE} [--use-frames] \
    [--device ${DEVICE_TYPE}] [--fps {FPS}] [--font-size {FONT_SIZE}] [--font-color {FONT_COLOR}] \
    [--target-resolution ${TARGET_RESOLUTION}] [--resize-algorithm {RESIZE_ALGORITHM}] [--out-filename {OUT_FILE}]
L
linjintao 已提交
104 105
```

J
Jintao Lin 已提交
106 107 108 109 110 111 112 113 114 115
Optional arguments:
- `--use-frames`: If specified, the demo will take rawframes as input. Otherwise, it will take a video as input.
- `DEVICE_TYPE`: Type of device to run the demo. Allowed values are cuda device like `cuda:0` or `cpu`. If not specified, it will be set to `cuda:0`.
- `FPS`: FPS value of the output video when using rawframes as input. If not specified, it wll be set to 30.
- `FONT_SIZE`: Font size of the label added in the video. If not specified, it wll be set to 20.
- `FONT_COLOR`: Font color of the label added in the video. If not specified, it will be `white`.
- `TARGET_RESOLUTION`: Resolution(desired_width, desired_height) for resizing the frames before output when using a video as input. If not specified, it will be None and the frames are resized by keeping the existing aspect ratio.
- `RESIZE_ALGORITHM`: Resize algorithm used for resizing. If not specified, it will be set to `bicubic`.
- `OUT_FILE`: Path to the output file which can be a video format or gif format. If not specified, it will be set to `None` and does not generate the output file.

L
linjintao 已提交
116 117
Examples:

J
Jintao Lin 已提交
118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190
Assume that you are located at `$MMACTION2` and have already downloaded the checkpoints to the directory `checkpoints/`

1. Recognize a video file as input by using a TSN model on cuda by default.

    ```shell
    # The demo.mp4 and label_map.txt are both from Kinetics-400
    python demo/demo.py configs/recognition/tsn/tsn_r50_video_inference_1x1x3_100e_kinetics400_rgb.py \
        checkpoints/tsn_r50_1x1x3_100e_kinetics400_rgb_20200614-e508be42.pth \
        demo/demo.mp4 demo/label_map.txt
    ```

2. Recognize a list of rawframes as input by using a TSN model on cpu.

    ```shell
    python demo/demo.py configs/recognition/tsn/tsn_r50_inference_1x1x3_100e_kinetics400_rgb.py \
        checkpoints/tsn_r50_1x1x3_100e_kinetics400_rgb_20200614-e508be42.pth \
        PATH_TO_FRAMES/ LABEL_FILE --use-frames --device cpu
    ```

3. Recognize a video file as input by using a TSN model and then generate an mp4 file.

    ```shell
    # The demo.mp4 and label_map.txt are both from Kinetics-400
    python demo/demo.py configs/recognition/tsn/tsn_r50_video_inference_1x1x3_100e_kinetics400_rgb.py \
        checkpoints/tsn_r50_1x1x3_100e_kinetics400_rgb_20200614-e508be42.pth \
        demo/demo.mp4 demo/label_map.txt --out-filename demo/demo_out.mp4
    ```

4. Recognize a list of rawframes as input by using a TSN model and then generate a gif file.

    ```shell
    python demo/demo.py configs/recognition/tsn/tsn_r50_inference_1x1x3_100e_kinetics400_rgb.py \
        checkpoints/tsn_r50_1x1x3_100e_kinetics400_rgb_20200614-e508be42.pth \
        PATH_TO_FRAMES/ LABEL_FILE --use-frames --out-filename demo/demo_out.gif
    ```

5. Recognize a video file as input by using a TSN model, then generate an mp4 file with a given resolution and resize algorithm.

    ```shell
    # The demo.mp4 and label_map.txt are both from Kinetics-400
    python demo/demo.py configs/recognition/tsn/tsn_r50_video_inference_1x1x3_100e_kinetics400_rgb.py \
        checkpoints/tsn_r50_1x1x3_100e_kinetics400_rgb_20200614-e508be42.pth \
        demo/demo.mp4 demo/label_map.txt --target-resolution 340 256 --resize-algorithm bilinear \
        --out-filename demo/demo_out.mp4
    ```

    ```shell
    # The demo.mp4 and label_map.txt are both from Kinetics-400
    # If either dimension is set to -1, the frames are resized by keeping the existing aspect ratio
    # For --target-resolution 170 -1, original resolution (340, 256) -> target resolution (170, 128)
    python demo/demo.py configs/recognition/tsn/tsn_r50_video_inference_1x1x3_100e_kinetics400_rgb.py \
        checkpoints/tsn_r50_1x1x3_100e_kinetics400_rgb_20200614-e508be42.pth \
        demo/demo.mp4 demo/label_map.txt --target-resolution 170 -1 --resize-algorithm bilinear \
        --out-filename demo/demo_out.mp4
    ```

6. Recognize a video file as input by using a TSN model, then generate an mp4 file with a label in a red color and 10px fontsize.

    ```shell
    # The demo.mp4 and label_map.txt are both from Kinetics-400
    python demo/demo.py configs/recognition/tsn/tsn_r50_video_inference_1x1x3_100e_kinetics400_rgb.py \
        checkpoints/tsn_r50_1x1x3_100e_kinetics400_rgb_20200614-e508be42.pth \
        demo/demo.mp4 demo/label_map.txt --font-size 10 --font-color red \
        --out-filename demo/demo_out.mp4
    ```

7. Recognize a list of rawframes as input by using a TSN model and then generate an mp4 file with 24 fps.

    ```shell
    python demo/demo.py configs/recognition/tsn/tsn_r50_inference_1x1x3_100e_kinetics400_rgb.py \
        checkpoints/tsn_r50_1x1x3_100e_kinetics400_rgb_20200614-e508be42.pth \
        PATH_TO_FRAMES/ LABEL_FILE --use-frames --fps 24 --out-filename demo/demo_out.gif
    ```
L
linjintao 已提交
191

192
### High-level APIs for testing a video and rawframes.
L
linjintao 已提交
193

194
Here is an example of building the model and testing a given video.
L
linjintao 已提交
195 196

```python
197 198
import torch

L
linjintao 已提交
199
from mmaction.apis import init_recognizer, inference_recognizer
L
linjintao 已提交
200

201
config_file = 'configs/recognition/tsn/tsn_r50_video_inference_1x1x3_100e_kinetics400_rgb.py'
L
linjintao 已提交
202
# download the checkpoint from model zoo and put it in `checkpoints/`
L
linjintao 已提交
203
checkpoint_file = 'checkpoints/tsn_r50_1x1x3_100e_kinetics400_rgb_20200614-e508be42.pth'
L
linjintao 已提交
204

205 206 207 208
# assign the desired device.
device = 'cuda:0' # or 'cpu'
device = torch.device(device)

L
linjintao 已提交
209
 # build the model from a config file and a checkpoint file
210
model = init_recognizer(config_file, checkpoint_file, device=device)
L
linjintao 已提交
211 212 213 214

# test a single video and show the result:
video = 'demo/demo.mp4'
labels = 'demo/label_map.txt'
L
linjintao 已提交
215
results = inference_recognizer(model, video, labels)
L
linjintao 已提交
216 217

# show the results
L
linjintao 已提交
218 219 220
print(f'The top-5 labels with corresponding scores are:')
for result in results:
    print(f'{result[0]}: ', result[1])
L
linjintao 已提交
221 222
```

223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264
Here is an example of building the model and testing with a given rawframes directory.

```python
import torch

from mmaction.apis import init_recognizer, inference_recognizer

config_file = 'configs/recognition/tsn/tsn_r50_inference_1x1x3_100e_kinetics400_rgb.py'
# download the checkpoint from model zoo and put it in `checkpoints/`
checkpoint_file = 'checkpoints/tsn_r50_1x1x3_100e_kinetics400_rgb_20200614-e508be42.pth'

# assign the desired device.
device = 'cuda:0' # or 'cpu'
device = torch.device(device)

 # build the model from a config file and a checkpoint file
model = init_recognizer(config_file, checkpoint_file, device=device, use_frames=True)

# test rawframe directory of a single video and show the result:
video = 'SOME_DIR_PATH/'
labels = 'demo/label_map.txt'
results = inference_recognizer(model, video, labels, use_frames=True)

# show the results
print(f'The top-5 labels with corresponding scores are:')
for result in results:
    print(f'{result[0]}: ', result[1])
```

**Note**: We define `data_prefix` in config files and set it None as default for our provided inference configs.
If the `data_prefix` is not None, the path for the video file (or rawframe directory) to get will be `osp.path(data_prefix, video)`.
Here, the `video` is the param in the demo scripts above.
This detail can be found in `rawframe_dataset.py` and `video_dataset.py`. For example,

* When video (rawframes) path is `SOME_DIR_PATH/VIDEO.mp4` (`SOME_DIR_PATH/VIDEO_NAME/img_xxxxx.jpg`), and `data_prefix` is None in the config file,
the param `video` should be `SOME_DIR_PATH/VIDEO.mp4` (`SOME_DIR_PATH/VIDEO_NAME`).

* When video (rawframes) path is `SOME_DIR_PATH/VIDEO.mp4` (`SOME_DIR_PATH/VIDEO_NAME/img_xxxxx.jpg`), and `data_prefix` is `SOME_DIR_PATH` in the config file,
the param `video` should be `VIDEO.mp4` (`VIDEO_NAME`).

* When rawframes path is `VIDEO_NAME/img_xxxxx.jpg`, and `data_prefix` is None in the config file, the param `video` should be `VIDEO_NAME`.

J
JoannaLXY 已提交
265
A notebook demo can be found in [demo/demo.ipynb](/demo/demo.ipynb)
L
linjintao 已提交
266 267


268
## Build a Model
269 270 271

### Build a model with basic components

J
Jintao Lin 已提交
272
In MMAction2, model components are basically categorized as 4 types.
273 274 275 276 277 278 279 280 281 282 283 284 285 286

- recognizer: the whole recognizer model pipeline, usually contains a backbone and cls_head.
- backbone: usually an FCN network to extract feature maps, e.g., ResNet, BNInception.
- cls_head: the component for classification task, usually contains an FC layer with some pooling layers.
- localizer: the model for localization task, currently available: BSN, BMN.

Following some basic pipelines (e.g., `Recognizer2D`), the model structure
can be customized through config files with no pains.

If we want to implement some new components, e.g., the temporal shift backbone structure as
in [TSM: Temporal Shift Module for Efficient Video Understanding](https://arxiv.org/abs/1811.08383), there are several things to do.

1. create a new file in `mmaction/models/backbones/resnet_tsm.py`.

287 288 289
    ```python
    from ..registry import BACKBONES
    from .resnet import ResNet
290

291 292
    @BACKBONES.register_module()
    class ResNetTSM(ResNet):
293 294 295 296 297 298 299 300 301 302 303 304 305 306

      def __init__(self,
                   depth,
                   num_segments=8,
                   is_shift=True,
                   shift_div=8,
                   shift_place='blockres',
                   temporal_pool=False,
                   **kwargs):
          pass

      def forward(self, x):
          # implementation is ignored
          pass
307
    ```
308 309

2. Import the module in `mmaction/models/backbones/__init__.py`
310 311 312
    ```python
    from .resnet_tsm import ResNetTSM
    ```
313 314 315

3. modify the config file from

316 317
    ```python
    backbone=dict(
318 319 320 321
      type='ResNet',
      pretrained='torchvision://resnet50',
      depth=50,
      norm_eval=False)
322 323 324 325
    ```
   to
    ```python
    backbone=dict(
326 327 328 329 330
        type='ResNetTSM',
        pretrained='torchvision://resnet50',
        depth=50,
        norm_eval=False,
        shift_div=8)
331
    ```
332 333 334 335 336 337 338 339 340

### Write a new model

To write a new recognition pipeline, you need to inherit from `BaseRecognizer`,
which defines the following abstract methods.

- `forward_train()`: forward method of the training mode.
- `forward_test()`: forward method of the testing mode.

J
JoannaLXY 已提交
341
[Recognizer2D](/mmaction/models/recognizers/recognizer2d.py) and [Recognizer3D](/mmaction/models/recognizers/recognizer3d.py)
342 343 344
are good examples which show how to do that.


345
## Train a Model
L
linjintao 已提交
346

347 348
### Iteration pipeline

J
Jintao Lin 已提交
349
MMAction2 implements distributed training and non-distributed training,
L
linjintao 已提交
350 351
which uses `MMDistributedDataParallel` and `MMDataParallel` respectively.

352 353 354 355
We adopt distributed training for both single machine and multiple machines.
Supposing that the server has 8 GPUs, 8 processes will be started and each process runs on a single GPU.

Each process keeps an isolated model, data loader, and optimizer.
L
linjintao 已提交
356
Model parameters are only synchronized once at the beginning.
357 358 359 360 361 362
After a forward and backward pass, gradients will be allreduced among all GPUs,
and the optimizer will update model parameters.
Since the gradients are allreduced, the model parameter stays the same for all processes after the iteration.

### Training setting

L
linjintao 已提交
363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378
All outputs (log files and checkpoints) will be saved to the working directory,
which is specified by `work_dir` in the config file.

By default we evaluate the model on the validation set after each epoch, you can change the evaluation interval by modifying the interval argument in the training config
```python
evaluation = dict(interval=5)  # This evaluate the model per 5 epoch.
```

According to the [Linear Scaling Rule](https://arxiv.org/abs/1706.02677), you need to set the learning rate proportional to the batch size if you use different GPUs or videos per GPU, e.g., lr=0.01 for 4 GPUs * 2 video/gpu and lr=0.08 for 16 GPUs * 4 video/gpu.

### Train with a single GPU

```shell
python tools/train.py ${CONFIG_FILE} [optional arguments]
```

L
linjintao 已提交
379
If you want to specify the working directory in the command, you can add an argument `--work-dir ${YOUR_WORK_DIR}`.
L
linjintao 已提交
380 381 382 383 384 385 386 387 388

### Train with multiple GPUs

```shell
./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [optional arguments]
```

Optional arguments are:

389
- `--validate` (**strongly recommended**): Perform evaluation at every k (default value is 5, which can be modified by changing the `interval` value in `evaluation` dict in each config file) epochs during the training.
L
linjintao 已提交
390 391
- `--work-dir ${WORK_DIR}`: Override the working directory specified in the config file.
- `--resume-from ${CHECKPOINT_FILE}`: Resume from a previous checkpoint file.
L
linjintao 已提交
392
- `--gpus ${GPU_NUM}`: Number of gpus to use, which is only applicable to non-distributed training.
L
linjintao 已提交
393
- `--gpu-ids ${GPU_IDS}`: IDs of gpus to use, which is only applicable to non-distributed training.
L
linjintao 已提交
394 395 396 397 398 399 400 401 402 403 404 405
- `--seed ${SEED}`: Seed id for random state in python, numpy and pytorch to generate random numbers.
- `--deterministic`: If specified, it will set deterministic options for CUDNN backend.
- `JOB_LAUNCHER`: Items for distributed job initialization launcher. Allowed choices are `none`, `pytorch`, `slurm`, `mpi`. Especially, if set to none, it will test in a non-distributed mode.
- `LOCAL_RANK`: ID for local rank. If not specified, it will be set to 0.

Difference between `resume-from` and `load-from`:
`resume-from` loads both the model weights and optimizer status, and the epoch is also inherited from the specified checkpoint. It is usually used for resuming the training process that is interrupted accidentally.
`load-from` only loads the model weights and the training epoch starts from 0. It is usually used for finetuning.

Here is an example of using 8 GPUs to load TSN checkpoint.

```shell
L
linjintao 已提交
406
./tools/dist_train.sh configs/recognition/tsn/tsn_r50_1x1x3_100e_kinetics400_rgb.py 8 --resume-from work_dirs/tsn_r50_1x1x3_100e_kinetics400_rgb/latest.pth
L
linjintao 已提交
407 408 409 410
```

### Train with multiple machines

J
Jintao Lin 已提交
411
If you can run MMAction2 on a cluster managed with [slurm](https://slurm.schedmd.com/), you can use the script `slurm_train.sh`. (This script also supports single machine training.)
L
linjintao 已提交
412 413

```shell
J
Jintao Lin 已提交
414
[GPUS=${GPUS}] ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} [--work-dir ${WORK_DIR}]
L
linjintao 已提交
415 416 417 418 419
```

Here is an example of using 16 GPUs to train TSN on the dev partition in a slurm cluster. (use `GPUS_PER_NODE=8` to specify a single slurm cluster node with 8 GPUs.)

```shell
J
Jintao Lin 已提交
420
GPUS=16 ./tools/slurm_train.sh dev tsn_r50_k400 configs/recognition/tsn/tsn_r50_1x1x3_100e_kinetics400_rgb.py --work-dir work_dirs/tsn_r50_1x1x3_100e_kinetics400_rgb
L
linjintao 已提交
421 422
```

J
JoannaLXY 已提交
423
You can check [slurm_train.sh](/tools/slurm_train.sh) for full arguments and environment variables.
L
linjintao 已提交
424 425

If you have just multiple machines connected with ethernet, you can refer to
L
Doc fix  
linjintao 已提交
426
pytorch [launch utility](https://pytorch.org/docs/stable/distributed.html#launch-utility).
L
linjintao 已提交
427
Usually it is slow if you do not have high speed networking like InfiniBand.
L
linjintao 已提交
428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455

### Launch multiple jobs on a single machine

If you launch multiple jobs on a single machine, e.g., 2 jobs of 4-GPU training on a machine with 8 GPUs,
you need to specify different ports (29500 by default) for each job to avoid communication conflict.

If you use `dist_train.sh` to launch training jobs, you can set the port in commands.

```shell
CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 ./tools/dist_train.sh ${CONFIG_FILE} 4
CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 ./tools/dist_train.sh ${CONFIG_FILE} 4
```

If you use launch training jobs with slurm, you need to modify the config files (usually the 6th line from the bottom in config files) to set different communication ports.

In `config1.py`,
```python
dist_params = dict(backend='nccl', port=29500)
```

In `config2.py`,
```python
dist_params = dict(backend='nccl', port=29501)
```

Then you can launch two jobs with `config1.py` ang `config2.py`.

```shell
J
Jintao Lin 已提交
456 457
CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py [--work-dir ${WORK_DIR}]
CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=4 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py [--work-dir ${WORK_DIR}]
L
linjintao 已提交
458 459
```

460
## Useful Tools
L
linjintao 已提交
461 462 463 464 465 466 467

We provide lots of useful tools under `tools/` directory.

### Analyze logs

You can plot loss/top-k acc curves given a training log file. Run `pip install seaborn` first to install the dependency.

L
linjintao 已提交
468
![acc_curve_image](imgs/acc_curve.png)
L
linjintao 已提交
469 470

```shell
471
python tools/analysis/analyze_logs.py plot_curve ${JSON_LOGS} [--keys ${KEYS}] [--title ${TITLE}] [--legend ${LEGEND}] [--backend ${BACKEND}] [--style ${STYLE}] [--out ${OUT_FILE}]
L
linjintao 已提交
472 473 474 475 476 477 478
```

Examples:

- Plot the classification loss of some run.

```shell
479
python tools/analysis/analyze_logs.py plot_curve log.json --keys loss_cls --legend loss_cls
L
linjintao 已提交
480 481 482 483 484
```

- Plot the top-1 acc and top-5 acc of some run, and save the figure to a pdf.

```shell
485
python tools/analysis/analyze_logs.py plot_curve log.json --keys top1_acc top5_acc --out results.pdf
L
linjintao 已提交
486 487 488 489 490
```

- Compare the top-1 acc of two runs in the same figure.

```shell
491
python tools/analysis/analyze_logs.py plot_curve log1.json log2.json --keys top1_acc --legend run1 run2
L
linjintao 已提交
492 493 494 495 496
```

You can also compute the average training speed.

```shell
497
python tools/analysis/analyze_logs.py cal_train_time ${JSON_LOGS} [--include-outliers]
L
linjintao 已提交
498 499 500 501 502
```

- Compute the average training speed for a config file

```shell
503
python tools/analysis/analyze_logs.py cal_train_time work_dirs/some_exp/20200422_153324.log.json
L
linjintao 已提交
504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521
```

The output is expected to be like the following.

```
-----Analyze train time of work_dirs/some_exp/20200422_153324.log.json-----
slowest epoch 60, average time is 0.9736
fastest epoch 18, average time is 0.9001
time std over epochs is 0.0177
average iter time: 0.9330 s/iter

```

### Get the FLOPs and params (experimental)

We provide a script adapted from [flops-counter.pytorch](https://github.com/sovrasov/flops-counter.pytorch) to compute the FLOPs and params of a given model.

```shell
522
python tools/analysis/get_flops.py ${CONFIG_FILE} [--shape ${INPUT_SHAPE}]
L
linjintao 已提交
523 524 525 526 527 528 529 530 531 532
```

We will get the result like this

```
Input shape: (1, 3, 32, 340, 256)
Flops: 37.1 GMac
Params: 28.04 M
```

C
chenkai 已提交
533
**Note**: This tool is still experimental and we do not guarantee that the number is correct.
L
linjintao 已提交
534
You may use the result for simple comparisons well, but double check it before you adopt it in technical reports or papers.
L
linjintao 已提交
535 536

(1) FLOPs are related to the input shape while parameters are not. The default input shape is (1, 3, 340, 256) for 2D recognizer, (1, 3, 32, 340, 256) for 3D recognizer.
L
linjintao 已提交
537
(2) Some operators are not counted into FLOPs like GN and custom operators. Refer to [`mmcv.cnn.get_model_complexity_info()`](https://github.com/open-mmlab/mmcv/blob/master/mmcv/cnn/utils/flops_counter.py) for details.
L
linjintao 已提交
538 539 540

### Publish a model

541 542 543
Before you upload a model to AWS, you may want to:
(1) convert model weights to CPU tensors.
(2) delete the optimizer states.
L
linjintao 已提交
544 545 546 547 548 549 550 551 552
(3) compute the hash of the checkpoint file and append the hash id to the filename.

```shell
python tools/publish_model.py ${INPUT_FILENAME} ${OUTPUT_FILENAME}
```

E.g.,

```shell
553
python tools/publish_model.py work_dirs/tsn_r50_1x1x3_100e_kinetics400_rgb/latest.pth tsn_r50_1x1x3_100e_kinetics400_rgb.pth
L
linjintao 已提交
554 555
```

J
Jintao Lin 已提交
556
The final output filename will be `tsn_r50_1x1x3_100e_kinetics400_rgb-{hash id}.pth`.
L
linjintao 已提交
557 558 559

## Tutorials

L
linjintao 已提交
560 561
Currently, we provide some tutorials for users to [finetune model](tutorials/finetune.md),
[add new dataset](tutorials/new_dataset.md), [add new modules](tutorials/new_modules.md).