{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. PP-TSM Introduction\n", "\n", "Video classification is similar to image classification which belongs to the recognition task. For a given input video, the video classification model aims to output its predicted label category. If tags are all action categories, this task is also called action recognition. Different from image classification, video classification often requires the use of temporal information between multiple frames of images. PP-TSM is a practical industrial video classification model developed by PaddleVideo. Based on the implementation of state-of-the-art algorithms, we slim the model size and optimize the accuracy with the considerations of the trade-off between speed and precision.\n", "\n", "PP-TSM is produced based on ResNet-50 backbone. Optimized methods includes data augmentation, network structure fine-tuning, training strategy, preciceBN, pretrain model selection and model distillation. Under the premise of basically not increasing the amount of calculation, using the center-sampling evaluation method, the accuracy of PP-TSM on Kinetics-400 is 3.95 points higher than that of the original paper, reaching 76.16%, which exceeds the 3D model under the same backbone network, and the inference speed is 4.5 times faster!\n", "\n", "More information about PaddleVideo can be found here https://github.com/PaddlePaddle/PaddleVideo .\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Model Effects and Application Scenarios\n", "### 2.1 action recognition Tasks:\n", "\n", "#### 2.1.1 Datasets:\n", "\n", "The dataset is mainly in Kinetics-400, which is divided into training set and test set.\n", "\n", "#### 2.1.2 Model Effects:\n", "\n", "The recognition effect of PP-TSM on the picture is:\n", "\n", "
\n", "\n", "
\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. How to Use the Model\n", "\n", "### 3.1 Model Inference:\n", "* Download " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "scrolled": true, "tags": [] }, "outputs": [], "source": [ "%cd ~/work\n", "\n", "!git clone https://github.com/PaddlePaddle/PaddleVideo" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Installation" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true, "tags": [] }, "outputs": [], "source": [ "# The script needs to be run in the PaddleVideo directory\n", "%cd ~/work/PaddleVideo/\n", "\n", "# Install the required dependencies [already persisted, no need to install again].\n", "!pip install --upgrade pip\n", "!pip install -r requirements.txt --user\n", "\n", "# Install PaddleVideo\n", "!pip install ppvideo==2.3.0 --user" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Quick experience\n", "\n", "Congratulations! Now that you've successfully installed PaddleVideo, let's get a quick feel at action recognition." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true, "tags": [] }, "outputs": [], "source": [ "!ppvideo --model_name='ppTSM' --use_gpu=False --video_file='data/example.avi'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An video with the predicted result is generated.\n", "\n", "The result is as follows:\n", "\n", "```txt\n", "Current video file: data/example.avi\n", " top-1 classes: [5]\n", " top-1 scores: [0.95056254]\n", " top-1 label names: ['archery']\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.2 Model Training\n", "* Clone the PaddleVideo repository (see 3.1 for details)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Prepare the datasets and pretrain model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 进入PaddleVideo工作目录\n", "%cd ~/work/PaddleVideo/\n", "\n", "# 下载Kinetics-400小数据集\n", "%pushd ./data/k400\n", "!wget -nc https://videotag.bj.bcebos.com/Data/k400_videos_small.tar\n", "!tar -xf k400_videos_small.tar\n", "%popd\n", "\n", "# 下载预训练模型\n", "!wget -nc -P ./data https://videotag.bj.bcebos.com/PaddleVideo/PretrainModel/ResNet50_vd_ssld_v2_pretrained.pdparams --no-check-certificate\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Change yaml configurations files.\n", "\n", "\n", "Change yaml configurations files` configs/recognition/pptsm/pptsm_k400_videos_uniform.yaml`\n", "\n", "```\n", "DATASET: #DATASET field\n", " batch_size: 4 # adjust according to GPU memory\n", " num_workers: 4 \n", " test_batch_size: 1\n", " train:\n", " format: \"VideoDataset\" \n", " data_prefix: \"data/k400/videos\" \n", " file_path: \"data/k400/train_small_videos.list \" # modify train dataset path\n", " valid:\n", " format: \"VideoDataset\" \n", " data_prefix: \"data/k400/videos\" \n", " file_path: \"data/k400/val_small_videos.list\" #modify validation dataset path\n", " test:\n", " format: \"VideoDataset\"\n", " data_prefix: \"data/k400/videos\" \n", " file_path: \"data/k400/val_small_videos.list\" #modify validation dataset path\n", "```\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Train the model." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true, "tags": [] }, "outputs": [], "source": [ "%cd ~/work/PaddleVideo/\n", "%env CUDA_VISIBLE_DEVICES=0\n", "\n", "# Start training\n", "!python main.py --validate -c configs/recognition/pptsm/pptsm_k400_videos_uniform.yaml" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Model evaluation\n", "\n", "Setting `--validate` in train script, the evaluation will be implemented during training. For a trained model, the evaluation code as follows.\n", "\n", "`-c` specify configuration file, `-w` specify checkpoints." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "%cd ~/work/PaddleVideo/\n", "%env CUDA_VISIBLE_DEVICES=0\n", "\n", "!python main.py --test -c configs/recognition/pptsm/pptsm_k400_videos_uniform.yaml -w output/ppTSM/ppTSM_best.pdparams" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Model Principles" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Temporal Shift Module\n", "\n", "PP-TSM use Temporal Shift Module to learn temporal information. This method greatly improves the model's ability to use the video information without adding any additional parameters or computation.\n", "\n", "
\n", "
\n", "
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Data Augmentation VideoMix\n", "\n", "For Video Mix-up,The two videos are mixed with a certain weight value to form a new input video, which improves the space-time capability of the network.\n", "\n", "* precise BN\n", "\n", "In order to obtain more accurate mean and variance for BN layer to use in testing, in the experiment, we fix the parameters in the network after the network has trained an Epoch, then input the training data into the network for forward calculation, save the mean and square error of each step, and finally obtain the accurate mean and variance of all training samples to improve the testing accuracy." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Note\n", "\n", "Each configuration file provided by the PP-TSM model is placed in the `configs/recognition/pptsm` directory. The configuration file name is organized in the following format:\n", "\n", "```ModelName_Backbone_Datasest_DataFormat_EvaluationMethod_others.yaml```\n", "\n", "* Data format includes frame and video, video indicates training with online video, frame indicates training with offline frames. Due to the differences caused by decoding, the models obtained from the training of this two formats may have some differences in accuracy.\n", "\n", "* The evaluation methods include uniform and dense. Uniform means central sampling, and dense means dense sampling.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Related papers and citations\n", "\n", "```\n", "@inproceedings{lin2019tsm,\n", " title={TSM: Temporal Shift Module for Efficient Video Understanding},\n", " author={Lin, Ji and Gan, Chuang and Han, Song},\n", " booktitle={Proceedings of the IEEE International Conference on Computer Vision},\n", " year={2019}\n", "} \n", "```\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 4 }