add en doc for action development (#6484)

49d0000c · JYChen · GitHub · 2cb48107 · 49d0000c · 49d0000c
8 changed file
--- a/docs/advanced_tutorials/customization/action_recognotion/README.md
+++ b/docs/advanced_tutorials/customization/action_recognotion/README.md
+简体中文 | [English](./README_en.md)
 # 行为识别任务二次开发
 在产业落地过程中应用行为识别算法，不可避免地会出现希望自定义类型的行为识别的需求，或是对已有行为识别模型的优化，以提升在特定场景下模型的效果。鉴于行为的多样性，PP-Human支持抽烟、打电话、摔倒、打架、人员闯入五种异常行为识别，并根据行为的不同，集成了基于视频分类、基于检测、基于图像分类、基于跟踪以及基于骨骼点的五种行为识别技术方案，可覆盖90%+动作类型的识别，满足各类开发需求。我们在本文档通过案例来介绍如何根据期望识别的行为来进行行为识别方案的选择，以及使用PaddleDetection进行行为识别算法二次开发工作，包括：方案选择、数据准备、模型优化思路和新增行为的开发流程。
@@ -5,7 +7,7 @@
 ## 方案选择
-在PaddleDetection的PP-Human中，我们为行为识别提供了多种方案：基于视频分类、基于图像分类、基于检测、以及基于骨骼点的行为识别方案，以期望满足不同场景、不同目标行为的需求。对于二次开发，首先我们需要确定要采用何种方案来实现行为识别的需求，其核心是要通过对场景和具体行为的分析、并考虑数据采集成本等因素，综合选择一个合适的识别方案。我们在这里简要列举了当前PaddleDetection中所支持的方案的优劣势和适用场景，供大家参考。
+在PaddleDetection的PP-Human中，我们为行为识别提供了多种方案：基于视频分类、基于图像分类、基于检测、基于跟踪以及基于骨骼点的行为识别方案，以期望满足不同场景、不同目标行为的需求。对于二次开发，首先我们需要确定要采用何种方案来实现行为识别的需求，其核心是要通过对场景和具体行为的分析、并考虑数据采集成本等因素，综合选择一个合适的识别方案。我们在这里简要列举了当前PaddleDetection中所支持的方案的优劣势和适用场景，供大家参考。
 <img width="1091" alt="image" src="https://user-images.githubusercontent.com/22989727/178742352-d0c61784-3e93-4406-b2a2-9067f42cb343.png">
@@ -43,7 +45,7 @@
 原因：与上面的动作不同，打架是一个典型的多人组成的行为。因此不再通过检测与跟踪模型来提取行人及其ID，而对整体视频片段进行处理。此外，打架场景下各个目标间的互相遮挡极为严重，关键点识别的准确性不高，采用基于骨骼点的方案难以保证精度。
-下面详细展开四大类方案的数据准备、模型优化和新增行为识别方法
+下面详细展开五大类方案的数据准备、模型优化和新增行为识别方法
 1. [基于人体id检测的行为识别](./idbased_det.md)
 2. [基于人体id分类的行为识别](./idbased_clas.md)

--- a/docs/advanced_tutorials/customization/action_recognotion/README_en.md
+++ b/docs/advanced_tutorials/customization/action_recognotion/README_en.md
+[简体中文](./README.md) | English
+# Secondary Development for Action Recognition Task
+In the process of industrial implementation, the application of action recognition algorithms will inevitably lead to the need for customized types of action, or the optimization of existing action recognition models to improve the performance of the model in specific scenarios. In view of the diversity of behaviors, PP-Human supports the identification of five abnormal behavioras of smoking, making phone calls, falling, fighting, and people intrusion. At the same time, according to the different behaviors, PP-Human integrates five action recognition technology solutions based on video classification, detection-based, image-based classification, tracking-based and skeleton-based, which can cover 90%+ action type recognition and meet various development needs. In this document, we use a case to introduce how to select a action recognition solution according to the expected behavior, and use PaddleDetection to carry out the secondary development of the action recognition algorithm, including: solution selection, data preparation, model optimization and development process for adding new actions.
+## Solution Selection
+In PaddleDetection's PP-Human, we provide a variety of solutions for behavior recognition: video classification, image classification, detection, tracking-based, and skeleton point-based behavior recognition solutions, in order to meet the needs of different scenes and different target behaviors.
+<img width="1091" alt="image" src="https://user-images.githubusercontent.com/22989727/178742352-d0c61784-3e93-4406-b2a2-9067f42cb343.png">
+The following takes several specific actions that PaddleDetection currently supports as an example to introduce the selection basis of each action:
+### Smoking
+Solution selection: action recognition based on detection with human id.
+Reason: The smoking action has a obvious feature target, that is, cigarette. So we can think that when a cigarette is detected in the corresponding image of a person, the person is with the smoking action. Compared with video-based or skeleton-based recognition schemes, training detection model needs to collect data at the image level rather than the video level, which can significantly reduce the difficulty of data collection and labeling. In addition, the detection task has abundant pre-training model resources, and the performance of the model will be more guaranteed.
+### Making Phone Calls
+Solution selection: action recognition based on classification with human id.
+Reason: Although there is a characteristic target of a mobile phone in the call action, in order to distinguish actions such as looking at the mobile phone, and considering that there will be much occlusion of the mobile phone in the calling action in the security scene (such as the occlusion of the mobile phone by the hand or head, etc.), is not conducive to the detection model to correctly detect the target. Simultaneous, calls usually last a long time, and the character's action do not change much, so a strategy for frame-level image classification can therefore be employed. In addition, the action of making a phone call can mainly be judged by the upper body, and the half-body picture can be used to remove redundant information to reduce the difficulty of model training.
+### Falling
+Solution selection: action recognition based on skelenton.
+Reason: Falling is an obvious temporal action, which is distinguishable by a character himself, and it is scene-independent. Since PP-Human is towards the security monitoring scene, where the background changes are more complicated, and the real-time inference needs to be considered in the deployment, the action recognition based on skeleton points is adopted to obtain better generalization and running speed.
+### People Intrusion
+Solution selection: action recognition based on tracking with human id.
+Reason: The intrusion recognition can be judged by whether the pedestrian's path or location is in a selected area, and it is unrelated to pedestrian's body action. Therefore, it is only necessary to track the human and use coordinate results to analyze whether there is intrusion behavior.
+### Fighting
+Solution selection: action recognition based on video classification.
+Reason: Unlike the actions above, fighting is a typical multiplayer action. Therefore, the detection and tracking model is no longer used to extract pedestrians and their IDs, but the entire video clip is processed. In addition, the mutual occlusion between various targets in the fighting scene is extremely serious, leading to the accuracy of keypoint recognition is not good.
+The following are detailed description for the five major categories of solutions, including the data preparation, model optimization and adding new actions.
+1. [action recognition based on detection with human id.](./idbased_det_en.md)
+2. [action recognition based on classification with human id.](./idbased_clas_en.md)
+3. [action recognition based on skelenton.](./skeletonbased_rec_en.md)
+4. [action recognition based on tracking with human id](../mot_en.md)
+5. [action recognition based on video classification](./videobased_rec_en.md)
--- a/docs/advanced_tutorials/customization/action_recognotion/idbased_clas.md
+++ b/docs/advanced_tutorials/customization/action_recognotion/idbased_clas.md
+简体中文 | [English](./idbased_clas_en.md)
 # 基于人体id的分类模型开发
 ## 环境准备

--- a/docs/advanced_tutorials/customization/action_recognotion/idbased_clas_en.md
+++ b/docs/advanced_tutorials/customization/action_recognotion/idbased_clas_en.md
+[简体中文](./idbased_clas.md) | English
+# Development for Action Recognition Based on Classification with Human ID
+## Environmental Preparation
+The model of action recognition based on classification with human id is trained with [PaddleClas](https://github.com/PaddlePaddle/PaddleClas). Please refer to [Install PaddleClas](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.4/docs/en/installation/install_paddleclas_en.md) to complete the environment installation for subsequent model training and usage processes.
+## Data Preparation
+The model of action recognition based on classification with human id directly recognizes the image frames of video, so the model training process is same with the usual image classification model.
+### Dataset Download
+The action recognition of making phone calls is trained on the public dataset [UAV-Human](https://github.com/SUTDCV/UAV-Human). Please fill in the relevant application materials through this link to obtain the download link.
+The RGB video in this dataset is included in the `UAVHuman/ActionRecognition/RGBVideos` path, and the file name of each video is its annotation information.
+### Image Processing for Training and Validation
+According to the video file name, in which the `A` field (i.e. action) related to action recognition, we can find the action type of the video data that we expect to recognize.
+- Positive sample video: Taking phone calls as an example, we just need to find the file containing `A024`.
+- Negative sample video: All videos except the target action.
+In view of the fact that there will be much redundancy when converting video data into images, for positive sample videos, we sample at intervals of 8 frames, and use the pedestrian detection model to process it into a half-body image (take the upper half of the detection frame, that is, `img = img[: H/2, :, :]`). The image sampled from the positive sample video is regarded as a positive sample, and the sampled image from the negative sample video is regarded as a negative sample.
+**Note**: The positive sample video does not completely are the action of making a phone call. There will be some redundant actions at the beginning and end of the video, which need to be removed.
+### Preparation for Annotation File
+The model of action recognition based on classification with human id is trained with [PaddleClas](https://github.com/PaddlePaddle/PaddleClas). Thus the model trained with this scheme needs to prepare the desired image data and corresponding annotation files. Please refer to [Image Classification Datasets](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.4/docs/en/data_preparation/classification_dataset_en.md) to prepare the data. An example of an annotation file is as follows, where `0` and `1` are the corresponding categories of the image:
+```
+    # Each line uses "space" to separate the image path and label
+    train/000001.jpg 0
+    train/000002.jpg 0
+    train/000003.jpg 1
+    ...
+```
+Additionally, the label file `phone_label_list.txt` helps map category numbers to specific type names:
+```
+0 make_a_phone_call # type 0
+1 normal # type 1
+```
+After the above content finished, place it to the `dataset` directory, the file structure is as follow:
+```
+data/
+├── images  # All images
+├── phone_label_list.txt # Label file
+├── phone_train_list.txt # Training list, including pictures and their corresponding types
+└── phone_val_list.txt   # Validation list, including pictures and their corresponding types
+```
+## Model Optimization
+### Detection-Tracking Model Optimization
+The performance of action recognition based on classification with human id depends on the pre-order detection and tracking models. If the pedestrian location cannot be accurately detected in the actual scene, or it is difficult to correctly assign the person ID between different frames, the performance of the action recognition part will be limited. If you encounter the above problems in actual use, please refer to [Secondary Development of Detection Task](../detection_en.md) and [Secondary Development of Multi-target Tracking Task](../mot_en.md) for detection/track model optimization.
+### Half-Body Prediction
+In the action of making a phone call, the action classification can be achieved through the upper body image. Therefore, during the training and prediction process, the image is changed from the pedestrian full-body to half-body.
+## Add New Action
+### Data Preparation
+Referring to the previous introduction, complete the data preparation part and place it under `{root of PaddleClas}/dataset`:
+```
+data/
+├── images  # All images
+├── label_list.txt # Label file
+├── train_list.txt # Training list, including pictures and their corresponding types
+└── val_list.txt   # Validation list, including pictures and their corresponding types
+```
+Where the training list and validation list file are as follow:
+```
+    # Each line uses "space" to separate the image path and label
+    train/000001.jpg 0
+    train/000002.jpg 0
+    train/000003.jpg 1
+    train/000004.jpg 2   # For the newly added categories, simply fill in the corresponding category number.
+`label_list.txt` should give name of the extension type:
+```
+0 make_a_phone_call  # class 0
+1 Your New Action    # class 1
+ ...
+n normal             # class n
+```
+    ...
+```
+### Configuration File Settings
+The [training configuration file] (https://github.com/PaddlePaddle/PaddleClas/blob/develop/ppcls/configs/practical_models/PPHGNet_tiny_calling_halfbody.yaml) has been integrated in PaddleClas. The settings that need to be paid attention to are as follows:
+```yaml
+# model architecture
+Arch:
+  name: PPHGNet_tiny
+  class_num: 2       # Corresponding to the number of action categories
+  ...
+# Please correctly set image_root and cls_label_path to ensure that the image_root + image path in cls_label_path can access the image correctly
+DataLoader:
+  Train:
+    dataset:
+      name: ImageNetDataset
+      image_root: ./dataset/
+      cls_label_path: ./dataset/phone_train_list_halfbody.txt
+      ...
+Infer:
+  infer_imgs: docs/images/inference_deployment/whl_demo.jpg
+  batch_size: 1
+  transforms:
+    - DecodeImage:
+        to_rgb: True
+        channel_first: False
+    - ResizeImage:
+        size: 224
+    - NormalizeImage:
+        scale: 1.0/255.0
+        mean: [0.485, 0.456, 0.406]
+        std: [0.229, 0.224, 0.225]
+        order: ''
+    - ToCHWImage:
+  PostProcess:
+    name: Topk
+    topk: 2                                           # Display the number of topks, do not exceed the total number of categories
+    class_id_map_file: dataset/phone_label_list.txt   # path of label_list.txt
+```
+### Model Training And Evaluation
+#### Model Training
+Start training with the following command:
+```bash
+export CUDA_VISIBLE_DEVICES=0,1,2,3
+python3 -m paddle.distributed.launch \
+    --gpus="0,1,2,3" \
+    tools/train.py \
+        -c ./ppcls/configs/practical_models/PPHGNet_tiny_calling_halfbody.yaml \
+        -o Arch.pretrained=True
+```
+where `Arch.pretrained=True` is to use pretrained weights to help with training.
+#### Model Evaluation
+After training the model, use the following command to evaluate the model metrics.
+```bash
+python3 tools/eval.py \
+    -c ./ppcls/configs/practical_models/PPHGNet_tiny_calling_halfbody.yaml \
+    -o Global.pretrained_model=output/PPHGNet_tiny/best_model
+```
+Where `-o Global.pretrained_model="output/PPHGNet_tiny/best_model"` specifies the path where the current best weight is located. If other weights are needed, just replace the corresponding path.
+#### Model Export
+For the detailed introduction of model export, please refer to [here](https://github.com/PaddlePaddle/PaddleClas/blob/develop/docs/en/inference_deployment/export_model_en.md#2-export-classification-model)
+You can refer to the following steps:
+```python
+python tools/export_model.py
+    -c ./PPHGNet_tiny_calling_halfbody.yaml \
+    -o Global.pretrained_model=./output/PPHGNet_tiny/best_model \
+    -o Global.save_inference_dir=./output_inference/PPHGNet_tiny_calling_halfbody
+```
+Then rename the exported model and add the configuration file to suit the usage of PP-Human.
+```bash
+cd ./output_inference/PPHGNet_tiny_calling_halfbody
+mv inference.pdiparams model.pdiparams
+mv inference.pdiparams.info model.pdiparams.info
+mv inference.pdmodel model.pdmodel
+# Download configuration file for inference
+wget https://bj.bcebos.com/v1/paddledet/models/pipeline/infer_configs/PPHGNet_tiny_calling_halfbody/infer_cfg.yml
+```
+At this point, this model can be used in PP-Human.
--- a/docs/advanced_tutorials/customization/action_recognotion/idbased_det.md
+++ b/docs/advanced_tutorials/customization/action_recognotion/idbased_det.md
+简体中文 | [English](./idbased_det_en.md)
 # 基于人体id的检测模型开发
 ## 环境准备
@@ -5,7 +7,7 @@
 基于人体id的检测方案是直接使用[PaddleDetection](https://github.com/PaddlePaddle/PaddleDetection)的功能进行模型训练的。请按照[安装说明](https://github.com/PaddlePaddle/PaddleDetection/blob/develop/docs/tutorials/INSTALL_cn.md)完成环境安装，以进行后续的模型训练及使用流程。
 ## 数据准备
-基于检测的行为识别方案中，数据准备的流程与一般的检测模型一致，详情可参考[目标检测数据准备](../../tutorials/data/PrepareDetDataSet.md)。将图像和标注数据组织成PaddleDetection中支持的格式之一即可。
+基于检测的行为识别方案中，数据准备的流程与一般的检测模型一致，详情可参考[目标检测数据准备](../../../tutorials/data/PrepareDetDataSet.md)。将图像和标注数据组织成PaddleDetection中支持的格式之一即可。
 **注意** ： 在实际使用的预测过程中，使用的是单人图像进行预测，因此在训练过程中建议将图像裁剪为单人图像，再进行烟头检测框的标注，以提升准确率。
@@ -24,7 +26,7 @@
 ## 新增行为
 ### 数据准备
-参考[目标检测数据准备](../../tutorials/data/PrepareDetDataSet.md)完成训练数据准备。
+参考[目标检测数据准备](../../../tutorials/data/PrepareDetDataSet.md)完成训练数据准备。
 准备完成后，数据路径为
 ```
@@ -130,16 +132,16 @@ TestDataset:
 ```
 ### 模型训练及评估
- 模型训练
+#### 模型训练
-参考[PP-YOLOE](../../../configs/ppyoloe/README_cn.md)，执行下列步骤实现
+参考[PP-YOLOE](../../../../configs/ppyoloe/README_cn.md)，执行下列步骤实现
 ```bash
 # At Root of PaddleDetection
 python -m paddle.distributed.launch --gpus 0,1,2,3  tools/train.py -c configs/pphuman/ppyoloe_crn_s_80e_smoking_visdrone.yml --eval
 ```
- 模型评估
+#### 模型评估
 训练好模型之后，可以通过以下命令实现对模型指标的评估
 ```bash

--- a/docs/advanced_tutorials/customization/action_recognotion/idbased_det_en.md
+++ b/docs/advanced_tutorials/customization/action_recognotion/idbased_det_en.md
+[简体中文](./idbased_det.md) | English
+# Development for Action Recognition Based on Detection with Human ID
+## Environmental Preparation
+The model of action recognition based on detection with human id is trained with [PaddleDetection](https://github.com/PaddlePaddle/PaddleDetection). Please refer to [Installation](../../../tutorials/INSTALL.md) to complete the environment installation for subsequent model training and usage processes.
+## Data Preparation
+The model of action recognition based on detection with human id directly recognizes the image frames of video, so the model training process is same with preparation process of general detection model. For details, please refer to [Data Preparation for Detection](../../../tutorials/data/PrepareDetDataSet_en.md). Please process image and annotation of data into one of the formats PaddleDetection supports.
+**Note**: In the actual prediction process, a single person image is used for prediction. So it is recommended to crop the image into a single person image during the training process, and label the cigarette detection bounding box to improve the accuracy.
+## Model Optimization
+### Detection-Tracking Model Optimization
+The performance of action recognition based on detection with human id depends on the pre-order detection and tracking models. If the pedestrian location cannot be accurately detected in the actual scene, or it is difficult to correctly assign the person ID between different frames, the performance of the action recognition part will be limited. If you encounter the above problems in actual use, please refer to [Secondary Development of Detection Task](../detection_en.md) and [Secondary Development of Multi-target Tracking Task](../mot_en.md) for detection/track model optimization.
+### Larger resolution
+The detection of cigarette is a typical small target detection problem from the monitoring perspective. Using a larger resolution can help improve the overall performance of the model.
+### Pretrained model
+The pretrained model under the small target scene dataset VisDrone is used for training, and the mAP of the model is increased from 38.1 to 39.7.
+## Add New Action
+### Data Preparation
+please refer to [Data Preparation for Detection](../../../tutorials/data/PrepareDetDataSet_en.md) to complete the data preparation part.
+When finish this step, the path will look like:
+```
+dataset/smoking
+├── smoking # all images
+│   ├── 1.jpg
+│   ├── 2.jpg
+├── smoking_test_cocoformat.json # Validation file
+├── smoking_train_cocoformat.json # Training file
+```
+Taking the `COCO` format as an example, the content of the completed json annotation file is as follows:
+```json
+# The "images" field contains the path, id and corresponding width and height information of the images.
+  "images": [
+    {
+      "file_name": "smoking/1.jpg",
+      "id": 0,    # Here id is the picture id serial number, do not duplicate
+      "height": 437,
+      "width": 212
+    },
+    {
+      "file_name": "smoking/2.jpg",
+      "id": 1,
+      "height": 655,
+      "width": 365
+    },
+ ...
+# The "categories" field contains all category information. If you want to add more detection categories, please add them here. The example is as follows.
+  "categories": [
+    {
+      "supercategory": "cigarette",
+      "id": 1,
+      "name": "cigarette"
+    },
+    {
+      "supercategory": "Class_Defined_by_Yourself",
+      "id": 2,
+      "name": "Class_Defined_by_Yourself"
+    },
+  ...
+# The "annotations" field contains information about all instances, including category, bounding box coordinates, id, image id and other information
+  "annotations": [
+    {
+      "category_id": 1,  # Corresponding to the defined category, where 1 represents cigarette
+      "bbox": [
+        97.0181345931,
+        332.7033243081,
+        7.5943999555,
+        16.4545332369
+      ],
+      "id": 0,           # Here id is the id serial number of the instance, do not duplicate
+      "image_id": 0,     # Here is the id serial number of the image where the instance is located, which may be duplicated. In this case, there are multiple instance objects on one image.
+      "iscrowd": 0,
+      "area": 124.96230648208665
+    },
+    {
+      "category_id": 2, # Corresponding to the defined category, where 2 represents Class_Defined_by_Yourself
+      "bbox": [
+        114.3895698372,
+        221.9131122343,
+        25.9530363697,
+        50.5401234568
+      ],
+      "id": 1,
+      "image_id": 1,
+      "iscrowd": 0,
+      "area": 1311.6696622034585
+```
+### Configuration File Settings
+Refer to [Configuration File](../../../../configs/pphuman/ppyoloe_crn_s_80e_smoking_visdrone.yml), the key should be paid attention to are as follows:
+```yaml
+metric: COCO
+num_classes: 1 # If more categories are added, please modify here accordingly
+# Set image_dir，anno_path，dataset_dir correctly
+# Ensure that dataset_dir + anno_path can correctly access to the path of the annotation file
+# Ensure that dataset_dir + image_dir + the image path in the annotation file can correctly access to the image path
+TrainDataset:
+  !COCODataSet
+    image_dir: ""
+    anno_path: smoking_train_cocoformat.json
+    dataset_dir: dataset/smoking
+    data_fields: ['image', 'gt_bbox', 'gt_class', 'is_crowd']
+EvalDataset:
+  !COCODataSet
+    image_dir: ""
+    anno_path: smoking_test_cocoformat.json
+    dataset_dir: dataset/smoking
+TestDataset:
+  !ImageFolder
+    anno_path: smoking_test_cocoformat.json
+    dataset_dir: dataset/smoking
+```
+### Model Training And Evaluation
+#### Model Training
+As [PP-YOLOE](../../../../configs/ppyoloe/README.md), start training with the following command:
+```bash
+# At Root of PaddleDetection
+python -m paddle.distributed.launch --gpus 0,1,2,3  tools/train.py -c configs/pphuman/ppyoloe_crn_s_80e_smoking_visdrone.yml --eval
+```
+#### Model Evaluation
+After training the model, use the following command to evaluate the model metrics.
+```bash
+# At Root of PaddleDetection
+python tools/eval.py -c configs/pphuman/ppyoloe_crn_s_80e_smoking_visdrone.yml
+```
+#### Model Export
+Note: If predicting in Tensor-RT environment, please enable `-o trt=True` for better performance.
+```bash
+# At Root of PaddleDetection
+python tools/export_model.py -c configs/pphuman/ppyoloe_crn_s_80e_smoking_visdrone.yml -o weights=output/ppyoloe_crn_s_80e_smoking_visdrone/best_model trt=True
+```
+After exporting the model, you can get:
+```
+ppyoloe_crn_s_80e_smoking_visdrone/
+├── infer_cfg.yml
+├── model.pdiparams
+├── model.pdiparams.info
+└── model.pdmodel
+```
+At this point, this model can be used in PP-Human.
--- a/docs/advanced_tutorials/customization/action_recognotion/skeletonbased_rec.md
+++ b/docs/advanced_tutorials/customization/action_recognotion/skeletonbased_rec.md
+简体中文 | [English](./skeletonbased_rec_en.md)
 # 基于人体骨骼点的行为识别
 ## 环境准备
@@ -143,7 +145,6 @@ DATASET: #DATASET field
 ```
 ### 模型训练与测试
 - 在PaddleVideo中，使用以下命令即可开始训练：
 ```bash

--- a/docs/advanced_tutorials/customization/action_recognotion/skeletonbased_rec_en.md
+++ b/docs/advanced_tutorials/customization/action_recognotion/skeletonbased_rec_en.md
+[简体中文](./skeletonbased_rec.md) | English
+# Skeleton-based action recognition
+## Environmental Preparation
+The skeleton-based action recognition is trained with [PaddleVideo](https://github.com/PaddlePaddle/PaddleVideo). Please refer to [Installation](https://github.com/PaddlePaddle/PaddleVideo/blob/develop/docs/en/install.md) to complete the environment installation for subsequent model training and usage processes.
+## Data Preparation
+For the model of skeleton-based model, you can refer to [this document](https://github.com/PaddlePaddle/PaddleVideo/tree/develop/applications/PPHuman#%E5%87%86%E5%A4%87%E8%AE %AD%E7%BB%83%E6%95%B0%E6%8D%AE) to preparation training adapted to PaddleVideo. The main process includes the following steps:
+### Data Format Description
+STGCN is a model based on the sequence of skeleton point coordinates. In PaddleVideo, training data is `Numpy` data stored with `.npy` format, and labels can be files stored in `.npy` or `.pkl` format. The dimension requirement for sequence data is `(N,C,T,V,M)`, the current solution only supports behaviors composed of a single person (but there can be multiple people in the video, and each person performs action recognition separately), that is` M=1`.
+| Dim | Size | Description |
+| ---- | ---- | ---------- |
+| N | Not Fixed | The number of sequences in the dataset |
+| C | 2 | Keypoint coordinate, i.e. (x, y) |
+| T | 50 | The temporal dimension of the action sequence (i.e. the number of continuous frames)|
+| V | 17 | The number of keypoints of each person, here we use the definition of the `COCO` dataset, see [here](https://github.com/PaddlePaddle/PaddleDetection/blob/release/2.4/docs/tutorials/PrepareKeypointDataSet_en.md#description-for-coco-datasetkeypoint) |
+| M | 1 | The number of persons, here we only predict a single person for each action sequence |
+### Get The Skeleton Point Coordinates of The Sequence
+For a sequence to be labeled (here a sequence refers to an action segment, which can be a video or an ordered collection of pictures). The coordinates of skeletal points (also known as keypoints) can be obtained through model prediction or manual annotation.
+- Model prediction: You can directly select the model in the [PaddleDetection KeyPoint Models](https://github.com/PaddlePaddle/PaddleDetection/blob/release/2.4/configs/keypoint/README_en.md) and according to `3, training and testing - Deployment Prediction - Detect + keypoint top-down model joint deployment` to get the 17 keypoint coordinates of the target sequence.
+When using the model to predict and obtain the coordinates, you can refer to the following steps, please note that the operation in PaddleDetection at this time.
+```bash
+# current path is under root of PaddleDetection
+# Step 1: download pretrained inference models.
+wget https://bj.bcebos.com/v1/paddledet/models/pipeline/mot_ppyoloe_l_36e_pipeline.zip
+wget https://bj.bcebos.com/v1/paddledet/models/pipeline/dark_hrnet_w32_256x192.zip
+unzip -d output_inference/ mot_ppyoloe_l_36e_pipeline.zip
+unzip -d output_inference/ dark_hrnet_w32_256x192.zip
+# Step 2: Get the keypoint coordinarys
+# if your data is image sequence
+python deploy/python/det_keypoint_unite_infer.py --det_model_dir=output_inference/mot_ppyoloe_l_36e_pipeline/ --keypoint_model_dir=output_inference/dark_hrnet_w32_256x192 --image_dir={your image directory path} --device=GPU --save_res=True
+# if your data is video
+python deploy/python/det_keypoint_unite_infer.py --det_model_dir=output_inference/mot_ppyoloe_l_36e_pipeline/ --keypoint_model_dir=output_inference/dark_hrnet_w32_256x192 --video_file={your video file path} --device=GPU --save_res=True
+```
+We can get a detection result file named `det_keypoint_unite_image_results.json`. The detail of content can be seen at [Here](https://github.com/PaddlePaddle/PaddleDetection/blob/release/2.4/deploy/python/det_keypoint_unite_infer.py#L108).
+### Uniform Sequence Length
+Since the length of each action in the actual data is different, the first step is to pre-determine the time sequence length according to your data and the actual scene (in PP-Human, we use 50 frames as an action sequence), and do the following processing to the data:
+- If the actual length exceeds the predetermined length, a 50-frame segment will be randomly intercepted
+- Data whose actual length is less than the predetermined length: fill with 0 until 50 frames are met
+- data exactly equal to the predeter: no processing required
+Note: After this step is completed, please strictly confirm that the processed data contains a complete action, and there will be no ambiguity in prediction. It is recommended to confirm by visualizing the data.
+### Save to PaddleVideo usable formats
+After the first two steps of processing, we get the annotation of each character action fragment. At this time, we have a list `all_kpts`, which contains multiple keypoint sequence fragments, each one has a shape of (T, V, C) (in our case (50, 17, 2)), which is further converted into a format usable by PaddleVideo.
+- Adjust dimension order: `np.transpose` and `np.expand_dims` can be used to convert the dimension of each fragment into (C, T, V, M) format.
+- Combine and save all clips as one file
+Note: `class_id` is a `int` type variable, similar to other classification tasks. For example `0: falling, 1: other`.
+We provide a [script file](https://github.com/PaddlePaddle/PaddleVideo/blob/develop/applications/PPHuman/datasets/prepare_dataset.py) to do this step, which can directly process the generated `det_keypoint_unite_image_results.json` file. The content executed by the script includes parsing the content of the json file, unforming the training data  sequence and saving the data file as described in the preceding steps.
+```bash
+mkdir {root of PaddleVideo}/applications/PPHuman/datasets/annotations
+mv det_keypoint_unite_image_results.json {root of PaddleVideo}/applications/PPHuman/datasets/annotations/det_keypoint_unite_image_results_{video_id}_{camera_id}.json
+cd {root of PaddleVideo}/applications/PPHuman/datasets/
+python prepare_dataset.py
+```
+Now, we have available training data (`.npy`) and corresponding annotation files (`.pkl`).
+## Model Optimization
+### detection-tracking model optimization
+The performance of action recognition based on skelenton depends on the pre-order detection and tracking models. If the pedestrian location cannot be accurately detected in the actual scene, or it is difficult to correctly assign the person ID between different frames, the performance of the action recognition part will be limited. If you encounter the above problems in actual use, please refer to [Secondary Development of Detection Task](../detection_en.md) and [Secondary Development of Multi-target Tracking Task](../mot_en.md) for detection/track model optimization.
+### keypoint model optimization
+As the core feature of the scheme, the skeleton point positioning performance also determines the overall effect of action recognition. If there are obvious errors in the recognition results of the keypoint coordinates of in the actual scene, it is difficult to distinguish the specific actions from the skeleton image composed of the keypoint.
+You can refer to [Secondary Development of Keypoint Detection Task](../keypoint_detection_en.md) to optimize the keypoint model.
+### Coordinate Normalization
+After getting coordinates of the skeleton points, it is recommended to perform normalization processing according to the detection bounding box of each person to reduce the convergence difficulty brought by the difference in the position and scale of the person.
+## Add New Action
+In skeleton-based action recognition, the model is [ST-GCN](https://arxiv.org/abs/1801.07455). Modified to adapt PaddleVideo based on [Training Step](https://github.com/PaddlePaddle/PaddleVideo/blob/develop/docs/en/model_zoo/recognition/stgcn.md). And complete the model training and exporting process.
+### Data Preparation And Configuration File Settings
+- Prepare the training data (`.npy`) and the corresponding annotation file (`.pkl`) according to `Data preparation`. Correspondingly placed under `{root of PaddleVideo}/applications/PPHuman/datasets/`.
+- Refer [Configuration File](https://github.com/PaddlePaddle/PaddleVideo/blob/develop/applications/PPHuman/configs/stgcn_pphuman.yaml), the things to focus on are as follows:
+```yaml
+MODEL: #MODEL field
+    framework:
+        backbone:
+        name: "STGCN"
+        in_channels: 2  # This corresponds to the C dimension in the data format description, representing two-dimensional coordinates.
+        dropout: 0.5
+        layout: 'coco_keypoint'
+        data_bn: True
+    head:
+        name: "STGCNHead"
+        num_classes: 2  # If there are multiple action types in the data, this needs to be modified to match the number of types.
+    if_top5: False # When the number of action types is less than 5, please set it to False, otherwise an error will be raised.
+...
+# Please set the data and label path of the train/valid/test part correctly according to the data path
+DATASET: #DATASET field
+    batch_size: 64
+    num_workers: 4
+    test_batch_size: 1
+    test_num_workers: 0
+    train:
+        format: "SkeletonDataset" #Mandatory, indicate the type of dataset, associate to the 'paddle
+        file_path: "./applications/PPHuman/datasets/train_data.npy" #mandatory, train data index file path
+        label_path: "./applications/PPHuman/datasets/train_label.pkl"
+    valid:
+        format: "SkeletonDataset" #Mandatory, indicate the type of dataset, associate to the 'paddlevideo/loader/dateset'
+        file_path: "./applications/PPHuman/datasets/val_data.npy" #Mandatory, valid data index file path
+        label_path: "./applications/PPHuman/datasets/val_label.pkl"
+        test_mode: True
+    test:
+        format: "SkeletonDataset" #Mandatory, indicate the type of dataset, associate to the 'paddlevideo/loader/dateset'
+        file_path: "./applications/PPHuman/datasets/val_data.npy" #Mandatory, valid data index file path
+        label_path: "./applications/PPHuman/datasets/val_label.pkl"
+        test_mode: True
+```
+### Model Training And Evaluation
+- In PaddleVideo, start training with the following command:
+```bash
+# current path is under root of PaddleVideo
+python main.py -c applications/PPHuman/configs/stgcn_pphuman.yaml
+# Since the task may overfit, it is recommended to evaluate model during training to save the best model.
+python main.py --validate -c applications/PPHuman/configs/stgcn_pphuman.yaml
+```
+- After training the model, use the following command to do inference.
+```bash
+python main.py --test -c applications/PPHuman/configs/stgcn_pphuman.yaml  -w output/STGCN/STGCN_best.pdparams
+```
+### Model Export
+In PaddleVideo, use the following command to export model and get structure file `STGCN.pdmodel` and weight file `STGCN.pdiparams`. And add the configuration file here.
+```bash
+# current path is under root of PaddleVideo
+python tools/export_model.py -c applications/PPHuman/configs/stgcn_pphuman.yaml \
+                                -p output/STGCN/STGCN_best.pdparams \
+                                -o output_inference/STGCN
+cp applications/PPHuman/configs/infer_cfg.yml output_inference/STGCN
+# Rename model files to adapt PP-Human
+cd output_inference/STGCN
+mv STGCN.pdiparams model.pdiparams
+mv STGCN.pdiparams.info model.pdiparams.info
+mv STGCN.pdmodel model.pdmodel
+```
+The directory structure will look like:
+```
+STGCN
+├── infer_cfg.yml
+├── model.pdiparams
+├── model.pdiparams.info
+├── model.pdmodel
+```
+At this point, this model can be used in PP-Human.
+**Note**: If the length of the video sequence or the number of keypoints is changed during training, the content of the `INFERENCE` field in the configuration file needs to be modified accordingly to correct prediction.
+```yaml
+# The dimension of the sequence data is (N,C,T,V,M)
+INFERENCE:
+    name: 'STGCN_Inference_helper'
+    num_channels: 2 # Corresponding to C dimension
+    window_size: 50 # Corresponding to T dimension, please set it accordingly to the sequence length.
+    vertex_nums: 17 # Corresponding to V dimension, please set it accordingly to the number of keypoints
+    person_nums: 1 # Corresponding to M dimension
+```