# Secondary Development for Action Recognition Task
In the process of industrial implementation, the application of action recognition algorithms will inevitably lead to the need for customized types of action, or the optimization of existing action recognition models to improve the performance of the model in specific scenarios. In view of the diversity of behaviors, PP-Human supports the identification of five abnormal behavioras of smoking, making phone calls, falling, fighting, and people intrusion. At the same time, according to the different behaviors, PP-Human integrates five action recognition technology solutions based on video classification, detection-based, image-based classification, tracking-based and skeleton-based, which can cover 90%+ action type recognition and meet various development needs. In this document, we use a case to introduce how to select a action recognition solution according to the expected behavior, and use PaddleDetection to carry out the secondary development of the action recognition algorithm, including: solution selection, data preparation, model optimization and development process for adding new actions.
## Solution Selection
In PaddleDetection's PP-Human, we provide a variety of solutions for behavior recognition: video classification, image classification, detection, tracking-based, and skeleton point-based behavior recognition solutions, in order to meet the needs of different scenes and different target behaviors.
The following takes several specific actions that PaddleDetection currently supports as an example to introduce the selection basis of each action:
### Smoking
Solution selection: action recognition based on detection with human id.
Reason: The smoking action has a obvious feature target, that is, cigarette. So we can think that when a cigarette is detected in the corresponding image of a person, the person is with the smoking action. Compared with video-based or skeleton-based recognition schemes, training detection model needs to collect data at the image level rather than the video level, which can significantly reduce the difficulty of data collection and labeling. In addition, the detection task has abundant pre-training model resources, and the performance of the model will be more guaranteed.
### Making Phone Calls
Solution selection: action recognition based on classification with human id.
Reason: Although there is a characteristic target of a mobile phone in the call action, in order to distinguish actions such as looking at the mobile phone, and considering that there will be much occlusion of the mobile phone in the calling action in the security scene (such as the occlusion of the mobile phone by the hand or head, etc.), is not conducive to the detection model to correctly detect the target. Simultaneous, calls usually last a long time, and the character's action do not change much, so a strategy for frame-level image classification can therefore be employed. In addition, the action of making a phone call can mainly be judged by the upper body, and the half-body picture can be used to remove redundant information to reduce the difficulty of model training.
### Falling
Solution selection: action recognition based on skelenton.
Reason: Falling is an obvious temporal action, which is distinguishable by a character himself, and it is scene-independent. Since PP-Human is towards the security monitoring scene, where the background changes are more complicated, and the real-time inference needs to be considered in the deployment, the action recognition based on skeleton points is adopted to obtain better generalization and running speed.
### People Intrusion
Solution selection: action recognition based on tracking with human id.
Reason: The intrusion recognition can be judged by whether the pedestrian's path or location is in a selected area, and it is unrelated to pedestrian's body action. Therefore, it is only necessary to track the human and use coordinate results to analyze whether there is intrusion behavior.
### Fighting
Solution selection: action recognition based on video classification.
Reason: Unlike the actions above, fighting is a typical multiplayer action. Therefore, the detection and tracking model is no longer used to extract pedestrians and their IDs, but the entire video clip is processed. In addition, the mutual occlusion between various targets in the fighting scene is extremely serious, leading to the accuracy of keypoint recognition is not good.
The following are detailed description for the five major categories of solutions, including the data preparation, model optimization and adding new actions.
1.[action recognition based on detection with human id.](./idbased_det_en.md)
2.[action recognition based on classification with human id.](./idbased_clas_en.md)
3.[action recognition based on skelenton.](./skeletonbased_rec_en.md)
4.[action recognition based on tracking with human id](../mot_en.md)
5.[action recognition based on video classification](./videobased_rec_en.md)
# Development for Action Recognition Based on Classification with Human ID
## Environmental Preparation
The model of action recognition based on classification with human id is trained with [PaddleClas](https://github.com/PaddlePaddle/PaddleClas). Please refer to [Install PaddleClas](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.4/docs/en/installation/install_paddleclas_en.md) to complete the environment installation for subsequent model training and usage processes.
## Data Preparation
The model of action recognition based on classification with human id directly recognizes the image frames of video, so the model training process is same with the usual image classification model.
### Dataset Download
The action recognition of making phone calls is trained on the public dataset [UAV-Human](https://github.com/SUTDCV/UAV-Human). Please fill in the relevant application materials through this link to obtain the download link.
The RGB video in this dataset is included in the `UAVHuman/ActionRecognition/RGBVideos` path, and the file name of each video is its annotation information.
### Image Processing for Training and Validation
According to the video file name, in which the `A` field (i.e. action) related to action recognition, we can find the action type of the video data that we expect to recognize.
- Positive sample video: Taking phone calls as an example, we just need to find the file containing `A024`.
- Negative sample video: All videos except the target action.
In view of the fact that there will be much redundancy when converting video data into images, for positive sample videos, we sample at intervals of 8 frames, and use the pedestrian detection model to process it into a half-body image (take the upper half of the detection frame, that is, `img = img[: H/2, :, :]`). The image sampled from the positive sample video is regarded as a positive sample, and the sampled image from the negative sample video is regarded as a negative sample.
**Note**: The positive sample video does not completely are the action of making a phone call. There will be some redundant actions at the beginning and end of the video, which need to be removed.
### Preparation for Annotation File
The model of action recognition based on classification with human id is trained with [PaddleClas](https://github.com/PaddlePaddle/PaddleClas). Thus the model trained with this scheme needs to prepare the desired image data and corresponding annotation files. Please refer to [Image Classification Datasets](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.4/docs/en/data_preparation/classification_dataset_en.md) to prepare the data. An example of an annotation file is as follows, where `0` and `1` are the corresponding categories of the image:
```
# Each line uses "space" to separate the image path and label
train/000001.jpg 0
train/000002.jpg 0
train/000003.jpg 1
...
```
Additionally, the label file `phone_label_list.txt` helps map category numbers to specific type names:
```
0 make_a_phone_call # type 0
1 normal # type 1
```
After the above content finished, place it to the `dataset` directory, the file structure is as follow:
```
data/
├── images # All images
├── phone_label_list.txt # Label file
├── phone_train_list.txt # Training list, including pictures and their corresponding types
└── phone_val_list.txt # Validation list, including pictures and their corresponding types
```
## Model Optimization
### Detection-Tracking Model Optimization
The performance of action recognition based on classification with human id depends on the pre-order detection and tracking models. If the pedestrian location cannot be accurately detected in the actual scene, or it is difficult to correctly assign the person ID between different frames, the performance of the action recognition part will be limited. If you encounter the above problems in actual use, please refer to [Secondary Development of Detection Task](../detection_en.md) and [Secondary Development of Multi-target Tracking Task](../mot_en.md) for detection/track model optimization.
### Half-Body Prediction
In the action of making a phone call, the action classification can be achieved through the upper body image. Therefore, during the training and prediction process, the image is changed from the pedestrian full-body to half-body.
## Add New Action
### Data Preparation
Referring to the previous introduction, complete the data preparation part and place it under `{root of PaddleClas}/dataset`:
```
data/
├── images # All images
├── label_list.txt # Label file
├── train_list.txt # Training list, including pictures and their corresponding types
└── val_list.txt # Validation list, including pictures and their corresponding types
```
Where the training list and validation list file are as follow:
```
# Each line uses "space" to separate the image path and label
train/000001.jpg 0
train/000002.jpg 0
train/000003.jpg 1
train/000004.jpg 2 # For the newly added categories, simply fill in the corresponding category number.
`label_list.txt` should give name of the extension type:
```
0 make_a_phone_call # class 0
1 Your New Action # class 1
...
n normal # class n
```
...
```
### Configuration File Settings
The [training configuration file] (https://github.com/PaddlePaddle/PaddleClas/blob/develop/ppcls/configs/practical_models/PPHGNet_tiny_calling_halfbody.yaml) has been integrated in PaddleClas. The settings that need to be paid attention to are as follows:
```yaml
# model architecture
Arch:
name:PPHGNet_tiny
class_num:2# Corresponding to the number of action categories
...
# Please correctly set image_root and cls_label_path to ensure that the image_root + image path in cls_label_path can access the image correctly
Where `-o Global.pretrained_model="output/PPHGNet_tiny/best_model"` specifies the path where the current best weight is located. If other weights are needed, just replace the corresponding path.
#### Model Export
For the detailed introduction of model export, please refer to [here](https://github.com/PaddlePaddle/PaddleClas/blob/develop/docs/en/inference_deployment/export_model_en.md#2-export-classification-model)
# Development for Action Recognition Based on Detection with Human ID
## Environmental Preparation
The model of action recognition based on detection with human id is trained with [PaddleDetection](https://github.com/PaddlePaddle/PaddleDetection). Please refer to [Installation](../../../tutorials/INSTALL.md) to complete the environment installation for subsequent model training and usage processes.
## Data Preparation
The model of action recognition based on detection with human id directly recognizes the image frames of video, so the model training process is same with preparation process of general detection model. For details, please refer to [Data Preparation for Detection](../../../tutorials/data/PrepareDetDataSet_en.md). Please process image and annotation of data into one of the formats PaddleDetection supports.
**Note**: In the actual prediction process, a single person image is used for prediction. So it is recommended to crop the image into a single person image during the training process, and label the cigarette detection bounding box to improve the accuracy.
## Model Optimization
### Detection-Tracking Model Optimization
The performance of action recognition based on detection with human id depends on the pre-order detection and tracking models. If the pedestrian location cannot be accurately detected in the actual scene, or it is difficult to correctly assign the person ID between different frames, the performance of the action recognition part will be limited. If you encounter the above problems in actual use, please refer to [Secondary Development of Detection Task](../detection_en.md) and [Secondary Development of Multi-target Tracking Task](../mot_en.md) for detection/track model optimization.
### Larger resolution
The detection of cigarette is a typical small target detection problem from the monitoring perspective. Using a larger resolution can help improve the overall performance of the model.
### Pretrained model
The pretrained model under the small target scene dataset VisDrone is used for training, and the mAP of the model is increased from 38.1 to 39.7.
## Add New Action
### Data Preparation
please refer to [Data Preparation for Detection](../../../tutorials/data/PrepareDetDataSet_en.md) to complete the data preparation part.
The skeleton-based action recognition is trained with [PaddleVideo](https://github.com/PaddlePaddle/PaddleVideo). Please refer to [Installation](https://github.com/PaddlePaddle/PaddleVideo/blob/develop/docs/en/install.md) to complete the environment installation for subsequent model training and usage processes.
## Data Preparation
For the model of skeleton-based model, you can refer to [this document](https://github.com/PaddlePaddle/PaddleVideo/tree/develop/applications/PPHuman#%E5%87%86%E5%A4%87%E8%AE %AD%E7%BB%83%E6%95%B0%E6%8D%AE) to preparation training adapted to PaddleVideo. The main process includes the following steps:
### Data Format Description
STGCN is a model based on the sequence of skeleton point coordinates. In PaddleVideo, training data is `Numpy` data stored with `.npy` format, and labels can be files stored in `.npy` or `.pkl` format. The dimension requirement for sequence data is `(N,C,T,V,M)`, the current solution only supports behaviors composed of a single person (but there can be multiple people in the video, and each person performs action recognition separately), that is` M=1`.
| Dim | Size | Description |
| ---- | ---- | ---------- |
| N | Not Fixed | The number of sequences in the dataset |
| C | 2 | Keypoint coordinate, i.e. (x, y) |
| T | 50 | The temporal dimension of the action sequence (i.e. the number of continuous frames)|
| V | 17 | The number of keypoints of each person, here we use the definition of the `COCO` dataset, see [here](https://github.com/PaddlePaddle/PaddleDetection/blob/release/2.4/docs/tutorials/PrepareKeypointDataSet_en.md#description-for-coco-datasetkeypoint) |
| M | 1 | The number of persons, here we only predict a single person for each action sequence |
### Get The Skeleton Point Coordinates of The Sequence
For a sequence to be labeled (here a sequence refers to an action segment, which can be a video or an ordered collection of pictures). The coordinates of skeletal points (also known as keypoints) can be obtained through model prediction or manual annotation.
- Model prediction: You can directly select the model in the [PaddleDetection KeyPoint Models](https://github.com/PaddlePaddle/PaddleDetection/blob/release/2.4/configs/keypoint/README_en.md) and according to `3, training and testing - Deployment Prediction - Detect + keypoint top-down model joint deployment` to get the 17 keypoint coordinates of the target sequence.
When using the model to predict and obtain the coordinates, you can refer to the following steps, please note that the operation in PaddleDetection at this time.
python deploy/python/det_keypoint_unite_infer.py --det_model_dir=output_inference/mot_ppyoloe_l_36e_pipeline/ --keypoint_model_dir=output_inference/dark_hrnet_w32_256x192 --video_file={your video file path}--device=GPU --save_res=True
```
We can get a detection result file named `det_keypoint_unite_image_results.json`. The detail of content can be seen at [Here](https://github.com/PaddlePaddle/PaddleDetection/blob/release/2.4/deploy/python/det_keypoint_unite_infer.py#L108).
### Uniform Sequence Length
Since the length of each action in the actual data is different, the first step is to pre-determine the time sequence length according to your data and the actual scene (in PP-Human, we use 50 frames as an action sequence), and do the following processing to the data:
- If the actual length exceeds the predetermined length, a 50-frame segment will be randomly intercepted
- Data whose actual length is less than the predetermined length: fill with 0 until 50 frames are met
- data exactly equal to the predeter: no processing required
Note: After this step is completed, please strictly confirm that the processed data contains a complete action, and there will be no ambiguity in prediction. It is recommended to confirm by visualizing the data.
### Save to PaddleVideo usable formats
After the first two steps of processing, we get the annotation of each character action fragment. At this time, we have a list `all_kpts`, which contains multiple keypoint sequence fragments, each one has a shape of (T, V, C) (in our case (50, 17, 2)), which is further converted into a format usable by PaddleVideo.
- Adjust dimension order: `np.transpose` and `np.expand_dims` can be used to convert the dimension of each fragment into (C, T, V, M) format.
- Combine and save all clips as one file
Note: `class_id` is a `int` type variable, similar to other classification tasks. For example `0: falling, 1: other`.
We provide a [script file](https://github.com/PaddlePaddle/PaddleVideo/blob/develop/applications/PPHuman/datasets/prepare_dataset.py) to do this step, which can directly process the generated `det_keypoint_unite_image_results.json` file. The content executed by the script includes parsing the content of the json file, unforming the training data sequence and saving the data file as described in the preceding steps.
```bash
mkdir{root of PaddleVideo}/applications/PPHuman/datasets/annotations
mv det_keypoint_unite_image_results.json {root of PaddleVideo}/applications/PPHuman/datasets/annotations/det_keypoint_unite_image_results_{video_id}_{camera_id}.json
cd{root of PaddleVideo}/applications/PPHuman/datasets/
python prepare_dataset.py
```
Now, we have available training data (`.npy`) and corresponding annotation files (`.pkl`).
## Model Optimization
### detection-tracking model optimization
The performance of action recognition based on skelenton depends on the pre-order detection and tracking models. If the pedestrian location cannot be accurately detected in the actual scene, or it is difficult to correctly assign the person ID between different frames, the performance of the action recognition part will be limited. If you encounter the above problems in actual use, please refer to [Secondary Development of Detection Task](../detection_en.md) and [Secondary Development of Multi-target Tracking Task](../mot_en.md) for detection/track model optimization.
### keypoint model optimization
As the core feature of the scheme, the skeleton point positioning performance also determines the overall effect of action recognition. If there are obvious errors in the recognition results of the keypoint coordinates of in the actual scene, it is difficult to distinguish the specific actions from the skeleton image composed of the keypoint.
You can refer to [Secondary Development of Keypoint Detection Task](../keypoint_detection_en.md) to optimize the keypoint model.
### Coordinate Normalization
After getting coordinates of the skeleton points, it is recommended to perform normalization processing according to the detection bounding box of each person to reduce the convergence difficulty brought by the difference in the position and scale of the person.
## Add New Action
In skeleton-based action recognition, the model is [ST-GCN](https://arxiv.org/abs/1801.07455). Modified to adapt PaddleVideo based on [Training Step](https://github.com/PaddlePaddle/PaddleVideo/blob/develop/docs/en/model_zoo/recognition/stgcn.md). And complete the model training and exporting process.
### Data Preparation And Configuration File Settings
- Prepare the training data (`.npy`) and the corresponding annotation file (`.pkl`) according to `Data preparation`. Correspondingly placed under `{root of PaddleVideo}/applications/PPHuman/datasets/`.
- Refer [Configuration File](https://github.com/PaddlePaddle/PaddleVideo/blob/develop/applications/PPHuman/configs/stgcn_pphuman.yaml), the things to focus on are as follows:
```yaml
MODEL:#MODEL field
framework:
backbone:
name:"STGCN"
in_channels:2# This corresponds to the C dimension in the data format description, representing two-dimensional coordinates.
dropout:0.5
layout:'coco_keypoint'
data_bn:True
head:
name:"STGCNHead"
num_classes:2# If there are multiple action types in the data, this needs to be modified to match the number of types.
if_top5:False# When the number of action types is less than 5, please set it to False, otherwise an error will be raised.
...
# Please set the data and label path of the train/valid/test part correctly according to the data path
DATASET:#DATASET field
batch_size:64
num_workers:4
test_batch_size:1
test_num_workers:0
train:
format:"SkeletonDataset"#Mandatory, indicate the type of dataset, associate to the 'paddle
file_path:"./applications/PPHuman/datasets/train_data.npy"#mandatory, train data index file path
In PaddleVideo, use the following command to export model and get structure file `STGCN.pdmodel` and weight file `STGCN.pdiparams`. And add the configuration file here.
At this point, this model can be used in PP-Human.
**Note**: If the length of the video sequence or the number of keypoints is changed during training, the content of the `INFERENCE` field in the configuration file needs to be modified accordingly to correct prediction.
```yaml
# The dimension of the sequence data is (N,C,T,V,M)
INFERENCE:
name:'STGCN_Inference_helper'
num_channels:2# Corresponding to C dimension
window_size:50# Corresponding to T dimension, please set it accordingly to the sequence length.
vertex_nums:17# Corresponding to V dimension, please set it accordingly to the number of keypoints