[cherry-pick] add en doc for pphuman (#5788)

* add_en_doc_for_pphuman, test=document_fix * update link, test=document_fix * add attr dataset link, test=document_fix * correct link, test=document_fix

[cherry-pick] add en doc for pphuman (#5788)
* add_en_doc_for_pphuman, test=document_fix * update link, test=document_fix * add attr dataset link, test=document_fix * correct link, test=document_fix
002da945 · wangguanzhong · GitHub · 05c33ec8 · 002da945 · 002da945
8 changed file
--- a/deploy/pphuman/README_en.md
+++ b/deploy/pphuman/README_en.md
@@ -149,19 +149,19 @@ The overall solution of PP-Human is as follows:

 ### 1. Object Detection
 - Use PP-YOLOE L as the model of object detection
- For details, please refer to [PP-YOLOE](../../configs/ppyoloe/) and [Detection and Tracking](docs/mot.md)
+- For details, please refer to [PP-YOLOE](../../configs/ppyoloe/) and [Detection and Tracking](docs/mot_en.md)

 ### 2. Multi-Object Tracking
 - Conduct multi-object tracking with the SDE solution
 - Use PP-YOLOE L as the detection model
 - Use the Bytetrack solution to track modules
- For details, refer to [Bytetrack](configs/mot/bytetrack) and [Detection and Tracking](docs/mot.md)
+- For details, refer to [Bytetrack](configs/mot/bytetrack) and [Detection and Tracking](docs/mot_en.md)

 ### 3. Multi-Camera Tracking
 - Use PP-YOLOE + Bytetrack to obtain the tracks of single-camera multi-object tracking
 - Use ReID（centroid network）to extract features of the detection result of each frame
 - Match the features of multi-camera tracks to get the cross-camera tracking result
- For details, please refer to [Multi-Camera Tracking](docs/mtmct.md)
+- For details, please refer to [Multi-Camera Tracking](docs/mtmct_en.md)

 ### 4. Attribute Recognition
 - Use PP-YOLOE + Bytetrack to track humans
@@ -172,4 +172,4 @@ The overall solution of PP-Human is as follows:
 - Use PP-YOLOE + Bytetrack to track humans
 - Use HRNet for keypoint detection and get the information of the 17 key points in the human body
 - According to the changes of the key points of the same person within 50 frames, judge whether the action made by the person within 50 frames is a fall with the help of ST-GCN
- For details, please refer to [Action Recognition](docs/action.md)
+- For details, please refer to [Action Recognition](docs/action_en.md)
--- a/deploy/pphuman/docs/action.md
+++ b/deploy/pphuman/docs/action.md
@@ -56,7 +56,7 @@ python deploy/pphuman/pipeline.py --config deploy/pphuman/config/infer_cfg.yml \
 ```

 ## 方案说明
-1. 使用目标检测与多目标跟踪获取视频输入中的行人检测框及跟踪ID序号，模型方案为PP-YOLOE，详细文档参考[PP-YOLOE](../../../configs/ppyoloe)。
+1. 使用目标检测与多目标跟踪获取视频输入中的行人检测框及跟踪ID序号，模型方案为PP-YOLOE，详细文档参考[PP-YOLOE](../../../configs/ppyoloe/README_cn.md)。
 2. 通过行人检测框的坐标在输入视频的对应帧中截取每个行人，并使用[关键点识别模型](../../../configs/keypoint/hrnet/dark_hrnet_w32_256x192.yml)得到对应的17个骨骼特征点。骨骼特征点的顺序及类型与COCO一致，详见[如何准备关键点数据集](../../../docs/tutorials/PrepareKeypointDataSet_cn.md)中的`COCO数据集`部分。
 3. 每个跟踪ID对应的目标行人各自累计骨骼特征点结果，组成该人物的时序关键点序列。当累计到预定帧数或跟踪丢失后，使用行为识别模型判断时序关键点序列的动作类型。当前版本模型支持摔倒行为的识别，预测得到的`class id`对应关系为：
 ```

--- a/deploy/pphuman/docs/action_en.md
+++ b/deploy/pphuman/docs/action_en.md
+# Action Recognition Module of PP-Human
+
+Action Recognition is widely used in the intelligent community/smart city, and security monitoring. PP-Human provides the module of skeleton-based action recognition.
+
+<div align="center">  <img src="./images/action.gif" width='1000'/> <center>Data source and copyright owner：Skyinfor
+Technology. Thanks for the provision of actual scenario data, which are only
+used for academic research here. </center>
+
+</div>
+
+## Model Zoo
+
+There are multiple available pretrained models including pedestrian detection/tracking, keypoint detection, and fall detection models. Users can download and use them directly.
+
+| Task                          | Algorithm | Precision                 | Inference Speed(ms)                 | Download Link                                                                             |
+|:----------------------------- |:---------:|:-------------------------:|:-----------------------------------:|:-----------------------------------------------------------------------------------------:|
+| Pedestrian Detection/Tracking | PP-YOLOE  | mAP: 56.3 <br> MOTA: 72.0 | Detection: 28ms <br>Tracking：33.1ms | [Link](https://bj.bcebos.com/v1/paddledet/models/pipeline/mot_ppyoloe_l_36e_pipeline.zip) |
+| Keypoint Detection            | HRNet     | AP: 87.1                  | Single Person 2.9ms                 | [Link](https://bj.bcebos.com/v1/paddledet/models/pipeline/dark_hrnet_w32_256x192.zip)     |
+| Action Recognition            | ST-GCN    | Precision Rate: 96.43     | Single Person 2.7ms                 | [Link](https://bj.bcebos.com/v1/paddledet/models/pipeline/STGCN.zip)                      |
+
+Note:
+
+1. The precision of the pedestrian detection/ tracking model is obtained by trainning and testing on [MOT17](https://motchallenge.net/), [CrowdHuman](http://www.crowdhuman.org/), [HIEVE](http://humaninevents.org/) and some business data.
+
+2. The keypoint detection model is trained on [COCO](https://cocodataset.org/), [UAV-Human](https://github.com/SUTDCV/UAV-Human), and some business data, and the precision is obtained on test sets of business data.
+
+3. The action recognition model is trained on [NTU-RGB+D](https://rose1.ntu.edu.sg/dataset/actionRecognition/), [UR Fall Detection Dataset](http://fenix.univ.rzeszow.pl/~mkepski/ds/uf.html), and some business data, and the precision is obtained on the testing set of business data.
+
+4. The inference speed is the speed of using TensorRT FP16 on NVIDIA T4, including the total time of data pre-training, model inference, and post-processing.
+
+## Description of Configuration
+
+Parameters related to action recognition in the [config file](../config/infer_cfg.yml) are as follow:
+
+```
+ACTION:
+  model_dir: output_inference/STGCN  # Path of the model
+  batch_size: 1 # The size of the inference batch. The only avilable size for inference is 1.
+  max_frames: 50 # The number of frames of action segments. When frames of time-ordered skeleton keypoints of each pedestrian ID achieve the max value,the action type will be judged by the action recognition model. If the setting is the same as the training, there will be an ideal inference result.
+  display_frames: 80 # The number of display frames. When the inferred action type is falling down, the time length of the act will be displayed in the ID.
+  coord_size: [384, 512] # The unified size of the coordinate, which is the best when it is the same as the training setting.
+```
+
+
+
+
+## How to Use
+
+- Download models from the links of the above table and unzip them to ```./output_inference```.
+
+- Now the only available input is the video input in the action recognition module. The start command is:
+
+  ```python
+  python deploy/pphuman/pipeline.py --config deploy/pphuman/config/infer_cfg.yml \
+                                                     --video_file=test_video.mp4 \
+                                                     --device=gpu \
+                                                     --enable_action=True
+  ```
+
+- There are two ways to modify the model path:
+
+  - In ```./deploy/pphuman/config/infer_cfg.yml```, you can configurate different model paths，which is proper only if you match keypoint models and action recognition models with the fields of `KPT` and `ACTION`respectively, and modify the corresponding path of each field into the expected path.
+
+  - Add `--model_dir` in the command line to revise the model path：
+
+    ```python
+    python deploy/pphuman/pipeline.py --config deploy/pphuman/config/infer_cfg.yml \
+                                                       --video_file=test_video.mp4 \
+                                                       --device=gpu \
+                                                       --enable_action=True \
+                                                       --model_dir kpt=./dark_hrnet_w32_256x192 action=./STGCN
+    ```
+
+## Introduction to the Solution
+
+1. Get the pedestrian detection box and the tracking ID number of the video input through object detection and multi-object tracking. The adopted model is PP-YOLOE, and for details, please refer to [PP-YOLOE](../../../configs/ppyoloe).
+
+2. Capture every pedestrian in frames of the input video accordingly by using the coordinate of the detection box, and employ the [keypoint detection model](../../../configs/keypoint/hrnet/dark_hrnet_w32_256x192.yml)
+   to obtain 17 skeleton keypoints. Their sequences and types are identical to
+   those of COCO. For details, please refer to the `COCO dataset` part of [how to
+   prepare keypoint datasets](../../../docs/tutorials/PrepareKeypointDataSet_en.md).
+
+3. Each target pedestrian with a tracking ID has their own accumulation of skeleton keypoints, which is used to form a keypoint sequence in time order. When the number of accumulated frames reach a preset threshold or the tracking is lost, the action recognition model will be applied to judging the action type of the time-ordered keypoint sequence. The current model only supports the recognition of the act of falling down, and the relationship between the action type and `class id` is：
+
+```
+0: Fall down
+
+1: Others
+```
+
+4. The action recognition model uses [ST-GCN](https://arxiv.org/abs/1801.07455), and employ the [PaddleVideo](https://github.com/PaddlePaddle/PaddleVideo/blob/develop/docs/zh-CN/model_zoo/recognition/stgcn.md) toolkit to complete model training.
+
+
+## Custom Action Training
+
+The pretrained models are provided and can be used directly, including pedestrian detection/ tracking, keypoint detection and fall recognition. If users need to train custom action or optimize the model performance, please refer the link below.
+
+| Task | Model | Training and Export doc |
+| ---- | ---- | -------- |
+| pedestrian detection/tracking | PP-YOLOE | [doc](../../../configs/ppyoloe/README.md#getting-start) |
+| keypoint detection | HRNet | [doc](../../../configs/keypoint/README_en.md#3training-and-testing) |
+| action recognition |  ST-GCN  | [doc](https://github.com/PaddlePaddle/PaddleVideo/tree/develop/applications/PPHuman) |
+
+
+## Reference
+
+```
+@inproceedings{stgcn2018aaai,
+  title     = {Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition},
+  author    = {Sijie Yan and Yuanjun Xiong and Dahua Lin},
+  booktitle = {AAAI},
+  year      = {2018},
+}
+```
--- a/deploy/pphuman/docs/attribute.md
+++ b/deploy/pphuman/docs/attribute.md
@@ -9,8 +9,8 @@
 | 行人检测/跟踪    |  PP-YOLOE | mAP: 56.3 <br> MOTA: 72.0 | 检测: 28ms <br> 跟踪：33.1ms | [下载链接](https://bj.bcebos.com/v1/paddledet/models/pipeline/mot_ppyoloe_l_36e_pipeline.zip) |
 | 行人属性分析    |  StrongBaseline  |  mA: 94.86  | 单人 2ms | [下载链接](https://bj.bcebos.com/v1/paddledet/models/pipeline/strongbaseline_r50_30e_pa100k.zip) |

-1. 检测/跟踪模型精度为MOT17，CrowdHuman，HIEVE和部分业务数据融合训练测试得到
-2. 行人属性分析精度为PA100k，RAPv2，PETA和部分业务数据融合训练测试得到
+1. 检测/跟踪模型精度为[MOT17](https://motchallenge.net/)，[CrowdHuman](http://www.crowdhuman.org/)，[HIEVE](http://humaninevents.org/)和部分业务数据融合训练测试得到
+2. 行人属性分析精度为[PA100k](https://github.com/xh-liu/HydraPlus-Net#pa-100k-dataset)，[RAPv2](http://www.rapdataset.com/rapv2.html)，[PETA](http://mmlab.ie.cuhk.edu.hk/projects/PETA.html)和部分业务数据融合训练测试得到
 3. 预测速度为T4 机器上使用TensorRT FP16时的速度, 速度包含数据预处理、模型预测、后处理全流程

 ## 使用方法
@@ -52,7 +52,7 @@ python deploy/pphuman/pipeline.py --config deploy/pphuman/config/infer_cfg.yml \

 ## 方案说明

-1. 目标检测/多目标跟踪获取图片/视频输入中的行人检测框，模型方案为PP-YOLOE，详细文档参考[PP-YOLOE](../../../configs/ppyoloe)
+1. 目标检测/多目标跟踪获取图片/视频输入中的行人检测框，模型方案为PP-YOLOE，详细文档参考[PP-YOLOE](../../../configs/ppyoloe/README_cn.md)
 2. 通过行人检测框的坐标在输入图像中截取每个行人
 3. 使用属性识别分析每个行人对应属性，属性类型与PA100k数据集相同，具体属性列表如下：
 ```

--- a/deploy/pphuman/docs/attribute_en.md
+++ b/deploy/pphuman/docs/attribute_en.md
@@ -9,8 +9,8 @@ Pedestrian attribute recognition has been widely used in the intelligent communi
 | Pedestrian Detection/ Tracking    |  PP-YOLOE | mAP: 56.3 <br> MOTA: 72.0 | Detection: 28ms <br> Tracking：33.1ms | [Download Link](https://bj.bcebos.com/v1/paddledet/models/pipeline/mot_ppyoloe_l_36e_pipeline.zip) |
 | Pedestrian Attribute Analysis   |  StrongBaseline  |  ma: 94.86  | Per Person 2ms | [Download Link](https://bj.bcebos.com/v1/paddledet/models/pipeline/strongbaseline_r50_30e_pa100k.tar) |

-1. The precision of detection/ tracking models is obtained by training and testing on the dataset consist of MOT17, CrowdHuman, HIEVE, and some business data.
-2. The precision of pedestiran attribute analysis is obtained by training and testing on the dataset consist of PA100k, RAPv2, PETA, and some business data.
+1. The precision of detection/ tracking models is obtained by training and testing on the dataset consist of [MOT17](https://motchallenge.net/)，[CrowdHuman](http://www.crowdhuman.org/)，[HIEVE](http://humaninevents.org/) and some business data.
+2. The precision of pedestiran attribute analysis is obtained by training and testing on the dataset consist of [PA100k](https://github.com/xh-liu/HydraPlus-Net#pa-100k-dataset)，[RAPv2](http://www.rapdataset.com/rapv2.html)，[PETA](http://mmlab.ie.cuhk.edu.hk/projects/PETA.html) and some business data.
 3. The inference speed is T4, the speed of using TensorRT FP16.

 ## Instruction

--- a/deploy/pphuman/docs/mot.md
+++ b/deploy/pphuman/docs/mot.md
@@ -51,7 +51,7 @@ python deploy/pphuman/pipeline.py --config deploy/pphuman/config/infer_cfg.yml \

 ## 方案说明

-1. 目标检测/多目标跟踪获取图片/视频输入中的行人检测框，模型方案为PP-YOLOE，详细文档参考[PP-YOLOE](../../../configs/ppyoloe)
+1. 目标检测/多目标跟踪获取图片/视频输入中的行人检测框，模型方案为PP-YOLOE，详细文档参考[PP-YOLOE](../../../configs/ppyoloe/README_cn.md)
 2. 多目标跟踪模型方案基于[ByteTrack](https://arxiv.org/pdf/2110.06864.pdf)，采用PP-YOLOE替换原文的YOLOX作为检测器，采用BYTETracker作为跟踪器。

 ## 参考文献

--- a/deploy/pphuman/docs/mot_en.md
+++ b/deploy/pphuman/docs/mot_en.md
+# Detection and Tracking Module of PP-Human
+
+Pedestrian detection and tracking is widely used in the intelligent community, industrial inspection, transportation monitoring and so on. PP-Human has the detection and tracking module, which is fundamental to keypoint detection, attribute action recognition, etc. Users enjoy easy access to pretrained models here.
+
+| Task                 | Algorithm | Precision | Inference Speed(ms) | Download Link                                                                               |
+|:---------------------|:---------:|:------:|:------:| :---------------------------------------------------------------------------------: |
+| Pedestrian Detection/ Tracking    |  PP-YOLOE | mAP: 56.3 <br> MOTA: 72.0 | Detection: 28ms <br> Tracking：33.1ms | [Download Link](https://bj.bcebos.com/v1/paddledet/models/pipeline/mot_ppyoloe_l_36e_pipeline.zip) |
+
+1. The precision of the pedestrian detection/ tracking model is obtained by trainning and testing on [MOT17](https://motchallenge.net/), [CrowdHuman](http://www.crowdhuman.org/), [HIEVE](http://humaninevents.org/) and some business data.
+2. The inference speed is the speed of using TensorRT FP16 on T4, the total number of data pre-training, model inference, and post-processing.
+
+## How to Use
+
+1. Download models from the links of the above table and unizp them to ```./output_inference```.
+2. During the image input, the start command is as follows:
+```python
+python deploy/pphuman/pipeline.py --config deploy/pphuman/config/infer_cfg.yml \
+                                                   --image_file=test_image.jpg \
+                                                   --device=gpu
+```
+3. In the video input, the start command is as follows:
+```python
+python deploy/pphuman/pipeline.py --config deploy/pphuman/config/infer_cfg.yml \
+                                                   --video_file=test_video.mp4 \
+                                                   --device=gpu
+```
+4. There are two ways to modify the model path:
+
+     - In `./deploy/pphuman/config/infer_cfg.yml`, you can configurate different model paths，which is proper only if you match keypoint models and action recognition models with the fields of `DET` and `MOT` respectively, and modify the corresponding path of each field into the expected path.
+    - Add `--model_dir` in the command line to revise the model path:
+
+```python
+python deploy/pphuman/pipeline.py --config deploy/pphuman/config/infer_cfg.yml \
+                                                   --video_file=test_video.mp4 \
+                                                   --device=gpu \
+                                                   --model_dir det=ppyoloe/
+                                                   --do_entrance_counting \
+                                                   --draw_center_traj
+
+```
+**Note:**
+
+ - `--do_entrance_counting` is whether to calculate flow at the gateway, and the default setting is False.
+ - `--draw_center_traj` means whether to draw the track, and the default setting is False. It's worth noting that the test video of track drawing should be filmed by the still camera.
+The test result is：
+
+<div width="1000" align="center">
+  <img src="./images/mot.gif"/>
+</div>
+
+Data source and copyright owner：Skyinfor Technology. Thanks for the provision of actual scenario data, which are only used for academic research here.
+
+
+## Introduction to the Solution
+
+1. Get the pedestrian detection box of the image/ video input through object detection and multi-object tracking. The adopted model is PP-YOLOE, and for details, please refer to [PP-YOLOE](../../../configs/ppyoloe).
+
+2. The multi-object tracking model solution is based on [ByteTrack](https://arxiv.org/pdf/2110.06864.pdf), and replace the original YOLOX with P-YOLOE as the detector，and BYTETracker as the tracker.
+
+## Reference
+```
+@article{zhang2021bytetrack,
+  title={ByteTrack: Multi-Object Tracking by Associating Every Detection Box},
+  author={Zhang, Yifu and Sun, Peize and Jiang, Yi and Yu, Dongdong and Yuan, Zehuan and Luo, Ping and Liu, Wenyu and Wang, Xinggang},
+  journal={arXiv preprint arXiv:2110.06864},
+  year={2021}
+}
+```
--- a/deploy/pphuman/docs/mtmct_en.md
+++ b/deploy/pphuman/docs/mtmct_en.md
+# Multi-Target Multi-Camera Tracking Module of PP-Human
+
+Multi-target multi-camera tracking, or MTMCT, matches the identity of a person in different cameras based on the single-camera tracking. MTMCT is usually applied to the security system and the smart retailing.
+The MTMCT module of PP-Human aims to provide a multi-target multi-camera pipleline which is simple, and efficient.
+
+## How to Use
+
+1. Download [REID model](https://bj.bcebos.com/v1/paddledet/models/pipeline/reid_model.zip) and unzip it to ```./output_inference```. For the MOT model, please refer to [mot description](./mot.md).
+
+2. In the MTMCT mode, input videos are required to be put in the same directory. The command line is:
+```python
+python3 deploy/pphuman/pipeline.py --config deploy/pphuman/config/infer_cfg.yml --video_dir=[your_video_file_directory] --device=gpu
+```
+
+3. Configuration can be modified in `./deploy/pphuman/config/infer_cfg.yml`.
+
+```python
+python3 deploy/pphuman/pipeline.py
+        --config deploy/pphuman/config/infer_cfg.yml
+        --video_dir=[your_video_file_directory]
+        --device=gpu
+        --model_dir reid=reid_best/
+```
+
+## Intorduction to the Solution
+
+MTMCT module consists of the multi-target multi-camera tracking pipeline and the REID model.
+
+1. Multi-Target Multi-Camera Tracking Pipeline
+
+```
+
+single-camera tracking[id+bbox]
+        │
+capture the target in the original image according to bbox——│
+        │            │
+    REID model      quality assessment (covered or not, complete or not, brightness, etc.)
+        │            │
+    [feature]        [quality]
+        │            │
+   datacollector—————│
+        │
+      sort out and filter features
+        │
+ calculate the similarity of IDs in the videos
+        │
+  make the IDs cluster together and rearrange them
+```
+
+2. The model solution is [reid-centroids](https://github.com/mikwieczorek/centroids-reid), with ResNet50 as the backbone. It is worth noting that the solution employs different features of the same ID to enhance the similarity.
+
+Under the above circumstances, the REID model used in MTMCT integrates open-source datasets and compresses model features to 128-dimensional features to optimize the generalization. In this way, the actual generalization result becomes much better.
+
+### Other Suggestions
+
+- The provided REID model is obtained from open-source dataset training. It is recommended to add your own data to get a more powerful REID model, notably improving the MTMCT effect.
+- The quality assessment is based on simple logic +OpenCV, whose effect is limited. If possible, it is advisable to conduct specific training on the quality assessment model.
+
+
+### Example
+
+- camera 1:
+<div width="1080" align="center">
+  <img src="./images/c1.gif"/>
+</div>
+
+- camera 2:
+<div width="1080" align="center">
+  <img src="./images/c2.gif"/>
+</div>
+
+
+## Reference
+```
+@article{Wieczorek2021OnTU,
+  title={On the Unreasonable Effectiveness of Centroids in Image Retrieval},
+  author={Mikolaj Wieczorek and Barbara Rychalska and Jacek Dabrowski},
+  journal={ArXiv},
+  year={2021},
+  volume={abs/2104.13643}
+}
+```