Add action docs (#6331)

* add docs, still in working * add action doc * add action en doc

Add action docs (#6331)
* add docs, still in working * add action doc * add action en doc
b0cf9783 · JYChen · GitHub · 1f132953 · b0cf9783 · b0cf9783
7 changed file
--- a/deploy/pipeline/README.md
+++ b/deploy/pipeline/README.md
@@ -32,8 +32,10 @@ PP-Human支持图片/单镜头视频/多镜头视频多种输入方式，功能
 | 目标跟踪(高精度) | 视频输入 | MOTA: 79.5  | 33.1ms           |[下载链接](https://bj.bcebos.com/v1/paddledet/models/pipeline/mot_ppyoloe_l_36e_pipeline.pdparams) |[下载链接](https://bj.bcebos.com/v1/paddledet/models/pipeline/mot_ppyoloe_l_36e_pipeline.zip) |
 | 目标跟踪(轻量级) | 视频输入 | MOTA: 69.1  | 27.2ms           |[下载链接](https://bj.bcebos.com/v1/paddledet/models/pipeline/mot_ppyoloe_s_36e_pipeline.pdparams) |[下载链接](https://bj.bcebos.com/v1/paddledet/models/pipeline/mot_ppyoloe_s_36e_pipeline.zip) |
 | 属性识别    | 图片/视频输入 属性识别  | mA: 94.86 |  单人2ms     | - |[下载链接](https://bj.bcebos.com/v1/paddledet/models/pipeline/strongbaseline_r50_30e_pa100k.zip) |
-| 关键点检测    | 视频输入 行为识别 | AP: 87.1 | 单人2.9ms        |[下载链接](https://bj.bcebos.com/v1/paddledet/models/pipeline/dark_hrnet_w32_256x192.pdparams) |[下载链接](https://bj.bcebos.com/v1/paddledet/models/pipeline/dark_hrnet_w32_256x192.zip)
-| 行为识别   |  视频输入 行为识别  | 准确率: 96.43 |  单人2.7ms      | - |[下载链接](https://bj.bcebos.com/v1/paddledet/models/pipeline/STGCN.zip) |
+| 关键点检测    | 视频输入 行为识别 | AP: 87.1 | 单人2.9ms        |[下载链接](https://bj.bcebos.com/v1/paddledet/models/pipeline/dark_hrnet_w32_256x192.pdparams) |[下载链接](https://bj.bcebos.com/v1/paddledet/models/pipeline/dark_hrnet_w32_256x192.zip) |
+| 摔倒行为识别   |  视频输入 行为识别  | 准确率: 96.43 |  单人2.7ms      | - |[下载链接](https://bj.bcebos.com/v1/paddledet/models/pipeline/STGCN.zip) |
+| 打电话行为识别   |  视频输入 行为识别  | 准确率: 86.85 |  单人2.94ms      | - |[下载链接](https://bj.bcebos.com/v1/paddledet/models/pipeline/PPHGNet_tiny_calling_halfbody.zip) |
+| 抽烟行为识别   |  视频输入 行为识别  | mAP: 39.7 |  单人2.0ms      | - |[下载链接](https://bj.bcebos.com/v1/paddledet/models/pipeline/ppyoloe_crn_s_80e_smoking_visdrone.zip) |
 | ReID         | 视频输入 跨镜跟踪   | mAP: 98.8 | 单人1.5ms        | - |[下载链接](https://bj.bcebos.com/v1/paddledet/models/pipeline/reid_model.zip) |

 下载预测部署模型并解压存放至`./output_inference`新建目录中

--- a/deploy/pipeline/config/infer_cfg_pphuman.yml
+++ b/deploy/pipeline/config/infer_cfg_pphuman.yml
@@ -45,7 +45,7 @@ SKELETON_ACTION:
  enable: False

 ID_BASED_DETACTION:
-  model_dir: output_inference/ppyoloe_crn_s_300e_smoking/
+  model_dir: output_inference/ppyoloe_crn_s_80e_smoking_visdrone
  batch_size: 8
  basemode: "idbased"
  threshold: 0.4

--- a/deploy/pipeline/docs/images/calling.gif
+++ b/deploy/pipeline/docs/images/calling.gif
--- a/deploy/pipeline/docs/images/smoking.gif
+++ b/deploy/pipeline/docs/images/smoking.gif
--- a/deploy/pipeline/docs/tutorials/action.md
+++ b/deploy/pipeline/docs/tutorials/action.md
@@ -2,48 +2,53 @@

 # PP-Human行为识别模块

-行为识别在智慧社区，安防监控等方向具有广泛应用，PP-Human中集成了基于骨骼点的行为识别模块、基于视频分类的打架识别模块。
+行为识别在智慧社区，安防监控等方向具有广泛应用，根据行为的不同，PP-Human中集成了基于视频分类、基于检测、基于图像分类以及基于骨骼点的行为识别模块，方便用户根据需求进行选择。

-## 摔倒识别
-
-<div align="center">
-  <img src="../images/action.gif" width='1000'/>
-  <center>数据来源及版权归属：天覆科技，感谢提供并开源实际场景数据，仅限学术研究使用</center>
-</div>
-
-### 模型库
-在这里，我们提供了检测/跟踪、关键点识别以及识别摔倒动作的预训练模型，用户可以直接下载使用。
+## 模型库
+在这里，我们提供了检测/跟踪、关键点识别、识别打架、打电话行为、抽烟行为、以及摔倒动作的预训练模型，用户可以直接下载使用。

 | 任务 | 算法 | 精度 | 预测速度(ms) | 模型权重 | 预测部署模型 |
 |:---------------------|:---------:|:------:|:------:| :------: |:---------------------------------------------------------------------------------: |
 | 行人检测/跟踪 |  PP-YOLOE | mAP: 56.3 <br> MOTA: 72.0 | 检测: 28ms <br> 跟踪：33.1ms |[下载链接](https://bj.bcebos.com/v1/paddledet/models/pipeline/mot_ppyoloe_l_36e_pipeline.pdparams) |[下载链接](https://bj.bcebos.com/v1/paddledet/models/pipeline/mot_ppyoloe_l_36e_pipeline.zip) |
+| 打电话行为识别 | PP-HGNet | 准确率: 86.85 | 单人 2.94ms | [下载链接](https://bj.bcebos.com/v1/paddledet/models/pipeline/PPHGNet_tiny_calling_halfbody.pdparams) | [下载链接](https://bj.bcebos.com/v1/paddledet/models/pipeline/PPHGNet_tiny_calling_halfbody.zip) |
+| 抽烟行为识别 | PP-YOLOE | mAP: 39.7 | 单人 2.0ms | [下载链接](https://bj.bcebos.com/v1/paddledet/models/pipeline/ppyoloe_crn_s_80e_smoking_visdrone.pdparams) | [下载链接](https://bj.bcebos.com/v1/paddledet/models/pipeline/ppyoloe_crn_s_80e_smoking_visdrone.zip) |
 | 关键点识别 | HRNet | AP: 87.1 | 单人 2.9ms |[下载链接](https://bj.bcebos.com/v1/paddledet/models/pipeline/dark_hrnet_w32_256x192.pdparams) |[下载链接](https://bj.bcebos.com/v1/paddledet/models/pipeline/dark_hrnet_w32_256x192.zip)|
-| 摔倒识别 |  ST-GCN  | 准确率: 96.43 | 单人 2.7ms | - |[下载链接](https://bj.bcebos.com/v1/paddledet/models/pipeline/STGCN.zip) |
+| 摔倒行为识别 |  ST-GCN  | 准确率: 96.43 | 单人 2.7ms | - |[下载链接](https://bj.bcebos.com/v1/paddledet/models/pipeline/STGCN.zip) |
+| 打架识别 | PP-TSM | 准确率：89.06% | 2s视频 128ms | [下载链接](https://videotag.bj.bcebos.com/PaddleVideo-release2.3/ppTSM_fight.pdparams) | [下载链接](https://videotag.bj.bcebos.com/PaddleVideo-release2.3/ppTSM_fight.zip) |


 注：
 1. 检测/跟踪模型精度为[MOT17](https://motchallenge.net/)，[CrowdHuman](http://www.crowdhuman.org/)，[HIEVE](http://humaninevents.org/)和部分业务数据融合训练测试得到。
 2. 关键点模型使用[COCO](https://cocodataset.org/)，[UAV-Human](https://github.com/SUTDCV/UAV-Human)和部分业务数据融合训练, 精度在业务数据测试集上得到。
-3. 行为识别模型使用[NTU-RGB+D](https://rose1.ntu.edu.sg/dataset/actionRecognition/)，[UR Fall Detection Dataset](http://fenix.univ.rzeszow.pl/~mkepski/ds/uf.html)和部分业务数据融合训练，精度在业务数据测试集上得到。
-4. 预测速度为NVIDIA T4 机器上使用TensorRT FP16时的速度, 速度包含数据预处理、模型预测、后处理全流程。
+3. 摔倒行为识别模型使用[NTU-RGB+D](https://rose1.ntu.edu.sg/dataset/actionRecognition/)，[UR Fall Detection Dataset](http://fenix.univ.rzeszow.pl/~mkepski/ds/uf.html)和部分业务数据融合训练，精度在业务数据测试集上得到。
+4. 打电话行为识别模型使用[UAV-Human](https://github.com/SUTDCV/UAV-Human)的打电话行为部分进行训练和测试。
+5. 抽烟行为识别模型使用业务数据进行训练和测试。
+6. 打架识别模型基于6个公开数据集训练得到：Surveillance Camera Fight Dataset、A Dataset for Automatic Violence Detection in Videos、Hockey Fight Detection Dataset、Video Fight Detection Dataset、Real Life Violence Situations Dataset、UBI Abnormal Event Detection Dataset。
+7. 预测速度为NVIDIA T4 机器上使用TensorRT FP16时的速度, 速度包含数据预处理、模型预测、后处理全流程。
+
+## 基于骨骼点的行为识别——摔倒识别
+
+<div align="center">
+  <img src="../images/action.gif" width='1000'/>
+  <center>数据来源及版权归属：天覆科技，感谢提供并开源实际场景数据，仅限学术研究使用</center>
+</div>

 ### 配置说明
-## 配置说明
 [配置文件](../../config/infer_cfg_pphuman.yml)中与行为识别相关的参数如下：
 ```
-SKELETON_ACTION:
+SKELETON_ACTION: # 基于骨骼点的行为识别模型配置
  model_dir: output_inference/STGCN  # 模型所在路径
  batch_size: 1 # 预测批大小。 当前仅支持为1进行推理
  max_frames: 50 # 动作片段对应的帧数。在行人ID对应时序骨骼点结果时达到该帧数后，会通过行为识别模型判断该段序列的动作类型。与训练设置一致时效果最佳。
  display_frames: 80 # 显示帧数。当预测结果为摔倒时，在对应人物ID中显示状态的持续时间。
  coord_size: [384, 512] # 坐标统一缩放到的尺度大小。与训练设置一致时效果最佳。
-  basemode: "skeletonbased" #模型基于的路线分支，是否需要skeleton作为输入
-  enable: False #是否开启该功能
+  basemode: "skeletonbased" # 模型基于的路线分支，是否需要skeleton作为输入
+  enable: False # 是否开启该功能
 ```

 ### 使用方法
 1. 从上表链接中下载模型并解压到```./output_inference```路径下。
-2. 目前行为识别模块仅支持视频输入，设置infer_cfg_pphuman.yml中`SKELETON_ACTION`的enable: True, 然后启动命令如下：
+2. 目前行为识别模块仅支持视频输入，根据期望开启的行为识别方案类型，设置infer_cfg_pphuman.yml中`SKELETON_ACTION`的enable: True, 然后启动命令如下：
 ```python
 python deploy/pipeline/pipeline.py --config deploy/pipeline/config/infer_cfg_pphuman.yml \
                                                   --video_file=test_video.mp4 \
@@ -61,24 +66,97 @@ python deploy/pipeline/pipeline.py --config deploy/pipeline/config/infer_cfg_pph
 ```

 ### 方案说明
-1. 使用目标检测与多目标跟踪获取视频输入中的行人检测框及跟踪ID序号，模型方案为PP-YOLOE，详细文档参考[PP-YOLOE](../../../configs/ppyoloe/README_cn.md)。
-2. 通过行人检测框的坐标在输入视频的对应帧中截取每个行人，并使用[关键点识别模型](../../../configs/keypoint/hrnet/dark_hrnet_w32_256x192.yml)得到对应的17个骨骼特征点。骨骼特征点的顺序及类型与COCO一致，详见[如何准备关键点数据集](../../../docs/tutorials/PrepareKeypointDataSet_cn.md)中的`COCO数据集`部分。
-3. 每个跟踪ID对应的目标行人各自累计骨骼特征点结果，组成该人物的时序关键点序列。当累计到预定帧数或跟踪丢失后，使用行为识别模型判断时序关键点序列的动作类型。当前版本模型支持摔倒行为的识别，预测得到的`class id`对应关系为：
+1. 使用目标检测与多目标跟踪获取视频输入中的行人检测框及跟踪ID序号，模型方案为PP-YOLOE，详细文档参考[PP-YOLOE](../../../../configs/ppyoloe/README_cn.md)。
+2. 通过行人检测框的坐标在输入视频的对应帧中截取每个行人。
+3. 使用[关键点识别模型](../../../configs/keypoint/hrnet/dark_hrnet_w32_256x192.yml)得到对应的17个骨骼特征点。骨骼特征点的顺序及类型与COCO一致，详见[如何准备关键点数据集](../../../docs/tutorials/PrepareKeypointDataSet_cn.md)中的`COCO数据集`部分。
+4. 每个跟踪ID对应的目标行人各自累计骨骼特征点结果，组成该人物的时序关键点序列。当累计到预定帧数或跟踪丢失后，使用行为识别模型判断时序关键点序列的动作类型。当前版本模型支持摔倒行为的识别，预测得到的`class id`对应关系为：
 ```
 0: 摔倒，
 1: 其他
 ```
-4. 行为识别模型使用了[ST-GCN](https://arxiv.org/abs/1801.07455)，并基于[PaddleVideo](https://github.com/PaddlePaddle/PaddleVideo/blob/develop/docs/zh-CN/model_zoo/recognition/stgcn.md)套件完成模型训练。
+- 摔倒行为识别模型使用了[ST-GCN](https://arxiv.org/abs/1801.07455)，并基于[PaddleVideo](https://github.com/PaddlePaddle/PaddleVideo/blob/develop/docs/zh-CN/model_zoo/recognition/stgcn.md)套件完成模型训练。
+
+## 基于图像分类的行为识别——打电话识别
+
+<div align="center">
+  <img src="../images/calling.gif" width='1000'/>
+  <center>数据来源及版权归属：天覆科技，感谢提供并开源实际场景数据，仅限学术研究使用</center>
+</div>
+
+
+### 配置说明
+[配置文件](../../config/infer_cfg_pphuman.yml)中相关的参数如下：
+```
+ID_BASED_CLSACTION: # 基于分类的行为识别模型配置
+  model_dir: output_inference/PPHGNet_tiny_calling_halfbody # 模型所在路径
+  batch_size: 8 # 预测批大小
+  basemode: "idbased" # 模型基于的路线分支，是否基于跟踪获得的ID信息
+  threshold: 0.45 #识别为对应行为的阈值
+  display_frames: 80 # 显示帧数。当识别到对应动作时，在对应人物ID中显示状态的持续时间。
+  enable: False # 是否开启该功能
+```
+
+### 使用方法
+1. 从上表链接中下载预测部署模型并解压到`./output_inference`路径下；
+2. 修改配置文件`deploy/pipeline/config/infer_cfg_pphuman.yml`中`ID_BASED_CLSACTION`下的`enable`为`True`；
+3. 输入视频，启动命令如下：
+```
+python deploy/pipeline/pipeline.py --config deploy/pipeline/config/infer_cfg_pphuman.yml \
+                                                   --video_file=test_video.mp4 \
+                                                   --device=gpu
+```
+
+### 方案说明
+1. 使用目标检测与多目标跟踪获取视频输入中的行人检测框及跟踪ID序号，模型方案为PP-YOLOE，详细文档参考[PP-YOLOE](../../../../configs/ppyoloe/README_cn.md)。
+2. 通过行人检测框的坐标在输入视频的对应帧中截取每个行人。
+3. 通过在帧级别的行人图像通过图像分类的方式实现。当图片所属类别为对应行为时，即认为在一定时间段内该人物处于该行为状态中。该任务使用[PP-HGNet](https://github.com/PaddlePaddle/PaddleClas/blob/develop/docs/zh_CN/models/PP-HGNet.md)实现，当前版本模型支持打电话行为的识别，预测得到的`class id`对应关系为：
+```
+0: 打电话，
+1: 其他
+```
+- 基于分类的行为识别基于[PaddleClas](https://github.com/PaddlePaddle/PaddleClas/blob/develop/docs/zh_CN/models/PP-HGNet.md#3.3)完成模型训练。
+
+
+## 基于检测的行为识别——吸烟识别
+<div align="center">
+  <img src="../images/smoking.gif" width='1000'/>
+  <center>数据来源及版权归属：天覆科技，感谢提供并开源实际场景数据，仅限学术研究使用</center>
+</div>
+
+
+### 配置说明
+[配置文件](../../config/infer_cfg_pphuman.yml)中相关的参数如下：
+```
+ID_BASED_DETACTION: # 基于检测的行为识别模型配置
+  model_dir: output_inference/ppyoloe_crn_s_80e_smoking_visdrone # 模型所在路径
+  batch_size: 8  # 预测批大小
+  basemode: "idbased" # 模型基于的路线分支，是否基于跟踪获得的ID信息
+  threshold: 0.4  # 识别为对应行为的阈值
+  display_frames: 80 # 显示帧数。当识别到对应动作时，在对应人物ID中显示状态的持续时间。
+  enable: False # 是否开启该功能
+```
+
+### 使用方法
+1. 从上表链接中下载预测部署模型并解压到`./output_inference`路径下；
+2. 修改配置文件`deploy/pipeline/config/infer_cfg_pphuman.yml`中`ID_BASED_DETACTION`下的`enable`为`True`；
+3. 输入视频，启动命令如下：
+```
+python deploy/pipeline/pipeline.py --config deploy/pipeline/config/infer_cfg_pphuman.yml \
+                                                   --video_file=test_video.mp4 \
+                                                   --device=gpu
+```
+
+### 方案说明
+1. 使用目标检测与多目标跟踪获取视频输入中的行人检测框及跟踪ID序号，模型方案为PP-YOLOE，详细文档参考[PP-YOLOE](../../../../configs/ppyoloe/README_cn.md)。
+2. 通过行人检测框的坐标在输入视频的对应帧中截取每个行人。
+3. 通过在帧级别的行人图像中检测该行为的典型特定目标实现。当检测到特定目标(在这里即烟头）以后，即认为在一定时间段内该人物处于该行为状态中。该任务使用[PP-YOLOE](../../../../configs/ppyoloe/README_cn.md)实现，当前版本模型支持吸烟行为的识别，预测得到的`class id`对应关系为：
+```
+0: 吸烟，
+1: 其他
+```

-### 自定义模型训练
-我们已经提供了检测/跟踪、关键点识别以及识别摔倒动作的预训练模型，可直接下载使用。如果希望使用自定义场景数据训练，或是对模型进行优化，根据具体模型，分别参考下面的链接：
-| 任务 | 算法 | 模型训练及导出文档 |
-| ---- | ---- | -------- |
-| 行人检测/跟踪 | PP-YOLOE | [使用教程](../../../configs/ppyoloe/README_cn.md#使用教程) |
-| 关键点识别 | HRNet | [使用教程](../../../configs/keypoint#3训练与测试) |
-| 行为识别 |  ST-GCN  | [使用教程](https://github.com/PaddlePaddle/PaddleVideo/tree/develop/applications/PPHuman) |

-## 打架识别
+## 基于视频分类的行为识别——打架识别

 随着监控摄像头部署覆盖范围越来越广，人工查看是否存在打架等异常行为耗时费力、效率低，AI+安防助理智慧安防。PP-Human中集成了打架识别模块，识别视频中是否存在打架行为。我们提供了预训练模型，用户可直接下载使用。

@@ -90,6 +168,20 @@ python deploy/pipeline/pipeline.py --config deploy/pipeline/config/infer_cfg_pph

 本项目关注的场景为监控摄像头下的打架行为识别。打架行为涉及多人，基于骨骼点技术的方案更适用于单人的行为识别。此外，打架行为对时序信息依赖较强，基于检测和分类的方案也不太适用。由于监控场景背景复杂，人的密集程度、光线、拍摄角度等都会对识别造成影响，本方案采用基于视频分类的方式判断视频中是否存在打架行为。针对摄像头距离人较远的情况，通过增大输入图像分辨率优化。由于训练数据有限，采用数据增强的方式提升模型的泛化性能。

+### 配置说明
+[配置文件](../../config/infer_cfg_pphuman.yml)中与行为识别相关的参数如下：
+```
+VIDEO_ACTION:  # 基于视频分类的行为识别模型配置
+  model_dir: output_inference/ppTSM # 模型所在路径
+  batch_size: 1 # 预测批大小。当前仅支持为1进行推理
+  frame_len: 8 # 累计抽样帧数量，达到该数量后执行一次识别
+  sample_freq: 7 # 抽样频率，即间隔多少帧抽样一帧
+  short_size: 340 # 视频帧尺度变换最小边的长度
+  target_size: 320 # 目标视频帧的大小
+  basemode: "videobased"  # 模型基于的路线分支，是否直接使用视频进行输入
+  enable: False # 是否开启该功能
+```
+
 ### 使用方法
 1. 从上表链接中下载预测部署模型并解压到`./output_inference`路径下；
 2. 修改解压后`ppTSM`文件夹中的文件名称为`model.pdiparams、model.pdiparams.info和model.pdmodel`；
@@ -113,6 +205,17 @@ python deploy/pipeline/pipeline.py --config deploy/pipeline/config/infer_cfg_pph

 目前打架识别模型使用的是[PP-TSM](https://github.com/PaddlePaddle/PaddleVideo/blob/develop/docs/zh-CN/model_zoo/recognition/pp-tsm.md)，并在PP-TSM视频分类模型训练流程的基础上修改适配，完成模型训练。对于输入的视频或者视频流，进行等间隔抽帧，当视频帧累计到指定数目时，输入到视频分类模型中判断是否存在打架行为。

+## 自定义模型训练
+我们已经提供了检测/跟踪、关键点识别以及识别摔倒、吸烟、打电话以及打架的预训练模型，可直接下载使用。如果希望使用自定义场景数据训练，或是对模型进行优化，根据具体模型，分别参考下面的链接：
+| 任务 | 算法 | 模型训练及导出文档 |
+| ---- | ---- | -------- |
+| 行人检测/跟踪 | PP-YOLOE | [使用教程](../../../../configs/ppyoloe/README_cn.md#使用教程) |
+| 关键点识别 | HRNet | [使用教程](../../../../configs/keypoint#3训练与测试) |
+| 行为识别（摔倒）|  ST-GCN  | [使用教程](https://github.com/PaddlePaddle/PaddleVideo/tree/develop/applications/PPHuman) |
+| 行为识别（吸烟）|  PP-YOLOE  | [使用教程](../../../../docs/advanced_tutorials/customization/action.md) |
+| 行为识别（打电话）|  PP-HGNet  | [使用教程](../../../../docs/advanced_tutorials/customization/action.md) |
+| 行为识别 （打架）| PP-TSM | [使用教程](../../../../docs/advanced_tutorials/customization/action.md)
+

 ## 参考文献
 ```

--- a/deploy/pipeline/docs/tutorials/action_en.md
+++ b/deploy/pipeline/docs/tutorials/action_en.md
 English | [简体中文](action.md)

-# Falling Recognition Module of PP-Human
+# Action Recognition Module of PP-Human

-Falling Recognition is widely used in the intelligent community/smart city, and security monitoring. PP-Human provides the module of skeleton-based action recognition.
-
-<div align="center">  <img src="../images/action.gif" width='1000'/> <center>Data source and copyright owner：Skyinfor
-Technology. Thanks for the provision of actual scenario data, which are only
-used for academic research here. </center>
-
-</div>
+Action Recognition is widely used in the intelligent community/smart city, and security monitoring. PP-Human provides the module of video-classification-based, detection-based, image-classification-based and skeleton-based action recognition.

 ## Model Zoo

-There are multiple available pretrained models including pedestrian detection/tracking, keypoint detection, and fall detection models. Users can download and use them directly.
+There are multiple available pretrained models including pedestrian detection/tracking, keypoint detection, fighting, calling, smoking and fall detection models. Users can download and use them directly.

 | Task                          | Algorithm | Precision                 | Inference Speed(ms)                 | Model Weights |Model Inference and Deployment                                                                             |
 |:----------------------------- |:---------:|:-------------------------:|:-----------------------------------:| :-----------------:  |:-----------------------------------------------------------------------------------------:|
 | Pedestrian Detection/Tracking | PP-YOLOE  | mAP: 56.3 <br> MOTA: 72.0 | Detection: 28ms <br>Tracking：33.1ms |[Link](https://bj.bcebos.com/v1/paddledet/models/pipeline/mot_ppyoloe_l_36e_pipeline.pdparams) |[Link](https://bj.bcebos.com/v1/paddledet/models/pipeline/mot_ppyoloe_l_36e_pipeline.zip) |
+| Calling Recognition | PP-HGNet | Precision Rate: 86.85 | Single Person 2.94ms | [Link](https://bj.bcebos.com/v1/paddledet/models/pipeline/PPHGNet_tiny_calling_halfbody.pdparams) | [Link](https://bj.bcebos.com/v1/paddledet/models/pipeline/PPHGNet_tiny_calling_halfbody.zip) |
+| Smoking Recognition | PP-YOLOE | mAP: 39.7 | Single Person 2.0ms | [Link](https://bj.bcebos.com/v1/paddledet/models/pipeline/ppyoloe_crn_s_80e_smoking_visdrone.pdparams) | [Link](https://bj.bcebos.com/v1/paddledet/models/pipeline/ppyoloe_crn_s_80e_smoking_visdrone.zip) |
 | Keypoint Detection            | HRNet     | AP: 87.1                  | Single Person 2.9ms                 |[Link](https://bj.bcebos.com/v1/paddledet/models/pipeline/dark_hrnet_w32_256x192.pdparams) |[Link](https://bj.bcebos.com/v1/paddledet/models/pipeline/dark_hrnet_w32_256x192.zip)     |
 | Falling Recognition            | ST-GCN    | Precision Rate: 96.43     | Single Person 2.7ms                 | - |[Link](https://bj.bcebos.com/v1/paddledet/models/pipeline/STGCN.zip)                      |
+| Fighting Recognition | PP-TSM | Precision Rate: 89.06% | 128ms for a 2sec video | [Link](https://videotag.bj.bcebos.com/PaddleVideo-release2.3/ppTSM_fight.pdparams) | [Link](https://videotag.bj.bcebos.com/PaddleVideo-release2.3/ppTSM_fight.zip) |

 Note:

@@ -26,33 +23,46 @@ Note:

 2. The keypoint detection model is trained on [COCO](https://cocodataset.org/), [UAV-Human](https://github.com/SUTDCV/UAV-Human), and some business data, and the precision is obtained on test sets of business data.

-3. The action recognition model is trained on [NTU-RGB+D](https://rose1.ntu.edu.sg/dataset/actionRecognition/), [UR Fall Detection Dataset](http://fenix.univ.rzeszow.pl/~mkepski/ds/uf.html), and some business data, and the precision is obtained on the testing set of business data.
+3. The falling action recognition model is trained on [NTU-RGB+D](https://rose1.ntu.edu.sg/dataset/actionRecognition/), [UR Fall Detection Dataset](http://fenix.univ.rzeszow.pl/~mkepski/ds/uf.html), and some business data, and the precision is obtained on the testing set of business data.

-4. The inference speed is the speed of using TensorRT FP16 on NVIDIA T4, including the total time of data pre-training, model inference, and post-processing.
+4. The calling action recognition model is trained and tested on [UAV-Human](https://github.com/SUTDCV/UAV-Human), by using video frames of calling in this dataset.

-## Description of Configuration
+5. The smoking action recognition model is trained and tested on business data.

-Parameters related to action recognition in the [config file](../config/infer_cfg_pphuman.yml) are as follow:
+6. The fighting action recognition model is trained and tested on 6 public datasets, including Surveillance Camera Fight Dataset, A Dataset for Automatic Violence Detection in Videos, Hockey Fight Detection Dataset, Video Fight Detection Dataset, Real Life Violence Situations Dataset, UBI Abnormal Event Detection Dataset.
+
+7. The inference speed is the speed of using TensorRT FP16 on NVIDIA T4, including the total time of data pre-training, model inference, and post-processing.

-```
-SKELETON_ACTION:
-  model_dir: output_inference/STGCN  # Path of the model
-  batch_size: 1 # The size of the inference batch. The only avilable size for inference is 1.
-  max_frames: 50 # The number of frames of action segments. When frames of time-ordered skeleton keypoints of each pedestrian ID achieve the max value,the action type will be judged by the action recognition model. If the setting is the same as the training, there will be an ideal inference result.
-  display_frames: 80 # The number of display frames. When the inferred action type is falling down, the time length of the act will be displayed in the ID.
-  coord_size: [384, 512] # The unified size of the coordinate, which is the best when it is the same as the training setting.
-  basemode: "skeletonbased" #the models which is based on，whether we need the skeleton model.
-  enable: False #whether to enable this function
-```

+## Skeleton-based action recognition -- falling detection

+<div align="center">  <img src="../images/action.gif" width='1000'/> <center>Data source and copyright owner：Skyinfor
+Technology. Thanks for the provision of actual scenario data, which are only
+used for academic research here. </center>
+
+</div>
+
+### Description of Configuration
+
+Parameters related to action recognition in the [config file](../../config/infer_cfg_pphuman.yml) are as follow:
+
+```
+SKELETON_ACTION: # Config for skeleton-based action recognition model
+    model_dir: output_inference/STGCN  # Path of the model
+    batch_size: 1 # The size of the inference batch. Current now only support 1.
+    max_frames: 50 # The number of frames of action segments. When frames of time-ordered skeleton keypoints of each pedestrian ID achieve the max value,the action type will be judged by the action recognition model. If the setting is the same as the training, there will be an ideal inference result.
+    display_frames: 80 # The number of display frames. When the inferred action type is falling down, the time length of the act will be displayed in the ID.
+    coord_size: [384, 512] # The unified size of the coordinate, which is the best when it is the same as the training setting.
+    basemode: "skeletonbased" # The models which is based on，whether we need the skeleton model.
+    enable: False # Whether to enable this function
+```


 ## How to Use

 - Download models from the links of the above table and unzip them to ```./output_inference```.

- Now the only available input is the video input in the action recognition module. set the "enable: True" in SKELETON_ACTION of infer_cfg_pphuman.yml. And then run the command:
+- Now the only available input is the video input in the action recognition module. set the "enable: True" of `SKELETON_ACTION` in infer_cfg_pphuman.yml. And then run the command:

  ```python
  python deploy/pipeline/pipeline.py --config deploy/pipeline/config/infer_cfg_pphuman.yml \
@@ -63,7 +73,6 @@ SKELETON_ACTION:
 - There are two ways to modify the model path:

  - In ```./deploy/pipeline/config/infer_cfg_pphuman.yml```, you can configurate different model paths，which is proper only if you match keypoint models and action recognition models with the fields of `KPT` and `SKELETON_ACTION` respectively, and modify the corresponding path of each field into the expected path.
-
  - Add `--model_dir` in the command line to revise the model path：

    ```python
@@ -73,35 +82,184 @@ SKELETON_ACTION:
                                                       --model_dir kpt=./dark_hrnet_w32_256x192 action=./STGCN
    ```

-## Introduction to the Solution
+### Introduction to the Solution

-1. Get the pedestrian detection box and the tracking ID number of the video input through object detection and multi-object tracking. The adopted model is PP-YOLOE, and for details, please refer to [PP-YOLOE](../../../configs/ppyoloe).
+1. Get the pedestrian detection box and the tracking ID number of the video input through object detection and multi-object tracking. The adopted model is PP-YOLOE, and for details, please refer to [PP-YOLOE](../../../../configs/ppyoloe).

-2. Capture every pedestrian in frames of the input video accordingly by using the coordinate of the detection box, and employ the [keypoint detection model](../../../configs/keypoint/hrnet/dark_hrnet_w32_256x192.yml)
-   to obtain 17 skeleton keypoints. Their sequences and types are identical to
-   those of COCO. For details, please refer to the `COCO dataset` part of [how to
-   prepare keypoint datasets](../../../docs/tutorials/PrepareKeypointDataSet_en.md).
+2. Capture every pedestrian in frames of the input video accordingly by using the coordinate of the detection box.
+3. In this strategy, we use the [keypoint detection model](../../../configs/keypoint/hrnet/dark_hrnet_w32_256x192.yml) to obtain 17 skeleton keypoints. Their sequences and types are identical to those of COCO. For details, please refer to the `COCO dataset` part of [how to prepare keypoint datasets](../../../docs/tutorials/PrepareKeypointDataSet_en.md).

-3. Each target pedestrian with a tracking ID has their own accumulation of skeleton keypoints, which is used to form a keypoint sequence in time order. When the number of accumulated frames reach a preset threshold or the tracking is lost, the action recognition model will be applied to judging the action type of the time-ordered keypoint sequence. The current model only supports the recognition of the act of falling down, and the relationship between the action type and `class id` is：
+4. Each target pedestrian with a tracking ID has their own accumulation of skeleton keypoints, which is used to form a keypoint sequence in time order. When the number of accumulated frames reach a preset threshold or the tracking is lost, the action recognition model will be applied to judging the action type of the time-ordered keypoint sequence. The current model only supports the recognition of the act of falling down, and the relationship between the action type and `class id` is：

 ```
 0: Fall down

 1: Others
 ```
+- The falling action recognition model uses [ST-GCN](https://arxiv.org/abs/1801.07455), and employ the [PaddleVideo](https://github.com/PaddlePaddle/PaddleVideo/blob/develop/docs/zh-CN/model_zoo/recognition/stgcn.md) toolkit to complete model training.
+
+## Image-Classification-Based Action Recognition -- Calling Detection
+
+<div align="center">  <img src="../images/calling.gif" width='1000'/> <center>Data source and copyright owner：Skyinfor
+Technology. Thanks for the provision of actual scenario data, which are only
+used for academic research here. </center>
+
+</div>
+
+### Description of Configuration
+
+Parameters related to action recognition in the [config file](../../config/infer_cfg_pphuman.yml) are as follow:
+
+```
+ID_BASED_CLSACTION: # config for classfication-based action recognition model
+    model_dir: output_inference/PPHGNet_tiny_calling_halfbody  # Path of the model
+    batch_size: 8 # The size of the inference batch
+    basemode: "idbased" # the models which is based on, whether the id of each object obtained by tracking is needed.
+    threshold: 0.45 # Threshold for corresponding behavior
+    display_frames: 80 # The number of display frames. When the corresponding action is detected, the time length of the act will be displayed in the ID.
+    enable: False # Whether to enable this function
+```
+
+### How to Use
+
+1. Download models from the links of the above table and unzip them to ```./output_inference```.
+
+2. Now the only available input is the video input in the action recognition module. set the "enable: True" of `ID_BASED_CLSACTION` in infer_cfg_pphuman.yml.
+
+3. Run this command:
+  ```python
+  python deploy/pipeline/pipeline.py --config deploy/pipeline/config/infer_cfg_pphuman.yml \
+                                                     --video_file=test_video.mp4 \
+                                                     --device=gpu
+  ```
+
+### Introduction to the Solution
+1. Get the pedestrian detection box and the tracking ID number of the video input through object detection and multi-object tracking. The adopted model is PP-YOLOE, and for details, please refer to [PP-YOLOE](../../../configs/ppyoloe).
+
+2. Capture every pedestrian in frames of the input video accordingly by using the coordinate of the detection box.
+3. With image classification through pedestrian images at the frame level, when the category to which the image belongs is the corresponding behavior, it is considered that the character is in the behavior state for a certain period of time. This task is implemented with [PP-HGNet](https://github.com/PaddlePaddle/PaddleClas/blob/develop/docs/zh_CN/models/PP-HGNet.md). In current version, the behavior of calling is supported and the relationship between the action type and `class id` is:
+```
+0: Calling
+
+1: Others
+```
+
+
+## Detection-based Action Recognition -- Smoking Detection
+
+<div align="center">  <img src="../images/smoking.gif" width='1000'/> <center>Data source and copyright owner：Skyinfor
+Technology. Thanks for the provision of actual scenario data, which are only
+used for academic research here. </center>
+
+</div>
+
+### Description of Configuration
+
+Parameters related to action recognition in the [config file](../../config/infer_cfg_pphuman.yml) are as follow:
+```
+ID_BASED_DETACTION: # Config for detection-based action recognition model
+    model_dir: output_inference/ppyoloe_crn_s_80e_smoking_visdrone # Path of the model
+    batch_size: 8  # The size of the inference batch
+    basemode: "idbased" # The models which is based on, whether the id of each object obtained by tracking is needed.
+    threshold: 0.4  # Threshold for corresponding behavior.
+    display_frames: 80 # The number of display frames. When the corresponding action is detected, the time length of the act will be displayed in the ID.
+    enable: False # Whether to enable this function
+```
+
+### How to Use
+
+1. Download models from the links of the above table and unzip them to ```./output_inference```.
+
+2. Now the only available input is the video input in the action recognition module. set the "enable: True" of `ID_BASED_DETACTION` in infer_cfg_pphuman.yml.
+
+3. Run this command:
+  ```python
+  python deploy/pipeline/pipeline.py --config deploy/pipeline/config/infer_cfg_pphuman.yml \
+                                                     --video_file=test_video.mp4 \
+                                                     --device=gpu
+  ```
+
+### Introduction to the Solution
+1. Get the pedestrian detection box and the tracking ID number of the video input through object detection and multi-object tracking. The adopted model is PP-YOLOE, and for details, please refer to [PP-YOLOE](../../../../configs/ppyoloe).
+
+2. Capture every pedestrian in frames of the input video accordingly by using the coordinate of the detection box.
+
+3. We detecting the typical specific target of this behavior in frame-level pedestrian images. When a specific target (in this case, cigarette is the target) is detected, it is considered that the character is in the behavior state for a certain period of time. This task is implemented by [PP-YOLOE](../../../../configs/ppyoloe/). In current version, the behavior of smoking is supported and the relationship between the action type and `class id` is:
+
+```
+0: Smoking
+
+1: Others
+```
+
+## Video-Classification-Based Action Recognition -- Fighting Detection
+With wider and wider deployment of surveillance cameras, it is time-consuming and labor-intensive and inefficient to manually check whether there are abnormal behaviors such as fighting. AI + security assistant smart security. A fight recognition module is integrated into PP-Human to identify whether there is fighting in the video. We provide pre-trained models that users can download and use directly.
+
+| Task | Model | Acc. | Speed(ms) | Weight | Deploy Model |
+| ---- | ---- | ---------- | ---- | ---- | ---------- |
+|  Fighting Detection | PP-TSM | 89.06% | 128ms for a 2-sec video| [Link](https://videotag.bj.bcebos.com/PaddleVideo-release2.3/ppTSM_fight.pdparams) | [Link](https://videotag.bj.bcebos.com/PaddleVideo-release2.3/ppTSM_fight.zip) |
+
+
+The model is trained with 6 public dataset, including Surveillance Camera Fight Dataset、A Dataset for Automatic Violence Detection in Videos、Hockey Fight Detection Dataset、Video Fight Detection Dataset、Real Life Violence Situations Dataset、UBI Abnormal Event Detection Dataset.
+
+This project focuses on is the identification of fighting behavior under surveillance cameras. Fighting behavior involves multiple people, and the skeleton-based technology is more suitable for single-person behavior recognition. In addition, fighting behavior is strongly dependent on timing information, and the detection and classification-based scheme is not suitable. Due to the complex background of the monitoring scene, the density of people, light, filming angle may affect the accuracy. This solution uses video-classification-based method to determine whether there is fighting in the video.
+For the case where the camera is far away from the person, it is optimized by increasing the resolution of the input image. Due to the limited training data, data augmentation is used to improve the generalization performance of the model.
+
+
+### Description of Configuration
+
+Parameters related to action recognition in the [config file](../../config/infer_cfg_pphuman.yml) are as follow:
+```
+VIDEO_ACTION:  # Config for detection-based action recognition model
+    model_dir: output_inference/ppTSM  # Path of the model
+    batch_size: 1 # The size of the inference batch. Current now only support 1.
+    frame_len: 8 # Accumulate the number of sampling frames. Inference will be executed when sampled frames reached this value.
+    sample_freq: 7 # Sampling frequency. It means how many frames to sample one frame.
+    short_size: 340 # The shortest length for video frame scaling transforms.
+    target_size: 320 # Target size for input video
+    basemode: "videobased"  # The models which is based on, whether to use video as model input.
+    enable: False # Whether to enable this function
+```
+
+### How to Use
+
+1. Download models from the links of the above table and unzip them to ```./output_inference```.
+
+2. Modify the file names in the `ppTSM` folder  to `model.pdiparams, model.pdiparams.info and model.pdmodel`;
+
+3. Now the only available input is the video input in the action recognition module. set the "enable: True" of `VIDEO_ACTION` in infer_cfg_pphuman.yml.
+
+3. Run this command:
+  ```python
+  python deploy/pipeline/pipeline.py --config deploy/pipeline/config/infer_cfg_pphuman.yml \
+                                                     --video_file=test_video.mp4 \
+                                                     --device=gpu
+  ```
+
+The result is shown as follow:
+
+<div width="1000" align="center">
+  <img src="../images/fight_demo.gif"/>
+</div>
+
+Data source and copyright owner: Surveillance Camera Fight Dataset.

-4. The action recognition model uses [ST-GCN](https://arxiv.org/abs/1801.07455), and employ the [PaddleVideo](https://github.com/PaddlePaddle/PaddleVideo/blob/develop/docs/zh-CN/model_zoo/recognition/stgcn.md) toolkit to complete model training.
+### Introduction to the Solution
+The current fight recognition model is using [PP-TSM](https://github.com/PaddlePaddle/PaddleVideo/blob/develop/docs/zh-CN/model_zoo/recognition/pp-tsm.md), and adaptated to complete the model training. For the input video or video stream, we extraction frame at a certain interval. When the video frame accumulates to the specified number, it is input into the video classification model to determine whether there is fighting.


-## Custom Falling Training
+## Custom Training

-The pretrained models are provided and can be used directly, including pedestrian detection/ tracking, keypoint detection and fall recognition. If users need to train custom action or optimize the model performance, please refer the link below.
+The pretrained models are provided and can be used directly, including pedestrian detection/ tracking, keypoint detection, smoking, calling and fighting recognition. If users need to train custom action or optimize the model performance, please refer the link below.

 | Task | Model | Training and Export doc |
 | ---- | ---- | -------- |
-| pedestrian detection/tracking | PP-YOLOE | [doc](../../../configs/ppyoloe/README.md#getting-start) |
-| keypoint detection | HRNet | [doc](../../../configs/keypoint/README_en.md#3training-and-testing) |
-| action recognition |  ST-GCN  | [doc](https://github.com/PaddlePaddle/PaddleVideo/tree/develop/applications/PPHuman) |
+| pedestrian detection/tracking | PP-YOLOE | [doc](../../../../configs/ppyoloe/README.md#getting-start) |
+| keypoint detection | HRNet | [doc](../../../../configs/keypoint/README_en.md#3training-and-testing) |
+| action recognition (fall down) |  ST-GCN  | [doc](https://github.com/PaddlePaddle/PaddleVideo/tree/develop/applications/PPHuman) |
+| action recognition (smoking) |  PP-YOLOE  | [doc](../../../../docs/advanced_tutorials/customization/action.md) |
+| action recognition (calling) |  PP-HGNet  | [doc](../../../../docs/advanced_tutorials/customization/action.md) |
+| action recognition (fighting) |  PP-TSM  | [doc](../../../../docs/advanced_tutorials/customization/action.md) |


 ## Reference

--- a/docs/advanced_tutorials/customization/action.md
+++ b/docs/advanced_tutorials/customization/action.md
 # 行为识别任务二次开发

-## 方案选择
+在产业落地过程中应用行为识别算法，不可避免地会出现希望自定义类型的行为识别的需求，或是对已有行为识别模型的优化，以提升在特定场景下模型的效果。鉴于行为的多样性，我们在本文档通过案例来介绍如何根据期望识别的行为来进行行为识别方案的选择，以及使用PaddleDetection进行行为识别算法二次开发工作，包括：数据准备、模型优化思路和新增行为的开发流程。
+

-行为识别常用的技术有检测技术、骨骼点技术、分类技术等，不同技术方案的优劣势和适用场景如下：
+## 方案选择
+在PaddleDetection中，我们为行为识别提供了多种方案：基于视频分类、基于图像分类、基于检测、以及基于骨骼点的行为识别方案，以期望满足不同场景、不同目标行为的需求。对于二次开发，首先我们需要确定要采用何种方案来实现行为识别的需求，其核心是要通过对场景和具体行为的分析、并考虑数据采集成本等因素，综合选择一个合适的识别方案。我们在这里简要列举了当前PaddleDetection中所支持的方案的优劣势和适用场景，供大家参考。

 | 技术方案 | 方案说明 | 方案优势 | 方案劣势 | 适用场景 |
 | :--: | :--: | :--: | :--: | :--: |
@@ -11,9 +13,30 @@
 | 基于人体id的检测 | 1. 通过目标检测技术得到画面中的人；<br>2. 根据检测结果将人物从原图中抠出，再在扣得的图像中再次用目标检测技术检测与行为强相关的目标。 | 1. 方案简单，易于训练；<br> 2. 可解释性强；<br> 3. 数据采集容易；<br> 4. 可结合跳帧及结果复用逻辑，速度快； | 1. 缺少时序信息；<br>2. 分辨率较低情况下效果不佳；<br> 3. 密集场景容易发生动作误匹配 | 行为与某特定目标强相关的场景，且目标较小，需要两级检测才能准确定位，如吸烟。 |
 | 基于视频分类的行为识别 | 应用视频分类技术对整个视频场景进行分类。 | 1.充分利用背景上下文和时序信息；<br>2. 可利用语音、字幕等多模态信息；<br>3. 不依赖检测及跟踪模型；<br>4. 可处理多人共同组成的动作； | 1. 无法定位到具体某个人的行为；<br>2. 场景泛化能力较弱；<br>3.真实数据采集困难； | 无需具体到人的场景的判定，即判断是否存在某种特定行为，多人或对背景依赖较强的动作，如监控画面中打架识别等场景。 |

+
+下面我们以PaddleDetection支持的几个具体动作为例，介绍每个动作是为什么使用现有方案的：
+
+### 吸烟
+
+吸烟动作中具有香烟这个明显特征目标，因此我们可以认为当在某个人物的对应图像中检测到香烟时，该人物即在吸烟动作中。相比于基于视频或基于骨骼点的识别方案，训练检测模型需要采集的是图片级别而非视频级别的数据，可以明显减轻数据收集与标注的难度。此外，目标检测任务具有丰富的预训练模型资源，整体模型的效果会更有保障，
+
+### 打电话
+
+打电话动作中虽然有手机这个特征目标，但为了区分看手机等动作，以及考虑到在安防场景下打电话动作中会出现较多对手机的遮挡（如手对手机的遮挡、人头对手机的遮挡等等），不利于检测模型正确检测到目标。同时打电话通常持续的时间较长，且人物本身的动作不会发生太大变化，因此可以因此采用帧级别图像分类的策略。
+    此外，打电话这个动作主要可以通过上半身判别，我们在训练和预测时采用了半身图片，去除冗余信息以降低模型训练的难度。
+
+### 摔倒
+
+摔倒是一个明显的时序行为的动作，可由一个人物本身进行区分，具有场景无关的特性。由于PP-Human的场景定位偏向安防监控场景，背景变化较为复杂，且部署上需要考虑到实时性，因此采用了基于骨骼点的行为识别方案，以获得更好的泛化性及运行速度。
+
+### 打架
+
+与上面的动作不同，打架是一个典型的多人组成的行为（即我们认为不能一个人单独打架）。因此我们不再通过检测与跟踪模型来提取行人及其ID，而是对整体视频片段进行处理。此外，打架场景下各个目标间的互相遮挡极为严重，关键点识别的准确性不高，采用基于骨骼点的方案难以保证精度。
+
+
 ## 数据准备

-### 视频分类数据准备
+### 1.基于视频分类的行为识别方案
 视频分类任务输入的视频格式一般为`.mp4`、`.avi`等格式视频或者是抽帧后的视频帧序列，标签则可以是`.txt`格式存储的文件。

 对于打架识别任务，具体数据准备流程如下：
@@ -62,10 +85,84 @@ python split_fight_train_test_dataset.py "rawframes" 2 0.8
 #### 视频裁剪
 对于未裁剪的视频，需要先进行裁剪才能用于模型训练，`deploy/pipeline/tools/clip_video.py`中给出了视频裁剪的函数`cut_video`，输入为视频路径，裁剪的起始帧和结束帧以及裁剪后的视频保存路径。

+### 2.基于图像分类的行为识别方案
+基于图像分类的行为识别方案直接对视频中的图像帧结果进行识别，因此模型训练流程与通常的图像分类模型一致。
+
+#### 数据集下载
+打电话的行为识别是基于公开数据集[UAV-Human](https://github.com/SUTDCV/UAV-Human)进行训练的。请通过该链接填写相关数据集申请材料后获取下载链接。
+
+在`UAVHuman/ActionRecognition/RGBVideos`路径下包含了该数据集中RGB视频数据集，每个视频的文件名即为其标注信息。
+
+#### 训练及测试图像处理
+根据视频文件名，其中与行为识别相关的为`A`相关的字段（即action），我们可以找到期望识别的动作类型数据。
+- 正样本视频：以打电话为例，我们只需找到包含`A024`的文件。
+- 负样本视频：除目标动作以外所有的视频。
+
+鉴于视频数据转化为图像会有较多冗余，对于正样本视频，我们间隔8帧进行采样，并使用行人检测模型处理为半身图像（取检测框的上半部分，即`img = img[:H/2, :, :]`)。正样本视频中的采样得到的图像即视为正样本，负样本视频中采样得到的图像即为负样本。
+
+**注意**: 正样本视频中并不完全符合打电话这一动作，在视频开头结尾部分会出现部分冗余动作，需要移除。
+
+#### 标注文件准备
+
+基于图像分类的行为识别方案是借助[PaddleClas](https://github.com/PaddlePaddle/PaddleClas)进行模型训练的。使用该方案训练的模型，需要准备期望识别的图像数据及对应标注文件。根据[PaddleClas数据集格式说明](https://github.com/PaddlePaddle/PaddleClas/blob/develop/docs/zh_CN/data_preparation/classification_dataset.md#1-%E6%95%B0%E6%8D%AE%E9%9B%86%E6%A0%BC%E5%BC%8F%E8%AF%B4%E6%98%8E)准备对应的数据即可。标注文件样例如下，其中`0`,`1`分别是图片对应所属的类别：
+```
+    # 每一行采用"空格"分隔图像路径与标注
+    train/000001.jpg 0
+    train/000002.jpg 0
+    train/000003.jpg 1
+    ...
+```
+
+### 3.基于检测的行为识别方案
+
+基于检测的行为识别方案中，数据准备的流程与一般的检测模型一致，详情可参考[目标检测数据准备](../../tutorials/data/PrepareDetDataSet.md)。将图像和标注数据组织成PaddleDetection中支持的格式之一即可。
+
+### 4.基于骨骼点的行为识别方案
+
+基于骨骼点的行为识别方案是借助[PaddleVideo](https://github.com/PaddlePaddle/PaddleVideo)进行模型训练的。使用该方案训练的模型，可以参考[此文档](https://github.com/PaddlePaddle/PaddleVideo/tree/develop/applications/PPHuman#%E5%87%86%E5%A4%87%E8%AE%AD%E7%BB%83%E6%95%B0%E6%8D%AE)准备训练数据。其主要流程包含以下步骤：
+
+#### 数据格式说明
+STGCN是一个基于骨骼点坐标序列进行预测的模型。在PaddleVideo中，训练数据为采用`.npy`格式存储的`Numpy`数据，标签则可以是`.npy`或`.pkl`格式存储的文件。对于序列数据的维度要求为`(N,C,T,V,M)`。
+
+| 维度 | 大小 | 说明 |
+| ---- | ---- | ---------- |
+| N | 不定 | 数据集序列个数 |
+| C | 2 | 关键点坐标维度，即(x, y) |
+| T | 50 | 动作序列的时序维度（即持续帧数）|
+| V | 17 | 每个人物关键点的个数，这里我们使用了`COCO`数据集的定义，具体可见[这里](https://github.com/PaddlePaddle/PaddleDetection/blob/release/2.4/docs/tutorials/PrepareKeypointDataSet_cn.md#COCO%E6%95%B0%E6%8D%AE%E9%9B%86) |
+| M | 1 | 人物个数，这里我们每个动作序列只针对单人预测 |
+
+#### 获取序列的骨骼点坐标
+对于一个待标注的序列（这里序列指一个动作片段，可以是视频或有顺序的图片集合）。可以通过模型预测或人工标注的方式获取骨骼点（也称为关键点）坐标。
+- 模型预测：可以直接选用[PaddleDetection KeyPoint模型系列](https://github.com/PaddlePaddle/PaddleDetection/tree/release/2.4/configs/keypoint) 模型库中的模型，并根据`3、训练与测试 - 部署预测 - 检测+keypoint top-down模型联合部署`中的步骤获取目标序列的17个关键点坐标。
+- 人工标注：若对关键点的数量或是定义有其他需求，也可以直接人工标注各个关键点的坐标位置，注意对于被遮挡或较难标注的点，仍需要标注一个大致坐标，否则后续网络学习过程会受到影响。
+
+#### 统一序列的时序长度
+由于实际数据中每个动作的长度不一，首先需要根据您的数据和实际场景预定时序长度（在PP-Human中我们采用50帧为一个动作序列），并对数据做以下处理：
+- 实际长度超过预定长度的数据，随机截取一个50帧的片段
+- 实际长度不足预定长度的数据：补0，直到满足50帧
+- 恰好等于预定长度的数据： 无需处理
+
+注意：在这一步完成后，请严格确认处理后的数据仍然包含了一个完整的行为动作，不会产生预测上的歧义，建议通过可视化数据的方式进行确认。
+
+#### 保存为PaddleVideo可用的文件格式
+在经过前两步处理后，我们得到了每个人物动作片段的标注，此时我们已有一个列表`all_kpts`，这个列表中包含多个关键点序列片段，其中每一个片段形状为(T, V, C) （在我们的例子中即(50, 17, 2)), 下面进一步将其转化为PaddleVideo可用的格式。
+- 调整维度顺序： 可通过`np.transpose`和`np.expand_dims`将每一个片段的维度转化为(C, T, V, M)的格式。
+- 将所有片段组合并保存为一个文件
+
+注意：这里的`class_id`是`int`类型，与其他分类任务类似。例如`0：摔倒， 1：其他`。
+
+至此，我们得到了可用的训练数据（`.npy`）和对应的标注文件（`.pkl`）。
+
+
 ## 模型优化

 ### 1. 摔倒--基于关键点的行为识别方案

+#### 坐标归一化处理
+在完成骨骼点坐标的获取后，建议根据各人物的检测框进行归一化处理，以消除人物位置、尺度的差异给网络带来的收敛难度。
+
+
 ### 2. 打架--基于视频分类的行为识别方案

 #### VideoMix
@@ -78,9 +175,69 @@ python split_fight_train_test_dataset.py "rawframes" 2 0.8
 #### 更大的分辨率
 由于监控摄像头角度、距离等问题，存在监控画面下人比较小的情况，小目标行为的识别较困难，尝试增大输入图像的分辨率，模型精度由88.01%提升至89.06%。

+
+### 3. 打电话--基于图像分类的行为识别方案
+
+#### 半身图预测
+在打电话这一动作中，实际是通过上半身就能实现动作的区分的，因此在训练和预测过程中，将图像由行人全身图换为半身图
+
+
+### 4. 抽烟--基于检测的行为识别方案
+
+#### 更大的分辨率
+烟头的检测在监控视角下是一个典型的小目标检测问题，使用更大的分辨率有助于提升模型整体的识别率
+
+#### 预训练模型
+加入小目标场景数据集VisDrone下的预训练模型进行训练，模型mAP由38.1提升到39.7。
+
 ## 新增行为
+当希望新增行为时，首先请参考`方案选择`部分，综合考虑各方案的优劣势及数据采集难易程度，确定希望新增的行为采用何种识别方案进行实现。

 ### 1. 基于关键点的行为识别方案
+基于关键点的行为识别方案中，行为识别模型使用的是[ST-GCN](https://arxiv.org/abs/1801.07455)，并在[PaddleVideo训练流程](https://github.com/PaddlePaddle/PaddleVideo/blob/develop/docs/zh-CN/model_zoo/recognition/stgcn.md)的基础上修改适配，完成模型训练及导出使用流程。
+
+#### 模型训练与测试
+- 按照`数据准备`, 准备训练数据
+- 在PaddleVideo中，使用以下命令即可开始训练：
+```bash
+# current path is under root of PaddleVideo
+python main.py -c applications/PPHuman/configs/stgcn_pphuman.yaml
+
+# 由于整个任务可能过拟合,建议同时开启验证以保存最佳模型
+python main.py --validate -c applications/PPHuman/configs/stgcn_pphuman.yaml
+```
+
+在训练完成后，采用以下命令进行预测：
+```bash
+python main.py --test -c applications/PPHuman/configs/stgcn_pphuman.yaml  -w output/STGCN/STGCN_best.pdparams
+```
+
+#### 模型导出
+- 在PaddleVideo中，通过以下命令实现模型的导出，得到模型结构文件`STGCN.pdmodel`和模型权重文件`STGCN.pdiparams`，并增加配置文件：
+```bash
+# current path is under root of PaddleVideo
+python tools/export_model.py -c applications/PPHuman/configs/stgcn_pphuman.yaml \
+                                -p output/STGCN/STGCN_best.pdparams \
+                                -o output_inference/STGCN
+
+cp applications/PPHuman/configs/infer_cfg.yml output_inference/STGCN
+
+# 重命名模型文件，适配PP-Human的调用
+cd output_inference/STGCN
+mv STGCN.pdiparams model.pdiparams
+mv STGCN.pdiparams.info model.pdiparams.info
+mv STGCN.pdmodel model.pdmodel
+```
+完成后的导出模型目录结构如下：
+```
+STGCN
+├── infer_cfg.yml
+├── model.pdiparams
+├── model.pdiparams.info
+├── model.pdmodel
+```
+
+至此，就可以使用PP-Human进行行为识别的推理了。

 ### 2. 基于视频分类的行为识别方案
 目前打架识别模型使用的是[PaddleVideo](https://github.com/PaddlePaddle/PaddleVideo)套件中[PP-TSM](https://github.com/PaddlePaddle/PaddleVideo/blob/develop/docs/zh-CN/model_zoo/recognition/pp-tsm.md)，并在PP-TSM视频分类模型训练流程的基础上修改适配，完成模型训练。
@@ -139,3 +296,48 @@ python tools/export_model.py -c pptsm_fight_frames_dense.yaml \
                                -p ppTSM_fight_best.pdparams \
                                -o inference/ppTSM
 ```
+
+### 3.基于图像分类的行为识别方案
+
+#### 模型训练及测试
+- 首先根据[Install PaddleClas](https://github.com/PaddlePaddle/PaddleClas/blob/develop/docs/en/installation/install_paddleclas_en.md)完成PaddleClas的环境配置。
+- 按照`数据准备`部分，完成训练/验证集图像的裁剪及标注文件准备。
+- 模型训练: 参考[使用预训练模型进行训练](https://github.com/PaddlePaddle/PaddleClas/blob/develop/docs/zh_CN/quick_start/quick_start_classification_new_user.md#422-%E4%BD%BF%E7%94%A8%E9%A2%84%E8%AE%AD%E7%BB%83%E6%A8%A1%E5%9E%8B%E8%BF%9B%E8%A1%8C%E8%AE%AD%E7%BB%83)完成模型的训练及精度验证
+
+#### 模型导出
+模型导出的详细介绍请参考[这里](https://github.com/PaddlePaddle/PaddleClas/blob/develop/docs/en/inference_deployment/export_model_en.md#2-export-classification-model)
+可以参考以下步骤实现：
+```python
+python tools/export_model.py
+    -c ./PPHGNet_tiny_resize_halfbody.yaml \
+    -o Global.pretrained_model=./output/PPHGNet_tiny/best_model \
+    -o Global.save_inference_dir=./output_inference/PPHGNet_tiny_resize_halfbody
+```
+然后将导出的模型重命名，并加入配置文件，以适配PP-Human的使用
+```bash
+cd ./output_inference/PPHGNet_tiny_resize_halfbody
+
+mv inference.pdiparams model.pdiparams
+mv inference.pdiparams.info model.pdiparams.info
+mv inference.pdmodel model.pdmodel
+
+cp infer_cfg.yml .
+```
+至此，即可使用PP-Human进行实际预测了。
+
+### 4.基于检测的行为识别方案
+
+#### 模型训练及测试
+- 按照`数据准备`部分，完成训练/验证集图像的裁剪及标注文件准备。
+- 模型训练: 参考[PP-YOLOE](../../../configs/ppyoloe/README_cn.md)，执行下列步骤实现
+
+```bash
+python -m paddle.distributed.launch --gpus 0,1,2,3  tools/train.py -c ppyoloe_smoking/ppyoloe_crn_s_80e_smoking_visdrone.yml --eval
+```
+
+#### 模型导出
+注意：如果在Tensor-RT环境下预测, 请开启`-o trt=True`以获得更好的性能
+```bash
+python tools/export_model.py -c ppyoloe_smoking/ppyoloe_crn_s_80e_smoking_visdrone.yml -o weights=output/ppyoloe_crn_s_80e_smoking_visdrone/best_model trt=True
+```
+至此，即可使用PP-Human进行实际预测了。