Add fleet tipc doc (#5492)

* update * add test_train_fleet_inference_python.md * add test_train_fleet_inference_python.md * Update README.md * Update test_train_fleet_inference_python.md * update test_train_fleet_inference_python.md * update test_train_fleet_inference_python.md * update test_train_fleet_inference_python.md * update test_train_fleet_infer_python.md * Update README.md * add test_train_fleet_infer_python.md * Update test_train_fleet_infer_python.md * update test_train_fleet_infer_python.md * update test_train_fleet_infer_python.md * Update test_train_fleet_infer_python.md * upate fleet test config

Add fleet tipc doc (#5492)
* update * add test_train_fleet_inference_python.md * add test_train_fleet_inference_python.md * Update README.md * Update test_train_fleet_inference_python.md * update test_train_fleet_inference_python.md * update test_train_fleet_inference_python.md * update test_train_fleet_inference_python.md * update test_train_fleet_infer_python.md * Update README.md * add test_train_fleet_infer_python.md * Update test_train_fleet_infer_python.md * update test_train_fleet_infer_python.md * update test_train_fleet_infer_python.md * Update test_train_fleet_infer_python.md * upate fleet test config
b09ddb48 · Bin Lu · GitHub · cef864d5 · b09ddb48 · b09ddb48
6 changed file
--- a/tutorials/mobilenetv3_prod/Step6/test_tipc/README.md
+++ b/tutorials/mobilenetv3_prod/Step6/test_tipc/README.md
@@ -54,7 +54,7 @@ test_tipc
    - [Linux GPU/CPU 基础训练推理测试](docs/test_train_inference_python.md)

 - 更多训练方式测试（coming soon）：
-    - [Linux GPU/CPU 多机多卡训练推理测试]
+    - [Linux GPU/CPU 多机多卡训练推理测试](docs/test_train_fleet_inference_python.md)
    - [Linux GPU/CPU 混合精度训练推理测试](docs/test_train_amp_inference_python.md)

 - 更多部署方式测试（coming soon）：

--- a/tutorials/mobilenetv3_prod/Step6/test_tipc/configs/mobilenet_v3_small/train_fleet_infer_python.txt
+++ b/tutorials/mobilenetv3_prod/Step6/test_tipc/configs/mobilenet_v3_small/train_fleet_infer_python.txt
+===========================train_params===========================
+model_name:mobilenet_v3_small
+python:python3.7
+gpu_list:192.168.0.1,192.168.0.2;0,1
+use-gpu:True
+--epochs:lite_train_lite_infer=5|whole_train_whole_infer=90
+--output-dir:./output/
+--batch-size:lite_train_lite_infer=4|whole_train_whole_infer=128
+--pretrained:null
+train_model_name:latest.pdparams
+--data-path:./lite_data
+##
+trainer:fleet_train
+fleet_train:train.py
+##
+===========================eval_params===========================
+eval:train.py --test-only
+##
+===========================infer_params===========================
+--save-inference-dir:./output/mobilenet_v3_small_infer/
+--pretrained:
+norm_export:tools/export_model.py --model=mobilenet_v3_small
+##
+train_model:./pretrain_models/mobilenet_v3_small_pretrained.pdparams
+infer_export:tools/export_model.py --model=mobilenet_v3_small
+##
+inference:deploy/inference_python/infer.py
+--use-gpu:True|False
+--batch-size:1
+--model-dir:./output/mobilenet_v3_small_infer/
+--img-path:./images/demo.jpg
+--benchmark:True
\ No newline at end of file
--- a/tutorials/mobilenetv3_prod/Step6/test_tipc/docs/test_train_fleet_inference_python.md
+++ b/tutorials/mobilenetv3_prod/Step6/test_tipc/docs/test_train_fleet_inference_python.md
+# Linux GPU/CPU 多机多卡训练推理测试
+
+Linux GPU/CPU 多机多卡训练推理测试的主程序为`test_train_inference_python.sh`，可以测试基于Python的模型训练、评估、推理等基本功能。
+
+## 1. 测试结论汇总
+
+- 训练相关：
+
+| 算法名称 | 模型名称 | 多机多卡 |
+|  :----: |   :----:  |    :----:  |
+|  MobileNetV3  | mobilenet_v3_small | 分布式训练 |
+
+
+- 推理相关：
+
+| 算法名称 | 模型名称 | device_CPU | device_GPU | batchsize |
+|  :----:   |  :----: |   :----:   |  :----:  |   :----:   |
+|  MobileNetV3   |  mobilenet_v3_small |  支持 | 支持 | 1 |
+
+
+## 2. 测试流程
+
+### 2.1 准备环境
+- 准备至少两台可以相互`ping`通的机器
+
+  这里推荐使用Docker容器的方式来运行。以Paddle2.2.2 GPU版，cuda10.2, cudnn7为例：
+  ```
+  拉取预安装 PaddlePaddle 的镜像：
+  nvidia-docker pull registry.baidubce.com/paddlepaddle/paddle:2.2.2-gpu-cuda10.2-cudnn7
+
+  用镜像构建并进入Docker容器：
+  nvidia-docker run --name paddle -it --net=host -v $PWD:/paddle registry.baidubce.com/paddlepaddle/paddle:2.2.2-gpu-cuda10.2-cudnn7 /bin/bash
+  ```
+  不同的物理机环境配置，安装请参照[官网安装说明](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/docker/linux-docker.html#old-version-anchor-2-%E5%AE%89%E8%A3%85%E6%AD%A5%E9%AA%A4)。
+
+- 拉取代码
+  ```
+  git clone https://github.com/PaddlePaddle/models.git
+  cd models/tutorials/mobilenetv3_prod/Step6
+  ```
+
+- 安装依赖
+    ```
+    pip install  -r requirements.txt
+    ```
+
+- 安装AutoLog（规范化日志输出工具）
+    ```
+    pip install  https://paddleocr.bj.bcebos.com/libs/auto_log-1.2.0-py3-none-any.whl
+    ```
+
+### 2.2 功能测试
+
+首先，修改配置文件中的`ip`设置:  假设两台机器的`ip`地址分别为`192.168.0.1`和`192.168.0.2`，则对应的配置文件`gpu_list`字段需要修改为`gpu_list:192.168.0.1,192.168.0.2;0,1`； `ip`地址查看命令为`ifconfig`。
+
+测试方法如下所示，如果希望测试不同的模型文件，只需更换为自己的参数配置文件，即可完成对应模型的测试。
+
+```bash
+bash test_tipc/test_train_inference_python.sh ${your_params_file} lite_train_lite_infer
+```
+**注意:** 多机多卡的训练推理测试有别于单机，需要在各个节点上分别启动命令。
+
+以`mobilenet_v3_small`的`Linux GPU/CPU 多机多卡训练推理测试`为例，命令如下所示。
+
+```bash
+bash test_tipc/prepare.sh test_tipc/configs/mobilenet_v3_small/train_fleet_infer_python.txt lite_train_lite_infer
+```
+
+```bash
+bash test_tipc/test_train_inference_python.sh test_tipc/configs/mobilenet_v3_small/train_fleet_infer_python.txt lite_train_lite_infer
+```
+
+输出结果如下，表示命令运行成功。
+
+```bash
+Run successfully with command - python3.7 -m paddle.distributed.launch --ips=192.168.0.1,192.168.0.2 --gpus=0,1 train.py --output-dir=./log/mobilenet_v3_small/lite_train_lite_infer/norm_train_gpus_0,1_nodes_2 --epochs=5   --batch-size=4!
+......
+Run successfully with command - python3.7 deploy/inference_python/infer.py --use-gpu=False --model-dir=./log/mobilenet_v3_small/lite_train_lite_infer/norm_train_gpus_0,1_nodes_2 --batch-size=1   --benchmark=True > ./log/mobilenet_v3_small/lite_train_lite_infer/python_infer_cpu_batchsize_1.log 2>&1 !
+```
+
+在开启benchmark参数时，可以得到测试的详细数据，包含运行环境信息（系统版本、CUDA版本、CUDNN版本、驱动版本），Paddle版本信息，参数设置信息（运行设备、线程数、是否开启内存优化等），模型信息（模型名称、精度），数据信息（batchsize、是否为动态shape等），性能信息（CPU,GPU的占用、运行耗时、预处理耗时、推理耗时、后处理耗时），内容如下所示：
+
+```
+[2022/03/22 06:15:51] root INFO: ---------------------- Env info ----------------------
+[2022/03/22 06:15:51] root INFO:  OS_version: Ubuntu 16.04
+[2022/03/22 06:15:51] root INFO:  CUDA_version: 10.2.89
+[2022/03/22 06:15:51] root INFO:  CUDNN_version: 7.6.5
+[2022/03/22 06:15:51] root INFO:  drivier_version: 440.64.00
+[2022/03/22 06:15:51] root INFO: ---------------------- Paddle info ----------------------
+[2022/03/22 06:15:51] root INFO:  paddle_version: 2.2.2
+[2022/03/22 06:15:51] root INFO:  paddle_commit: b031c389938bfa15e15bb20494c76f86289d77b0
+[2022/03/22 06:15:51] root INFO:  log_api_version: 1.0
+[2022/03/22 06:15:51] root INFO: ----------------------- Conf info -----------------------
+[2022/03/22 06:15:51] root INFO:  runtime_device: cpu
+[2022/03/22 06:15:51] root INFO:  ir_optim: True
+[2022/03/22 06:15:51] root INFO:  enable_memory_optim: True
+[2022/03/22 06:15:51] root INFO:  enable_tensorrt: False
+[2022/03/22 06:15:51] root INFO:  enable_mkldnn: False
+[2022/03/22 06:15:51] root INFO:  cpu_math_library_num_threads: 1
+[2022/03/22 06:15:51] root INFO: ----------------------- Model info ----------------------
+[2022/03/22 06:15:51] root INFO:  model_name: classification
+[2022/03/22 06:15:51] root INFO:  precision: fp32
+[2022/03/22 06:15:51] root INFO: ----------------------- Data info -----------------------
+[2022/03/22 06:15:51] root INFO:  batch_size: 1
+[2022/03/22 06:15:51] root INFO:  input_shape: dynamic
+[2022/03/22 06:15:51] root INFO:  data_num: 1
+[2022/03/22 06:15:51] root INFO: ----------------------- Perf info -----------------------
+[2022/03/22 06:15:51] root INFO:  cpu_rss(MB): 227.2812, gpu_rss(MB): None, gpu_util: None%
+[2022/03/22 06:15:51] root INFO:  total time spent(s): 0.1583
+[2022/03/22 06:15:51] root INFO:  preprocess_time(ms): 18.6493, inference_time(ms): 139.591, postprocess_time(ms): 0.0875
+```
+
+该信息可以在运行log中查看，以`mobilenet_v3_small`为例，log位置在`./log/mobilenet_v3_small/lite_train_lite_infer/python_infer_gpu_batchsize_1.log`。
+
+如果运行失败，也会在终端中输出运行失败的日志信息以及对应的运行命令。可以基于该命令，分析运行失败的原因。
--- a/tutorials/mobilenetv3_prod/Step6/test_tipc/test_train_inference_python.sh
+++ b/tutorials/mobilenetv3_prod/Step6/test_tipc/test_train_inference_python.sh
@@ -31,6 +31,9 @@ data_path_value=$(func_parser_value "${lines[10]}")
 trainer_list=$(func_parser_value "${lines[12]}")
 norm_trainer=$(func_parser_key "${lines[13]}")
 trainer_py=$(func_parser_value "${lines[13]}")
+# nodes
+nodes_key=$(func_parser_key "${lines[14]}")
+nodes_value=$(func_parser_value "${lines[14]}")

 # eval params
 eval_py=$(func_parser_value "${lines[16]}")
@@ -152,6 +155,15 @@ else
            array=(${gpu})
            env="export CUDA_VISIBLE_DEVICES=${array[0]}"
            IFS="|"
+        else
+            IFS=";"
+            array=(${gpu})
+            ips=${array[0]}
+            gpu=${array[1]}
+            IFS=","
+            array=(${gpu})
+            env="export CUDA_VISIBLE_DEVICES=${array[0]}"
+            IFS="|"
        fi

        for trainer in ${trainer_list[*]}; do
@@ -165,15 +177,22 @@ else
            set_epoch=$(func_set_params "${epoch_key}" "${epoch_num}")
            set_pretrain=$(func_set_params "${pretrain_model_key}" "${pretrain_model_value}")
            set_batchsize=$(func_set_params "${train_batch_key}" "${train_batch_value}")
-            if [ ${#ips} -le 26 ];then
+            if [ ${#ips} -le 15 ];then
                save_log="${LOG_PATH}/${trainer}_gpus_${gpu}"
-                nodes=1
+            else                  
+                IFS=","
+                ips_array=(${ips})
+                IFS="|"
+                nodes=${#ips_array[@]}
+                save_log="${LOG_PATH}/${trainer}_gpus_${gpu}_nodes_${nodes}"
            fi
            set_save_model=$(func_set_params "${save_model_key}" "${save_log}")
            if [ ${#gpu} -le 2 ];then  # train with single gpu
                cmd="${python} ${run_train} ${set_save_model} ${set_epoch} ${set_pretrain} ${set_batchsize}"
-            elif [ ${#ips} -le 26 ];then  # train with multi-gpu
+            elif [ ${#ips} -le 15 ];then  # train with multi-gpu
                cmd="${python} -m paddle.distributed.launch --gpus=${gpu} ${run_train} ${set_save_model} ${set_epoch} ${set_pretrain} ${set_batchsize}"
+            else     # train with multi-machine
+                cmd="${python} -m paddle.distributed.launch --ips=${ips} --gpus=${gpu} ${run_train} ${set_save_model} ${set_epoch} ${set_pretrain} ${set_batchsize}"
            fi
            # run train
            eval $cmd

--- a/tutorials/tipc/train_fleet_infer_python/README.md
+++ b/tutorials/tipc/train_fleet_infer_python/README.md
@@ -54,3 +54,4 @@
 <a name="3"></a>

 # 3. 多机多卡训练推理测试开发与规范
+参考[Linux GPU 多机多卡训练推理测试开发文档](./test_train_fleet_infer_python.md)增加配置文件、验证多机多卡训练推理全流程正确性并撰写测试说明文档。
--- a/tutorials/tipc/train_fleet_infer_python/test_train_fleet_infer_python.md
+++ b/tutorials/tipc/train_fleet_infer_python/test_train_fleet_infer_python.md
-# Linux GPU 多机多卡推理测试开发文档
+# Linux GPU/CPU 多机多卡训练推理测试开发文档

 # 目录
-
 - [1. 简介](#1)
- [2. 基本多机多卡训练推理功能测试开发](#2---)
- [3. 高级多机多卡训练推理功能测试开发](#3---)
+- [2. 命令与配置文件解析](#2)
+- [3. 多机多卡训练推理测试开发](#3)
 - [4. FAQ](#4)

+<a name="1"></a>
+
+## 1. 简介
+
+本文档主要关注Linux GPU/CPU 下模型的多机多卡训练推理全流程功能测试。与基础训练推理测试类似，其具体测试点如下：
+
+- 模型训练：多机多卡训练跑通
+- 模型动转静：保存静态图模型跑通
+- 模型推理：推理过程跑通
+
+<a name="2"></a>
+
+## 2. 命令与配置文件解析
+
+此章节可以参考[基础训练推理测试开发文档](../train_infer_python/test_train_infer_python.md#2)。 **主要的差异点**为脚本的第4行、第13行和第14行，如下所示：
+| 行号 | 参考内容                                        | 含义              | key是否需要修改 | value是否需要修改 |  修改内容                 |
+|----|---------------------------------------------|-----------------|-----------|-------------|-------------------|
+| 4  | gpu_list:xx.xx.xx.xx,yy.yy.yy.yy;0,1     | 节点IP地址和GPU ID        | 否         | 是           | value修改为自己的IP地址和GPU ID                |
+| 13 | trainer:fleet_train                          | 训练方法            | 否         | 否           | -                 |
+| 14 | fleet_train:train.py                         | 多机多卡训练脚本 | 否         | 是           | value可以修改为自己的训练命令 |
+
+以训练命令`python3.7 -m paddle.distributed.launch --ips 192.168.0.1,192.168.0.2 --gpus 0,1 train.py --device=gpu --epochs=1 --data-path=./lite_data`为例，该命令为多机多卡训练（非裁剪、量化、蒸馏等方式），运行在`ip`地址为`192.168.0.1`和`192.168.0.2`的`0,1`号卡上，
+因此:
+* 配置文件的第4行写`gpu_list:192.168.0.1,192.168.0.2;0,1`。
+* 配置文件的第13行`trainer`内容为`fleet_train`, 区别于基础训练的`normal_train`、混合精度训练的`amp_train`。
+* 配置文件的第14行内容为`fleet_train:train.py`。
+
+<a name="3"></a>
+
+## 3. 多机多卡训练推理功能测试开发
+
+多机多卡训练推理功能测试开发过程，同样包含了如下6个步骤。
+
+<div align="center">
+    <img src="../train_infer_python/images/test_linux_train_infer_python_pipeline.png" width="600">
+</div>
+
+其中设置了2个核验点，详细的开发过程与[基础训练推理测试开发](../train_infer_python/test_train_infer_python.md#3)类似。**主要的差异点**有如下四处:
+
+* ### 1）准备验证环境
+
+该步骤需要准备至少两台可以相互`ping`通的机器。这里推荐使用Docker容器的方式来运行。以Paddle2.2.2 GPU版，cuda10.2, cudnn7为例：
+```
+拉取预安装 PaddlePaddle 的镜像：
+nvidia-docker pull registry.baidubce.com/paddlepaddle/paddle:2.2.2-gpu-cuda10.2-cudnn7
+
+用镜像构建并进入Docker容器：
+nvidia-docker run --name paddle -it --net=host -v $PWD:/paddle registry.baidubce.com/paddlepaddle/paddle:2.2.2-gpu-cuda10.2-cudnn7 /bin/bash
+```
+不同的物理机环境配置，Docker容器创建请参照[官网安装说明](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/docker/linux-docker.html#old-version-anchor-2-%E5%AE%89%E8%A3%85%E6%AD%A5%E9%AA%A4)。
+
+* ### 2）增加配置文件
+
+此处需要将文件 [train_fleet_infer_python.txt](../../mobilenetv3_prod/Step6/test_tipc/configs/mobilenet_v3_small/train_fleet_infer_python.txt) 拷贝到`test_tipc/configs/model_name`路径下，`model_name`为您自己的模型名字。同时，需要相应
+修改`train_fleet_infer_python.txt`模板文件中的`model_name`字段。
+
+* ### 3）验证配置正确性
+
+首先，修改配置文件中的`ip`设置:  假设两台机器的`ip`地址分别为`192.168.0.1`和`192.168.0.2`，则对应的配置文件`gpu_list`字段需要修改为`gpu_list:192.168.0.1,192.168.0.2;0,1`，`ip`地址查看命令为`ifconfig`。
+
+基于修改完的配置，运行
+
+```bash
+bash test_tipc/prepare.sh ${your_params_file} lite_train_lite_infer
+bash test_tipc/test_train_inference_python.sh ${your_params_file} lite_train_lite_infer
+```
+**注意:** 多机多卡的训练推理验证过程有别于单机，需要在各个节点上分别启动命令。
+
+以mobilenet_v3_small的`Linux GPU/CPU 多机多卡训练推理功能测试` 为例，命令如下所示。
+
+```bash
+bash test_tipc/test_train_inference_python.sh test_tipc/configs/mobilenet_v3_small/train_fleet_infer_python.txt lite_train_lite_infer
+```
+
+输出结果如下，表示命令运行成功。
+```bash
+Run successfully with command - python3.7 -m paddle.distributed.launch --ips=192.168.0.1,192.168.0.2 --gpus=0,1 train.py --output-dir=./log/mobilenet_v3_small/lite_train_lite_infer/norm_train_gpus_0,1_nodes_2 --epochs=5   --batch-size=4!
+......
+Run successfully with command - python3.7 deploy/inference_python/infer.py --use-gpu=False --model-dir=./log/mobilenet_v3_small/lite_train_lite_infer/norm_train_gpus_0,1_nodes_2 --batch-size=1   --benchmark=True > ./log/mobilenet_v3_small/lite_train_lite_infer/python_infer_cpu_batchsize_1.log 2>&1 !
+```
+若基于修改后的配置文件，全部命令都运行成功，则验证通过。
+
+* ### 4）撰写说明文档
+
+此处需要增加`Linux GPU/CPU 多机多卡训练推理功能测试`说明文档，该文档的模板位于[test_train_fleet_inference_python.md](../../mobilenetv3_prod/Step6/test_tipc/docs/test_train_fleet_inference_python.md)，可以直接拷贝到自己的repo中，根据自己的模型进行修改。
+
+若已完成多机多卡训练测试开发以及基础训练测试的开发，则repo最终目录结构如下所示。
+```
+test_tipc
+    |--configs                                  # 配置目录
+    |    |--model_name                          # 您的模型名称
+    |           |--train_infer_python.txt       # 基础训练推理测试配置文件
+    |           |--train_fleet_infer_python.txt # 多机多卡训练推理测试配置文件
+    |--docs                                     # 文档目录
+    |   |--test_train_inference_python.md       # 基础训练推理测试说明文档
+    |   |--test_train_fleet_inference_python.md # 多机多卡训练推理测试说明文档
+    |----README.md                              # TIPC说明文档
+    |----prepare.sh                             # TIPC基础、多机多卡训练推理测试数据准备脚本
+    |----test_train_inference_python.sh         # TIPC基础、多机多卡训练推理测试解析脚本
+    |----common_func.sh                         # TIPC基础、多机多卡训练推理测试常用函数
+```
+最后，自行基于`test_train_fleet_inference_python.md`文档，跑通`Linux GPU/CPU 多机多卡训练推理功能测试`流程即可。
+
+<a name="4"></a>
+
+## 4. FAQ