dist training doc (#6276)

* dist training doc * add multi-machine distribued training link

dist training doc (#6276)
* dist training doc * add multi-machine distribued training link
b0dd4863 · Wenyu · GitHub · b01af8bd · b0dd4863 · b0dd4863
6 changed file
--- a/configs/ppyoloe/README.md
+++ b/configs/ppyoloe/README.md
@@ -49,7 +49,10 @@ Training PP-YOLOE with mixed precision on 8 GPUs with following command
 python -m paddle.distributed.launch --gpus 0,1,2,3,4,5,6,7 tools/train.py -c configs/ppyoloe/ppyoloe_crn_l_300e_coco.yml --amp
 ```

-**Notes:** use `--amp` to train with default config to avoid out of memeory.
+**Notes:**
+- use `--amp` to train with default config to avoid out of memeory.
+- PaddleDetection supports multi-machine distribued training, you can refer to [DistributedTraining tutorial](../../docs/DistributedTraining_en.md).
+

 ### Evaluation


--- a/configs/ppyoloe/README_cn.md
+++ b/configs/ppyoloe/README_cn.md
@@ -48,8 +48,9 @@ PP-YOLOE由以下方法组成
 ```bash
 python -m paddle.distributed.launch --gpus 0,1,2,3,4,5,6,7 tools/train.py -c configs/ppyoloe/ppyoloe_crn_l_300e_coco.yml --amp
 ```
-
-**注意:** 使用默认配置训练需要设置`--amp`以避免显存溢出.
+**注意:**
+- 使用默认配置训练需要设置`--amp`以避免显存溢出.
+- PaddleDetection支持多机训练，可以参考[多机训练教程](../../docs/DistributedTraining_cn.md).

 ### 评估


--- a/docs/tutorials/DistributedTraining_cn.md
+++ b/docs/tutorials/DistributedTraining_cn.md
+[English](DistributedTraining_en.md) | 简体中文
+
+
+# 分布式训练
+
+## 1. 简介
+
+* 分布式训练指的是将训练任务按照一定方法拆分到多个计算节点进行计算，再按照一定的方法对拆分后计算得到的梯度等信息进行聚合与更新。飞桨分布式训练技术源自百度的业务实践，在自然语言处理、计算机视觉、搜索和推荐等领域经过超大规模业务检验。分布式训练的高性能，是飞桨的核心优势技术之一，PaddleDetection同时支持单机训练与多机训练。更多关于分布式训练的方法与文档可以参考：[分布式训练快速开始教程](https://fleet-x.readthedocs.io/en/latest/paddle_fleet_rst/parameter_server/ps_quick_start.html)。
+
+## 2. 使用方法
+
+### 2.1 单机训练
+
+* 以PP-YOLOE-s为例，本地准备好数据之后，使用`paddle.distributed.launch`或者`fleetrun`的接口启动训练任务即可。下面为运行脚本示例。
+
+```bash
+fleetrun \
+--selected_gpu 0,1,2,3,4,5,6,7 \
+tools/train.py -c configs/ppyoloe/ppyoloe_crn_s_300e_coco.yml \
+--eval &>logs.txt 2>&1 &
+```
+
+### 2.2 多机训练
+
+* 相比单机训练，多机训练时，只需要添加`--ips`的参数，该参数表示需要参与分布式训练的机器的ip列表，不同机器的ip用逗号隔开。下面为运行代码示例。
+
+```shell
+ip_list="10.127.6.17,10.127.5.142,10.127.45.13,10.127.44.151"
+fleetrun \
+--ips=${ip_list} \
+--selected_gpu 0,1,2,3,4,5,6,7 \
+tools/train.py -c configs/ppyoloe/ppyoloe_crn_s_300e_coco.yml \
+--eval &>logs.txt 2>&1 &
+```
+
+**注：**
+* 不同机器的ip信息需要用逗号隔开，可以通过`ifconfig`或者`ipconfig`查看。
+* 不同机器之间需要做免密设置，且可以直接ping通，否则无法完成通信。
+* 不同机器之间的代码、数据与运行命令或脚本需要保持一致，且所有的机器上都需要运行设置好的训练命令或者脚本。最终`ip_list`中的第一台机器的第一块设备是trainer0，以此类推。
+* 不同机器的起始端口可能不同，建议在启动多机任务前，在不同的机器中设置相同的多机运行起始端口，命令为`export FLAGS_START_PORT=17000`，端口值建议在`10000~20000`之间。
+
+
+## 3. 性能效果测试
+
+* 在单机和4机8卡V100的机器上，基于[PP-YOLOE-s](../../configs/ppyoloe/ppyoloe_crn_s_300e_coco.yml)进行模型训练，模型的训练耗时情况如下所示。
+
+机器 | 精度 | 耗时
+-|-|-
+单机8卡 | 42.7% | 39h
+4机8卡 | 42.1% | 13h
--- a/docs/tutorials/DistributedTraining_en.md
+++ b/docs/tutorials/DistributedTraining_en.md
+English | [简体中文](DistributedTraining_cn.md)
+
+
+## 1. Usage
+
+### 1.1 Single-machine
+
+* Take PP-YOLOE-s as an example, after preparing the data locally, use the interface of `paddle.distributed.launch` or `fleetrun` to start the training task. Below is an example of running the script.
+
+```bash
+fleetrun \
+--selected_gpu 0,1,2,3,4,5,6,7 \
+tools/train.py -c configs/ppyoloe/ppyoloe_crn_s_300e_coco.yml \
+--eval &>logs.txt 2>&1 &
+```
+
+### 1.2 Multi-machine
+
+* Compared with single-machine training, when training on multiple machines, you only need to add the `--ips` parameter, which indicates the ip list of machines that need to participate in distributed training. The ips of different machines are separated by commas. Below is an example of running code.
+
+```shell
+ip_list="10.127.6.17,10.127.5.142,10.127.45.13,10.127.44.151"
+fleetrun \
+--ips=${ip_list} \
+--selected_gpu 0,1,2,3,4,5,6,7 \
+tools/train.py -c configs/ppyoloe/ppyoloe_crn_s_300e_coco.yml \
+--eval &>logs.txt 2>&1 &
+```
+
+**Note:**
+* The ip information of different machines needs to be separated by commas, which can be viewed through `ifconfig` or `ipconfig`.
+* Password-free settings are required between different machines, and they can be pinged directly, otherwise the communication cannot be completed.
+* The code, data, and running commands or scripts between different machines need to be consistent, and the set training commands or scripts need to be run on all machines. The first device of the first machine in the final `ip_list` is trainer0, and so on.
+* The starting port of different machines may be different. It is recommended to set the same starting port for multi-machine running in different machines before starting the multi-machine task. The command is `export FLAGS_START_PORT=17000`, and the port value is recommended to be `10000~20000`.
+
+
+## 2. Performance
+
+* On single-machine and 4-machine 8-card V100 machines, model training is performed based on [PP-YOLOE-s](../../configs/ppyoloe/ppyoloe_crn_s_300e_coco.yml). The model training time is as follows.
+
+Machine | mAP | Time cost
+-|-|-
+single machine | 42.7% | 39h
+4 machines | 42.1% | 13h
--- a/docs/tutorials/GETTING_STARTED.md
+++ b/docs/tutorials/GETTING_STARTED.md
@@ -14,10 +14,9 @@ instructions](INSTALL_cn.md).
 - Please refer to [PrepareDetDataSet](PrepareDetDataSet_en.md) for data preparation
 - Please set the data path for data configuration file in ```configs/datasets```

-
 ## Training & Evaluation & Inference

-PaddleDetection provides scripts for training, evalution and inference with various features according to different configure.
+PaddleDetection provides scripts for training, evalution and inference with various features according to different configure. And for more distribued training details see [DistributedTraining].(./DistributedTraining_en.md)

 ```bash
 # training on single-GPU
@@ -26,6 +25,9 @@ python tools/train.py -c configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.yml
 # training on multi-GPU
 export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
 python -m paddle.distributed.launch --gpus 0,1,2,3,4,5,6,7 tools/train.py -c configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.yml
+# training on multi-machines and multi-GPUs
+export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+$fleetrun --ips="10.127.6.17,10.127.5.142,10.127.45.13,10.127.44.151" --selected_gpu 0,1,2,3,4,5,6,7 tools/train.py -c configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.yml
 # GPU evaluation
 export CUDA_VISIBLE_DEVICES=0
 python tools/eval.py -c configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.yml -o weights=https://paddledet.bj.bcebos.com/models/faster_rcnn_r50_fpn_1x_coco.pdparams

--- a/docs/tutorials/GETTING_STARTED_cn.md
+++ b/docs/tutorials/GETTING_STARTED_cn.md
@@ -99,6 +99,15 @@ python tools/train.py -c configs/yolov3/yolov3_mobilenet_v1_roadsign.yml
 export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 #windows和Mac下不需要执行该命令
 python -m paddle.distributed.launch --gpus 0,1,2,3,4,5,6,7 tools/train.py -c configs/yolov3/yolov3_mobilenet_v1_roadsign.yml
 ```
+
+* [GPU多机多卡训练](./DistributedTraining_cn.md)
+```bash
+$fleetrun \
+--ips="10.127.6.17,10.127.5.142,10.127.45.13,10.127.44.151" \
+--selected_gpu 0,1,2,3,4,5,6,7 \
+tools/train.py -c configs/yolov3/yolov3_mobilenet_v1_roadsign.yml \
+```
+
 * Fine-tune其他任务

  使用预训练模型fine-tune其他任务时，可以直接加载预训练模型，形状不匹配的参数将自动忽略，例如：