update

94b321a4 · sandyhouse · 9d9d014e · 9d9d014e · 94b321a4 · 94b321a4
2 changed file
--- a/tutorials/tipc/train_fleet_infer_python/images/data_parallel.png
+++ b/tutorials/tipc/train_fleet_infer_python/images/data_parallel.png
--- a/tutorials/tipc/train_fleet_infer_python/train_fleet_infer_python.md
+++ b/tutorials/tipc/train_fleet_infer_python/train_fleet_infer_python.md
@@ -28,92 +28,118 @@

 <img src="images/data_parallel.png" title="" alt="" data-align="center">

-  如上图所示，与单机单卡的普通模型训练相比，使用飞桨分布式训练的代码都只需要补充三个部分代码：
-
-1. 导入分布式训练需要的依赖包
-
-2. 初始化分布式环境
-
-3. 使用DataParallel封装用户组网
-   
-   下面将逐一进行讲解。
- 2.1 导入依赖
-  
-  ```python
-  from paddle.distributed as dist
-  ```
-
- 2.2 初始化分布式环境
-  
-  ```python
-  dist.init_parallel_env()
-  ```
-
- 2.3 使用DataParallel封装用户组网
-  
-  ```python
-  model = paddle.DataParallel(model)
-  ```
-
-假设用户训练脚本文件名为train.py，下面我们说明如何启动分布式训练任务。
-
-1. 启动单机多卡任务
-   
-   当使用单机多卡时，可以通过如下的命令启动分布式训练任务：
-   
-   ```shell
-   python -m paddle.distributed.launch --gpus="0,1" train.py
-   ```
-   
-   其中，``--gpus``选项指定用户分布式训练使用的GPU卡。
-
-2. 启动多机多卡任务
-   
-   我们以2台机器为例，说明如何启动多机多卡分布式训练任务。假设两台机器的ip地址分别为192.168.0.1和192.168.0.2。
-   
-   首先，我们需要确保两台机器间的网络是互通的，可以通过``ping``命令验证机器间网络的互通性，如下所示：
-   
-   ```shell
-   # 在ip地址为192.168.0.1的机器上
-   ping 192.168.0.2
-   ```
-   
-   接着，我们分别在两台机器上启动分布式任务：
-   
-   ```shell
-   # 在ip地址为192.168.0.1的机器上
-   python -m paddle.distributed.launch --ips="192.168.0.1,192.168.0.2" --gpus="0,1" train.py
-   ```
-   
-   ```shell
-   # 在ip地址为192.168.0.2的机器上
-   python -m paddle.distributed.launch --ips="192.168.0.1,192.168.0.2" --gpus="0,1" train.py
-   ```
-   
-   启动上述命令后，将在控制台上输出类似如下所示的信息：
-   
-   ```shell
-   WARNING 2021-01-04 17:59:08,725 launch.py:314] Not found distinct arguments and compiled with cuda. Default use collective mode
-   launch train in GPU mode
-   INFO 2021-01-04 17:59:08,727 launch_utils.py:472] Local start 2 processes. First process distributed environment info (Only For Debug):
-       +=======================================================================================+
-       |                        Distributed Envs                      Value                    |
-       +---------------------------------------------------------------------------------------+
-       |                 PADDLE_CURRENT_ENDPOINT                 127.0.0.1:17901               |
-       |                     PADDLE_TRAINERS_NUM                        2                      |
-       |                PADDLE_TRAINER_ENDPOINTS         127.0.0.1:17901,127.0.0.1:18846       |
-       |                     FLAGS_selected_gpus                        0                      |
-       |                       PADDLE_TRAINER_ID                        0                      |
-       +=======================================================================================+
-   
-   ...
-   W0104 17:59:19.018365 43338 device_context.cc:342] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 9.2
-   W0104 17:59:19.022523 43338 device_context.cc:352] device: 0, cuDNN Version: 7.4.
-   W0104 17:59:23.193490 43338 fuse_all_reduce_op_pass.cc:78] Find all_reduce operators: 161. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 5.
+如上图所示，与单机单卡的普通模型训练相比，使用飞桨分布式训练的代码都只需要补充三个部分代码：

+1. 准备多机多卡环境
+
+2. 导入依赖
+
+3. 初始化分布式环境
+
+4. 使用数据并行算子封装模型
+
+5. 验证多机多卡程序运行正确
+
+6. 开发模型推理程序  
+
+下面将逐一讲解每一步骤。
+
+### 2.1 准备多机多卡环境
+
+准备两台机器，每台机器包含至少两张GPU卡。假设两台机器的IP地址分别为192.168.0.1和192.168.0.2。那么，通过ping命令验证两台机器的网络是否是连通的：
+
+```shell
+# 在ip地址为192.168.0.1的机器上
+ping 192.168.0.2
+```
+
+如果控制台输出类似如下的信息，则表示两台机器的网络是连通的。
+
+```shell
+PING 192.168.0.2 (192.168.0.2): 56 data bytes
+64 bytes from 192.168.0.1: icmp_seq=0 ttl=64 time=0.090 ms
+64 bytes from 192.168.0.1: icmp_seq=1 ttl=64 time=0.111 ms
+64 bytes from 192.168.0.1: icmp_seq=2 ttl=64 time=0.094 ms
+64 bytes from 192.168.0.1: icmp_seq=3 ttl=64 time=0.089 ms
+```
+
+反之，如果输出如下的信息，则表示两台机器间的网络是不可连通的，请咨询您的网络管理员。
+
+```shell
+PING 192.168.0.2 (192.168.0.2): 56 data bytes
+Request timeout for icmp_seq 0
+Request timeout for icmp_seq 1
+```
+
+### 2.2 导入依赖
+
+```python
+import paddle
+from paddle.distributed as dist
+```
+
+### 2.3 初始化分布式环境
+
+```python
+dist.init_parallel_env()
+```
+
+### 2.4 使用数据并行算子封装模型
+
+```python
+model = paddle.DataParallel(model)
+```
+
+### 2.5 验证多机多卡程序运行正确
+
+假设用户训练脚本文件名为train.py，下面我们说明如何启动分布式训练任务，并验证程序的正确性。
+
+我们分别在两台机器上启动分布式任务：
+
+```shell
+# 在ip地址为192.168.0.1的机器上
+python -m paddle.distributed.launch --ips="192.168.0.1,192.168.0.2" --gpus="0,1" train.py
 ```
+
+```shell
+# 在ip地址为192.168.0.2的机器上
+python -m paddle.distributed.launch --ips="192.168.0.1,192.168.0.2" --gpus="0,1" train.py
+```
+
+启动上述命令后，将在控制台上输出类似如下所示的信息：
+
+```shell
+WARNING 2021-01-04 17:59:08,725 launch.py:314] Not found distinct arguments and compiled with cuda. Default use collective mode
+launch train in GPU mode
+INFO 2021-01-04 17:59:08,727 launch_utils.py:472] Local start 4 processes. First process distributed environment info (Only For Debug):
+    +=======================================================================================+
+    |                        Distributed Envs                      Value                    |
+    +---------------------------------------------------------------------------------------+
+    |                 PADDLE_CURRENT_ENDPOINT                 192.168.0.1:17901               |
+    |                     PADDLE_TRAINERS_NUM                        2                      |
+    |                PADDLE_TRAINER_ENDPOINTS         192.168.0.1:17901,192.168.0.0.1:18846...       |
+    |                     FLAGS_selected_gpus                        0                      |
+    |                       PADDLE_TRAINER_ID                        0                      |
+    +=======================================================================================+
+
+...
+W0104 17:59:19.018365 43338 device_context.cc:342] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 9.2
+W0104 17:59:19.022523 43338 device_context.cc:352] device: 0, cuDNN Version: 7.4.
+W0104 17:59:23.193490 43338 fuse_all_reduce_op_pass.cc:78] Find all_reduce operators: 161. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 
+```
+
 当使用paddle.distributed.launch模块启动分布式任务时，所有日志将保存在./log目录下，日志文件名为workerlog.xx，其中xx为整数；每个卡训练进程对应一个日志文件。

+用户也可以通过--log_dir选项指定日志的保存目录，比如下面的例子中将日志保存在./my_log目录下：
+
+```shell
+python -m paddle.distributed.launch --ips="192.168.0.1,192.168.0.2" --gpus="0,1" --log_dir=./my_log train.py
+```
+
+### 2.6 开发模型推理程序
+
+请参考第3部分。
+
 ## 3. [多机多卡推理功能开发](#3)

 由于数据并行训练各个卡上包含完整的模型副本，因此只需要保存某张卡上的模型用于推理即可。通常，可以选择保存第一张卡上的模型用于推理。