diff --git a/tutorials/tipc/train_fleet_infer_python/images/data_parallel.png b/tutorials/tipc/train_fleet_infer_python/images/data_parallel.png index 3bbf67714e53458c9bd70bd75e9391b7f91f126c..0a0f75654d65fac70150f06b585874d232f58ce9 100644 Binary files a/tutorials/tipc/train_fleet_infer_python/images/data_parallel.png and b/tutorials/tipc/train_fleet_infer_python/images/data_parallel.png differ diff --git a/tutorials/tipc/train_fleet_infer_python/train_fleet_infer_python.md b/tutorials/tipc/train_fleet_infer_python/train_fleet_infer_python.md index b332350ef3a8a125e9a8e852bd26194bd2c6432f..9852af6cad5fa7369d8d3f597ac407a8f9b7e428 100644 --- a/tutorials/tipc/train_fleet_infer_python/train_fleet_infer_python.md +++ b/tutorials/tipc/train_fleet_infer_python/train_fleet_infer_python.md @@ -28,92 +28,118 @@ - 如上图所示,与单机单卡的普通模型训练相比,使用飞桨分布式训练的代码都只需要补充三个部分代码: - -1. 导入分布式训练需要的依赖包 - -2. 初始化分布式环境 - -3. 使用DataParallel封装用户组网 - - 下面将逐一进行讲解。 -- 2.1 导入依赖 - - ```python - from paddle.distributed as dist - ``` - -- 2.2 初始化分布式环境 - - ```python - dist.init_parallel_env() - ``` - -- 2.3 使用DataParallel封装用户组网 - - ```python - model = paddle.DataParallel(model) - ``` - -假设用户训练脚本文件名为train.py,下面我们说明如何启动分布式训练任务。 - -1. 启动单机多卡任务 - - 当使用单机多卡时,可以通过如下的命令启动分布式训练任务: - - ```shell - python -m paddle.distributed.launch --gpus="0,1" train.py - ``` - - 其中,``--gpus``选项指定用户分布式训练使用的GPU卡。 - -2. 启动多机多卡任务 - - 我们以2台机器为例,说明如何启动多机多卡分布式训练任务。假设两台机器的ip地址分别为192.168.0.1和192.168.0.2。 - - 首先,我们需要确保两台机器间的网络是互通的,可以通过``ping``命令验证机器间网络的互通性,如下所示: - - ```shell - # 在ip地址为192.168.0.1的机器上 - ping 192.168.0.2 - ``` - - 接着,我们分别在两台机器上启动分布式任务: - - ```shell - # 在ip地址为192.168.0.1的机器上 - python -m paddle.distributed.launch --ips="192.168.0.1,192.168.0.2" --gpus="0,1" train.py - ``` - - ```shell - # 在ip地址为192.168.0.2的机器上 - python -m paddle.distributed.launch --ips="192.168.0.1,192.168.0.2" --gpus="0,1" train.py - ``` - - 启动上述命令后,将在控制台上输出类似如下所示的信息: - - ```shell - WARNING 2021-01-04 17:59:08,725 launch.py:314] Not found distinct arguments and compiled with cuda. Default use collective mode - launch train in GPU mode - INFO 2021-01-04 17:59:08,727 launch_utils.py:472] Local start 2 processes. First process distributed environment info (Only For Debug): - +=======================================================================================+ - | Distributed Envs Value | - +---------------------------------------------------------------------------------------+ - | PADDLE_CURRENT_ENDPOINT 127.0.0.1:17901 | - | PADDLE_TRAINERS_NUM 2 | - | PADDLE_TRAINER_ENDPOINTS 127.0.0.1:17901,127.0.0.1:18846 | - | FLAGS_selected_gpus 0 | - | PADDLE_TRAINER_ID 0 | - +=======================================================================================+ - - ... - W0104 17:59:19.018365 43338 device_context.cc:342] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 9.2 - W0104 17:59:19.022523 43338 device_context.cc:352] device: 0, cuDNN Version: 7.4. - W0104 17:59:23.193490 43338 fuse_all_reduce_op_pass.cc:78] Find all_reduce operators: 161. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 5. +如上图所示,与单机单卡的普通模型训练相比,使用飞桨分布式训练的代码都只需要补充三个部分代码: +1. 准备多机多卡环境 + +2. 导入依赖 + +3. 初始化分布式环境 + +4. 使用数据并行算子封装模型 + +5. 验证多机多卡程序运行正确 + +6. 开发模型推理程序 + +下面将逐一讲解每一步骤。 + +### 2.1 准备多机多卡环境 + +准备两台机器,每台机器包含至少两张GPU卡。假设两台机器的IP地址分别为192.168.0.1和192.168.0.2。那么,通过ping命令验证两台机器的网络是否是连通的: + +```shell +# 在ip地址为192.168.0.1的机器上 +ping 192.168.0.2 +``` + +如果控制台输出类似如下的信息,则表示两台机器的网络是连通的。 + +```shell +PING 192.168.0.2 (192.168.0.2): 56 data bytes +64 bytes from 192.168.0.1: icmp_seq=0 ttl=64 time=0.090 ms +64 bytes from 192.168.0.1: icmp_seq=1 ttl=64 time=0.111 ms +64 bytes from 192.168.0.1: icmp_seq=2 ttl=64 time=0.094 ms +64 bytes from 192.168.0.1: icmp_seq=3 ttl=64 time=0.089 ms +``` + +反之,如果输出如下的信息,则表示两台机器间的网络是不可连通的,请咨询您的网络管理员。 + +```shell +PING 192.168.0.2 (192.168.0.2): 56 data bytes +Request timeout for icmp_seq 0 +Request timeout for icmp_seq 1 +``` + +### 2.2 导入依赖 + +```python +import paddle +from paddle.distributed as dist +``` + +### 2.3 初始化分布式环境 + +```python +dist.init_parallel_env() +``` + +### 2.4 使用数据并行算子封装模型 + +```python +model = paddle.DataParallel(model) +``` + +### 2.5 验证多机多卡程序运行正确 + +假设用户训练脚本文件名为train.py,下面我们说明如何启动分布式训练任务,并验证程序的正确性。 + +我们分别在两台机器上启动分布式任务: + +```shell +# 在ip地址为192.168.0.1的机器上 +python -m paddle.distributed.launch --ips="192.168.0.1,192.168.0.2" --gpus="0,1" train.py ``` + +```shell +# 在ip地址为192.168.0.2的机器上 +python -m paddle.distributed.launch --ips="192.168.0.1,192.168.0.2" --gpus="0,1" train.py +``` + +启动上述命令后,将在控制台上输出类似如下所示的信息: + +```shell +WARNING 2021-01-04 17:59:08,725 launch.py:314] Not found distinct arguments and compiled with cuda. Default use collective mode +launch train in GPU mode +INFO 2021-01-04 17:59:08,727 launch_utils.py:472] Local start 4 processes. First process distributed environment info (Only For Debug): + +=======================================================================================+ + | Distributed Envs Value | + +---------------------------------------------------------------------------------------+ + | PADDLE_CURRENT_ENDPOINT 192.168.0.1:17901 | + | PADDLE_TRAINERS_NUM 2 | + | PADDLE_TRAINER_ENDPOINTS 192.168.0.1:17901,192.168.0.0.1:18846... | + | FLAGS_selected_gpus 0 | + | PADDLE_TRAINER_ID 0 | + +=======================================================================================+ + +... +W0104 17:59:19.018365 43338 device_context.cc:342] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 9.2 +W0104 17:59:19.022523 43338 device_context.cc:352] device: 0, cuDNN Version: 7.4. +W0104 17:59:23.193490 43338 fuse_all_reduce_op_pass.cc:78] Find all_reduce operators: 161. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is +``` + 当使用paddle.distributed.launch模块启动分布式任务时,所有日志将保存在./log目录下,日志文件名为workerlog.xx,其中xx为整数;每个卡训练进程对应一个日志文件。 +用户也可以通过--log_dir选项指定日志的保存目录,比如下面的例子中将日志保存在./my_log目录下: + +```shell +python -m paddle.distributed.launch --ips="192.168.0.1,192.168.0.2" --gpus="0,1" --log_dir=./my_log train.py +``` + +### 2.6 开发模型推理程序 + +请参考第3部分。 + ## 3. [多机多卡推理功能开发](#3) 由于数据并行训练各个卡上包含完整的模型副本,因此只需要保存某张卡上的模型用于推理即可。通常,可以选择保存第一张卡上的模型用于推理。