add launch mp distributed job (#624)

* add launch mp distributed job * fix comment * update by comment

add launch mp distributed job (#624)
* add launch mp distributed job * fix comment * update by comment
d0bd9522 · Yan Xu · Cheerego · a4b46eba · d0bd9522
隐藏空白更改
内联并排

Showing with 21 addition and 0 deletion

doc/fluid/user_guides/howto/training/cluster_howto.rst doc/fluid/user_guides/howto/training/cluster_howto.rst +21 -0

未找到文件。
--- a/doc/fluid/user_guides/howto/training/cluster_howto.rst
+++ b/doc/fluid/user_guides/howto/training/cluster_howto.rst
@@ -218,6 +218,27 @@ NCCL2模式的分布式训练，由于没有parameter server角色，是trainer
 目前使用NCCL2进行分布式训练仅支持同步训练方式。使用NCCL2方式的分布式训练，更适合模型体积较大，并需要使用\
 同步训练和GPU训练，如果硬件设备支持RDMA和GPU Direct，可以达到很高的分布式训练性能。

+启动多进程模式 NCCL2 分布式训练作业
+++++++++++++++++++++++++++++++++
+
+通常情况下使用多进程模式启动 NCCL2 分布式训练作业可以获得更好多训练性能，Paddle 提供了
+:code:`paddle.distributed.launch` 模块可以方便地启动多进程作业，启动后每个训练进程将会使用一块独立的 GPU 设备。
+使用时需要注意：
+
+* 设置节点数：通过环境变量 :code:`PADDLE_NUM_TRAINERS` 设置作业的节点数，此环境变量也会被设置在每个训练进程中。
+* 设置每个节点的设备数：通过启动参数 :code:`--gpus` 可以设置每个节点的 GPU 设备数量，每个进程的序号将会被自动设置在环境变量
+  :code:`PADDLE_TRAINER_ID` 中。
+* 数据切分： 多进程模式是每个设备一个进程，一般来说需要每个进程处理一部分训练数据，并且保证所有进程能够处理完整的数据集。
+* 入口文件：入口文件为实际启动的训练脚本。
+* 日志：每个训练进程的日志默认会保存在 :code:`./mylog` 目录下，您也可以通过参数 :code:`--log_dir` 进行指定。
+
+启动样例:
+
+.. code-block:: bash
+
+    > PADDLE_NUM_TRAINERS=<TRAINER_COUNT> python -m paddle.distributed.launch train.py --gpus <NUM_GPUS_ON_HOSTS> <ENTRYPOINT_SCRIPT> --arg1 --arg2 ...
+
+
 NCCL2分布式训练注意事项
 +++++++++++++++++++++