Update nccl2 tips (#430)

* update nccl2 dist train tips * update * update * update * update * follow comments

Update nccl2 tips (#430)
* update nccl2 dist train tips * update * update * update * update * follow comments
468c82a9 · Wu Yi · Cheerego · 928eb2a6 · 468c82a9
显示空白变更内容
内联并排

Showing with 9 addition and 2 deletion

doc/fluid/user_guides/howto/training/cluster_howto.rst doc/fluid/user_guides/howto/training/cluster_howto.rst +9 -2

未找到文件。
--- a/doc/fluid/user_guides/howto/training/cluster_howto.rst
+++ b/doc/fluid/user_guides/howto/training/cluster_howto.rst
@@ -218,8 +218,15 @@ NCCL2模式的分布式训练，由于没有parameter server角色，是trainer
 目前使用NCCL2进行分布式训练仅支持同步训练方式。使用NCCL2方式的分布式训练，更适合模型体积较大，并需要使用\
 同步训练和GPU训练，如果硬件设备支持RDMA和GPU Direct，可以达到很高的分布式训练性能。

-注意如果系统中有多个网络设备，需要手动指定NCCL2使用的设备，
-假设需要使用 :code:`eth2` 为通信设备，需要设定如下环境变量：
+NCCL2分布式训练注意事项
+++++++++++++++++++++
+
+**注意：** 使用NCCL2模式分布式训练时，需要确保每个节点训练等量的数据，防止在最后一轮训练中任务不退出。通常有两种方式：
+
+- 随机采样一些数据，补全分配到较少数据的节点上。（推荐使用这种方法，以训练完整的数据集）。
+- 在python代码中，每个节点每个pass只训练固定的batch数，如果这个节点数据较多，则不训练这些多出来的数据。
+
+**注意：** 如果系统中有多个网络设备，需要手动指定NCCL2使用的设备，假设需要使用 :code:`eth2` 为通信设备，需要设定如下环境变量：

 .. code-block:: bash