!511 add_restriction_for_parallel_optimizer

Merge pull request !511 from gziyan/add_restriction_for_parallel_optimizer

!511 add_restriction_for_parallel_optimizer
Merge pull request !511 from gziyan/add_restriction_for_parallel_optimizer
49740db1 · mindspore-ci-bot · Gitee · daf73f94 · 21c8face · 49740db1
3 changed file
--- a/docs/source_zh_cn/design/mindspore/dp_internal_design.md
+++ b/docs/source_zh_cn/design/mindspore/dp_internal_design.md
@@ -54,12 +54,12 @@
 ### 代码实现
 1. 集合通信

-    - management.py: 这个文件中涵盖了集合通信过程中常用的`helper`函数接口，例如获取集群数量和卡的序号等。当在Ascend芯片上执行时，框架会加载环境上的`libhccl.so`库文件，通过它来完成从Python层到底层的通信接口调用。
-    - comm_ops.py: MindSpore将支持的集合通信操作都包装为算子的形式放在这个文件下，包括`AllReduce`、`AllGather`、`ReduceScatter`和`Broadcast`等。`PrimitiveWithInfer`中除了定义算子所需属性外，还包括构图过程中输入到输出的`shape`和`dtype`推导。
+    - [management.py](https://gitee.com/mindspore/mindspore/blob/master/mindspore/communication/management.py): 这个文件中涵盖了集合通信过程中常用的`helper`函数接口，例如获取集群数量和卡的序号等。当在Ascend芯片上执行时，框架会加载环境上的`libhccl.so`库文件，通过它来完成从Python层到底层的通信接口调用。
+    - [comm_ops.py](https://gitee.com/mindspore/mindspore/blob/master/mindspore/ops/operations/comm_ops.py): MindSpore将支持的集合通信操作都包装为算子的形式放在这个文件下，包括`AllReduce`、`AllGather`、`ReduceScatter`和`Broadcast`等。`PrimitiveWithInfer`中除了定义算子所需属性外，还包括构图过程中输入到输出的`shape`和`dtype`推导。

 2. 梯度聚合

-    - grad_reducer.py: 这个文件实现了梯度聚合的过程。对入参`grads`用`HyperMap`展开后插入`AllReduce`算子，这里采用的是全局通信组，用户也可以根据自己网络的需求仿照这个模块进行自定义开发。MindSpore中单机和分布式执行共用一套网络封装接口，在`Cell`内部通过`ParallelMode`来区分是否要对梯度做聚合操作，网络封装接口建议参考`TrainOneStepCell`代码实现。
+    - [grad_reducer.py](https://gitee.com/mindspore/mindspore/blob/master/mindspore/nn/wrap/grad_reducer.py): 这个文件实现了梯度聚合的过程。对入参`grads`用`HyperMap`展开后插入`AllReduce`算子，这里采用的是全局通信组，用户也可以根据自己网络的需求仿照这个模块进行自定义开发。MindSpore中单机和分布式执行共用一套网络封装接口，在`Cell`内部通过`ParallelMode`来区分是否要对梯度做聚合操作，网络封装接口建议参考`TrainOneStepCell`代码实现。


 ### 其他并行

--- a/tutorials/source_en/advanced_use/distributed_training_ascend.md
+++ b/tutorials/source_en/advanced_use/distributed_training_ascend.md
@@ -215,7 +215,7 @@ The `Momentum` optimizer is used as the parameter update tool. The definition is
 - `parallel_mode`: parallel distributed mode. The default value is `ParallelMode.STAND_ALONE`. The options are `ParallelMode.DATA_PARALLEL` and `ParallelMode.AUTO_PARALLEL`.
 - `parameter_broadcast`: whether to broadcast initialized parameters. The default value is `True` in `DATA_PARALLEL` and `HYBRID_PARALLEL` mode.
 - `mirror_mean`: During backward computation, the framework collects gradients of parameters in data parallel mode across multiple hosts, obtains the global gradient value, and transfers the global gradient value to the optimizer for update. The default value is `False`, which indicates that the `allreduce_sum` operation is applied. The value `True` indicates that the `allreduce_mean` operation is applied.
- `enable_parallel_optimizer`: a developing feature. Whether to use optimizer model parallel, which improves performance by distributing the parameters to be updated to each worker, and applying Broadcast among workers to share updated parameters.
+- `enable_parallel_optimizer`: a developing feature. Whether to use optimizer model parallel, which improves performance by distributing the parameters to be updated to each worker, and applying Broadcast among workers to share updated parameters. This feature can be used only in data parallel mode and when the number of parameters is larger than the number of devices.

 > You are advised to set `device_num` and `global_rank` to their default values. The framework calls the HCCL API to obtain the values.


--- a/tutorials/source_zh_cn/advanced_use/distributed_training_ascend.md
+++ b/tutorials/source_zh_cn/advanced_use/distributed_training_ascend.md
@@ -218,7 +218,7 @@ class SoftmaxCrossEntropyExpand(nn.Cell):
 - `parallel_mode`：分布式并行模式，默认为单机模式`ParallelMode.STAND_ALONE`。可选数据并行`ParallelMode.DATA_PARALLEL`及自动并行`ParallelMode.AUTO_PARALLEL`。
 - `parameter_broadcast`： 参数初始化广播开关，`DATA_PARALLEL`和`HYBRID_PARALLEL`模式下，默认值为`True`。
 - `mirror_mean`：反向计算时，框架内部会将数据并行参数分散在多台机器的梯度值进行收集，得到全局梯度值后再传入优化器中更新。默认值为`False`，设置为True对应`allreduce_mean`操作，False对应`allreduce_sum`操作。
- `enable_parallel_optimizer`：开发中特性。优化器模型并行开关，通过拆分权重到各卡分别进行更新再同步的方式以提升性能。
+- `enable_parallel_optimizer`：开发中特性。打开优化器模型并行开关，通过拆分权重到各卡分别进行更新再同步的方式以提升性能。该特性只在数据并行模式和参数量大于机器数时有效。

 > `device_num`和`global_rank`建议采用默认值，框架内会调用HCCL接口获取。