未验证 提交 df339c08 编写于 作者: G gongweibao 提交者: GitHub

Fix mistakes in dgc document. (#16731)

上级 b07584dc
...@@ -628,16 +628,16 @@ class DGCMomentumOptimizer(MomentumOptimizer): ...@@ -628,16 +628,16 @@ class DGCMomentumOptimizer(MomentumOptimizer):
Original paper is https://arxiv.org/abs/1712.01887 Original paper is https://arxiv.org/abs/1712.01887
DGC reduce the communication bandwidth by sending only the important gradients (sparse update):\ DGC reduces the communication bandwidth by sending only the important gradients (sparse update):\
only gradients larger than a threshold are transmitted. only gradients larger than a threshold are transmitted.
To avoid losing information, DGC accumulate the rest of the gradients locally. To avoid losing information, DGC accumulates the rest of the gradients locally.
Eventually, these gradients become large enough to be transmitted. Eventually, these gradients become large enough to be transmitted.
Thus, DGC send the large gradients immediately but eventually send all of the gradients over time. Thus, DGC sends the large gradients immediately but eventually send all of the gradients over time.
To ensure no loss of accuracy, DGC employs momentum correc-tionandlocal gradient clipping on top of the gradient sparsification to maintain model performance. To ensure no loss of accuracy, DGC employs momentum correction and local gradient clipping on top of the gradient sparsification to maintain model performance.
DGC also uses momentum factor masking and warmup training to overcome the staleness problem caused by reduced communication. DGC also uses momentum factor masking and warmup training to overcome the staleness problem caused by reduced communication.
...@@ -652,7 +652,7 @@ class DGCMomentumOptimizer(MomentumOptimizer): ...@@ -652,7 +652,7 @@ class DGCMomentumOptimizer(MomentumOptimizer):
learning_rate (float|Variable): the learning rate used to update parameters. \ learning_rate (float|Variable): the learning rate used to update parameters. \
Can be a float value or a Variable with one float value as data element. Can be a float value or a Variable with one float value as data element.
momentum (float): Momentum factor. momentum (float): Momentum factor.
rampup_begin_step (int): The begining step from which gradient compression is implemented. rampup_begin_step (int): The beginning step from which gradient compression is implemented.
rampup_step (int): How long it use the sparsity periods. Default is 1. rampup_step (int): How long it use the sparsity periods. Default is 1.
for example: If the sparsity is [0.75, 0.9375, 0.984375, 0.996, 0.999], and the rampup_step is 5, \ for example: If the sparsity is [0.75, 0.9375, 0.984375, 0.996, 0.999], and the rampup_step is 5, \
it will use 0.75 at 0 step, and 0.9375 at 1 step, and so on. And when reach sparsity array ends, \ it will use 0.75 at 0 step, and 0.9375 at 1 step, and so on. And when reach sparsity array ends, \
...@@ -660,9 +660,9 @@ class DGCMomentumOptimizer(MomentumOptimizer): ...@@ -660,9 +660,9 @@ class DGCMomentumOptimizer(MomentumOptimizer):
sparsity (list[float]): Get top important element from gradient tensor, the ratio is (1 - current sparsity). sparsity (list[float]): Get top important element from gradient tensor, the ratio is (1 - current sparsity).
use_nesterov (bool): Enables Nesterov momentum. True means use nesterov. use_nesterov (bool): Enables Nesterov momentum. True means use nesterov.
local_grad_clip_norm (float): Clip norm value if needed. local_grad_clip_norm (float): Clip norm value if needed.
num_trainers: The number of training node. num_trainers: The number of training nodes.
regularization: A Regularizer, such as fluid.regularizer.L2DecayRegularizer. regularization: A Regularizer, such as fluid.regularizer.L2DecayRegularizer.
name: A optional name prefix. name: An optional name prefix.
Examples: Examples:
.. code-block:: python .. code-block:: python
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册