Among them, :math:`\ eta_ {l}` and :math:`\ eta_ {0}` are the current learning
rate of the current round and the initial learning rate respectively. :math:`F\left(\mathbf{x}_{t=l}\right)` and :math:`F\left(\mathbf{x}_{t=l}\right)` are
current training loss and initial loss in the first round respectively.
User-defined Step Size for LocalSGD
---------
There are three main parameters of the user-defined step size localsgd training mode, which are:
":code:`use_local_sgd`", "bool", "False/True", "Whether to enable LocalSGD. Default: No"
":code:`local_sgd_is_warm_steps`", "int", "Greater than 0", "How many rounds of training before using local SGD"
":code:`local_sgd_steps`", "int", "Greater than 0", "Step size of LocalSGD"
Explain:
- LocalSGD's warmup step size :code:`local_sgd_is_warm_steps` affects the generalization ability of the final model. Generally, you need to wait for the model parameters to stabilize before performing local SGD training. The experience value can use the epoch when the learning rate drops for the first time as the warmup step, and then start training for LocalSGD.
- LocalSGD's step size :code:`local_sgd_steps`. Generally, the larger the value, the less the communication times, and the faster the training speed, but the accuracy of the model decreases with it. The experience value is set to 2 or 4.
LocalSGD training can be achieved by setting the above three parameters,
and only a few parts need to be added to the original distributed training code:
**Setting Up Distribution Strategy**
In the training strategy, select the :code:`use_local_sgd` switch.
[1] Lin T, Stich S U, Patel K K, et al. Don't Use Large Mini-Batches, Use Local SGD[J]. arXiv preprint arXiv:1808.07217, 2018.
[2] Wang J, Joshi G. Adaptive communication strategies to achieve the best error-runtime trade-off in local-update SGD[J]. arXiv preprint arXiv:1810.08313, 2018.