diff --git a/tutorials/source_en/advanced_use/distributed_training_tutorials.rst b/tutorials/source_en/advanced_use/distributed_training_tutorials.rst index 723c85928fd1b4240ff078b5bb2566eb5be14cc5..4807338b07f88a8ef709abf861f1af9334deb256 100644 --- a/tutorials/source_en/advanced_use/distributed_training_tutorials.rst +++ b/tutorials/source_en/advanced_use/distributed_training_tutorials.rst @@ -19,3 +19,4 @@ MindSpore also provides the parallel distributed training function. It supports distributed_training_ascend host_device_training checkpoint_for_hybrid_parallel + parameter_server_training diff --git a/tutorials/source_en/advanced_use/parameter_server_training.md b/tutorials/source_en/advanced_use/parameter_server_training.md new file mode 100644 index 0000000000000000000000000000000000000000..2a72815057b8ea1a4fdd26aa44fea6fbc0b4b8b1 --- /dev/null +++ b/tutorials/source_en/advanced_use/parameter_server_training.md @@ -0,0 +1,132 @@ +# Parameter Server Training + + + +- [Parameter Server Training](#parameter-server-training) + - [Overview](#overview) + - [Preparations](#preparations) + - [Training Script Preparation](#training-script-preparation) + - [Parameter Setting](#parameter-setting) + - [Environment Variable Setting](#environment-variable-setting) + - [Training](#training) + + + + + +## Overview +A parameter server is a widely used architecture in distributed training. Compared with the synchronous AllReduce training method, a parameter server has better flexibility, scalability, and node failover capabilities. Specifically, the parameter server supports both synchronous and asynchronous SGD training algorithms. In terms of scalability, model computing and update are separately deployed in the worker and server processes, so that resources of the worker and server can be independently scaled out and in horizontally. In addition, in an environment of a large-scale data center, various failures often occur in a computing device, a network, and a storage device, and consequently some nodes are abnormal. However, in an architecture of a parameter server, such a failure can be relatively easily handled without affecting a training job. + +In the parameter server implementation of MindSpore, the open-source [ps-lite](https://github.com/dmlc/ps-lite) is used as the basic architecture. Based on the remote communication capability provided by the [ps-lite](https://github.com/dmlc/ps-lite) and abstract Push/Pull primitives, the distributed training algorithm of the synchronous SGD is implemented. In addition, with the high-performance Huawei collective communication library (HCCL) in Ascend, MindSpore also provides the hybrid training mode of parameter server and AllReduce. Some weights can be stored and updated through the parameter server, and other weights are still trained through the AllReduce algorithm. + +The ps-lite architecture consists of three independent components: server, worker, and scheduler. Their functions are as follows: + +- Server: saves model weights and backward computation gradients, and updates the model using gradients pushed by workers. (In the current version, only a single server is supported.) +- Worker: performs forward and backward computation on the network. The gradient value for forward computation is uploaded to a server through the `Push` API, and the model updated by the server is downloaded to the worker through the `Pull` API. + +- Scheduler: establishes the communication relationship between the server and worker. + +> The current version supports only the Ascend 910 AI Processor. The support for GPU platform is under development. + +## Preparations +The following describes how to use parameter server to train LeNet on Ascend 910: + +### Training Script Preparation + +Learn how to train a LeNet using the [MNIST dataset](http://yann.lecun.com/exdb/mnist/) by referring to . + +### Parameter Setting + +In this training mode, you can use either of the following methods to control whether the training parameters are updated through the parameter server: + +- Use `mindspore.nn.Cell.set_param_ps()` to set all weight recursions of `nn.Cell`. +- Use `mindspore.common.Parameter.set_param_ps()` to set the weight. + +On the basis of the [original training script](https://gitee.com/mindspore/mindspore/blob/master/model_zoo/official/cv/lenet/train.py), set all LeNet model weights to be trained on the parameter server: +```python +network = LeNet5(cfg.num_classes) +network.set_param_ps() +``` + +### Environment Variable Setting + +MindSpore reads environment variables to control parameter server training. The environment variables include the following options (all scripts of `MS_SCHED_HOST` and `MS_SCHED_POST` must be consistent): + +``` +export PS_VERBOSE=1 # Print ps-lite log +export MS_SERVER_NUM=1 # Server number +export MS_WORKER_NUM=1 # Worker number +export MS_SCHED_HOST=XXX.XXX.XXX.XXX # Scheduler IP address +export MS_SCHED_PORT=XXXX # Scheduler port +export MS_ROLE=MS_SCHED # The role of this process: MS_SCHED represents the scheduler, MS_WORKER represents the worker, MS_PSERVER represents the Server +``` + +## Training + +1. Shell scripts + + Provide the shell scripts corresponding to the worker, server, and scheduler roles to start training: + + `Scheduler.sh`: + ```bash + #!/bin/bash + export PS_VERBOSE=1 + export MS_SERVER_NUM=1 + export MS_WORKER_NUM=1 + export MS_SCHED_HOST=XXX.XXX.XXX.XXX + export MS_SCHED_PORT=XXXX + export MS_ROLE=MS_SCHED + python train.py + ``` + + `Server.sh`: + ```bash + #!/bin/bash + export PS_VERBOSE=1 + export MS_SERVER_NUM=1 + export MS_WORKER_NUM=1 + export MS_SCHED_HOST=XXX.XXX.XXX.XXX + export MS_SCHED_PORT=XXXX + export MS_ROLE=MS_PSERVER + python train.py + ``` + + `Worker.sh`: + ```bash + #!/bin/bash + export PS_VERBOSE=1 + export MS_SERVER_NUM=1 + export MS_WORKER_NUM=1 + export MS_SCHED_HOST=XXX.XXX.XXX.XXX + export MS_SCHED_PORT=XXXX + export MS_ROLE=MS_WORKER + python train.py + ``` + + Run the following commands separately: + ```bash + sh Scheduler.sh > scheduler.log 2>&1 & + sh Server.sh > server.log 2>&1 & + sh Worker.sh > worker.log 2>&1 & + ``` + Start training. + +2. Viewing result + + Run the following command to view the communication logs between the server and worker in the `scheduler.log` file: + ``` + Bind to role=scheduler, id=1, ip=XXX.XXX.XXX.XXX, port=XXXX + Assign rank=8 to node role=server, ip=XXX.XXX.XXX.XXX, port=XXXX + Assign rank=9 to node role=worker, ip=XXX.XXX.XXX.XXX, port=XXXX + the scheduler is connected to 1 workers and 1 servers + ``` + The preceding information indicates that the communication between the server, worker, and scheduler is established successfully. + + Check the training result in the `worker.log` file: + ``` + epoch: 1 step: 1, loss is 2.302287 + epoch: 1 step: 2, loss is 2.304071 + epoch: 1 step: 3, loss is 2.308778 + epoch: 1 step: 4, loss is 2.301943 + ... + ```