提交 7c4869f5 编写于 作者: Z ZPaC

Update parameter-server for r0.7

上级 6c513001
# Parameter Server Training
`Ascend` `Model Training` `Intermediate` `Expert`
`Ascend` `GPU` `Model Training` `Intermediate` `Expert`
<!-- TOC -->
......@@ -19,16 +19,16 @@
## Overview
A parameter server is a widely used architecture in distributed training. Compared with the synchronous AllReduce training method, a parameter server has better flexibility, scalability, and node failover capabilities. Specifically, the parameter server supports both synchronous and asynchronous SGD training algorithms. In terms of scalability, model computing and update are separately deployed in the worker and server processes, so that resources of the worker and server can be independently scaled out and in horizontally. In addition, in an environment of a large-scale data center, various failures often occur in a computing device, a network, and a storage device, and consequently some nodes are abnormal. However, in an architecture of a parameter server, such a failure can be relatively easily handled without affecting a training job.
In the parameter server implementation of MindSpore, the open-source [ps-lite](https://github.com/dmlc/ps-lite) is used as the basic architecture. Based on the remote communication capability provided by the [ps-lite](https://github.com/dmlc/ps-lite) and abstract Push/Pull primitives, the distributed training algorithm of the synchronous SGD is implemented. In addition, with the high-performance Huawei collective communication library (HCCL) in Ascend, MindSpore also provides the hybrid training mode of parameter server and AllReduce. Some weights can be stored and updated through the parameter server, and other weights are still trained through the AllReduce algorithm.
In the parameter server implementation of MindSpore, the open-source [ps-lite](https://github.com/dmlc/ps-lite) is used as the basic architecture. Based on the remote communication capability provided by the [ps-lite](https://github.com/dmlc/ps-lite) and abstract Push/Pull primitives, the distributed training algorithm of the synchronous SGD is implemented. In addition, with the high-performance collective communication library in Ascend and GPU(HCCL and NCCL), MindSpore also provides the hybrid training mode of parameter server and AllReduce. Some weights can be stored and updated through the parameter server, and other weights are still trained through the AllReduce algorithm.
The ps-lite architecture consists of three independent components: server, worker, and scheduler. Their functions are as follows:
- Server: saves model weights and backward computation gradients, and updates the model using gradients pushed by workers. (In the current version, only a single server is supported.)
- Server: saves model weights and backward computation gradients, and updates the model using gradients pushed by workers.
- Worker: performs forward and backward computation on the network. The gradient value for forward computation is uploaded to a server through the `Push` API, and the model updated by the server is downloaded to the worker through the `Pull` API.
- Scheduler: establishes the communication relationship between the server and worker.
> The current version supports only the Ascend 910 AI Processor. The support for GPU platform is under development.
## Preparations
The following describes how to use parameter server to train LeNet on Ascend 910:
......
# Parameter Server训练
`Ascend` `模型训练` `中级` `高级`
`Ascend` `GPU` `模型训练` `中级` `高级`
<!-- TOC -->
......@@ -19,18 +19,16 @@
## 概述
Parameter Server(参数服务器)是分布式训练中一种广泛使用的架构,相较于同步的AllReduce训练方法,Parameter Server具有更好的灵活性、可扩展性以及节点容灾的能力。具体来讲,参数服务器既支持同步SGD,也支持异步SGD的训练算法;在扩展性上,将模型的计算与模型的更新分别部署在Worker和Server两类进程中,使得Worker和Server的资源可以独立地横向扩缩;另外,在大规模数据中心的环境下,计算设备、网络以及存储经常会出现各种故障而导致部分节点异常,而在参数服务器的架构下,能够较为容易地处理此类的故障而不会对训练中的任务产生影响。
在MindSpore的参数服务器实现中,采用了开源的[ps-lite](https://github.com/dmlc/ps-lite)作为基础架构,基于其提供的远程通信能力以及抽象的Push/Pull原语,实现了同步SGD的分布式训练算法,另外结合Ascend中的高性能集合通信库HCCL,MindSpore还提供了Parameter Server和AllReduce的混合训练模式,支持将部分权重通过参数服务器进行存储和更新,其余权重仍然通过AllReduce算法进行训练。
在MindSpore的参数服务器实现中,采用了开源的[ps-lite](https://github.com/dmlc/ps-lite)作为基础架构,基于其提供的远程通信能力以及抽象的Push/Pull原语,实现了同步SGD的分布式训练算法,另外结合Ascend和GPU中的高性能集合通信库(HCCL和NCCL),MindSpore还提供了Parameter Server和AllReduce的混合训练模式,支持将部分权重通过参数服务器进行存储和更新,其余权重仍然通过AllReduce算法进行训练。
在ps-lite的架构设计中,一共包含三个独立的组件,分别是Server、Worker和Scheduler,作用分别是:
- Server:保存模型的权重和反向计算的梯度值,并使用优化器通过Worker上传的梯度值对模型进行更新(当前版本仅支持单Server)
- Server:保存模型的权重和反向计算的梯度值,并使用优化器通过Worker上传的梯度值对模型进行更新。
- Worker:执行网络的正反向计算,正向计算的梯度值通过Push接口上传至Server中,通过Pull接口把Server更新好的模型下载到Worker本地。
- Scheduler:用于建立Server和Worker的通信关系。
> 当前版本仅支持Ascend 910 AI处理器,GPU平台支持正在开发中。
## 准备工作
以LeNet在Ascend 910上使用Parameter Server训练为例:
......@@ -53,7 +51,7 @@ network.set_param_ps()
### 环境变量设置
Mindspore通过读取环境变量,控制Parameter Server训练,环境变量包括以下选项(其中MS_SCHED_HOST及MS_SCHED_POST所有脚本需保持一致):
MindSpore通过读取环境变量,控制Parameter Server训练,环境变量包括以下选项(其中MS_SCHED_HOST及MS_SCHED_POST所有脚本需保持一致):
```
export PS_VERBOSE=1 # Print ps-lite log
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册