提交 a795de42 编写于 作者: L liqingping

Merge branch 'docs/arch' into 'develop'

docs: add architecture

See merge request platform/CloudNative4AI/cluster-lifecycle/nervex-operator!28
......@@ -4,7 +4,7 @@
Refers to [developer-guide](./docs/developer-guide.md)
## user guide
Refers to [user-guide](./docs/architecture.md)
### prerequisites
- a well prepared kubernetes cluster. Follow the [instructions](https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/create-cluster-kubeadm/) to create a kubernetes cluster, or create a local kubernetes node referring to [kind](https://kind.sigs.k8s.io/docs/user/quick-start/) or [minikube](https://minikube.sigs.k8s.io/docs/start/)
- cert-manager. Installation on kubernetes referenced to [cert-manager docs](https://cert-manager.io/docs/installation/kubernetes/). Or you can install by the following command.
......
# NerveX Orchestrator架构
nerveX框架分为3个重要的模块,分别是coordinator、collector和learner。一般情况下,一个nerveX训练任务只有一个coordinator,learner和collector的数目可以变化。三个模块的作用分别为:
- coordinator:保持与collector和learner的连接,接受collector和learner的获取原信息请求、推送原信息请求等,向learner和collector发送任务。
- collector:从coordinator获取RL模型在存储中间件中的位置并加载RL模型,然后在自身构造的环境中根据RL模型决策产生数据帧,将数据帧存储回存储中间件,并将数据帧原信息(存储路径、大小等)汇报给coordinator。
- learner:从coordinator获取数据帧存储位置并从存储中间件中加载数据帧开始训练RL模型,训练完成之后将模型存储到中间件中,并将模型原信息(存储路径、大小等)汇报给coordinator。由于learner部分常常存在数据并行训练这一额外的分布式机制,避免混淆,我们将与coordinator进行交互的模块称作logic learner,是coordinator下发任务的基本单位;而将数据并行训练中的单个learner进程称作ddp learner,多个ddp learner进程提供数据并行服务。一个logic learner可以对应1个ddp learner(单卡)或多个ddp learner(多卡)。另外,提供数据并行训练服务还需要额外引入aggregator模块,aggregator负责将多个ddp learner的训练结果进行汇总并发送给coordinator,即aggregator与多个ddp learner一起构成logic learner,而coordinator只会与logic learner进行交互。
有关nerveX的详细介绍可参考[nerveX developer tutorial](http://open-xlab.pages.gitlab.bj.sensetime.com/cell/nerveX/tutorial_dev/index.html)
为了提供nerveX在Kubernetes(K8s)中运行的支持,我们设计了NerveX Orchestrator,本文将说明利用NerveX Orchestrator,nerveX各个模块在K8s系统上如何被创建、如何相互发现、如何开始训练等。NerveX Orchestrator的架构如下图所示:
![](images/nervex-arch.png)
整体分为两大模块:`nervex-server``nervex-operator``DDPL`指ddp learner,`Lm`指Learner,`Cn`指Collector,`Aggregator+DDPL`构成一个logic learner。接下来将首先介绍一个nerveX任务提交到K8s之后NerveX Orchestrator如何将nerveX的各个模块(在K8s中就是一个[pod](https://kubernetes.io/docs/concepts/workloads/pods/))创建并启动,然后将对nervex-server和nervex-operator进行介绍。
## 任务创建流程
这里介绍任务创建流程,说明一个nerveX任务在K8s中从创建到执行完成的一整个生命周期
- 编写AggregatorConfig yaml文件,定义aggregator的模板,将在后面创建NerveXJob的时候用来创建aggregator,aggregator可以为训练端提供数据并行训练服务。
- 编写NerveXJob yaml文件,定义coordinator、collector、learner的模板,提交到K8s集群中。
- nervex-operator监听到NerveXJob的提交,创建coordinator,并为coordinator创建可访问的域名。
- coordinator启动之后按照默认配置向nervex-server请求创建一定数目的collector和learner。
- nervex-server收到coordinator的创建请求后,读取NerveXJob中collector和learner的模板,创建相应数目的collector(上图中Cn)和learner(上图中Lm)并把collector和learner可访问的URL返回给请求方。同时,根据每个learner中申请的GPU数目来决定是否创建aggregator。即当learner申请的GPU数目大于1时创建为该learner创建一个aggregator,否则不创建aggregator。
- coordinator等待collector, learner(aggregator和其多个ddp learner可以看作一个logic learner)连接上后开始下发任务开始训练。
- 用户可手动向nervex-server发送请求增删collector和learner,coordinator会定期查询其可用的collector和learner数目并决定新建或断开连接。
- 训练结束后,nervex-operator默认会将collectors、learner、aggregator都删除掉,而coordinator则会保留给用户查看日志等操作。
## NerveX Operator
nervex-operator是在一个负责在K8s系统中编排NerveXJob的组件,采用K8s [operator pattern](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/),通过[controller pattern](https://kubernetes.io/docs/concepts/architecture/controller/)中的控制循环监听K8s集群中NerveXJob的状态,并在有需要的时候对NerveXJob的状态进行修改,使得NerveXJob的实际状态与我们预定义的状态尽可能保持一致。
### API定义
根据nerveX框架中每个模块的特性,我们定义了两种自定义资源(Custom Resource),分别是NerveXJob和AggregatorConfig。前者用来定义一个RL任务的coordinator、collector和learner运行所需的必备条件,包括镜像、启动命令、所需计算和存储资源、环境变量等;后者用来定义一个RL任务的aggregator运行所需的必备条件。
NerveXJob定义:
```go
type NerveXJobSpec struct {
// Group is a collection of NerveXJobs
Group string `json:"group,omitempty"`
//Priority labels the priority of NerveXJob
PriorityClassName PriorityClassName `json:"priorityClassName,omitempty"`
// CleanPodPolicy defines the policy to clean pods after NerveXJob completed
CleanPodPolicy CleanPodPolicy `json:"cleanPodPolicy,omitempty"`
// Volumes defines the shared volumes for nerveX components
Volumes []corev1.Volume `json:"volumes,omitempty"`
Coordinator CoordinatorSpec `json:"coordinator"`
Collector CollectorSpec `json:"collector,"`
Learner LearnerSpec `json:"learner,"`
}
```
AggregatorConfig定义:
```go
type AggregatorConfigSpec struct {
Aggregator AggregatorSpec `json:"aggregator,"`
}
```
> **为什么aggregator单独定义?**
aggregator对所有使用nerveX框架进行RL训练的任务都是通用的,因此我们将aggregator定义为一个全局的、共享的资源AggregatorConfig,所有RL任务提交后,nervex-server将通过读取集群中唯一的AggregatorConfig来创建aggregator。另外,aggregator只是针对最常见的数据并行训练,如果是其他并行训练方法,需要定义新的Custom Resource。
### 状态定义
用户提交NerveXJob后,nervex-operator便接管了NerveXJob的生命周期的管理,为了便于用户了解NerveXJob的状态,我们定义了以下阶段(phase):
```go
const (
// JobCreated means the job has been submitted to the cluster,
// but not all the pods and services have been created,
// or no pods are running
JobCreated Phase = "Created"
// JobRunning means all the pods are in running state
JobRunning Phase = "Running"
// JobSucceeded means job completed without error
JobSucceeded Phase = "Succeeded"
// JobFailed means some pods failed, job is also considered failed
JobFailed Phase = "Failed"
// JobUnknown means the job is in unknown state
JobUnknown Phase = "Unknown"
)
```
一个正常运行并结束的NerveXJob会经历Created、Running和Succeeded三个阶段:
- 当NerveXJob提交后,nervex-operator将coordinator创建后进入Created阶段
- 当coordinator pod处于Running阶段后NerveXJob进入Running阶段
- 当coordinator pod处于Completed阶段后NerveXJob进入Succeeded阶段。
另外,当coordinator pod处于Failed阶段时,NerveXJob也会进入Failed阶段。而aggregator、collector、learner在失败后会立即重启,不会影响NerveXJob所处的阶段。
Unknown阶段暂时未作定义。
### 控制循环
使用[kubebuilder v3](https://github.com/kubernetes-sigs/kubebuilder/releases/download/v3.0.0/kubebuilder_linux_amd64)构建项目,operator所需的[reflector、informer、indexer](https://github.com/kubernetes/sample-controller/blob/master/docs/controller-client-go.md)、controller等组件都由[controller-runtime](https://github.com/kubernetes-sigs/controller-runtime)封装到[manager](https://github.com/kubernetes-sigs/controller-runtime/blob/master/pkg/manager/manager.go)中,将调谐(Reconcile)函数暴露给我们实现调谐逻辑,如下代码所示:
```go
func (r *NerveXJobReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
// your reconcile logic here
return ctrl.Result{}, nil
}
```
当用户提交NerveXJob后,informer获取到该提交事件后触发handler,之后Reconcile函数被调用;Reconcile函数中调用list pod方法发现coordinator未创建,则读取NerveXJob中关于coordinator的定义模板,创建相应的coordinator pod(coordinator程序在其中运行)和service(用于pod间通信),并将一些环境变量写入pod中,包括pod的名称、pod的命名空间、访问coordinator的URL等环境变量。
其中,nerveX框架的每个模块占用的端口都有一个默认值,如下所示:
```go
DefaultCollectorPort = 22270
DefaultLearnerPort = 22271
DefaultAggregatorPort = 22272
DefaultCoordinatorPort = 22273
```
coordinator创建之后,nervex-operator将监听pod的状态并修改NerveXJob的状态。等到NerveXJob完成后(Succeeded或者Failed),nervex-operator默认会将NerveXJob的所有处于Running阶段的pod和所有的service都删除,coordinator pod会保留。
### Webhook
用户提交NerveXJob时,可能存在yaml文件里的某些字段输入错误的问题,导致NerveXJob的运行状态达不到预期,影响用户排查问题;或者需要为NerveXJob的某些字段设置默认值。如果在NerveXJob提交到K8s集群前能为NerveXJob设置默认值,以及做一次正确性校验,有助于用户提前发现问题。
在K8s中,可以配置webhook在NerveXJob提交到K8s集群之前对其进行正确性校验。K8s webhook分为MutatingWebhook和ValidatingWebhook,前者用于修改K8s资源对象的值,后者用于验证K8s资源对象的正确性。
nervex-operator中实现了webhook校验方法,创建MutatingWebhook用于设置NerveXJob的默认值;创建ValidatingWebhook用于校验NerveXJob的正确性。比如对`CleanPodPolicy`字段,我们在MutatingWebhook中设置其默认值为`Running`,表示NerveXJob完成后将Running的pod都删除;我们在ValidatingWebhook中校验`CleanPodPolicy`字段的值,如果用户设置的值不等于`None``ALL``Running`中的任何一个,则拒绝提交该NerveXJob。
## NerveX Server
nervex-server是一个为nerveX框架定制的http服务器,提供新增、删除和查询collector、learner、aggregator的功能。通过调用nervex-server的相关接口,nervex-server为NerveXJob提供了动态增删collector和learner的能力。下面将对nervex-server的设计进行简要介绍,包括存储AggregatorConfig、NerveXJob以及NerveXJob所有pod的本地cache;用于动态新增、删除和查询collector、learner和aggregator的http接口设计。
### 本地cache
为了减少nervex-server与K8s api server之间查询的频率,从而减轻K8s api server的负担,我们利用[client-go](https://github.com/kubernetes/client-go)提供的informer机制将AggregatorConfig、NerveXJob和NerveXJob的所有pod存储在本地cache,如下图所示
[示意图](https://github.com/kubernetes/sample-controller/blob/master/docs/controller-client-go.md)
![](images/client-go-controller-interaction.jpeg)
上图中我们只关注上半部分:reflector通过list & watch接受到新的资源实例存在的通知,就将新资源实例放到Delta Fifo queue中,informer从Delta Fifo queue中获取新资源实例并通过indexer存放到本地cache中。查询操作都可以通过查询本地cache来完成,减少向K8s api server的请求次数。如下命令:
```go
genericInformer.Informer().GetIndexer().GetByKey(key)
```
当资源对象有变更时,reflector同样会接受到通知并更新本地cache;另外,informer也会定期向api server同步本地cache,与K8s集群中的资源对象保持一致。
### http接口
为了支持NerveXJob动态增删collector/learner的需求,nervex-server提供http接口用于对collector/learner进行新增、删除和查询的功能,如下图所示:
![](images/nervex-api.png)
提供如下接口:
| method | path | description |
|---|---|---|
| GET | /v1alpha1/replicas | list all collectors and learners |
| GET | /v1alpha1/replicas?namespace=xxx | list all collectors and learners in namespace |
| GET | /v1alpha1/replicas?namespace=xxx&coordinator=xxx | list all replicas belongs to coordinator |
| GET | /v1alpha1/replicas?namespace=xxx&aggregator=xxx | get learners belongs to aggregator |
| DELETE | /v1alpha1/replicas | delete some replicas. put data in request body |
| POST | /v1alpha1/replicas | create replicas. put data in request body |
| POST | /v1alpha1/replicas/failed | post failed replicas and request for recreation. put data in request body |
各个接口的请求格式、请求参数、请求body、返回值详见[http接口定义](https://gitlab.bj.sensetime.com/platform/CloudNative4AI/cluster-lifecycle/nervex-operator/issues/6)
## NerveX Orchestrator的优势
NerveX Orchestrator为nerveX框架提供了分布式场景下基于K8s的容器运行方案。对于用户提交的NerveXJob,nervex-operator负责对nerveX的各个模块进行编排,使得各个模块可以正常运行并执行训练任务。通过调用nervex-server的接口,赋予coordinator新增、删除和查询其所有的collector、learner和aggregator的功能,提升nerveX框架资源动态分配的能力。总结NerveX Orchestrator提供了以下优势:
1. 封装性。依赖nervex-operator的编排能力,部署nerveX分布式RL训练的细节(包括pod创建、服务发现)对用户来说是透明的。根据nerveX框架对分布式RL训练的部署需求,nervex-operator会将coordinator创建出来,然后coordinator再请求nervex-server创建其他模块,nervex-operator会把每个模块的pod的状态记录到NerveXJob的状态中。NerveXJob的生命周期也由nervex-operator维护,向用户展示NerveXJob在不同阶段的状态。
2. 易用性。用户只需要在NerveXJob的yaml文件中定义好coordinator、collector、learner的配置之后,一键提交到K8s集群即可,nervex-operator将负责完成部署工作,将用户从K8s集群中复杂的分布式RL训练部署中解放出来。
3. 鲁棒性。依赖K8s的pod重启机制,保证pod在意外退出的情况下能自动重启,coordinator能够迅速响应并重新连接。
4. 动态扩展。NerveXJob所需的collector/learner/aggregator是动态变化的,因此nervex-server提供了http接口可以动态调整collector/learner的数目,使得NerveXJob可以根据自身需求调整collector和learner的比例,优化吞吐量。
# NerveX Operator architecture
NerveX framework consists of 3 important modules, namely coordinator, collector and learner. In general, a nerveX training job has only one coordinator, and the number of learners and collectors can vary. The roles of the three modules are:
- Coordinator. Maintain connections with collectors and learners, accept meta-infos requests and posts from collectors and learners, and send tasks to collectors and learners.
- Collector. Request path to RL model stored in storage middleware from coordinator, load the RL model, and then generate data frames according to the RL model's steps from environment. Store the data frames back to the storage middleware, and report meta-infos (the storage path, size, etc.) of the data frames to coordinator.
- Learner: Request data frames storage path from coordinator and load the data frames from storage middleware to start training the RL model. After the training is completed, store the model into the storage middleware, and report model meta-infos (storage path, size, etc.) to coordinator. Because we often need to use distributed mechanism of data parallel training, to avoid confusion, we call the module interacting with coordinator the logic learner, which is the basic unit for coordinator to issue tasks. And the single learner process in the data parallel training is called ddp learner, and multiple ddp learner processes provide data parallel services. One logic learner can correspond to one ddp learner (single-gpu) or multiple ddp learners (multi-gpu). In addition, to provide data parallel training services, an additional aggregator module needs to be introduced. The aggregator is responsible for summarizing the training results of multiple ddp learners and sending them to coordinator. That is, the aggregator and multiple ddp learners form a logic learner, and coordinator will only interact with logic learners.
For the introduction of nerveX, please refer to [nerveX developer tutorial](http://open-xlab.pages.gitlab.bj.sensetime.com/cell/nerveX/tutorial_dev/index.html).
In order to provide running support for nerveX in Kubernetes (K8s), we designed `NerveX Orchestrator`. This article will explain how to use NerveX Orchestrator, how each module of nerveX is created on K8s and discovers each other, how to start training, etc. The architecture of NerveX Orchestrator is shown in the figure below:
![](images/nervex-arch.png)
There are two main modules that is `nervex-server` and `nervex-operator`.
`DDPL` represents ddp learner, `Lm` represents logic learner, `Cn` represents collector, and `Aggregator+DDPL` constructs a logic learner. In the following pages, we will first introduce how `NerveX Orchestrator` creates and starts each module of nerveX after a nerveX job is submitted to K8s, and then introduces the architecture of nervex-server and `nervex-operator`.
## Job creation process
Here is a description of the job creation process, illustrating the entire life cycle of a nerveX job from creation to execution in K8s.
- Edit the AggregatorConfig yaml file to define the aggregator template, which will be used to create aggregators when NerveXJob is created later. Aggregator can provide data parallel training services.
- Edit the NerveXJob yaml file to define the template of coordinator, collector and learner, and submit it to K8s.
- After nervex-operator received the event of NerveXJob submission, it creates a coordinator, and creates an accessible domain name for the coordinator.
- After the coordinator started, it sends an HTTP request to nervex-server to create a certain number of collectors and learners according to the coordinator's default configuration.
- After nervex-server receives the coordinator's creation request, it reads the collector and learner templates from NerveXJob object, and creates the corresponding number of collectors (Cn in the above figure) and learners (Lm in the above figure), and returns the URLs accessible to the collectors and learners. At the same time, nervex-server will determine whether to create an aggregator for each learner according to the number of GPUs applied for in each learner. That is, when the number of GPUs requested by the learner is greater than 1, an aggregator is created for the learner, otherwise no aggregator is created.
- Coordinator waits for collectors and learners (an aggregator and its multiple ddp learners are regarded as a logic learner) to connect, and then starts to issue tasks to start training.
- We can manually send a request to nervex-server to add/delete collectors or learners, and the coordinator will periodically query the number of available collectors and learners and decide to create or disconnect connections to them.
- When the training is completed, nervex-operator will delete all collectors, learners by default, while coordinator will be reserved for users to view logs and other operations.
## NerveX Operator
Nervex-operator is a component responsible for orchestrating NerveXJob in K8s. It uses K8s [operator pattern](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/) to monitor the status of NerveXJob objects in K8s cluster through the control loop in [controller pattern](https://kubernetes.io/docs/concepts/architecture/controller/), and to update the status of NerveXJob when necessary. The status is modified so that the actual status of NerveXJob is as consistent as possible with our predefined status.
### API definition
According to the characteristics of each module, we have defined two Custom Resources, namely NerveXJob and AggregatorConfig. The former is used to define the prerequisites for coordinator, collector and learner to start running, including docker images, startup commands, computing and storage resources, environment variables, etc. The latter is used to define the prerequisites for aggregator.
NerveXJob definition is described as below:
```go
type NerveXJobSpec struct {
// Group is a collection of NerveXJobs
Group string `json:"group,omitempty"`
//Priority labels the priority of NerveXJob
PriorityClassName PriorityClassName `json:"priorityClassName,omitempty"`
// CleanPodPolicy defines the policy to clean pods after NerveXJob completed
CleanPodPolicy CleanPodPolicy `json:"cleanPodPolicy,omitempty"`
// Volumes defines the shared volumes for nerveX components
Volumes []corev1.Volume `json:"volumes,omitempty"`
Coordinator CoordinatorSpec `json:"coordinator"`
Collector CollectorSpec `json:"collector,"`
Learner LearnerSpec `json:"learner,"`
}
```
AggregatorConfig definition is described as below:
```go
type AggregatorConfigSpec struct {
Aggregator AggregatorSpec `json:"aggregator,"`
}
```
> **Why should aggregator be defined alone?**
Aggregator is common module for all RL training jobs using nerveX framework, so we define the aggregator as a global and shared resource named AggregatorConfig. After RL jobs are submitted, nervex-server will read the global AggregatorConfig in K8s cluster to create aggregators for these RL jobs. In addition, aggregator is only for most common data parallel training. You need to define a new Custom Resource if other parallel training methods are used.
### Status definition
After NerveXJob is submitted, nervex-operator takes over the management of the life cycle of the NerveXJob. In order to facilitate the user to have a better view of the NerveXJob's status, we define the following phases:
```go
const (
// JobCreated means the job has been submitted to the cluster,
// but not all the pods and services have been created,
// or no pods are running
JobCreated Phase = "Created"
// JobRunning means all the pods are in running state
JobRunning Phase = "Running"
// JobSucceeded means job completed without error
JobSucceeded Phase = "Succeeded"
// JobFailed means some pods failed, job is also considered failed
JobFailed Phase = "Failed"
// JobUnknown means the job is in unknown state
JobUnknown Phase = "Unknown"
)
```
A normal NerveXJob that runs and ends successfully will go through three stages, that is Created, Running and Succeeded:
- When NerveXJob is submitted, nervex-operator will enter the Created phase after creating the coordinator.
- When the coordinator pod is in the Running phase, NerveXJob enters the Running phase.
- When the coordinator pod is in the Completed phase, NerveXJob enters the Succeeded phase.
In addition, when the coordinator pod is in the Failed phase, NerveXJob will also enter the Failed phase. The aggregators, collectors, and learners will restart immediately after failure, but they will not affect NerveXJob's phase.
Unknown phase has not been defined.
### Control loop
Built upon [kubebuilder v3](https://github.com/kubernetes-sigs/kubebuilder/releases/download/v3.0.0/kubebuilder_linux_amd64), components such as [reflectors, informers, indexers](https://github.com/kubernetes/sample-controller/blob/master/docs/controller-client-go.md) and controllers required by operator are all encapsulated in [manager](https://github.com/kubernetes-sigs/controller-runtime/blob/master/pkg/manager/manager.go) of [controller-runtime](https://github.com/kubernetes-sigs/controller-runtime). Kubebuilder only exposes a common function named `Reconcile` to us to implement reconcile logic for NerveXJob.
```go
func (r *NerveXJobReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
// your reconcile logic here
return ctrl.Result{}, nil
}
```
When NerveXJob is submitted, we firstly list pods that belong to NerveXJob in the Reconcile function and find that the coordinator has not been created. Then we read the coordinator template defined in NerveXJob and create the corresponding coordinator pod (used to run coordinator main process) and service (used for inter-pod communication), and write some environment variables into the pod, including the name of the pod, the namespace of the pod, the port which coordinator listens to, and the URL to access the coordinator.
The port occupied by each module of the nerveX framework has a default value, as shown below:
```go
DefaultCollectorPort = 22270
DefaultLearnerPort = 22271
DefaultAggregatorPort = 22272
DefaultCoordinatorPort = 22273
```
After the coordinator is created, nervex-operator will monitor the status of the pod and modify the status of the NerveXJob. After NerveXJob is completed (Succeeded or Failed), nervex-operator will delete all services of the NerveXJob, and all pods that are in the Running phase of the NerveXJob by default.
### Webhook
There may be some mistakes when submitting a NerveXJob, such as spelling mistakes in NerveXJob's fields, field value miss matched with predefined, etc., resulting in potential errors when managing NerveXJob's life cycle. For the other hand, it is necessary to set default values for some fields of NerveXJob. If the default value of NerveXJob can be set before NerveXJob is submitted, and a correctness check can be performed, it will help us find problems in advance.
To achieve the above goals, we can configure webhooks in K8s. K8s webhook consists of MutatingWebhook and ValidatingWebhook. The former is used to modify the value of the K8s resource object, and the latter is used to verify the correctness of the K8s resource object.
The webhook verification is implemented in nervex-operator. MutatingWebhook is created to set the default value for NerveXJob; ValidatingWebhook is created to verify the correctness of NerveXJob. For example, for the `CleanPodPolicy` field in NerveXJob, we set its default value in MutatingWebhook to `Running`, which means that all running pods will be deleted after NerveXJob is completed. We verify the value of the `CleanPodPolicy` field in ValidatingWebhook, if the value set by the user is not equal to any of `None`, `ALL`, or `Running`, the NerveXJob will be rejected.
## NerveX Server
Nervex-server is an http server customized for nerveX framework, providing the apis of adding, deleting, and querying collectors, learners, and aggregators. By calling the related apis of nervex-server, nervex-server can provide NerveXJob with the ability to dynamically scale collectors and learners. The following will briefly introduce the design of nervex-server, including the local cache for storing AggregatorConfig, NerveXJob and all pods of NerveXJob; the http interface design for dynamically adding, deleting and querying collectors, learners and aggregators.
### Local cache
In order to reduce the frequency of queries between nervex-server and K8s api server, thereby reducing the burden of K8s api server, we use [client-go](https://github.com/kubernetes/client-go)'s informer mechanism to store AggregatorConfig, NerveXJob and all pods of NerveXJob in local cache, as shown in the following figure
[Schematic diagram](https://github.com/kubernetes/sample-controller/blob/master/docs/controller-client-go.md)
![](images/client-go-controller-interaction.jpeg)
In the above figure, we only pay attention to the upper part. Reflector receives notifications of the existence of new resource instance through list & watch api, and puts the new resource instance into Delta Fifo queue, and informer gets the new resource instance from the Delta Fifo queue and passes it through indexer to store in local cache. The query operation can be completed by querying the local cache, reducing the number of requests to K8s api server. The query command is as following:
```go
genericInformer.Informer().GetIndexer().GetByKey(key)
```
When the resource object changes, the reflector will also receive notifications and update the local cache. In addition, the informer will also periodically synchronize the local cache with K8s api server to be consistent with the resource objects in K8s cluster.
### HTTP interface
In order to support dynamic scaling of collectors/learners for NerveXJobs, nervex-server implements some http interfaces for adding, deleting and querying collectors/learners, as shown in the following figure:
![](images/nervex-api.png)
提供如下接口:
| method | path | description |
|---|---|---|
| GET | /v1alpha1/replicas | list all collectors and learners |
| GET | /v1alpha1/replicas?namespace=xxx | list all collectors and learners in namespace |
| GET | /v1alpha1/replicas?namespace=xxx&coordinator=xxx | list all replicas belongs to coordinator |
| GET | /v1alpha1/replicas?namespace=xxx&aggregator=xxx | get learners belongs to aggregator |
| DELETE | /v1alpha1/replicas | delete some replicas. put data in request body |
| POST | /v1alpha1/replicas | create replicas. put data in request body |
| POST | /v1alpha1/replicas/failed | post failed replicas and request for recreation. put data in request body |
For the request format, request parameters, request body, and return value of each interface, please refer to [http interface definition](https://gitlab.bj.sensetime.com/platform/CloudNative4AI/cluster-lifecycle/nervex-operator/issues/6)
## Advantages of NerveX Orchestrator
NerveX Orchestrator provides a K8s-based container-orchestration solution for the nerveX framework in a distributed scenario. For a NerveXJob, nervex-operator is responsible for arranging the various modules of nerveX so that each module can run normally and perform training tasks. By calling nervex-server’s HTTP interface, coordinator is given the ability to add, delete, and query all its collectors, learners, aggregators and improve the dynamic allocation of nerveX framework resources. In summary, NerveX Orchestrator provides the following advantages:
1. Encapsulation. Relying on the orchestration capabilities of nervex-operator, deploying nerveX distributed RL training (including pod creation and service discovery) are transparent to us. According to the deployment requirements of the nerveX framework for distributed RL training, nervex-operator will create coordinator, and then the coordinator will request nervex-server to create other modules. Nervex-operator will record the status of the pod of each module into the status of the NerveXJob. The life cycle of NerveXJob is also maintained by nervex-operator, providing us with status of NerveXJob in different stages.
2. Ease of use. We only need to define the configuration of coordinator, collector, and learner in the yaml file of NerveXJob, and submit them to K8s cluster with one click. Nervex-operator will be responsible for deploying nerveX RL trainings and liberating us from the complex distributed RL deployments in K8s cluster.
3. Robustness. Relying on the pod restart mechanism of K8s, it ensures that pods can automatically restart in the event of an unexpected exit, and the coordinator can respond quickly and reconnect.
4. Dynamic expansion. Collectors/learners required by NerveXJob is dynamically changing, so nervex-server provides HTTP interfaces to allow us to dynamically adjust the number of collectors/learners, so that NerveXJob can adjust the ratio of collectors and learners according to its own needs to optimize throughput.
......@@ -7,48 +7,35 @@ spec:
aggregator:
template:
spec:
# nodeSelector:
# kubernetes.io/hostname: "liqingping-gpu.novalocal"
containers:
- name: aggregator
image: registry.sensetime.com/cloudnative4ai/nervex-parallel-linklink-migration:v0.10
image: registry.sensetime.com/cloudnative4ai/nervex:v0.0.4-torch1.4-cuda10.1-cudnn7-devel-987bacf7
imagePullPolicy: Always
resources:
requests:
cpu: 2
memory: 5Gi
limits:
cpu: 2
memory: 5Gi
env:
- name: LC_ALL
value: "en_US.utf-8"
- name: LANG
value: "en_US.utf-8"
- name: NCCL_SHM_DISABLE
- name: PYTHONUNBUFFERED
value: "1"
resources:
requests:
cpu: 3
memory: "10Gi"
limits:
cpu: 3
memory: "10Gi"
command: ["/bin/bash", "-c",]
args:
- >
until curl $HOSTNAME:22279 &>/dev/null ; do sleep 1 ; done ; echo succeed;
until curl $HOSTNAME:22279 &>/dev/null ; do sleep 1 ; done ; echo succeed;
cd /data/nervex/test_atari/test3;
python3 -u -c 'import os; import nervex.entry.parallel_entry as pe; pe.launch_learner_aggregator(filename="atari_impala_default_config.py.pkl", name="aggregator{}".format(os.environ["HOSTNAME"].split("-")[-1]) )';
- |
# if code has been changed in the mount path, we have to reinstall nervex cli
# pip install --no-cache-dir -e .;
# pip install --no-cache-dir -e .[common_env]
nervex -m dist --module config -p k8s -c app_zoo/atari/config/parallel/qbert_dqn_config_k8s.py -s 0;
nervex -m dist --module learner_aggregator -c app_zoo/atari/config/parallel/qbert_dqn_config_k8s.py.pkl -s 0
ports:
- name: aggregator
containerPort: 22272
volumeMounts:
- name: config
mountPath: /data/nervex
volumes:
- name: config
nfs:
path: /data/nfs/nervex
server: 10.152.197.14
status:
actors:
total: 4
active: 3
idle: 1
learners:
total: 5
active: 5
idle: 0
\ No newline at end of file
containerPort: 22270
\ No newline at end of file
......@@ -19,7 +19,7 @@ spec:
spec:
containers:
- name: coordinator
image: registry.sensetime.com/cloudnative4ai/nervex:v0.0.1-torch1.4-cuda10.1-cudnn7-devel-9472640d
image: registry.sensetime.com/cloudnative4ai/nervex:v0.0.4-torch1.4-cuda10.1-cudnn7-devel-987bacf7
imagePullPolicy: Always
env:
- name: LC_ALL
......@@ -162,7 +162,7 @@ spec:
spec:
containers:
- name: collector
image: registry.sensetime.com/cloudnative4ai/nervex:v0.0.1-torch1.4-cuda10.1-cudnn7-devel-9472640d
image: registry.sensetime.com/cloudnative4ai/nervex:v0.0.4-torch1.4-cuda10.1-cudnn7-devel-987bacf7
imagePullPolicy: Always
env:
- name: LC_ALL
......@@ -304,7 +304,7 @@ spec:
spec:
containers:
- name: learner
image: registry.sensetime.com/cloudnative4ai/nervex:v0.0.1-torch1.4-cuda10.1-cudnn7-devel-9472640d
image: registry.sensetime.com/cloudnative4ai/nervex:v0.0.4-torch1.4-cuda10.1-cudnn7-devel-987bacf7
imagePullPolicy: Always
env:
- name: LC_ALL
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册