diff --git a/README.md b/README.md index c4814f43e2b762dd5f0a5f60b2bff76816a6638f..70921a2579e442c7f9b829d58781714262f53a27 100644 --- a/README.md +++ b/README.md @@ -24,19 +24,19 @@ di-server-7b86ff8df4-jfgmp 1/1 Running 0 59s Install global components of DIJob defined in AggregatorConfig: ```bash -kubectl create -f examples/di_v1alpha1_agconfig.yaml -n di-system +kubectl create -f config/samples/agconfig.yaml -n di-system ``` ### Submit DIJob ```bash # submit DIJob -$ kubectl create -f examples/di_v1alpha1_dijob.yaml +$ kubectl create -f config/samples/dijob-cartpole.yaml # get pod and you will see coordinator is created by di-operator # a few seconds later, you will see collectors and learners created by di-server $ kubectl get pod # get logs of coordinator -$ kubectl logs dijob-example-coordinator +$ kubectl logs cartpole-dqn-coordinator ``` ## User Guide diff --git a/docs/architecture-cn.md b/docs/architecture-cn.md index 35acd3bd48299f2cc75834462e2bc0004e3a1718..b528135b87ca7470a84e0f31973e68dd568925c9 100644 --- a/docs/architecture-cn.md +++ b/docs/architecture-cn.md @@ -8,7 +8,7 @@ DI框架分为3个重要的模块,分别是coordinator、collector和learner 为了提供DI在Kubernetes(K8s)中运行的支持,我们设计了DI Orchestrator,本文将说明利用DI Orchestrator,DI各个模块在K8s系统上如何被创建、如何相互发现、如何开始训练等。DI Orchestrator的架构如下图所示: -![](images/di-arch.png) +![](images/di-arch.svg) 整体分为两大模块:`di-server`和`di-operator`,`DDPL`指ddp learner,`Lm`指Learner,`Cn`指Collector,`Aggregator+DDPL`构成一个logic learner。接下来将首先介绍一个DI任务提交到K8s之后DI Orchestrator如何将DI的各个模块(在K8s中就是一个[pod](https://kubernetes.io/docs/concepts/workloads/pods/))创建并启动,然后将对di-server和di-operator进行介绍。 diff --git a/docs/architecture.md b/docs/architecture.md index 064e4e1c93d16620d2ce0a5a770aa73b37c97bf8..06b33c590eceef97e1113fcfa7b66f067ba1e67d 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -8,7 +8,7 @@ For the introduction of DI, please refer to [DI developer tutorial](https://open In order to provide running support for DI in Kubernetes (K8s), we designed `DI Orchestrator`. This article will explain how to use DI Orchestrator, how each module of DI is created on K8s and discovers each other, how to start training, etc. The architecture of DI Orchestrator is shown in the figure below: -![](images/di-arch.png) +![](images/di-arch.svg) There are two main modules that is `di-server` and `di-operator`. `DDPL` represents ddp learner, `Lm` represents logic learner, `Cn` represents collector, and `Aggregator+DDPL` constructs a logic learner. In the following pages, we will first introduce how `DI Orchestrator` creates and starts each module of DI after a DI job is submitted to K8s, and then introduces the architecture of `di-server` and `di-operator`. diff --git a/docs/developer-guide.md b/docs/developer-guide.md index 6a8b4ea780be02734fe35f5105291ba45c04fa89..25c61a12e3df11ea484066c9d117aaddf8fcbffb 100644 --- a/docs/developer-guide.md +++ b/docs/developer-guide.md @@ -1,6 +1,6 @@ -# developer guide +# Developer Guide -## prerequisites +## Prerequisites - a well prepared kubernetes cluster. Follow the [instructions](https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/create-cluster-kubeadm/) to create a kubernetes cluster, or create a local kubernetes node referring to [kind](https://kind.sigs.k8s.io/docs/user/quick-start/) or [minikube](https://minikube.sigs.k8s.io/docs/start/) - kustomize. Installed by the following command ```bash @@ -11,7 +11,7 @@ kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash ```bash kubectl create -f ./config/certmanager/cert-manager.yaml ``` -## project initialization +## Project Initialization This project is based on [kubebuilder v3](https://github.com/kubernetes-sigs/kubebuilder/releases/download/v3.0.0/kubebuilder_linux_amd64), since CRDs generated by kubebuilder v2 is not compatible in kubernetes v1.20. ```bash kubebuilder init --domain opendilab.org --license apache2 --owner "The OpenDILab authors" @@ -32,26 +32,28 @@ make manifests ``` New CRD files will be generated in [./config/crd/bases](./config/crd/bases) -## controller logic +## Controller Logic Referenced to [controllers](./controllers) -## di-server logic +## DI Server Logic Referenced to [server](./server) ## Installation Run the following command in the project root directory. ```bash -# build images. If you are not working in Linux, here you should use `make docker-build` -make dev-images +# build images. +make docker-build make docker-push # deploy di-operator and server to cluster make dev-deploy ``` Since the CustomResourceDefinitions are too long, you will probably find the following error: -![](docs/images/deploy-failed.png) +```bash +The CustomResourceDefinition "dijobs.diengine.opendilab.org" is invalid: metadata.annotations: Too long: must have at most 262144 bytes +``` -Then run the following command will solve the problem: +Then running the following command will solve the problem: ```bash kustomize build config/crd | kubectl create -f - ``` @@ -66,5 +68,5 @@ di-server-7b86ff8df4-jfgmp 1/1 Running 0 59s Install global components of DIJob defined in AggregatorConfig: ```bash -kubectl create -f examples/di-mock-agconfig.yaml -n di-system +kubectl create -f config/samples/agconfig.yaml -n di-system ``` diff --git a/docs/images/di-arch.png b/docs/images/di-arch.png deleted file mode 100644 index 60a14406077902e0486bb4297d99b20bb5bbada3..0000000000000000000000000000000000000000 Binary files a/docs/images/di-arch.png and /dev/null differ diff --git a/docs/images/di-arch.svg b/docs/images/di-arch.svg new file mode 100644 index 0000000000000000000000000000000000000000..0eb4076d4f7f995480c27b3a636cf4da9fc016bd --- /dev/null +++ b/docs/images/di-arch.svg @@ -0,0 +1,3 @@ + + +
scale
scale
scale
scale
4
4
nervex-operator
nervex-operator
1
1
2
2
apiVersion: xxx.xx
kind: DIJob
metadata:
  name: di2
spec:
    coordinator:
      ...
    collector:
      ...
    learner:
      ...
apiVersion: xxx.xx...
apiVersion: xxx.xx
kind: DIJob
metadata:
  name: di1
spec:
    coordinator:
      ...
    collector:
      ...
    learner:
      ...
apiVersion: xxx.xx...
request scale 
request scale 
2
2
3
3
nervex-server
nervex-server
apiVersion: xxx.xx
kind: AggregatorConfig
metadata:
  name: aggregator-config
spec:
    aggregator:
      ...
apiVersion: xxx.xx...
Storage Middleware
Storage Middleware
request scale 
request scale 
3
3
4
4
Coordinator
Coordinator
Coordinator
Coordinator
create coordinator
create coordinator
create coordinator
create coordinator
Ck
Ck
C0
C0
Cn
Cn
C0
C0
Lj
Lj
L0
L0
Aggregator+DDPL
Aggregator+DDPL
Lm
Lm
L0
L0
Aggregator+DDPL
Aggregator+DDPL
Viewer does not support full SVG 1.1
\ No newline at end of file