@@ -17,7 +17,7 @@ DI Operator is responsible for orchestrating DIJob in K8s system, using K8s [ope
...
@@ -17,7 +17,7 @@ DI Operator is responsible for orchestrating DIJob in K8s system, using K8s [ope
### API Definitions
### API Definitions
According to the characteristics of DI-engine framework, we use K8s Custom Resource to define the DIJob resource, which is used to define the desired state of a DI-engine job, including images, startup commands, mount volumes, and the number of workers, etc..
According to the characteristics of DI-engine framework, we use K8s Custom Resource to define the DIJob resource, which is used to define the desired state of a DI-engine Reinforcement Learning(RL) job, including images, startup commands, mount volumes, and the number of workers, etc..
Definition and meaning of each field in DIJobSpec is as follows:
Definition and meaning of each field in DIJobSpec is as follows:
...
@@ -171,8 +171,8 @@ Jobs submitted run in the cluster according to the process in the following figu
...
@@ -171,8 +171,8 @@ Jobs submitted run in the cluster according to the process in the following figu
DI Orchestrator provides a K8s-based container-orchestration solution for DI-engine framework in a distributed scenario. For a DIJob, Operator is responsible for orchestrating DI-engine workers so that each worker can run normally and perform training tasks. The sub-module Allocator in Operator provides DI-engine framework with the ability to dynamically allocate and schedule resources. By calling Server's HTTP interface, users are given the functions of adding, deleting, and querying workers for each job. In summary, DI Orchestrator provides the following advantages:
DI Orchestrator provides a K8s-based container-orchestration solution for DI-engine framework in a distributed scenario. For a DIJob, Operator is responsible for orchestrating DI-engine workers so that each worker can run normally and perform training tasks. The sub-module Allocator in Operator provides DI-engine framework with the ability to dynamically allocate and schedule resources. By calling Server's HTTP interface, users are given the functions of adding, deleting, and querying workers for each job. In summary, DI Orchestrator provides the following advantages:
1. Encapsulation. Depending on the orchestration capabilities of Operator, details of deploying DI-engine distributed RL training jobs(including pod creation, service discovery) are transparent to users. According to the deployment requirements of DI-engine jobs for distributed RL training, Operator creates workers for jobs, and write the status of each worker to DIJob status. The life cycle of DIJob is also maintained by Operator, providing us with status of DIJob in different stages.
1. Encapsulation. Depending on the orchestration capabilities of Operator, details of deploying DI-engine distributed RL training jobs(including pod creation, service discovery) are transparent to users. According to the deployment requirements of DI-engine jobs for distributed RL training, Operator creates workers for jobs, and writes the status of each worker to DIJob status. The life cycle of DIJob is also maintained by Operator, providing us with status of DIJob in different stages.
2. Ease of use. The user only needs to define the configuration of the task in the yaml file of DIJob and submit it to the K8s cluster with one click, and the operator will be responsible for completing the deployment work, freeing the user from the complex distributed RL training deployment in the K8s cluster. At the same time, DIJob can be submitted with one click with the help of command line tools.
2. Ease of use. Users only need to define the configuration of DI-engine job in the yaml file of DIJob and submit it to K8s cluster with one click, and Operator will be responsible for completing the deployment work, freeing users from the complex distributed RL training deployment in K8s cluster. At the same time, DIJob can be submitted with one click with the help of command line tools.
3. Robustness. Rely on the operator's restart mechanism to ensure that workers can automatically restart in the case of unexpected exit.
3. Robustness. Rely on the Operator's restart mechanism to ensure that workers can automatically restart in the case of unexpected exit.
4. Dynamic expansion. The workers required by DIJob change dynamically, so the server provides the http interface to dynamically adjust the number of workers, so that DIJob can adjust the number of workers according to its own needs and optimize throughput.
4. Dynamic expansion. The number of workers required by DIJob changes dynamically, so users can directly modify DIJob through the K8s client to change the number of workers; at the same time, Server provides HTTP interfaces to dynamically adjust the number of workers. Dynamic expansion allows users to adjust the number of workers according to their own needs and optimize throughput.
5. Dynamic scheduling. By relying on the Operator subcomponent Allocator, dynamic scheduling for DI-engine tasks becomes simple. Allocator provides scheduling strategies for single-task and multi-task, which can optimize the global task completion time without affecting normal training.
5. Dynamic scheduling. By relying on Operator's sub-module Allocator, dynamic scheduling for DI-engine jobs becomes simple. Allocator provides scheduling strategies for single-job and multi-jobs, which can optimize the global job completion time without affecting normal training.