Parl Cluster ============ Cluster Structure Overview ########################## | There are three core concepts in a Parl cluster: master, worker and client. - **Master:** The master node is the control center of a parl cluster, which provides connections to workers and clients. It receives tasks from clients and allocate vacant workers to run the tasks. - **Worker:** A worker provides the cpu computation resources for the cluster. It will initiate separate job subprocesses waiting for tasks from the master. - **Client:** For each training program, there is a unique global client which submits tasks to the master node. .. image:: ./parallel_training/cluster_structure.png :width: 600px :align: center Master ###### | There is only one master node in each parl cluster, we can start a master by calling ``xparl start --port 1234`` with a assigned port number. This command will also simultaneously start a local worker which connects to the new master. | **master socket** will receive all kinds of message from workers, clients or cluster monitor, such as: - A new worker connects the cluster. The master will start a heartbeat to check worker's status, and worker's jobs will be added to master's job center. - A new client connects the cluster: The master will start a heartbeat to check client's status, and wait for client to submit a task. - A worker updates its job buffer: The master will replace the new jobs for the killed old jobs in the job center. - Cluster monitor query cluster status: The master will return the detailed status of the cluster (i.e. total cpu number, used cpu number, load average ...) to the monitor. .. image:: ./parallel_training/master.png :width: 600px :align: center Worker ###### | We can add more computation resources to an existed cluster by calling ``xparl --connect master_address`` command. This command will create a local **Worker** object and then connect to the cluster. | When we start a new worker, it will first initiate separate job subprocesses in a job buffer. And then send the initialized worker to the master node. | The worker will send a heartbeat signal to each job to check if it's still alive. When the worker find a job subprocess is dead, it will drop the dead job from the job buffer, start a new job and update worker information to the master node. .. image:: ./parallel_training/worker.png :width: 600px :align: center Client ###### | We have a global client for each training program, it submits training tasks to the master node. User do not need to interact with client object directly. We can create a new global client and connect it to the cluster by calling ``parl.connect(master_address)``. | The global client will read local python scripts and configuration files, which will later be sent to remote jobs. .. image:: ./parallel_training/client.png :width: 600px :align: center Actor ##### | **Actor** is an object defined by users which aims to solve a specific task. We use ``@parl.remote_class`` decorator to convert an actor to a remote class object, and each actor is connected to the global client. .. code-block:: python # connect global client to the master node parl.connect(master_address) @parl.remote_class class Actor(object) def __init__(self): ... | When a decorated actor class object is instantiated, the global client will submit a task to the master node. Then the master node will pick a vacant job from the job center and send the job back to the client. The actor will make a connection with the job and send local files, class definition and initialization arguments to the job. Then the job will instantiate a local actor in the job process. | When the actor call a function, the real computation will be executed in the job process by job's local actor. .. image:: ./parallel_training/actor.png :width: 600px :align: center