We introduced how to create a PaddlePaddle Job with a single node on Kuberentes in the
previous document.
In this article, we will introduce how to craete a PaddlePaddle job with multiple nodes
on Kubernetes cluster.
## Overall Architecture
Before creating a training job, the users need to deploy the Python scripts and
training data which have already been sliced on the precast path in the distributed file
system(We can use the different type of Kuberentes Volumes to mount different distributed
file system). Before start training, The program would copy the training data into the
Container and also save the models at the same path during training. The global architecture
is as follows:
![PaddlePaddle on Kubernetes Architecture](src/k8s-paddle-arch.png)
The above figure describes a distributed training architecture which contains 3 nodes, each
Pod would mount a folder of the distributed file system to save training data and models
by Kubernetes Volume. Kubernetes created 3 Pod for this training phase and scheduled these on
3 nodes, each Pod has a PaddlePaddle container. After the containers have been created,
PaddlePaddle would start up the communication between PServer and Trainer and read training
data for this training job.
As the description above, we can start up a PaddlePaddle distributed training job on a ready
Kubernetes cluster as the following steps:
1.[Build PaddlePaddle Docker Image](#Build a Docker Image)
1.[Split training data and upload to the distributed file system](#Upload Training Data)
1.[Edit a YAML file and create a Kubernetes Job](#Create a Job)
1.[Check the output](#Check The Output)
We will introduce these steps as follows:
### Build a Docker Image
PaddlePaddle Docker Image needs to support the runtime environment of `Paddle PServer` and
`Paddle Trainer` process and this Docker Image has the two import features:
- Copy the training data into the container.
- Generate the start arguments of `Paddle PServer` and `Paddle Training` process.
Because of the official Docker Image `paddlepaddle/paddle:latest` has already included the
PaddlePaddle executable file, but above features so that we can use the official Docker Image as
a base Image and add some additional scripts to finish the work of building a new image.
You can reference [Dockerfile](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/usage/cluster/src/k8s_train/Dockerfile).
```bash
$ cd doc/howto/usage/k8s/src/k8s_train
$ docker build -t[YOUR_REPO]/paddle:mypaddle .
```
And then upload the new Docker Image to a Docker hub:
```bash
docker push [YOUR_REPO]/paddle:mypaddle
```
**[NOTE]**, in the above command arguments, `[YOUR_REPO]` representative your Docker repository,
you need to use your repository instead of it. We will use `[YOUR_REPO]/paddle:mypaddle` to
represent the Docker Image which built in this step.
### Prepare Training Data
We can download and split the training job by creating a Kubernetes Job, or custom your image
by editing [k8s_train](./src/k8s_train/README.md).
Before creating a Job, we need to bind a [persistenVolumeClaim](https://kubernetes.io/docs/user-guide/persistent-volumes) by the different type of
the different distributed file system, the generated dataset would be saved on this volume.
```yaml
apiVersion:batch/v1
kind:Job
metadata:
name:paddle-data
spec:
template:
metadata:
name:pi
spec:
hostNetwork:true
containers:
-name:paddle-data
image:paddlepaddle/paddle-tutorial:k8s_data
imagePullPolicy:Always
volumeMounts:
-mountPath:"/mnt"
name:nfs
env:
-name:OUT_DIR
value:/home/work/mfs/paddle-cluster-job
-name:SPLIT_COUNT
value:"3"
volumes:
-name:nfs
persistentVolumeClaim:
claimName:mfs
restartPolicy:Never
```
If success, you can see some information like this:
```base
[root@paddle-kubernetes-node0 nfsdir]$ tree -d
.
`-- paddle-cluster-job
|-- 0
| `-- data
|-- 1
| `-- data
|-- 2
| `-- data
|-- output
|-- quick_start
```
The `paddle-cluster-job` above is the job name for this training job; we need 3
PaddlePaddle training node and save the split training data on `paddle-cluster-job` path,
the folder `0`, `1` and `2` representative the `training_id` on each node, `quick_start` folder is used to store training data, `output` folder is used to store the models and logs.
### Create a Job
Kubernetes allow users to create an object with YAML files, and we can use a command-line tool
to create it.
The Job YAML file describes that which Docker Image would be used in this training job, how much nodes would be created, what's the startup arguments of `Paddle PServer/Trainer` process and what's the type of Volumes. You can find the details of the YAML filed in
And then query the status of all the other Pods of this Job by the function `getPodList()`, and fetch `triner_id` by the function `getIdMap(podlist)` if all the Pods status is `RUNNING`.
```python
podlist=getPodList()
# need to wait until all pods are running
whilenotisPodAllRunning(podlist):
time.sleep(10)
podlist=getPodList()
idMap=getIdMap(podlist)
```
**NOTE**: `getPodList()` would fetch all the pod in the current namespace, if some Pods are running, may cause some error. We will use [statfulesets](https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets) instead of
Kubernetes Pod or Replicaset in the future.
For the implement of `getIdMap(podlist)`, this function would fetch each IP address of
`podlist` and then sort them to generate `trainer_id`.
```python
defgetIdMap(podlist):
'''
generate tainer_id by ip
'''
ips=[]
forpodinpodlist["items"]:
ips.append(pod["status"]["podIP"])
ips.sort()
idMap={}
foriinrange(len(ips)):
idMap[ips[i]]=i
returnidMap
```
After getting the `idMap`, we can generate the arguments of `Paddle PServer` and `Paddle Trainer`
so that we can start up them by `startPaddle(idMap, train_args_dict)`.
### Create Job
The main goal of `startPaddle` is generating the arguments of `Paddle PServer` and `Paddle Trainer` processes. Such as `Paddle Trainer`, we parse the environment variable and then get
`PADDLE_NIC`, `PADDLE_PORT`, `PADDLE_PORTS_NUM` and etc..., finally find `trainerId` from