Now you can use administration tool kubectl to operate the cluster.
By default, kubectl will use the kubeconfig file generated during the cluster startup for authenticating against the API, the location is in `~/.kube/config`.
For running PaddlePaddle training with Kubernetes on AWS, you can refer to [this article](https://github.com/drinktee/Paddle/blob/k8s/doc_cn/cluster/k8s/distributed_training_on_kubernetes.md).
###Setup PaddlePaddle Environment on AWS
For the design of running PaddlePaddle on Kubernetes, you really need to read [this article](https://github.com/drinktee/Paddle/blob/k8s/doc_cn/cluster/k8s/distributed_training_on_kubernetes.md) first.
###Cluster Tear Down
For sharing the training data across all the Kubernetes nodes, we use EFS (Elastic File System) in AWS. Ceph might be a better solution, but it requires high version of Linux kernel that might not be stable enough at this moment. We haven't automated the EFS setup at this moment, so please do the following steps:
1. Make sure you add the AmazonElasticFileSystemFullAccess policy into your AWS account.
2. Create the Elastic File System in AWS console, and attach the Kubernetes VPC with it.
3. Modify the Kubernetes security group, add additional inbound policy "All TCP TCP 0 - 65535 0.0.0.0/0" for Kubernetes default VPC security group.
4. Follow the EC2 mount instruction to mount the disk onto all the Kubernetes nodes, we recommend to mount EFS disk onto ~/efs.
And now you can place your training data onto the EFS, you should follow [this article](https://github.com/drinktee/Paddle/blob/k8s/doc_cn/cluster/k8s/distributed_training_on_kubernetes.md) to locate your data.
###Start PaddlePaddle Training Demo on AWS
After setting up all the steps on AWS, We can start up our PaddlePaddle training recommendation demo by using:
```
kubectl create -f job.yaml
```
The yaml file content is as follows:
```
apiVersion: batch/v1
kind: Job
metadata:
name: paddle-cluster-job
spec:
parallelism: 3
completions: 3
template:
metadata:
name: paddle-cluster-job
spec:
volumes:
- name: jobpath
hostPath:
path: /home/admin/efs
containers:
- name: trainer
image: drinkcode/paddle:k8s-job
command: ["bin/bash", "-c", "/root/start.sh"]
env:
- name: JOB_NAME
value: paddle-cluster-job
- name: JOB_PATH
value: /home/jobpath
- name: JOB_NAMESPACE
value: default
- name: TRAIN_CONFIG_DIR
value: recommendation
- name: CONF_PADDLE_NIC
value: eth0
- name: CONF_PADDLE_PORT
value: "7164"
- name: CONF_PADDLE_PORTS_NUM
value: "2"
- name: CONF_PADDLE_PORTS_NUM_SPARSE
value: "2"
- name: CONF_PADDLE_GRADIENT_NUM
value: "3"
volumeMounts:
- name: jobpath
mountPath: /home/jobpath
ports:
- name: jobport
hostPort: 30001
containerPort: 30001
restartPolicy: Never
```
It will generate three PaddlePaddle job runing on distributed Kubernetes nodes, and all the training result will be written into EFS.
We've made an experiment of running this PaddlePaddle recommendation training demo on three 2 core 8 GB machine (m3.large), and it took 8 hours to generate 10 models.