diff --git a/doc/cluster/aws/add_security_group.png b/doc/cluster/aws/add_security_group.png new file mode 100644 index 0000000000000000000000000000000000000000..50eed4c6573a18d6ae0f9df9bd6a3cae05493e3c Binary files /dev/null and b/doc/cluster/aws/add_security_group.png differ diff --git a/doc/cluster/aws/create_efs.png b/doc/cluster/aws/create_efs.png new file mode 100644 index 0000000000000000000000000000000000000000..f4d448d1518e11a11d535efb9c3a78b56cc13149 Binary files /dev/null and b/doc/cluster/aws/create_efs.png differ diff --git a/doc/cluster/aws/efs_mount.png b/doc/cluster/aws/efs_mount.png new file mode 100644 index 0000000000000000000000000000000000000000..0f9e3cab98445707e5e9baa18ddabe15cdf04576 Binary files /dev/null and b/doc/cluster/aws/efs_mount.png differ diff --git a/doc/cluster/aws/managed_policy.png b/doc/cluster/aws/managed_policy.png new file mode 100644 index 0000000000000000000000000000000000000000..c7ecda555b81d7750e9292a9ab72d2f517f76a2a Binary files /dev/null and b/doc/cluster/aws/managed_policy.png differ diff --git a/doc/cluster/aws/paddlepaddle_on_aws_with_kubernetes.md b/doc/cluster/aws/paddlepaddle_on_aws_with_kubernetes.md index 920608c562b03dc1510536a96f709ced9856f36a..b4e88f6e5c9edddbbde278c179b5d020a03ebb09 100644 --- a/doc/cluster/aws/paddlepaddle_on_aws_with_kubernetes.md +++ b/doc/cluster/aws/paddlepaddle_on_aws_with_kubernetes.md @@ -1,4 +1,4 @@ -#PaddlePaddle on AWS with Kubernetes +ddlePaddle on AWS with Kubernetes ##Prerequisites @@ -14,10 +14,13 @@ You need an Amazon account and your user account needs the following privileges * IAMFullAccess * NetworkAdministrator -If you are not in Unites States, we also recommend creating a jump server instance with default amazon AMI in the same available zone as your cluster, otherwise there will be some issue on creating the cluster. +![managed_policy](managed_policy.png =800x)) -##For people new to Kubernetes and AWS +If you are not in Unites States, we also recommend creating a jump server VM instance with default amazon AMI in the same available zone as your cluster and login to jump server for the following operations, otherwise there will be some issues related to account authentication. + + +##PaddlePaddle on AWS If you are new to Kubernetes or AWS and just want to run PaddlePaddle, you can follow these steps to start up a new cluster. @@ -47,16 +50,25 @@ export KUBERNETES_PROVIDER=aws; curl -sS https://get.k8s.io | bash ``` +By default, the script will provision a new VPC and a 4 node k8s cluster in us-west-2a (Oregon) with EC2 instances running on Debian. You can override the variables defined in `/cluster/config-default.sh` to change this behavior as follows: -This process takes about 5 to 10 minutes. +``` +export KUBE_AWS_ZONE=us-west-2a +export NUM_NODES=3 +export MASTER_SIZE=m3.medium +export NODE_SIZE=m3.large +export AWS_S3_REGION=us-west-2a +export AWS_S3_BUCKET=mycompany-kubernetes-artifacts +export KUBE_AWS_INSTANCE_PREFIX=k8s +... -Once the cluster is up, the IP addresses of your master and node(s) will be printed, as well as information about the default services running in the cluster (monitoring, logging, dns). +``` -User credentials and security tokens are written in `~/.kube/config`, they will be necessary to use the CLI or the HTTP Basic Auth. +This process takes about 5 to 10 minutes. ``` -[ec2-user@ip-172-31-24-50 ~]$ export KUBERNETES_PROVIDER=aws; curl -sS https://get.k8s.io | bash +[ec2-user@ip-172-31-27-229 ~]$ export KUBERNETES_PROVIDER=aws; curl -sS https://get.k8s.io | bash 'kubernetes' directory already exist. Should we skip download step and start to create cluster based on it? [Y]/n Skipping download step. Creating a kubernetes on aws... @@ -65,74 +77,72 @@ Creating a kubernetes on aws... ... calling kube-up Starting cluster using os distro: jessie Uploading to Amazon S3 -+++ Staging server tars to S3 Storage: kubernetes-staging-98b0b8ae5c8ea0e33a0faa67722948f1/devel -upload: ../../../tmp/kubernetes.7nMCAR/s3/bootstrap-script to s3://kubernetes-staging-98b0b8ae5c8ea0e33a0faa67722948f1/devel/bootstrap-script ++++ Staging server tars to S3 Storage: kubernetes-staging-9996f910edd9ec30ed3f8e3a9db7466c/devel +upload: ../../../tmp/kubernetes.KsacFg/s3/bootstrap-script to s3://kubernetes-staging-9996f910edd9ec30ed3f8e3a9db7466c/devel/bootstrap-script Uploaded server tars: - SERVER_BINARY_TAR_URL: https://s3.amazonaws.com/kubernetes-staging-98b0b8ae5c8ea0e33a0faa67722948f1/devel/kubernetes-server-linux-amd64.tar.gz - SALT_TAR_URL: https://s3.amazonaws.com/kubernetes-staging-98b0b8ae5c8ea0e33a0faa67722948f1/devel/kubernetes-salt.tar.gz - BOOTSTRAP_SCRIPT_URL: https://s3.amazonaws.com/kubernetes-staging-98b0b8ae5c8ea0e33a0faa67722948f1/devel/bootstrap-script -INSTANCEPROFILE arn:aws:iam::525016323257:instance-profile/kubernetes-master 2016-11-22T05:20:41Z AIPAJWBAGNSEHM4CILHDY kubernetes-master / -ROLES arn:aws:iam::525016323257:role/kubernetes-master 2016-11-22T05:20:39Z / AROAJW3VKVVQ5MZSTTJ5O kubernetes-master + SERVER_BINARY_TAR_URL: https://s3.amazonaws.com/kubernetes-staging-9996f910edd9ec30ed3f8e3a9db7466c/devel/kubernetes-server-linux-amd64.tar.gz + SALT_TAR_URL: https://s3.amazonaws.com/kubernetes-staging-9996f910edd9ec30ed3f8e3a9db7466c/devel/kubernetes-salt.tar.gz + BOOTSTRAP_SCRIPT_URL: https://s3.amazonaws.com/kubernetes-staging-9996f910edd9ec30ed3f8e3a9db7466c/devel/bootstrap-script +INSTANCEPROFILE arn:aws:iam::330323714104:instance-profile/kubernetes-master 2016-12-01T03:19:54Z AIPAIQDDLSMLWJ2QDXM6I kubernetes-master / +ROLES arn:aws:iam::330323714104:role/kubernetes-master 2016-12-01T03:19:52Z / AROAJDKKDIYHJTTEJM73M kubernetes-master ASSUMEROLEPOLICYDOCUMENT 2012-10-17 STATEMENT sts:AssumeRole Allow PRINCIPAL ec2.amazonaws.com -INSTANCEPROFILE arn:aws:iam::525016323257:instance-profile/kubernetes-minion 2016-11-22T05:20:45Z AIPAIYVABOPWQZZX5EN5W kubernetes-minion / -ROLES arn:aws:iam::525016323257:role/kubernetes-minion 2016-11-22T05:20:43Z / AROAJKDVM7XQNZ4JGVKNO kubernetes-minion +INSTANCEPROFILE arn:aws:iam::330323714104:instance-profile/kubernetes-minion 2016-12-01T03:19:57Z AIPAJGNG4GYTNVP3UQU4S kubernetes-minion / +ROLES arn:aws:iam::330323714104:role/kubernetes-minion 2016-12-01T03:19:55Z / AROAIZVAWWBIVUENE5XB4 kubernetes-minion ASSUMEROLEPOLICYDOCUMENT 2012-10-17 STATEMENT sts:AssumeRole Allow PRINCIPAL ec2.amazonaws.com -Using SSH key with (AWS) fingerprint: 08:9f:6b:82:3d:b5:ba:a0:f3:db:ab:94:1b:a7:a4:c7 +Using SSH key with (AWS) fingerprint: 70:66:c6:3d:53:3b:e5:3d:1d:7f:cd:c9:d1:87:35:81 Creating vpc. -Adding tag to vpc-fad1139d: Name=kubernetes-vpc -Adding tag to vpc-fad1139d: KubernetesCluster=kubernetes -Using VPC vpc-fad1139d -Adding tag to dopt-e43a7180: Name=kubernetes-dhcp-option-set -Adding tag to dopt-e43a7180: KubernetesCluster=kubernetes -Using DHCP option set dopt-e43a7180 +Adding tag to vpc-e01fc087: Name=kubernetes-vpc +Adding tag to vpc-e01fc087: KubernetesCluster=kubernetes +Using VPC vpc-e01fc087 +Adding tag to dopt-807151e4: Name=kubernetes-dhcp-option-set +Adding tag to dopt-807151e4: KubernetesCluster=kubernetes +Using DHCP option set dopt-807151e4 Creating subnet. -Adding tag to subnet-fc16fa9b: KubernetesCluster=kubernetes -Using subnet subnet-fc16fa9b +Adding tag to subnet-4a9a642d: KubernetesCluster=kubernetes +Using subnet subnet-4a9a642d Creating Internet Gateway. -Using Internet Gateway igw-fc0d9398 +Using Internet Gateway igw-821a73e6 Associating route table. Creating route table -Adding tag to rtb-bd8512da: KubernetesCluster=kubernetes -Associating route table rtb-bd8512da to subnet subnet-fc16fa9b -Adding route to route table rtb-bd8512da -Using Route Table rtb-bd8512da +Adding tag to rtb-0d96fa6a: KubernetesCluster=kubernetes +Associating route table rtb-0d96fa6a to subnet subnet-4a9a642d +Adding route to route table rtb-0d96fa6a +Using Route Table rtb-0d96fa6a Creating master security group. Creating security group kubernetes-master-kubernetes. -Adding tag to sg-d9280ba0: KubernetesCluster=kubernetes +Adding tag to sg-a47564dd: KubernetesCluster=kubernetes Creating minion security group. Creating security group kubernetes-minion-kubernetes. -Adding tag to sg-dc280ba5: KubernetesCluster=kubernetes -Using master security group: kubernetes-master-kubernetes sg-d9280ba0 -Using minion security group: kubernetes-minion-kubernetes sg-dc280ba5 +Adding tag to sg-9a7564e3: KubernetesCluster=kubernetes +Using master security group: kubernetes-master-kubernetes sg-a47564dd +Using minion security group: kubernetes-minion-kubernetes sg-9a7564e3 Creating master disk: size 20GB, type gp2 -Adding tag to vol-04d71a810478dec0d: Name=kubernetes-master-pd -Adding tag to vol-04d71a810478dec0d: KubernetesCluster=kubernetes -Allocated Elastic IP for master: 35.162.175.115 -Adding tag to vol-04d71a810478dec0d: kubernetes.io/master-ip=35.162.175.115 -Generating certs for alternate-names: IP:35.162.175.115,IP:172.20.0.9,IP:10.0.0.1,DNS:kubernetes,DNS:kubernetes.default,DNS:kubernetes.default.svc,DNS:kubernetes.default.svc.cluster.local,DNS:kubernetes-master +Adding tag to vol-0eba023cc1874c790: Name=kubernetes-master-pd +Adding tag to vol-0eba023cc1874c790: KubernetesCluster=kubernetes +Allocated Elastic IP for master: 35.165.155.60 +Adding tag to vol-0eba023cc1874c790: kubernetes.io/master-ip=35.165.155.60 +Generating certs for alternate-names: IP:35.165.155.60,IP:172.20.0.9,IP:10.0.0.1,DNS:kubernetes,DNS:kubernetes.default,DNS:kubernetes.default.svc,DNS:kubernetes.default.svc.cluster.local,DNS:kubernetes-master Starting Master -Adding tag to i-042488375c2ca1e3e: Name=kubernetes-master -Adding tag to i-042488375c2ca1e3e: Role=kubernetes-master -Adding tag to i-042488375c2ca1e3e: KubernetesCluster=kubernetes +Adding tag to i-097f358631739e01c: Name=kubernetes-master +Adding tag to i-097f358631739e01c: Role=kubernetes-master +Adding tag to i-097f358631739e01c: KubernetesCluster=kubernetes Waiting for master to be ready -Attempt 1 to check for master nodeWaiting for instance i-042488375c2ca1e3e to be running (currently pending) -Sleeping for 3 seconds... -Waiting for instance i-042488375c2ca1e3e to be running (currently pending) +Attempt 1 to check for master nodeWaiting for instance i-097f358631739e01c to be running (currently pending) Sleeping for 3 seconds... -Waiting for instance i-042488375c2ca1e3e to be running (currently pending) +Waiting for instance i-097f358631739e01c to be running (currently pending) Sleeping for 3 seconds... -Waiting for instance i-042488375c2ca1e3e to be running (currently pending) +Waiting for instance i-097f358631739e01c to be running (currently pending) Sleeping for 3 seconds... -Waiting for instance i-042488375c2ca1e3e to be running (currently pending) +Waiting for instance i-097f358631739e01c to be running (currently pending) Sleeping for 3 seconds... [master running] -Attaching IP 35.162.175.115 to instance i-042488375c2ca1e3e -Attaching persistent data volume (vol-04d71a810478dec0d) to master -2016-11-23T02:14:59.645Z /dev/sdb i-042488375c2ca1e3e attaching vol-04d71a810478dec0d +Attaching IP 35.165.155.60 to instance i-097f358631739e01c +Attaching persistent data volume (vol-0eba023cc1874c790) to master +2016-12-13T10:56:50.378Z /dev/sdb i-097f358631739e01c attaching vol-0eba023cc1874c790 cluster "aws_kubernetes" set. user "aws_kubernetes" set. context "aws_kubernetes" set. @@ -145,32 +155,34 @@ Creating autoscaling group 0 minions started; waiting 0 minions started; waiting 0 minions started; waiting - 2 minions started; ready + 0 minions started; waiting + 3 minions started; ready Waiting for cluster initialization. This will continually check to see if the API for kubernetes is reachable. This might loop forever if there was some uncaught error during start up. -.......................................................................................................................................................................................................................Kubernetes cluster created. +...........................................................................................................................................................................Kubernetes cluster created. Sanity checking cluster... -Attempt 1 to check Docker on node @ 35.164.79.249 ...working -Attempt 1 to check Docker on node @ 35.164.83.190 ...working +Attempt 1 to check Docker on node @ 35.165.35.181 ...working +Attempt 1 to check Docker on node @ 35.165.79.208 ...working +Attempt 1 to check Docker on node @ 35.163.90.67 ...working Kubernetes cluster is running. The master is running at: - https://35.162.175.115 + https://35.165.155.60 The user name and password to use is located in /home/ec2-user/.kube/config. ... calling validate-cluster -Waiting for 2 ready nodes. 0 ready nodes, 2 registered. Retrying. -Waiting for 2 ready nodes. 1 ready nodes, 2 registered. Retrying. -Waiting for 2 ready nodes. 1 ready nodes, 2 registered. Retrying. -Found 2 node(s). -NAME STATUS AGE -ip-172-20-0-23.us-west-2.compute.internal Ready 54s -ip-172-20-0-24.us-west-2.compute.internal Ready 52s +Waiting for 3 ready nodes. 0 ready nodes, 3 registered. Retrying. +Waiting for 3 ready nodes. 0 ready nodes, 3 registered. Retrying. +Found 3 node(s). +NAME STATUS AGE +ip-172-20-0-186.us-west-2.compute.internal Ready 33s +ip-172-20-0-187.us-west-2.compute.internal Ready 34s +ip-172-20-0-188.us-west-2.compute.internal Ready 34s Validate output: NAME STATUS MESSAGE ERROR scheduler Healthy ok @@ -180,14 +192,14 @@ etcd-0 Healthy {"health": "true"} Cluster validation succeeded Done, listing cluster services: -Kubernetes master is running at https://35.162.175.115 -Elasticsearch is running at https://35.162.175.115/api/v1/proxy/namespaces/kube-system/services/elasticsearch-logging -Heapster is running at https://35.162.175.115/api/v1/proxy/namespaces/kube-system/services/heapster -Kibana is running at https://35.162.175.115/api/v1/proxy/namespaces/kube-system/services/kibana-logging -KubeDNS is running at https://35.162.175.115/api/v1/proxy/namespaces/kube-system/services/kube-dns -kubernetes-dashboard is running at https://35.162.175.115/api/v1/proxy/namespaces/kube-system/services/kubernetes-dashboard -Grafana is running at https://35.162.175.115/api/v1/proxy/namespaces/kube-system/services/monitoring-grafana -InfluxDB is running at https://35.162.175.115/api/v1/proxy/namespaces/kube-system/services/monitoring-influxdb +Kubernetes master is running at https://35.165.155.60 +Elasticsearch is running at https://35.165.155.60/api/v1/proxy/namespaces/kube-system/services/elasticsearch-logging +Heapster is running at https://35.165.155.60/api/v1/proxy/namespaces/kube-system/services/heapster +Kibana is running at https://35.165.155.60/api/v1/proxy/namespaces/kube-system/services/kibana-logging +KubeDNS is running at https://35.165.155.60/api/v1/proxy/namespaces/kube-system/services/kube-dns +kubernetes-dashboard is running at https://35.165.155.60/api/v1/proxy/namespaces/kube-system/services/kubernetes-dashboard +Grafana is running at https://35.165.155.60/api/v1/proxy/namespaces/kube-system/services/monitoring-grafana +InfluxDB is running at https://35.165.155.60/api/v1/proxy/namespaces/kube-system/services/monitoring-influxdb To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'. @@ -197,31 +209,36 @@ Installation successful! ``` -By default, the script will provision a new VPC and a 4 node k8s cluster in us-west-2a (Oregon) with EC2 instances running on Debian. You can override the variables defined in `/cluster/config-default.sh` to change this behavior as follows: +Once the cluster is up, the IP addresses of your master and node(s) will be printed, as well as information about the default services running in the cluster (monitoring, logging, dns). + +User credentials and security tokens are written in `~/.kube/config`, they will be necessary to use the CLI or the HTTP Basic Auth. + -``` -export KUBE_AWS_ZONE=us-west-2a -export NUM_NODES=2 -export MASTER_SIZE=m3.medium -export NODE_SIZE=m3.large -export AWS_S3_REGION=us-west-2a -export AWS_S3_BUCKET=mycompany-kubernetes-artifacts -export KUBE_AWS_INSTANCE_PREFIX=k8s -... -``` And then concate the kubernetes binaries directory into PATH: + ``` export PATH=/platforms/linux/amd64:$PATH ``` -Now you can use administration tool kubectl to operate the cluster. -By default, kubectl will use the kubeconfig file generated during the cluster startup for authenticating against the API, the location is in `~/.kube/config`. + + +Now you can use administration tool `kubectl` to operate the cluster. +By default, `kubectl` will use the kubeconfig file generated during the cluster startup for authenticating against the API, the location is in `~/.kube/config`. + ###Setup PaddlePaddle Environment on AWS -For the design of running PaddlePaddle on Kubernetes, you really need to read [this article](https://github.com/drinktee/Paddle/blob/k8s/doc_cn/cluster/k8s/distributed_training_on_kubernetes.md) first. +Now, we've created a cluster with following network capability: + +1. All Kubernetes nodes can communicate with each other. + +1. All Docker containers on Kubernetes nodes can communicate with each other. + +1. All Kubernetes nodes can communicate with all Docker containers on Kubernetes nodes. + +1. All other traffic loads from outside of Kubernetes nodes cannot reach to the Docker containers on Kubernetes nodes except for creating the services for containers. For sharing the training data across all the Kubernetes nodes, we use EFS (Elastic File System) in AWS. Ceph might be a better solution, but it requires high version of Linux kernel that might not be stable enough at this moment. We haven't automated the EFS setup at this moment, so please do the following steps: @@ -229,23 +246,182 @@ For sharing the training data across all the Kubernetes nodes, we use EFS (Elast 1. Make sure you add the AmazonElasticFileSystemFullAccess policy into your AWS account. -2. Create the Elastic File System in AWS console, and attach the Kubernetes VPC with it. +1. Create the Elastic File System in AWS console, and attach the Kubernetes VPC with it. +![create_efs](create_efs.png =800x) + +1. Modify the Kubernetes security group under ec2/Security Groups, add additional inbound policy "All TCP TCP 0 - 65535 0.0.0.0/0" for Kubernetes default VPC security group. +![add_security_group](add_security_group.png =800x) + + +1. Follow the EC2 mount instruction to mount the disk onto all the Kubernetes nodes, we recommend to mount EFS disk onto ~/efs. +![efs_mount](efs_mount.png =800x) + + +Before starting the training, you should place your user config and divided training data onto EFS. When the training start, each task will copy related files from EFS into container, and it will also write the training results back onto EFS, we will show you how to place the data later in this article. + + + +###Core Concept of PaddlePaddle Training on AWS + +Now we've already setup a 3 node distributed training cluster, and on each node we've attached the EFS volume, in this training demo, we will create three Kubernetes pod and scheduling them on 3 node. Each pod contains a PaddlePaddle container. When container gets created, it will start pserver and trainer process, load the training data from EFS volume and start the distributed training task. + +####Use Kubernetes Job + +We use Kubernetes job to represent one time of distributed training. After the job get finished, Kubernetes will destroy job container and release all related resources. -3. Modify the Kubernetes security group, add additional inbound policy "All TCP TCP 0 - 65535 0.0.0.0/0" for Kubernetes default VPC security group. +We can write a yaml file to describe the Kubernetes job. The file contains lots of configuration information, for example PaddlePaddle's node number, `paddle pserver` open port number, the network card info etc., these information are passed into container for processes to use as environment variables. -4. Follow the EC2 mount instruction to mount the disk onto all the Kubernetes nodes, we recommend to mount EFS disk onto ~/efs. +In one time of distributed training, user will confirm the PaddlePaddle node number first. And then upload the pre-divided training data and configuration file onth EFS volume. And then create the Kubernetes job yaml file; submit to the Kubernetes cluster to start the training job. +####Create PaddlePaddle Node + +After Kubernetes master gets the request, it will parse the yaml file and create several pods (PaddlePaddle's node number), Kubernetes will allocate these pods onto cluster's node. A pod represents a PaddlePaddle node, when pod is successfully allocated onto one physical/virtual machine, Kubernetes will startup the container in the pod, and this container will use the environment variables in yaml file and start up `paddle pserver` and `paddle trainer` processes. + + +####Start up Training + +After container gets started, it starts up the distributed training by using scripts. We know `paddle train` process need to know other node's ip address and it's own trainer_id, since PaddlePaddle currently don't have the ability to do the service discovery, so in the start up script, each node will use job pod's name to query all to pod info from Kubernetes apiserver (apiserver's endpoint is an environment variable in container by default). + +With pod information, we can assign each pod a unique trainer_id. Here we sort all the pods by pod's ip, and assign the index to each PaddlePaddle node as it's trainer_id. The workflow of starting up the script is as follows: + +1. Query the api server to get pod information, and assign the trainer_id by sorting the ip. +1. Copy the training data from EFS sharing volume into container. +1. Parse the `paddle pserver` and 'paddle trainer' startup parameters from environment variables, and then start up the processes. +1. PaddlePaddle will automatically write the result onto the PaddlePaddle node with trainer_id:0, we set the output path to be the EFS volume to save the result data. -And now you can place your training data onto the EFS, you should follow [this article](https://github.com/drinktee/Paddle/blob/k8s/doc_cn/cluster/k8s/distributed_training_on_kubernetes.md) to locate your data. ###Start PaddlePaddle Training Demo on AWS -After setting up all the steps on AWS, We can start up our PaddlePaddle training recommendation demo by using: +Now we'll start a PaddlePaddle training demo on AWS, steps are as follows: + +1. Build PaddlePaddle Docker image. +1. Divide the training data file and upload it onto the EFS sharing volume. +1. Create the training job yaml file, and start up the job. +1. Check the result after training. + +####Build PaddlePaddle Docker Image + +PaddlePaddle docker image need to provide the runtime environment for `paddle pserver` and `paddle train`, so the container use this image should have two main function: + +1. Copy the training data into container. +1. Generate the startup parameter for `paddle pserver` and `paddle train` process, and startup the training. + + +Since official `paddledev/paddle:cpu-latest` have already included the PaddlePaddle binary, but lack of the above functionalities, so we will create the startup script based on this image, to achieve the work above. the detailed Dockerfile is as follows: ``` -kubectl create -f job.yaml +FROM paddledev/paddle:cpu-latest + +MAINTAINER zjsxzong89@gmail.com + +COPY start.sh /root/ +COPY start_paddle.py /root/ +CMD ["bash"," -c","/root/start.sh"] ``` +At this point, we will copy our `start.sh` and `start_paddle.py` file into container, and then exec `start_paddle.py` script to start up the training, all the steps like assigning trainer_id, getting other nodes' ip are implemented in `start_paddle.py`. + +`start_paddle.py` will start parsing the parameters. + +``` +parser = argparse.ArgumentParser(prog="start_paddle.py", + description='simple tool for k8s') + args, train_args_list = parser.parse_known_args() + train_args = refine_unknown_args(train_args_list) + train_args_dict = dict(zip(train_args[:-1:2], train_args[1::2])) + podlist = getPodList() +``` + +And then using function `getPodList()` to query all the pod information from the job name through Kubernetes api server. When all the pods are in the running status, using `getIdMap(podlist)` to get the trainer_id. + +``` + podlist = getPodList() + # need to wait until all pods are running + while not isPodAllRunning(podlist): + time.sleep(10) + podlist = getPodList() + idMap = getIdMap(podlist) +``` + +In function `getIdMap(podlist)`, we use podlist to get the ip address for each pod and sort them, use the index as the trainer_id. + +``` +def getIdMap(podlist): + ''' + generate tainer_id by ip + ''' + ips = [] + for pod in podlist["items"]: + ips.append(pod["status"]["podIP"]) + ips.sort() + idMap = {} + for i in range(len(ips)): + idMap[ips[i]] = i + return idMap +``` + +After getting `idMap`, we use function `startPaddle(idMap, train_args_dict)` to generate `paddle pserver` and `paddle train` start up parameters and then start up the processes. + +In function `startPaddle`, the most important work is to generate `paddle pserver` and `paddle train` start up parameters. For example, `paddle train` parameter parsing, we will get parameters like `PADDLE_NIC`, `PADDLE_PORT`, `PADDLE_PORTS_NUM`, and get the `trainer_id` from `idMap`. + +``` + program = 'paddle train' + args = " --nics=" + PADDLE_NIC + args += " --port=" + str(PADDLE_PORT) + args += " --ports_num=" + str(PADDLE_PORTS_NUM) + args += " --comment=" + "paddle_process_by_paddle" + ip_string = "" + for ip in idMap.keys(): + ip_string += (ip + ",") + ip_string = ip_string.rstrip(",") + args += " --pservers=" + ip_string + args_ext = "" + for key, value in train_args_dict.items(): + args_ext += (' --' + key + '=' + value) + localIP = socket.gethostbyname(socket.gethostname()) + trainerId = idMap[localIP] + args += " " + args_ext + " --trainer_id=" + \ + str(trainerId) + " --save_dir=" + JOB_PATH_OUTPUT +``` + +Use `docker build` to build toe Docker Image: + +``` +docker build -t your_repo/paddle:mypaddle . +``` + +And then push the built image onto docker registry. + +``` +docker push your_repo/paddle:mypaddle +``` + +####Upload Training Data File + +Here we will use PaddlePaddle's official recommendation demo as the content for this training, we put the training data file into a directory named by job name, which located in EFS sharing volume, the tree structure for the directory looks like: + +``` +efs +└── paddle-cluster-job + ├── data + │ ├── 0 + │ │ + │ ├── 1 + │ │ + │ └── 2 + ├── output + └── recommendation +``` + +The `paddle-cluster-job` directory is the job name for this training, this training includes 3 PaddlePaddle node, we store the pre-divided data under `paddle-cluster-job/data` directory, directory 0, 1, 2 each represent 3 nodes' trainer_id. the training data in in recommendation directory, the training results and logs will be in the output directory. + + +####Create Kubernetes Job + +Kubernetes use yaml file to describe job details, and then use command line tool to create the job in Kubernetes cluster. + +In yaml file, we describe the Docker image we use for this training, the node number we need to startup, the volume mounting information and all the necessary parameters we need for `paddle pserver` and `paddle train` processes. + The yaml file content is as follows: ``` @@ -297,14 +473,93 @@ spec: restartPolicy: Never ``` -It will generate three PaddlePaddle job runing on distributed Kubernetes nodes, and all the training result will be written into EFS. -We've made an experiment of running this PaddlePaddle recommendation training demo on three 2 core 8 GB machine (m3.large), and it took 8 hours to generate 10 models. +In yaml file, the metadata's name is the job's name. `parallelism, completions` means this job will simultaneously start up 3 PaddlePaddle nodes, and this job will be finished when there are 3 finished pods. For the data store volume, we declare the path jobpath, it mount the /home/admin/efs on host machine into the container with path /home/jobpath. So in container, the /home/jobpath actually stores the data onto EFS sharing volume. + +`env` field represents container's environment variables, we pass the PaddlePaddle parameters into containers by using the `env` field. + +`JOB_PATH` represents the sharing volume path, `JOB_NAME` represents job name, `TRAIN_CONFIG_DIR` represents the training data file directory, we can these three parameters to get the file path for this training. + +`CONF_PADDLE_NIC` represents `paddle pserver` process's `--nics` parameters, the NIC name. +`CONF_PADDLE_PORT` represents `paddle pserver` process's `--port` parameters, `CONF_PADDLE_PORTS_NUM` represents `--port_num` parameter. + +`CONF_PADDLE_PORTS_NUM_SPARSE` represents the sparse updated port number, `--ports_num_for_sparse` parameter. + +`CONF_PADDLE_GRADIENT_NUM` represents the training node number, `--num_gradient_servers` parameter. + +After we create the yaml file, we can use Kubernetes command line tool to create the job onto the cluster. + +``` +kubectl create -f job.yaml +``` + +After we execute the above command, Kubernetes will create 3 pods and then pull the PaddlePaddle image, then start up the containers for training. + + + +####Check Training Results + +During the training, we can see the logs and models on EFS sharing volume, the output directory contains the training results. (Caution: node_0, node_1, node_2 directories represents PaddlePaddle node and train_id, not the Kubernetes node) + +``` +[root@paddle-kubernetes-node0 output]# tree -d +. +├── node_0 +│ ├── server.log +│ └── train.log +├── node_1 +│ ├── server.log +│ └── train.log +├── node_2 +...... +├── pass-00002 +│ ├── done +│ ├── ___embedding_0__.w0 +│ ├── ___embedding_1__.w0 +...... +``` + +We can always check the container training status through logs, for example: + +``` +[root@paddle-kubernetes-node0 node_0]# cat train.log +I1116 09:10:17.123121 50 Util.cpp:155] commandline: + /usr/local/bin/../opt/paddle/bin/paddle_trainer + --nics=eth0 --port=7164 + --ports_num=2 --comment=paddle_process_by_paddle + --pservers=192.168.129.66,192.168.223.143,192.168.129.71 + --ports_num_for_sparse=2 --config=./trainer_config.py + --trainer_count=4 --num_passes=10 --use_gpu=0 + --log_period=50 --dot_period=10 --saving_period=1 + --local=0 --trainer_id=0 + --save_dir=/home/jobpath/paddle-cluster-job/output +I1116 09:10:17.123440 50 Util.cpp:130] Calling runInitFunctions +I1116 09:10:17.123764 50 Util.cpp:143] Call runInitFunctions done. +[WARNING 2016-11-16 09:10:17,227 default_decorators.py:40] please use keyword arguments in paddle config. +[INFO 2016-11-16 09:10:17,239 networks.py:1282] The input order is [movie_id, title, genres, user_id, gender, age, occupation, rating] +[INFO 2016-11-16 09:10:17,239 networks.py:1289] The output order is [__regression_cost_0__] +I1116 09:10:17.392917 50 Trainer.cpp:170] trainer mode: Normal +I1116 09:10:17.613910 50 PyDataProvider2.cpp:257] loading dataprovider dataprovider::process +I1116 09:10:17.680917 50 PyDataProvider2.cpp:257] loading dataprovider dataprovider::process +I1116 09:10:17.681543 50 GradientMachine.cpp:134] Initing parameters.. +I1116 09:10:18.012390 50 GradientMachine.cpp:141] Init parameters done. +I1116 09:10:18.018641 50 ParameterClient2.cpp:122] pserver 0 192.168.129.66:7164 +I1116 09:10:18.018950 50 ParameterClient2.cpp:122] pserver 1 192.168.129.66:7165 +I1116 09:10:18.019069 50 ParameterClient2.cpp:122] pserver 2 192.168.223.143:7164 +I1116 09:10:18.019492 50 ParameterClient2.cpp:122] pserver 3 192.168.223.143:7165 +I1116 09:10:18.019716 50 ParameterClient2.cpp:122] pserver 4 192.168.129.71:7164 +I1116 09:10:18.019836 50 ParameterClient2.cpp:122] pserver 5 192.168.129.71:7165 +``` + +It'll take around 8 hours to run this PaddlePaddle recommendation training demo on three 2 core 8 GB EC2 machine (m3.large), and the results will be 8 trained models. ###Kubernetes Cluster Tear Down -If you want to tear down the running cluster: + + +If you want to tear down the running cluster, make sure to *delete* the EFS volume first, and then use the following command: + ``` export KUBERNETES_PROVIDER=aws; /cluster/kube-down.sh @@ -313,51 +568,46 @@ export KUBERNETES_PROVIDER=aws; /cluster/kube-down This process takes about 2 to 5 minutes. ``` -[ec2-user@ip-172-31-24-50 ~]$ export KUBERNETES_PROVIDER=aws; ./kubernetes/cluster/kube-down.sh +ec2-user@ip-172-31-27-229 ~]$ export KUBERNETES_PROVIDER=aws; ./kubernetes/cluster/kube-down.sh Bringing down cluster using provider: aws -Deleting instances in VPC: vpc-fad1139d +Deleting instances in VPC: vpc-e01fc087 Deleting auto-scaling group: kubernetes-minion-group-us-west-2a Deleting auto-scaling launch configuration: kubernetes-minion-group-us-west-2a Deleting auto-scaling group: kubernetes-minion-group-us-west-2a +Deleting auto-scaling group: kubernetes-minion-group-us-west-2a Waiting for instances to be deleted -Waiting for instance i-09d7e8824ef1f8384 to be terminated (currently shutting-down) +Waiting for instance i-04e973f1d6d56d580 to be terminated (currently shutting-down) Sleeping for 3 seconds... -Waiting for instance i-09d7e8824ef1f8384 to be terminated (currently shutting-down) +Waiting for instance i-04e973f1d6d56d580 to be terminated (currently shutting-down) Sleeping for 3 seconds... -Waiting for instance i-09d7e8824ef1f8384 to be terminated (currently shutting-down) +Waiting for instance i-04e973f1d6d56d580 to be terminated (currently shutting-down) Sleeping for 3 seconds... -Waiting for instance i-09d7e8824ef1f8384 to be terminated (currently shutting-down) +Waiting for instance i-04e973f1d6d56d580 to be terminated (currently shutting-down) Sleeping for 3 seconds... -Waiting for instance i-09d7e8824ef1f8384 to be terminated (currently shutting-down) +Waiting for instance i-04e973f1d6d56d580 to be terminated (currently shutting-down) Sleeping for 3 seconds... -Waiting for instance i-09d7e8824ef1f8384 to be terminated (currently shutting-down) +Waiting for instance i-04e973f1d6d56d580 to be terminated (currently shutting-down) Sleeping for 3 seconds... -Waiting for instance i-09d7e8824ef1f8384 to be terminated (currently shutting-down) +Waiting for instance i-04e973f1d6d56d580 to be terminated (currently shutting-down) Sleeping for 3 seconds... -Waiting for instance i-09d7e8824ef1f8384 to be terminated (currently shutting-down) +Waiting for instance i-04e973f1d6d56d580 to be terminated (currently shutting-down) Sleeping for 3 seconds... -Waiting for instance i-09d7e8824ef1f8384 to be terminated (currently shutting-down) -Sleeping for 3 seconds... -Waiting for instance i-09d7e8824ef1f8384 to be terminated (currently shutting-down) -Sleeping for 3 seconds... -Waiting for instance i-09d7e8824ef1f8384 to be terminated (currently shutting-down) -Sleeping for 3 seconds... -Waiting for instance i-09d7e8824ef1f8384 to be terminated (currently shutting-down) +Waiting for instance i-04e973f1d6d56d580 to be terminated (currently shutting-down) Sleeping for 3 seconds... All instances deleted -Releasing Elastic IP: 35.162.175.115 -Deleting volume vol-04d71a810478dec0d -Cleaning up resources in VPC: vpc-fad1139d -Cleaning up security group: sg-d9280ba0 -Cleaning up security group: sg-dc280ba5 -Deleting security group: sg-d9280ba0 -Deleting security group: sg-dc280ba5 -Deleting VPC: vpc-fad1139d +Releasing Elastic IP: 35.165.155.60 +Deleting volume vol-0eba023cc1874c790 +Cleaning up resources in VPC: vpc-e01fc087 +Cleaning up security group: sg-9a7564e3 +Cleaning up security group: sg-a47564dd +Deleting security group: sg-9a7564e3 +Deleting security group: sg-a47564dd +Deleting VPC: vpc-e01fc087 Done ``` -## For experts with Kubernetes and AWS +## For Experts with Kubernetes and AWS Sometimes we might need to create or manage the cluster on AWS manually with limited privileges, so here we will explain more on what’s going on with the Kubernetes setup script. @@ -386,3 +636,4 @@ Sometimes we might need to create or manage the cluster on AWS manually with lim +