Here we will show you step by step on how to run PaddlePaddle training on AWS cluster.
Here we will show you step by step on how to run PaddlePaddle training on AWS cluster.
###AWS Login
###Download kube-aws and kubectl
Import the CoreOS Application Signing Public Key:
gpg2 --keyserver --recv-key FC8A365E
Validate the key fingerprint:
gpg2 --fingerprint FC8A365E
The correct key fingerprint is `18AD 5014 C99E F7E3 BA5F 6CE9 50BD D3E0 FC8A 365E`
Go to the [releases]( and download the latest release tarball and detached signature (.sig) for your architecture.
User credentials and security tokens will be generated later in user directory, not in `~/.kube/config`, they will be necessary to use the CLI or the HTTP Basic Auth.
###Configure AWS Credentials
First check out [this]( for installing the AWS command line interface, if you use ec2 instance with default amazon AMI, the cli tool has already been installed on your machine.
First check out [this]( for installing the AWS command line interface, if you use ec2 instance with default amazon AMI, the cli tool has already been installed on your machine.
@@ -58,7 +115,7 @@ aws configure
@@ -58,7 +115,7 @@ aws configure
Fill in the required fields (You can get your AWS aceess key id and AWS secrete access key by following [this]( instruction):
Fill in the required fields (You can get your AWS aceess key id and AWS secrete access key by following [this]( instruction):
Test that your credentials work by describing any instances you may already have running on your account:
aws ec2 describe-instances
###Define Cluster Parameters
####EC2 key pair
The keypair that will authenticate SSH access to your EC2 instances. The public half of this key pair will be configured on each CoreOS node.
After creating a key pair, you will use the name you gave the keys to configure the cluster. Key pairs are only available to EC2 instances in the same region. More info in the [EC2 Keypair docs](
####KMS key
And then execute the following command after your aws login:
Amazon KMS keys are used to encrypt and decrypt cluster TLS assets. If you already have a KMS Key that you would like to use, you can skip creating a new key and provide the Arn string for your existing key.
By default, this command will download and unzip the latest Kubernetes release package and execute the script inside to provision a new VPC (virtual private cloud) and a four t2.micro node cluster in us-west-2a (Oregon) under that VPC.
You can override the variables defined in `<path/to/kubernetes-directory>/cluster/` as follows:
Using SSH key with (AWS) fingerprint: 70:66:c6:3d:53:3b:e5:3d:1d:7f:cd:c9:d1:87:35:81
Creating vpc.
Adding tag to vpc-e01fc087: Name=kubernetes-vpc
Adding tag to vpc-e01fc087: KubernetesCluster=kubernetes
Using VPC vpc-e01fc087
Adding tag to dopt-807151e4: Name=kubernetes-dhcp-option-set
Adding tag to dopt-807151e4: KubernetesCluster=kubernetes
Using DHCP option set dopt-807151e4
Creating subnet.
Adding tag to subnet-4a9a642d: KubernetesCluster=kubernetes
Using subnet subnet-4a9a642d
Creating Internet Gateway.
Using Internet Gateway igw-821a73e6
Associating route table.
Creating route table
Adding tag to rtb-0d96fa6a: KubernetesCluster=kubernetes
Associating route table rtb-0d96fa6a to subnet subnet-4a9a642d
Adding route to route table rtb-0d96fa6a
Using Route Table rtb-0d96fa6a
Creating master security group.
Creating security group kubernetes-master-kubernetes.
Adding tag to sg-a47564dd: KubernetesCluster=kubernetes
Creating minion security group.
Creating security group kubernetes-minion-kubernetes.
Adding tag to sg-9a7564e3: KubernetesCluster=kubernetes
Using master security group: kubernetes-master-kubernetes sg-a47564dd
Using minion security group: kubernetes-minion-kubernetes sg-9a7564e3
Creating master disk: size 20GB, type gp2
Adding tag to vol-0eba023cc1874c790: Name=kubernetes-master-pd
Adding tag to vol-0eba023cc1874c790: KubernetesCluster=kubernetes
Allocated Elastic IP for master:
Adding tag to vol-0eba023cc1874c790:
Generating certs for alternate-names: IP:,IP:,IP:,DNS:kubernetes,DNS:kubernetes.default,DNS:kubernetes.default.svc,DNS:kubernetes.default.svc.cluster.local,DNS:kubernetes-master
Starting Master
Adding tag to i-097f358631739e01c: Name=kubernetes-master
Adding tag to i-097f358631739e01c: Role=kubernetes-master
Adding tag to i-097f358631739e01c: KubernetesCluster=kubernetes
Waiting for master to be ready
Attempt 1 to check for master nodeWaiting for instance i-097f358631739e01c to be running (currently pending)
Sleeping for 3 seconds...
Waiting for instance i-097f358631739e01c to be running (currently pending)
Sleeping for 3 seconds...
Waiting for instance i-097f358631739e01c to be running (currently pending)
Sleeping for 3 seconds...
Waiting for instance i-097f358631739e01c to be running (currently pending)
Sleeping for 3 seconds...
[master running]
Attaching IP to instance i-097f358631739e01c
Attaching persistent data volume (vol-0eba023cc1874c790) to master
Kubernetes master is running at
Elasticsearch is running at
Heapster is running at
Kibana is running at
KubeDNS is running at
kubernetes-dashboard is running at
Grafana is running at
InfluxDB is running at
To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
Kubernetes binaries at /home/ec2-user/kubernetes/cluster/
You may want to add this directory to your PATH in $HOME/.profile
Installation successful!
Once the cluster is up, the IP addresses of your master and node(s) will be printed, as well as information about the default services running in the cluster (monitoring, logging, dns).
User credentials and security tokens are written in `~/.kube/config`, they will be necessary to use the CLI or the HTTP Basic Auth.
And then concate the kubernetes binaries directory into PATH:
You can create a KMS key in the AWS console, or with the aws command line tool:
When the cluster is created, the controller will expose the TLS-secured API on a public IP address. You will need to create an A record for the external DNS hostname you want to point to this IP address. You can find the API external IP address after the cluster is created by invoking kube-aws status.
####S3 bucket
You need to create an S3 bucket before startup the Kubernetes cluster.
Now you can use Kubernetes administration tool `kubectl` to operate the cluster, let's give `kubectl get nodes` a try.
####Initialize an asset directory
Create a directory on your local machine to hold the generated assets:
$ mkdir my-cluster
$ cd my-cluster
Initialize the cluster CloudFormation stack with the KMS Arn, key pair name, and DNS name from the previous step:
There will now be a cluster.yaml file in the asset directory. This is the main configuration file for your cluster.
####Render contents of the asset directory
In the simplest case, you can have kube-aws generate both your TLS identities and certificate authority for you.
$ kube-aws render credentials --generate-ca
The next command generates the default set of cluster assets in your asset directory.
sh $ kube-aws render stack
Here's what the directory structure looks like:
$ tree
├── cluster.yaml
├── credentials
│ ├── admin-key.pem
│ ├── admin.pem
│ ├── apiserver-key.pem
│ ├── apiserver.pem
│ ├── ca-key.pem
│ ├── ca.pem
│ ├── worker-key.pem
│ └── worker.pem
│ ├── etcd-key.pem
│ └── etcd.pem
│ ├── etcd-client-key.pem
│ └── etcd-client.pem
├── kubeconfig
├── stack-template.json
└── userdata
├── cloud-config-controller
└── cloud-config-worker
These assets (templates and credentials) are used to create, update and interact with your Kubernetes cluster.
###Kubernetes Cluster Start Up
####Create the instances defined in the CloudFormation template
Now for the exciting part, creating your cluster:
$ kube-aws up --s3-uri s3://<your-bucket-name>/<prefix>
####Configure DNS
You can invoke `kube-aws status` to get the cluster API endpoint after cluster creation, if necessary. This command can take a while. And then dig the load balancer hostname to get the ip address, use this ip to setup an A record for your external dns name.
@@ -277,7 +330,7 @@ For sharing the training data across all the Kubernetes nodes, we use EFS (Elast
@@ -277,7 +330,7 @@ For sharing the training data across all the Kubernetes nodes, we use EFS (Elast
1. Make sure you added AmazonElasticFileSystemFullAccess policy in your group.
1. Make sure you added AmazonElasticFileSystemFullAccess policy in your group.
1. Create the Elastic File System in AWS console, and attach the Kubernetes VPC with it.
1. Create the Elastic File System in AWS console, and attach the new VPC with it.
@@ -295,7 +348,7 @@ Before starting the training, you should place your user config and divided trai
@@ -295,7 +348,7 @@ Before starting the training, you should place your user config and divided trai
###Core Concept of PaddlePaddle Training on AWS
###Core Concept of PaddlePaddle Training on AWS
Now we've already setup a 3 node distributed training cluster, and on each node we've attached the EFS volume, in this training demo, we will create three Kubernetes pod and scheduling them on 3 node. Each pod contains a PaddlePaddle container. When container gets created, it will start pserver and trainer process, load the training data from EFS volume and start the distributed training task.
Now we've already setup a 3 nodes distributed Kubernetes cluster, and on each node we've attached the EFS volume, in this training demo, we will create three Kubernetes pod and scheduling them on 3 node. Each pod contains a PaddlePaddle container. When container gets created, it will start pserver and trainer process, load the training data from EFS volume and start the distributed training task.
####Use Kubernetes Job
####Use Kubernetes Job
@@ -307,7 +360,7 @@ In one time of distributed training, user will confirm the PaddlePaddle node num
@@ -307,7 +360,7 @@ In one time of distributed training, user will confirm the PaddlePaddle node num
####Create PaddlePaddle Node
####Create PaddlePaddle Node
After Kubernetes master gets the request, it will parse the yaml file and create several pods (PaddlePaddle's node number), Kubernetes will allocate these pods onto cluster's node. A pod represents a PaddlePaddle node, when pod is successfully allocated onto one physical/virtual machine, Kubernetes will startup the container in the pod, and this container will use the environment variables in yaml file and start up `paddle pserver` and `paddle trainer` processes.
After Kubernetes master gets the request, it will parse the yaml file and create several pods (defined by PaddlePaddle's node number), Kubernetes will allocate these pods onto cluster's node. A pod represents a PaddlePaddle node, when pod is successfully allocated onto one physical/virtual machine, Kubernetes will startup the container in the pod, and this container will use the environment variables in yaml file and start up `paddle pserver` and `paddle trainer` processes.
It'll take around 8 hours to run this PaddlePaddle recommendation training demo on three 2 core 8 GB EC2 machine (m3.large), and the results will be 10 trained models.
It'll take around 8 hours to finish this PaddlePaddle recommendation training demo on three 2 core 8 GB EC2 machine (m3.large).
###Kubernetes Cluster Tear Down
###Kubernetes Cluster Tear Down
@@ -592,51 +645,13 @@ It'll take around 8 hours to run this PaddlePaddle recommendation training demo
@@ -592,51 +645,13 @@ It'll take around 8 hours to run this PaddlePaddle recommendation training demo
If you want to tear down the whole Kubernetes cluster, make sure to *delete* the EFS volume first (otherwise, you will get stucked on following steps), and then use the following command:
If you want to tear down the whole Kubernetes cluster, make sure to *delete* the EFS volume first (otherwise, you will get stucked on following steps), and then use the following command:
Waiting for instance i-04e973f1d6d56d580 to be terminated (currently shutting-down)
Sleeping for 3 seconds...
Waiting for instance i-04e973f1d6d56d580 to be terminated (currently shutting-down)
Sleeping for 3 seconds...
Waiting for instance i-04e973f1d6d56d580 to be terminated (currently shutting-down)
Sleeping for 3 seconds...
Waiting for instance i-04e973f1d6d56d580 to be terminated (currently shutting-down)
Sleeping for 3 seconds...
Waiting for instance i-04e973f1d6d56d580 to be terminated (currently shutting-down)
Sleeping for 3 seconds...
Waiting for instance i-04e973f1d6d56d580 to be terminated (currently shutting-down)
Sleeping for 3 seconds...
Waiting for instance i-04e973f1d6d56d580 to be terminated (currently shutting-down)
Sleeping for 3 seconds...
Waiting for instance i-04e973f1d6d56d580 to be terminated (currently shutting-down)
Sleeping for 3 seconds...
Waiting for instance i-04e973f1d6d56d580 to be terminated (currently shutting-down)
Sleeping for 3 seconds...
All instances deleted
Releasing Elastic IP:
Deleting volume vol-0eba023cc1874c790
Cleaning up resources in VPC: vpc-e01fc087
Cleaning up security group: sg-9a7564e3
Cleaning up security group: sg-a47564dd
Deleting security group: sg-9a7564e3
Deleting security group: sg-a47564dd
Deleting VPC: vpc-e01fc087
It's an async call, it might take 5 min to tear down the whole cluster.
If you created any Kubernetes Services of type LoadBalancer, you must delete these first, as the CloudFormation cannot be fully destroyed if any externally-managed resources still exist.
## For Experts with Kubernetes and AWS
## For Experts with Kubernetes and AWS
@@ -645,23 +660,7 @@ Sometimes we might need to create or manage the cluster on AWS manually with lim
@@ -645,23 +660,7 @@ Sometimes we might need to create or manage the cluster on AWS manually with lim
### Some Presumptions
### Some Presumptions
* Instances run on Debian, the official IAM, and the filesystem is aufs instead of ext4.
* Instances run on CoreOS, the official IAM.
* Kubernetes node use instance storage, no EBS get mounted. Master use a persistent volume for etcd.
* Kubernetes node use instance storage, no EBS get mounted. Etcd is running on additional node.
* Nodes are running in an Auto Scaling Group on AWS, auto-scaling itself is disabled, but if some node get terminated, it will launch another node instead.
* For networking, we use Flannel network at this moment, we will use Calico solution later on.
* For networking, we use ip-per-pod model here, each pod get assigned a /24 CIDR. And the whole vpc is a /16 CIDR, No overlay network at this moment, we will use Calico solution later on.
* When you create a service with Type=LoadBalancer, Kubernetes will create and ELB, and create a security group for the ELB.
* When you create a service with Type=LoadBalancer, Kubernetes will create and ELB, and create a security group for the ELB.
* Kube-proxy sets up two IAM roles, one for master called kubernetes-master, one for nodes called kubernetes-node.
* All AWS resources are tagged with a tag named "KubernetesCluster", with a value that is the unique cluster-id.
###Script Detailed Steps
* Create an s3 bucket for binaries and scripts.
* Create two iam roles: kubernetes-master, kubernetes-node.
* Create an AWS SSH key named kubernetes-YOUR_RSA_FINGERPRINT.
* Create a vpc with CIDR, and enables dns-support and dns-hostnames options in vpc settings.
* Create Internet gateway, route table, a subnet with CIDR of, and associate the subnet to the route table.
* Create and configure security group for master and nodes.
* Create an EBS for master, it will be attached after the master node get up.
* Launch the master with fixed ip address, and the node is initialized with Salt script, all the components get started as docker containers.
* Create an auto-scaling group, it has the min and max size, it can be changed by using aws api or console, it will auto launch the kubernetes node and configure itself, connect to master, assign an internal CIDR, and the master configures the route table with the assigned CIDR.