diff --git a/doc/howto/usage/k8s/k8s_aws_en.md b/doc/howto/usage/k8s/k8s_aws_en.md index 8a15a9583eb4ff96ae4a56bb6bd622490ba1d499..ce5ccbca5ee33b334872d9bdde42ac2ede5d041c 100644 --- a/doc/howto/usage/k8s/k8s_aws_en.md +++ b/doc/howto/usage/k8s/k8s_aws_en.md @@ -1,3 +1,4 @@ + # Distributed PaddlePaddle Training on AWS with Kubernetes We will show you step by step on how to run distributed PaddlePaddle training on AWS cluster with Kubernetes. Let's start from core concepts. @@ -43,6 +44,12 @@ We rank each pod by sorting them by their ips. The rank of each pod could be the ## PaddlePaddle on AWS with Kubernetes +### Choose AWS Service Region +This tutorial requires several AWS services work in the same region. Before we create anything in AWS, please check the following link +https://aws.amazon.com/about-aws/global-infrastructure/regional-product-services/ +Choose a region which has the following services available: EC2, EFS, VPS, CloudFormation, KMS, VPC, S3. +In this tutorial, we use "Oregon(us-west-2)" as example. + ### Create AWS Account and IAM Account Under each AWS account, we can create multiple [IAM](http://docs.aws.amazon.com/IAM/latest/UserGuide/introduction.html) users. This allows us to grant some privileges to each IAM user and to create/operate AWS clusters as an IAM user. @@ -73,7 +80,8 @@ Please be aware that this tutorial needs the following privileges for the user i #### kube-aws [kube-aws](https://github.com/coreos/kube-aws) is a CLI tool to automate cluster deployment to AWS. - +##### Verify kube-aws integrity +Note: if you are using a non-official release (e.g RC release) kube-aws, you can skip this setp. Import the CoreOS Application Signing Public Key: ``` @@ -98,7 +106,7 @@ PLATFORM=darwin-amd64 gpg2 --verify kube-aws-${PLATFORM}.tar.gz.sig kube-aws-${PLATFORM}.tar.gz ``` - +##### Install kube-aws Extract the binary: ``` @@ -241,14 +249,15 @@ Paste into following inline policies: ] } ``` - +`Version` : Its value has to be exactly "2012-10-17". `AWS_ACCOUNT_ID`: You can get it from following command line: ``` aws sts get-caller-identity --output text --query Account ``` -`MY_CLUSTER_NAME`: Pick a MY_CLUSTER_NAME that you like, you will use it later as well. +`MY_CLUSTER_NAME`: Pick a MY_CLUSTER_NAME that you like, you will use it later as well. +Please note, stack name must satisfy regular expression pattern: [a-zA-Z][-a-zA-Z0-9*]*, which means no "_" or "-" in stack name, or kube-aws will throw error in later steps. #### External DNS name @@ -256,7 +265,7 @@ When the cluster is created, the controller will expose the TLS-secured API on a DNS name should have a CNAME points to cluster DNS name or an A record points to the cluster IP address. -We will need to use DNS name later in tutorial. +We will need to use DNS name later in tutorial. If you don't already own one, you can choose any DNS name (e.g., `paddle`) and modify `/etc/hosts` to associate cluster IP with that DNS name for your local machine. And add name service (route53) in aws to associate the IP to paddle for cluster. We will find the cluster IP in later steps. #### S3 bucket @@ -364,6 +373,26 @@ paddle-cl-ElbAPISe-EEOI3EZPR86C-531251350.us-west-2.elb.amazonaws.com. 59 IN A 5 In the above output, both ip `54.241.164.52`, `54.67.102.112` will work. +*If you own a DNS name*, set the A record to any of the above ip. Then you can skip to the step "Access the cluster". + +*If you do not own a DNS name*: +##### Update local DNS association +Edit `/etc/hosts` to associate above ip with the DNS name. +##### Add Route53 private name service in VPC + - Open [Route53 Console](https://console.aws.amazon.com/route53/home) + - Create hosted zone with following config + - Domain name: "paddle" + - Type: "Private hosted zone for amazon VPC" + - VPC ID: + - Add A record + - Click on the zone "paddle" just created + - Click the button "Create record set" + - Name : leave blank + - type: "A" + - Value: + - Verify name service + - Connect to any instance created by kube-aws via ssh + - Run command "host paddle", see if the ip returned is the private ip of kube-controller #### Access the cluster