diff --git a/doc/howto/usage/k8s/k8s_aws_en.md b/doc/howto/usage/k8s/k8s_aws_en.md index a6422b9be00e210a6a305260585520acd72fb2f1..8a15a9583eb4ff96ae4a56bb6bd622490ba1d499 100644 --- a/doc/howto/usage/k8s/k8s_aws_en.md +++ b/doc/howto/usage/k8s/k8s_aws_en.md @@ -1,6 +1,49 @@ -# Kubernetes on AWS +# Distributed PaddlePaddle Training on AWS with Kubernetes -## Create AWS Account and IAM Account +We will show you step by step on how to run distributed PaddlePaddle training on AWS cluster with Kubernetes. Let's start from core concepts. + +## Distributed PaddlePaddle Training Core Concepts + +### Distributed Training Job + +A distributed training job is represented by a [Kubernetes job](https://kubernetes.io/docs/user-guide/jobs/#what-is-a-job). + +Each Kuberentes job is described by a job config file, which specifies the information like the number of [pods](https://kubernetes.io/docs/user-guide/pods/#what-is-a-pod) in the job and environment variables. + +In a distributed training job, we would: + +1. prepare partitioned training data and configuration file on a distributed file system (in this tutorial we use Amazon Elastic File System), and +1. create and submit the Kubernetes job config to the Kubernetes cluster to start the training job. + +### Parameter Servers and Trainers + +There are two roles in a PaddlePaddle cluster: *parameter server (pserver)* and *trainer*. Each parameter server process maintains a shard of the global model. Each trainer has its local copy of the model, and uses its local data to update the model. During the training process, trainers send model updates to parameter servers, parameter servers are responsible for aggregating these updates, so that trainers can synchronize their local copy with the global model. + +
![Model is partitioned into two shards. Managed by two parameter servers respectively.](src/pserver_and_trainer.png)
+ +In order to communicate with pserver, trainer needs to know the ip address of each pserver. In kubernetes it's better to use a service discovery mechanism (e.g., DNS hostname) rather than static ip address, since any pserver's pod may be killed and a new pod could be schduled onto another node of different ip address. However, now we are using static ip. This will be improved. + +Parameter server and trainer are packaged into a same docker image. They will run once pod is scheduled by kubernetes job. + +### Trainer ID + +Each trainer process requires a trainer ID, a zero-based index value, passed in as a command-line parameter. The trainer process thus reads the data partition indexed by this ID. + +### Training + +The entry-point of a container is a shell script. It can see some environment variables pre-defined by Kubernetes. This includes one that gives the job's identity, which can be used in a remote call to the Kubernetes apiserver that lists all pods in the job. + +We rank each pod by sorting them by their ips. The rank of each pod could be the "pod ID". Because we run one trainer and one parameter server in each pod, we can use this "pod ID" as the trainer ID. A detailed workflow of the entry-point script is as follows: + +1. Query the api server to get pod information, and assign the `trainer_id` by sorting the ip. +1. Copy the training data from EFS persistent volume into container. +1. Parse the `paddle pserver` and `paddle trainer` startup parameters from environment variables, and then start up the processes. +1. Trainer with `train_id` 0 will automatically write results onto EFS volume. + + +## PaddlePaddle on AWS with Kubernetes + +### Create AWS Account and IAM Account Under each AWS account, we can create multiple [IAM](http://docs.aws.amazon.com/IAM/latest/UserGuide/introduction.html) users. This allows us to grant some privileges to each IAM user and to create/operate AWS clusters as an IAM user. @@ -25,11 +68,6 @@ Please be aware that this tutorial needs the following privileges for the user i - AWSKeyManagementServicePowerUser -## PaddlePaddle on AWS - -Here we will show you step by step on how to run PaddlePaddle training on AWS cluster. - - ### Download kube-aws and kubectl #### kube-aws @@ -103,7 +141,6 @@ And then configure your AWS account information: ``` aws configure - ``` @@ -113,7 +150,7 @@ Fill in the required fields: ``` AWS Access Key ID: YOUR_ACCESS_KEY_ID AWS Secrete Access Key: YOUR_SECRETE_ACCESS_KEY -Default region name: us-west-1 +Default region name: us-west-2 Default output format: json ``` @@ -131,25 +168,28 @@ aws ec2 describe-instances The keypair that will authenticate SSH access to your EC2 instances. The public half of this key pair will be configured on each CoreOS node. -Follow [EC2 Keypair docs](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html) to create a EC2 key pair +Follow [EC2 Keypair User Guide](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html) to create a EC2 key pair After creating a key pair, you will use the key pair name to configure the cluster. -Key pairs are only available to EC2 instances in the same region. We are using us-west-1 in our tutorial, so make sure to creat key pairs in that region (N. California). +Key pairs are only available to EC2 instances in the same region. We are using us-west-2 in our tutorial, so make sure to creat key pairs in that region (Oregon). + +Your browser will download a `key-name.pem` file which is the key to access the EC2 instances. We will use it later. + #### KMS key Amazon KMS keys are used to encrypt and decrypt cluster TLS assets. If you already have a KMS Key that you would like to use, you can skip creating a new key and provide the Arn string for your existing key. -You can create a KMS key in the AWS console, or with the aws command line tool: +You can create a KMS key with the aws command line tool: ``` -aws kms --region=us-west-1 create-key --description="kube-aws assets" +aws kms --region=us-west-2 create-key --description="kube-aws assets" { "KeyMetadata": { "CreationDate": 1458235139.724, "KeyState": "Enabled", - "Arn": "arn:aws:kms:us-west-1:aaaaaaaaaaaaa:key/xxxxxxxxxxxxxxxxxxx", + "Arn": "arn:aws:kms:us-west-2:aaaaaaaaaaaaa:key/xxxxxxxxxxxxxxxxxxx", "AWSAccountId": "xxxxxxxxxxxxx", "Enabled": true, "KeyUsage": "ENCRYPT_DECRYPT", @@ -161,14 +201,14 @@ aws kms --region=us-west-1 create-key --description="kube-aws assets" We will need to use the value of `Arn` later. -And then you need to add several inline policies in your user permission. +And then let's add several inline policies in your IAM user permission. -Go to IAM user page, click on `Add inline policy` button, and then select `Custom Policy` +Go to [IAM Console](https://console.aws.amazon.com/iam/home?region=us-west-2#/home). Click on button `Users`, click user that we just created, and then click on `Add inline policy` button, and select `Custom Policy`. -paste into following inline policies: +Paste into following inline policies: ``` -{ + (Caution: node_0, node_1, node_2 directories represents PaddlePaddle node and train_id, not the Kubernetes node){ "Version": "2012-10-17", "Statement": [ { @@ -195,7 +235,7 @@ paste into following inline policies: "cloudformation:DescribeStackEvents" ], "Resource": [ - "arn:aws:cloudformation:us-west-1:AWS_ACCOUNT_ID:stack/MY_CLUSTER_NAME/*" + "arn:aws:cloudformation:us-west-2:AWS_ACCOUNT_ID:stack/MY_CLUSTER_NAME/*" ] } ] @@ -214,20 +254,20 @@ aws sts get-caller-identity --output text --query Account When the cluster is created, the controller will expose the TLS-secured API on a DNS name. -The A record of that DNS name needs to be point to the cluster ip address. +DNS name should have a CNAME points to cluster DNS name or an A record points to the cluster IP address. -We will need to use DNS name later in tutorial. If you don't already own one, you can choose any DNS name (e.g., `paddle`) and modify `/etc/hosts` to associate cluster ip with that DNS name. +We will need to use DNS name later in tutorial. #### S3 bucket You need to create an S3 bucket before startup the Kubernetes cluster. -There are some bugs in aws cli in creating S3 bucket, so let's use the [Web console](https://console.aws.amazon.com/s3/home?region=us-west-1). +There are some bugs in aws cli in creating S3 bucket, so let's use the [S3 Console](https://console.aws.amazon.com/s3/home?region=us-west-2). -Click on `Create Bucket`, fill in a unique BUCKET_NAME, and make sure region is us-west-1 (Northern California). +Click on `Create Bucket`, fill in a unique BUCKET_NAME, and make sure region is us-west-2 (Oregon). -#### Initialize an asset directory +#### Initialize Assets Create a directory on your local machine to hold the generated assets: @@ -242,10 +282,10 @@ Initialize the cluster CloudFormation stack with the KMS Arn, key pair name, and kube-aws init \ --cluster-name=MY_CLUSTER_NAME \ --external-dns-name=MY_EXTERNAL_DNS_NAME \ ---region=us-west-1 \ ---availability-zone=us-west-1a \ +--region=us-west-2 \ +--availability-zone=us-west-2a \ --key-name=KEY_PAIR_NAME \ ---kms-key-arn="arn:aws:kms:us-west-1:xxxxxxxxxx:key/xxxxxxxxxxxxxxxxxxx" +--kms-key-arn="arn:aws:kms:us-west-2:xxxxxxxxxx:key/xxxxxxxxxxxxxxxxxxx" ``` `MY_CLUSTER_NAME`: the one you picked in [KMS key](#kms-key) @@ -256,14 +296,15 @@ kube-aws init \ `--kms-key-arn`: the "Arn" in [KMS key](#kms-key) -Here `us-west-1a` is used for parameter `--availability-zone`, but supported availability zone varies among AWS accounts. +Here `us-west-2a` is used for parameter `--availability-zone`, but supported availability zone varies among AWS accounts. -Please check if `us-west-1a` is supported by `aws ec2 --region us-west-1 describe-availability-zones`, if not switch to other supported availability zone. (e.g., `us-west-1a`, or `us-west-1b`) +Please check if `us-west-2a` is supported by `aws ec2 --region us-west-2 describe-availability-zones`, if not switch to other supported availability zone. (e.g., `us-west-2a`, or `us-west-2b`) -Note: please don't use `us-west-1c`. Subnets can currently only be created in the following availability zones: us-west-1b, us-west-1a. There will now be a cluster.yaml file in the asset directory. This is the main configuration file for your cluster. +By default `kube-aws` will only create one worker node. Let's edit `cluster.yaml` and change `workerCount` from 1 to 3. + #### Render contents of the asset directory @@ -278,41 +319,14 @@ The next command generates the default set of cluster assets in your asset direc ``` kube-aws render stack ``` - -Here's what the directory structure looks like: - -``` -$ tree -. -├── cluster.yaml -├── credentials -│ ├── admin-key.pem -│ ├── admin.pem -│ ├── apiserver-key.pem -│ ├── apiserver.pem -│ ├── ca-key.pem -│ ├── ca.pem -│ ├── worker-key.pem -│ └── worker.pem -│ ├── etcd-key.pem -│ └── etcd.pem -│ ├── etcd-client-key.pem -│ └── etcd-client.pem -├── kubeconfig -├── stack-template.json -└── userdata - ├── cloud-config-controller - └── cloud-config-worker -``` - -These assets (templates and credentials) are used to create, update and interact with your Kubernetes cluster. +Assets (templates and credentials) that are used to create, update and interact with your Kubernetes cluster will be created under your current folder. ### Kubernetes Cluster Start Up #### Create the instances defined in the CloudFormation template -Now let's create your cluster (choose any PREFIX for the command below): +Now let's create your cluster (choose any `PREFIX` for the command below): ``` kube-aws up --s3-uri s3://BUCKET_NAME/PREFIX @@ -328,239 +342,158 @@ You can invoke `kube-aws status` to get the cluster API endpoint after cluster c ``` $ kube-aws status Cluster Name: paddle-cluster -Controller DNS Name: paddle-cl-ElbAPISe-EEOI3EZPR86C-531251350.us-west-1.elb.amazonaws.com +Controller DNS Name: paddle-cl-ElbAPISe-EEOI3EZPR86C-531251350.us-west-2.elb.amazonaws.com ``` +If you own a DNS name, set the A record to any of the above ip. __Or__ you can set up CNAME point to `Controller DNS Name` (`paddle-cl-ElbAPISe-EEOI3EZPR86C-531251350.us-west-2.elb.amazonaws.com`) + +##### Find IP address + Use command `dig` to check the load balancer hostname to get the ip address. ``` -$ dig paddle-cl-ElbAPISe-EEOI3EZPR86C-531251350.us-west-1.elb.amazonaws.com +$ dig paddle-cl-ElbAPISe-EEOI3EZPR86C-531251350.us-west-2.elb.amazonaws.com ;; QUESTION SECTION: -;paddle-cl-ElbAPISe-EEOI3EZPR86C-531251350.us-west-1.elb.amazonaws.com. IN A +;paddle-cl-ElbAPISe-EEOI3EZPR86C-531251350.us-west-2.elb.amazonaws.com. IN A ;; ANSWER SECTION: -paddle-cl-ElbAPISe-EEOI3EZPR86C-531251350.us-west-1.elb.amazonaws.com. 59 IN A 54.241.164.52 -paddle-cl-ElbAPISe-EEOI3EZPR86C-531251350.us-west-1.elb.amazonaws.com. 59 IN A 54.67.102.112 +paddle-cl-ElbAPISe-EEOI3EZPR86C-531251350.us-west-2.elb.amazonaws.com. 59 IN A 54.241.164.52 +paddle-cl-ElbAPISe-EEOI3EZPR86C-531251350.us-west-2.elb.amazonaws.com. 59 IN A 54.67.102.112 ``` In the above output, both ip `54.241.164.52`, `54.67.102.112` will work. -If you own a DNS name, set the A record to any of the above ip. Otherwise you can edit `/etc/hosts` to associate ip with the DNS name. #### Access the cluster Once the API server is running, you should see: ``` -$ kubectl --kubeconfig=kubeconfig get nodes -NAME STATUS AGE -ip-10-0-0-xxx.us-west-1.compute.internal Ready 5m -ip-10-0-0-xxx.us-west-1.compute.internal Ready 5m -ip-10-0-0-xx.us-west-1.compute.internal Ready,SchedulingDisabled 5m +$ kubectl --kubeconfig=kubeconfig get nodes +NAME STATUS AGE +ip-10-0-0-134.us-west-2.compute.internal Ready 6m +ip-10-0-0-238.us-west-2.compute.internal Ready 6m +ip-10-0-0-50.us-west-2.compute.internal Ready 6m +ip-10-0-0-55.us-west-2.compute.internal Ready 6m ``` ### Setup Elastic File System for Cluster -Training data is usually served on a distributed filesystem, we use Elastic File System (EFS) on AWS. Ceph might be a better solution, but it requires high version of Linux kernel that might not be stable enough at this moment. We haven't automated the EFS setup at this moment, so please do the following steps: - +Training data is usually served on a distributed filesystem, we use Elastic File System (EFS) on AWS. -1. Make sure you added AmazonElasticFileSystemFullAccess policy in your group. +1. Create security group for EFS in [security group console](https://us-west-2.console.aws.amazon.com/ec2/v2/home?region=us-west-2#SecurityGroups:sort=groupId) + 1. Look up security group id for `paddle-cluster-sg-worker` (`sg-055ee37d` in the image below) +
![](src/worker_security_group.png)
+ 2. Add security group `paddle-efs` with `ALL TCP` inbound rule and custom source as group id of `paddle-cluster-sg-worker`. And VPC of `paddle-cluster-vpc`. Make sure availability zone is same as the one you used in [Initialize Assets](#initialize-assets). +
![](src/add_security_group.png)
-1. Create the Elastic File System in AWS console, and attach the new VPC with it. +2. Create the Elastic File System in [EFS console](https://us-west-2.console.aws.amazon.com/efs/home?region=us-west-2#/wizard/1) with `paddle-cluster-vpc` VPC. Make sure subnet is `paddle-cluster-Subnet0` andd security group is `paddle-efs`.
![](src/create_efs.png)
-1. Modify the Kubernetes security group under ec2/Security Groups, add additional inbound policy "All TCP TCP 0 - 65535 0.0.0.0/0" for Kubernetes default VPC security group. -
![](src/add_security_group.png)
- - -1. Follow the EC2 mount instruction to mount the disk onto all the Kubernetes nodes, we recommend to mount EFS disk onto ~/efs. -
![](src/efs_mount.png)
- - -We will place user config and divided training data onto EFS. Training task will cache related files by copying them from EFS into container. It will also write the training results back onto EFS. We will show you how to place the data later in this article. - - - -### Core Concepts of PaddlePaddle Training on AWS - -Now we've already setup a 3 nodes distributed Kubernetes cluster, and on each node we've attached the EFS volume. In this training demo, we will create three Kubernetes pods and schedule them on three nodes. Each pod contains a PaddlePaddle container. When container gets created, it will start parameter server (pserver) and trainer process, load the training data from EFS volume and start the distributed training task. - -#### Distributed Training Job - -A distributed training job is represented by a [kubernetes job](https://kubernetes.io/docs/user-guide/jobs/#what-is-a-job). - -Each Kuberentes job is described by a job config file, which specifies the information like the number of pods in the job and environment variables. - -In a distributed training job, we would: - -1. upload the partitioned training data and configuration file onto EFS volume, and -1. create and submit the Kubernetes job config to the Kubernetes cluster to start the training job. - -#### Parameter Servers and Trainers - -There are two roles in a PaddlePaddle cluster: `parameter server` and `trainer`. Each parameter server process maintains a shard of the global model. Each trainer has its local copy of the model, and uses its local data to update the model. During the training process, trainers send model updates to parameter servers, parameter servers are responsible for aggregating these updates, so that trainers can synchronize their local copy with the global model. - -
![Model is partitioned into two shards. Managed by two parameter servers respectively.](src/pserver_and_trainer.png)
- -In order to communicate with pserver, trainer needs to know the ip address of each pserver. In kubernetes it's better to use a service discovery mechanism (e.g., DNS hostname) rather than static ip address, since any pserver's pod may be killed and a new pod could be schduled onto another node of different ip address. We will improve paddlepaddle's service discovery ability. For now we will use static ip. - -Parameter server and trainer are packaged into a same docker image. They will run once pod is scheduled by kubernetes job. - -#### Trainer ID - -Each trainer process requires a trainer ID, a zero-based index value, passed in as a command-line parameter. The trainer process thus reads the data partition indexed by this ID. - -#### Training - -The entry-point of a container is a Python script. As it runs in a pod, it can see some environment variables pre-defined by Kubernetes. This includes one that gives the job's identity, which can be used in a remote call to the Kubernetes apiserver that lists all pods in the job. - -We rank each pod by sorting them by their ips. The rank of each pod could be the "pod ID". Because we run one trainer and one parameter server in each pod, we can use this "pod ID" as the trainer ID. A detailed workflow of the entry-point script is as follows: - -1. Query the api server to get pod information, and assign the `trainer_id` by sorting the ip. -1. Copy the training data from EFS sharing volume into container. -1. Parse the `paddle pserver` and `paddle trainer` startup parameters from environment variables, and then start up the processes. -1. Trainer with `train_id` 0 will automatically write results onto EFS volume. - - ### Start PaddlePaddle Training Demo on AWS -Now we'll start a PaddlePaddle training demo on AWS, steps are as follows: - -1. Build PaddlePaddle Docker image. -1. Divide the training data file and upload it onto the EFS sharing volume. -1. Create the training job config file, and start up the job. -1. Check the result after training. - -#### Build PaddlePaddle Docker Image - -PaddlePaddle docker image need to provide the runtime environment for `pserver` and `trainer`, so the container use this image should have two main function: - -1. Copy the training data into container. -1. Generate the startup parameter for `pserver` and `trainer` process, and startup the training. - +#### Configure Kubernetes Volume that Points to EFS -We need to create a new image since official `paddledev/paddle:cpu-latest` only have PaddlePaddle binary, but lack of the above functionalities. - -Dockerfile for creating the new image is as follows: +First we need to create a [PersistentVolume](https://kubernetes.io/docs/user-guide/persistent-volumes/) to provision EFS volumn. +Save following snippet as `pv.yaml` ``` -FROM paddledev/paddle:cpu-latest - -MAINTAINER zjsxzong89@gmail.com - -COPY start.sh /root/ -COPY start_paddle.py /root/ -CMD ["bash"," -c","/root/start.sh"] +apiVersion: v1 +kind: PersistentVolume +metadata: + name: efsvol +spec: + capacity: + storage: 100Gi + accessModes: + - ReadWriteMany + nfs: + server: EFS_DNS_NAME + path: "/" ``` -At this point, we will copy our `start.sh` and `start_paddle.py` file into container, and then exec `start_paddle.py` script to start up the training, all the steps like assigning trainer_id, getting other nodes' ip are implemented in `start_paddle.py`. - -`start_paddle.py` will start parsing the parameters. +`EFS_DNS_NAME`: DNS name as shown in description of `paddle-efs` that we created. Looks similar to `fs-2cbf7385.efs.us-west-2.amazonaws.com` +Run following command to create a persistent volumn: ``` -parser = argparse.ArgumentParser(prog="start_paddle.py", - description='simple tool for k8s') - args, train_args_list = parser.parse_known_args() - train_args = refine_unknown_args(train_args_list) - train_args_dict = dict(zip(train_args[:-1:2], train_args[1::2])) - podlist = getPodList() +kubectl --kubeconfig=kubeconfig create -f pv.yaml ``` -And then using function `getPodList()` to query all the pod information from the job name through Kubernetes api server. When all the pods are in the running status, using `getIdMap(podlist)` to get the trainer_id. +Next let's create a [PersistentVolumeClaim](https://kubernetes.io/docs/user-guide/persistent-volumes/) to claim the persistent volume. +Save following snippet as `pvc.yaml`. ``` - podlist = getPodList() - # need to wait until all pods are running - while not isPodAllRunning(podlist): - time.sleep(10) - podlist = getPodList() - idMap = getIdMap(podlist) +kind: PersistentVolumeClaim +apiVersion: v1 +metadata: + name: efsvol +spec: + accessModes: + - ReadWriteMany + resources: + requests: + storage: 50Gi ``` -In function `getIdMap(podlist)`, we use podlist to get the ip address for each pod and sort them, use the index as the trainer_id. - +Run following command to create a persistent volumn claim: ``` -def getIdMap(podlist): - ''' - generate tainer_id by ip - ''' - ips = [] - for pod in podlist["items"]: - ips.append(pod["status"]["podIP"]) - ips.sort() - idMap = {} - for i in range(len(ips)): - idMap[ips[i]] = i - return idMap +kubectl --kubeconfig=kubeconfig create -f pvc.yaml ``` -After getting `idMap`, we use function `startPaddle(idMap, train_args_dict)` to generate `paddle pserver` and `paddle train` start up parameters and then start up the processes. +#### Prepare Training Data -In function `startPaddle`, the most important work is to generate `paddle pserver` and `paddle train` start up parameters. For example, `paddle train` parameter parsing, we will get parameters like `PADDLE_NIC`, `PADDLE_PORT`, `PADDLE_PORTS_NUM`, and get the `trainer_id` from `idMap`. +We will now launch a kubernetes job that downloads, saves and evenly splits training data into 3 shards on the persistent volumn that we just created. +save following snippet as `paddle-data-job.yaml` ``` - program = 'paddle train' - args = " --nics=" + PADDLE_NIC - args += " --port=" + str(PADDLE_PORT) - args += " --ports_num=" + str(PADDLE_PORTS_NUM) - args += " --comment=" + "paddle_process_by_paddle" - ip_string = "" - for ip in idMap.keys(): - ip_string += (ip + ",") - ip_string = ip_string.rstrip(",") - args += " --pservers=" + ip_string - args_ext = "" - for key, value in train_args_dict.items(): - args_ext += (' --' + key + '=' + value) - localIP = socket.gethostbyname(socket.gethostname()) - trainerId = idMap[localIP] - args += " " + args_ext + " --trainer_id=" + \ - str(trainerId) + " --save_dir=" + JOB_PATH_OUTPUT -``` - -Use `docker build` to build toe Docker Image: - -``` -docker build -t your_repo/paddle:mypaddle . +apiVersion: batch/v1 +kind: Job +metadata: + name: paddle-data +spec: + template: + metadata: + name: pi + spec: + containers: + - name: paddle-data + image: paddledev/paddle-tutorial:k8s_data + imagePullPolicy: Always + volumeMounts: + - mountPath: "/efs" + name: efs + env: + - name: OUT_DIR + value: /efs/paddle-cluster-job + - name: SPLIT_COUNT + value: "3" + volumes: + - name: efs + persistentVolumeClaim: + claimName: efsvol + restartPolicy: Never ``` -And then push the built image onto docker registry. - +Run following command to launch the job: ``` -docker push your_repo/paddle:mypaddle +kubectl --kubeconfig=kubeconfig create -f paddle-data-job.yaml ``` -#### Upload Training Data File - -Here we will use PaddlePaddle's official recommendation demo as the content for this training, we put the training data file into a directory named by job name, which located in EFS sharing volume, the tree structure for the directory looks like: - +Job may take 7 min to finish, use following command to check job status. Do not proceed until `SUCCESSFUL` for `paddle-data` job is `1` ``` -efs -└── paddle-cluster-job - ├── data - │ ├── 0 - │ │ - │ ├── 1 - │ │ - │ └── 2 - ├── output - └── recommendation +$ kubectl --kubeconfig=kubeconfig get jobs +NAME DESIRED SUCCESSFUL AGE +paddle-data 1 1 6m ``` -The `paddle-cluster-job` directory is the job name for this training, this training includes 3 PaddlePaddle node, we store the partitioned data under `paddle-cluster-job/data` directory, directory 0, 1, 2 each represent 3 nodes' trainer_id. the training data in in recommendation directory, the training results and logs will be in the output directory. - +Data preparation is done by docker image `paddledev/paddle-tutorial:k8s_data`, see [here](src/k8s_data/README.md) for how to build this docker image and source code. -#### Create Kubernetes Job - -Kubernetes use yaml file to describe job details, and then use command line tool to create the job in Kubernetes cluster. - -In yaml file, we describe the Docker image we use for this training, the node number we need to startup, the volume mounting information and all the necessary parameters we need for `paddle pserver` and `paddle train` processes. - -The yaml file content is as follows: +#### Start Training +Now we are ready to start paddle training job. Save following snippet as `paddle-cluster-job.yaml` ``` apiVersion: batch/v1 kind: Job @@ -574,12 +507,12 @@ spec: name: paddle-cluster-job spec: volumes: - - name: jobpath - hostPath: - path: /home/admin/efs + - name: efs + persistentVolumeClaim: + claimName: efsvol containers: - name: trainer - image: drinkcode/paddle:k8s-job + image: paddledev/paddle-tutorial:k8s_train command: ["bin/bash", "-c", "/root/start.sh"] env: - name: JOB_NAME @@ -589,7 +522,7 @@ spec: - name: JOB_NAMESPACE value: default - name: TRAIN_CONFIG_DIR - value: recommendation + value: quick_start - name: CONF_PADDLE_NIC value: eth0 - name: CONF_PADDLE_PORT @@ -600,106 +533,124 @@ spec: value: "2" - name: CONF_PADDLE_GRADIENT_NUM value: "3" + - name: TRAINER_COUNT + value: "3" volumeMounts: - - name: jobpath - mountPath: /home/jobpath + - mountPath: "/home/jobpath" + name: efs ports: - - name: jobport - hostPort: 30001 - containerPort: 30001 + - name: jobport0 + hostPort: 7164 + containerPort: 7164 + - name: jobport1 + hostPort: 7165 + containerPort: 7165 + - name: jobport2 + hostPort: 7166 + containerPort: 7166 + - name: jobport3 + hostPort: 7167 + containerPort: 7167 restartPolicy: Never - ``` -In yaml file, the metadata's name is the job's name. `parallelism, completions` means this job will simultaneously start up 3 PaddlePaddle nodes, and this job will be finished when there are 3 finished pods. For the data store volume, we declare the path jobpath, it mount the /home/admin/efs on host machine into the container with path /home/jobpath. So in container, the /home/jobpath actually stores the data onto EFS sharing volume. - -`env` field represents container's environment variables, we pass the PaddlePaddle parameters into containers by using the `env` field. +`parallelism: 3, completions: 3` means this job will simultaneously start 3 PaddlePaddle pods, and this job will be finished when there are 3 finished pods. -`JOB_PATH` represents the sharing volume path, `JOB_NAME` represents job name, `TRAIN_CONFIG_DIR` represents the training data file directory, we can these three parameters to get the file path for this training. +`env` field represents container's environment variables, we specify PaddlePaddle parameters by environment variables. -`CONF_PADDLE_NIC` represents `paddle pserver` process's `--nics` parameters, the NIC name. +`ports` indicates that TCP port 7164 - 7167 are exposed for communication between `pserver` ans trainer. port starts continously from `CONF_PADDLE_PORT` (7164) to `CONF_PADDLE_PORT + CONF_PADDLE_PORTS_NUM + CONF_PADDLE_PORTS_NUM_SPARSE - 1` (7167). We use multiple ports for dense and sparse paramter updates to improve latency. -`CONF_PADDLE_PORT` represents `paddle pserver` process's `--port` parameters, `CONF_PADDLE_PORTS_NUM` represents `--port_num` parameter. - -`CONF_PADDLE_PORTS_NUM_SPARSE` represents the sparse updated port number, `--ports_num_for_sparse` parameter. +Run following command to launch the job. +``` +kubectl --kubeconfig=kubeconfig create -f paddle-claster-job.yaml +``` -`CONF_PADDLE_GRADIENT_NUM` represents the training node number, `--num_gradient_servers` parameter. +Inspect individual pods -After we create the yaml file, we can use Kubernetes command line tool to create the job onto the cluster. +``` +$ kubectl --kubeconfig=kubeconfig get pods +NAME READY STATUS RESTARTS AGE +paddle-cluster-job-cm469 1/1 Running 0 9m +paddle-cluster-job-fnt03 1/1 Running 0 9m +paddle-cluster-job-jx4xr 1/1 Running 0 9m +``` +Inspect individual console output ``` -kubectl create -f job.yaml +kubectl --kubeconfig=kubeconfig log -f POD_NAME ``` -After we execute the above command, Kubernetes will create 3 pods and then pull the PaddlePaddle image, then start up the containers for training. +`POD_NAME`: name of any pod (e.g., `paddle-cluster-job-cm469`). +Run `kubectl --kubeconfig=kubeconfig describe job paddle-cluster-job` to check training job status. It will complete in around 20 minutes. +The details for start `pserver` and `trainer` are hidden inside docker image `paddledev/paddle-tutorial:k8s_train`, see [here](src/k8s_train/README.md) for how to build the docker image and source code. -#### Check Training Results +#### Inspect Training Output -During the training, we can see the logs and models on EFS sharing volume, the output directory contains the training results. (Caution: node_0, node_1, node_2 directories represents PaddlePaddle node and train_id, not the Kubernetes node) +Training output (model snapshot and logs) will be saved in EFS. We can ssh into worker EC2 instance, mount EFS and check training output. +1. ssh Into Worker EC2 instance ``` -[root@paddle-kubernetes-node0 output]# tree -d -. -├── node_0 -│ ├── server.log -│ └── train.log -├── node_1 -│ ├── server.log -│ └── train.log -├── node_2 -...... -├── pass-00002 -│ ├── done -│ ├── ___embedding_0__.w0 -│ ├── ___embedding_1__.w0 -...... +chmod 400 key-name.pem +ssh -i key-name.pem core@INSTANCE_IP ``` -We can always check the container training status through logs, for example: +`INSTANCE_IP`: public IP address of EC2 kubernetes worker node. Go to [EC2 console](https://us-west-2.console.aws.amazon.com/ec2/v2/home?region=us-west-2#Instances:sort=instanceId) and check `public IP` of any `paddle-cluster-kube-aws-worker` instance. +2. Mount EFS ``` -[root@paddle-kubernetes-node0 node_0]# cat train.log -I1116 09:10:17.123121 50 Util.cpp:155] commandline: - /usr/local/bin/../opt/paddle/bin/paddle_trainer - --nics=eth0 --port=7164 - --ports_num=2 --comment=paddle_process_by_paddle - --pservers=192.168.129.66,192.168.223.143,192.168.129.71 - --ports_num_for_sparse=2 --config=./trainer_config.py - --trainer_count=4 --num_passes=10 --use_gpu=0 - --log_period=50 --dot_period=10 --saving_period=1 - --local=0 --trainer_id=0 - --save_dir=/home/jobpath/paddle-cluster-job/output -I1116 09:10:17.123440 50 Util.cpp:130] Calling runInitFunctions -I1116 09:10:17.123764 50 Util.cpp:143] Call runInitFunctions done. -[WARNING 2016-11-16 09:10:17,227 default_decorators.py:40] please use keyword arguments in paddle config. -[INFO 2016-11-16 09:10:17,239 networks.py:1282] The input order is [movie_id, title, genres, user_id, gender, age, occupation, rating] -[INFO 2016-11-16 09:10:17,239 networks.py:1289] The output order is [__regression_cost_0__] -I1116 09:10:17.392917 50 Trainer.cpp:170] trainer mode: Normal -I1116 09:10:17.613910 50 PyDataProvider2.cpp:257] loading dataprovider dataprovider::process -I1116 09:10:17.680917 50 PyDataProvider2.cpp:257] loading dataprovider dataprovider::process -I1116 09:10:17.681543 50 GradientMachine.cpp:134] Initing parameters.. -I1116 09:10:18.012390 50 GradientMachine.cpp:141] Init parameters done. -I1116 09:10:18.018641 50 ParameterClient2.cpp:122] pserver 0 192.168.129.66:7164 -I1116 09:10:18.018950 50 ParameterClient2.cpp:122] pserver 1 192.168.129.66:7165 -I1116 09:10:18.019069 50 ParameterClient2.cpp:122] pserver 2 192.168.223.143:7164 -I1116 09:10:18.019492 50 ParameterClient2.cpp:122] pserver 3 192.168.223.143:7165 -I1116 09:10:18.019716 50 ParameterClient2.cpp:122] pserver 4 192.168.129.71:7164 -I1116 09:10:18.019836 50 ParameterClient2.cpp:122] pserver 5 192.168.129.71:7165 +mkdir efs +sudo mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2 EFS_DNS_NAME:/ efs ``` -It'll take around 8 hours to finish this PaddlePaddle recommendation training demo on three 2 core 8 GB EC2 machine (m3.large). +`EFS_DNS_NAME`: DNS name as shown in description of `paddle-efs` that we created. Look similar to `fs-2cbf7385.efs.us-west-2.amazonaws.com`. +Now folder `efs` will have structure similar to: +``` +-- paddle-cluster-job + |-- ... + |-- output + | |-- node_0 + | | |-- server.log + | | `-- train.log + | |-- node_1 + | | |-- server.log + | | `-- train.log + | |-- node_2 + | | |-- server.log + | | `-- train.log + | |-- pass-00000 + | | |-- ___fc_layer_0__.w0 + | | |-- ___fc_layer_0__.wbias + | | |-- done + | | |-- path.txt + | | `-- trainer_config.lr.py + | |-- pass-00001... +``` +`server.log` contains log for `pserver`. `train.log` contains log for `trainer`. Model description and snapshot is stored in `pass-0000*`. ### Kubernetes Cluster Tear Down +#### Delete EFS + +Go to [EFS Console](https://us-west-2.console.aws.amazon.com/efs/home?region=us-west-2) and delete the EFS volumn that we created. + +#### Delete security group + +Go to [Security Group Console](https://us-west-2.console.aws.amazon.com/ec2/v2/home?region=us-west-2#SecurityGroups:sort=groupId) and delete security group `paddle-efs`. -If you want to tear down the whole Kubernetes cluster, make sure to *delete* the EFS volume first (otherwise, you will get stucked on following steps), and then use the following command: + +#### Delete S3 Bucket + +Go to [S3 Console](https://console.aws.amazon.com/s3/home?region=us-west-2#) and delete the S3 bucket that we created. + +#### Destroy Cluster ``` kube-aws destroy ``` -It's an async call, it might take 5 min to tear down the whole cluster. -If you created any Kubernetes Services of type LoadBalancer, you must delete these first, as the CloudFormation cannot be fully destroyed if any externally-managed resources still exist. +The command will return immediately, but it might take 5 min to tear down the whole cluster. + +You can go to [CludFormation Console](https://us-west-2.console.aws.amazon.com/cloudformation/home?region=us-west-2#/stacks?filter=active) to check destroy process. diff --git a/doc/howto/usage/k8s/src/add_security_group.png b/doc/howto/usage/k8s/src/add_security_group.png index 50eed4c6573a18d6ae0f9df9bd6a3cae05493e3c..bd34f46c9b0ada7027fd53e553e7d033255d25fc 100644 Binary files a/doc/howto/usage/k8s/src/add_security_group.png and b/doc/howto/usage/k8s/src/add_security_group.png differ diff --git a/doc/howto/usage/k8s/src/create_efs.png b/doc/howto/usage/k8s/src/create_efs.png index f4d448d1518e11a11d535efb9c3a78b56cc13149..e5f1526033d1daf401700989af1d25919bcb7675 100644 Binary files a/doc/howto/usage/k8s/src/create_efs.png and b/doc/howto/usage/k8s/src/create_efs.png differ diff --git a/doc/howto/usage/k8s/src/job.yaml b/doc/howto/usage/k8s/src/job.yaml deleted file mode 100644 index 488aad0bede4f940b25c7be04259f209c3de9f52..0000000000000000000000000000000000000000 --- a/doc/howto/usage/k8s/src/job.yaml +++ /dev/null @@ -1,43 +0,0 @@ -apiVersion: batch/v1 -kind: Job -metadata: - name: paddle-cluster-job -spec: - parallelism: 3 - completions: 3 - template: - metadata: - name: paddle-cluster-job - spec: - volumes: - - name: jobpath - hostPath: - path: /home/work/paddle_output - containers: - - name: trainer - image: registry.baidu.com/public/paddle:mypaddle - command: ["bin/bash", "-c", "/root/start.sh"] - env: - - name: JOB_NAME - value: paddle-cluster-job - - name: JOB_PATH - value: /home/jobpath - - name: JOB_NAMESPACE - value: default - - name: TRAIN_CONFIG_DIR - value: recommendation - - name: CONF_PADDLE_NIC - value: eth0 - - name: CONF_PADDLE_PORT - value: "7164" - - name: CONF_PADDLE_PORTS_NUM - value: "2" - - name: CONF_PADDLE_PORTS_NUM_SPARSE - value: "2" - - name: CONF_PADDLE_GRADIENT_NUM - value: "3" - volumeMounts: - - name: jobpath - mountPath: /home/jobpath - restartPolicy: Never - diff --git a/doc/howto/usage/k8s/src/k8s_data/Dockerfile b/doc/howto/usage/k8s/src/k8s_data/Dockerfile new file mode 100644 index 0000000000000000000000000000000000000000..6d3a12ae393aa594b8e6e9a5f726109426937284 --- /dev/null +++ b/doc/howto/usage/k8s/src/k8s_data/Dockerfile @@ -0,0 +1,7 @@ +FROM alpine + +RUN apk update && apk upgrade && apk add coreutils +ADD quick_start /quick_start +ADD get_data.sh /bin/ +RUN chmod +x /bin/get_data.sh +ENTRYPOINT ["/bin/get_data.sh"] diff --git a/doc/howto/usage/k8s/src/k8s_data/README.md b/doc/howto/usage/k8s/src/k8s_data/README.md new file mode 100644 index 0000000000000000000000000000000000000000..83cef7affd0ac4d3a1ca08ea5b046fa81e1bc630 --- /dev/null +++ b/doc/howto/usage/k8s/src/k8s_data/README.md @@ -0,0 +1,6 @@ +To build PaddlePaddle data preparation image in tutorial [Distributed PaddlePaddle Training on AWS with Kubernetes](../../k8s_aws_en.md), run following commands: + +``` +cp -r ../../../../../../demo/quick_start . +docker build . -t prepare-data-image-name +``` diff --git a/doc/howto/usage/k8s/src/k8s_data/get_data.sh b/doc/howto/usage/k8s/src/k8s_data/get_data.sh new file mode 100755 index 0000000000000000000000000000000000000000..d187ba5ac8d03f69dfdefd4f63610ed7921575be --- /dev/null +++ b/doc/howto/usage/k8s/src/k8s_data/get_data.sh @@ -0,0 +1,26 @@ +#!/bin/sh + +out_dir=$OUT_DIR +split_count=$SPLIT_COUNT + +set -e + +mkdir -p $out_dir +cp -r /quick_start $out_dir/ + +mkdir -p $out_dir/0/data +cd $out_dir/0/data +wget http://paddlepaddle.bj.bcebos.com/demo/quick_start_preprocessed_data/preprocessed_data.tar.gz +tar zxvf preprocessed_data.tar.gz +rm preprocessed_data.tar.gz + +split -d --number=l/$split_count -a 5 train.txt train. +mv train.00000 train.txt + +cd $out_dir +end=$(expr $split_count - 1) +for i in $(seq 1 $end); do + mkdir -p $i/data + cp -r 0/data/* $i/data + mv $i/data/train.`printf %05d $i` $i/data/train.txt +done; diff --git a/doc/howto/usage/k8s/src/k8s_train/Dockerfile b/doc/howto/usage/k8s/src/k8s_train/Dockerfile new file mode 100644 index 0000000000000000000000000000000000000000..c0fca1f9a945921e6e8899fee2db8845e66136a1 --- /dev/null +++ b/doc/howto/usage/k8s/src/k8s_train/Dockerfile @@ -0,0 +1,6 @@ +FROM paddledev/paddle:cpu-latest + +COPY start.sh /root/ +COPY start_paddle.py /root/ +RUN chmod +x /root/start.sh +CMD ["bash"," -c","/root/start.sh"] diff --git a/doc/howto/usage/k8s/src/k8s_train/README.md b/doc/howto/usage/k8s/src/k8s_train/README.md new file mode 100644 index 0000000000000000000000000000000000000000..96bf65497ffa23e90c4c9350504f86367b48daf2 --- /dev/null +++ b/doc/howto/usage/k8s/src/k8s_train/README.md @@ -0,0 +1,5 @@ +To build PaddlePaddle training image in tutorial [Distributed PaddlePaddle Training on AWS with Kubernetes](../../k8s_aws_en.md), run following command: + +``` +docker build . -t train-image-name +``` diff --git a/doc/howto/usage/k8s/src/start.sh b/doc/howto/usage/k8s/src/k8s_train/start.sh similarity index 55% rename from doc/howto/usage/k8s/src/start.sh rename to doc/howto/usage/k8s/src/k8s_train/start.sh index b3a1334174a20b018d35de3b01b149fc5b10d49d..12dfe1e6386885a6989d3887f21c6922f137a9ae 100755 --- a/doc/howto/usage/k8s/src/start.sh +++ b/doc/howto/usage/k8s/src/k8s_train/start.sh @@ -1,19 +1,19 @@ #!/bin/sh + set -eu jobconfig=${JOB_PATH}"/"${JOB_NAME}"/"${TRAIN_CONFIG_DIR} cd /root -cp -rf $jobconfig . -cd $TRAIN_CONFIG_DIR - +cp -rf $jobconfig/* . python /root/start_paddle.py \ --dot_period=10 \ - --ports_num_for_sparse=$CONF_PADDLE_PORTS_NUM \ + --ports_num=$CONF_PADDLE_PORTS_NUM \ + --ports_num_for_sparse=$CONF_PADDLE_PORTS_NUM_SPARSE \ --log_period=50 \ --num_passes=10 \ - --trainer_count=4 \ + --trainer_count=$TRAINER_COUNT \ --saving_period=1 \ --local=0 \ - --config=./trainer_config.py \ + --config=trainer_config.lr.py \ --use_gpu=0 diff --git a/doc/howto/usage/k8s/src/start_paddle.py b/doc/howto/usage/k8s/src/k8s_train/start_paddle.py similarity index 84% rename from doc/howto/usage/k8s/src/start_paddle.py rename to doc/howto/usage/k8s/src/k8s_train/start_paddle.py index df00d82919faa2acecc79c28e3d773ba3de9672a..f1a770ccb54fbd7d4c3cf6bf134d00d7bf5961ca 100755 --- a/doc/howto/usage/k8s/src/start_paddle.py +++ b/doc/howto/usage/k8s/src/k8s_train/start_paddle.py @@ -23,7 +23,6 @@ import argparse API = "/api/v1/namespaces/" JOBSELECTOR = "labelSelector=job-name=" JOB_PATH = os.getenv("JOB_PATH") + "/" + os.getenv("JOB_NAME") -JOB_PATH_DATA = JOB_PATH + "/data" JOB_PATH_OUTPUT = JOB_PATH + "/output" JOBNAME = os.getenv("JOB_NAME") NAMESPACE = os.getenv("JOB_NAMESPACE") @@ -33,6 +32,8 @@ PADDLE_PORTS_NUM = os.getenv("CONF_PADDLE_PORTS_NUM") PADDLE_PORTS_NUM_SPARSE = os.getenv("CONF_PADDLE_PORTS_NUM_SPARSE") PADDLE_SERVER_NUM = os.getenv("CONF_PADDLE_GRADIENT_NUM") +tokenpath = '/var/run/secrets/kubernetes.io/serviceaccount/token' + def refine_unknown_args(cmd_args): ''' @@ -64,6 +65,7 @@ def isPodAllRunning(podlist): for pod in podlist["items"]: if pod["status"]["phase"] == "Running": running += 1 + print "waiting for pods running, require:", require, "running:", running if require == running: return True return False @@ -79,8 +81,17 @@ def getPodList(): pod = API + NAMESPACE + "/pods?" job = JOBNAME - return requests.get(apiserver + pod + JOBSELECTOR + job, - verify=False).json() + if os.path.isfile(tokenpath): + tokenfile = open(tokenpath, mode='r') + token = tokenfile.read() + Bearer = "Bearer " + token + headers = {"Authorization": Bearer} + return requests.get(apiserver + pod + JOBSELECTOR + job, + headers=headers, + verify=False).json() + else: + return requests.get(apiserver + pod + JOBSELECTOR + job, + verify=False).json() def getIdMap(podlist): @@ -122,8 +133,8 @@ def startPaddle(idMap={}, train_args_dict=None): if not os.path.exists(JOB_PATH_OUTPUT): os.makedirs(JOB_PATH_OUTPUT) os.mkdir(logDir) - copyCommand = 'cp -rf ' + JOB_PATH_DATA + \ - "/" + str(trainerId) + " ./data" + copyCommand = 'cp -rf ' + JOB_PATH + \ + "/" + str(trainerId) + "/data/*" + " ./data/" os.system(copyCommand) startPserver = 'nohup paddle pserver' + \ " --port=" + str(PADDLE_PORT) + \ @@ -136,9 +147,9 @@ def startPaddle(idMap={}, train_args_dict=None): print startPserver os.system(startPserver) # wait until pservers completely start - time.sleep(10) - startTrainer = program + args + " > " + \ - logDir + "/train.log 2>&1 < /dev/null" + time.sleep(20) + startTrainer = program + args + " 2>&1 | tee " + \ + logDir + "/train.log" print startTrainer os.system(startTrainer) @@ -152,7 +163,7 @@ if __name__ == '__main__': podlist = getPodList() # need to wait until all pods are running while not isPodAllRunning(podlist): - time.sleep(10) + time.sleep(20) podlist = getPodList() idMap = getIdMap(podlist) startPaddle(idMap, train_args_dict) diff --git a/doc/howto/usage/k8s/src/worker_security_group.png b/doc/howto/usage/k8s/src/worker_security_group.png new file mode 100644 index 0000000000000000000000000000000000000000..57eb0265a34ad4223b69600d2a3dd355482e0bf5 Binary files /dev/null and b/doc/howto/usage/k8s/src/worker_security_group.png differ