diff --git a/doc/v2/howto/cluster/multi_cluster/k8s_aws_cn.md b/doc/v2/howto/cluster/multi_cluster/k8s_aws_cn.md deleted file mode 120000 index c44cd9a731bed7067cdf19aa2f714abdce6c736a..0000000000000000000000000000000000000000 --- a/doc/v2/howto/cluster/multi_cluster/k8s_aws_cn.md +++ /dev/null @@ -1 +0,0 @@ -k8s_aws_en.md \ No newline at end of file diff --git a/doc/v2/howto/cluster/multi_cluster/k8s_aws_cn.md b/doc/v2/howto/cluster/multi_cluster/k8s_aws_cn.md new file mode 100644 index 0000000000000000000000000000000000000000..afc753aa42f19631c49a451a797f28365e65ed1d --- /dev/null +++ b/doc/v2/howto/cluster/multi_cluster/k8s_aws_cn.md @@ -0,0 +1,672 @@ +# Kubernetes on AWS + +我们将å‘ä½ å±•ç¤ºæ€Žä¹ˆæ ·åœ¨AWSçš„Kubernetes集群上è¿è¡Œåˆ†å¸ƒå¼PaddlePaddleè®ç»ƒï¼Œè®©æˆ‘ä»¬ä»Žæ ¸å¿ƒæ¦‚å¿µå¼€å§‹ + +## PaddlePaddle分布å¼è®ç»ƒçš„æ ¸å¿ƒæ¦‚å¿µ + +### 分布å¼è®ç»ƒä»»åŠ¡ + +一个分布å¼è®ç»ƒä»»åŠ¡å¯ä»¥çœ‹åšæ˜¯ä¸€ä¸ªKubernetes任务 +æ¯ä¸€ä¸ªKubernetes任务都有相应的é…置文件,æ¤é…置文件指定了åƒä»»åŠ¡çš„pod个数之类的环境å˜é‡ä¿¡æ¯ + +在分布å¼è®ç»ƒä»»åŠ¡ä¸ï¼Œæˆ‘们å¯ä»¥å¦‚下æ“作: + +1. 在分布å¼æ–‡ä»¶ç³»ç»Ÿä¸ï¼Œå‡†å¤‡åˆ†å—æ•°æ®å’Œé…置文件(在æ¤æ¬¡æ•™å¦ä¸ï¼Œæˆ‘们会用到亚马逊分布å¼å˜å‚¨æœåŠ¡ï¼ˆEFS)) +2. 创建和æ交一个kubernetes任务é…置到集群ä¸å¼€å§‹è®ç»ƒ + +### Parameter Serverå’ŒTrainer + +在paddlepaddle集群ä¸æœ‰ä¸¤ä¸ªè§’色:å‚æ•°æœåŠ¡å™¨ï¼ˆpserver)者和trainer, æ¯ä¸€ä¸ªå‚æ•°æœåŠ¡å™¨è¿‡ç¨‹éƒ½ä¼šä¿å˜ä¸€éƒ¨åˆ†æ¨¡åž‹çš„å‚数。æ¯ä¸€ä¸ªtrainer都ä¿å˜ä¸€ä»½å®Œæ•´çš„模型å‚数,并å¯ä»¥åˆ©ç”¨æœ¬åœ°æ•°æ®æ›´æ–°æ¨¡åž‹ã€‚在这个è®ç»ƒè¿‡ç¨‹ä¸ï¼Œtrainerå‘é€æ¨¡åž‹æ›´æ–°åˆ°å‚æ•°æœåŠ¡å™¨ä¸ï¼Œå‚æ•°æœåŠ¡å™¨èŒè´£å°±æ˜¯èšåˆè¿™äº›æ›´æ–°ï¼Œä»¥ä¾¿äºŽtrainerå¯ä»¥æŠŠå…¨å±€æ¨¡åž‹åŒæ¥åˆ°æœ¬åœ°ã€‚ + +为了能够和pserver通信,trainer需è¦æ¯ä¸€ä¸ªpserverçš„IP地å€ã€‚在Kubernetesä¸åˆ©ç”¨æœåŠ¡å‘现机制(比如:DNSã€hostname)è¦æ¯”é™æ€çš„IP地å€è¦å¥½ä¸€äº›ï¼Œå› 为任何一个pod都会被æ€æŽ‰ç„¶åŽæ–°çš„pod被é‡å¯åˆ°å¦ä¸€ä¸ªä¸åŒIP地å€çš„node上。现在我们å¯ä»¥å…ˆç”¨é™æ€çš„IP地å€æ–¹å¼ï¼Œè¿™ç§æ–¹å¼æ˜¯å¯ä»¥æ›´æ”¹çš„。 + +å‚æ•°æœåŠ¡å™¨å’Œtrainer一å—被打包æˆä¸€ä¸ªdockeré•œåƒï¼Œè¿™ä¸ªé•œåƒä¼šè¿è¡Œåœ¨è¢«Kubernetes集群调度的podä¸ã€‚ + +### è®ç»ƒè€…ID + +æ¯ä¸€ä¸ªè®ç»ƒè¿‡ç¨‹éƒ½éœ€è¦ä¸€ä¸ªè®ç»ƒID,以0作为基础值,作为命令行å‚æ•°ä¼ é€’ã€‚è®ç»ƒè¿‡ç¨‹å› æ¤ç”¨è¿™ä¸ªID去读å–æ•°æ®åˆ†ç‰‡ã€‚ + +### è®ç»ƒ + +PaddlePaddle容器的入å£æ˜¯ä¸€ä¸ªshell脚本,这个脚本å¯ä»¥è¯»å–Kubernetes内预置的环境å˜é‡ã€‚这里å¯ä»¥å®šä¹‰ä»»åŠ¡identity,在任务ä¸identityå¯ä»¥ç”¨æ¥è¿œç¨‹è®¿é—®åŒ…å«æ‰€æœ‰podçš„Kubernetes apiserveræœåŠ¡ã€‚ + +æ¯ä¸€ä¸ªpod通过ipæ¥æŽ’åºã€‚æ¯ä¸€ä¸ªpodçš„åºåˆ—作为“pod idâ€ã€‚å› ä¸ºæˆ‘ä»¬ä¼šåœ¨æ¯ä¸€ä¸ªpodä¸è¿è¡Œè®ç»ƒå’Œå‚æ•°æœåŠ¡ï¼Œå¯ä»¥ç”¨â€œpod idâ€ä½œä¸ºè®ç»ƒID。入å£è„šæœ¬è¯¦ç»†å·¥ä½œæµç¨‹å¦‚下: + +1. 查找apiserver得到podä¿¡æ¯ï¼Œé€šè¿‡ip排åºæ¥åˆ†é…一个trainer_id。 +2. 从EFSæŒä¹…化å·ä¸å¤åˆ¶è®ç»ƒæ•°æ®åˆ°å®¹å™¨ä¸ã€‚ +3. 从环境å˜é‡ä¸è§£æžpaddle pserverå’Œ paddle trainerçš„å¯åŠ¨å‚数,然åŽå¼€å§‹å¯åŠ¨æµç¨‹ã€‚ +4. 以trainer_idæ¥è®ç»ƒå°†è‡ªåŠ¨æŠŠç»“果写入到EFSå·ä¸ã€‚ + + +## AWSçš„Kubernetesä¸çš„PaddlePaddle + +### 选择AWSæœåŠ¡åŒºåŸŸ +这个教程需è¦å¤šä¸ªAWSæœåŠ¡å·¥ä½œåœ¨ä¸€ä¸ªåŒºåŸŸä¸ã€‚在AWS创建任何东西之å‰ï¼Œè¯·æ£€æŸ¥é“¾æŽ¥https://aws.amazon.com/about-aws/global-infrastructure/regional-product-services/ 选择一个å¯ä»¥æ供如下æœåŠ¡çš„区域:EC2, EFS, VPS, CloudFormation, KMS, VPC, S3。在教程ä¸æˆ‘们使用“Oregon(us-west-2)â€ä½œä¸ºä¾‹å。 + +### 创建aws账户和IAM账户 + +在æ¯ä¸€ä¸ªaws账户下å¯ä»¥åˆ›å»ºå¤šä¸ªIAM用户。å…许为æ¯ä¸€ä¸ªIAM用户赋予æƒé™ï¼Œä½œä¸ºIAM用户å¯ä»¥åˆ›å»º/æ“作aws集群 + +注册aws账户,请éµå¾ªç”¨æˆ·æŒ‡å—。在AWS账户下创建IAM用户和用户组,请éµå¾ªç”¨æˆ·æŒ‡å— + +请注æ„æ¤æ•™ç¨‹éœ€è¦å¦‚下的IAM用户æƒé™ï¼š + +- AmazonEC2FullAccess +- AmazonS3FullAccess +- AmazonRoute53FullAccess +- AmazonRoute53DomainsFullAccess +- AmazonElasticFileSystemFullAccess +- AmazonVPCFullAccess +- IAMUserSSHKeys +- IAMFullAccess +- NetworkAdministrator +- AWSKeyManagementServicePowerUser + + +### 下载kube-aws and kubectl + +#### kube-aws + +在AWSä¸[kube-aws](https://github.com/coreos/kube-aws)是一个自动部署集群的CLI工具 + +##### kube-awså®Œæ•´æ€§éªŒè¯ +æç¤ºï¼šå¦‚æžœä½ ç”¨çš„æ˜¯éžå®˜æ–¹ç‰ˆæœ¬ï¼ˆe.g RC release)的kube-aws,å¯ä»¥è·³è¿‡è¿™ä¸€æ¥éª¤ã€‚引入coreos的应用程åºç¾å公钥: + +``` +gpg2 --keyserver pgp.mit.edu --recv-key FC8A365E +``` + +指纹验è¯ï¼š + +``` +gpg2 --fingerprint FC8A365E +``` +æ£ç¡®çš„指纹是: `18AD 5014 C99E F7E3 BA5F 6CE9 50BD D3E0 FC8A 365E` + +我们å¯ä»¥ä»Žå‘布页é¢ä¸ä¸‹è½½kube-aws,教程使用0.9.1版本 [release page](https://github.com/coreos/kube-aws/releases). + +验è¯tar包的GPGç¾å: + +``` +PLATFORM=linux-amd64 + # Or +PLATFORM=darwin-amd64 + +gpg2 --verify kube-aws-${PLATFORM}.tar.gz.sig kube-aws-${PLATFORM}.tar.gz +``` +##### 安装kube-aws +解压: + +``` +tar zxvf kube-aws-${PLATFORM}.tar.gz +``` + +æ·»åŠ åˆ°çŽ¯å¢ƒå˜é‡: + +``` +mv ${PLATFORM}/kube-aws /usr/local/bin +``` + + +#### kubectl + +[kubectl](https://Kubernetes.io/docs/user-guide/kubectl-overview/) 是一个æ“作Kubernetesé›†ç¾¤çš„å‘½ä»¤è¡ŒæŽ¥å£ + +利用`curl`工具从Kuberneteså‘布页é¢ä¸ä¸‹è½½`kubectl` + +``` +# OS X +curl -O https://storage.googleapis.com/kubernetes-release/release/"$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)"/bin/darwin/amd64/kubectl + +# Linux +curl -O https://storage.googleapis.com/kubernetes-release/release/"$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)"/bin/linux/amd64/kubectl +``` + +为了能是kubectlè¿è¡Œå¿…é¡»å°†ä¹‹æ·»åŠ åˆ°çŽ¯å¢ƒå˜é‡ä¸ (e.g. `/usr/local/bin`): + +``` +chmod +x ./kubectl +sudo mv ./kubectl /usr/local/bin/kubectl +``` + +### é…ç½®AWSè¯ä¹¦ + +首先检查这里 [this](http://docs.aws.amazon.com/cli/latest/userguide/installing.html) 安装AWS命令行工具 + +然åŽé…ç½®aws账户信æ¯: + +``` +aws configure +``` + + +æ·»åŠ å¦‚ä¸‹ä¿¡æ¯: + + +``` +AWS Access Key ID: YOUR_ACCESS_KEY_ID +AWS Secrete Access Key: YOUR_SECRETE_ACCESS_KEY +Default region name: us-west-2 +Default output format: json +``` + +`YOUR_ACCESS_KEY_ID`, and `YOUR_SECRETE_ACCESS_KEY` 是创建aws账户和IAM账户的IAMçš„key和密ç [Create AWS Account and IAM Account](#create-aws-account-and-iam-account) + +æ述任何è¿è¡Œåœ¨ä½ 账户ä¸çš„实例æ¥éªŒè¯å‡æ®æ˜¯å¦å·¥ä½œ: + +``` +aws ec2 describe-instances +``` + +### 定义集群å‚æ•° + +#### EC2秘钥对 + +秘钥对将认è¯sshè®¿é—®ä½ çš„EC2实例。秘钥对的公钥部分将é…置到æ¯ä¸€ä¸ªCOREOS节点ä¸ã€‚ + +éµå¾ª [EC2 Keypair User Guide](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html) Keypair用户指å—æ¥åˆ›å»ºEC2秘钥对 + +ä½ å¯ä»¥ä½¿ç”¨åˆ›å»ºå¥½çš„秘钥对å称æ¥é…置集群. + +在åŒä¸€å·¥ä½œåŒºä¸ç§˜é’¥å¯¹ä¸ºEC2实例唯一ç 。在教程ä¸ä½¿ç”¨ us-west-2 ,所以请确认在这个区域(Oregon)ä¸åˆ›å»ºç§˜é’¥å¯¹ã€‚ + +在æµè§ˆå™¨ä¸ä¸‹è½½ä¸€ä¸ª`key-name.pem`文件用æ¥è®¿é—®EC2实例,我们待会会用到. + + +#### KMS秘钥 + +亚马逊的KMS秘钥在TLS秘钥管ç†æœåŠ¡ä¸ç”¨æ¥åŠ å¯†å’Œè§£å¯†é›†ç¾¤ã€‚å¦‚æžœä½ å·²ç»æœ‰å¯ç”¨çš„KMSç§˜é’¥ï¼Œä½ å¯ä»¥è·³è¿‡åˆ›å»ºæ–°ç§˜é’¥è¿™ä¸€æ¥ï¼Œæ供现å˜ç§˜é’¥çš„ARNå—符串。 + +利用aws命令行创建kms秘钥: + +``` +aws kms --region=us-west-2 create-key --description="kube-aws assets" +{ + "KeyMetadata": { + "CreationDate": 1458235139.724, + "KeyState": "Enabled", + "Arn": "arn:aws:kms:us-west-2:aaaaaaaaaaaaa:key/xxxxxxxxxxxxxxxxxxx", + "AWSAccountId": "xxxxxxxxxxxxx", + "Enabled": true, + "KeyUsage": "ENCRYPT_DECRYPT", + "KeyId": "xxxxxxxxx", + "Description": "kube-aws assets" + } +} +``` + +我们ç¨åŽç”¨åˆ°`Arn` 的值. + +在IAM用户许å¯ä¸æ·»åŠ 多个内è”ç–ç•¥. + +进入[IAM Console](https://console.aws.amazon.com/iam/home?region=us-west-2#/home)。点击`Users`按钮,点击刚æ‰åˆ›å»ºçš„用户,然åŽç‚¹å‡»`Add inline policy`按钮,选择`Custom Policy` + +粘贴内è”ç–ç•¥: + +``` + (Caution: node_0, node_1, node_2 directories represents PaddlePaddle node and train_id, not the Kubernetes node){ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "Stmt1482205552000", + "Effect": "Allow", + "Action": [ + "kms:Decrypt", + "kms:Encrypt" + ], + "Resource": [ + "arn:aws:kms:*:AWS_ACCOUNT_ID:key/*" + ] + }, + { + "Sid": "Stmt1482205746000", + "Effect": "Allow", + "Action": [ + "cloudformation:CreateStack", + "cloudformation:UpdateStack", + "cloudformation:DeleteStack", + "cloudformation:DescribeStacks", + "cloudformation:DescribeStackResource", + "cloudformation:GetTemplate", + "cloudformation:DescribeStackEvents" + ], + "Resource": [ + "arn:aws:cloudformation:us-west-2:AWS_ACCOUNT_ID:stack/MY_CLUSTER_NAME/*" + ] + } + ] +} +``` +`Version` : 值必须是"2012-10-17". +`AWS_ACCOUNT_ID`: ä½ å¯ä»¥ä»Žå‘½ä»¤è¡Œä¸èŽ·å–: + +``` +aws sts get-caller-identity --output text --query Account +``` + +`MY_CLUSTER_NAME`: é€‰æ‹©ä¸€ä¸ªä½ å–œæ¬¢çš„MY_CLUSTER_NAME,ç¨åŽä¼šç”¨åˆ°ã€‚ +请注æ„ï¼Œå †æ ˆå称必须是æ£åˆ™è¡¨è¾¾å¼ï¼š[a-zA-Z][-a-zA-Z0-9*]*, 在å称ä¸ä¸èƒ½æœ‰"_"或者"-",å¦åˆ™kube-aws在下é¢æ¥éª¤ä¸ä¼šæŠ›å‡ºå¼‚常 + +#### 外部DNSå称 + +当集群被创建åŽï¼ŒåŸºäºŽDNSå称控制器将会暴露安全的TLS API. + +DNSå称å«æœ‰CNAME指å‘到集群DNSå称或者记录指å‘集群的IP地å€ã€‚ + +我们ç¨åŽä¼šç”¨åˆ°DNSå称,如果没有DNSå称的è¯ï¼Œä½ å¯ä»¥é€‰æ‹©ä¸€ä¸ªï¼ˆæ¯”如:`paddle`)还å¯ä»¥ä¿®æ”¹`/etc/hosts`用本机的DNSå称和集群IPå…³è”。还å¯ä»¥åœ¨AWSä¸Šå¢žåŠ ä¸€ä¸ªå称æœåŠ¡æ¥å…³è”paddle集群IP,ç¨åŽæ¥éª¤ä¸ä¼šæŸ¥æ‰¾é›†ç¾¤IP. + +#### S3 bucket + +在å¯åŠ¨Kubernetes集群å‰éœ€è¦åˆ›å»ºä¸€ä¸ªS3 bucket + +在AWS上创建s3 bucket会有许多的bugs,所以使用[s3 console](https://console.aws.amazon.com/s3/home?region=us-west-2)。 + +链接到 `Create Bucket`,确ä¿åœ¨us-west-2 (Oregon)上创建一个唯一的BUCKET_NAME。 + +#### åˆå§‹åŒ–assets + +在本机创建一个目录用æ¥å˜æ”¾äº§ç”Ÿçš„assets: + +``` +$ mkdir my-cluster +$ cd my-cluster +``` + +利用KMS Arnã€ç§˜é’¥å¯¹å称和å‰ä¸€æ¥äº§ç”Ÿçš„DNSå称æ¥åˆå§‹åŒ–集群的CloudFormationæ ˆ: + +``` +kube-aws init \ +--cluster-name=MY_CLUSTER_NAME \ +--external-dns-name=MY_EXTERNAL_DNS_NAME \ +--region=us-west-2 \ +--availability-zone=us-west-2a \ +--key-name=KEY_PAIR_NAME \ +--kms-key-arn="arn:aws:kms:us-west-2:xxxxxxxxxx:key/xxxxxxxxxxxxxxxxxxx" +``` + +`MY_CLUSTER_NAME`: the one you picked in [KMS key](#kms-key) + +`MY_EXTERNAL_DNS_NAME`: see [External DNS name](#external-dns-name) + +`KEY_PAIR_NAME`: see [EC2 key pair](#ec2-key-pair) + +`--kms-key-arn`: the "Arn" in [KMS key](#kms-key) + +这里的`us-west-2a`用于å‚æ•°`--availability-zone`,但必须在AWS账户的有效å¯ç”¨åŒºä¸ + +如果ä¸èƒ½åˆ‡æ¢åˆ°å…¶ä»–的有效å¯ç”¨åŒºï¼ˆe.g., `us-west-2a`, or `us-west-2b`),请检查`us-west-2a`是支æŒ`aws ec2 --region us-west-2 describe-availability-zones`。 + +现在在asset目录ä¸å°±æœ‰äº†é›†ç¾¤çš„主é…置文件cluster.yaml。 + +默认情况下kube-aws会创建一个工作节点,修改`cluster.yaml`让`workerCount`从1个节点å˜æˆ3个节点. + +#### 呈现asset目录内容 + +在这个简å•çš„例åä¸ï¼Œä½ å¯ä»¥ä½¿ç”¨kuber-aws生æˆTLS身份和è¯ä¹¦ + +``` +kube-aws render credentials --generate-ca +``` + +下一æ¥åœ¨asset目录ä¸ç”Ÿæˆä¸€ç»„集群assets. + +``` +kube-aws render stack +``` +asserts(模æ¿å’Œå‡è¯)用于创建ã€æ›´æ–°å’Œå½“å‰ç›®å½•è¢«åˆ›å»ºçš„Kubernetesé›†ç¾¤ç›¸å…³è” + +### å¯åŠ¨Kubernetes集群 + +#### 创建一个在CloudFormation模æ¿ä¸Šå®šä¹‰å¥½çš„实例 + +现在让我们创建集群(在命令行ä¸é€‰æ‹©ä»»æ„çš„ `PREFIX`) + +``` +kube-aws up --s3-uri s3://BUCKET_NAME/PREFIX +``` + +`BUCKET_NAME`: t在[S3 bucket](#s3-bucket)上使用的bucketå称 + + +#### é…ç½®DNS + +ä½ å¯ä»¥æ‰§è¡Œå‘½ä»¤ `kube-aws status`æ¥æŸ¥çœ‹åˆ›å»ºåŽé›†ç¾¤çš„API. + +``` +$ kube-aws status +Cluster Name: paddle-cluster +Controller DNS Name: paddle-cl-ElbAPISe-EEOI3EZPR86C-531251350.us-west-2.elb.amazonaws.com +``` +å¦‚æžœä½ ç”¨DNSå称,在ip上设置任何记录或是安装CNAME点到`Controller DNS Name` (`paddle-cl-ElbAPISe-EEOI3EZPR86C-531251350.us-west-2.elb.amazonaws.com`) + +##### 查询IPåœ°å€ + +用命令`dig`去检查负载å‡è¡¡å™¨çš„域åæ¥èŽ·å–ip地å€. + +``` +$ dig paddle-cl-ElbAPISe-EEOI3EZPR86C-531251350.us-west-2.elb.amazonaws.com + +;; QUESTION SECTION: +;paddle-cl-ElbAPISe-EEOI3EZPR86C-531251350.us-west-2.elb.amazonaws.com. IN A + +;; ANSWER SECTION: +paddle-cl-ElbAPISe-EEOI3EZPR86C-531251350.us-west-2.elb.amazonaws.com. 59 IN A 54.241.164.52 +paddle-cl-ElbAPISe-EEOI3EZPR86C-531251350.us-west-2.elb.amazonaws.com. 59 IN A 54.67.102.112 +``` + +在上é¢çš„例åä¸ï¼Œ`54.241.164.52`, `54.67.102.112`这两个ipéƒ½å°†æ˜¯å·¥ä½œçŠ¶æ€ + +*å¦‚æžœä½ æœ‰DNSå称*,设置记录到ip上,然åŽä½ å¯ä»¥è·³è¿‡â€œAccess the clusterâ€è¿™ä¸€æ¥ + +*如果没有自己的DNSå称* + +编辑/etc/hosts文件用DNSå…³è”IP + +##### 更新本地的DNSå…³è” +编辑`/etc/hosts`文件用DNSå…³è”IP +##### 在VPCä¸Šæ·»åŠ route53ç§æœ‰å称æœåŠ¡ + - 打开[Route53 Console](https://console.aws.amazon.com/route53/home) + - æ ¹æ®é…置创建域åzone + - domainå称为: "paddle" + - Type: "Private hosted zone for amazon VPC" + - VPC ID: `<Your VPC ID>` + +  + - æ·»åŠ è®°å½• + - 点击zoneä¸åˆšåˆ›å»ºçš„“paddle†+ - 点击按钮“Create record set†+ - Name : leave blank + - type: "A" + - Value: `<kube-controller ec2 private ip>` + +  + - 检查å称æœåŠ¡ + - 连接通过kube-aws via ssh创建的任何实例 + - è¿è¡Œå‘½ä»¤"host paddle",看看是å¦ip为返回的kube-controllerçš„ç§æœ‰IP + +#### 进入集群 + +集群è¿è¡ŒåŽå¦‚下命令会看到: + +``` +$ kubectl --kubeconfig=kubeconfig get nodes +NAME STATUS AGE +ip-10-0-0-134.us-west-2.compute.internal Ready 6m +ip-10-0-0-238.us-west-2.compute.internal Ready 6m +ip-10-0-0-50.us-west-2.compute.internal Ready 6m +ip-10-0-0-55.us-west-2.compute.internal Ready 6m +``` + + +### 集群安装弹性文件系统 + +è®ç»ƒæ•°æ®å˜æ”¾åœ¨AWS上的EFS分布å¼æ–‡ä»¶ç³»ç»Ÿä¸. + +1. 在[security group console](https://us-west-2.console.aws.amazon.com/ec2/v2/home?region=us-west-2#SecurityGroups:sort=groupId)为EFS创建一个安全组 + 1. å¯ä»¥çœ‹åˆ°`paddle-cluster-sg-worker` (在sg-055ee37dé•œåƒä¸)安全组id + <center></center> + + 2. å¢žåŠ å®‰å…¨ç»„`paddle-efs` ,以`paddle-cluster-sg-worker`çš„group id作为用户æºå’Œ`ALL TCP`å…¥æ ˆè§„åˆ™ã€‚å¢žåŠ vpc `paddle-cluster-vpc`, ç¡®ä¿å¯ç”¨åŒºæ˜¯åœ¨[Initialize Assets](#initialize-assets)的时候用到的那一个. + <center></center> + +2. 利用`paddle-cluster-vpc`ç§æœ‰ç½‘络在[EFS console](https://us-west-2.console.aws.amazon.com/efs/home?region=us-west-2#/wizard/1) ä¸åˆ›å»ºå¼¹æ€§æ–‡ä»¶ç³»ç»Ÿ, 确定å网为`paddle-cluster-Subnet0`和安全区为`paddle-efs`. +<center></center> + + +### 开始在AWS上进行paddlepaddleçš„è®ç»ƒ + +#### é…ç½®Kuberneteså·æŒ‡å‘EFS + +首先需è¦åˆ›å»ºä¸€ä¸ªæŒä¹…å·[PersistentVolume](https://kubernetes.io/docs/user-guide/persistent-volumes/) 到EFS上 + +用 `pv.yaml`å½¢å¼æ¥ä¿å˜ +``` +apiVersion: v1 +kind: PersistentVolume +metadata: + name: efsvol +spec: + capacity: + storage: 100Gi + accessModes: + - ReadWriteMany + nfs: + server: EFS_DNS_NAME + path: "/" +``` + +`EFS_DNS_NAME`: DNSå称最好能æ述我们创建的`paddle-efs`,看起æ¥åƒ`fs-2cbf7385.efs.us-west-2.amazonaws.com` + +è¿è¡Œä¸‹é¢çš„命令æ¥åˆ›å»ºæŒä¹…å·: +``` +kubectl --kubeconfig=kubeconfig create -f pv.yaml +``` +下一æ¥åˆ›å»º [PersistentVolumeClaim](https://kubernetes.io/docs/user-guide/persistent-volumes/)æ¥å£°æ˜ŽæŒä¹…å· + +用`pvc.yaml`æ¥ä¿å˜. +``` +kind: PersistentVolumeClaim +apiVersion: v1 +metadata: + name: efsvol +spec: + accessModes: + - ReadWriteMany + resources: + requests: + storage: 50Gi +``` + +行下é¢å‘½ä»¤æ¥åˆ›å»ºæŒä¹…å·å£°æ˜Ž: +``` +kubectl --kubeconfig=kubeconfig create -f pvc.yaml +``` + +#### 准备è®ç»ƒæ•°æ® + +å¯åŠ¨Kubernetes job在我们创建的æŒä¹…层上进行下载ã€ä¿å˜å¹¶å‡åŒ€æ‹†åˆ†è®ç»ƒæ•°æ®ä¸º3份. + +用`paddle-data-job.yaml`ä¿å˜ +``` +apiVersion: batch/v1 +kind: Job +metadata: + name: paddle-data +spec: + template: + metadata: + name: pi + spec: + containers: + - name: paddle-data + image: paddlepaddle/paddle-tutorial:k8s_data + imagePullPolicy: Always + volumeMounts: + - mountPath: "/efs" + name: efs + env: + - name: OUT_DIR + value: /efs/paddle-cluster-job + - name: SPLIT_COUNT + value: "3" + volumes: + - name: efs + persistentVolumeClaim: + claimName: efsvol + restartPolicy: Never +``` + +è¿è¡Œä¸‹é¢çš„命令æ¥å¯åŠ¨ä»»åŠ¡: +``` +kubectl --kubeconfig=kubeconfig create -f paddle-data-job.yaml +``` +任务è¿è¡Œå¤§æ¦‚需è¦7分钟,å¯ä»¥ä½¿ç”¨ä¸‹é¢å‘½ä»¤æŸ¥çœ‹ä»»åŠ¡çŠ¶æ€ï¼Œç›´åˆ°`paddle-data`任务的`SUCCESSFUL`状æ€ä¸º`1`æ—¶æˆåŠŸï¼Œè¿™é‡Œhereæœ‰æ€Žæ ·åˆ›å»ºé•œåƒçš„æºç +``` +$ kubectl --kubeconfig=kubeconfig get jobs +NAME DESIRED SUCCESSFUL AGE +paddle-data 1 1 6m +``` +æ•°æ®å‡†å¤‡å®ŒæˆåŽçš„结果是以镜åƒ`paddlepaddle/paddle-tutorial:k8s_data`å˜æ”¾ï¼Œå¯ä»¥ç‚¹å‡»è¿™é‡Œ[here](src/k8s_data/README.md)查看如何创建dockeré•œåƒæºç + +#### 开始è®ç»ƒ + +现在å¯ä»¥å¼€å§‹è¿è¡Œpaddleçš„è®ç»ƒä»»åŠ¡ï¼Œç”¨`paddle-cluster-job.yaml`进行ä¿å˜ +``` +apiVersion: batch/v1 +kind: Job +metadata: + name: paddle-cluster-job +spec: + parallelism: 3 + completions: 3 + template: + metadata: + name: paddle-cluster-job + spec: + volumes: + - name: efs + persistentVolumeClaim: + claimName: efsvol + containers: + - name: trainer + image: paddlepaddle/paddle-tutorial:k8s_train + command: ["bin/bash", "-c", "/root/start.sh"] + env: + - name: JOB_NAME + value: paddle-cluster-job + - name: JOB_PATH + value: /home/jobpath + - name: JOB_NAMESPACE + value: default + - name: TRAIN_CONFIG_DIR + value: quick_start + - name: CONF_PADDLE_NIC + value: eth0 + - name: CONF_PADDLE_PORT + value: "7164" + - name: CONF_PADDLE_PORTS_NUM + value: "2" + - name: CONF_PADDLE_PORTS_NUM_SPARSE + value: "2" + - name: CONF_PADDLE_GRADIENT_NUM + value: "3" + - name: TRAINER_COUNT + value: "3" + volumeMounts: + - mountPath: "/home/jobpath" + name: efs + ports: + - name: jobport0 + hostPort: 7164 + containerPort: 7164 + - name: jobport1 + hostPort: 7165 + containerPort: 7165 + - name: jobport2 + hostPort: 7166 + containerPort: 7166 + - name: jobport3 + hostPort: 7167 + containerPort: 7167 + restartPolicy: Never +``` + +`parallelism: 3, completions: 3` æ„æ€æ˜¯è¿™ä¸ªä»»åŠ¡ä¼šåŒæ—¶å¼€å¯3个paddlepaddleçš„pod,当podå¯åŠ¨åŽ3个任务将被完æˆã€‚ + +`env` å‚数代表容器的环境å˜é‡ï¼Œåœ¨è¿™é‡ŒæŒ‡å®špaddlepaddleçš„å‚æ•°. + +`ports` 指定TCP端å£7164 - 7167å’Œ`pserver`进行连接,port从`CONF_PADDLE_PORT`(7164)到`CONF_PADDLE_PORT + CONF_PADDLE_PORTS_NUM + CONF_PADDLE_PORTS_NUM_SPARSE - 1`(7167)。我们使用多个端å£å¯†é›†å’Œç¨€ç–å‚æ•°çš„æ›´æ–°æ¥æ高延迟 + +è¿è¡Œä¸‹é¢å‘½ä»¤æ¥å¯åŠ¨ä»»åŠ¡. +``` +kubectl --kubeconfig=kubeconfig create -f paddle-claster-job.yaml +``` + +检查podsä¿¡æ¯ + +``` +$ kubectl --kubeconfig=kubeconfig get pods +NAME READY STATUS RESTARTS AGE +paddle-cluster-job-cm469 1/1 Running 0 9m +paddle-cluster-job-fnt03 1/1 Running 0 9m +paddle-cluster-job-jx4xr 1/1 Running 0 9m +``` + +检查指定pod的控制å°è¾“出 +``` +kubectl --kubeconfig=kubeconfig log -f POD_NAME +``` + +`POD_NAME`: 任何一个podçš„å称 (e.g., `paddle-cluster-job-cm469`). + +è¿è¡Œ`kubectl --kubeconfig=kubeconfig describe job paddle-cluster-job`æ¥æ£€æŸ¥è®ç»ƒä»»åŠ¡çš„状æ€ï¼Œå°†ä¼šåœ¨å¤§çº¦20åˆ†é’Ÿå®Œæˆ + +`pserver`å’Œ`trainer`的细节都éšè—在dockeré•œåƒ`paddlepaddle/paddle-tutorial:k8s_train`ä¸ï¼Œè¿™é‡Œ[here](src/k8s_train/README.md) 有创建dockeré•œåƒçš„æºç . + +#### 检查è®ç»ƒè¾“出 + +è®ç»ƒè¾“出(模型快照和日志)将被ä¿å˜åœ¨EFS上。我们å¯ä»¥ç”¨ssh登录到EC2的工作节点上,查看mount过的EFSå’Œè®ç»ƒè¾“出. + +1. ssh登录EC2工作节点 +``` +chmod 400 key-name.pem +ssh -i key-name.pem core@INSTANCE_IP +``` + +`INSTANCE_IP`: EC2上Kubernetes工作节点的公共IP地å€ï¼Œè¿›å…¥[EC2 console](https://us-west-2.console.aws.amazon.com/ec2/v2/home?region=us-west-2#Instances:sort=instanceId) ä¸æ£€æŸ¥ä»»ä½•`paddle-cluster-kube-aws-worker`实例的 `public IP` + +2. 挂载EFS +``` +mkdir efs +sudo mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2 EFS_DNS_NAME:/ efs +``` + +`EFS_DNS_NAME`: DNSå称最好能æ述我们创建的`paddle-efs`,看起æ¥åƒ`fs-2cbf7385.efs.us-west-2.amazonaws.com`. + +文件夹`efs`上有这结构相似的nodeä¿¡æ¯: +``` +-- paddle-cluster-job + |-- ... + |-- output + | |-- node_0 + | | |-- server.log + | | `-- train.log + | |-- node_1 + | | |-- server.log + | | `-- train.log + | |-- node_2 + | | |-- server.log + | | `-- train.log + | |-- pass-00000 + | | |-- ___fc_layer_0__.w0 + | | |-- ___fc_layer_0__.wbias + | | |-- done + | | |-- path.txt + | | `-- trainer_config.lr.py + | |-- pass-00001... +``` +`server.log` 是`pserver`çš„log日志,`train.log`是`trainer`çš„log日志,模型快照和æè¿°å˜æ”¾åœ¨`pass-0000*`. + +### Kubernetes集群å¸è½½æˆ–åˆ é™¤ + +#### åˆ é™¤EFS + +到[EFS Console](https://us-west-2.console.aws.amazon.com/efs/home?region=us-west-2) ä¸åˆ 除创建的EFSå· + +#### åˆ é™¤å®‰å…¨ç»„ + +去[Security Group Console](https://us-west-2.console.aws.amazon.com/ec2/v2/home?region=us-west-2#SecurityGroups:sort=groupId) åˆ é™¤å®‰å…¨ç»„`paddle-efs`. + +#### åˆ é™¤S3 bucket + +进入 [S3 Console](https://console.aws.amazon.com/s3/home?region=us-west-2#)åˆ é™¤S3 bucket + +#### 销æ¯é›†ç¾¤ + +``` +kube-aws destroy +``` + +命令会立刻返回,但需è¦å¤§çº¦5分钟æ¥é”€æ¯é›†ç¾¤ + +å¯ä»¥è¿›å…¥ [CludFormation Console](https://us-west-2.console.aws.amazon.com/cloudformation/home?region=us-west-2#/stacks?filter=active)检查销æ¯çš„过程。