diff --git a/tools/aws_benchmarking/README.md b/tools/aws_benchmarking/README.md index 5fd586cc1537eabe3c45ec9c2a364e53c3e2d000..dfa2a5f478400fb80b438ddb1ebd0ee613671b0f 100644 --- a/tools/aws_benchmarking/README.md +++ b/tools/aws_benchmarking/README.md @@ -50,7 +50,10 @@ Training nodes will run your `ENTRYPOINT` script with the following environment - `TASK_NAME`: unique name to identify this training process. - `TRAINING_ROLE`: current node's role in this training process, either "PSERVER" or "TRAINER" - `PSERVER_HOSTS`: comma separated value of pserver end points, I.E. "192.168.1.2:5436,192.168.1.3:5436" - - `TRAINER_INDEX`: an integer to identify the index of current trainer + - `PSERVERS`: same as above + - `TRAINERS`: trainer count + - `SERVER_ENDPOINT`: current server end point if the node role is a pserver + - `TRAINER_INDEX`: an integer to identify the index of current trainer if the node role is a trainer. Now we have a working distributed training script which takes advantage of node environment variables and docker file to generate the training image. Run the following command: @@ -73,11 +76,15 @@ Training nodes will run your `ENTRYPOINT` script with the following environment Now let's start the training process: ```bash -docker run -i -v $HOME/.aws:/root/.aws -v :/.pem \ +docker run -i -v $HOME/.aws:/root/.aws -v :/root/.pem \ putcn/paddle_aws_client \ --action create \ --key_name \ ---security_group_id +--security_group_id \ +--pserver_image_id \ +--trainer_image_id \ +--pserver_count 2 \ +--trainer_count 2 ``` Now just wait until you see this: @@ -91,7 +98,7 @@ That means you can turn off your laptop and your cluster is creating instances, To access the master log: ```bash -docker run -i -v $HOME/.aws:/root/.aws -v :/.pem \ +docker run -i -v $HOME/.aws:/root/.aws \ putcn/paddle_aws_client \ --action status \ --master_server_public_ip \ @@ -101,7 +108,7 @@ putcn/paddle_aws_client \ To tear down the training setup: ```bash -docker run -i -v $HOME/.aws:/root/.aws -v :/.pem \ +docker run -i -v $HOME/.aws:/root/.aws \ putcn/paddle_aws_client \ --action cleanup \ --master_server_public_ip \ @@ -111,7 +118,7 @@ putcn/paddle_aws_client \ To retrieve training logs TBD -### Tech details +### Tech details *What to expect in this step* diff --git a/tools/aws_benchmarking/diagram.png b/tools/aws_benchmarking/diagram.png index 9dd656c9b4719fc6a96eb3d68c796daa0aaf7b98..b97909c5fe78b59d0e636ff73c2ed3e63a0be722 100644 Binary files a/tools/aws_benchmarking/diagram.png and b/tools/aws_benchmarking/diagram.png differ diff --git a/tools/aws_benchmarking/server/logs/master.log b/tools/aws_benchmarking/server/logs/master.log new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/tools/aws_benchmarking/server/pserver.sh.template b/tools/aws_benchmarking/server/pserver.sh.template index fe2360ed2063753805e86114857ede9be66be705..5e46a4246f1a31261654dc52d4ab30ccd89f1957 100644 --- a/tools/aws_benchmarking/server/pserver.sh.template +++ b/tools/aws_benchmarking/server/pserver.sh.template @@ -1,2 +1,2 @@ #!/bin/bash -nvidia-docker run -i -p {PSERVER_PORT}:{PSERVER_PORT} -e "MASTER_ENDPOINT={MASTER_ENDPOINT}" -e "TASK_NAME={TASK_NAME}" -e "TRAINING_ROLE=PSERVER" -e "PSERVER_HOSTS={PSERVER_HOSTS}" {DOCKER_IMAGE} \ No newline at end of file +nvidia-docker run -i -p {PSERVER_PORT}:{PSERVER_PORT} -e "SERVER_ENDPOINT={SERVER_ENDPOINT}" -e "MASTER_ENDPOINT={MASTER_ENDPOINT}" -e "TASK_NAME={TASK_NAME}" -e "TRAINING_ROLE=PSERVER" -e "TRAINERS={TRAINER_COUNT}" -e "PSERVER_HOSTS={PSERVER_HOSTS}" -e "PSERVERS={PSERVER_HOSTS}" {DOCKER_IMAGE} {COMMAND} \ No newline at end of file diff --git a/tools/aws_benchmarking/server/trainer.sh.template b/tools/aws_benchmarking/server/trainer.sh.template index 89f405811e768f2d8d4b74e75f818b4eb5df160d..56405a8e31d0c89c90168ce5126afc5b3206da0f 100644 --- a/tools/aws_benchmarking/server/trainer.sh.template +++ b/tools/aws_benchmarking/server/trainer.sh.template @@ -1,2 +1,2 @@ #!/bin/bash -nvidia-docker run -i -e "MASTER_ENDPOINT={MASTER_ENDPOINT}" -e "TASK_NAME={TASK_NAME}" -e "TRAINER_INDEX={TRAINER_INDEX}" -e "TRAINING_ROLE=TRAINER" -e "PSERVER_HOSTS={PSERVER_HOSTS}" {DOCKER_IMAGE} \ No newline at end of file +nvidia-docker run -i -e "MASTER_ENDPOINT={MASTER_ENDPOINT}" -e "TASK_NAME={TASK_NAME}" -e "TRAINER_COUNT={TRAINER_COUNT}" -e "TRAINER_INDEX={TRAINER_INDEX}" -e "TRAINING_ROLE=TRAINER" -e "PSERVER_HOSTS={PSERVER_HOSTS}" {DOCKER_IMAGE} {COMMAND} \ No newline at end of file