@@ -50,7 +50,10 @@ Training nodes will run your `ENTRYPOINT` script with the following environment
...
@@ -50,7 +50,10 @@ Training nodes will run your `ENTRYPOINT` script with the following environment
-`TASK_NAME`: unique name to identify this training process.
-`TASK_NAME`: unique name to identify this training process.
-`TRAINING_ROLE`: current node's role in this training process, either "PSERVER" or "TRAINER"
-`TRAINING_ROLE`: current node's role in this training process, either "PSERVER" or "TRAINER"
-`PSERVER_HOSTS`: comma separated value of pserver end points, I.E. "192.168.1.2:5436,192.168.1.3:5436"
-`PSERVER_HOSTS`: comma separated value of pserver end points, I.E. "192.168.1.2:5436,192.168.1.3:5436"
-`TRAINER_INDEX`: an integer to identify the index of current trainer
-`PSERVERS`: same as above
-`TRAINERS`: trainer count
-`SERVER_ENDPOINT`: current server end point if the node role is a pserver
-`TRAINER_INDEX`: an integer to identify the index of current trainer if the node role is a trainer.
Now we have a working distributed training script which takes advantage of node environment variables and docker file to generate the training image. Run the following command:
Now we have a working distributed training script which takes advantage of node environment variables and docker file to generate the training image. Run the following command:
...
@@ -73,11 +76,15 @@ Training nodes will run your `ENTRYPOINT` script with the following environment
...
@@ -73,11 +76,15 @@ Training nodes will run your `ENTRYPOINT` script with the following environment
Now let's start the training process:
Now let's start the training process:
```bash
```bash
docker run -i-v$HOME/.aws:/root/.aws -v <full path to your pem file>:/<key pare name>.pem \
docker run -i-v$HOME/.aws:/root/.aws -v <full path to your pem file>:/root/<key pare name>.pem \
putcn/paddle_aws_client \
putcn/paddle_aws_client \
--action create \
--action create \
--key_name <your key pare name> \
--key_name <your key pare name> \
--security_group_id <your security group id>
--security_group_id <your security group id>\
--pserver_image_id <your pserver image id>\
--trainer_image_id <your trainer images id>\
--pserver_count 2 \
--trainer_count 2
```
```
Now just wait until you see this:
Now just wait until you see this:
...
@@ -91,7 +98,7 @@ That means you can turn off your laptop and your cluster is creating instances,
...
@@ -91,7 +98,7 @@ That means you can turn off your laptop and your cluster is creating instances,
To access the master log:
To access the master log:
```bash
```bash
docker run -i-v$HOME/.aws:/root/.aws -v <full path to your pem file>:/<key pare name>.pem \
docker run -i-v$HOME/.aws:/root/.aws \
putcn/paddle_aws_client \
putcn/paddle_aws_client \
--action status \
--action status \
--master_server_public_ip <master ip> \
--master_server_public_ip <master ip> \
...
@@ -101,7 +108,7 @@ putcn/paddle_aws_client \
...
@@ -101,7 +108,7 @@ putcn/paddle_aws_client \
To tear down the training setup:
To tear down the training setup:
```bash
```bash
docker run -i-v$HOME/.aws:/root/.aws -v <full path to your pem file>:/<key pare name>.pem \
nvidia-docker run -i-p{PSERVER_PORT}:{PSERVER_PORT}-e"MASTER_ENDPOINT={MASTER_ENDPOINT}"-e"TASK_NAME={TASK_NAME}"-e"TRAINING_ROLE=PSERVER"-e"PSERVER_HOSTS={PSERVER_HOSTS}"{DOCKER_IMAGE}
nvidia-docker run -i-p{PSERVER_PORT}:{PSERVER_PORT}-e"SERVER_ENDPOINT={SERVER_ENDPOINT}"-e"MASTER_ENDPOINT={MASTER_ENDPOINT}"-e"TASK_NAME={TASK_NAME}"-e"TRAINING_ROLE=PSERVER"-e"TRAINERS={TRAINER_COUNT}"-e"PSERVER_HOSTS={PSERVER_HOSTS}"-e"PSERVERS={PSERVER_HOSTS}"{DOCKER_IMAGE}{COMMAND}
nvidia-docker run -i-e"MASTER_ENDPOINT={MASTER_ENDPOINT}"-e"TASK_NAME={TASK_NAME}"-e"TRAINER_INDEX={TRAINER_INDEX}"-e"TRAINING_ROLE=TRAINER"-e"PSERVER_HOSTS={PSERVER_HOSTS}"{DOCKER_IMAGE}
nvidia-docker run -i-e"MASTER_ENDPOINT={MASTER_ENDPOINT}"-e"TASK_NAME={TASK_NAME}"-e"TRAINER_COUNT={TRAINER_COUNT}"-e"TRAINER_INDEX={TRAINER_INDEX}"-e"TRAINING_ROLE=TRAINER"-e"PSERVER_HOSTS={PSERVER_HOSTS}"{DOCKER_IMAGE}{COMMAND}