提交 94ad30e5 编写于 作者: X Xi Chen

fix log service, add docker run command

上级 d44895e6
...@@ -50,7 +50,10 @@ Training nodes will run your `ENTRYPOINT` script with the following environment ...@@ -50,7 +50,10 @@ Training nodes will run your `ENTRYPOINT` script with the following environment
- `TASK_NAME`: unique name to identify this training process. - `TASK_NAME`: unique name to identify this training process.
- `TRAINING_ROLE`: current node's role in this training process, either "PSERVER" or "TRAINER" - `TRAINING_ROLE`: current node's role in this training process, either "PSERVER" or "TRAINER"
- `PSERVER_HOSTS`: comma separated value of pserver end points, I.E. "192.168.1.2:5436,192.168.1.3:5436" - `PSERVER_HOSTS`: comma separated value of pserver end points, I.E. "192.168.1.2:5436,192.168.1.3:5436"
- `TRAINER_INDEX`: an integer to identify the index of current trainer - `PSERVERS`: same as above
- `TRAINERS`: trainer count
- `SERVER_ENDPOINT`: current server end point if the node role is a pserver
- `TRAINER_INDEX`: an integer to identify the index of current trainer if the node role is a trainer.
Now we have a working distributed training script which takes advantage of node environment variables and docker file to generate the training image. Run the following command: Now we have a working distributed training script which takes advantage of node environment variables and docker file to generate the training image. Run the following command:
...@@ -73,11 +76,15 @@ Training nodes will run your `ENTRYPOINT` script with the following environment ...@@ -73,11 +76,15 @@ Training nodes will run your `ENTRYPOINT` script with the following environment
Now let's start the training process: Now let's start the training process:
```bash ```bash
docker run -i -v $HOME/.aws:/root/.aws -v <full path to your pem file>:/<key pare name>.pem \ docker run -i -v $HOME/.aws:/root/.aws -v <full path to your pem file>:/root/<key pare name>.pem \
putcn/paddle_aws_client \ putcn/paddle_aws_client \
--action create \ --action create \
--key_name <your key pare name> \ --key_name <your key pare name> \
--security_group_id <your security group id> --security_group_id <your security group id> \
--pserver_image_id <your pserver image id> \
--trainer_image_id <your trainer images id> \
--pserver_count 2 \
--trainer_count 2
``` ```
Now just wait until you see this: Now just wait until you see this:
...@@ -91,7 +98,7 @@ That means you can turn off your laptop and your cluster is creating instances, ...@@ -91,7 +98,7 @@ That means you can turn off your laptop and your cluster is creating instances,
To access the master log: To access the master log:
```bash ```bash
docker run -i -v $HOME/.aws:/root/.aws -v <full path to your pem file>:/<key pare name>.pem \ docker run -i -v $HOME/.aws:/root/.aws \
putcn/paddle_aws_client \ putcn/paddle_aws_client \
--action status \ --action status \
--master_server_public_ip <master ip> \ --master_server_public_ip <master ip> \
...@@ -101,7 +108,7 @@ putcn/paddle_aws_client \ ...@@ -101,7 +108,7 @@ putcn/paddle_aws_client \
To tear down the training setup: To tear down the training setup:
```bash ```bash
docker run -i -v $HOME/.aws:/root/.aws -v <full path to your pem file>:/<key pare name>.pem \ docker run -i -v $HOME/.aws:/root/.aws \
putcn/paddle_aws_client \ putcn/paddle_aws_client \
--action cleanup \ --action cleanup \
--master_server_public_ip <master ip> \ --master_server_public_ip <master ip> \
...@@ -111,7 +118,7 @@ putcn/paddle_aws_client \ ...@@ -111,7 +118,7 @@ putcn/paddle_aws_client \
To retrieve training logs To retrieve training logs
TBD TBD
### Tech details ### Tech details
*What to expect in this step* *What to expect in this step*
......
tools/aws_benchmarking/diagram.png

40.8 KB | W: | H:

tools/aws_benchmarking/diagram.png

39.8 KB | W: | H:

tools/aws_benchmarking/diagram.png
tools/aws_benchmarking/diagram.png
tools/aws_benchmarking/diagram.png
tools/aws_benchmarking/diagram.png
  • 2-up
  • Swipe
  • Onion skin
#!/bin/bash #!/bin/bash
nvidia-docker run -i -p {PSERVER_PORT}:{PSERVER_PORT} -e "MASTER_ENDPOINT={MASTER_ENDPOINT}" -e "TASK_NAME={TASK_NAME}" -e "TRAINING_ROLE=PSERVER" -e "PSERVER_HOSTS={PSERVER_HOSTS}" {DOCKER_IMAGE} nvidia-docker run -i -p {PSERVER_PORT}:{PSERVER_PORT} -e "SERVER_ENDPOINT={SERVER_ENDPOINT}" -e "MASTER_ENDPOINT={MASTER_ENDPOINT}" -e "TASK_NAME={TASK_NAME}" -e "TRAINING_ROLE=PSERVER" -e "TRAINERS={TRAINER_COUNT}" -e "PSERVER_HOSTS={PSERVER_HOSTS}" -e "PSERVERS={PSERVER_HOSTS}" {DOCKER_IMAGE} {COMMAND}
\ No newline at end of file \ No newline at end of file
#!/bin/bash #!/bin/bash
nvidia-docker run -i -e "MASTER_ENDPOINT={MASTER_ENDPOINT}" -e "TASK_NAME={TASK_NAME}" -e "TRAINER_INDEX={TRAINER_INDEX}" -e "TRAINING_ROLE=TRAINER" -e "PSERVER_HOSTS={PSERVER_HOSTS}" {DOCKER_IMAGE} nvidia-docker run -i -e "MASTER_ENDPOINT={MASTER_ENDPOINT}" -e "TASK_NAME={TASK_NAME}" -e "TRAINER_COUNT={TRAINER_COUNT}" -e "TRAINER_INDEX={TRAINER_INDEX}" -e "TRAINING_ROLE=TRAINER" -e "PSERVER_HOSTS={PSERVER_HOSTS}" {DOCKER_IMAGE} {COMMAND}
\ No newline at end of file \ No newline at end of file
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册