Azure and local node documentation updates (#48)

2f22ed2e · Jeff Rasley · GitHub · 0ca56d4b · 2f22ed2e · 2f22ed2e
隐藏空白更改
内联并排

Showing with 42 addition and 25 deletion

README.md README.md +14 -5

docs/azure.md docs/azure.md +28 -20

未找到文件。
--- a/README.md
+++ b/README.md
@@ -18,7 +18,7 @@ compared to the state-of-art.
 | --------------------------------------- | ------------------------------------------- |
 | [Why DeepSpeed?](#why-deepspeed)        |  DeepSpeed overview                         |
 | [Getting Started](#getting-started)     |  DeepSpeed first steps                      |
-| [Further Reading](#further-reading)     |  Additional DeepSpeed documentation         |
+| [Further Reading](#further-reading)     |  DeepSpeed features, tutorials, etc.        |
 | [Testing](#testing)                     |  Instructions for testing DeepSpeed         |
 | [Contributing](#contributing)           |  Instructions for contributing to DeepSpeed |

@@ -306,10 +306,10 @@ We illustrate an example usage of DeepSpeed with the following assumptions:
 4. `ds_config.json` is the configuration file for DeepSpeed


-### Resource Configuration
-DeepSpeed configures compute resources with hostfiles that are compatible with
+### Resource Configuration (multi-node)
+DeepSpeed configures multi-node compute resources with hostfiles that are compatible with
 [OpenMPI](https://www.open-mpi.org/) and [Horovod](https://github.com/horovod/horovod).
-A hostfile is a list of *hostnames*, which are machines accessible via passwordless
+A hostfile is a list of *hostnames* (or SSH aliases), which are machines accessible via passwordless
 SSH, and *slot counts*, which specify the number of GPUs available on the system. For
 example,
 ```
@@ -321,7 +321,8 @@ for training.

 Hostfiles are specified with the `--hostfile` command line option. If no hostfile is
 specified, DeepSpeed searches for `/job/hostfile`. If no hostfile is specified or found,
-DeepSpeed queries the number of GPUs on the local machine.
+DeepSpeed queries the number of GPUs on the local machine to discover the number of local
+slots available.


 The following command launches a PyTorch training job across all available nodes and GPUs
@@ -355,6 +356,14 @@ deepspeed --include="worker-2:0,1" \
 	--deepspeed --deepspeed_config ds_config.json
 ```

+### Resource Configuration (single-node)
+In the case that we are only running on a single node (with one or more GPUs)
+DeepSpeed *does not* require a hostfile as described above. If a hostfile is
+not detected or passed in then DeepSpeed will query the number of GPUs on the
+local machine to discover the number of slots available. The `--include` and
+`--exclude` arguments work as normal, but the user should specify 'localhost'
+as the hostname.
+

 ## Further Reading


--- a/docs/azure.md
+++ b/docs/azure.md
 # DeepSpeed with Azure

 This tutorial will help you get started running DeepSpeed on [Azure virtual
-machines](https://azure.microsoft.com/en-us/services/virtual-machines/). Support for
-[Azure ML](https://azure.microsoft.com/en-us/services/machine-learning/) will be coming
-soon!
+machines](https://azure.microsoft.com/en-us/services/virtual-machines/).
+Looking forward, we will be integrating these techniques and additional enhancements
+into the [Azure ML](https://azure.microsoft.com/en-us/services/machine-learning/) platform to
+benefit all your large model training jobs.
+
+If you don't already have an Azure account please see more details here: [https://azure.microsoft.com/](https://azure.microsoft.com/).

 To help with launching Azure instances we suggest using the [Azure
 CLI](https://docs.microsoft.com/en-us/cli/azure/?view=azure-cli-latest). We have created
@@ -17,9 +20,15 @@ between Docker containers. `ssh-keygen` is the recommended way of doing this. Ou
 assume your key is located inside the same directory as the Azure scripts.

 ## Azure Config JSON
-Our helper scripts depend on the following a configuration JSON for deployment and setup.
-We have provided a simple example JSON in `azure_config.json` that sets up a basic
-environment with two VMs. See the example below:
+Our helper scripts depend on the following a configuration JSON for deployment
+and setup.  We have provided a simple example JSON in `azure_config.json` that
+sets up a basic environment with two VMs. This config uses the NV6_Promo
+instance type which has one NVIDIA Tesla M60 GPU per VM. You can read more
+details about the VM on the [Linux Virtual Machines
+Pricing](https://azure.microsoft.com/en-us/pricing/details/virtual-machines/linux/)
+page.
+
+See the example below:
 ```json
 {
  "num_vms": 2,
@@ -73,9 +82,8 @@ public IP address of your VM and use the SSH key provided in the Azure configura
 JSON.

 ## Access DeepSpeed container
-Everything should be up and running at this point. Set's access the running DeepSpeed
-container on the first VM and make sure we can talk to the other containers in our setup.
-Let's complete the following steps:
+Everything should be up and running at this point. Let's access the running DeepSpeed
+container on the first VM and make sure we can talk to the other containers in our deployment.

 * SSH into the first VM via: `./azure_ssh.sh 0`
 * Change directories into the azure folder of this repo via: `cd ~/workdir/DeepSpeed/azure`
@@ -83,8 +91,14 @@ Let's complete the following steps:
 * You should now be able to `ssh` into any other docker container, the containers can be
   accessed via their SSH alias of `worker-N`, where `N` is the VM number between `0`
   and `num_vms-1`. In this example we should be able to successfully run `ssh worker-1
-   hostname`. You can also use `ds_ssh` to execute a command in parallel on all of your
-   worker containers.
+   hostname` which will return the hostname of worker-1.
+
+## Parallel SSH across containers
+ DeepSpeed comes installed with a helper script `ds_ssh` which is a wrapper around
+ the [pdsh](https://linux.die.net/man/1/pdsh) command that lets you issue commands
+ to groups of hosts (via SSH) in parallel. This wrapper simply connects with the
+ hostfile that defines all the containers in your deployment. For example if you run
+ `ds_ssh hostname` you should see a list of all the hostnames in your deployment.

 ## Run CIFAR-10 example model
 We will now run the DeepSpeed CIFAR-10 model example to test the VM setup. From inside
@@ -105,17 +119,11 @@ the first DeepSpeed container:
  ```bash
  deepspeed cifar10_deepspeed.py --deepspeed --deepspeed_config ds_config.json
  ```
-  Alternatively, we provide a helper script `./run_ds.sh`.
-
-This will train a simple CIFAR-10 example model. The accuracy that you will achieve will
-be dependent on the number of GPUs you are training with, we are using this example
-simply to demonstrate that everything is setup correctly and less on training a suitable
-CIFAR-10 model.
-

 ## Megatron-LM GPT2
 DeepSpeed includes an example model using Megatron-LM's GPT2. Please refer to the full
 [Megatron tutorial](tutorials/MegatronGPT2Tutorial.md) for more details.
 * In order to fully train GPT2 with DeepSpeed and ZeRO we recommend using 8 instances of
-   Azure's Standard_ND40rs_v2 SKU for a total of 64 NVIDIA V100 GPUs. With this setup you
-   should be able to train 153.6 million samples in less than 2 weeks of training.
+   Azure's Standard_ND40rs_v2 SKU for a total of 64 NVIDIA V100 GPUs. With this setup and
+   a batch size of 1536 you should be able to complete 100k training steps (153.6 million
+   samples) in less than 2 weeks of training.