提交 2f22ed2e 编写于 作者: J Jeff Rasley 提交者: GitHub

Azure and local node documentation updates (#48)

上级 0ca56d4b
......@@ -18,7 +18,7 @@ compared to the state-of-art.
| --------------------------------------- | ------------------------------------------- |
| [Why DeepSpeed?](#why-deepspeed) | DeepSpeed overview |
| [Getting Started](#getting-started) | DeepSpeed first steps |
| [Further Reading](#further-reading) | Additional DeepSpeed documentation |
| [Further Reading](#further-reading) | DeepSpeed features, tutorials, etc. |
| [Testing](#testing) | Instructions for testing DeepSpeed |
| [Contributing](#contributing) | Instructions for contributing to DeepSpeed |
......@@ -306,10 +306,10 @@ We illustrate an example usage of DeepSpeed with the following assumptions:
4. `ds_config.json` is the configuration file for DeepSpeed
### Resource Configuration
DeepSpeed configures compute resources with hostfiles that are compatible with
### Resource Configuration (multi-node)
DeepSpeed configures multi-node compute resources with hostfiles that are compatible with
[OpenMPI](https://www.open-mpi.org/) and [Horovod](https://github.com/horovod/horovod).
A hostfile is a list of *hostnames*, which are machines accessible via passwordless
A hostfile is a list of *hostnames* (or SSH aliases), which are machines accessible via passwordless
SSH, and *slot counts*, which specify the number of GPUs available on the system. For
example,
```
......@@ -321,7 +321,8 @@ for training.
Hostfiles are specified with the `--hostfile` command line option. If no hostfile is
specified, DeepSpeed searches for `/job/hostfile`. If no hostfile is specified or found,
DeepSpeed queries the number of GPUs on the local machine.
DeepSpeed queries the number of GPUs on the local machine to discover the number of local
slots available.
The following command launches a PyTorch training job across all available nodes and GPUs
......@@ -355,6 +356,14 @@ deepspeed --include="worker-2:0,1" \
--deepspeed --deepspeed_config ds_config.json
```
### Resource Configuration (single-node)
In the case that we are only running on a single node (with one or more GPUs)
DeepSpeed *does not* require a hostfile as described above. If a hostfile is
not detected or passed in then DeepSpeed will query the number of GPUs on the
local machine to discover the number of slots available. The `--include` and
`--exclude` arguments work as normal, but the user should specify 'localhost'
as the hostname.
## Further Reading
......
# DeepSpeed with Azure
This tutorial will help you get started running DeepSpeed on [Azure virtual
machines](https://azure.microsoft.com/en-us/services/virtual-machines/). Support for
[Azure ML](https://azure.microsoft.com/en-us/services/machine-learning/) will be coming
soon!
machines](https://azure.microsoft.com/en-us/services/virtual-machines/).
Looking forward, we will be integrating these techniques and additional enhancements
into the [Azure ML](https://azure.microsoft.com/en-us/services/machine-learning/) platform to
benefit all your large model training jobs.
If you don't already have an Azure account please see more details here: [https://azure.microsoft.com/](https://azure.microsoft.com/).
To help with launching Azure instances we suggest using the [Azure
CLI](https://docs.microsoft.com/en-us/cli/azure/?view=azure-cli-latest). We have created
......@@ -17,9 +20,15 @@ between Docker containers. `ssh-keygen` is the recommended way of doing this. Ou
assume your key is located inside the same directory as the Azure scripts.
## Azure Config JSON
Our helper scripts depend on the following a configuration JSON for deployment and setup.
We have provided a simple example JSON in `azure_config.json` that sets up a basic
environment with two VMs. See the example below:
Our helper scripts depend on the following a configuration JSON for deployment
and setup. We have provided a simple example JSON in `azure_config.json` that
sets up a basic environment with two VMs. This config uses the NV6_Promo
instance type which has one NVIDIA Tesla M60 GPU per VM. You can read more
details about the VM on the [Linux Virtual Machines
Pricing](https://azure.microsoft.com/en-us/pricing/details/virtual-machines/linux/)
page.
See the example below:
```json
{
"num_vms": 2,
......@@ -73,9 +82,8 @@ public IP address of your VM and use the SSH key provided in the Azure configura
JSON.
## Access DeepSpeed container
Everything should be up and running at this point. Set's access the running DeepSpeed
container on the first VM and make sure we can talk to the other containers in our setup.
Let's complete the following steps:
Everything should be up and running at this point. Let's access the running DeepSpeed
container on the first VM and make sure we can talk to the other containers in our deployment.
* SSH into the first VM via: `./azure_ssh.sh 0`
* Change directories into the azure folder of this repo via: `cd ~/workdir/DeepSpeed/azure`
......@@ -83,8 +91,14 @@ Let's complete the following steps:
* You should now be able to `ssh` into any other docker container, the containers can be
accessed via their SSH alias of `worker-N`, where `N` is the VM number between `0`
and `num_vms-1`. In this example we should be able to successfully run `ssh worker-1
hostname`. You can also use `ds_ssh` to execute a command in parallel on all of your
worker containers.
hostname` which will return the hostname of worker-1.
## Parallel SSH across containers
DeepSpeed comes installed with a helper script `ds_ssh` which is a wrapper around
the [pdsh](https://linux.die.net/man/1/pdsh) command that lets you issue commands
to groups of hosts (via SSH) in parallel. This wrapper simply connects with the
hostfile that defines all the containers in your deployment. For example if you run
`ds_ssh hostname` you should see a list of all the hostnames in your deployment.
## Run CIFAR-10 example model
We will now run the DeepSpeed CIFAR-10 model example to test the VM setup. From inside
......@@ -105,17 +119,11 @@ the first DeepSpeed container:
```bash
deepspeed cifar10_deepspeed.py --deepspeed --deepspeed_config ds_config.json
```
Alternatively, we provide a helper script `./run_ds.sh`.
This will train a simple CIFAR-10 example model. The accuracy that you will achieve will
be dependent on the number of GPUs you are training with, we are using this example
simply to demonstrate that everything is setup correctly and less on training a suitable
CIFAR-10 model.
## Megatron-LM GPT2
DeepSpeed includes an example model using Megatron-LM's GPT2. Please refer to the full
[Megatron tutorial](tutorials/MegatronGPT2Tutorial.md) for more details.
* In order to fully train GPT2 with DeepSpeed and ZeRO we recommend using 8 instances of
Azure's Standard_ND40rs_v2 SKU for a total of 64 NVIDIA V100 GPUs. With this setup you
should be able to train 153.6 million samples in less than 2 weeks of training.
Azure's Standard_ND40rs_v2 SKU for a total of 64 NVIDIA V100 GPUs. With this setup and
a batch size of 1536 you should be able to complete 100k training steps (153.6 million
samples) in less than 2 weeks of training.
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册