nccl2_rdma_training.md 4.0 KB
Newer Older
W
Wu Yi 已提交
1 2
# Distributed Training with NCCL2 and RDMA

G
gongweibao 已提交
3 4 5
When doing distributed multi-GPU training, network bandwidth often becomes the
bottleneck. We introduce a way to use NCCL2 to do such training job to
achieve best performance.
W
Wu Yi 已提交
6

G
gongweibao 已提交
7
## Prepare Hardware with RDMA and Multiple GPUs
W
Wu Yi 已提交
8

G
gongweibao 已提交
9
I'm using two Linux servers each of them installed with 8 GPUs and
W
Wu Yi 已提交
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
one 100Gb RDMA card.
Base environment is:

* OS: CentOS 7.4
* RDMA device: "Mellanox Technologies MT27700 Family [ConnectX-4]"
* Kernel version: `4.4.88-1.el7.elrepo.x86_64`
* Docker version: `1.12.6`
* Docker storage driver: `overlay2`
* IP addresses: 192.168.16.30,192.168.16.34

In general, the steps including:

1. Install GPU drivers
1. Install RDMA drivers
1. Install "InfiniBand Support"
1. Use docker to run tests and make sure GPUs and RDMA can work inside
   the container.

G
gongweibao 已提交
28
I'll omit the section "Install GPU drivers" because we can find it easily
W
Wu Yi 已提交
29 30 31 32 33 34 35
somewhere else.

### Install RDMA drivers

For my case, I've got two machines with device
"Mellanox Technologies MT27700 Family [ConnectX-4]" installed. The OS was
"CentOS 7.4" and I updated the kernel to version 4.4 so that docker can
G
gongweibao 已提交
36
work with the latest overlay2 filesystem.
W
Wu Yi 已提交
37 38 39 40 41 42 43 44 45 46 47

***NOTE: before you start, make sure you have a way to get a console
of the server other than ssh because we may need to re-configure the
network device.***

1. Go to http://www.mellanox.com/page/products_dyn?product_family=26,
   download `MLNX_OFED` software in the bottom of the page, and upload it
   onto the server.
1. Run `./mlnxofedinstall --add-kernel-support` in the software package.
1. Run `/etc/init.d/openibd restart` to make everything work, note that
   this operation may cause the network goes down if you are using this
G
gongweibao 已提交
48
   RDMA device as default network device and use ssh to log in the server.
W
Wu Yi 已提交
49 50 51 52 53 54
1. Re-configure the network interface, for example:
   `ifconfig eth2 192.168.16.30/20 up`, then add routes if needed:
   `ip route add default via 192.168.16.1 dev eth2`.
1. Do the same thing on the other node.
1. Use `ping` to test if the two nodes have typical ICMP connection.
1. Use either `udaddy` or `ib_write_bw` to test the network connection is
G
gongweibao 已提交
55
   ready and have the desired bandwidth.
W
Wu Yi 已提交
56 57 58 59 60 61 62

### Prepare Docker Image to Run RDMA Programs

1. Build a docker image using cuda base image like: `nvidia/cuda:8.0-cudnn5-devel-ubuntu16.04` and install paddlepaddle whl
   package in it.
1. Start a docker container and mount GPU driver libs into it (you can
   skip this step if you are using nvidia-docker).
G
gongweibao 已提交
63
1. Mount RDMA drivers and libs into the docker image (see below section),
W
Wu Yi 已提交
64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110
   also `udaddy` and `ib_write_bw` if needed.
1. Mount GPU devices and RDMA devices into the container using `--device`
   or just use privileged mode `--privileged`.
1. Start the container using host network mode: `--net=host`

### RDMA Library Files Needed

Usually, `MLNX_OFED` install latest supported libs under
`/usr/lib64/mlnx_ofed/valgrind`. Other libs also needed to run RDMA programs
is listed below. These libs must be mounted into the docker container.

* Libs under `/usr/lib64/mlnx_ofed/valgrind`
  * libibcm.so
  * libibverbs.so
  * libmlx4.so
  * libmlx5.so
  * libmlx5-rdmav2.so
  * librdmacm.so
* Other libs:
  * libnl-3.so.200
  * libnl-route-3.so.200
  * libnuma.so.1

## Start to Run the Training Job

Setting NCCL environment variables to turn NCCL switches on and off:


| Env Name | Description |
| --- | --- |
| NCCL_SOCKET_IFNAME | The RDMA device, e.g. eth2 |
| NCCL_P2P_DISABLE | Set to 1 to disable P2P transfer between GPUs |
| NCCL_IB_DISABLE | Set to 1 to disable using RDMA |
| NCCL_IB_CUDA_SUPPORT | Set to 1 to enable GPU Direct if supported |
| NCCL_DEBUG | Set debug level: VERSION, WARN, INFO |

My two servers are: `192.168.16.30,192.168.16.34`, On node 1, Run :

```bash
PADDLE_TRAINER_ID=0 PADDLE_PORT=48372 PADDLE_WORKERS=192.168.16.30,192.168.16.34 POD_IP=192.168.16.30 stdbuf -oL python vgg16.py
```

On node 2, Run:

```bash
PADDLE_TRAINER_ID=1 PADDLE_PORT=48372 PADDLE_WORKERS=192.168.16.30,192.168.16.34 POD_IP=192.168.16.34 stdbuf -oL python vgg16.py
```