diff --git a/doc/fluid/design/dist_train/dist_train_nccl2.md b/doc/fluid/design/dist_train/dist_train_nccl2.md index aa7455ec5de0d46d7c2b0cef3b7ebf4754af3cb1..b8b8427811cddcddf872db5badfd37c96a76c3e3 100644 --- a/doc/fluid/design/dist_train/dist_train_nccl2.md +++ b/doc/fluid/design/dist_train/dist_train_nccl2.md @@ -1,7 +1,7 @@ # Distributed Training with NCCL2 We design a pattern that can enable training with `ParallelExecutor` and -using [NCCL2](https://developer.nvidia.com/nccl) as it's collective +use [NCCL2](https://developer.nvidia.com/nccl) as it's collective communication library. In `ParallelExecutor` we can use `AllReduce` or `Reduce` and `Broadcast` @@ -9,14 +9,14 @@ to do multi GPU training. And if we initialize NCCL2 communicators as ranks in a distributed environment, we can simply run the `ParallelExecutor` as a distributed program! The only thing that may be different than in the single node version is that we need to broadcast the NCCL unique ID -to all the nodes, and initialize communicators using that ID, so NCCL2 -will know each other as ranks. +to all the nodes and initialize communicators using that ID, so NCCL2 +can know each other as ranks. To achieve this feature, we introduce a new operator: `gen_nccl_id` op, so we are ***not*** "bind to" running NCCL2 with MPI, we can run it in -what ever platform you like. +whatever platform you like. -It have two running modes: +It has two running modes: 1. Generate and broadcast mode, which should be used on trainer 0; 1. Listen and fetch mode, which should be used on trainers other than 0. @@ -29,7 +29,7 @@ initialize NCCL communicator objects. The above figure indicates the general process when training with NCCL2 -distributed. Each trainer have the number of communicators equal to the +distributed. Each trainer has the number of communicators equal to the number of GPUs, but the ranks should match the global ranks number: here we have total 8 GPUs, so `nranks==8`, for each trainer, the ranks should be from 0 ~ 3 on trainer 0 and 4 ~ 7 on trainer 1. diff --git a/doc/fluid/howto/cluster/nccl2_rdma_training.md b/doc/fluid/howto/cluster/nccl2_rdma_training.md index cecd5c3a7a7339e3be6772543a534728ec132105..8adaf324fccb4cda7af16b9bace559c0642ae444 100644 --- a/doc/fluid/howto/cluster/nccl2_rdma_training.md +++ b/doc/fluid/howto/cluster/nccl2_rdma_training.md @@ -1,12 +1,12 @@ # Distributed Training with NCCL2 and RDMA -When doing distributed multi-GPU training, network bandwith often becomes the -bottle neck. We introduce a way to use NCCL2 to do such training job to -achieve best performace. +When doing distributed multi-GPU training, network bandwidth often becomes the +bottleneck. We introduce a way to use NCCL2 to do such training job to +achieve best performance. -## Prepare Hardwares with RDMA and Multiple GPUs +## Prepare Hardware with RDMA and Multiple GPUs -I'm using two Linux servers each of them is installed with 8 GPUs and +I'm using two Linux servers each of them installed with 8 GPUs and one 100Gb RDMA card. Base environment is: @@ -25,7 +25,7 @@ In general, the steps including: 1. Use docker to run tests and make sure GPUs and RDMA can work inside the container. -I'll ommit section "Install GPU drivers" because we can find it easily +I'll omit the section "Install GPU drivers" because we can find it easily somewhere else. ### Install RDMA drivers @@ -33,7 +33,7 @@ somewhere else. For my case, I've got two machines with device "Mellanox Technologies MT27700 Family [ConnectX-4]" installed. The OS was "CentOS 7.4" and I updated the kernel to version 4.4 so that docker can -work with latest overlay2 filesystem. +work with the latest overlay2 filesystem. ***NOTE: before you start, make sure you have a way to get a console of the server other than ssh because we may need to re-configure the @@ -45,14 +45,14 @@ network device.*** 1. Run `./mlnxofedinstall --add-kernel-support` in the software package. 1. Run `/etc/init.d/openibd restart` to make everything work, note that this operation may cause the network goes down if you are using this - RDMA device as default network device and use ssh to login the server. + RDMA device as default network device and use ssh to log in the server. 1. Re-configure the network interface, for example: `ifconfig eth2 192.168.16.30/20 up`, then add routes if needed: `ip route add default via 192.168.16.1 dev eth2`. 1. Do the same thing on the other node. 1. Use `ping` to test if the two nodes have typical ICMP connection. 1. Use either `udaddy` or `ib_write_bw` to test the network connection is - ready and have the desired bandwith. + ready and have the desired bandwidth. ### Prepare Docker Image to Run RDMA Programs @@ -60,7 +60,7 @@ network device.*** package in it. 1. Start a docker container and mount GPU driver libs into it (you can skip this step if you are using nvidia-docker). -1. Mount RDMA dirvers and libs into the docker image (see below section), +1. Mount RDMA drivers and libs into the docker image (see below section), also `udaddy` and `ib_write_bw` if needed. 1. Mount GPU devices and RDMA devices into the container using `--device` or just use privileged mode `--privileged`.