dist_train_nccl2.md 1.6 KB
Newer Older
W
Wu Yi 已提交
1 2 3
# Distributed Training with NCCL2

We design a pattern that can enable training with `ParallelExecutor` and
G
gongweibao 已提交
4
use [NCCL2](https://developer.nvidia.com/nccl) as it's collective
W
Wu Yi 已提交
5 6 7 8 9 10 11
communication library.

In `ParallelExecutor` we can use `AllReduce` or `Reduce` and `Broadcast`
to do multi GPU training. And if we initialize NCCL2 communicators as
ranks in a distributed environment, we can simply run the `ParallelExecutor`
as a distributed program! The only thing that may be different than in
the single node version is that we need to broadcast the NCCL unique ID
G
gongweibao 已提交
12 13
to all the nodes and initialize communicators using that ID, so NCCL2
can know each other as ranks.
W
Wu Yi 已提交
14 15 16

To achieve this feature, we introduce a new operator: `gen_nccl_id` op,
so we are ***not*** "bind to" running NCCL2 with MPI, we can run it in
G
gongweibao 已提交
17
whatever platform you like.
W
Wu Yi 已提交
18

G
gongweibao 已提交
19
It has two running modes:
W
Wu Yi 已提交
20 21 22 23 24 25 26 27 28 29 30 31

1. Generate and broadcast mode, which should be used on trainer 0;
1. Listen and fetch mode, which should be used on trainers other than 0.

In both two modes, this op can save the NCCL ID into current scope as a
persistable variable, Then we can insert this op at the end of
"startup program" of fluid, so that all workers can get the same ID to
initialize NCCL communicator objects.

<img src="src/ncc2_design.png">

The above figure indicates the general process when training with NCCL2
G
gongweibao 已提交
32
distributed. Each trainer has the number of communicators equal to the
W
Wu Yi 已提交
33 34 35
number of GPUs, but the ranks should match the global ranks number: here
we have total 8 GPUs, so `nranks==8`, for each trainer, the ranks should
be from 0 ~ 3 on trainer 0 and 4 ~ 7 on trainer 1.