paddle_nccl.md 3.7 KB
Newer Older
D
dzhwinter 已提交
1
# Design Doc: NCCL support in Paddle Fluid
D
dongzhihong 已提交
2 3 4

## Abstract

D
dzhwinter 已提交
5
This Design Doc refers to the NCCL feature in  paddle.  We propose an approach to support NCCL library both on a single machine and multiple machines. We wrapper the NCCL primitives `Broadcast`, `Allreduce`, `Reduce` as operators to utilize Multi-GPU powers in one script.
D
dongzhihong 已提交
6 7 8 9


## Motivation

D
dzhwinter 已提交
10
NCCL is a Nvidia library support Multi-GPU communicating. [NCCL](https://developer.nvidia.com/nccl). With NCCL library, we can easily accelerate the training in parallel.
D
dongzhihong 已提交
11

D
dzhwinter 已提交
12 13 14
- can easily move the optimize sub-graph to parameter server,  multi-GPU feature can be compatible with distributed support design.
- easily plug-in with [NCCL2](https://developer.nvidia.com/nccl) library.
- GPU Model parallelism becomes easier to implement. we only need to replace different GPU's sub-graph with different part of the whole graph.
D
dongzhihong 已提交
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
- GPU Data Parallelism 

  Suppose to we have `n`GPUs, every GPU has `1/n`part of training data, and store a complete model in GPU memory.  

- GPU Model Parallelism

  every GPU have part of a complete model in GPU memory.

At the beginning of training, the framework needs to issue the same sub-graph to every GPU in Data Parallelism, or different sub-graph in Model Parallelism.

During training, we need the operations of peer to peer copy between different GPUs, aggregating gradients/parameters from GPUs, and broadcasting parameters to GPUs. Every GPU only need to run the sub-graph with correct place information.

Besides, it needs interfaces to synchronize model update with each other, and issue/merge model from different GPU Cards. 

## Implementation

As mentioned above, we summarise that several kinds of operators are needed. Currently, we need to issue parameters to different GPUs,  named it with Broadcast operator.  And also synchronize parameters between GPUs, called it with AllReduce. 

### Graph Converter

D
dongzhihong 已提交
35
To be compatible with [parameter server design doc](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/ops/dist_train.md), the graph converter converts the user defined operation graph into sub-graphs to be executed on different devices.
D
dongzhihong 已提交
36 37 38 39 40

1. The user-defined operator graph will be partitioned into sub-graph. 

2. Control operators between GPUs will be inserted into the graph.

D
dongzhihong 已提交
41
   *Broadcast, AllReduce in a single machine. And Broadcast, AllReduce, [Send, Recv](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/ops/dist_train.md#graph-converter) in multiple machines*
D
dongzhihong 已提交
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57

   <img src="images/multigpu_before_convert.png" width="300"/>

After convert, the graph as shows

<img src="images/multigpu_allreduce.png" width="1000"/>

Operators are added to the sub-graphs. Every GPU assigned a role of `rank0`, `rank1` etc. 

- **Broadcast**. Broadcast operator distribute initialized parameter to all the GPUs from the GPU who owns it. e.g. from`rank0` GPU.
- **Allreduce**. Allreduce operator synchronizes parameters/gradients between GPUs. AllReduce implemented in the Ring-Based  communicating method, avoid of the bottle neck in a single GPU.

These two operators need the Multi-GPU context support.

Need to notice that Allreduce operator force GPUs synchronized at that point. Every device only need runs sub-graph in a loop style forever, the whole training process in asynchronous or synchronous mode depends on the Allreduce point in the graph.

D
dongzhihong 已提交
58
As it shown in the picture, when each GPU compute the gradient of `W`, followed with a `AllReduce` operator, accumulate the `dW` to full batch of data, then run the optimize process individually and apply the gradient to its `W`.
D
dongzhihong 已提交
59 60 61

In fact, in the way of every GPU optimized full batch of data, wasted (n-1) GPU compute resources. We will enhance it in the next stage.

D
dongzhihong 已提交
62
### Benefits