@@ -7,36 +7,34 @@ This Design Doc refers to the NCCL feature in paddle. We propose an approach t
...
@@ -7,36 +7,34 @@ This Design Doc refers to the NCCL feature in paddle. We propose an approach t
## Motivation
## Motivation
NCCL is a Nvidia library support Multi-GPU communicating. [NCCL](https://developer.nvidia.com/nccl). With NCCL library, we can easily accelerate the training in parallel.
NCCL is a NVIDIA library support Multi-GPU communicating and optimized for NVIDIA GPUs, it provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter, that can achieve high bandwidth over PCIe and NVLink high-speed interconnect. [NCCL](https://developer.nvidia.com/nccl). With NCCL library, we can easily accelerate the training in parallel.
-can easily move the optimize sub-graph to parameter server, multi-GPU feature can be compatible with distributed support design.
-Pros
- easily plug-in with [NCCL2](https://developer.nvidia.com/nccl) library.
1. easily plug-in with [NCCL2](https://developer.nvidia.com/nccl) library.
- GPU Model parallelism becomes easier to implement. we only need to replace different GPU's sub-graph with different part of the whole graph.
1. high performance in NVIDIA GPUs.
- GPU Data Parallelism
1. MPI like primitives, which have low learning cost for users.
Suppose to we have `n`GPUs, every GPU has `1/n`part of training data, and store a complete model in GPU memory.
- Cons
1. Only design for NVIDIA GPUs, not a general multi-device solution.
1. Although NCCL1 is opensourced under BSD license, but NCCL2 is not opensourced anymore.
- GPU Model Parallelism
At the beginning of training, the framework needs to distribute the same parameters to every GPU, and merge the gradients at any time user interests.
every GPU have part of a complete model in GPU memory.
As a result, during training, we need the operations of peer to peer copy between different GPUs, aggregating gradients/parameters from GPUs, and broadcasting parameters to GPUs. Every GPU only need to run the operator with correct place information.
At the beginning of training, the framework needs to issue the same sub-graph to every GPU in Data Parallelism, or different sub-graph in Model Parallelism.
Besides, it needs interfaces to synchronize model update with each different GPU Cards.
During training, we need the operations of peer to peer copy between different GPUs, aggregating gradients/parameters from GPUs, and broadcasting parameters to GPUs. Every GPU only need to run the sub-graph with correct place information.
Besides, it needs interfaces to synchronize model update with each other, and issue/merge model from different GPU Cards.
## Implementation
## Implementation
As mentioned above, we summarise that several kinds of operators are needed. Currently, we need to issue parameters to different GPUs, named it with Broadcast operator. And also synchronize parameters between GPUs, called it with AllReduce.
As mentioned above, we wrap the NCCL routines as several kinds of operators. Need to note that NCCL need to create Communicator between gpu at the beginning, so there is a NCCLInit operator created.
### Graph Converter
### Graph Converter
To be compatible with [parameter server design doc](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/ops/dist_train.md), the graph converter converts the user defined operation graph into sub-graphs to be executed on different devices.
To be compatible with [parameter server design doc](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/ops/dist_train.md), the graph converter converts the user defined operation graph into sub-graphs to be executed on different devices.
1. The user-defined operator graph will be partitioned into sub-graph.
1. The user-defined model will be a single device program
2.Control operators between GPUs will be inserted into the graph.
2.Broadcast/Reduce operators between GPUs will be inserted into the program, even for the multi-node, may insert the `Send`, `Recv` operator.
*Broadcast, AllReduce in a single machine. And Broadcast, AllReduce, [Send, Recv](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/ops/dist_train.md#graph-converter) in multiple machines*
*Broadcast, AllReduce in a single machine. And Broadcast, AllReduce, [Send, Recv](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/ops/dist_train.md#graph-converter) in multiple machines*
...
@@ -49,14 +47,14 @@ After convert, the graph as shows
...
@@ -49,14 +47,14 @@ After convert, the graph as shows
Operators are added to the sub-graphs. Every GPU assigned a role of `rank0`, `rank1` etc.
Operators are added to the sub-graphs. Every GPU assigned a role of `rank0`, `rank1` etc.
-**Broadcast**. Broadcast operator distribute initialized parameter to all the GPUs from the GPU who owns it. e.g. from`rank0` GPU.
-**Broadcast**. Broadcast operator distribute initialized parameter to all the GPUs from the GPU who owns it. e.g. from`rank0` GPU.
-**Allreduce**. Allreduce operator synchronizes parameters/gradients between GPUs. AllReduce implemented in the Ring-Based communicating method, avoid of the bottle neck in a single GPU.
-**AllReduce**. AllReduce operator synchronizes parameters/gradients between GPUs. AllReduce implemented in the Ring-Based communicating method, avoid of the bottle neck in a single GPU.
These two operators need the Multi-GPU context support.
Need to notice that AllReduce operator force GPUs synchronized at that point. The whole training process in asynchronous or synchronous mode depends on the AllReduce point in the graph.
Need to notice that Allreduce operator force GPUs synchronized at that point. Every device only need runs sub-graph in a loop style forever, the whole training process in asynchronous or synchronous mode depends on the Allreduce point in the graph.
As it shown in the picture, when each GPU compute the gradient of `W`, followed with a `AllReduce` operator, accumulate the `dW` to full batch of data, then run the optimize process individually and apply the gradient to its `W`.
As it shown in the picture, when each GPU compute the gradient of `W`, followed with a `AllReduce` operator, accumulate the `dW` to full batch of data, then run the optimize process individually and apply the gradient to its `W`.
In fact, in the way of every GPU optimized full batch of data, wasted (n-1) GPU compute resources. We will enhance it in the next stage.
-**AllReduce2**
If we use the NCCL2 AllReduce primitive, every GPU optimized full batch of data, wasted (n-1) GPU compute resources. In addition, AllReduce will only utilize the communicate resource during synchronization, then update the gradient will be a seperated phase. In fact, we can amortize the update gradient time cost into the communicating phase.
### Benefits
- Every parameter has its root card. That card will call **Reduce** operator and collect the gradients from GPUs.
- The whole model's parameter will be hashed to different root card, ensure the load balance between GPUs.
Then we have another version AllReduce operator. Other part keep the same with before.