From 9b1675048561e2708af40948b192f24a199bf70a Mon Sep 17 00:00:00 2001
From: dzhwinter <dzhwinter@gmail.com>
Date: Wed, 13 Dec 2017 22:44:33 -0800
Subject: [PATCH] "remove AllReduce2 comments"

---
 doc/design/paddle_nccl.md | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/doc/design/paddle_nccl.md b/doc/design/paddle_nccl.md
index dba451b9b..5e28144b9 100644
--- a/doc/design/paddle_nccl.md
+++ b/doc/design/paddle_nccl.md
@@ -53,8 +53,13 @@ Need to notice that AllReduce operator force GPUs synchronized at that point. Th
 
 As it shown in the picture, when each GPU compute the gradient of `W`, followed with a `AllReduce` operator, accumulate the `dW` to full batch of data, then run the optimize process individually and apply the gradient to its `W`.
 
-- **AllReduce2**
-If we use the NCCL2 AllReduce primitive, every GPU optimized full batch of data, wasted (n-1) GPU compute resources. In addition, AllReduce will only utilize the communicate resource during synchronization, then update the gradient will be a seperated phase. In fact, we can amortize the update gradient time cost into the communicating phase.
-- Every parameter has its root card. That card will call **Reduce** operator and collect the gradients from GPUs.
-- The whole model's parameter will be hashed to different root card, ensure the load balance between GPUs.
-Then we have another version AllReduce operator. Other part keep the same with before.
+- **AllReduce**
+  Need to note that our AllReduce operator is a ring-base AllReduce implementation. If we use the NCCL2 AllReduce primitive, every GPU optimized full batch of data, wasted (n-1) GPU compute resources. In addition, NCCL2 built-in AllReduce will only utilize the communicating resource during synchronization, then update the gradient will be a subsequent phase. In fact, we can amortize the update gradient time cost into the communicating phase. The process is
+1. Every parameter has its root card. That card will responsible for aggregating the gradients from GPUs.
+2. The whole model's parameter will be hashed to different root card, ensure the load balance between GPUs.
+3. Logically neighberhood card will start send parameter to the next one. After one round, the parameter main card will aggregate the full gradients.
+4. Then the root card will optimize the parameter.
+5. This parameter card will send its optimized result to its neighberhood, then the neighberhood will send parameter to its next one.
+6. Finish the sychronization round.
+
+The total time cost will be 2 * (n-1) * per-parameter-send-time, we reach the goal of amortize the upgrade time into communicating phase.
-- 
GitLab