backward.md 5.1 KB
Newer Older
F
fengjiayi 已提交
1
# Operator/expression 's Backward
D
dongzhihong 已提交
2

F
fengjiayi 已提交
3
## Motivation
D
dongzhihong 已提交
4

D
dongzhihong 已提交
5
In Neural Network, many model is solved by the the backpropagation algorithm(known as BP) at present. Technically it caculates the gradient of the loss function, then distributed back through the networks. Follows the chain rule, so we need a module chains the gradient operators/expressions together with to construct the backward pass. Every forward network needs a backward network to construct the full computation graph, the operator/expression's backward pass will be generated respect to forward pass. 
D
dongzhihong 已提交
6

D
dongzhihong 已提交
7 8 9 10 11 12 13 14 15
## Implementation

In this design doc, we exported only one API for generating the backward pass.

```c++
std::unique_ptr<OperatorBase> Backward(const OperatorBase& forwardOp,
    const std::unordered_set<std::string>& no_grad_vars);
```

D
dongzhihong 已提交
16
The implementation behind it can be divided into two parts, **Backward Operator Creating** and **Backward Operator Building**.
D
dongzhihong 已提交
17

D
dongzhihong 已提交
18
### Backward Operator Registry
F
fengjiayi 已提交
19

D
dongzhihong 已提交
20
A backward network is built up with several backward operators. Backward operators take forward operators' inputs, outputs, and output gradients and then calculate its input gradients.
F
fengjiayi 已提交
21

F
fengjiayi 已提交
22 23 24 25
|                        | forward operator | backward operator 
| ---------------------- | ---------------- |------------------------- |		
| **Operator::inputs_**  | Inputs       | Inputs, Outputs, OutputGradients |	
| **Operator::outputs_** | Outputs          | InputGradients            |
F
fengjiayi 已提交
26

D
dongzhihong 已提交
27
 In most cases, there is a one-to-one correspondence between the forward and backward operators. These correspondences are recorded by a global hash map(`OpInfoMap`). To follow the philosophy of minimum core and make operators pluggable, the registry mechanism is introduced.
F
fengjiayi 已提交
28

D
dongzhihong 已提交
29
For example, we have got a `mul_op`, and we can register its information and corresponding backward operator by the following macro:
F
fengjiayi 已提交
30 31

```cpp
32
REGISTER_OP(mul, MulOp, MulOpMaker, mul_grad, MulOpGrad);
F
fengjiayi 已提交
33 34
```

F
fengjiayi 已提交
35
`mul` is the operator's type. `MulOp` and `MulOpMaker` are the operator class and the operator maker class respectively.
F
fengjiayi 已提交
36

F
fengjiayi 已提交
37
`mul_grad` is the type of backward operator, and `MulOpGrad` is its class name.
D
dongzhihong 已提交
38

D
dongzhihong 已提交
39
### Backward Opeartor Creating
D
dongzhihong 已提交
40

D
dongzhihong 已提交
41
Given a certain forward operator, we can get its corresponding backward operator by calling:
D
dongzhihong 已提交
42

F
fengjiayi 已提交
43 44
```cpp
OperatorBase* bwd_op = BuildGradOp(const OperatorBase* fwd_op);
D
dongzhihong 已提交
45
```
F
fengjiayi 已提交
46 47 48

The function `BuildGradOp` will sequentially execute following processes:

F
fengjiayi 已提交
49
1. Get the `type_` of given forward operator, and then get the corresponding backward operator's type by looking up the `OpInfoMap`.
F
fengjiayi 已提交
50

D
dongzhihong 已提交
51
2. Build two maps named `inputs` and `outputs` to temporary storage backward operator's inputs and outputs. Copy forward operator's `inputs_` and `outputs_` to map `inputs`, except these, are not necessary for gradient computing.
F
fengjiayi 已提交
52

F
fengjiayi 已提交
53
3. Add forward inputs' gradient variables into map `output`, adding forward outputs' gradient variables into map `input`.
F
fengjiayi 已提交
54

F
fengjiayi 已提交
55
4. Building backward operator with `inputs`, `outputs` and forward operator's attributes.
F
fengjiayi 已提交
56

D
dongzhihong 已提交
57
### Backward Network Building
D
dongzhihong 已提交
58

D
dongzhihong 已提交
59
A backward network is a series of backward operators. The main idea of building a backward network is creating backward operators in the inverted sequence and append them together one by one. There is some corner case need to process specially.
D
dongzhihong 已提交
60

D
dongzhihong 已提交
61 62
1. Op 

D
dongzhihong 已提交
63
   When the input forward network is an Op, return its gradient Operator Immediately. If all of its outputs are in no gradient set, then return a special `NOP`.
D
dongzhihong 已提交
64 65 66

2. NetOp 

D
dongzhihong 已提交
67 68 69 70 71 72
   In our design, the network itself is also a kind of operator(**NetOp**). So the operators contained by a big network may be some small network. When the input forward network is a NetOp, it needs to call the sub NetOp/Operators backward function recursively. During the process, we need to collect the `OutputGradients` name according to the forward NetOp.

3. RnnOp

   RnnOp is a nested stepnet operator.  Backward module need to recusively call `Backward` for every stepnet.

D
dongzhihong 已提交
73
4. Sharing Variables
D
dongzhihong 已提交
74

D
dongzhihong 已提交
75
   **sharing variables**. As illustrated in the pictures, two operator's share the same variable name of W@GRAD, which will overwrite their sharing input variable. 
D
dongzhihong 已提交
76

D
dongzhihong 已提交
77
<p align="center">
D
dongzhihong 已提交
78
<img src="./images/duplicate_op.png" width="50%" ><br/>
D
dongzhihong 已提交
79

D
dongzhihong 已提交
80
​	pic 1. Sharing variables in operators. 
D
dongzhihong 已提交
81 82

</p>
D
dongzhihong 已提交
83

D
dongzhihong 已提交
84
​	Sharing variable between operators or same input variable used in multiple operators leads to a duplicate gradient variable. As demo show above, we need to rename gradient name recursively and add a generic add operator to replace the overwrite links. 
D
dongzhihong 已提交
85

D
dongzhihong 已提交
86 87
<p align="center">
<img src="images/duplicate_op2.png" width="40%" ><br/>
D
dongzhihong 已提交
88

D
dongzhihong 已提交
89
​	pic 2. Replace sharing variable's gradient with `Add` operator.
D
dongzhihong 已提交
90

D
dongzhihong 已提交
91
</p>
D
dongzhihong 已提交
92

D
dongzhihong 已提交
93
​	Because our framework finds variables accord to their names, we need to rename the output links. We add a suffix of number to represent its position in clockwise. 
D
dongzhihong 已提交
94

D
dongzhihong 已提交
95
5. Part of Gradient is Zero.
D
dongzhihong 已提交
96

D
dongzhihong 已提交
97
   In the whole graph, there is some case of that one operator's gradient is not needed, but its input's gradient is a dependency link of other operator,  we need to fill a same shape gradient matrix in the position. In our implement, we insert a special `fillZeroLike` operator.
D
dongzhihong 已提交
98 99


D
dongzhihong 已提交
100
Follow these rules above, then collect the sub graph `OutputGradients`/`InputGradients` as the NetOp's and return it.