diff --git a/paddle/framework/backward.md b/paddle/framework/backward.md index 0a6d762bc8be5201ac196b4bc6107c06d07a31d7..ac60be572419b62f4beb644ff192d413c35e19bb 100644 --- a/paddle/framework/backward.md +++ b/paddle/framework/backward.md @@ -2,7 +2,7 @@ ## Motivation -In Neural Network, many model is solved by the the backpropagation algorithm(known as BP) at present. Technically it caculates the gradient of the loss function, then distributed back through the networks. Follows the chain rule, so we need a module chains the gradient operators/expressions together with to construct the backward pass. Every forward network needs a backward network to construct the full computation graph, the operator/expression's backward pass will be generated respect to forward pass. +In Neural Network, most models are solved by the backpropagation algorithm(known as **BP**) at present. Technically, BP calculates the gradient of the loss function, then propagates it back through the networks following the chain rule. Hence we need a module that chains the gradient operators/expressions together to construct the backward pass. Every forward network needs a backward network to construct the full computation graph. The operator/expression's backward pass will be generated with respect to the forward pass. ## Implementation @@ -24,9 +24,9 @@ A backward network is built up with several backward operators. Backward operato | **Operator::inputs_** | Inputs | Inputs, Outputs, OutputGradients | | **Operator::outputs_** | Outputs | InputGradients | - In most cases, there is a one-to-one correspondence between the forward and backward operators. These correspondences are recorded by a global hash map(`OpInfoMap`). To follow the philosophy of minimum core and make operators pluggable, the registry mechanism is introduced. + In most cases, there is a one-to-one relation between the forward and backward operators. These relations are recorded by a global hash map(`OpInfoMap`). To follow the philosophy of minimum core and to make operators pluggable, the registry mechanism is introduced. -For example, we have got a `mul_op`, and we can register its information and corresponding backward operator by the following macro: +For example, we have `mul_op`, and we can register its information and corresponding backward operator by the following macro: ```cpp REGISTER_OP(mul, MulOp, MulOpMaker, mul_grad, MulOpGrad); @@ -48,7 +48,7 @@ The function `BuildGradOp` will sequentially execute following processes: 1. Get the `type_` of given forward operator, and then get the corresponding backward operator's type by looking up the `OpInfoMap`. -2. Build two maps named `inputs` and `outputs` to temporary storage backward operator's inputs and outputs. Copy forward operator's `inputs_` and `outputs_` to map `inputs`, except these, are not necessary for gradient computing. +2. Build two maps named `inputs` and `outputs` to temporarily store backward operator's inputs and outputs. Copy forward operator's `inputs_` and `outputs_` to map `inputs`, except these, are not necessary for gradient computing. 3. Add forward inputs' gradient variables into map `output`, adding forward outputs' gradient variables into map `input`. @@ -56,11 +56,11 @@ The function `BuildGradOp` will sequentially execute following processes: ### Backward Network Building -A backward network is a series of backward operators. The main idea of building a backward network is creating backward operators in the inverted sequence and append them together one by one. There is some corner case need to process specially. +A backward network is a series of backward operators. The main idea of building a backward network is creating backward operators in the inverted sequence and appending them together one by one. There are some corner cases that need special processing. 1. Op - When the input forward network is an Op, return its gradient Operator Immediately. If all of its outputs are in no gradient set, then return a special `NOP`. + When the input forward network is an Op, return its gradient Operator immediately. If all of its outputs are in no gradient set, then return a special `NOP`. 2. NetOp @@ -68,33 +68,33 @@ A backward network is a series of backward operators. The main idea of building 3. RnnOp - RnnOp is a nested stepnet operator. Backward module need to recusively call `Backward` for every stepnet. + RnnOp is a nested stepnet operator. Backward module needs to recusively call `Backward` for every stepnet. 4. Sharing Variables - **sharing variables**. As illustrated in the pictures, two operator's share the same variable name of W@GRAD, which will overwrite their sharing input variable. + As illustrated in the figure 1 and figure 2, two operators share the same variable name **W@GRAD**, which will overwrite their shared input variable.


-​ pic 1. Sharing variables in operators. +​ Figure 1. Sharing variables in operators.

-​ Sharing variable between operators or same input variable used in multiple operators leads to a duplicate gradient variable. As demo show above, we need to rename gradient name recursively and add a generic add operator to replace the overwrite links. +​ Sharing variable between operators or same input variable used in multiple operators can lead to duplicate gradient variables. As illustrated in figure 2, we need to rename the gradient names recursively and add a generic add operator to prevent overwriting.


-​ pic 2. Replace sharing variable's gradient with `Add` operator. +​ Figure 2. Replace sharing variable's gradient with `Add` operator.

-​ Because our framework finds variables accord to their names, we need to rename the output links. We add a suffix of number to represent its position in clockwise. +​ Because the framework finds variables according to their names, we need to rename the output links. We add an integer suffix to represent its position in the clockwise direction. -5. Part of Gradient is Zero. +5. Part of the Gradient is Zero. - In the whole graph, there is some case of that one operator's gradient is not needed, but its input's gradient is a dependency link of other operator, we need to fill a same shape gradient matrix in the position. In our implement, we insert a special `fillZeroLike` operator. + In the whole graph, there is some case of that one operator's gradient is not needed, but its input's gradient is a dependency link of other operator, we need to fill a same shape gradient matrix in the position. In our implementation, we insert a special `fillZeroLike` operator. Follow these rules above, then collect the sub graph `OutputGradients`/`InputGradients` as the NetOp's and return it. diff --git a/paddle/framework/variable.md b/paddle/framework/variable.md index f44d5ea46e7ce98dd443d684ad42308496bc4179..442ef6b718b227d79ca73031efcbb55817558252 100644 --- a/paddle/framework/variable.md +++ b/paddle/framework/variable.md @@ -7,7 +7,7 @@ Variable is also known as *blob* in MxNet and Caffe2. It is the input and outpu For the flexibility of a DL system, a variable should be able to contain any typed value -- a tensor in most cases, but could also be some integer IDs or a scope of other variables in the case of RNN. -To use the minimum amount of memory, we'd like that a variable to allocate memory when it has to, or, lazy memory allocation. Let's take the following example: +To use the minimum amount of memory, we would like that a variable allocates memory only when it has to, or, lazy memory allocation. Let's take the following example: ```cpp Variable vr, v1, v2; @@ -38,7 +38,7 @@ This syntax for lazy memory allocation when we call `Randomize` and `Mult`, thos To make memory allocation lazy, we cannot assume that we know the type held by a variable at definition time. In other words, `class Variable` cannot be a template `template class Variable`. -Because we don't know the type `T`, we cannot save a `T*` as `Variable's` data member. Instead, we save an interface object `Placeholder`, who can return the pointer to the saved object via `Placeholder::Ptr()` as `void*`. +Because we don't know the type `T`, we cannot save a `T*` as `Variable's` data member. Instead, we save an interface object `Placeholder`, which can return the pointer to the saved object via `Placeholder::Ptr()` as `void*`. But anyway, Variable needs to know `T` so could it `delete(ptr)` and so could `Variable::Get` checks the expected type and the saved object's type. @@ -49,4 +49,4 @@ Because `PlaceholderImpl` knows `T`, it can save and return `typeid(T)` for the ## Conclusion -The technique type hiding utilizes C++ class templates, interface and derivation, and C++ RTTI (typeid). This combination saves us from definition something like `caffe2::TypeMata`, which takes hundreds of lines of C++ code. +The technique type hiding utilizes C++ class templates, interface and derivation, and C++ RTTI (typeid). This combination saves us from defining something like `caffe2::TypeMeta`, which takes hundreds of lines of C++ code.