Deploy to GitHub Pages: 8141875c

9065f11c · Travis CI · 97cddf7d · 9065f11c · 9065f11c · 9065f11c
8 changed file
--- a/develop/doc/_sources/design/refactorization.md.txt
+++ b/develop/doc/_sources/design/refactorization.md.txt
+# Design Doc: Refactorization Overview
+
+The goal of refactorizaiton include:
+
+1. Make it easy for external contributors to write new elementory computaiton operations.
+1. Make the codebase clean and readable.
+1. Introduce a new design of computation representation -- a computation graph of operators and variables.
+1. The graph representation helps implementing auto-scalable and auto fault recoverable distributed computing.
+
+## Computation Graphs
+
+1. PaddlePaddle represent the computation, training and inference of DL models, by computation graphs.
+
+  1. Please dig into [computation graphs](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/graph.md) for a solid example.
+
+1. Users write Python programs to describe the graphs and run it (locally or remotely).
+
+1. A graph is composed of *variabels* and *operators*.
+
+1. The description of graphs must be able to be serialized/deserialized, so it
+
+   1. could to be sent to the cloud for distributed execution, and
+   1. be sent to clients for mobile or enterprise deployment.
+
+1. The Python program do
+
+   1. *compilation*: runs a Python program to generate a protobuf message representation of the graph and send it to
+      1. the C++ library `libpaddle.so` for local execution,
+      1. the master process of a distributed training job for training, or
+      1. the server process of a Kubernetes serving job for distributed serving.
+   1. *execution*: according to the protobuf message, constructs instances of class `Variable` and `OperatorBase`, and run them.
+
+## Description and Realization
+
+At compile time, the Python program generates protobuf message representation of the graph, or the description of the graph.
+
+At runtime, the C++ program realizes the graph and run it.
+
+| | Representation (protobuf messages) | Realization (C++ class objects) |
+|---|---|---|
+|Data|[VarDesc](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/framework.proto#L107)|[Variable](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/variable.h#L24)|
+|Operation|[OpDesc](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/framework.proto#L35)|[Operator](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/operator.h#L64)|
+|Block|BlockDesc|Block|
+
+The word *graph* is exchangable with *block* in this document.  A graph represent computation steps and local variables as a C++/Java program block, or a pair of { and }.
+
+## Compilation and Execution
+
+1. Run an applicaton Python program to describe the graph.  In particular,
+
+   1. create VarDesc to represent local/intermediate variables,
+   1. create operators and set attributes,
+   1. validate attribute values,
+   1. inference the type and the shape of variables,
+   1. plan for memory-reuse for variables,
+   1. generate backward and optimization part of the Graph.
+   1. possiblly split the graph for distributed training.
+
+1. The invocation of `train` or `infer` in the application Python program:
+
+   1. create a new Scope instance in the [scope hierarchy](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/scope.md) for each run of a block,
+      1. realize local variables defined in the BlockDesc message in the new scope,
+      1. a scope is similar to the stack frame in programming languages,
+
+   1. create an instance of class `Block`, in which,
+      1. realize operators in the BlockDesc message,
+
+   1. run the Block by calling
+      1. `Block::Eval(vector<Variable>* targets)` for forward and backward computations, or
+      1. `Block::Eval(vector<Operator>* targets)` for optimization.
+
+
+## Intermediate Representation (IR)
+
+```text
+Compile Time -> IR -> Runtime
+```
+
+### Benefit
+
+- Optimization
+  ```text
+  Compile Time -> IR -> Optimized IR -> Runtime
+  ```
+- Send automatically partitioned IR to different nodes.
+  - Automatic data parallel
+    ```text
+    Compile Time
+    |-> Single GPU IR
+        |-> [trainer-IR-0, trainer-IR-1, pserver-IR]
+            |-> Node-0 (runs trainer-IR-0)
+            |-> Node-1 (runs trainer-IR-1)
+            |-> Node-2 (runs pserver-IR)
+    ```
+  - Automatic model parallel (planned for future)
+
+---
+
+# Operator/OpWithKernel/OpKernel
+
+![class_diagram](http://api.paddlepaddle.org/graphviz?dot=https://gist.githubusercontent.com/reyoung/53df507f6749762675dff3e7ce53372f/raw/49caf1fb70820fb4a6c217634317c9306f361f36/op_op_with_kern_class_diagram.dot)
+
+---
+
+# Operator
+![class_diagram](http://api.paddlepaddle.org/graphviz?dot=https://gist.githubusercontent.com/reyoung/53df507f6749762675dff3e7ce53372f/raw/dd598e8f1976f5759f58af5e5ef94738a6b2e661/op.dot)
+
+* `Operator` is the fundamental building block as the user interface.
+    * Operator stores input/output variable name, and attributes.
+    * The `InferShape` interface is used to infer output variable shapes by its input shapes.
+    * Use `Run` to compute `input variables` to `output variables`.
+
+---
+
+# OpWithKernel/Kernel
+
+![class_diagram](http://api.paddlepaddle.org/graphviz?dot=https://gist.githubusercontent.com/reyoung/53df507f6749762675dff3e7ce53372f/raw/9d7f4eba185cf41c8e2fbfb40ae21890dbddcd39/op_with_kernel.dot)
+
+* `OpWithKernel` inherits `Operator`.
+* `OpWithKernel` contains a Kernel map.
+    * `OpWithKernel::Run` get device's kernel, and invoke `OpKernel::Compute`.
+    * `OpKernelKey` is the map key. Only device place now, but may be data type later.
+
+---
+
+# Why separate Kernel and Operator
+
+* Separate GPU and CPU code.
+    * Make Paddle can run without GPU.
+* Make one operator (which is user interface) can contain many implementations.
+    * Same mul op, different FP16, FP32 Kernel. different MKL, eigen kernel.
+---
+
+# Libraries for Kernel development
+
+* `Eigen::Tensor` contains basic math and element-wise functions.
+    * Note that `Eigen::Tensor` has broadcast implementation.
+    * Limit number of `tensor.device(dev) = ` in your code.
+* `thrust::tranform` and `std::transform`.
+    * `thrust` has the same API as C++ standard library. Using `transform` can quickly implement a customized elementwise kernel.
+    * `thrust` has more complex API, like `scan`, `reduce`, `reduce_by_key`.
+* Hand-writing `GPUKernel` and `CPU` code
+    * Do not write `.h`. CPU Kernel should be in `.cc`. CPU kernel should be in `.cu`. (`GCC` cannot compile GPU code.)
+---
+# Operator Register
+
+## Why register is necessary?
+We need a method to build mappings between Op type names and Op classes.
+
+## How to do the register?
+
+Maintain a map, whose key is the type name and value is corresponding Op constructor.
+
+---
+# The Registry Map
+
+### `OpInfoMap`
+
+`op_type(string)` -> `OpInfo`
+
+`OpInfo`:
+
+- **`creator`**: The Op constructor.
+- **`grad_op_type`**: The type of the gradient Op.
+- **`proto`**: The Op's Protobuf, including inputs, outputs and required attributes.
+- **`checker`**: Used to check attributes.
+
+---
+# Related Concepts
+
+### Op_Maker
+It's constructor takes `proto` and `checker`. They are compeleted during Op_Maker's construction. ([ScaleOpMaker](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/scale_op.cc#L37))
+
+### Register Macros
+```cpp
+REGISTER_OP(op_type, op_class, op_maker_class, grad_op_type, grad_op_class)
+REGISTER_OP_WITHOUT_GRADIENT(op_type, op_class, op_maker_class)
+```
+
+### `USE` Macros
+make sure the registration process is executed and linked.
+
+---
+# Register Process
+1. Write Op class, as well as its gradient Op class if there is.
+2. Write Op maker class. In the constructor, describe its inputs, outputs, and attributes.
+3. Invoke macro `REGISTER_OP`. The macro will
+	1. call maker class to complete `proto` and `checker`
+	2. with the completed `proto` and `checker`, build a new key-value pair in the `OpInfoMap`
+
+4. Invoke `USE` macro in where the Op is used to make sure it is linked.
+
+---
+# Backward Module (1/2)
+### Create Backward Operator
+- Mapping from forwarding Op to backward Op
+![backward](https://gist.githubusercontent.com/dzhwinter/a6fbd4623ee76c459f7f94591fd1abf0/raw/61026ab6e518e66bde66a889bc42557a1fccff33/backward.png)
+
+---
+# Backward Module (2/2)
+### Build Backward Network
+- **Input** graph of forwarding operators
+- **Output** graph of backward operators
+- **corner case in construction**
+	- shared variable => insert `Add` operator
+	- no gradient => insert `fill_zero_grad` operator
+	- recursive netOp => call `Backward` recursively
+	- RNN Op => recursively call `Backward` on stepnet
+
+
+---
+# Scope, Variable, Tensor
+
+* `Tensor` is an n-dimension array with type.
+	* Only dims and data pointers are stored in `Tensor`.
+	* All operators on `Tensor` is written in `Operator` or global functions.
+	* variable length Tensor design [LoDTensor](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/lod_tensor.md)
+* `Variable` is the inputs and outputs of an operator. Not just `Tensor`.
+	* step_scopes in RNN is a variable and not a tensor.
+* `Scope` is where variables store at.
+	* map<string/*var name */, Variable>
+	* `Scope` has a hierarchical structure. The local scope can get variable from its parent scope.
+
+---
+# Block (in design)
+## the difference with original RNNOp
+- as an operator is more intuitive than `RNNOp`,
+- offers new interface `Eval(targets)` to deduce the minimal block to `Run`,
+- fits the compile-time/ runtime separation design.
+  - during the compilation, `SymbolTable` stores `VarDesc`s and `OpDesc`s and serialize to a `BlockDesc`
+  - when graph executes, a Block with `BlockDesc` passed in creates `Op` and `Var` then `Run`
+
+---
+# Milestone
+- take Paddle/books as the main line, the requirement of the models motivates framework refactoring,
+- model migration
+  - framework development gives **priority support** to model migration, for example,
+    - the MNIST demo needs a Python interface,
+    - the RNN models require the framework to support `LoDTensor`.
+  - determine some timelines,
+  - heavily-relied Ops need to be migrated first,
+  - different models can be migrated parallelly.
+- improve the framework at the same time
+- accept imperfection, concentrated on solving the specific problem at the right price.
+
+---
+# Control the migration quality
+- compare the performance of migrated models with old ones.
+- follow google C style
+- build the automatic workflow of generating Python/C++ documentations
+  - the documentation of layers and ops should be written inside the code
+  - take the documentation quality into account when doing PR
+  - preview the documentations, read and improve them from users' perspective
--- a/develop/doc/design/refactorization.html
+++ b/develop/doc/design/refactorization.html
+
+
+<!DOCTYPE html>
+<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
+<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
+<head>
+  <meta charset="utf-8">
+  
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  
+  <title>Design Doc: Refactorization Overview &mdash; PaddlePaddle  documentation</title>
+  
+
+  
+  
+
+  
+
+  
+  
+    
+
+  
+
+  
+  
+    <link rel="stylesheet" href="../_static/css/theme.css" type="text/css" />
+  
+
+  
+  
+        <link rel="index" title="Index"
+              href="../genindex.html"/>
+        <link rel="search" title="Search" href="../search.html"/>
+    <link rel="top" title="PaddlePaddle  documentation" href="../index.html"/> 
+
+  <link rel="stylesheet" href="https://cdn.jsdelivr.net/perfect-scrollbar/0.6.14/css/perfect-scrollbar.min.css" type="text/css" />
+  <link rel="stylesheet" href="../_static/css/override.css" type="text/css" />
+  <script>
+  var _hmt = _hmt || [];
+  (function() {
+    var hm = document.createElement("script");
+    hm.src = "//hm.baidu.com/hm.js?b9a314ab40d04d805655aab1deee08ba";
+    var s = document.getElementsByTagName("script")[0]; 
+    s.parentNode.insertBefore(hm, s);
+  })();
+  </script>
+
+  
+
+  
+  <script src="../_static/js/modernizr.min.js"></script>
+
+</head>
+
+<body class="wy-body-for-nav" role="document">
+
+  
+  <header class="site-header">
+    <div class="site-logo">
+      <a href="/"><img src="../_static/images/PP_w.png"></a>
+    </div>
+    <div class="site-nav-links">
+      <div class="site-menu">
+        <a class="fork-on-github" href="https://github.com/PaddlePaddle/Paddle" target="_blank"><i class="fa fa-github"></i>Fork me on Github</a>
+        <div class="language-switcher dropdown">
+          <a type="button" data-toggle="dropdown">
+            <span>English</span>
+            <i class="fa fa-angle-up"></i>
+            <i class="fa fa-angle-down"></i>
+          </a>
+          <ul class="dropdown-menu">
+            <li><a href="/doc_cn">中文</a></li>
+            <li><a href="/doc">English</a></li>
+          </ul>
+        </div>
+        <ul class="site-page-links">
+          <li><a href="/">Home</a></li>
+        </ul>
+      </div>
+      <div class="doc-module">
+        
+        <ul>
+<li class="toctree-l1"><a class="reference internal" href="../getstarted/index_en.html">GET STARTED</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../howto/index_en.html">HOW TO</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../api/index_en.html">API</a></li>
+</ul>
+
+        
+<div role="search">
+  <form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
+    <input type="text" name="q" placeholder="Search docs" />
+    <input type="hidden" name="check_keywords" value="yes" />
+    <input type="hidden" name="area" value="default" />
+  </form>
+</div>        
+      </div>
+    </div>
+  </header>
+  
+  <div class="main-content-wrap">
+
+    
+    <nav class="doc-menu-vertical" role="navigation">
+        
+          
+          <ul>
+<li class="toctree-l1"><a class="reference internal" href="../getstarted/index_en.html">GET STARTED</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="../getstarted/build_and_install/index_en.html">Install and Build</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../getstarted/build_and_install/docker_install_en.html">PaddlePaddle in Docker Containers</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../getstarted/build_and_install/build_from_source_en.html">Installing from Sources</a></li>
+</ul>
+</li>
+</ul>
+</li>
+<li class="toctree-l1"><a class="reference internal" href="../howto/index_en.html">HOW TO</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="../howto/usage/cmd_parameter/index_en.html">Set Command-line Parameters</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cmd_parameter/use_case_en.html">Use Case</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cmd_parameter/arguments_en.html">Argument Outline</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cmd_parameter/detail_introduction_en.html">Detail Description</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/usage/cluster/cluster_train_en.html">Run Distributed Training</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/usage/k8s/k8s_en.html">Paddle On Kubernetes</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/usage/k8s/k8s_aws_en.html">Distributed PaddlePaddle Training on AWS with Kubernetes</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/dev/build_en.html">Build PaddlePaddle from Source Code and Run Unit Test</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/dev/new_layer_en.html">Write New Layers</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/dev/contribute_to_paddle_en.html">Contribute Code</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/deep_model/rnn/index_en.html">RNN Models</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../howto/deep_model/rnn/rnn_config_en.html">RNN Configuration</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/optimization/gpu_profiling_en.html">Tune GPU Performance</a></li>
+</ul>
+</li>
+<li class="toctree-l1"><a class="reference internal" href="../api/index_en.html">API</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="../api/v2/model_configs.html">Model Configuration</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/activation.html">Activation</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/layer.html">Layers</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/evaluators.html">Evaluators</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/optimizer.html">Optimizer</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/pooling.html">Pooling</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/networks.html">Networks</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/attr.html">Parameter Attribute</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../api/v2/data.html">Data Reader Interface and DataSets</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../api/v2/run_logic.html">Training and Inference</a></li>
+</ul>
+</li>
+</ul>
+
+        
+    </nav>
+    
+    <section class="doc-content-wrap">
+
+      
+
+ 
+
+
+
+
+
+
+
+<div role="navigation" aria-label="breadcrumbs navigation">
+  <ul class="wy-breadcrumbs">
+      
+    <li>Design Doc: Refactorization Overview</li>
+  </ul>
+</div>
+      
+      <div class="wy-nav-content" id="doc-content">
+        <div class="rst-content">
+          <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
+           <div itemprop="articleBody">
+            
+  <div class="section" id="design-doc-refactorization-overview">
+<span id="design-doc-refactorization-overview"></span><h1>Design Doc: Refactorization Overview<a class="headerlink" href="#design-doc-refactorization-overview" title="Permalink to this headline">¶</a></h1>
+<p>The goal of refactorizaiton include:</p>
+<ol class="simple">
+<li>Make it easy for external contributors to write new elementory computaiton operations.</li>
+<li>Make the codebase clean and readable.</li>
+<li>Introduce a new design of computation representation &#8211; a computation graph of operators and variables.</li>
+<li>The graph representation helps implementing auto-scalable and auto fault recoverable distributed computing.</li>
+</ol>
+<div class="section" id="computation-graphs">
+<span id="computation-graphs"></span><h2>Computation Graphs<a class="headerlink" href="#computation-graphs" title="Permalink to this headline">¶</a></h2>
+<ol class="simple">
+<li>PaddlePaddle represent the computation, training and inference of DL models, by computation graphs.</li>
+<li>Please dig into <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/graph.md">computation graphs</a> for a solid example.</li>
+<li>Users write Python programs to describe the graphs and run it (locally or remotely).</li>
+<li>A graph is composed of <em>variabels</em> and <em>operators</em>.</li>
+<li>The description of graphs must be able to be serialized/deserialized, so it<ol>
+<li>could to be sent to the cloud for distributed execution, and</li>
+<li>be sent to clients for mobile or enterprise deployment.</li>
+</ol>
+</li>
+<li>The Python program do<ol>
+<li><em>compilation</em>: runs a Python program to generate a protobuf message representation of the graph and send it to<ol>
+<li>the C++ library <code class="docutils literal"><span class="pre">libpaddle.so</span></code> for local execution,</li>
+<li>the master process of a distributed training job for training, or</li>
+<li>the server process of a Kubernetes serving job for distributed serving.</li>
+</ol>
+</li>
+<li><em>execution</em>: according to the protobuf message, constructs instances of class <code class="docutils literal"><span class="pre">Variable</span></code> and <code class="docutils literal"><span class="pre">OperatorBase</span></code>, and run them.</li>
+</ol>
+</li>
+</ol>
+</div>
+<div class="section" id="description-and-realization">
+<span id="description-and-realization"></span><h2>Description and Realization<a class="headerlink" href="#description-and-realization" title="Permalink to this headline">¶</a></h2>
+<p>At compile time, the Python program generates protobuf message representation of the graph, or the description of the graph.</p>
+<p>At runtime, the C++ program realizes the graph and run it.</p>
+<p>| | Representation (protobuf messages) | Realization (C++ class objects) |
+|&#8212;|&#8212;|&#8212;|
+|Data|<a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/framework.proto#L107">VarDesc</a>|<a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/variable.h#L24">Variable</a>|
+|Operation|<a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/framework.proto#L35">OpDesc</a>|<a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/operator.h#L64">Operator</a>|
+|Block|BlockDesc|Block|</p>
+<p>The word <em>graph</em> is exchangable with <em>block</em> in this document.  A graph represent computation steps and local variables as a C++/Java program block, or a pair of { and }.</p>
+</div>
+<div class="section" id="compilation-and-execution">
+<span id="compilation-and-execution"></span><h2>Compilation and Execution<a class="headerlink" href="#compilation-and-execution" title="Permalink to this headline">¶</a></h2>
+<ol class="simple">
+<li>Run an applicaton Python program to describe the graph.  In particular,<ol>
+<li>create VarDesc to represent local/intermediate variables,</li>
+<li>create operators and set attributes,</li>
+<li>validate attribute values,</li>
+<li>inference the type and the shape of variables,</li>
+<li>plan for memory-reuse for variables,</li>
+<li>generate backward and optimization part of the Graph.</li>
+<li>possiblly split the graph for distributed training.</li>
+</ol>
+</li>
+<li>The invocation of <code class="docutils literal"><span class="pre">train</span></code> or <code class="docutils literal"><span class="pre">infer</span></code> in the application Python program:<ol>
+<li>create a new Scope instance in the <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/scope.md">scope hierarchy</a> for each run of a block,<ol>
+<li>realize local variables defined in the BlockDesc message in the new scope,</li>
+<li>a scope is similar to the stack frame in programming languages,</li>
+</ol>
+</li>
+<li>create an instance of class <code class="docutils literal"><span class="pre">Block</span></code>, in which,<ol>
+<li>realize operators in the BlockDesc message,</li>
+</ol>
+</li>
+<li>run the Block by calling<ol>
+<li><code class="docutils literal"><span class="pre">Block::Eval(vector&lt;Variable&gt;*</span> <span class="pre">targets)</span></code> for forward and backward computations, or</li>
+<li><code class="docutils literal"><span class="pre">Block::Eval(vector&lt;Operator&gt;*</span> <span class="pre">targets)</span></code> for optimization.</li>
+</ol>
+</li>
+</ol>
+</li>
+</ol>
+</div>
+<div class="section" id="intermediate-representation-ir">
+<span id="intermediate-representation-ir"></span><h2>Intermediate Representation (IR)<a class="headerlink" href="#intermediate-representation-ir" title="Permalink to this headline">¶</a></h2>
+<div class="highlight-text"><div class="highlight"><pre><span></span>Compile Time -&gt; IR -&gt; Runtime
+</pre></div>
+</div>
+<div class="section" id="benefit">
+<span id="benefit"></span><h3>Benefit<a class="headerlink" href="#benefit" title="Permalink to this headline">¶</a></h3>
+<ul>
+<li><p class="first">Optimization</p>
+<div class="highlight-text"><div class="highlight"><pre><span></span>Compile Time -&gt; IR -&gt; Optimized IR -&gt; Runtime
+</pre></div>
+</div>
+</li>
+<li><p class="first">Send automatically partitioned IR to different nodes.</p>
+<ul>
+<li><p class="first">Automatic data parallel</p>
+<div class="highlight-text"><div class="highlight"><pre><span></span>Compile Time
+|-&gt; Single GPU IR
+    |-&gt; [trainer-IR-0, trainer-IR-1, pserver-IR]
+        |-&gt; Node-0 (runs trainer-IR-0)
+        |-&gt; Node-1 (runs trainer-IR-1)
+        |-&gt; Node-2 (runs pserver-IR)
+</pre></div>
+</div>
+</li>
+<li><p class="first">Automatic model parallel (planned for future)</p>
+</li>
+</ul>
+</li>
+</ul>
+</div>
+</div>
+</div>
+<hr class="docutils" />
+<div class="section" id="operator-opwithkernel-opkernel">
+<span id="operator-opwithkernel-opkernel"></span><h1>Operator/OpWithKernel/OpKernel<a class="headerlink" href="#operator-opwithkernel-opkernel" title="Permalink to this headline">¶</a></h1>
+<p><img alt="class_diagram" src="http://api.paddlepaddle.org/graphviz?dot=https://gist.githubusercontent.com/reyoung/53df507f6749762675dff3e7ce53372f/raw/49caf1fb70820fb4a6c217634317c9306f361f36/op_op_with_kern_class_diagram.dot" /></p>
+</div>
+<hr class="docutils" />
+<div class="section" id="operator">
+<span id="operator"></span><h1>Operator<a class="headerlink" href="#operator" title="Permalink to this headline">¶</a></h1>
+<p><img alt="class_diagram" src="http://api.paddlepaddle.org/graphviz?dot=https://gist.githubusercontent.com/reyoung/53df507f6749762675dff3e7ce53372f/raw/dd598e8f1976f5759f58af5e5ef94738a6b2e661/op.dot" /></p>
+<ul class="simple">
+<li><code class="docutils literal"><span class="pre">Operator</span></code> is the fundamental building block as the user interface.<ul>
+<li>Operator stores input/output variable name, and attributes.</li>
+<li>The <code class="docutils literal"><span class="pre">InferShape</span></code> interface is used to infer output variable shapes by its input shapes.</li>
+<li>Use <code class="docutils literal"><span class="pre">Run</span></code> to compute <code class="docutils literal"><span class="pre">input</span> <span class="pre">variables</span></code> to <code class="docutils literal"><span class="pre">output</span> <span class="pre">variables</span></code>.</li>
+</ul>
+</li>
+</ul>
+</div>
+<hr class="docutils" />
+<div class="section" id="opwithkernel-kernel">
+<span id="opwithkernel-kernel"></span><h1>OpWithKernel/Kernel<a class="headerlink" href="#opwithkernel-kernel" title="Permalink to this headline">¶</a></h1>
+<p><img alt="class_diagram" src="http://api.paddlepaddle.org/graphviz?dot=https://gist.githubusercontent.com/reyoung/53df507f6749762675dff3e7ce53372f/raw/9d7f4eba185cf41c8e2fbfb40ae21890dbddcd39/op_with_kernel.dot" /></p>
+<ul class="simple">
+<li><code class="docutils literal"><span class="pre">OpWithKernel</span></code> inherits <code class="docutils literal"><span class="pre">Operator</span></code>.</li>
+<li><code class="docutils literal"><span class="pre">OpWithKernel</span></code> contains a Kernel map.<ul>
+<li><code class="docutils literal"><span class="pre">OpWithKernel::Run</span></code> get device&#8217;s kernel, and invoke <code class="docutils literal"><span class="pre">OpKernel::Compute</span></code>.</li>
+<li><code class="docutils literal"><span class="pre">OpKernelKey</span></code> is the map key. Only device place now, but may be data type later.</li>
+</ul>
+</li>
+</ul>
+</div>
+<hr class="docutils" />
+<div class="section" id="why-separate-kernel-and-operator">
+<span id="why-separate-kernel-and-operator"></span><h1>Why separate Kernel and Operator<a class="headerlink" href="#why-separate-kernel-and-operator" title="Permalink to this headline">¶</a></h1>
+<ul class="simple">
+<li>Separate GPU and CPU code.<ul>
+<li>Make Paddle can run without GPU.</li>
+</ul>
+</li>
+<li>Make one operator (which is user interface) can contain many implementations.<ul>
+<li>Same mul op, different FP16, FP32 Kernel. different MKL, eigen kernel.</li>
+</ul>
+</li>
+</ul>
+</div>
+<hr class="docutils" />
+<div class="section" id="libraries-for-kernel-development">
+<span id="libraries-for-kernel-development"></span><h1>Libraries for Kernel development<a class="headerlink" href="#libraries-for-kernel-development" title="Permalink to this headline">¶</a></h1>
+<ul class="simple">
+<li><code class="docutils literal"><span class="pre">Eigen::Tensor</span></code> contains basic math and element-wise functions.<ul>
+<li>Note that <code class="docutils literal"><span class="pre">Eigen::Tensor</span></code> has broadcast implementation.</li>
+<li>Limit number of <code class="docutils literal"><span class="pre">tensor.device(dev)</span> <span class="pre">=</span></code> in your code.</li>
+</ul>
+</li>
+<li><code class="docutils literal"><span class="pre">thrust::tranform</span></code> and <code class="docutils literal"><span class="pre">std::transform</span></code>.<ul>
+<li><code class="docutils literal"><span class="pre">thrust</span></code> has the same API as C++ standard library. Using <code class="docutils literal"><span class="pre">transform</span></code> can quickly implement a customized elementwise kernel.</li>
+<li><code class="docutils literal"><span class="pre">thrust</span></code> has more complex API, like <code class="docutils literal"><span class="pre">scan</span></code>, <code class="docutils literal"><span class="pre">reduce</span></code>, <code class="docutils literal"><span class="pre">reduce_by_key</span></code>.</li>
+</ul>
+</li>
+<li>Hand-writing <code class="docutils literal"><span class="pre">GPUKernel</span></code> and <code class="docutils literal"><span class="pre">CPU</span></code> code<ul>
+<li>Do not write <code class="docutils literal"><span class="pre">.h</span></code>. CPU Kernel should be in <code class="docutils literal"><span class="pre">.cc</span></code>. CPU kernel should be in <code class="docutils literal"><span class="pre">.cu</span></code>. (<code class="docutils literal"><span class="pre">GCC</span></code> cannot compile GPU code.)</li>
+</ul>
+</li>
+</ul>
+</div>
+<hr class="docutils" />
+<div class="section" id="operator-register">
+<span id="operator-register"></span><h1>Operator Register<a class="headerlink" href="#operator-register" title="Permalink to this headline">¶</a></h1>
+<div class="section" id="why-register-is-necessary">
+<span id="why-register-is-necessary"></span><h2>Why register is necessary?<a class="headerlink" href="#why-register-is-necessary" title="Permalink to this headline">¶</a></h2>
+<p>We need a method to build mappings between Op type names and Op classes.</p>
+</div>
+<div class="section" id="how-to-do-the-register">
+<span id="how-to-do-the-register"></span><h2>How to do the register?<a class="headerlink" href="#how-to-do-the-register" title="Permalink to this headline">¶</a></h2>
+<p>Maintain a map, whose key is the type name and value is corresponding Op constructor.</p>
+</div>
+</div>
+<hr class="docutils" />
+<div class="section" id="the-registry-map">
+<span id="the-registry-map"></span><h1>The Registry Map<a class="headerlink" href="#the-registry-map" title="Permalink to this headline">¶</a></h1>
+<div class="section" id="opinfomap">
+<span id="opinfomap"></span><h2><code class="docutils literal"><span class="pre">OpInfoMap</span></code><a class="headerlink" href="#opinfomap" title="Permalink to this headline">¶</a></h2>
+<p><code class="docutils literal"><span class="pre">op_type(string)</span></code> -&gt; <code class="docutils literal"><span class="pre">OpInfo</span></code></p>
+<p><code class="docutils literal"><span class="pre">OpInfo</span></code>:</p>
+<ul class="simple">
+<li><strong><code class="docutils literal"><span class="pre">creator</span></code></strong>: The Op constructor.</li>
+<li><strong><code class="docutils literal"><span class="pre">grad_op_type</span></code></strong>: The type of the gradient Op.</li>
+<li><strong><code class="docutils literal"><span class="pre">proto</span></code></strong>: The Op&#8217;s Protobuf, including inputs, outputs and required attributes.</li>
+<li><strong><code class="docutils literal"><span class="pre">checker</span></code></strong>: Used to check attributes.</li>
+</ul>
+</div>
+</div>
+<hr class="docutils" />
+<div class="section" id="related-concepts">
+<span id="related-concepts"></span><h1>Related Concepts<a class="headerlink" href="#related-concepts" title="Permalink to this headline">¶</a></h1>
+<div class="section" id="op-maker">
+<span id="op-maker"></span><h2>Op_Maker<a class="headerlink" href="#op-maker" title="Permalink to this headline">¶</a></h2>
+<p>It&#8217;s constructor takes <code class="docutils literal"><span class="pre">proto</span></code> and <code class="docutils literal"><span class="pre">checker</span></code>. They are compeleted during Op_Maker&#8217;s construction. (<a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/scale_op.cc#L37">ScaleOpMaker</a>)</p>
+</div>
+<div class="section" id="register-macros">
+<span id="register-macros"></span><h2>Register Macros<a class="headerlink" href="#register-macros" title="Permalink to this headline">¶</a></h2>
+<div class="highlight-cpp"><div class="highlight"><pre><span></span><span class="n">REGISTER_OP</span><span class="p">(</span><span class="n">op_type</span><span class="p">,</span> <span class="n">op_class</span><span class="p">,</span> <span class="n">op_maker_class</span><span class="p">,</span> <span class="n">grad_op_type</span><span class="p">,</span> <span class="n">grad_op_class</span><span class="p">)</span>
+<span class="n">REGISTER_OP_WITHOUT_GRADIENT</span><span class="p">(</span><span class="n">op_type</span><span class="p">,</span> <span class="n">op_class</span><span class="p">,</span> <span class="n">op_maker_class</span><span class="p">)</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="use-macros">
+<span id="use-macros"></span><h2><code class="docutils literal"><span class="pre">USE</span></code> Macros<a class="headerlink" href="#use-macros" title="Permalink to this headline">¶</a></h2>
+<p>make sure the registration process is executed and linked.</p>
+</div>
+</div>
+<hr class="docutils" />
+<div class="section" id="register-process">
+<span id="register-process"></span><h1>Register Process<a class="headerlink" href="#register-process" title="Permalink to this headline">¶</a></h1>
+<ol class="simple">
+<li>Write Op class, as well as its gradient Op class if there is.</li>
+<li>Write Op maker class. In the constructor, describe its inputs, outputs, and attributes.</li>
+<li>Invoke macro <code class="docutils literal"><span class="pre">REGISTER_OP</span></code>. The macro will<ol>
+<li>call maker class to complete <code class="docutils literal"><span class="pre">proto</span></code> and <code class="docutils literal"><span class="pre">checker</span></code></li>
+<li>with the completed <code class="docutils literal"><span class="pre">proto</span></code> and <code class="docutils literal"><span class="pre">checker</span></code>, build a new key-value pair in the <code class="docutils literal"><span class="pre">OpInfoMap</span></code></li>
+</ol>
+</li>
+<li>Invoke <code class="docutils literal"><span class="pre">USE</span></code> macro in where the Op is used to make sure it is linked.</li>
+</ol>
+</div>
+<hr class="docutils" />
+<div class="section" id="backward-module-1-2">
+<span id="backward-module-1-2"></span><h1>Backward Module (1/2)<a class="headerlink" href="#backward-module-1-2" title="Permalink to this headline">¶</a></h1>
+<div class="section" id="create-backward-operator">
+<span id="create-backward-operator"></span><h2>Create Backward Operator<a class="headerlink" href="#create-backward-operator" title="Permalink to this headline">¶</a></h2>
+<ul class="simple">
+<li>Mapping from forwarding Op to backward Op
+<img alt="backward" src="https://gist.githubusercontent.com/dzhwinter/a6fbd4623ee76c459f7f94591fd1abf0/raw/61026ab6e518e66bde66a889bc42557a1fccff33/backward.png" /></li>
+</ul>
+</div>
+</div>
+<hr class="docutils" />
+<div class="section" id="backward-module-2-2">
+<span id="backward-module-2-2"></span><h1>Backward Module (2/2)<a class="headerlink" href="#backward-module-2-2" title="Permalink to this headline">¶</a></h1>
+<div class="section" id="build-backward-network">
+<span id="build-backward-network"></span><h2>Build Backward Network<a class="headerlink" href="#build-backward-network" title="Permalink to this headline">¶</a></h2>
+<ul class="simple">
+<li><strong>Input</strong> graph of forwarding operators</li>
+<li><strong>Output</strong> graph of backward operators</li>
+<li><strong>corner case in construction</strong><ul>
+<li>shared variable =&gt; insert <code class="docutils literal"><span class="pre">Add</span></code> operator</li>
+<li>no gradient =&gt; insert <code class="docutils literal"><span class="pre">fill_zero_grad</span></code> operator</li>
+<li>recursive netOp =&gt; call <code class="docutils literal"><span class="pre">Backward</span></code> recursively</li>
+<li>RNN Op =&gt; recursively call <code class="docutils literal"><span class="pre">Backward</span></code> on stepnet</li>
+</ul>
+</li>
+</ul>
+</div>
+</div>
+<hr class="docutils" />
+<div class="section" id="scope-variable-tensor">
+<span id="scope-variable-tensor"></span><h1>Scope, Variable, Tensor<a class="headerlink" href="#scope-variable-tensor" title="Permalink to this headline">¶</a></h1>
+<ul class="simple">
+<li><code class="docutils literal"><span class="pre">Tensor</span></code> is an n-dimension array with type.<ul>
+<li>Only dims and data pointers are stored in <code class="docutils literal"><span class="pre">Tensor</span></code>.</li>
+<li>All operators on <code class="docutils literal"><span class="pre">Tensor</span></code> is written in <code class="docutils literal"><span class="pre">Operator</span></code> or global functions.</li>
+<li>variable length Tensor design <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/lod_tensor.md">LoDTensor</a></li>
+</ul>
+</li>
+<li><code class="docutils literal"><span class="pre">Variable</span></code> is the inputs and outputs of an operator. Not just <code class="docutils literal"><span class="pre">Tensor</span></code>.<ul>
+<li>step_scopes in RNN is a variable and not a tensor.</li>
+</ul>
+</li>
+<li><code class="docutils literal"><span class="pre">Scope</span></code> is where variables store at.<ul>
+<li>map&lt;string/*var name */, Variable&gt;</li>
+<li><code class="docutils literal"><span class="pre">Scope</span></code> has a hierarchical structure. The local scope can get variable from its parent scope.</li>
+</ul>
+</li>
+</ul>
+</div>
+<hr class="docutils" />
+<div class="section" id="block-in-design">
+<span id="block-in-design"></span><h1>Block (in design)<a class="headerlink" href="#block-in-design" title="Permalink to this headline">¶</a></h1>
+<div class="section" id="the-difference-with-original-rnnop">
+<span id="the-difference-with-original-rnnop"></span><h2>the difference with original RNNOp<a class="headerlink" href="#the-difference-with-original-rnnop" title="Permalink to this headline">¶</a></h2>
+<ul class="simple">
+<li>as an operator is more intuitive than <code class="docutils literal"><span class="pre">RNNOp</span></code>,</li>
+<li>offers new interface <code class="docutils literal"><span class="pre">Eval(targets)</span></code> to deduce the minimal block to <code class="docutils literal"><span class="pre">Run</span></code>,</li>
+<li>fits the compile-time/ runtime separation design.<ul>
+<li>during the compilation, <code class="docutils literal"><span class="pre">SymbolTable</span></code> stores <code class="docutils literal"><span class="pre">VarDesc</span></code>s and <code class="docutils literal"><span class="pre">OpDesc</span></code>s and serialize to a <code class="docutils literal"><span class="pre">BlockDesc</span></code></li>
+<li>when graph executes, a Block with <code class="docutils literal"><span class="pre">BlockDesc</span></code> passed in creates <code class="docutils literal"><span class="pre">Op</span></code> and <code class="docutils literal"><span class="pre">Var</span></code> then <code class="docutils literal"><span class="pre">Run</span></code></li>
+</ul>
+</li>
+</ul>
+</div>
+</div>
+<hr class="docutils" />
+<div class="section" id="milestone">
+<span id="milestone"></span><h1>Milestone<a class="headerlink" href="#milestone" title="Permalink to this headline">¶</a></h1>
+<ul class="simple">
+<li>take Paddle/books as the main line, the requirement of the models motivates framework refactoring,</li>
+<li>model migration<ul>
+<li>framework development gives <strong>priority support</strong> to model migration, for example,<ul>
+<li>the MNIST demo needs a Python interface,</li>
+<li>the RNN models require the framework to support <code class="docutils literal"><span class="pre">LoDTensor</span></code>.</li>
+</ul>
+</li>
+<li>determine some timelines,</li>
+<li>heavily-relied Ops need to be migrated first,</li>
+<li>different models can be migrated parallelly.</li>
+</ul>
+</li>
+<li>improve the framework at the same time</li>
+<li>accept imperfection, concentrated on solving the specific problem at the right price.</li>
+</ul>
+</div>
+<hr class="docutils" />
+<div class="section" id="control-the-migration-quality">
+<span id="control-the-migration-quality"></span><h1>Control the migration quality<a class="headerlink" href="#control-the-migration-quality" title="Permalink to this headline">¶</a></h1>
+<ul class="simple">
+<li>compare the performance of migrated models with old ones.</li>
+<li>follow google C style</li>
+<li>build the automatic workflow of generating Python/C++ documentations<ul>
+<li>the documentation of layers and ops should be written inside the code</li>
+<li>take the documentation quality into account when doing PR</li>
+<li>preview the documentations, read and improve them from users&#8217; perspective</li>
+</ul>
+</li>
+</ul>
+</div>
+
+
+           </div>
+          </div>
+          <footer>
+  
+
+  <hr/>
+
+  <div role="contentinfo">
+    <p>
+        &copy; Copyright 2016, PaddlePaddle developers.
+
+    </p>
+  </div>
+  Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>. 
+
+</footer>
+
+        </div>
+      </div>
+
+    </section>
+
+  </div>
+  
+
+
+  
+
+    <script type="text/javascript">
+        var DOCUMENTATION_OPTIONS = {
+            URL_ROOT:'../',
+            VERSION:'',
+            COLLAPSE_INDEX:false,
+            FILE_SUFFIX:'.html',
+            HAS_SOURCE:  true,
+            SOURCELINK_SUFFIX: ".txt",
+        };
+    </script>
+      <script type="text/javascript" src="../_static/jquery.js"></script>
+      <script type="text/javascript" src="../_static/underscore.js"></script>
+      <script type="text/javascript" src="../_static/doctools.js"></script>
+      <script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
+       
+  
+
+  
+  
+    <script type="text/javascript" src="../_static/js/theme.js"></script>
+  
+  
+  <script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js" integrity="sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa" crossorigin="anonymous"></script>
+  <script src="https://cdn.jsdelivr.net/perfect-scrollbar/0.6.14/js/perfect-scrollbar.jquery.min.js"></script>
+  <script src="../_static/js/paddle_doc_init.js"></script> 
+
+</body>
+</html>
\ No newline at end of file
--- a/develop/doc/objects.inv
+++ b/develop/doc/objects.inv
--- a/develop/doc/searchindex.js
+++ b/develop/doc/searchindex.js
--- a/develop/doc_cn/_sources/design/refactorization.md.txt
+++ b/develop/doc_cn/_sources/design/refactorization.md.txt
+# Design Doc: Refactorization Overview
+
+The goal of refactorizaiton include:
+
+1. Make it easy for external contributors to write new elementory computaiton operations.
+1. Make the codebase clean and readable.
+1. Introduce a new design of computation representation -- a computation graph of operators and variables.
+1. The graph representation helps implementing auto-scalable and auto fault recoverable distributed computing.
+
+## Computation Graphs
+
+1. PaddlePaddle represent the computation, training and inference of DL models, by computation graphs.
+
+  1. Please dig into [computation graphs](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/graph.md) for a solid example.
+
+1. Users write Python programs to describe the graphs and run it (locally or remotely).
+
+1. A graph is composed of *variabels* and *operators*.
+
+1. The description of graphs must be able to be serialized/deserialized, so it
+
+   1. could to be sent to the cloud for distributed execution, and
+   1. be sent to clients for mobile or enterprise deployment.
+
+1. The Python program do
+
+   1. *compilation*: runs a Python program to generate a protobuf message representation of the graph and send it to
+      1. the C++ library `libpaddle.so` for local execution,
+      1. the master process of a distributed training job for training, or
+      1. the server process of a Kubernetes serving job for distributed serving.
+   1. *execution*: according to the protobuf message, constructs instances of class `Variable` and `OperatorBase`, and run them.
+
+## Description and Realization
+
+At compile time, the Python program generates protobuf message representation of the graph, or the description of the graph.
+
+At runtime, the C++ program realizes the graph and run it.
+
+| | Representation (protobuf messages) | Realization (C++ class objects) |
+|---|---|---|
+|Data|[VarDesc](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/framework.proto#L107)|[Variable](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/variable.h#L24)|
+|Operation|[OpDesc](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/framework.proto#L35)|[Operator](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/operator.h#L64)|
+|Block|BlockDesc|Block|
+
+The word *graph* is exchangable with *block* in this document.  A graph represent computation steps and local variables as a C++/Java program block, or a pair of { and }.
+
+## Compilation and Execution
+
+1. Run an applicaton Python program to describe the graph.  In particular,
+
+   1. create VarDesc to represent local/intermediate variables,
+   1. create operators and set attributes,
+   1. validate attribute values,
+   1. inference the type and the shape of variables,
+   1. plan for memory-reuse for variables,
+   1. generate backward and optimization part of the Graph.
+   1. possiblly split the graph for distributed training.
+
+1. The invocation of `train` or `infer` in the application Python program:
+
+   1. create a new Scope instance in the [scope hierarchy](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/scope.md) for each run of a block,
+      1. realize local variables defined in the BlockDesc message in the new scope,
+      1. a scope is similar to the stack frame in programming languages,
+
+   1. create an instance of class `Block`, in which,
+      1. realize operators in the BlockDesc message,
+
+   1. run the Block by calling
+      1. `Block::Eval(vector<Variable>* targets)` for forward and backward computations, or
+      1. `Block::Eval(vector<Operator>* targets)` for optimization.
+
+
+## Intermediate Representation (IR)
+
+```text
+Compile Time -> IR -> Runtime
+```
+
+### Benefit
+
+- Optimization
+  ```text
+  Compile Time -> IR -> Optimized IR -> Runtime
+  ```
+- Send automatically partitioned IR to different nodes.
+  - Automatic data parallel
+    ```text
+    Compile Time
+    |-> Single GPU IR
+        |-> [trainer-IR-0, trainer-IR-1, pserver-IR]
+            |-> Node-0 (runs trainer-IR-0)
+            |-> Node-1 (runs trainer-IR-1)
+            |-> Node-2 (runs pserver-IR)
+    ```
+  - Automatic model parallel (planned for future)
+
+---
+
+# Operator/OpWithKernel/OpKernel
+
+![class_diagram](http://api.paddlepaddle.org/graphviz?dot=https://gist.githubusercontent.com/reyoung/53df507f6749762675dff3e7ce53372f/raw/49caf1fb70820fb4a6c217634317c9306f361f36/op_op_with_kern_class_diagram.dot)
+
+---
+
+# Operator
+![class_diagram](http://api.paddlepaddle.org/graphviz?dot=https://gist.githubusercontent.com/reyoung/53df507f6749762675dff3e7ce53372f/raw/dd598e8f1976f5759f58af5e5ef94738a6b2e661/op.dot)
+
+* `Operator` is the fundamental building block as the user interface.
+    * Operator stores input/output variable name, and attributes.
+    * The `InferShape` interface is used to infer output variable shapes by its input shapes.
+    * Use `Run` to compute `input variables` to `output variables`.
+
+---
+
+# OpWithKernel/Kernel
+
+![class_diagram](http://api.paddlepaddle.org/graphviz?dot=https://gist.githubusercontent.com/reyoung/53df507f6749762675dff3e7ce53372f/raw/9d7f4eba185cf41c8e2fbfb40ae21890dbddcd39/op_with_kernel.dot)
+
+* `OpWithKernel` inherits `Operator`.
+* `OpWithKernel` contains a Kernel map.
+    * `OpWithKernel::Run` get device's kernel, and invoke `OpKernel::Compute`.
+    * `OpKernelKey` is the map key. Only device place now, but may be data type later.
+
+---
+
+# Why separate Kernel and Operator
+
+* Separate GPU and CPU code.
+    * Make Paddle can run without GPU.
+* Make one operator (which is user interface) can contain many implementations.
+    * Same mul op, different FP16, FP32 Kernel. different MKL, eigen kernel.
+---
+
+# Libraries for Kernel development
+
+* `Eigen::Tensor` contains basic math and element-wise functions.
+    * Note that `Eigen::Tensor` has broadcast implementation.
+    * Limit number of `tensor.device(dev) = ` in your code.
+* `thrust::tranform` and `std::transform`.
+    * `thrust` has the same API as C++ standard library. Using `transform` can quickly implement a customized elementwise kernel.
+    * `thrust` has more complex API, like `scan`, `reduce`, `reduce_by_key`.
+* Hand-writing `GPUKernel` and `CPU` code
+    * Do not write `.h`. CPU Kernel should be in `.cc`. CPU kernel should be in `.cu`. (`GCC` cannot compile GPU code.)
+---
+# Operator Register
+
+## Why register is necessary?
+We need a method to build mappings between Op type names and Op classes.
+
+## How to do the register?
+
+Maintain a map, whose key is the type name and value is corresponding Op constructor.
+
+---
+# The Registry Map
+
+### `OpInfoMap`
+
+`op_type(string)` -> `OpInfo`
+
+`OpInfo`:
+
+- **`creator`**: The Op constructor.
+- **`grad_op_type`**: The type of the gradient Op.
+- **`proto`**: The Op's Protobuf, including inputs, outputs and required attributes.
+- **`checker`**: Used to check attributes.
+
+---
+# Related Concepts
+
+### Op_Maker
+It's constructor takes `proto` and `checker`. They are compeleted during Op_Maker's construction. ([ScaleOpMaker](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/scale_op.cc#L37))
+
+### Register Macros
+```cpp
+REGISTER_OP(op_type, op_class, op_maker_class, grad_op_type, grad_op_class)
+REGISTER_OP_WITHOUT_GRADIENT(op_type, op_class, op_maker_class)
+```
+
+### `USE` Macros
+make sure the registration process is executed and linked.
+
+---
+# Register Process
+1. Write Op class, as well as its gradient Op class if there is.
+2. Write Op maker class. In the constructor, describe its inputs, outputs, and attributes.
+3. Invoke macro `REGISTER_OP`. The macro will
+	1. call maker class to complete `proto` and `checker`
+	2. with the completed `proto` and `checker`, build a new key-value pair in the `OpInfoMap`
+
+4. Invoke `USE` macro in where the Op is used to make sure it is linked.
+
+---
+# Backward Module (1/2)
+### Create Backward Operator
+- Mapping from forwarding Op to backward Op
+![backward](https://gist.githubusercontent.com/dzhwinter/a6fbd4623ee76c459f7f94591fd1abf0/raw/61026ab6e518e66bde66a889bc42557a1fccff33/backward.png)
+
+---
+# Backward Module (2/2)
+### Build Backward Network
+- **Input** graph of forwarding operators
+- **Output** graph of backward operators
+- **corner case in construction**
+	- shared variable => insert `Add` operator
+	- no gradient => insert `fill_zero_grad` operator
+	- recursive netOp => call `Backward` recursively
+	- RNN Op => recursively call `Backward` on stepnet
+
+
+---
+# Scope, Variable, Tensor
+
+* `Tensor` is an n-dimension array with type.
+	* Only dims and data pointers are stored in `Tensor`.
+	* All operators on `Tensor` is written in `Operator` or global functions.
+	* variable length Tensor design [LoDTensor](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/lod_tensor.md)
+* `Variable` is the inputs and outputs of an operator. Not just `Tensor`.
+	* step_scopes in RNN is a variable and not a tensor.
+* `Scope` is where variables store at.
+	* map<string/*var name */, Variable>
+	* `Scope` has a hierarchical structure. The local scope can get variable from its parent scope.
+
+---
+# Block (in design)
+## the difference with original RNNOp
+- as an operator is more intuitive than `RNNOp`,
+- offers new interface `Eval(targets)` to deduce the minimal block to `Run`,
+- fits the compile-time/ runtime separation design.
+  - during the compilation, `SymbolTable` stores `VarDesc`s and `OpDesc`s and serialize to a `BlockDesc`
+  - when graph executes, a Block with `BlockDesc` passed in creates `Op` and `Var` then `Run`
+
+---
+# Milestone
+- take Paddle/books as the main line, the requirement of the models motivates framework refactoring,
+- model migration
+  - framework development gives **priority support** to model migration, for example,
+    - the MNIST demo needs a Python interface,
+    - the RNN models require the framework to support `LoDTensor`.
+  - determine some timelines,
+  - heavily-relied Ops need to be migrated first,
+  - different models can be migrated parallelly.
+- improve the framework at the same time
+- accept imperfection, concentrated on solving the specific problem at the right price.
+
+---
+# Control the migration quality
+- compare the performance of migrated models with old ones.
+- follow google C style
+- build the automatic workflow of generating Python/C++ documentations
+  - the documentation of layers and ops should be written inside the code
+  - take the documentation quality into account when doing PR
+  - preview the documentations, read and improve them from users' perspective
--- a/develop/doc_cn/design/refactorization.html
+++ b/develop/doc_cn/design/refactorization.html
+
+
+<!DOCTYPE html>
+<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
+<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
+<head>
+  <meta charset="utf-8">
+  
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  
+  <title>Design Doc: Refactorization Overview &mdash; PaddlePaddle  文档</title>
+  
+
+  
+  
+
+  
+
+  
+  
+    
+
+  
+
+  
+  
+    <link rel="stylesheet" href="../_static/css/theme.css" type="text/css" />
+  
+
+  
+  
+        <link rel="index" title="索引"
+              href="../genindex.html"/>
+        <link rel="search" title="搜索" href="../search.html"/>
+    <link rel="top" title="PaddlePaddle  文档" href="../index.html"/> 
+
+  <link rel="stylesheet" href="https://cdn.jsdelivr.net/perfect-scrollbar/0.6.14/css/perfect-scrollbar.min.css" type="text/css" />
+  <link rel="stylesheet" href="../_static/css/override.css" type="text/css" />
+  <script>
+  var _hmt = _hmt || [];
+  (function() {
+    var hm = document.createElement("script");
+    hm.src = "//hm.baidu.com/hm.js?b9a314ab40d04d805655aab1deee08ba";
+    var s = document.getElementsByTagName("script")[0]; 
+    s.parentNode.insertBefore(hm, s);
+  })();
+  </script>
+
+  
+
+  
+  <script src="../_static/js/modernizr.min.js"></script>
+
+</head>
+
+<body class="wy-body-for-nav" role="document">
+
+  
+  <header class="site-header">
+    <div class="site-logo">
+      <a href="/"><img src="../_static/images/PP_w.png"></a>
+    </div>
+    <div class="site-nav-links">
+      <div class="site-menu">
+        <a class="fork-on-github" href="https://github.com/PaddlePaddle/Paddle" target="_blank"><i class="fa fa-github"></i>Fork me on Github</a>
+        <div class="language-switcher dropdown">
+          <a type="button" data-toggle="dropdown">
+            <span>English</span>
+            <i class="fa fa-angle-up"></i>
+            <i class="fa fa-angle-down"></i>
+          </a>
+          <ul class="dropdown-menu">
+            <li><a href="/doc_cn">中文</a></li>
+            <li><a href="/doc">English</a></li>
+          </ul>
+        </div>
+        <ul class="site-page-links">
+          <li><a href="/">Home</a></li>
+        </ul>
+      </div>
+      <div class="doc-module">
+        
+        <ul>
+<li class="toctree-l1"><a class="reference internal" href="../getstarted/index_cn.html">新手入门</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../howto/index_cn.html">进阶指南</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../api/index_cn.html">API</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../faq/index_cn.html">FAQ</a></li>
+</ul>
+
+        
+<div role="search">
+  <form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
+    <input type="text" name="q" placeholder="Search docs" />
+    <input type="hidden" name="check_keywords" value="yes" />
+    <input type="hidden" name="area" value="default" />
+  </form>
+</div>        
+      </div>
+    </div>
+  </header>
+  
+  <div class="main-content-wrap">
+
+    
+    <nav class="doc-menu-vertical" role="navigation">
+        
+          
+          <ul>
+<li class="toctree-l1"><a class="reference internal" href="../getstarted/index_cn.html">新手入门</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="../getstarted/build_and_install/index_cn.html">安装与编译</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../getstarted/build_and_install/docker_install_cn.html">PaddlePaddle的Docker容器使用方式</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../getstarted/build_and_install/cmake/build_from_source_cn.html">PaddlePaddle的编译选项</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../getstarted/concepts/use_concepts_cn.html">基本使用概念</a></li>
+</ul>
+</li>
+<li class="toctree-l1"><a class="reference internal" href="../howto/index_cn.html">进阶指南</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="../howto/usage/cmd_parameter/index_cn.html">设置命令行参数</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cmd_parameter/use_case_cn.html">使用案例</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cmd_parameter/arguments_cn.html">参数概述</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cmd_parameter/detail_introduction_cn.html">细节描述</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/usage/cluster/cluster_train_cn.html">运行分布式训练</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/usage/k8s/k8s_basis_cn.html">Kubernetes 简介</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/usage/k8s/k8s_cn.html">Kubernetes单机训练</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/usage/k8s/k8s_distributed_cn.html">Kubernetes分布式训练</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/dev/build_cn.html">编译PaddlePaddle和运行单元测试</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/dev/write_docs_cn.html">如何贡献/修改文档</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/dev/contribute_to_paddle_cn.html">如何贡献代码</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/deep_model/rnn/index_cn.html">RNN相关模型</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../howto/deep_model/rnn/rnn_config_cn.html">RNN配置</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../howto/deep_model/rnn/recurrent_group_cn.html">Recurrent Group教程</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../howto/deep_model/rnn/hierarchical_layer_cn.html">支持双层序列作为输入的Layer</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../howto/deep_model/rnn/hrnn_rnn_api_compare_cn.html">单双层RNN API对比介绍</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/optimization/gpu_profiling_cn.html">GPU性能分析与调优</a></li>
+</ul>
+</li>
+<li class="toctree-l1"><a class="reference internal" href="../api/index_cn.html">API</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="../api/v2/model_configs.html">模型配置</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/activation.html">Activation</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/layer.html">Layers</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/evaluators.html">Evaluators</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/optimizer.html">Optimizer</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/pooling.html">Pooling</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/networks.html">Networks</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/attr.html">Parameter Attribute</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../api/v2/data.html">数据访问</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../api/v2/run_logic.html">训练与应用</a></li>
+</ul>
+</li>
+<li class="toctree-l1"><a class="reference internal" href="../faq/index_cn.html">FAQ</a></li>
+</ul>
+
+        
+    </nav>
+    
+    <section class="doc-content-wrap">
+
+      
+
+ 
+
+
+
+
+
+
+
+<div role="navigation" aria-label="breadcrumbs navigation">
+  <ul class="wy-breadcrumbs">
+      
+    <li>Design Doc: Refactorization Overview</li>
+  </ul>
+</div>
+      
+      <div class="wy-nav-content" id="doc-content">
+        <div class="rst-content">
+          <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
+           <div itemprop="articleBody">
+            
+  <div class="section" id="design-doc-refactorization-overview">
+<span id="design-doc-refactorization-overview"></span><h1>Design Doc: Refactorization Overview<a class="headerlink" href="#design-doc-refactorization-overview" title="永久链接至标题">¶</a></h1>
+<p>The goal of refactorizaiton include:</p>
+<ol class="simple">
+<li>Make it easy for external contributors to write new elementory computaiton operations.</li>
+<li>Make the codebase clean and readable.</li>
+<li>Introduce a new design of computation representation &#8211; a computation graph of operators and variables.</li>
+<li>The graph representation helps implementing auto-scalable and auto fault recoverable distributed computing.</li>
+</ol>
+<div class="section" id="computation-graphs">
+<span id="computation-graphs"></span><h2>Computation Graphs<a class="headerlink" href="#computation-graphs" title="永久链接至标题">¶</a></h2>
+<ol class="simple">
+<li>PaddlePaddle represent the computation, training and inference of DL models, by computation graphs.</li>
+<li>Please dig into <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/graph.md">computation graphs</a> for a solid example.</li>
+<li>Users write Python programs to describe the graphs and run it (locally or remotely).</li>
+<li>A graph is composed of <em>variabels</em> and <em>operators</em>.</li>
+<li>The description of graphs must be able to be serialized/deserialized, so it<ol>
+<li>could to be sent to the cloud for distributed execution, and</li>
+<li>be sent to clients for mobile or enterprise deployment.</li>
+</ol>
+</li>
+<li>The Python program do<ol>
+<li><em>compilation</em>: runs a Python program to generate a protobuf message representation of the graph and send it to<ol>
+<li>the C++ library <code class="docutils literal"><span class="pre">libpaddle.so</span></code> for local execution,</li>
+<li>the master process of a distributed training job for training, or</li>
+<li>the server process of a Kubernetes serving job for distributed serving.</li>
+</ol>
+</li>
+<li><em>execution</em>: according to the protobuf message, constructs instances of class <code class="docutils literal"><span class="pre">Variable</span></code> and <code class="docutils literal"><span class="pre">OperatorBase</span></code>, and run them.</li>
+</ol>
+</li>
+</ol>
+</div>
+<div class="section" id="description-and-realization">
+<span id="description-and-realization"></span><h2>Description and Realization<a class="headerlink" href="#description-and-realization" title="永久链接至标题">¶</a></h2>
+<p>At compile time, the Python program generates protobuf message representation of the graph, or the description of the graph.</p>
+<p>At runtime, the C++ program realizes the graph and run it.</p>
+<p>| | Representation (protobuf messages) | Realization (C++ class objects) |
+|&#8212;|&#8212;|&#8212;|
+|Data|<a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/framework.proto#L107">VarDesc</a>|<a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/variable.h#L24">Variable</a>|
+|Operation|<a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/framework.proto#L35">OpDesc</a>|<a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/operator.h#L64">Operator</a>|
+|Block|BlockDesc|Block|</p>
+<p>The word <em>graph</em> is exchangable with <em>block</em> in this document.  A graph represent computation steps and local variables as a C++/Java program block, or a pair of { and }.</p>
+</div>
+<div class="section" id="compilation-and-execution">
+<span id="compilation-and-execution"></span><h2>Compilation and Execution<a class="headerlink" href="#compilation-and-execution" title="永久链接至标题">¶</a></h2>
+<ol class="simple">
+<li>Run an applicaton Python program to describe the graph.  In particular,<ol>
+<li>create VarDesc to represent local/intermediate variables,</li>
+<li>create operators and set attributes,</li>
+<li>validate attribute values,</li>
+<li>inference the type and the shape of variables,</li>
+<li>plan for memory-reuse for variables,</li>
+<li>generate backward and optimization part of the Graph.</li>
+<li>possiblly split the graph for distributed training.</li>
+</ol>
+</li>
+<li>The invocation of <code class="docutils literal"><span class="pre">train</span></code> or <code class="docutils literal"><span class="pre">infer</span></code> in the application Python program:<ol>
+<li>create a new Scope instance in the <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/scope.md">scope hierarchy</a> for each run of a block,<ol>
+<li>realize local variables defined in the BlockDesc message in the new scope,</li>
+<li>a scope is similar to the stack frame in programming languages,</li>
+</ol>
+</li>
+<li>create an instance of class <code class="docutils literal"><span class="pre">Block</span></code>, in which,<ol>
+<li>realize operators in the BlockDesc message,</li>
+</ol>
+</li>
+<li>run the Block by calling<ol>
+<li><code class="docutils literal"><span class="pre">Block::Eval(vector&lt;Variable&gt;*</span> <span class="pre">targets)</span></code> for forward and backward computations, or</li>
+<li><code class="docutils literal"><span class="pre">Block::Eval(vector&lt;Operator&gt;*</span> <span class="pre">targets)</span></code> for optimization.</li>
+</ol>
+</li>
+</ol>
+</li>
+</ol>
+</div>
+<div class="section" id="intermediate-representation-ir">
+<span id="intermediate-representation-ir"></span><h2>Intermediate Representation (IR)<a class="headerlink" href="#intermediate-representation-ir" title="永久链接至标题">¶</a></h2>
+<div class="highlight-text"><div class="highlight"><pre><span></span>Compile Time -&gt; IR -&gt; Runtime
+</pre></div>
+</div>
+<div class="section" id="benefit">
+<span id="benefit"></span><h3>Benefit<a class="headerlink" href="#benefit" title="永久链接至标题">¶</a></h3>
+<ul>
+<li><p class="first">Optimization</p>
+<div class="highlight-text"><div class="highlight"><pre><span></span>Compile Time -&gt; IR -&gt; Optimized IR -&gt; Runtime
+</pre></div>
+</div>
+</li>
+<li><p class="first">Send automatically partitioned IR to different nodes.</p>
+<ul>
+<li><p class="first">Automatic data parallel</p>
+<div class="highlight-text"><div class="highlight"><pre><span></span>Compile Time
+|-&gt; Single GPU IR
+    |-&gt; [trainer-IR-0, trainer-IR-1, pserver-IR]
+        |-&gt; Node-0 (runs trainer-IR-0)
+        |-&gt; Node-1 (runs trainer-IR-1)
+        |-&gt; Node-2 (runs pserver-IR)
+</pre></div>
+</div>
+</li>
+<li><p class="first">Automatic model parallel (planned for future)</p>
+</li>
+</ul>
+</li>
+</ul>
+</div>
+</div>
+</div>
+<hr class="docutils" />
+<div class="section" id="operator-opwithkernel-opkernel">
+<span id="operator-opwithkernel-opkernel"></span><h1>Operator/OpWithKernel/OpKernel<a class="headerlink" href="#operator-opwithkernel-opkernel" title="永久链接至标题">¶</a></h1>
+<p><img alt="class_diagram" src="http://api.paddlepaddle.org/graphviz?dot=https://gist.githubusercontent.com/reyoung/53df507f6749762675dff3e7ce53372f/raw/49caf1fb70820fb4a6c217634317c9306f361f36/op_op_with_kern_class_diagram.dot" /></p>
+</div>
+<hr class="docutils" />
+<div class="section" id="operator">
+<span id="operator"></span><h1>Operator<a class="headerlink" href="#operator" title="永久链接至标题">¶</a></h1>
+<p><img alt="class_diagram" src="http://api.paddlepaddle.org/graphviz?dot=https://gist.githubusercontent.com/reyoung/53df507f6749762675dff3e7ce53372f/raw/dd598e8f1976f5759f58af5e5ef94738a6b2e661/op.dot" /></p>
+<ul class="simple">
+<li><code class="docutils literal"><span class="pre">Operator</span></code> is the fundamental building block as the user interface.<ul>
+<li>Operator stores input/output variable name, and attributes.</li>
+<li>The <code class="docutils literal"><span class="pre">InferShape</span></code> interface is used to infer output variable shapes by its input shapes.</li>
+<li>Use <code class="docutils literal"><span class="pre">Run</span></code> to compute <code class="docutils literal"><span class="pre">input</span> <span class="pre">variables</span></code> to <code class="docutils literal"><span class="pre">output</span> <span class="pre">variables</span></code>.</li>
+</ul>
+</li>
+</ul>
+</div>
+<hr class="docutils" />
+<div class="section" id="opwithkernel-kernel">
+<span id="opwithkernel-kernel"></span><h1>OpWithKernel/Kernel<a class="headerlink" href="#opwithkernel-kernel" title="永久链接至标题">¶</a></h1>
+<p><img alt="class_diagram" src="http://api.paddlepaddle.org/graphviz?dot=https://gist.githubusercontent.com/reyoung/53df507f6749762675dff3e7ce53372f/raw/9d7f4eba185cf41c8e2fbfb40ae21890dbddcd39/op_with_kernel.dot" /></p>
+<ul class="simple">
+<li><code class="docutils literal"><span class="pre">OpWithKernel</span></code> inherits <code class="docutils literal"><span class="pre">Operator</span></code>.</li>
+<li><code class="docutils literal"><span class="pre">OpWithKernel</span></code> contains a Kernel map.<ul>
+<li><code class="docutils literal"><span class="pre">OpWithKernel::Run</span></code> get device&#8217;s kernel, and invoke <code class="docutils literal"><span class="pre">OpKernel::Compute</span></code>.</li>
+<li><code class="docutils literal"><span class="pre">OpKernelKey</span></code> is the map key. Only device place now, but may be data type later.</li>
+</ul>
+</li>
+</ul>
+</div>
+<hr class="docutils" />
+<div class="section" id="why-separate-kernel-and-operator">
+<span id="why-separate-kernel-and-operator"></span><h1>Why separate Kernel and Operator<a class="headerlink" href="#why-separate-kernel-and-operator" title="永久链接至标题">¶</a></h1>
+<ul class="simple">
+<li>Separate GPU and CPU code.<ul>
+<li>Make Paddle can run without GPU.</li>
+</ul>
+</li>
+<li>Make one operator (which is user interface) can contain many implementations.<ul>
+<li>Same mul op, different FP16, FP32 Kernel. different MKL, eigen kernel.</li>
+</ul>
+</li>
+</ul>
+</div>
+<hr class="docutils" />
+<div class="section" id="libraries-for-kernel-development">
+<span id="libraries-for-kernel-development"></span><h1>Libraries for Kernel development<a class="headerlink" href="#libraries-for-kernel-development" title="永久链接至标题">¶</a></h1>
+<ul class="simple">
+<li><code class="docutils literal"><span class="pre">Eigen::Tensor</span></code> contains basic math and element-wise functions.<ul>
+<li>Note that <code class="docutils literal"><span class="pre">Eigen::Tensor</span></code> has broadcast implementation.</li>
+<li>Limit number of <code class="docutils literal"><span class="pre">tensor.device(dev)</span> <span class="pre">=</span></code> in your code.</li>
+</ul>
+</li>
+<li><code class="docutils literal"><span class="pre">thrust::tranform</span></code> and <code class="docutils literal"><span class="pre">std::transform</span></code>.<ul>
+<li><code class="docutils literal"><span class="pre">thrust</span></code> has the same API as C++ standard library. Using <code class="docutils literal"><span class="pre">transform</span></code> can quickly implement a customized elementwise kernel.</li>
+<li><code class="docutils literal"><span class="pre">thrust</span></code> has more complex API, like <code class="docutils literal"><span class="pre">scan</span></code>, <code class="docutils literal"><span class="pre">reduce</span></code>, <code class="docutils literal"><span class="pre">reduce_by_key</span></code>.</li>
+</ul>
+</li>
+<li>Hand-writing <code class="docutils literal"><span class="pre">GPUKernel</span></code> and <code class="docutils literal"><span class="pre">CPU</span></code> code<ul>
+<li>Do not write <code class="docutils literal"><span class="pre">.h</span></code>. CPU Kernel should be in <code class="docutils literal"><span class="pre">.cc</span></code>. CPU kernel should be in <code class="docutils literal"><span class="pre">.cu</span></code>. (<code class="docutils literal"><span class="pre">GCC</span></code> cannot compile GPU code.)</li>
+</ul>
+</li>
+</ul>
+</div>
+<hr class="docutils" />
+<div class="section" id="operator-register">
+<span id="operator-register"></span><h1>Operator Register<a class="headerlink" href="#operator-register" title="永久链接至标题">¶</a></h1>
+<div class="section" id="why-register-is-necessary">
+<span id="why-register-is-necessary"></span><h2>Why register is necessary?<a class="headerlink" href="#why-register-is-necessary" title="永久链接至标题">¶</a></h2>
+<p>We need a method to build mappings between Op type names and Op classes.</p>
+</div>
+<div class="section" id="how-to-do-the-register">
+<span id="how-to-do-the-register"></span><h2>How to do the register?<a class="headerlink" href="#how-to-do-the-register" title="永久链接至标题">¶</a></h2>
+<p>Maintain a map, whose key is the type name and value is corresponding Op constructor.</p>
+</div>
+</div>
+<hr class="docutils" />
+<div class="section" id="the-registry-map">
+<span id="the-registry-map"></span><h1>The Registry Map<a class="headerlink" href="#the-registry-map" title="永久链接至标题">¶</a></h1>
+<div class="section" id="opinfomap">
+<span id="opinfomap"></span><h2><code class="docutils literal"><span class="pre">OpInfoMap</span></code><a class="headerlink" href="#opinfomap" title="永久链接至标题">¶</a></h2>
+<p><code class="docutils literal"><span class="pre">op_type(string)</span></code> -&gt; <code class="docutils literal"><span class="pre">OpInfo</span></code></p>
+<p><code class="docutils literal"><span class="pre">OpInfo</span></code>:</p>
+<ul class="simple">
+<li><strong><code class="docutils literal"><span class="pre">creator</span></code></strong>: The Op constructor.</li>
+<li><strong><code class="docutils literal"><span class="pre">grad_op_type</span></code></strong>: The type of the gradient Op.</li>
+<li><strong><code class="docutils literal"><span class="pre">proto</span></code></strong>: The Op&#8217;s Protobuf, including inputs, outputs and required attributes.</li>
+<li><strong><code class="docutils literal"><span class="pre">checker</span></code></strong>: Used to check attributes.</li>
+</ul>
+</div>
+</div>
+<hr class="docutils" />
+<div class="section" id="related-concepts">
+<span id="related-concepts"></span><h1>Related Concepts<a class="headerlink" href="#related-concepts" title="永久链接至标题">¶</a></h1>
+<div class="section" id="op-maker">
+<span id="op-maker"></span><h2>Op_Maker<a class="headerlink" href="#op-maker" title="永久链接至标题">¶</a></h2>
+<p>It&#8217;s constructor takes <code class="docutils literal"><span class="pre">proto</span></code> and <code class="docutils literal"><span class="pre">checker</span></code>. They are compeleted during Op_Maker&#8217;s construction. (<a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/scale_op.cc#L37">ScaleOpMaker</a>)</p>
+</div>
+<div class="section" id="register-macros">
+<span id="register-macros"></span><h2>Register Macros<a class="headerlink" href="#register-macros" title="永久链接至标题">¶</a></h2>
+<div class="highlight-cpp"><div class="highlight"><pre><span></span><span class="n">REGISTER_OP</span><span class="p">(</span><span class="n">op_type</span><span class="p">,</span> <span class="n">op_class</span><span class="p">,</span> <span class="n">op_maker_class</span><span class="p">,</span> <span class="n">grad_op_type</span><span class="p">,</span> <span class="n">grad_op_class</span><span class="p">)</span>
+<span class="n">REGISTER_OP_WITHOUT_GRADIENT</span><span class="p">(</span><span class="n">op_type</span><span class="p">,</span> <span class="n">op_class</span><span class="p">,</span> <span class="n">op_maker_class</span><span class="p">)</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="use-macros">
+<span id="use-macros"></span><h2><code class="docutils literal"><span class="pre">USE</span></code> Macros<a class="headerlink" href="#use-macros" title="永久链接至标题">¶</a></h2>
+<p>make sure the registration process is executed and linked.</p>
+</div>
+</div>
+<hr class="docutils" />
+<div class="section" id="register-process">
+<span id="register-process"></span><h1>Register Process<a class="headerlink" href="#register-process" title="永久链接至标题">¶</a></h1>
+<ol class="simple">
+<li>Write Op class, as well as its gradient Op class if there is.</li>
+<li>Write Op maker class. In the constructor, describe its inputs, outputs, and attributes.</li>
+<li>Invoke macro <code class="docutils literal"><span class="pre">REGISTER_OP</span></code>. The macro will<ol>
+<li>call maker class to complete <code class="docutils literal"><span class="pre">proto</span></code> and <code class="docutils literal"><span class="pre">checker</span></code></li>
+<li>with the completed <code class="docutils literal"><span class="pre">proto</span></code> and <code class="docutils literal"><span class="pre">checker</span></code>, build a new key-value pair in the <code class="docutils literal"><span class="pre">OpInfoMap</span></code></li>
+</ol>
+</li>
+<li>Invoke <code class="docutils literal"><span class="pre">USE</span></code> macro in where the Op is used to make sure it is linked.</li>
+</ol>
+</div>
+<hr class="docutils" />
+<div class="section" id="backward-module-1-2">
+<span id="backward-module-1-2"></span><h1>Backward Module (1/2)<a class="headerlink" href="#backward-module-1-2" title="永久链接至标题">¶</a></h1>
+<div class="section" id="create-backward-operator">
+<span id="create-backward-operator"></span><h2>Create Backward Operator<a class="headerlink" href="#create-backward-operator" title="永久链接至标题">¶</a></h2>
+<ul class="simple">
+<li>Mapping from forwarding Op to backward Op
+<img alt="backward" src="https://gist.githubusercontent.com/dzhwinter/a6fbd4623ee76c459f7f94591fd1abf0/raw/61026ab6e518e66bde66a889bc42557a1fccff33/backward.png" /></li>
+</ul>
+</div>
+</div>
+<hr class="docutils" />
+<div class="section" id="backward-module-2-2">
+<span id="backward-module-2-2"></span><h1>Backward Module (2/2)<a class="headerlink" href="#backward-module-2-2" title="永久链接至标题">¶</a></h1>
+<div class="section" id="build-backward-network">
+<span id="build-backward-network"></span><h2>Build Backward Network<a class="headerlink" href="#build-backward-network" title="永久链接至标题">¶</a></h2>
+<ul class="simple">
+<li><strong>Input</strong> graph of forwarding operators</li>
+<li><strong>Output</strong> graph of backward operators</li>
+<li><strong>corner case in construction</strong><ul>
+<li>shared variable =&gt; insert <code class="docutils literal"><span class="pre">Add</span></code> operator</li>
+<li>no gradient =&gt; insert <code class="docutils literal"><span class="pre">fill_zero_grad</span></code> operator</li>
+<li>recursive netOp =&gt; call <code class="docutils literal"><span class="pre">Backward</span></code> recursively</li>
+<li>RNN Op =&gt; recursively call <code class="docutils literal"><span class="pre">Backward</span></code> on stepnet</li>
+</ul>
+</li>
+</ul>
+</div>
+</div>
+<hr class="docutils" />
+<div class="section" id="scope-variable-tensor">
+<span id="scope-variable-tensor"></span><h1>Scope, Variable, Tensor<a class="headerlink" href="#scope-variable-tensor" title="永久链接至标题">¶</a></h1>
+<ul class="simple">
+<li><code class="docutils literal"><span class="pre">Tensor</span></code> is an n-dimension array with type.<ul>
+<li>Only dims and data pointers are stored in <code class="docutils literal"><span class="pre">Tensor</span></code>.</li>
+<li>All operators on <code class="docutils literal"><span class="pre">Tensor</span></code> is written in <code class="docutils literal"><span class="pre">Operator</span></code> or global functions.</li>
+<li>variable length Tensor design <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/lod_tensor.md">LoDTensor</a></li>
+</ul>
+</li>
+<li><code class="docutils literal"><span class="pre">Variable</span></code> is the inputs and outputs of an operator. Not just <code class="docutils literal"><span class="pre">Tensor</span></code>.<ul>
+<li>step_scopes in RNN is a variable and not a tensor.</li>
+</ul>
+</li>
+<li><code class="docutils literal"><span class="pre">Scope</span></code> is where variables store at.<ul>
+<li>map&lt;string/*var name */, Variable&gt;</li>
+<li><code class="docutils literal"><span class="pre">Scope</span></code> has a hierarchical structure. The local scope can get variable from its parent scope.</li>
+</ul>
+</li>
+</ul>
+</div>
+<hr class="docutils" />
+<div class="section" id="block-in-design">
+<span id="block-in-design"></span><h1>Block (in design)<a class="headerlink" href="#block-in-design" title="永久链接至标题">¶</a></h1>
+<div class="section" id="the-difference-with-original-rnnop">
+<span id="the-difference-with-original-rnnop"></span><h2>the difference with original RNNOp<a class="headerlink" href="#the-difference-with-original-rnnop" title="永久链接至标题">¶</a></h2>
+<ul class="simple">
+<li>as an operator is more intuitive than <code class="docutils literal"><span class="pre">RNNOp</span></code>,</li>
+<li>offers new interface <code class="docutils literal"><span class="pre">Eval(targets)</span></code> to deduce the minimal block to <code class="docutils literal"><span class="pre">Run</span></code>,</li>
+<li>fits the compile-time/ runtime separation design.<ul>
+<li>during the compilation, <code class="docutils literal"><span class="pre">SymbolTable</span></code> stores <code class="docutils literal"><span class="pre">VarDesc</span></code>s and <code class="docutils literal"><span class="pre">OpDesc</span></code>s and serialize to a <code class="docutils literal"><span class="pre">BlockDesc</span></code></li>
+<li>when graph executes, a Block with <code class="docutils literal"><span class="pre">BlockDesc</span></code> passed in creates <code class="docutils literal"><span class="pre">Op</span></code> and <code class="docutils literal"><span class="pre">Var</span></code> then <code class="docutils literal"><span class="pre">Run</span></code></li>
+</ul>
+</li>
+</ul>
+</div>
+</div>
+<hr class="docutils" />
+<div class="section" id="milestone">
+<span id="milestone"></span><h1>Milestone<a class="headerlink" href="#milestone" title="永久链接至标题">¶</a></h1>
+<ul class="simple">
+<li>take Paddle/books as the main line, the requirement of the models motivates framework refactoring,</li>
+<li>model migration<ul>
+<li>framework development gives <strong>priority support</strong> to model migration, for example,<ul>
+<li>the MNIST demo needs a Python interface,</li>
+<li>the RNN models require the framework to support <code class="docutils literal"><span class="pre">LoDTensor</span></code>.</li>
+</ul>
+</li>
+<li>determine some timelines,</li>
+<li>heavily-relied Ops need to be migrated first,</li>
+<li>different models can be migrated parallelly.</li>
+</ul>
+</li>
+<li>improve the framework at the same time</li>
+<li>accept imperfection, concentrated on solving the specific problem at the right price.</li>
+</ul>
+</div>
+<hr class="docutils" />
+<div class="section" id="control-the-migration-quality">
+<span id="control-the-migration-quality"></span><h1>Control the migration quality<a class="headerlink" href="#control-the-migration-quality" title="永久链接至标题">¶</a></h1>
+<ul class="simple">
+<li>compare the performance of migrated models with old ones.</li>
+<li>follow google C style</li>
+<li>build the automatic workflow of generating Python/C++ documentations<ul>
+<li>the documentation of layers and ops should be written inside the code</li>
+<li>take the documentation quality into account when doing PR</li>
+<li>preview the documentations, read and improve them from users&#8217; perspective</li>
+</ul>
+</li>
+</ul>
+</div>
+
+
+           </div>
+          </div>
+          <footer>
+  
+
+  <hr/>
+
+  <div role="contentinfo">
+    <p>
+        &copy; Copyright 2016, PaddlePaddle developers.
+
+    </p>
+  </div>
+  Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>. 
+
+</footer>
+
+        </div>
+      </div>
+
+    </section>
+
+  </div>
+  
+
+
+  
+
+    <script type="text/javascript">
+        var DOCUMENTATION_OPTIONS = {
+            URL_ROOT:'../',
+            VERSION:'',
+            COLLAPSE_INDEX:false,
+            FILE_SUFFIX:'.html',
+            HAS_SOURCE:  true,
+            SOURCELINK_SUFFIX: ".txt",
+        };
+    </script>
+      <script type="text/javascript" src="../_static/jquery.js"></script>
+      <script type="text/javascript" src="../_static/underscore.js"></script>
+      <script type="text/javascript" src="../_static/doctools.js"></script>
+      <script type="text/javascript" src="../_static/translations.js"></script>
+      <script type="text/javascript" src="https://cdn.bootcss.com/mathjax/2.7.0/MathJax.js"></script>
+       
+  
+
+  
+  
+    <script type="text/javascript" src="../_static/js/theme.js"></script>
+  
+  
+  <script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js" integrity="sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa" crossorigin="anonymous"></script>
+  <script src="https://cdn.jsdelivr.net/perfect-scrollbar/0.6.14/js/perfect-scrollbar.jquery.min.js"></script>
+  <script src="../_static/js/paddle_doc_init.js"></script> 
+
+</body>
+</html>
\ No newline at end of file
--- a/develop/doc_cn/objects.inv
+++ b/develop/doc_cn/objects.inv
--- a/develop/doc_cn/searchindex.js
+++ b/develop/doc_cn/searchindex.js