The word *graph* is exchangable with *block* in this document. A graph represent computation steps and local variables as a C++/Java program block, or a pair of { and }.
## Compilation and Execution
1. Run an applicaton Python program to describe the graph. In particular,
1. create VarDesc to represent local/intermediate variables,
1. create operators and set attributes,
1. validate attribute values,
1. inference the type and the shape of variables,
1. plan for memory-reuse for variables,
1. generate backward and optimization part of the Graph.
1. possiblly split the graph for distributed training.
1. The invocation of `train` or `infer` in the application Python program:
1. create a new Scope instance in the [scope hierarchy](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/scope.md) for each run of a block,
1. realize local variables defined in the BlockDesc message in the new scope,
1. a scope is similar to the stack frame in programming languages,
1. create an instance of class `Block`, in which,
1. realize operators in the BlockDesc message,
1. run the Block by calling
1. `Block::Eval(vector<Variable>* targets)` for forward and backward computations, or
1. `Block::Eval(vector<Operator>* targets)` for optimization.
## Intermediate Representation (IR)
```text
Compile Time -> IR -> Runtime
```
### Benefit
- Optimization
```text
Compile Time -> IR -> Optimized IR -> Runtime
```
- Send automatically partitioned IR to different nodes.
* `OpWithKernel::Run` get device's kernel, and invoke `OpKernel::Compute`.
* `OpKernelKey` is the map key. Only device place now, but may be data type later.
---
# Why separate Kernel and Operator
* Separate GPU and CPU code.
* Make Paddle can run without GPU.
* Make one operator (which is user interface) can contain many implementations.
* Same mul op, different FP16, FP32 Kernel. different MKL, eigen kernel.
---
# Libraries for Kernel development
* `Eigen::Tensor` contains basic math and element-wise functions.
* Note that `Eigen::Tensor` has broadcast implementation.
* Limit number of `tensor.device(dev) = ` in your code.
* `thrust::tranform` and `std::transform`.
* `thrust` has the same API as C++ standard library. Using `transform` can quickly implement a customized elementwise kernel.
* `thrust` has more complex API, like `scan`, `reduce`, `reduce_by_key`.
* Hand-writing `GPUKernel` and `CPU` code
* Do not write `.h`. CPU Kernel should be in `.cc`. CPU kernel should be in `.cu`. (`GCC` cannot compile GPU code.)
---
# Operator Register
## Why register is necessary?
We need a method to build mappings between Op type names and Op classes.
## How to do the register?
Maintain a map, whose key is the type name and value is corresponding Op constructor.
---
# The Registry Map
### `OpInfoMap`
`op_type(string)` -> `OpInfo`
`OpInfo`:
- **`creator`**: The Op constructor.
- **`grad_op_type`**: The type of the gradient Op.
- **`proto`**: The Op's Protobuf, including inputs, outputs and required attributes.
- **`checker`**: Used to check attributes.
---
# Related Concepts
### Op_Maker
It's constructor takes `proto` and `checker`. They are compeleted during Op_Maker's construction. ([ScaleOpMaker](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/scale_op.cc#L37))
<liclass="toctree-l2"><aclass="reference internal"href="../getstarted/build_and_install/index_en.html">Install and Build</a><ul>
<liclass="toctree-l3"><aclass="reference internal"href="../getstarted/build_and_install/docker_install_en.html">PaddlePaddle in Docker Containers</a></li>
<liclass="toctree-l3"><aclass="reference internal"href="../getstarted/build_and_install/build_from_source_en.html">Installing from Sources</a></li>
<liclass="toctree-l2"><aclass="reference internal"href="../howto/usage/k8s/k8s_en.html">Paddle On Kubernetes</a></li>
<liclass="toctree-l2"><aclass="reference internal"href="../howto/usage/k8s/k8s_aws_en.html">Distributed PaddlePaddle Training on AWS with Kubernetes</a></li>
<liclass="toctree-l2"><aclass="reference internal"href="../howto/dev/build_en.html">Build PaddlePaddle from Source Code and Run Unit Test</a></li>
<liclass="toctree-l2"><aclass="reference internal"href="../howto/dev/new_layer_en.html">Write New Layers</a></li>
<spanid="design-doc-refactorization-overview"></span><h1>Design Doc: Refactorization Overview<aclass="headerlink"href="#design-doc-refactorization-overview"title="Permalink to this headline">¶</a></h1>
<p>The goal of refactorizaiton include:</p>
<olclass="simple">
<li>Make it easy for external contributors to write new elementory computaiton operations.</li>
<li>Make the codebase clean and readable.</li>
<li>Introduce a new design of computation representation – a computation graph of operators and variables.</li>
<li>The graph representation helps implementing auto-scalable and auto fault recoverable distributed computing.</li>
</ol>
<divclass="section"id="computation-graphs">
<spanid="computation-graphs"></span><h2>Computation Graphs<aclass="headerlink"href="#computation-graphs"title="Permalink to this headline">¶</a></h2>
<olclass="simple">
<li>PaddlePaddle represent the computation, training and inference of DL models, by computation graphs.</li>
<li>Please dig into <aclass="reference external"href="https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/graph.md">computation graphs</a> for a solid example.</li>
<li>Users write Python programs to describe the graphs and run it (locally or remotely).</li>
<li>A graph is composed of <em>variabels</em> and <em>operators</em>.</li>
<li>The description of graphs must be able to be serialized/deserialized, so it<ol>
<li>could to be sent to the cloud for distributed execution, and</li>
<li>be sent to clients for mobile or enterprise deployment.</li>
</ol>
</li>
<li>The Python program do<ol>
<li><em>compilation</em>: runs a Python program to generate a protobuf message representation of the graph and send it to<ol>
<li>the C++ library <codeclass="docutils literal"><spanclass="pre">libpaddle.so</span></code> for local execution,</li>
<li>the master process of a distributed training job for training, or</li>
<li>the server process of a Kubernetes serving job for distributed serving.</li>
</ol>
</li>
<li><em>execution</em>: according to the protobuf message, constructs instances of class <codeclass="docutils literal"><spanclass="pre">Variable</span></code> and <codeclass="docutils literal"><spanclass="pre">OperatorBase</span></code>, and run them.</li>
<spanid="description-and-realization"></span><h2>Description and Realization<aclass="headerlink"href="#description-and-realization"title="Permalink to this headline">¶</a></h2>
<p>At compile time, the Python program generates protobuf message representation of the graph, or the description of the graph.</p>
<p>At runtime, the C++ program realizes the graph and run it.</p>
<p>The word <em>graph</em> is exchangable with <em>block</em> in this document. A graph represent computation steps and local variables as a C++/Java program block, or a pair of { and }.</p>
<spanid="compilation-and-execution"></span><h2>Compilation and Execution<aclass="headerlink"href="#compilation-and-execution"title="Permalink to this headline">¶</a></h2>
<olclass="simple">
<li>Run an applicaton Python program to describe the graph. In particular,<ol>
<li>create VarDesc to represent local/intermediate variables,</li>
<li>create operators and set attributes,</li>
<li>validate attribute values,</li>
<li>inference the type and the shape of variables,</li>
<li>plan for memory-reuse for variables,</li>
<li>generate backward and optimization part of the Graph.</li>
<li>possiblly split the graph for distributed training.</li>
</ol>
</li>
<li>The invocation of <codeclass="docutils literal"><spanclass="pre">train</span></code> or <codeclass="docutils literal"><spanclass="pre">infer</span></code> in the application Python program:<ol>
<li>create a new Scope instance in the <aclass="reference external"href="https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/scope.md">scope hierarchy</a> for each run of a block,<ol>
<li>realize local variables defined in the BlockDesc message in the new scope,</li>
<li>a scope is similar to the stack frame in programming languages,</li>
</ol>
</li>
<li>create an instance of class <codeclass="docutils literal"><spanclass="pre">Block</span></code>, in which,<ol>
<li>realize operators in the BlockDesc message,</li>
</ol>
</li>
<li>run the Block by calling<ol>
<li><codeclass="docutils literal"><spanclass="pre">Block::Eval(vector<Variable>*</span><spanclass="pre">targets)</span></code> for forward and backward computations, or</li>
<li><codeclass="docutils literal"><spanclass="pre">Block::Eval(vector<Operator>*</span><spanclass="pre">targets)</span></code> for optimization.</li>
<spanid="intermediate-representation-ir"></span><h2>Intermediate Representation (IR)<aclass="headerlink"href="#intermediate-representation-ir"title="Permalink to this headline">¶</a></h2>
<divclass="highlight-text"><divclass="highlight"><pre><span></span>Compile Time -> IR -> Runtime
</pre></div>
</div>
<divclass="section"id="benefit">
<spanid="benefit"></span><h3>Benefit<aclass="headerlink"href="#benefit"title="Permalink to this headline">¶</a></h3>
<ul>
<li><pclass="first">Optimization</p>
<divclass="highlight-text"><divclass="highlight"><pre><span></span>Compile Time -> IR -> Optimized IR -> Runtime
</pre></div>
</div>
</li>
<li><pclass="first">Send automatically partitioned IR to different nodes.</p>
<ul>
<li><pclass="first">Automatic data parallel</p>
<divclass="highlight-text"><divclass="highlight"><pre><span></span>Compile Time
|-> Single GPU IR
|-> [trainer-IR-0, trainer-IR-1, pserver-IR]
|-> Node-0 (runs trainer-IR-0)
|-> Node-1 (runs trainer-IR-1)
|-> Node-2 (runs pserver-IR)
</pre></div>
</div>
</li>
<li><pclass="first">Automatic model parallel (planned for future)</p>
<spanid="operator-opwithkernel-opkernel"></span><h1>Operator/OpWithKernel/OpKernel<aclass="headerlink"href="#operator-opwithkernel-opkernel"title="Permalink to this headline">¶</a></h1>
<li><codeclass="docutils literal"><spanclass="pre">Operator</span></code> is the fundamental building block as the user interface.<ul>
<li>Operator stores input/output variable name, and attributes.</li>
<li>The <codeclass="docutils literal"><spanclass="pre">InferShape</span></code> interface is used to infer output variable shapes by its input shapes.</li>
<li>Use <codeclass="docutils literal"><spanclass="pre">Run</span></code> to compute <codeclass="docutils literal"><spanclass="pre">input</span><spanclass="pre">variables</span></code> to <codeclass="docutils literal"><spanclass="pre">output</span><spanclass="pre">variables</span></code>.</li>
</ul>
</li>
</ul>
</div>
<hrclass="docutils"/>
<divclass="section"id="opwithkernel-kernel">
<spanid="opwithkernel-kernel"></span><h1>OpWithKernel/Kernel<aclass="headerlink"href="#opwithkernel-kernel"title="Permalink to this headline">¶</a></h1>
<li><codeclass="docutils literal"><spanclass="pre">OpWithKernel</span></code> contains a Kernel map.<ul>
<li><codeclass="docutils literal"><spanclass="pre">OpWithKernel::Run</span></code> get device’s kernel, and invoke <codeclass="docutils literal"><spanclass="pre">OpKernel::Compute</span></code>.</li>
<li><codeclass="docutils literal"><spanclass="pre">OpKernelKey</span></code> is the map key. Only device place now, but may be data type later.</li>
<spanid="why-separate-kernel-and-operator"></span><h1>Why separate Kernel and Operator<aclass="headerlink"href="#why-separate-kernel-and-operator"title="Permalink to this headline">¶</a></h1>
<ulclass="simple">
<li>Separate GPU and CPU code.<ul>
<li>Make Paddle can run without GPU.</li>
</ul>
</li>
<li>Make one operator (which is user interface) can contain many implementations.<ul>
<li>Same mul op, different FP16, FP32 Kernel. different MKL, eigen kernel.</li>
<spanid="libraries-for-kernel-development"></span><h1>Libraries for Kernel development<aclass="headerlink"href="#libraries-for-kernel-development"title="Permalink to this headline">¶</a></h1>
<ulclass="simple">
<li><codeclass="docutils literal"><spanclass="pre">Eigen::Tensor</span></code> contains basic math and element-wise functions.<ul>
<li>Note that <codeclass="docutils literal"><spanclass="pre">Eigen::Tensor</span></code> has broadcast implementation.</li>
<li>Limit number of <codeclass="docutils literal"><spanclass="pre">tensor.device(dev)</span><spanclass="pre">=</span></code> in your code.</li>
</ul>
</li>
<li><codeclass="docutils literal"><spanclass="pre">thrust::tranform</span></code> and <codeclass="docutils literal"><spanclass="pre">std::transform</span></code>.<ul>
<li><codeclass="docutils literal"><spanclass="pre">thrust</span></code> has the same API as C++ standard library. Using <codeclass="docutils literal"><spanclass="pre">transform</span></code> can quickly implement a customized elementwise kernel.</li>
<li><codeclass="docutils literal"><spanclass="pre">thrust</span></code> has more complex API, like <codeclass="docutils literal"><spanclass="pre">scan</span></code>, <codeclass="docutils literal"><spanclass="pre">reduce</span></code>, <codeclass="docutils literal"><spanclass="pre">reduce_by_key</span></code>.</li>
</ul>
</li>
<li>Hand-writing <codeclass="docutils literal"><spanclass="pre">GPUKernel</span></code> and <codeclass="docutils literal"><spanclass="pre">CPU</span></code> code<ul>
<li>Do not write <codeclass="docutils literal"><spanclass="pre">.h</span></code>. CPU Kernel should be in <codeclass="docutils literal"><spanclass="pre">.cc</span></code>. CPU kernel should be in <codeclass="docutils literal"><spanclass="pre">.cu</span></code>. (<codeclass="docutils literal"><spanclass="pre">GCC</span></code> cannot compile GPU code.)</li>
</ul>
</li>
</ul>
</div>
<hrclass="docutils"/>
<divclass="section"id="operator-register">
<spanid="operator-register"></span><h1>Operator Register<aclass="headerlink"href="#operator-register"title="Permalink to this headline">¶</a></h1>
<spanid="why-register-is-necessary"></span><h2>Why register is necessary?<aclass="headerlink"href="#why-register-is-necessary"title="Permalink to this headline">¶</a></h2>
<p>We need a method to build mappings between Op type names and Op classes.</p>
</div>
<divclass="section"id="how-to-do-the-register">
<spanid="how-to-do-the-register"></span><h2>How to do the register?<aclass="headerlink"href="#how-to-do-the-register"title="Permalink to this headline">¶</a></h2>
<p>Maintain a map, whose key is the type name and value is corresponding Op constructor.</p>
</div>
</div>
<hrclass="docutils"/>
<divclass="section"id="the-registry-map">
<spanid="the-registry-map"></span><h1>The Registry Map<aclass="headerlink"href="#the-registry-map"title="Permalink to this headline">¶</a></h1>
<divclass="section"id="opinfomap">
<spanid="opinfomap"></span><h2><codeclass="docutils literal"><spanclass="pre">OpInfoMap</span></code><aclass="headerlink"href="#opinfomap"title="Permalink to this headline">¶</a></h2>
<li><strong><codeclass="docutils literal"><spanclass="pre">creator</span></code></strong>: The Op constructor.</li>
<li><strong><codeclass="docutils literal"><spanclass="pre">grad_op_type</span></code></strong>: The type of the gradient Op.</li>
<li><strong><codeclass="docutils literal"><spanclass="pre">proto</span></code></strong>: The Op’s Protobuf, including inputs, outputs and required attributes.</li>
<li><strong><codeclass="docutils literal"><spanclass="pre">checker</span></code></strong>: Used to check attributes.</li>
</ul>
</div>
</div>
<hrclass="docutils"/>
<divclass="section"id="related-concepts">
<spanid="related-concepts"></span><h1>Related Concepts<aclass="headerlink"href="#related-concepts"title="Permalink to this headline">¶</a></h1>
<divclass="section"id="op-maker">
<spanid="op-maker"></span><h2>Op_Maker<aclass="headerlink"href="#op-maker"title="Permalink to this headline">¶</a></h2>
<p>It’s constructor takes <codeclass="docutils literal"><spanclass="pre">proto</span></code> and <codeclass="docutils literal"><spanclass="pre">checker</span></code>. They are compeleted during Op_Maker’s construction. (<aclass="reference external"href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/scale_op.cc#L37">ScaleOpMaker</a>)</p>
</div>
<divclass="section"id="register-macros">
<spanid="register-macros"></span><h2>Register Macros<aclass="headerlink"href="#register-macros"title="Permalink to this headline">¶</a></h2>
<spanid="use-macros"></span><h2><codeclass="docutils literal"><spanclass="pre">USE</span></code> Macros<aclass="headerlink"href="#use-macros"title="Permalink to this headline">¶</a></h2>
<p>make sure the registration process is executed and linked.</p>
</div>
</div>
<hrclass="docutils"/>
<divclass="section"id="register-process">
<spanid="register-process"></span><h1>Register Process<aclass="headerlink"href="#register-process"title="Permalink to this headline">¶</a></h1>
<olclass="simple">
<li>Write Op class, as well as its gradient Op class if there is.</li>
<li>Write Op maker class. In the constructor, describe its inputs, outputs, and attributes.</li>
<li>Invoke macro <codeclass="docutils literal"><spanclass="pre">REGISTER_OP</span></code>. The macro will<ol>
<li>call maker class to complete <codeclass="docutils literal"><spanclass="pre">proto</span></code> and <codeclass="docutils literal"><spanclass="pre">checker</span></code></li>
<li>with the completed <codeclass="docutils literal"><spanclass="pre">proto</span></code> and <codeclass="docutils literal"><spanclass="pre">checker</span></code>, build a new key-value pair in the <codeclass="docutils literal"><spanclass="pre">OpInfoMap</span></code></li>
</ol>
</li>
<li>Invoke <codeclass="docutils literal"><spanclass="pre">USE</span></code> macro in where the Op is used to make sure it is linked.</li>
</ol>
</div>
<hrclass="docutils"/>
<divclass="section"id="backward-module-1-2">
<spanid="backward-module-1-2"></span><h1>Backward Module (1/2)<aclass="headerlink"href="#backward-module-1-2"title="Permalink to this headline">¶</a></h1>
<divclass="section"id="create-backward-operator">
<spanid="create-backward-operator"></span><h2>Create Backward Operator<aclass="headerlink"href="#create-backward-operator"title="Permalink to this headline">¶</a></h2>
<spanid="backward-module-2-2"></span><h1>Backward Module (2/2)<aclass="headerlink"href="#backward-module-2-2"title="Permalink to this headline">¶</a></h1>
<divclass="section"id="build-backward-network">
<spanid="build-backward-network"></span><h2>Build Backward Network<aclass="headerlink"href="#build-backward-network"title="Permalink to this headline">¶</a></h2>
<ulclass="simple">
<li><strong>Input</strong> graph of forwarding operators</li>
<li><strong>Output</strong> graph of backward operators</li>
<li><strong>corner case in construction</strong><ul>
<li>RNN Op => recursively call <codeclass="docutils literal"><spanclass="pre">Backward</span></code> on stepnet</li>
</ul>
</li>
</ul>
</div>
</div>
<hrclass="docutils"/>
<divclass="section"id="scope-variable-tensor">
<spanid="scope-variable-tensor"></span><h1>Scope, Variable, Tensor<aclass="headerlink"href="#scope-variable-tensor"title="Permalink to this headline">¶</a></h1>
<ulclass="simple">
<li><codeclass="docutils literal"><spanclass="pre">Tensor</span></code> is an n-dimension array with type.<ul>
<li>Only dims and data pointers are stored in <codeclass="docutils literal"><spanclass="pre">Tensor</span></code>.</li>
<li>All operators on <codeclass="docutils literal"><spanclass="pre">Tensor</span></code> is written in <codeclass="docutils literal"><spanclass="pre">Operator</span></code> or global functions.</li>
<li><codeclass="docutils literal"><spanclass="pre">Variable</span></code> is the inputs and outputs of an operator. Not just <codeclass="docutils literal"><spanclass="pre">Tensor</span></code>.<ul>
<li>step_scopes in RNN is a variable and not a tensor.</li>
</ul>
</li>
<li><codeclass="docutils literal"><spanclass="pre">Scope</span></code> is where variables store at.<ul>
<li>map<string/*var name */, Variable></li>
<li><codeclass="docutils literal"><spanclass="pre">Scope</span></code> has a hierarchical structure. The local scope can get variable from its parent scope.</li>
</ul>
</li>
</ul>
</div>
<hrclass="docutils"/>
<divclass="section"id="block-in-design">
<spanid="block-in-design"></span><h1>Block (in design)<aclass="headerlink"href="#block-in-design"title="Permalink to this headline">¶</a></h1>
<spanid="the-difference-with-original-rnnop"></span><h2>the difference with original RNNOp<aclass="headerlink"href="#the-difference-with-original-rnnop"title="Permalink to this headline">¶</a></h2>
<ulclass="simple">
<li>as an operator is more intuitive than <codeclass="docutils literal"><spanclass="pre">RNNOp</span></code>,</li>
<li>offers new interface <codeclass="docutils literal"><spanclass="pre">Eval(targets)</span></code> to deduce the minimal block to <codeclass="docutils literal"><spanclass="pre">Run</span></code>,</li>
<li>fits the compile-time/ runtime separation design.<ul>
<li>during the compilation, <codeclass="docutils literal"><spanclass="pre">SymbolTable</span></code> stores <codeclass="docutils literal"><spanclass="pre">VarDesc</span></code>s and <codeclass="docutils literal"><spanclass="pre">OpDesc</span></code>s and serialize to a <codeclass="docutils literal"><spanclass="pre">BlockDesc</span></code></li>
<li>when graph executes, a Block with <codeclass="docutils literal"><spanclass="pre">BlockDesc</span></code> passed in creates <codeclass="docutils literal"><spanclass="pre">Op</span></code> and <codeclass="docutils literal"><spanclass="pre">Var</span></code> then <codeclass="docutils literal"><spanclass="pre">Run</span></code></li>
</ul>
</li>
</ul>
</div>
</div>
<hrclass="docutils"/>
<divclass="section"id="milestone">
<spanid="milestone"></span><h1>Milestone<aclass="headerlink"href="#milestone"title="Permalink to this headline">¶</a></h1>
<ulclass="simple">
<li>take Paddle/books as the main line, the requirement of the models motivates framework refactoring,</li>
<li>model migration<ul>
<li>framework development gives <strong>priority support</strong> to model migration, for example,<ul>
<li>the MNIST demo needs a Python interface,</li>
<li>the RNN models require the framework to support <codeclass="docutils literal"><spanclass="pre">LoDTensor</span></code>.</li>
</ul>
</li>
<li>determine some timelines,</li>
<li>heavily-relied Ops need to be migrated first,</li>
<li>different models can be migrated parallelly.</li>
</ul>
</li>
<li>improve the framework at the same time</li>
<li>accept imperfection, concentrated on solving the specific problem at the right price.</li>
<spanid="control-the-migration-quality"></span><h1>Control the migration quality<aclass="headerlink"href="#control-the-migration-quality"title="Permalink to this headline">¶</a></h1>
<ulclass="simple">
<li>compare the performance of migrated models with old ones.</li>
<li>follow google C style</li>
<li>build the automatic workflow of generating Python/C++ documentations<ul>
<li>the documentation of layers and ops should be written inside the code</li>
<li>take the documentation quality into account when doing PR</li>
<li>preview the documentations, read and improve them from users’ perspective</li>
Built with <ahref="http://sphinx-doc.org/">Sphinx</a> using a <ahref="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <ahref="https://readthedocs.org">Read the Docs</a>.
The word *graph* is exchangable with *block* in this document. A graph represent computation steps and local variables as a C++/Java program block, or a pair of { and }.
## Compilation and Execution
1. Run an applicaton Python program to describe the graph. In particular,
1. create VarDesc to represent local/intermediate variables,
1. create operators and set attributes,
1. validate attribute values,
1. inference the type and the shape of variables,
1. plan for memory-reuse for variables,
1. generate backward and optimization part of the Graph.
1. possiblly split the graph for distributed training.
1. The invocation of `train` or `infer` in the application Python program:
1. create a new Scope instance in the [scope hierarchy](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/scope.md) for each run of a block,
1. realize local variables defined in the BlockDesc message in the new scope,
1. a scope is similar to the stack frame in programming languages,
1. create an instance of class `Block`, in which,
1. realize operators in the BlockDesc message,
1. run the Block by calling
1. `Block::Eval(vector<Variable>* targets)` for forward and backward computations, or
1. `Block::Eval(vector<Operator>* targets)` for optimization.
## Intermediate Representation (IR)
```text
Compile Time -> IR -> Runtime
```
### Benefit
- Optimization
```text
Compile Time -> IR -> Optimized IR -> Runtime
```
- Send automatically partitioned IR to different nodes.
* `OpWithKernel::Run` get device's kernel, and invoke `OpKernel::Compute`.
* `OpKernelKey` is the map key. Only device place now, but may be data type later.
---
# Why separate Kernel and Operator
* Separate GPU and CPU code.
* Make Paddle can run without GPU.
* Make one operator (which is user interface) can contain many implementations.
* Same mul op, different FP16, FP32 Kernel. different MKL, eigen kernel.
---
# Libraries for Kernel development
* `Eigen::Tensor` contains basic math and element-wise functions.
* Note that `Eigen::Tensor` has broadcast implementation.
* Limit number of `tensor.device(dev) = ` in your code.
* `thrust::tranform` and `std::transform`.
* `thrust` has the same API as C++ standard library. Using `transform` can quickly implement a customized elementwise kernel.
* `thrust` has more complex API, like `scan`, `reduce`, `reduce_by_key`.
* Hand-writing `GPUKernel` and `CPU` code
* Do not write `.h`. CPU Kernel should be in `.cc`. CPU kernel should be in `.cu`. (`GCC` cannot compile GPU code.)
---
# Operator Register
## Why register is necessary?
We need a method to build mappings between Op type names and Op classes.
## How to do the register?
Maintain a map, whose key is the type name and value is corresponding Op constructor.
---
# The Registry Map
### `OpInfoMap`
`op_type(string)` -> `OpInfo`
`OpInfo`:
- **`creator`**: The Op constructor.
- **`grad_op_type`**: The type of the gradient Op.
- **`proto`**: The Op's Protobuf, including inputs, outputs and required attributes.
- **`checker`**: Used to check attributes.
---
# Related Concepts
### Op_Maker
It's constructor takes `proto` and `checker`. They are compeleted during Op_Maker's construction. ([ScaleOpMaker](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/scale_op.cc#L37))
<li>PaddlePaddle represent the computation, training and inference of DL models, by computation graphs.</li>
<li>Please dig into <aclass="reference external"href="https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/graph.md">computation graphs</a> for a solid example.</li>
<li>Users write Python programs to describe the graphs and run it (locally or remotely).</li>
<li>A graph is composed of <em>variabels</em> and <em>operators</em>.</li>
<li>The description of graphs must be able to be serialized/deserialized, so it<ol>
<li>could to be sent to the cloud for distributed execution, and</li>
<li>be sent to clients for mobile or enterprise deployment.</li>
</ol>
</li>
<li>The Python program do<ol>
<li><em>compilation</em>: runs a Python program to generate a protobuf message representation of the graph and send it to<ol>
<li>the C++ library <codeclass="docutils literal"><spanclass="pre">libpaddle.so</span></code> for local execution,</li>
<li>the master process of a distributed training job for training, or</li>
<li>the server process of a Kubernetes serving job for distributed serving.</li>
</ol>
</li>
<li><em>execution</em>: according to the protobuf message, constructs instances of class <codeclass="docutils literal"><spanclass="pre">Variable</span></code> and <codeclass="docutils literal"><spanclass="pre">OperatorBase</span></code>, and run them.</li>
<spanid="description-and-realization"></span><h2>Description and Realization<aclass="headerlink"href="#description-and-realization"title="永久链接至标题">¶</a></h2>
<p>At compile time, the Python program generates protobuf message representation of the graph, or the description of the graph.</p>
<p>At runtime, the C++ program realizes the graph and run it.</p>
<p>The word <em>graph</em> is exchangable with <em>block</em> in this document. A graph represent computation steps and local variables as a C++/Java program block, or a pair of { and }.</p>
<spanid="compilation-and-execution"></span><h2>Compilation and Execution<aclass="headerlink"href="#compilation-and-execution"title="永久链接至标题">¶</a></h2>
<olclass="simple">
<li>Run an applicaton Python program to describe the graph. In particular,<ol>
<li>create VarDesc to represent local/intermediate variables,</li>
<li>create operators and set attributes,</li>
<li>validate attribute values,</li>
<li>inference the type and the shape of variables,</li>
<li>plan for memory-reuse for variables,</li>
<li>generate backward and optimization part of the Graph.</li>
<li>possiblly split the graph for distributed training.</li>
</ol>
</li>
<li>The invocation of <codeclass="docutils literal"><spanclass="pre">train</span></code> or <codeclass="docutils literal"><spanclass="pre">infer</span></code> in the application Python program:<ol>
<li>create a new Scope instance in the <aclass="reference external"href="https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/scope.md">scope hierarchy</a> for each run of a block,<ol>
<li>realize local variables defined in the BlockDesc message in the new scope,</li>
<li>a scope is similar to the stack frame in programming languages,</li>
</ol>
</li>
<li>create an instance of class <codeclass="docutils literal"><spanclass="pre">Block</span></code>, in which,<ol>
<li>realize operators in the BlockDesc message,</li>
</ol>
</li>
<li>run the Block by calling<ol>
<li><codeclass="docutils literal"><spanclass="pre">Block::Eval(vector<Variable>*</span><spanclass="pre">targets)</span></code> for forward and backward computations, or</li>
<li><codeclass="docutils literal"><spanclass="pre">Block::Eval(vector<Operator>*</span><spanclass="pre">targets)</span></code> for optimization.</li>
<li><codeclass="docutils literal"><spanclass="pre">Operator</span></code> is the fundamental building block as the user interface.<ul>
<li>Operator stores input/output variable name, and attributes.</li>
<li>The <codeclass="docutils literal"><spanclass="pre">InferShape</span></code> interface is used to infer output variable shapes by its input shapes.</li>
<li>Use <codeclass="docutils literal"><spanclass="pre">Run</span></code> to compute <codeclass="docutils literal"><spanclass="pre">input</span><spanclass="pre">variables</span></code> to <codeclass="docutils literal"><spanclass="pre">output</span><spanclass="pre">variables</span></code>.</li>
<li><codeclass="docutils literal"><spanclass="pre">OpWithKernel</span></code> contains a Kernel map.<ul>
<li><codeclass="docutils literal"><spanclass="pre">OpWithKernel::Run</span></code> get device’s kernel, and invoke <codeclass="docutils literal"><spanclass="pre">OpKernel::Compute</span></code>.</li>
<li><codeclass="docutils literal"><spanclass="pre">OpKernelKey</span></code> is the map key. Only device place now, but may be data type later.</li>
<spanid="why-separate-kernel-and-operator"></span><h1>Why separate Kernel and Operator<aclass="headerlink"href="#why-separate-kernel-and-operator"title="永久链接至标题">¶</a></h1>
<ulclass="simple">
<li>Separate GPU and CPU code.<ul>
<li>Make Paddle can run without GPU.</li>
</ul>
</li>
<li>Make one operator (which is user interface) can contain many implementations.<ul>
<li>Same mul op, different FP16, FP32 Kernel. different MKL, eigen kernel.</li>
<spanid="libraries-for-kernel-development"></span><h1>Libraries for Kernel development<aclass="headerlink"href="#libraries-for-kernel-development"title="永久链接至标题">¶</a></h1>
<ulclass="simple">
<li><codeclass="docutils literal"><spanclass="pre">Eigen::Tensor</span></code> contains basic math and element-wise functions.<ul>
<li>Note that <codeclass="docutils literal"><spanclass="pre">Eigen::Tensor</span></code> has broadcast implementation.</li>
<li>Limit number of <codeclass="docutils literal"><spanclass="pre">tensor.device(dev)</span><spanclass="pre">=</span></code> in your code.</li>
</ul>
</li>
<li><codeclass="docutils literal"><spanclass="pre">thrust::tranform</span></code> and <codeclass="docutils literal"><spanclass="pre">std::transform</span></code>.<ul>
<li><codeclass="docutils literal"><spanclass="pre">thrust</span></code> has the same API as C++ standard library. Using <codeclass="docutils literal"><spanclass="pre">transform</span></code> can quickly implement a customized elementwise kernel.</li>
<li><codeclass="docutils literal"><spanclass="pre">thrust</span></code> has more complex API, like <codeclass="docutils literal"><spanclass="pre">scan</span></code>, <codeclass="docutils literal"><spanclass="pre">reduce</span></code>, <codeclass="docutils literal"><spanclass="pre">reduce_by_key</span></code>.</li>
</ul>
</li>
<li>Hand-writing <codeclass="docutils literal"><spanclass="pre">GPUKernel</span></code> and <codeclass="docutils literal"><spanclass="pre">CPU</span></code> code<ul>
<li>Do not write <codeclass="docutils literal"><spanclass="pre">.h</span></code>. CPU Kernel should be in <codeclass="docutils literal"><spanclass="pre">.cc</span></code>. CPU kernel should be in <codeclass="docutils literal"><spanclass="pre">.cu</span></code>. (<codeclass="docutils literal"><spanclass="pre">GCC</span></code> cannot compile GPU code.)</li>
<spanid="why-register-is-necessary"></span><h2>Why register is necessary?<aclass="headerlink"href="#why-register-is-necessary"title="永久链接至标题">¶</a></h2>
<p>We need a method to build mappings between Op type names and Op classes.</p>
</div>
<divclass="section"id="how-to-do-the-register">
<spanid="how-to-do-the-register"></span><h2>How to do the register?<aclass="headerlink"href="#how-to-do-the-register"title="永久链接至标题">¶</a></h2>
<p>Maintain a map, whose key is the type name and value is corresponding Op constructor.</p>
<li><strong><codeclass="docutils literal"><spanclass="pre">creator</span></code></strong>: The Op constructor.</li>
<li><strong><codeclass="docutils literal"><spanclass="pre">grad_op_type</span></code></strong>: The type of the gradient Op.</li>
<li><strong><codeclass="docutils literal"><spanclass="pre">proto</span></code></strong>: The Op’s Protobuf, including inputs, outputs and required attributes.</li>
<li><strong><codeclass="docutils literal"><spanclass="pre">checker</span></code></strong>: Used to check attributes.</li>
<p>It’s constructor takes <codeclass="docutils literal"><spanclass="pre">proto</span></code> and <codeclass="docutils literal"><spanclass="pre">checker</span></code>. They are compeleted during Op_Maker’s construction. (<aclass="reference external"href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/scale_op.cc#L37">ScaleOpMaker</a>)</p>
<li>Write Op class, as well as its gradient Op class if there is.</li>
<li>Write Op maker class. In the constructor, describe its inputs, outputs, and attributes.</li>
<li>Invoke macro <codeclass="docutils literal"><spanclass="pre">REGISTER_OP</span></code>. The macro will<ol>
<li>call maker class to complete <codeclass="docutils literal"><spanclass="pre">proto</span></code> and <codeclass="docutils literal"><spanclass="pre">checker</span></code></li>
<li>with the completed <codeclass="docutils literal"><spanclass="pre">proto</span></code> and <codeclass="docutils literal"><spanclass="pre">checker</span></code>, build a new key-value pair in the <codeclass="docutils literal"><spanclass="pre">OpInfoMap</span></code></li>
</ol>
</li>
<li>Invoke <codeclass="docutils literal"><spanclass="pre">USE</span></code> macro in where the Op is used to make sure it is linked.</li>
<li><codeclass="docutils literal"><spanclass="pre">Tensor</span></code> is an n-dimension array with type.<ul>
<li>Only dims and data pointers are stored in <codeclass="docutils literal"><spanclass="pre">Tensor</span></code>.</li>
<li>All operators on <codeclass="docutils literal"><spanclass="pre">Tensor</span></code> is written in <codeclass="docutils literal"><spanclass="pre">Operator</span></code> or global functions.</li>
<li><codeclass="docutils literal"><spanclass="pre">Variable</span></code> is the inputs and outputs of an operator. Not just <codeclass="docutils literal"><spanclass="pre">Tensor</span></code>.<ul>
<li>step_scopes in RNN is a variable and not a tensor.</li>
</ul>
</li>
<li><codeclass="docutils literal"><spanclass="pre">Scope</span></code> is where variables store at.<ul>
<li>map<string/*var name */, Variable></li>
<li><codeclass="docutils literal"><spanclass="pre">Scope</span></code> has a hierarchical structure. The local scope can get variable from its parent scope.</li>
</ul>
</li>
</ul>
</div>
<hrclass="docutils"/>
<divclass="section"id="block-in-design">
<spanid="block-in-design"></span><h1>Block (in design)<aclass="headerlink"href="#block-in-design"title="永久链接至标题">¶</a></h1>
<spanid="the-difference-with-original-rnnop"></span><h2>the difference with original RNNOp<aclass="headerlink"href="#the-difference-with-original-rnnop"title="永久链接至标题">¶</a></h2>
<ulclass="simple">
<li>as an operator is more intuitive than <codeclass="docutils literal"><spanclass="pre">RNNOp</span></code>,</li>
<li>offers new interface <codeclass="docutils literal"><spanclass="pre">Eval(targets)</span></code> to deduce the minimal block to <codeclass="docutils literal"><spanclass="pre">Run</span></code>,</li>
<li>fits the compile-time/ runtime separation design.<ul>
<li>during the compilation, <codeclass="docutils literal"><spanclass="pre">SymbolTable</span></code> stores <codeclass="docutils literal"><spanclass="pre">VarDesc</span></code>s and <codeclass="docutils literal"><spanclass="pre">OpDesc</span></code>s and serialize to a <codeclass="docutils literal"><spanclass="pre">BlockDesc</span></code></li>
<li>when graph executes, a Block with <codeclass="docutils literal"><spanclass="pre">BlockDesc</span></code> passed in creates <codeclass="docutils literal"><spanclass="pre">Op</span></code> and <codeclass="docutils literal"><spanclass="pre">Var</span></code> then <codeclass="docutils literal"><spanclass="pre">Run</span></code></li>
<spanid="control-the-migration-quality"></span><h1>Control the migration quality<aclass="headerlink"href="#control-the-migration-quality"title="永久链接至标题">¶</a></h1>
<ulclass="simple">
<li>compare the performance of migrated models with old ones.</li>
<li>follow google C style</li>
<li>build the automatic workflow of generating Python/C++ documentations<ul>
<li>the documentation of layers and ops should be written inside the code</li>
<li>take the documentation quality into account when doing PR</li>
<li>preview the documentations, read and improve them from users’ perspective</li>
Built with <ahref="http://sphinx-doc.org/">Sphinx</a> using a <ahref="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <ahref="https://readthedocs.org">Read the Docs</a>.