1. Make it easy for external contributors to write new elementory computaiton operations.
1. Make the codebase clean and readable.
1. Introduce a new design of computation representation -- a computation graph of operators and variables.
1. The graph representation helps implementing auto-scalable and auto fault recoverable distributed computing.
1. Making it easy for external contributors to write new elementary computation operations.
1. Making the codebase clean and readable.
1. Designing a new computation representation -- a computation graph of operators and variables.
1. Implementing auto-scalability and auto fault recoverable distributed computing with the help of computation graphs.
## Computation Graphs
1. PaddlePaddle represent the computation, training and inference of DL models, by computation graphs.
1. PaddlePaddle represents the computation, training and inference of Deep Learning models, by computation graphs.
1. Please dig into [computation graphs](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/graph.md) for a solid example.
1. Please refer to [computation graphs](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/graph.md) for a concrete example.
1. Users write Python programs to describe the graphs and run it (locally or remotely).
1. Users write Python programs to describe the graphs and run them (locally or remotely).
1. A graph is composed of *variables* and *operators*.
1. The description of graphs must be able to be serialized/deserialized, so it
1. The description of graphs must be capable of being serialized/deserialized, so that
1. could to be sent to the cloud for distributed execution, and
1. be sent to clients for mobile or enterprise deployment.
1. It can to be sent to the cloud for distributed execution, and
1. It can be sent to clients for mobile or enterprise deployment.
1. The Python program do
1. The Python program does the following steps
1. *compilation*: runs a Python program to generate a protobuf message representation of the graph and send it to
1. *compilation*: run a Python program to generate a protobuf message representation of the graph and send it to
1. the C++ library `libpaddle.so` for local execution,
1. the master process of a distributed training job for training, or
1. the server process of a Kubernetes serving job for distributed serving.
1. *execution*: according to the protobuf message, constructs instances of class `Variable` and `OperatorBase`, and run them.
1. *execution*: execute the graph by constructing instances of class [`Variable`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/variable.h#L24) and [`OperatorBase`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/operator.h#L70), according to the protobuf message.
## Description and Realization
## Description and Realization of Computation Graph
At compile time, the Python program generates protobuf message representation of the graph, or the description of the graph.
At compile time, the Python program generates a protobuf message representation of the graph, or the description of the graph.
At runtime, the C++ program realizes the graph and run it.
At runtime, the C++ program realizes the graph and runs it.
The word *graph* is exchangable with *block* in this document. A graph represent computation steps and local variables as a C++/Java program block, or a pair of { and }.
The word *graph* is interchangeable with *block* in this document. A graph represents computation steps and local variables similar to a C++/Java program block, or a pair of parentheses(`{` and `}`).
## Compilation and Execution
1. Run an applicaton Python program to describe the graph. In particular,
1. Run an application Python program to describe the graph. In particular, the Python application program does the following:
1. create VarDesc to represent local/intermediate variables,
1. create operators and set attributes,
1. validate attribute values,
1. inference the type and the shape of variables,
1. plan for memory-reuse for variables,
1. generate backward and optimization part of the Graph.
1. possiblly split the graph for distributed training.
1. Create `VarDesc` to represent local/intermediate variables,
1. Create operators and set attributes,
1. Validate attribute values,
1. Infer the type and the shape of variables,
1. Plan memory-reuse for variables,
1. Generate the backward graph
1. Optimize the computation graph.
1. Potentially, split the graph for distributed training.
1. The invocation of `train` or `infer` in the application Python program:
1. The invocation of `train` or [`infer`](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/inference.py#L108) methods in the application Python program does the following:
1. create a new Scope instance in the [scope hierarchy](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/scope.md) for each run of a block,
1. Create a new Scope instance in the [scope hierarchy](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/scope.md) for each run of a block,
1. realize local variables defined in the BlockDesc message in the new scope,
1. a scope is similar to the stack frame in programming languages,
1. create an instance of class `Block`, in which,
1. Create an instance of class `Block`, in which,
1. realize operators in the BlockDesc message,
1. run the Block by calling
1. Run the Block by calling
1. `Block::Eval(vector<Variable>* targets)` for forward and backward computations, or
1. `Block::Eval(vector<Operator>* targets)` for optimization.
...
...
@@ -76,14 +77,14 @@ The word *graph* is exchangable with *block* in this document. A graph represen
Compile Time -> IR -> Runtime
```
### Benefit
### Benefits of IR
- Optimization
```text
Compile Time -> IR -> Optimized IR -> Runtime
```
- Send automatically partitioned IR to different nodes.
- Automatic data parallel
- Automatically send partitioned IR to different nodes.
- Automatic Data Parallelism
```text
Compile Time
|-> Single GPU IR
...
...
@@ -92,7 +93,7 @@ Compile Time -> IR -> Runtime
|-> Node-1 (runs trainer-IR-1)
|-> Node-2 (runs pserver-IR)
```
- Automatic model parallel (planned for future)
- Automatic Model Parallelism (planned for future)
---
...
...
@@ -105,10 +106,10 @@ Compile Time -> IR -> Runtime
* `Operator` is the fundamental building block as the user interface.
* Operator stores input/output variable name, and attributes.
* The `InferShape` interface is used to infer output variable shapes by its input shapes.
* Use `Run` to compute `input variables` to `output variables`.
* `Operator` is the fundamental building block of the user interface.
* Operator stores input/output variable names, and attributes.
* The `InferShape` interface is used to infer the shape of the output variable shapes based on the shapes of the input variables.
* Use `Run` to compute the `output` variables from the `input` variables.
---
...
...
@@ -126,30 +127,30 @@ Compile Time -> IR -> Runtime
# Why separate Kernel and Operator
* Separate GPU and CPU code.
* Make Paddle can run without GPU.
* Make one operator (which is user interface) can contain many implementations.
* Same mul op, different FP16, FP32 Kernel. different MKL, eigen kernel.
* Make Paddle capable of running without GPU.
* Make one operator (which is a user interface) and create many implementations.
* For example, same multiplication op can have different implementations kernels such as FP16 kernel, FP32 kernel, MKL, eigen kernel.
---
# Libraries for Kernel development
* `Eigen::Tensor` contains basic math and element-wise functions.
* Note that `Eigen::Tensor` has broadcast implementation.
* Limit number of `tensor.device(dev) = ` in your code.
* Limit the number of `tensor.device(dev) = ` in your code.
* `thrust::tranform` and `std::transform`.
* `thrust` has the same API as C++ standard library. Using `transform` can quickly implement a customized elementwise kernel.
* `thrust` has more complex API, like `scan`, `reduce`, `reduce_by_key`.
* `thrust` has the same API as C++ standard library. Using `transform`, one can quickly implement customized elementwise kernels.
* `thrust` also has more complex APIs, like `scan`, `reduce`, `reduce_by_key`.
* Hand-writing `GPUKernel` and `CPU` code
* Do not write `.h`. CPU Kernel should be in `.cc`. GPU kernel should be in `.cu`. (`GCC` cannot compile GPU code.)
* Do not write in header (`.h`) files. CPU Kernel should be in cpp source (`.cc`) and GPU kernels should be in cuda (`.cu`) files. (GCC cannot compile GPU code.)
---
# Operator Register
# Operator Registration
## Why register is necessary?
## Why registration is necessary?
We need a method to build mappings between Op type names and Op classes.
## How to do the register?
## How is registration implemented?
Maintain a map, whose key is the type name and value is corresponding Op constructor.
Maintaining a map, whose key is the type name and the value is the corresponding Op constructor.
<spanid="design-doc-refactorization-overview"></span><h1>Design Doc: Refactorization Overview<aclass="headerlink"href="#design-doc-refactorization-overview"title="Permalink to this headline">¶</a></h1>
<p>The goal of refactorizaiton include:</p>
<p>The goals of refactoring include:</p>
<olclass="simple">
<li>Make it easy for external contributors to write new elementory computaiton operations.</li>
<li>Make the codebase clean and readable.</li>
<li>Introduce a new design of computation representation – a computation graph of operators and variables.</li>
<li>The graph representation helps implementing auto-scalable and auto fault recoverable distributed computing.</li>
<li>Making it easy for external contributors to write new elementary computation operations.</li>
<li>Making the codebase clean and readable.</li>
<li>Designing a new computation representation – a computation graph of operators and variables.</li>
<li>Implementing auto-scalability and auto fault recoverable distributed computing with the help of computation graphs.</li>
</ol>
<divclass="section"id="computation-graphs">
<spanid="computation-graphs"></span><h2>Computation Graphs<aclass="headerlink"href="#computation-graphs"title="Permalink to this headline">¶</a></h2>
<olclass="simple">
<li>PaddlePaddle represent the computation, training and inference of DL models, by computation graphs.</li>
<li>Please dig into <aclass="reference external"href="https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/graph.md">computation graphs</a> for a solid example.</li>
<li>Users write Python programs to describe the graphs and run it (locally or remotely).</li>
<li>PaddlePaddle represents the computation, training and inference of Deep Learning models, by computation graphs.</li>
<li>Please refer to <aclass="reference external"href="https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/graph.md">computation graphs</a> for a concrete example.</li>
<li>Users write Python programs to describe the graphs and run them (locally or remotely).</li>
<li>A graph is composed of <em>variables</em> and <em>operators</em>.</li>
<li>The description of graphs must be able to be serialized/deserialized, so it<ol>
<li>could to be sent to the cloud for distributed execution, and</li>
<li>be sent to clients for mobile or enterprise deployment.</li>
<li>The description of graphs must be capable of being serialized/deserialized, so that<ol>
<li>It can to be sent to the cloud for distributed execution, and</li>
<li>It can be sent to clients for mobile or enterprise deployment.</li>
</ol>
</li>
<li>The Python program do<ol>
<li><em>compilation</em>: runs a Python program to generate a protobuf message representation of the graph and send it to<ol>
<li>The Python program does the following steps<ol>
<li><em>compilation</em>: run a Python program to generate a protobuf message representation of the graph and send it to<ol>
<li>the C++ library <codeclass="docutils literal"><spanclass="pre">libpaddle.so</span></code> for local execution,</li>
<li>the master process of a distributed training job for training, or</li>
<li>the server process of a Kubernetes serving job for distributed serving.</li>
</ol>
</li>
<li><em>execution</em>: according to the protobuf message, constructs instances of class <codeclass="docutils literal"><spanclass="pre">Variable</span></code> and <codeclass="docutils literal"><spanclass="pre">OperatorBase</span></code>, and run them.</li>
<li><em>execution</em>: execute the graph by constructing instances of class <aclass="reference external"href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/variable.h#L24"><codeclass="docutils literal"><spanclass="pre">Variable</span></code></a> and <aclass="reference external"href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/operator.h#L70"><codeclass="docutils literal"><spanclass="pre">OperatorBase</span></code></a>, according to the protobuf message.</li>
<spanid="description-and-realization"></span><h2>Description and Realization<aclass="headerlink"href="#description-and-realization"title="Permalink to this headline">¶</a></h2>
<p>At compile time, the Python program generates protobuf message representation of the graph, or the description of the graph.</p>
<p>At runtime, the C++ program realizes the graph and run it.</p>
<spanid="description-and-realization-of-computation-graph"></span><h2>Description and Realization of Computation Graph<aclass="headerlink"href="#description-and-realization-of-computation-graph"title="Permalink to this headline">¶</a></h2>
<p>At compile time, the Python program generates a protobuf message representation of the graph, or the description of the graph.</p>
<p>At runtime, the C++ program realizes the graph and runs it.</p>
<p>The word <em>graph</em> is exchangable with <em>block</em> in this document. A graph represent computation steps and local variables as a C++/Java program block, or a pair of { and }.</p>
<p>The word <em>graph</em> is interchangeable with <em>block</em> in this document. A graph represents computation steps and local variables similar to a C++/Java program block, or a pair of parentheses(<codeclass="docutils literal"><spanclass="pre">{</span></code> and <codeclass="docutils literal"><spanclass="pre">}</span></code>).</p>
<spanid="compilation-and-execution"></span><h2>Compilation and Execution<aclass="headerlink"href="#compilation-and-execution"title="Permalink to this headline">¶</a></h2>
<olclass="simple">
<li>Run an applicaton Python program to describe the graph. In particular,<ol>
<li>create VarDesc to represent local/intermediate variables,</li>
<li>create operators and set attributes,</li>
<li>validate attribute values,</li>
<li>inference the type and the shape of variables,</li>
<li>plan for memory-reuse for variables,</li>
<li>generate backward and optimization part of the Graph.</li>
<li>possiblly split the graph for distributed training.</li>
<li>Run an application Python program to describe the graph. In particular, the Python application program does the following:<ol>
<li>Create <codeclass="docutils literal"><spanclass="pre">VarDesc</span></code> to represent local/intermediate variables,</li>
<li>Create operators and set attributes,</li>
<li>Validate attribute values,</li>
<li>Infer the type and the shape of variables,</li>
<li>Plan memory-reuse for variables,</li>
<li>Generate the backward graph</li>
<li>Optimize the computation graph.</li>
<li>Potentially, split the graph for distributed training.</li>
</ol>
</li>
<li>The invocation of <codeclass="docutils literal"><spanclass="pre">train</span></code> or <codeclass="docutils literal"><spanclass="pre">infer</span></code> in the application Python program:<ol>
<li>create a new Scope instance in the <aclass="reference external"href="https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/scope.md">scope hierarchy</a> for each run of a block,<ol>
<li>The invocation of <codeclass="docutils literal"><spanclass="pre">train</span></code> or <aclass="reference external"href="https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/inference.py#L108"><codeclass="docutils literal"><spanclass="pre">infer</span></code></a> methods in the application Python program does the following:<ol>
<li>Create a new Scope instance in the <aclass="reference external"href="https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/scope.md">scope hierarchy</a> for each run of a block,<ol>
<li>realize local variables defined in the BlockDesc message in the new scope,</li>
<li>a scope is similar to the stack frame in programming languages,</li>
</ol>
</li>
<li>create an instance of class <codeclass="docutils literal"><spanclass="pre">Block</span></code>, in which,<ol>
<li>Create an instance of class <codeclass="docutils literal"><spanclass="pre">Block</span></code>, in which,<ol>
<li>realize operators in the BlockDesc message,</li>
</ol>
</li>
<li>run the Block by calling<ol>
<li>Run the Block by calling<ol>
<li><codeclass="docutils literal"><spanclass="pre">Block::Eval(vector<Variable>*</span><spanclass="pre">targets)</span></code> for forward and backward computations, or</li>
<li><codeclass="docutils literal"><spanclass="pre">Block::Eval(vector<Operator>*</span><spanclass="pre">targets)</span></code> for optimization.</li>
</ol>
...
...
@@ -258,17 +259,17 @@
<divclass="highlight-text"><divclass="highlight"><pre><span></span>Compile Time -> IR -> Runtime
</pre></div>
</div>
<divclass="section"id="benefit">
<spanid="benefit"></span><h3>Benefit<aclass="headerlink"href="#benefit"title="Permalink to this headline">¶</a></h3>
<divclass="section"id="benefits-of-ir">
<spanid="benefits-of-ir"></span><h3>Benefits of IR<aclass="headerlink"href="#benefits-of-ir"title="Permalink to this headline">¶</a></h3>
<ul>
<li><pclass="first">Optimization</p>
<divclass="highlight-text"><divclass="highlight"><pre><span></span>Compile Time -> IR -> Optimized IR -> Runtime
</pre></div>
</div>
</li>
<li><pclass="first">Send automatically partitioned IR to different nodes.</p>
<li><pclass="first">Automatically send partitioned IR to different nodes.</p>
<ul>
<li><pclass="first">Automatic data parallel</p>
<li><pclass="first">Automatic Data Parallelism</p>
<divclass="highlight-text"><divclass="highlight"><pre><span></span>Compile Time
|-> Single GPU IR
|-> [trainer-IR-0, trainer-IR-1, pserver-IR]
...
...
@@ -278,7 +279,7 @@
</pre></div>
</div>
</li>
<li><pclass="first">Automatic model parallel (planned for future)</p>
<li><pclass="first">Automatic Model Parallelism (planned for future)</p>
</li>
</ul>
</li>
...
...
@@ -296,10 +297,10 @@
<spanid="operator"></span><h1>Operator<aclass="headerlink"href="#operator"title="Permalink to this headline">¶</a></h1>
<li><codeclass="docutils literal"><spanclass="pre">Operator</span></code> is the fundamental building block as the user interface.<ul>
<li>Operator stores input/output variable name, and attributes.</li>
<li>The <codeclass="docutils literal"><spanclass="pre">InferShape</span></code> interface is used to infer output variable shapes by its input shapes.</li>
<li>Use <codeclass="docutils literal"><spanclass="pre">Run</span></code> to compute <codeclass="docutils literal"><spanclass="pre">input</span><spanclass="pre">variables</span></code> to <codeclass="docutils literal"><spanclass="pre">output</span><spanclass="pre">variables</span></code>.</li>
<li><codeclass="docutils literal"><spanclass="pre">Operator</span></code> is the fundamental building block of the user interface.<ul>
<li>Operator stores input/output variable names, and attributes.</li>
<li>The <codeclass="docutils literal"><spanclass="pre">InferShape</span></code> interface is used to infer the shape of the output variable shapes based on the shapes of the input variables.</li>
<li>Use <codeclass="docutils literal"><spanclass="pre">Run</span></code> to compute the <codeclass="docutils literal"><spanclass="pre">output</span></code> variables from the <codeclass="docutils literal"><spanclass="pre">input</span></code> variables.</li>
</ul>
</li>
</ul>
...
...
@@ -322,11 +323,11 @@
<spanid="why-separate-kernel-and-operator"></span><h1>Why separate Kernel and Operator<aclass="headerlink"href="#why-separate-kernel-and-operator"title="Permalink to this headline">¶</a></h1>
<ulclass="simple">
<li>Separate GPU and CPU code.<ul>
<li>Make Paddle can run without GPU.</li>
<li>Make Paddle capable of running without GPU.</li>
</ul>
</li>
<li>Make one operator (which is user interface) can contain many implementations.<ul>
<li>Same mul op, different FP16, FP32 Kernel. different MKL, eigen kernel.</li>
<li>Make one operator (which is a user interface) and create many implementations.<ul>
<li>For example, same multiplication op can have different implementations kernels such as FP16 kernel, FP32 kernel, MKL, eigen kernel.</li>
</ul>
</li>
</ul>
...
...
@@ -337,30 +338,30 @@
<ulclass="simple">
<li><codeclass="docutils literal"><spanclass="pre">Eigen::Tensor</span></code> contains basic math and element-wise functions.<ul>
<li>Note that <codeclass="docutils literal"><spanclass="pre">Eigen::Tensor</span></code> has broadcast implementation.</li>
<li>Limit number of <codeclass="docutils literal"><spanclass="pre">tensor.device(dev)</span><spanclass="pre">=</span></code> in your code.</li>
<li>Limit the number of <codeclass="docutils literal"><spanclass="pre">tensor.device(dev)</span><spanclass="pre">=</span></code> in your code.</li>
</ul>
</li>
<li><codeclass="docutils literal"><spanclass="pre">thrust::tranform</span></code> and <codeclass="docutils literal"><spanclass="pre">std::transform</span></code>.<ul>
<li><codeclass="docutils literal"><spanclass="pre">thrust</span></code> has the same API as C++ standard library. Using <codeclass="docutils literal"><spanclass="pre">transform</span></code> can quickly implement a customized elementwise kernel.</li>
<li><codeclass="docutils literal"><spanclass="pre">thrust</span></code>has more complex API, like <codeclass="docutils literal"><spanclass="pre">scan</span></code>, <codeclass="docutils literal"><spanclass="pre">reduce</span></code>, <codeclass="docutils literal"><spanclass="pre">reduce_by_key</span></code>.</li>
<li><codeclass="docutils literal"><spanclass="pre">thrust</span></code> has the same API as C++ standard library. Using <codeclass="docutils literal"><spanclass="pre">transform</span></code>, one can quickly implement customized elementwise kernels.</li>
<li><codeclass="docutils literal"><spanclass="pre">thrust</span></code>also has more complex APIs, like <codeclass="docutils literal"><spanclass="pre">scan</span></code>, <codeclass="docutils literal"><spanclass="pre">reduce</span></code>, <codeclass="docutils literal"><spanclass="pre">reduce_by_key</span></code>.</li>
</ul>
</li>
<li>Hand-writing <codeclass="docutils literal"><spanclass="pre">GPUKernel</span></code> and <codeclass="docutils literal"><spanclass="pre">CPU</span></code> code<ul>
<li>Do not write <codeclass="docutils literal"><spanclass="pre">.h</span></code>. CPU Kernel should be in <codeclass="docutils literal"><spanclass="pre">.cc</span></code>. GPU kernel should be in <codeclass="docutils literal"><spanclass="pre">.cu</span></code>. (<codeclass="docutils literal"><spanclass="pre">GCC</span></code> cannot compile GPU code.)</li>
<li>Do not write in header (<codeclass="docutils literal"><spanclass="pre">.h</span></code>) files. CPU Kernel should be in cpp source (<codeclass="docutils literal"><spanclass="pre">.cc</span></code>) and GPU kernels should be in cuda (<codeclass="docutils literal"><spanclass="pre">.cu</span></code>) files. (GCC cannot compile GPU code.)</li>
</ul>
</li>
</ul>
</div>
<hrclass="docutils"/>
<divclass="section"id="operator-register">
<spanid="operator-register"></span><h1>Operator Register<aclass="headerlink"href="#operator-register"title="Permalink to this headline">¶</a></h1>
<spanid="why-register-is-necessary"></span><h2>Why register is necessary?<aclass="headerlink"href="#why-register-is-necessary"title="Permalink to this headline">¶</a></h2>
<divclass="section"id="operator-registration">
<spanid="operator-registration"></span><h1>Operator Registration<aclass="headerlink"href="#operator-registration"title="Permalink to this headline">¶</a></h1>
<spanid="why-registration-is-necessary"></span><h2>Why registration is necessary?<aclass="headerlink"href="#why-registration-is-necessary"title="Permalink to this headline">¶</a></h2>
<p>We need a method to build mappings between Op type names and Op classes.</p>
</div>
<divclass="section"id="how-to-do-the-register">
<spanid="how-to-do-the-register"></span><h2>How to do the register?<aclass="headerlink"href="#how-to-do-the-register"title="Permalink to this headline">¶</a></h2>
<p>Maintain a map, whose key is the type name and value is corresponding Op constructor.</p>
<spanid="how-is-registration-implemented"></span><h2>How is registration implemented?<aclass="headerlink"href="#how-is-registration-implemented"title="Permalink to this headline">¶</a></h2>
<p>Maintaining a map, whose key is the type name and the value is the corresponding Op constructor.</p>
</div>
</div>
<hrclass="docutils"/>
...
...
@@ -393,22 +394,22 @@
</div>
</div>
<divclass="section"id="use-macros">
<spanid="use-macros"></span><h2><codeclass="docutils literal"><spanclass="pre">USE</span></code> Macros<aclass="headerlink"href="#use-macros"title="Permalink to this headline">¶</a></h2>
<p>make sure the registration process is executed and linked.</p>
<spanid="use-macros"></span><h2>USE Macros<aclass="headerlink"href="#use-macros"title="Permalink to this headline">¶</a></h2>
<p>Make sure the registration process is executed and linked.</p>
</div>
</div>
<hrclass="docutils"/>
<divclass="section"id="register-process">
<spanid="register-process"></span><h1>Register Process<aclass="headerlink"href="#register-process"title="Permalink to this headline">¶</a></h1>
<divclass="section"id="registration-process">
<spanid="registration-process"></span><h1>Registration Process<aclass="headerlink"href="#registration-process"title="Permalink to this headline">¶</a></h1>
<olclass="simple">
<li>Write Op class, as well as its gradient Op class if there is.</li>
<li>Write Op maker class. In the constructor, describe its inputs, outputs, and attributes.</li>
<li>Invoke macro <codeclass="docutils literal"><spanclass="pre">REGISTER_OP</span></code>. The macro will<ol>
<li>call maker class to complete <codeclass="docutils literal"><spanclass="pre">proto</span></code> and<codeclass="docutils literal"><spanclass="pre">checker</span></code></li>
<li>with the completed <codeclass="docutils literal"><spanclass="pre">proto</span></code> and <codeclass="docutils literal"><spanclass="pre">checker</span></code>, build a new key-value pair in the <codeclass="docutils literal"><spanclass="pre">OpInfoMap</span></code></li>
<li>Write an Op class and its gradient Op class, if required.</li>
<li>Write an Op maker class. In the constructor of this class, describe the inputs, outputs and attributes of the operator.</li>
<li>Invoke the macro <codeclass="docutils literal"><spanclass="pre">REGISTER_OP</span></code>. This macro will<ol>
<li>Call maker class to complete the <codeclass="docutils literal"><spanclass="pre">proto</span></code> and the<codeclass="docutils literal"><spanclass="pre">checker</span></code></li>
<li>Using the completed <codeclass="docutils literal"><spanclass="pre">proto</span></code> and <codeclass="docutils literal"><spanclass="pre">checker</span></code>, it will add a new key-value pair to the <codeclass="docutils literal"><spanclass="pre">OpInfoMap</span></code></li>
</ol>
</li>
<li>Invoke <codeclass="docutils literal"><spanclass="pre">USE</span></code> macro in where the Op is used to make sure it is linked.</li>
<li>Invoke the <codeclass="docutils literal"><spanclass="pre">USE</span></code> macro in which the Op is used, to make sure that it is linked.</li>
</ol>
</div>
<hrclass="docutils"/>
...
...
@@ -417,7 +418,7 @@
<divclass="section"id="create-backward-operator">
<spanid="create-backward-operator"></span><h2>Create Backward Operator<aclass="headerlink"href="#create-backward-operator"title="Permalink to this headline">¶</a></h2>
<spanid="build-backward-network"></span><h2>Build Backward Network<aclass="headerlink"href="#build-backward-network"title="Permalink to this headline">¶</a></h2>
<ulclass="simple">
<li><strong>Input</strong> graph of forwarding operators</li>
<li><strong>Output</strong> graph of backward operators</li>
<li><strong>corner case in construction</strong><ul>
<li>RNN Op => recursively call <codeclass="docutils literal"><spanclass="pre">Backward</span></code> on stepnet</li>
</ul>
</li>
...
...
@@ -446,17 +447,17 @@
<ulclass="simple">
<li><codeclass="docutils literal"><spanclass="pre">Tensor</span></code> is an n-dimension array with type.<ul>
<li>Only dims and data pointers are stored in <codeclass="docutils literal"><spanclass="pre">Tensor</span></code>.</li>
<li>All operators on <codeclass="docutils literal"><spanclass="pre">Tensor</span></code> is written in <codeclass="docutils literal"><spanclass="pre">Operator</span></code> or global functions.</li>
<li>All operations on <codeclass="docutils literal"><spanclass="pre">Tensor</span></code> are written in <codeclass="docutils literal"><spanclass="pre">Operator</span></code> or global functions.</li>
<li><codeclass="docutils literal"><spanclass="pre">Variable</span></code> is the inputs and outputs of an operator. Not just <codeclass="docutils literal"><spanclass="pre">Tensor</span></code>.<ul>
<li>step_scopes in RNN is a variable and not a tensor.</li>
<li><codeclass="docutils literal"><spanclass="pre">Variable</span></code> instances are the inputs and the outputs of an operator. Not just <codeclass="docutils literal"><spanclass="pre">Tensor</span></code>.<ul>
<li><codeclass="docutils literal"><spanclass="pre">step_scopes</span></code> in RNN is a variable and not a tensor.</li>
</ul>
</li>
<li><codeclass="docutils literal"><spanclass="pre">Scope</span></code> is where variables store at.<ul>
<li>map<string/*var name */, Variable></li>
<li><codeclass="docutils literal"><spanclass="pre">Scope</span></code> has a hierarchical structure. The local scope can get variable from its parent scope.</li>
<li><codeclass="docutils literal"><spanclass="pre">Scope</span></code> is where variables are stores.<ul>
<li><codeclass="docutils literal"><spanclass="pre">Scope</span></code> has a hierarchical structure. The local scope can get variables from its parent scope.</li>
<spanid="the-difference-with-original-rnnop"></span><h2>the difference with original RNNOp<aclass="headerlink"href="#the-difference-with-original-rnnop"title="Permalink to this headline">¶</a></h2>
<ulclass="simple">
<li>as an operator is more intuitive than <codeclass="docutils literal"><spanclass="pre">RNNOp</span></code>,</li>
<li>offers new interface <codeclass="docutils literal"><spanclass="pre">Eval(targets)</span></code> to deduce the minimal block to <codeclass="docutils literal"><spanclass="pre">Run</span></code>,</li>
<li>fits the compile-time/ runtime separation design.<ul>
<li>during the compilation, <codeclass="docutils literal"><spanclass="pre">SymbolTable</span></code> stores <codeclass="docutils literal"><spanclass="pre">VarDesc</span></code>s and <codeclass="docutils literal"><spanclass="pre">OpDesc</span></code>s and serialize to a <codeclass="docutils literal"><spanclass="pre">BlockDesc</span></code></li>
<li>when graph executes, a Block with <codeclass="docutils literal"><spanclass="pre">BlockDesc</span></code> passed in creates <codeclass="docutils literal"><spanclass="pre">Op</span></code> and <codeclass="docutils literal"><spanclass="pre">Var</span></code> then <codeclass="docutils literal"><spanclass="pre">Run</span></code></li>
<li>As an operator is more intuitive than <codeclass="docutils literal"><spanclass="pre">RNNOp</span></code>,</li>
<li>Offers a new interface <codeclass="docutils literal"><spanclass="pre">Eval(targets)</span></code> to deduce the minimal block to <codeclass="docutils literal"><spanclass="pre">Run</span></code>,</li>
<li>Fits the compile-time/ runtime separation design paradigm.<ul>
<li>During the compilation, <codeclass="docutils literal"><spanclass="pre">SymbolTable</span></code> stores <codeclass="docutils literal"><spanclass="pre">VarDesc</span></code>s and <codeclass="docutils literal"><spanclass="pre">OpDesc</span></code>s and serialize to a <codeclass="docutils literal"><spanclass="pre">BlockDesc</span></code></li>
<li>When graph executes, a Block with <codeclass="docutils literal"><spanclass="pre">BlockDesc</span></code> is passed. It then creates <codeclass="docutils literal"><spanclass="pre">Op</span></code> and <codeclass="docutils literal"><spanclass="pre">Var</span></code> instances and then invokes <codeclass="docutils literal"><spanclass="pre">Run</span></code>.</li>
</ul>
</li>
</ul>
...
...
@@ -481,32 +482,32 @@
<divclass="section"id="milestone">
<spanid="milestone"></span><h1>Milestone<aclass="headerlink"href="#milestone"title="Permalink to this headline">¶</a></h1>
<ulclass="simple">
<li>take Paddle/books as the main line, the requirement of the models motivates framework refactoring,</li>
<li>model migration<ul>
<li>framework development gives <strong>priority support</strong> to model migration, for example,<ul>
<li>Take Paddle/books as the main line, the requirement of the models motivates framework refactoring,</li>
<li>Model migration<ul>
<li>Framework development gives <strong>priority support</strong> to model migration, for example,<ul>
<li>the MNIST demo needs a Python interface,</li>
<li>the RNN models require the framework to support <codeclass="docutils literal"><spanclass="pre">LoDTensor</span></code>.</li>
</ul>
</li>
<li>determine some timelines,</li>
<li>heavily-relied Ops need to be migrated first,</li>
<li>different models can be migrated parallelly.</li>
<li>Determine some timelines,</li>
<li>Frequently used Ops need to be migrated first,</li>
<li>Different models can be migrated in parallel.</li>
</ul>
</li>
<li>improve the framework at the same time</li>
<li>accept imperfection, concentrated on solving the specific problem at the right price.</li>
<li>Improve the framework at the same time</li>
<li>Accept imperfection, concentrate on solving the specific problem at the right price.</li>
<spanid="control-the-migration-quality"></span><h1>Control the migration quality<aclass="headerlink"href="#control-the-migration-quality"title="Permalink to this headline">¶</a></h1>
<ulclass="simple">
<li>compare the performance of migrated models with old ones.</li>
<li>follow google C style</li>
<li>build the automatic workflow of generating Python/C++ documentations<ul>
<li>the documentation of layers and ops should be written inside the code</li>
<li>take the documentation quality into account when doing PR</li>
<li>preview the documentations, read and improve them from users’ perspective</li>
<li>Compare the performance of migrated models with old ones.</li>
<li>Follow the google C++ style</li>
<li>Build the automatic workflow of generating Python/C++ documentations.<ul>
<li>The documentation of layers and ops should be written inside the code.</li>
<li>Take the documentation quality into account when submitting pull requests.</li>
<li>Preview the documentations, read and improve them from a user’s perspective.</li>
1. Make it easy for external contributors to write new elementory computaiton operations.
1. Make the codebase clean and readable.
1. Introduce a new design of computation representation -- a computation graph of operators and variables.
1. The graph representation helps implementing auto-scalable and auto fault recoverable distributed computing.
1. Making it easy for external contributors to write new elementary computation operations.
1. Making the codebase clean and readable.
1. Designing a new computation representation -- a computation graph of operators and variables.
1. Implementing auto-scalability and auto fault recoverable distributed computing with the help of computation graphs.
## Computation Graphs
1. PaddlePaddle represent the computation, training and inference of DL models, by computation graphs.
1. PaddlePaddle represents the computation, training and inference of Deep Learning models, by computation graphs.
1. Please dig into [computation graphs](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/graph.md) for a solid example.
1. Please refer to [computation graphs](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/graph.md) for a concrete example.
1. Users write Python programs to describe the graphs and run it (locally or remotely).
1. Users write Python programs to describe the graphs and run them (locally or remotely).
1. A graph is composed of *variables* and *operators*.
1. The description of graphs must be able to be serialized/deserialized, so it
1. The description of graphs must be capable of being serialized/deserialized, so that
1. could to be sent to the cloud for distributed execution, and
1. be sent to clients for mobile or enterprise deployment.
1. It can to be sent to the cloud for distributed execution, and
1. It can be sent to clients for mobile or enterprise deployment.
1. The Python program do
1. The Python program does the following steps
1. *compilation*: runs a Python program to generate a protobuf message representation of the graph and send it to
1. *compilation*: run a Python program to generate a protobuf message representation of the graph and send it to
1. the C++ library `libpaddle.so` for local execution,
1. the master process of a distributed training job for training, or
1. the server process of a Kubernetes serving job for distributed serving.
1. *execution*: according to the protobuf message, constructs instances of class `Variable` and `OperatorBase`, and run them.
1. *execution*: execute the graph by constructing instances of class [`Variable`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/variable.h#L24) and [`OperatorBase`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/operator.h#L70), according to the protobuf message.
## Description and Realization
## Description and Realization of Computation Graph
At compile time, the Python program generates protobuf message representation of the graph, or the description of the graph.
At compile time, the Python program generates a protobuf message representation of the graph, or the description of the graph.
At runtime, the C++ program realizes the graph and run it.
At runtime, the C++ program realizes the graph and runs it.
The word *graph* is exchangable with *block* in this document. A graph represent computation steps and local variables as a C++/Java program block, or a pair of { and }.
The word *graph* is interchangeable with *block* in this document. A graph represents computation steps and local variables similar to a C++/Java program block, or a pair of parentheses(`{` and `}`).
## Compilation and Execution
1. Run an applicaton Python program to describe the graph. In particular,
1. Run an application Python program to describe the graph. In particular, the Python application program does the following:
1. create VarDesc to represent local/intermediate variables,
1. create operators and set attributes,
1. validate attribute values,
1. inference the type and the shape of variables,
1. plan for memory-reuse for variables,
1. generate backward and optimization part of the Graph.
1. possiblly split the graph for distributed training.
1. Create `VarDesc` to represent local/intermediate variables,
1. Create operators and set attributes,
1. Validate attribute values,
1. Infer the type and the shape of variables,
1. Plan memory-reuse for variables,
1. Generate the backward graph
1. Optimize the computation graph.
1. Potentially, split the graph for distributed training.
1. The invocation of `train` or `infer` in the application Python program:
1. The invocation of `train` or [`infer`](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/inference.py#L108) methods in the application Python program does the following:
1. create a new Scope instance in the [scope hierarchy](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/scope.md) for each run of a block,
1. Create a new Scope instance in the [scope hierarchy](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/scope.md) for each run of a block,
1. realize local variables defined in the BlockDesc message in the new scope,
1. a scope is similar to the stack frame in programming languages,
1. create an instance of class `Block`, in which,
1. Create an instance of class `Block`, in which,
1. realize operators in the BlockDesc message,
1. run the Block by calling
1. Run the Block by calling
1. `Block::Eval(vector<Variable>* targets)` for forward and backward computations, or
1. `Block::Eval(vector<Operator>* targets)` for optimization.
...
...
@@ -76,14 +77,14 @@ The word *graph* is exchangable with *block* in this document. A graph represen
Compile Time -> IR -> Runtime
```
### Benefit
### Benefits of IR
- Optimization
```text
Compile Time -> IR -> Optimized IR -> Runtime
```
- Send automatically partitioned IR to different nodes.
- Automatic data parallel
- Automatically send partitioned IR to different nodes.
- Automatic Data Parallelism
```text
Compile Time
|-> Single GPU IR
...
...
@@ -92,7 +93,7 @@ Compile Time -> IR -> Runtime
|-> Node-1 (runs trainer-IR-1)
|-> Node-2 (runs pserver-IR)
```
- Automatic model parallel (planned for future)
- Automatic Model Parallelism (planned for future)
---
...
...
@@ -105,10 +106,10 @@ Compile Time -> IR -> Runtime
* `Operator` is the fundamental building block as the user interface.
* Operator stores input/output variable name, and attributes.
* The `InferShape` interface is used to infer output variable shapes by its input shapes.
* Use `Run` to compute `input variables` to `output variables`.
* `Operator` is the fundamental building block of the user interface.
* Operator stores input/output variable names, and attributes.
* The `InferShape` interface is used to infer the shape of the output variable shapes based on the shapes of the input variables.
* Use `Run` to compute the `output` variables from the `input` variables.
---
...
...
@@ -126,30 +127,30 @@ Compile Time -> IR -> Runtime
# Why separate Kernel and Operator
* Separate GPU and CPU code.
* Make Paddle can run without GPU.
* Make one operator (which is user interface) can contain many implementations.
* Same mul op, different FP16, FP32 Kernel. different MKL, eigen kernel.
* Make Paddle capable of running without GPU.
* Make one operator (which is a user interface) and create many implementations.
* For example, same multiplication op can have different implementations kernels such as FP16 kernel, FP32 kernel, MKL, eigen kernel.
---
# Libraries for Kernel development
* `Eigen::Tensor` contains basic math and element-wise functions.
* Note that `Eigen::Tensor` has broadcast implementation.
* Limit number of `tensor.device(dev) = ` in your code.
* Limit the number of `tensor.device(dev) = ` in your code.
* `thrust::tranform` and `std::transform`.
* `thrust` has the same API as C++ standard library. Using `transform` can quickly implement a customized elementwise kernel.
* `thrust` has more complex API, like `scan`, `reduce`, `reduce_by_key`.
* `thrust` has the same API as C++ standard library. Using `transform`, one can quickly implement customized elementwise kernels.
* `thrust` also has more complex APIs, like `scan`, `reduce`, `reduce_by_key`.
* Hand-writing `GPUKernel` and `CPU` code
* Do not write `.h`. CPU Kernel should be in `.cc`. GPU kernel should be in `.cu`. (`GCC` cannot compile GPU code.)
* Do not write in header (`.h`) files. CPU Kernel should be in cpp source (`.cc`) and GPU kernels should be in cuda (`.cu`) files. (GCC cannot compile GPU code.)
---
# Operator Register
# Operator Registration
## Why register is necessary?
## Why registration is necessary?
We need a method to build mappings between Op type names and Op classes.
## How to do the register?
## How is registration implemented?
Maintain a map, whose key is the type name and value is corresponding Op constructor.
Maintaining a map, whose key is the type name and the value is the corresponding Op constructor.
<li>PaddlePaddle represent the computation, training and inference of DL models, by computation graphs.</li>
<li>Please dig into <aclass="reference external"href="https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/graph.md">computation graphs</a> for a solid example.</li>
<li>Users write Python programs to describe the graphs and run it (locally or remotely).</li>
<li>PaddlePaddle represents the computation, training and inference of Deep Learning models, by computation graphs.</li>
<li>Please refer to <aclass="reference external"href="https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/graph.md">computation graphs</a> for a concrete example.</li>
<li>Users write Python programs to describe the graphs and run them (locally or remotely).</li>
<li>A graph is composed of <em>variables</em> and <em>operators</em>.</li>
<li>The description of graphs must be able to be serialized/deserialized, so it<ol>
<li>could to be sent to the cloud for distributed execution, and</li>
<li>be sent to clients for mobile or enterprise deployment.</li>
<li>The description of graphs must be capable of being serialized/deserialized, so that<ol>
<li>It can to be sent to the cloud for distributed execution, and</li>
<li>It can be sent to clients for mobile or enterprise deployment.</li>
</ol>
</li>
<li>The Python program do<ol>
<li><em>compilation</em>: runs a Python program to generate a protobuf message representation of the graph and send it to<ol>
<li>The Python program does the following steps<ol>
<li><em>compilation</em>: run a Python program to generate a protobuf message representation of the graph and send it to<ol>
<li>the C++ library <codeclass="docutils literal"><spanclass="pre">libpaddle.so</span></code> for local execution,</li>
<li>the master process of a distributed training job for training, or</li>
<li>the server process of a Kubernetes serving job for distributed serving.</li>
</ol>
</li>
<li><em>execution</em>: according to the protobuf message, constructs instances of class <codeclass="docutils literal"><spanclass="pre">Variable</span></code> and <codeclass="docutils literal"><spanclass="pre">OperatorBase</span></code>, and run them.</li>
<li><em>execution</em>: execute the graph by constructing instances of class <aclass="reference external"href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/variable.h#L24"><codeclass="docutils literal"><spanclass="pre">Variable</span></code></a> and <aclass="reference external"href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/operator.h#L70"><codeclass="docutils literal"><spanclass="pre">OperatorBase</span></code></a>, according to the protobuf message.</li>
<spanid="description-and-realization"></span><h2>Description and Realization<aclass="headerlink"href="#description-and-realization"title="永久链接至标题">¶</a></h2>
<p>At compile time, the Python program generates protobuf message representation of the graph, or the description of the graph.</p>
<p>At runtime, the C++ program realizes the graph and run it.</p>
<spanid="description-and-realization-of-computation-graph"></span><h2>Description and Realization of Computation Graph<aclass="headerlink"href="#description-and-realization-of-computation-graph"title="永久链接至标题">¶</a></h2>
<p>At compile time, the Python program generates a protobuf message representation of the graph, or the description of the graph.</p>
<p>At runtime, the C++ program realizes the graph and runs it.</p>
<p>The word <em>graph</em> is exchangable with <em>block</em> in this document. A graph represent computation steps and local variables as a C++/Java program block, or a pair of { and }.</p>
<p>The word <em>graph</em> is interchangeable with <em>block</em> in this document. A graph represents computation steps and local variables similar to a C++/Java program block, or a pair of parentheses(<codeclass="docutils literal"><spanclass="pre">{</span></code> and <codeclass="docutils literal"><spanclass="pre">}</span></code>).</p>
<spanid="compilation-and-execution"></span><h2>Compilation and Execution<aclass="headerlink"href="#compilation-and-execution"title="永久链接至标题">¶</a></h2>
<olclass="simple">
<li>Run an applicaton Python program to describe the graph. In particular,<ol>
<li>create VarDesc to represent local/intermediate variables,</li>
<li>create operators and set attributes,</li>
<li>validate attribute values,</li>
<li>inference the type and the shape of variables,</li>
<li>plan for memory-reuse for variables,</li>
<li>generate backward and optimization part of the Graph.</li>
<li>possiblly split the graph for distributed training.</li>
<li>Run an application Python program to describe the graph. In particular, the Python application program does the following:<ol>
<li>Create <codeclass="docutils literal"><spanclass="pre">VarDesc</span></code> to represent local/intermediate variables,</li>
<li>Create operators and set attributes,</li>
<li>Validate attribute values,</li>
<li>Infer the type and the shape of variables,</li>
<li>Plan memory-reuse for variables,</li>
<li>Generate the backward graph</li>
<li>Optimize the computation graph.</li>
<li>Potentially, split the graph for distributed training.</li>
</ol>
</li>
<li>The invocation of <codeclass="docutils literal"><spanclass="pre">train</span></code> or <codeclass="docutils literal"><spanclass="pre">infer</span></code> in the application Python program:<ol>
<li>create a new Scope instance in the <aclass="reference external"href="https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/scope.md">scope hierarchy</a> for each run of a block,<ol>
<li>The invocation of <codeclass="docutils literal"><spanclass="pre">train</span></code> or <aclass="reference external"href="https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/inference.py#L108"><codeclass="docutils literal"><spanclass="pre">infer</span></code></a> methods in the application Python program does the following:<ol>
<li>Create a new Scope instance in the <aclass="reference external"href="https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/scope.md">scope hierarchy</a> for each run of a block,<ol>
<li>realize local variables defined in the BlockDesc message in the new scope,</li>
<li>a scope is similar to the stack frame in programming languages,</li>
</ol>
</li>
<li>create an instance of class <codeclass="docutils literal"><spanclass="pre">Block</span></code>, in which,<ol>
<li>Create an instance of class <codeclass="docutils literal"><spanclass="pre">Block</span></code>, in which,<ol>
<li>realize operators in the BlockDesc message,</li>
</ol>
</li>
<li>run the Block by calling<ol>
<li>Run the Block by calling<ol>
<li><codeclass="docutils literal"><spanclass="pre">Block::Eval(vector<Variable>*</span><spanclass="pre">targets)</span></code> for forward and backward computations, or</li>
<li><codeclass="docutils literal"><spanclass="pre">Block::Eval(vector<Operator>*</span><spanclass="pre">targets)</span></code> for optimization.</li>
</ol>
...
...
@@ -272,17 +273,17 @@
<divclass="highlight-text"><divclass="highlight"><pre><span></span>Compile Time -> IR -> Runtime
<li><codeclass="docutils literal"><spanclass="pre">Operator</span></code> is the fundamental building block as the user interface.<ul>
<li>Operator stores input/output variable name, and attributes.</li>
<li>The <codeclass="docutils literal"><spanclass="pre">InferShape</span></code> interface is used to infer output variable shapes by its input shapes.</li>
<li>Use <codeclass="docutils literal"><spanclass="pre">Run</span></code> to compute <codeclass="docutils literal"><spanclass="pre">input</span><spanclass="pre">variables</span></code> to <codeclass="docutils literal"><spanclass="pre">output</span><spanclass="pre">variables</span></code>.</li>
<li><codeclass="docutils literal"><spanclass="pre">Operator</span></code> is the fundamental building block of the user interface.<ul>
<li>Operator stores input/output variable names, and attributes.</li>
<li>The <codeclass="docutils literal"><spanclass="pre">InferShape</span></code> interface is used to infer the shape of the output variable shapes based on the shapes of the input variables.</li>
<li>Use <codeclass="docutils literal"><spanclass="pre">Run</span></code> to compute the <codeclass="docutils literal"><spanclass="pre">output</span></code> variables from the <codeclass="docutils literal"><spanclass="pre">input</span></code> variables.</li>
</ul>
</li>
</ul>
...
...
@@ -336,11 +337,11 @@
<spanid="why-separate-kernel-and-operator"></span><h1>Why separate Kernel and Operator<aclass="headerlink"href="#why-separate-kernel-and-operator"title="永久链接至标题">¶</a></h1>
<ulclass="simple">
<li>Separate GPU and CPU code.<ul>
<li>Make Paddle can run without GPU.</li>
<li>Make Paddle capable of running without GPU.</li>
</ul>
</li>
<li>Make one operator (which is user interface) can contain many implementations.<ul>
<li>Same mul op, different FP16, FP32 Kernel. different MKL, eigen kernel.</li>
<li>Make one operator (which is a user interface) and create many implementations.<ul>
<li>For example, same multiplication op can have different implementations kernels such as FP16 kernel, FP32 kernel, MKL, eigen kernel.</li>
</ul>
</li>
</ul>
...
...
@@ -351,30 +352,30 @@
<ulclass="simple">
<li><codeclass="docutils literal"><spanclass="pre">Eigen::Tensor</span></code> contains basic math and element-wise functions.<ul>
<li>Note that <codeclass="docutils literal"><spanclass="pre">Eigen::Tensor</span></code> has broadcast implementation.</li>
<li>Limit number of <codeclass="docutils literal"><spanclass="pre">tensor.device(dev)</span><spanclass="pre">=</span></code> in your code.</li>
<li>Limit the number of <codeclass="docutils literal"><spanclass="pre">tensor.device(dev)</span><spanclass="pre">=</span></code> in your code.</li>
</ul>
</li>
<li><codeclass="docutils literal"><spanclass="pre">thrust::tranform</span></code> and <codeclass="docutils literal"><spanclass="pre">std::transform</span></code>.<ul>
<li><codeclass="docutils literal"><spanclass="pre">thrust</span></code> has the same API as C++ standard library. Using <codeclass="docutils literal"><spanclass="pre">transform</span></code> can quickly implement a customized elementwise kernel.</li>
<li><codeclass="docutils literal"><spanclass="pre">thrust</span></code>has more complex API, like <codeclass="docutils literal"><spanclass="pre">scan</span></code>, <codeclass="docutils literal"><spanclass="pre">reduce</span></code>, <codeclass="docutils literal"><spanclass="pre">reduce_by_key</span></code>.</li>
<li><codeclass="docutils literal"><spanclass="pre">thrust</span></code> has the same API as C++ standard library. Using <codeclass="docutils literal"><spanclass="pre">transform</span></code>, one can quickly implement customized elementwise kernels.</li>
<li><codeclass="docutils literal"><spanclass="pre">thrust</span></code>also has more complex APIs, like <codeclass="docutils literal"><spanclass="pre">scan</span></code>, <codeclass="docutils literal"><spanclass="pre">reduce</span></code>, <codeclass="docutils literal"><spanclass="pre">reduce_by_key</span></code>.</li>
</ul>
</li>
<li>Hand-writing <codeclass="docutils literal"><spanclass="pre">GPUKernel</span></code> and <codeclass="docutils literal"><spanclass="pre">CPU</span></code> code<ul>
<li>Do not write <codeclass="docutils literal"><spanclass="pre">.h</span></code>. CPU Kernel should be in <codeclass="docutils literal"><spanclass="pre">.cc</span></code>. GPU kernel should be in <codeclass="docutils literal"><spanclass="pre">.cu</span></code>. (<codeclass="docutils literal"><spanclass="pre">GCC</span></code> cannot compile GPU code.)</li>
<li>Do not write in header (<codeclass="docutils literal"><spanclass="pre">.h</span></code>) files. CPU Kernel should be in cpp source (<codeclass="docutils literal"><spanclass="pre">.cc</span></code>) and GPU kernels should be in cuda (<codeclass="docutils literal"><spanclass="pre">.cu</span></code>) files. (GCC cannot compile GPU code.)</li>
<spanid="why-register-is-necessary"></span><h2>Why register is necessary?<aclass="headerlink"href="#why-register-is-necessary"title="永久链接至标题">¶</a></h2>
<spanid="why-registration-is-necessary"></span><h2>Why registration is necessary?<aclass="headerlink"href="#why-registration-is-necessary"title="永久链接至标题">¶</a></h2>
<p>We need a method to build mappings between Op type names and Op classes.</p>
</div>
<divclass="section"id="how-to-do-the-register">
<spanid="how-to-do-the-register"></span><h2>How to do the register?<aclass="headerlink"href="#how-to-do-the-register"title="永久链接至标题">¶</a></h2>
<p>Maintain a map, whose key is the type name and value is corresponding Op constructor.</p>
<spanid="how-is-registration-implemented"></span><h2>How is registration implemented?<aclass="headerlink"href="#how-is-registration-implemented"title="永久链接至标题">¶</a></h2>
<p>Maintaining a map, whose key is the type name and the value is the corresponding Op constructor.</p>
<li>Write Op class, as well as its gradient Op class if there is.</li>
<li>Write Op maker class. In the constructor, describe its inputs, outputs, and attributes.</li>
<li>Invoke macro <codeclass="docutils literal"><spanclass="pre">REGISTER_OP</span></code>. The macro will<ol>
<li>call maker class to complete <codeclass="docutils literal"><spanclass="pre">proto</span></code> and<codeclass="docutils literal"><spanclass="pre">checker</span></code></li>
<li>with the completed <codeclass="docutils literal"><spanclass="pre">proto</span></code> and <codeclass="docutils literal"><spanclass="pre">checker</span></code>, build a new key-value pair in the <codeclass="docutils literal"><spanclass="pre">OpInfoMap</span></code></li>
<li>Write an Op class and its gradient Op class, if required.</li>
<li>Write an Op maker class. In the constructor of this class, describe the inputs, outputs and attributes of the operator.</li>
<li>Invoke the macro <codeclass="docutils literal"><spanclass="pre">REGISTER_OP</span></code>. This macro will<ol>
<li>Call maker class to complete the <codeclass="docutils literal"><spanclass="pre">proto</span></code> and the<codeclass="docutils literal"><spanclass="pre">checker</span></code></li>
<li>Using the completed <codeclass="docutils literal"><spanclass="pre">proto</span></code> and <codeclass="docutils literal"><spanclass="pre">checker</span></code>, it will add a new key-value pair to the <codeclass="docutils literal"><spanclass="pre">OpInfoMap</span></code></li>
</ol>
</li>
<li>Invoke <codeclass="docutils literal"><spanclass="pre">USE</span></code> macro in where the Op is used to make sure it is linked.</li>
<li>Invoke the <codeclass="docutils literal"><spanclass="pre">USE</span></code> macro in which the Op is used, to make sure that it is linked.</li>
<li>RNN Op => recursively call <codeclass="docutils literal"><spanclass="pre">Backward</span></code> on stepnet</li>
</ul>
</li>
...
...
@@ -460,17 +461,17 @@
<ulclass="simple">
<li><codeclass="docutils literal"><spanclass="pre">Tensor</span></code> is an n-dimension array with type.<ul>
<li>Only dims and data pointers are stored in <codeclass="docutils literal"><spanclass="pre">Tensor</span></code>.</li>
<li>All operators on <codeclass="docutils literal"><spanclass="pre">Tensor</span></code> is written in <codeclass="docutils literal"><spanclass="pre">Operator</span></code> or global functions.</li>
<li>All operations on <codeclass="docutils literal"><spanclass="pre">Tensor</span></code> are written in <codeclass="docutils literal"><spanclass="pre">Operator</span></code> or global functions.</li>
<li><codeclass="docutils literal"><spanclass="pre">Variable</span></code> is the inputs and outputs of an operator. Not just <codeclass="docutils literal"><spanclass="pre">Tensor</span></code>.<ul>
<li>step_scopes in RNN is a variable and not a tensor.</li>
<li><codeclass="docutils literal"><spanclass="pre">Variable</span></code> instances are the inputs and the outputs of an operator. Not just <codeclass="docutils literal"><spanclass="pre">Tensor</span></code>.<ul>
<li><codeclass="docutils literal"><spanclass="pre">step_scopes</span></code> in RNN is a variable and not a tensor.</li>
</ul>
</li>
<li><codeclass="docutils literal"><spanclass="pre">Scope</span></code> is where variables store at.<ul>
<li>map<string/*var name */, Variable></li>
<li><codeclass="docutils literal"><spanclass="pre">Scope</span></code> has a hierarchical structure. The local scope can get variable from its parent scope.</li>
<li><codeclass="docutils literal"><spanclass="pre">Scope</span></code> is where variables are stores.<ul>
<li><codeclass="docutils literal"><spanclass="pre">Scope</span></code> has a hierarchical structure. The local scope can get variables from its parent scope.</li>
<spanid="the-difference-with-original-rnnop"></span><h2>the difference with original RNNOp<aclass="headerlink"href="#the-difference-with-original-rnnop"title="永久链接至标题">¶</a></h2>
<ulclass="simple">
<li>as an operator is more intuitive than <codeclass="docutils literal"><spanclass="pre">RNNOp</span></code>,</li>
<li>offers new interface <codeclass="docutils literal"><spanclass="pre">Eval(targets)</span></code> to deduce the minimal block to <codeclass="docutils literal"><spanclass="pre">Run</span></code>,</li>
<li>fits the compile-time/ runtime separation design.<ul>
<li>during the compilation, <codeclass="docutils literal"><spanclass="pre">SymbolTable</span></code> stores <codeclass="docutils literal"><spanclass="pre">VarDesc</span></code>s and <codeclass="docutils literal"><spanclass="pre">OpDesc</span></code>s and serialize to a <codeclass="docutils literal"><spanclass="pre">BlockDesc</span></code></li>
<li>when graph executes, a Block with <codeclass="docutils literal"><spanclass="pre">BlockDesc</span></code> passed in creates <codeclass="docutils literal"><spanclass="pre">Op</span></code> and <codeclass="docutils literal"><spanclass="pre">Var</span></code> then <codeclass="docutils literal"><spanclass="pre">Run</span></code></li>
<li>As an operator is more intuitive than <codeclass="docutils literal"><spanclass="pre">RNNOp</span></code>,</li>
<li>Offers a new interface <codeclass="docutils literal"><spanclass="pre">Eval(targets)</span></code> to deduce the minimal block to <codeclass="docutils literal"><spanclass="pre">Run</span></code>,</li>
<li>Fits the compile-time/ runtime separation design paradigm.<ul>
<li>During the compilation, <codeclass="docutils literal"><spanclass="pre">SymbolTable</span></code> stores <codeclass="docutils literal"><spanclass="pre">VarDesc</span></code>s and <codeclass="docutils literal"><spanclass="pre">OpDesc</span></code>s and serialize to a <codeclass="docutils literal"><spanclass="pre">BlockDesc</span></code></li>
<li>When graph executes, a Block with <codeclass="docutils literal"><spanclass="pre">BlockDesc</span></code> is passed. It then creates <codeclass="docutils literal"><spanclass="pre">Op</span></code> and <codeclass="docutils literal"><spanclass="pre">Var</span></code> instances and then invokes <codeclass="docutils literal"><spanclass="pre">Run</span></code>.</li>
<spanid="control-the-migration-quality"></span><h1>Control the migration quality<aclass="headerlink"href="#control-the-migration-quality"title="永久链接至标题">¶</a></h1>
<ulclass="simple">
<li>compare the performance of migrated models with old ones.</li>
<li>follow google C style</li>
<li>build the automatic workflow of generating Python/C++ documentations<ul>
<li>the documentation of layers and ops should be written inside the code</li>
<li>take the documentation quality into account when doing PR</li>
<li>preview the documentations, read and improve them from users’ perspective</li>
<li>Compare the performance of migrated models with old ones.</li>
<li>Follow the google C++ style</li>
<li>Build the automatic workflow of generating Python/C++ documentations.<ul>
<li>The documentation of layers and ops should be written inside the code.</li>
<li>Take the documentation quality into account when submitting pull requests.</li>
<li>Preview the documentations, read and improve them from a user’s perspective.</li>