Deploy to GitHub Pages: 0071b5f7

34087742 · Travis CI · 864af933 · 34087742 · 34087742 · 34087742
5 changed file
--- a/develop/doc/_sources/howto/dev/new_op_kernel_en.md.txt
+++ b/develop/doc/_sources/howto/dev/new_op_kernel_en.md.txt
+## Add Kernels for a New Device
+### Background
+PaddlePaddle Fluid have hundreds of operators.  Each operator could have one or more kernels.  A kernel is an implementation of the operator for a certain device, which could be a hardware device, e.g., the CUDA GPU, or a library that utilizes a device, e.g., Intel MKL that makes full use of the Xeon CPU.
+[This document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/dev/new_op_en.md) explains how to add an operator, and its kernels.  The kernels of an operator are indexed by a C++ type [`OpKernelType`](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/operator_kernel_type.md).  An operator chooses the right kernel at runtime.  This choosing mechanism is described [here](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/switch_kernel.md).
+### Write Kernels for A New Device 
+#### Add A New Device
+  For some historical reaons, we misuse the word *library* for *device*.  For example, we call the deivce type by *library type*.  An example is the header file [`library_type.h`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/library_type.h#L24).  We will correct this ASAP.
+To register a new device, we need to add an enum value to `LibraryType`:
+```
+enum class LibraryType {
+  kPlain = 0,
+  kMKLDNN = 1,
+  kCUDNN = 2,
+};
+```
+#### Add A New [Place](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/platform/place.h#L53)
+If you have a new kind of Device, firstly you need to add a new kind of [`Place`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/platform/place.h#L53). For example `CUDAPlace`:
+```cpp
+struct CUDAPlace {
+  CUDAPlace() : CUDAPlace(0) {}
+  explicit CUDAPlace(int d) : device(d) {}
+  inline int GetDeviceId() const { return device; }
+  // needed for variant equality comparison
+  inline bool operator==(const CUDAPlace &o) const {
+    return device == o.device;
+  }
+  inline bool operator!=(const CUDAPlace &o) const { return !(*this == o); }
+  int device;
+};
+typedef boost::variant<CUDAPlace, CPUPlace> Place;
+```
+#### Add [device context]((https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/platform/device_context.h#L37))
+After a new kind of Device is added, you should add a corresponding [DeviceContext](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/platform/device_context.h#L37) for it.
+```cpp
+class DeviceContext {
+ public:
+  virtual ~DeviceContext() {}
+  virtual Place GetPlace() const = 0;
+  virtual void Wait() const {}
+};
+```
+#### Implement new [OpKernel](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/operator.h#L351) for your Device.
+A detailed documentation can be found in [`new_op_and_kernel`](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/dev/new_op_en.md)
+```cpp
+class OpKernelBase {
+ public:
+  /**
+   * ExecutionContext is the only parameter of Kernel Run function.
+   * Run will get input/output variables, state such as momentum and
+   * device resource such as CUDA stream, cublas handle, etc. from
+   * ExecutionContext. User should construct it before run the Operator.
+   */
+  virtual void Compute(const ExecutionContext& context) const = 0;
+  virtual ~OpKernelBase() = default;
+};
+template <typename T>
+class OpKernel : public OpKernelBase {
+ public:
+  using ELEMENT_TYPE = T;
+};
+```
+#### Register the OpKernel to framework
+After writing the components described above, we should register the kernel to the framework.
+We use `REGISTER_OP_KERNEL` to do the registration.
+```cpp
+REGISTER_OP_KERNEL(
+	op_type,
+	library_type,
+	place_type,
+	kernel0, kernel1, ...)
+```
+kernel0, kernel1 are kernels that have the same `op_type`, `library_type`, `place_type` but different `data_types`.
+take [`conv2d`]((https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/conv_cudnn_op.cu.cc#L318)) as an example:
+	```cpp
+	REGISTER_OP_KERNEL(conv2d, CPU, paddle::platform::CPUPlace,
+    		paddle::operators::GemmConvKernel<paddle::platform::CPUDeviceContext, float>,
+    		paddle::operators::GemmConvKernel<paddle::platform::CPUDeviceContext, double>);
+	REGISTER_OP_KERNEL(conv2d, CUDNN, ::paddle::platform::CUDAPlace,
+	       paddle::operators::CUDNNConvOpKernel<float>,
+	       paddle::operators::CUDNNConvOpKernel<double>);
+	```
+In the code above:
+ - `conv2d` is the type/name of the operator
+ - `CUDNN/CPU` is `library`
+ - `paddle::platform::CUDAPlace/CPUPlace` is `place`
+ - template parameter `float/double` on `CUDNNConvOpKernel<T>` is `data_type`.
--- a/develop/doc/howto/dev/new_op_kernel_en.html
+++ b/develop/doc/howto/dev/new_op_kernel_en.html
+<!DOCTYPE html>
+<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
+<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
+<head>
+  <meta charset="utf-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>Add Kernels for a New Device &mdash; PaddlePaddle  documentation</title>
+    <link rel="stylesheet" href="../../_static/css/theme.css" type="text/css" />
+        <link rel="index" title="Index"
+              href="../../genindex.html"/>
+        <link rel="search" title="Search" href="../../search.html"/>
+    <link rel="top" title="PaddlePaddle  documentation" href="../../index.html"/> 
+  <link rel="stylesheet" href="https://cdn.jsdelivr.net/perfect-scrollbar/0.6.14/css/perfect-scrollbar.min.css" type="text/css" />
+  <link rel="stylesheet" href="../../_static/css/override.css" type="text/css" />
+  <script>
+  var _hmt = _hmt || [];
+  (function() {
+    var hm = document.createElement("script");
+    hm.src = "//hm.baidu.com/hm.js?b9a314ab40d04d805655aab1deee08ba";
+    var s = document.getElementsByTagName("script")[0]; 
+    s.parentNode.insertBefore(hm, s);
+  })();
+  </script>
+  <script src="../../_static/js/modernizr.min.js"></script>
+</head>
+<body class="wy-body-for-nav" role="document">
+  <header class="site-header">
+    <div class="site-logo">
+      <a href="/"><img src="../../_static/images/PP_w.png"></a>
+    </div>
+    <div class="site-nav-links">
+      <div class="site-menu">
+        <a class="fork-on-github" href="https://github.com/PaddlePaddle/Paddle" target="_blank"><i class="fa fa-github"></i>Fork me on Github</a>
+        <div class="language-switcher dropdown">
+          <a type="button" data-toggle="dropdown">
+            <span>English</span>
+            <i class="fa fa-angle-up"></i>
+            <i class="fa fa-angle-down"></i>
+          </a>
+          <ul class="dropdown-menu">
+            <li><a href="/doc_cn">中文</a></li>
+            <li><a href="/doc">English</a></li>
+          </ul>
+        </div>
+        <ul class="site-page-links">
+          <li><a href="/">Home</a></li>
+        </ul>
+      </div>
+      <div class="doc-module">
+        <ul>
+<li class="toctree-l1"><a class="reference internal" href="../../getstarted/index_en.html">GET STARTED</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../index_en.html">HOW TO</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../../api/index_en.html">API</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../../mobile/index_en.html">MOBILE</a></li>
+</ul>
+<div role="search">
+  <form id="rtd-search-form" class="wy-form" action="../../search.html" method="get">
+    <input type="text" name="q" placeholder="Search docs" />
+    <input type="hidden" name="check_keywords" value="yes" />
+    <input type="hidden" name="area" value="default" />
+  </form>
+</div>        
+      </div>
+    </div>
+  </header>
+  <div class="main-content-wrap">
+    <nav class="doc-menu-vertical" role="navigation">
+          <ul>
+<li class="toctree-l1"><a class="reference internal" href="../../getstarted/index_en.html">GET STARTED</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="../../getstarted/build_and_install/index_en.html">Install and Build</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../../getstarted/build_and_install/pip_install_en.html">Install Using pip</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../getstarted/build_and_install/docker_install_en.html">Run in Docker Containers</a></li>
+<li class="toctree-l3"><a class="reference internal" href="build_en.html">Build using Docker</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../getstarted/build_and_install/build_from_source_en.html">Build from Sources</a></li>
+</ul>
+</li>
+</ul>
+</li>
+<li class="toctree-l1"><a class="reference internal" href="../index_en.html">HOW TO</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="../usage/cmd_parameter/index_en.html">Set Command-line Parameters</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../usage/cmd_parameter/use_case_en.html">Use Case</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../usage/cmd_parameter/arguments_en.html">Argument Outline</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../usage/cmd_parameter/detail_introduction_en.html">Detail Description</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../usage/cluster/cluster_train_en.html">Distributed Training</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../usage/cluster/fabric_en.html">fabric</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../usage/cluster/openmpi_en.html">openmpi</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../usage/cluster/k8s_en.html">kubernetes</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../usage/cluster/k8s_aws_en.html">kubernetes on AWS</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="new_layer_en.html">Write New Layers</a></li>
+<li class="toctree-l2"><a class="reference internal" href="contribute_to_paddle_en.html">Contribute Code</a></li>
+<li class="toctree-l2"><a class="reference internal" href="write_docs_en.html">Contribute Documentation</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../deep_model/rnn/index_en.html">RNN Models</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../deep_model/rnn/rnn_config_en.html">RNN Configuration</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../optimization/gpu_profiling_en.html">Tune GPU Performance</a></li>
+</ul>
+</li>
+<li class="toctree-l1"><a class="reference internal" href="../../api/index_en.html">API</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="../../api/v2/model_configs.html">Model Configuration</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/config/activation.html">Activation</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/config/layer.html">Layers</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/config/evaluators.html">Evaluators</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/config/optimizer.html">Optimizer</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/config/pooling.html">Pooling</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/config/networks.html">Networks</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/config/attr.html">Parameter Attribute</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../../api/v2/data.html">Data Reader Interface and DataSets</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/data/data_reader.html">Data Reader Interface</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/data/image.html">Image Interface</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/data/dataset.html">Dataset</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../../api/v2/run_logic.html">Training and Inference</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../../api/v2/fluid.html">Fluid</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/fluid/layers.html">Layers</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/fluid/data_feeder.html">DataFeeder</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/fluid/executor.html">Executor</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/fluid/initializer.html">Initializer</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/fluid/evaluator.html">Evaluator</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/fluid/nets.html">Nets</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/fluid/optimizer.html">Optimizer</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/fluid/param_attr.html">ParamAttr</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/fluid/profiler.html">Profiler</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/fluid/regularizer.html">Regularizer</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../../api/v2/fluid/io.html">IO</a></li>
+</ul>
+</li>
+</ul>
+</li>
+<li class="toctree-l1"><a class="reference internal" href="../../mobile/index_en.html">MOBILE</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="../../mobile/cross_compiling_for_android_en.html">Build PaddlePaddle for Android</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../../mobile/cross_compiling_for_ios_en.html">Build PaddlePaddle for iOS</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../../mobile/cross_compiling_for_raspberry_en.html">Build PaddlePaddle for Raspberry Pi</a></li>
+</ul>
+</li>
+</ul>
+    </nav>
+    <section class="doc-content-wrap">
+<div role="navigation" aria-label="breadcrumbs navigation">
+  <ul class="wy-breadcrumbs">
+    <li>Add Kernels for a New Device</li>
+  </ul>
+</div>
+      <div class="wy-nav-content" id="doc-content">
+        <div class="rst-content">
+          <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
+           <div itemprop="articleBody">
+  <div class="section" id="add-kernels-for-a-new-device">
+<span id="add-kernels-for-a-new-device"></span><h1>Add Kernels for a New Device<a class="headerlink" href="#add-kernels-for-a-new-device" title="Permalink to this headline">¶</a></h1>
+<div class="section" id="background">
+<span id="background"></span><h2>Background<a class="headerlink" href="#background" title="Permalink to this headline">¶</a></h2>
+<p>PaddlePaddle Fluid have hundreds of operators.  Each operator could have one or more kernels.  A kernel is an implementation of the operator for a certain device, which could be a hardware device, e.g., the CUDA GPU, or a library that utilizes a device, e.g., Intel MKL that makes full use of the Xeon CPU.</p>
+<p><a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/dev/new_op_en.md">This document</a> explains how to add an operator, and its kernels.  The kernels of an operator are indexed by a C++ type <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/operator_kernel_type.md"><code class="docutils literal"><span class="pre">OpKernelType</span></code></a>.  An operator chooses the right kernel at runtime.  This choosing mechanism is described <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/switch_kernel.md">here</a>.</p>
+</div>
+<div class="section" id="write-kernels-for-a-new-device">
+<span id="write-kernels-for-a-new-device"></span><h2>Write Kernels for A New Device<a class="headerlink" href="#write-kernels-for-a-new-device" title="Permalink to this headline">¶</a></h2>
+<div class="section" id="add-a-new-device">
+<span id="add-a-new-device"></span><h3>Add A New Device<a class="headerlink" href="#add-a-new-device" title="Permalink to this headline">¶</a></h3>
+<p>For some historical reaons, we misuse the word <em>library</em> for <em>device</em>.  For example, we call the deivce type by <em>library type</em>.  An example is the header file <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/library_type.h#L24"><code class="docutils literal"><span class="pre">library_type.h</span></code></a>.  We will correct this ASAP.</p>
+<p>To register a new device, we need to add an enum value to <code class="docutils literal"><span class="pre">LibraryType</span></code>:</p>
+<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">enum</span> <span class="k">class</span> <span class="nc">LibraryType</span> <span class="p">{</span>
+  <span class="n">kPlain</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span>
+  <span class="n">kMKLDNN</span> <span class="o">=</span> <span class="mi">1</span><span class="p">,</span>
+  <span class="n">kCUDNN</span> <span class="o">=</span> <span class="mi">2</span><span class="p">,</span>
+<span class="p">};</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="add-a-new-place">
+<span id="add-a-new-place"></span><h3>Add A New <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/platform/place.h#L53">Place</a><a class="headerlink" href="#add-a-new-place" title="Permalink to this headline">¶</a></h3>
+<p>If you have a new kind of Device, firstly you need to add a new kind of <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/platform/place.h#L53"><code class="docutils literal"><span class="pre">Place</span></code></a>. For example <code class="docutils literal"><span class="pre">CUDAPlace</span></code>:</p>
+<div class="highlight-cpp"><div class="highlight"><pre><span></span><span class="k">struct</span> <span class="n">CUDAPlace</span> <span class="p">{</span>
+  <span class="n">CUDAPlace</span><span class="p">()</span> <span class="o">:</span> <span class="n">CUDAPlace</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="p">{}</span>
+  <span class="k">explicit</span> <span class="n">CUDAPlace</span><span class="p">(</span><span class="kt">int</span> <span class="n">d</span><span class="p">)</span> <span class="o">:</span> <span class="n">device</span><span class="p">(</span><span class="n">d</span><span class="p">)</span> <span class="p">{}</span>
+  <span class="kr">inline</span> <span class="kt">int</span> <span class="n">GetDeviceId</span><span class="p">()</span> <span class="k">const</span> <span class="p">{</span> <span class="k">return</span> <span class="n">device</span><span class="p">;</span> <span class="p">}</span>
+  <span class="c1">// needed for variant equality comparison</span>
+  <span class="kr">inline</span> <span class="kt">bool</span> <span class="k">operator</span><span class="o">==</span><span class="p">(</span><span class="k">const</span> <span class="n">CUDAPlace</span> <span class="o">&amp;</span><span class="n">o</span><span class="p">)</span> <span class="k">const</span> <span class="p">{</span>
+    <span class="k">return</span> <span class="n">device</span> <span class="o">==</span> <span class="n">o</span><span class="p">.</span><span class="n">device</span><span class="p">;</span>
+  <span class="p">}</span>
+  <span class="kr">inline</span> <span class="kt">bool</span> <span class="k">operator</span><span class="o">!=</span><span class="p">(</span><span class="k">const</span> <span class="n">CUDAPlace</span> <span class="o">&amp;</span><span class="n">o</span><span class="p">)</span> <span class="k">const</span> <span class="p">{</span> <span class="k">return</span> <span class="o">!</span><span class="p">(</span><span class="o">*</span><span class="k">this</span> <span class="o">==</span> <span class="n">o</span><span class="p">);</span> <span class="p">}</span>
+  <span class="kt">int</span> <span class="n">device</span><span class="p">;</span>
+<span class="p">};</span>
+<span class="k">typedef</span> <span class="n">boost</span><span class="o">::</span><span class="n">variant</span><span class="o">&lt;</span><span class="n">CUDAPlace</span><span class="p">,</span> <span class="n">CPUPlace</span><span class="o">&gt;</span> <span class="n">Place</span><span class="p">;</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="add-device-context">
+<span id="add-device-context"></span><h3>Add <a class="reference external" href="(https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/platform/device_context.h#L37)">device context</a><a class="headerlink" href="#add-device-context" title="Permalink to this headline">¶</a></h3>
+<p>After a new kind of Device is added, you should add a corresponding <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/platform/device_context.h#L37">DeviceContext</a> for it.</p>
+<div class="highlight-cpp"><div class="highlight"><pre><span></span><span class="k">class</span> <span class="nc">DeviceContext</span> <span class="p">{</span>
+ <span class="k">public</span><span class="o">:</span>
+  <span class="k">virtual</span> <span class="o">~</span><span class="n">DeviceContext</span><span class="p">()</span> <span class="p">{}</span>
+  <span class="k">virtual</span> <span class="n">Place</span> <span class="n">GetPlace</span><span class="p">()</span> <span class="k">const</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
+  <span class="k">virtual</span> <span class="kt">void</span> <span class="nf">Wait</span><span class="p">()</span> <span class="k">const</span> <span class="p">{}</span>
+<span class="p">};</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="implement-new-opkernel-for-your-device">
+<span id="implement-new-opkernel-for-your-device"></span><h3>Implement new <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/operator.h#L351">OpKernel</a> for your Device.<a class="headerlink" href="#implement-new-opkernel-for-your-device" title="Permalink to this headline">¶</a></h3>
+<p>A detailed documentation can be found in <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/dev/new_op_en.md"><code class="docutils literal"><span class="pre">new_op_and_kernel</span></code></a></p>
+<div class="highlight-cpp"><div class="highlight"><pre><span></span><span class="k">class</span> <span class="nc">OpKernelBase</span> <span class="p">{</span>
+ <span class="k">public</span><span class="o">:</span>
+  <span class="cm">/**</span>
+<span class="cm">   * ExecutionContext is the only parameter of Kernel Run function.</span>
+<span class="cm">   * Run will get input/output variables, state such as momentum and</span>
+<span class="cm">   * device resource such as CUDA stream, cublas handle, etc. from</span>
+<span class="cm">   * ExecutionContext. User should construct it before run the Operator.</span>
+<span class="cm">   */</span>
+  <span class="k">virtual</span> <span class="kt">void</span> <span class="n">Compute</span><span class="p">(</span><span class="k">const</span> <span class="n">ExecutionContext</span><span class="o">&amp;</span> <span class="n">context</span><span class="p">)</span> <span class="k">const</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
+  <span class="k">virtual</span> <span class="o">~</span><span class="n">OpKernelBase</span><span class="p">()</span> <span class="o">=</span> <span class="k">default</span><span class="p">;</span>
+<span class="p">};</span>
+<span class="k">template</span> <span class="o">&lt;</span><span class="k">typename</span> <span class="n">T</span><span class="o">&gt;</span>
+<span class="k">class</span> <span class="nc">OpKernel</span> <span class="o">:</span> <span class="k">public</span> <span class="n">OpKernelBase</span> <span class="p">{</span>
+ <span class="k">public</span><span class="o">:</span>
+  <span class="k">using</span> <span class="n">ELEMENT_TYPE</span> <span class="o">=</span> <span class="n">T</span><span class="p">;</span>
+<span class="p">};</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="register-the-opkernel-to-framework">
+<span id="register-the-opkernel-to-framework"></span><h3>Register the OpKernel to framework<a class="headerlink" href="#register-the-opkernel-to-framework" title="Permalink to this headline">¶</a></h3>
+<p>After writing the components described above, we should register the kernel to the framework.</p>
+<p>We use <code class="docutils literal"><span class="pre">REGISTER_OP_KERNEL</span></code> to do the registration.</p>
+<div class="highlight-cpp"><div class="highlight"><pre><span></span><span class="n">REGISTER_OP_KERNEL</span><span class="p">(</span>
+    <span class="n">op_type</span><span class="p">,</span>
+    <span class="n">library_type</span><span class="p">,</span>
+    <span class="n">place_type</span><span class="p">,</span>
+    <span class="n">kernel0</span><span class="p">,</span> <span class="n">kernel1</span><span class="p">,</span> <span class="p">...)</span>
+</pre></div>
+</div>
+<p>kernel0, kernel1 are kernels that have the same <code class="docutils literal"><span class="pre">op_type</span></code>, <code class="docutils literal"><span class="pre">library_type</span></code>, <code class="docutils literal"><span class="pre">place_type</span></code> but different <code class="docutils literal"><span class="pre">data_types</span></code>.</p>
+<p>take <a class="reference external" href="(https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/conv_cudnn_op.cu.cc#L318)"><code class="docutils literal"><span class="pre">conv2d</span></code></a> as an example:</p>
+<div class="highlight-default"><div class="highlight"><pre><span></span>```cpp
+REGISTER_OP_KERNEL(conv2d, CPU, paddle::platform::CPUPlace,
+        paddle::operators::GemmConvKernel&lt;paddle::platform::CPUDeviceContext, float&gt;,
+        paddle::operators::GemmConvKernel&lt;paddle::platform::CPUDeviceContext, double&gt;);
+REGISTER_OP_KERNEL(conv2d, CUDNN, ::paddle::platform::CUDAPlace,
+       paddle::operators::CUDNNConvOpKernel&lt;float&gt;,
+       paddle::operators::CUDNNConvOpKernel&lt;double&gt;);
+```
+</pre></div>
+</div>
+<p>In the code above:</p>
+<ul class="simple">
+<li><code class="docutils literal"><span class="pre">conv2d</span></code> is the type/name of the operator</li>
+<li><code class="docutils literal"><span class="pre">CUDNN/CPU</span></code> is <code class="docutils literal"><span class="pre">library</span></code></li>
+<li><code class="docutils literal"><span class="pre">paddle::platform::CUDAPlace/CPUPlace</span></code> is <code class="docutils literal"><span class="pre">place</span></code></li>
+<li>template parameter <code class="docutils literal"><span class="pre">float/double</span></code> on <code class="docutils literal"><span class="pre">CUDNNConvOpKernel&lt;T&gt;</span></code> is <code class="docutils literal"><span class="pre">data_type</span></code>.</li>
+</ul>
+</div>
+</div>
+</div>
+           </div>
+          </div>
+          <footer>
+  <hr/>
+  <div role="contentinfo">
+    <p>
+        &copy; Copyright 2016, PaddlePaddle developers.
+    </p>
+  </div>
+  Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>. 
+</footer>
+        </div>
+      </div>
+    </section>
+  </div>
+    <script type="text/javascript">
+        var DOCUMENTATION_OPTIONS = {
+            URL_ROOT:'../../',
+            VERSION:'',
+            COLLAPSE_INDEX:false,
+            FILE_SUFFIX:'.html',
+            HAS_SOURCE:  true,
+            SOURCELINK_SUFFIX: ".txt",
+        };
+    </script>
+      <script type="text/javascript" src="../../_static/jquery.js"></script>
+      <script type="text/javascript" src="../../_static/underscore.js"></script>
+      <script type="text/javascript" src="../../_static/doctools.js"></script>
+      <script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
+    <script type="text/javascript" src="../../_static/js/theme.js"></script>
+  <script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js" integrity="sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa" crossorigin="anonymous"></script>
+  <script src="https://cdn.jsdelivr.net/perfect-scrollbar/0.6.14/js/perfect-scrollbar.jquery.min.js"></script>
+  <script src="../../_static/js/paddle_doc_init.js"></script> 
+</body>
+</html>
\ No newline at end of file
--- a/develop/doc/objects.inv
+++ b/develop/doc/objects.inv
--- a/develop/doc/operators.json
+++ b/develop/doc/operators.json
@@ -349,70 +349,6 @@
   "comment" : "(string, default: tanh)The activation for candidate hidden state, `tanh` by default.",
   "generated" : 0
 } ] 
-},{
- "type" : "gru",
- "comment" : "\nGRU Operator implements part calculations of the complete GRU as following:\n\n\\f[\nupdate \\ gate: u_t = actGate(xu_t + W_u * h_{t-1} + b_u) \\\\\nreset \\ gate: r_t = actGate(xr_t + W_r * h_{t-1} + b_r)  \\\\\noutput \\ candidate: {h}_t = actNode(xc_t + W_c * dot(r_t, h_{t-1}) + b_c) \\\\\noutput: h_t = dot((1 - u_t), h_{t-1}) + dot(u_t, {h}_t)\n\\f]\n\n@note To implement the complete GRU, fully-connected operator must be used  \nbefore to feed xu, xr and xc as the Input of GRU operator.\n",
- "inputs" : [ 
- { 
-   "name" : "Input",
-   "comment" : "(LoDTensor) The first input is a LodTensor, which supports variable-time length input sequence. The underlying tensor in this LoDTenosr is a matrix with shape (T X 3D), where, T is the total time steps in this mini-batch, D is the hidden size.",
-   "duplicable" : 0,
-   "intermediate" : 0
- }, { 
-   "name" : "H0",
-   "comment" : "(Tensor, optional) The initial hidden state is an optional input. This is a tensor with shape (N x D), where N is the batch size, D is the hidden size.",
-   "duplicable" : 0,
-   "intermediate" : 0
- }, { 
-   "name" : "Weight",
-   "comment" : "(Tensor) The learnable hidden-hidden weight matrix with shape (D x 3D), where D is the hidden size. The elements continuous in memory can be divided into two parts. The first part are weights of the update gate and reset gate with shape (D x 2D), and the second part are weights of output candidate with shape (D x D).",
-   "duplicable" : 0,
-   "intermediate" : 0
- }, { 
-   "name" : "Bias",
-   "comment" : "(Tensor, optional) Bias vector with shape (1 x 3D) concating bias of the update gate, reset gate and output candidate.",
-   "duplicable" : 0,
-   "intermediate" : 0
- } ], 
- "outputs" : [ 
- { 
-   "name" : "BatchGate",
-   "comment" : "(LoDTensor) To compute with batches, sequence data will be reorganized into several successive batches each containing data from the same time step. The LoDTensor BatchGate contains the update gate, reset gate and output candidate values organized in batches. The LoD size is 2. The first LoD contains the batch offsets and the second LoD contains the indexes in the raw sequence data.",
-   "duplicable" : 0,
-   "intermediate" : 1
- }, { 
-   "name" : "BatchResetHiddenPrev",
-   "comment" : "(LoDTensor) The reseted hidden state LoDTensor organized in batches. This LoDTensor is a matrix with shape (T X D) and has the same LoD with `BatchGate`.",
-   "duplicable" : 0,
-   "intermediate" : 1
- }, { 
-   "name" : "BatchHidden",
-   "comment" : "(LoDTensor) The hidden state LoDTensor organized in batches.  This LoDTensor is a matrix with shape (T X D) and has the same LoD with `BatchGate`.",
-   "duplicable" : 0,
-   "intermediate" : 1
- }, { 
-   "name" : "Hidden",
-   "comment" : "(LoDTensor) the hidden state LoDTensor organized in sequences. This LoDTensor is a matrix with shape (T X D) and has the same LoD with `BatchGate`.",
-   "duplicable" : 0,
-   "intermediate" : 0
- } ], 
- "attrs" : [ 
- { 
-   "name" : "activation",
-   "type" : "string",
-   "comment" : "(string, default tanh) The activation type used for output candidate {h}_t.",
-   "generated" : 0
- }, { 
-   "name" : "gate_activation",
-   "type" : "string",
-   "comment" : "(string, default sigmoid) The activation type used in update gate and reset gate.",
-   "generated" : 0
- }, { 
-   "name" : "is_reverse",
-   "type" : "bool",
-   "comment" : "(bool, defalut: False) whether to compute reversed GRU.",
-   "generated" : 0
- } ] 
 },{
 "type" : "warpctc",
 "comment" : "\nAn operator integrating the open-source\n[warp-ctc](https://github.com/baidu-research/warp-ctc) library, which is used in\n[Deep Speech 2: End-toEnd Speech Recognition in English and Mandarin](\nhttps://arxiv.org/pdf/1512.02595v1.pdf),\nto compute Connectionist Temporal Classification (CTC) loss.\nIt can be aliased as softmax with ctc, since a native softmax activation is\ninterated to the warp-ctc library, to to normlize values for each row of the\ninput tensor.\n\nMore detail of CTC loss can be found by refering to\n[Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with\nRecurrent Neural Networks](\nhttp://machinelearning.wustl.edu/mlpapers/paper_files/icml2006_GravesFGS06.pdf).\n",
@@ -1938,7 +1874,7 @@
 "attrs" : [  ] 
 },{
 "type" : "sequence_expand",
- "comment" : "\nSequence Expand Operator.\n\nThis operator expands input(X) according to LOD of input(Y).\nFollowing are cases to better explain how this works:\nCase 1:\n\nGiven 2-level a LoDTensor input(X)\n    X.lod = [[0,       2, 3],\n             [0, 1,    3, 4]]\n    X.data = [a, b, c, d]\n    X.dims = [4, 1]\nand input(Y)\n    Y.lod = [[0,    2,    4],\n             [0, 3, 6, 7, 8]]\nwith condition len(Y.lod[-1]) -1 == X.dims[0]\nthen we get 2-level LoDTensor\n    Out.lod = [[0,                2,    4],\n               [0,       3,       6, 7, 8]]\n    Out.data = [a, a, a, b, b, b, c, d]\n    Out.dims = [8, 1]\n\nCase 2:\n\nGiven a 0-level LoDTensor input(X)\n    X.data = [a, b, c]\n    X.lod = NULL\n    X.dims = [3, 1]\nand input(Y)\n    Y.lod = [[0, 2, 3, 6]]\nwith condition len(Y.lod[-1]) -1 == X.dims[0]\nthen we get 1-level LoDTensor\n    Out.lod = [[0,    2, 3,      6]]\n    Out.data = [a, a, b, c, c, c]\n    Out.dims = [6, 1]\n\nCase 3:\n\nGiven a 0-level LoDTensor input(X)\n    X.data = [[a, b], [c, d], [e, f]]\n    X.lod = NULL\n    X.dims = [3, 2]\nand input(Y)\n    Y.lod = [[0, 2, 3, 6]]\nwith condition len(Y.lod[-1]) -1 == X.dims[0]\nthen we get 1-level LoDTensor\n    Out.lod = [[0,           2,     3,                     6]]\n    Out.data = [[a,b], [a,b] [c,d], [e, f], [e, f], [e, f]]\n    Out.dims = [6, 2]\n\nCase 4:\n\nGiven 2-level a LoDTensor input(X)\n    X.lod = [[0,       2, 3],\n             [0, 1,    3, 4]]\n    X.data = [a, b, c, d]\n    X.dims = [4, 1]\nand input(Y)\n    Y.lod = [[0,    2,    4],\n             [0, 3, 6, 6, 8]]\nwith condition len(Y.lod[-1]) -1 == X.dims[0]\nthen we get 2-level LoDTensor\n    Out.lod = [[0,                2,    4],\n               [0,       3,       6, 6, 8]]\n    Out.data = [a, a, a, b, b, b, d, d]\n    Out.dims = [8, 1]\n\n\n",
+ "comment" : "\nSequence Expand Operator.\n\nThis operator expands input(X) according to LOD of input(Y).\nFollowing are cases to better explain how this works:\nCase 1:\n\nGiven a 2-level LoDTensor input(X)\n    X.lod = [[0,       2, 3],\n             [0, 1,    3, 4]]\n    X.data = [a, b, c, d]\n    X.dims = [4, 1]\nand input(Y)\n    Y.lod = [[0,    2,    4],\n             [0, 3, 6, 7, 8]]\nwith condition len(Y.lod[-1]) -1 == X.dims[0]\nthen we get 2-level LoDTensor\n    Out.lod = [[0,                2,    4],\n               [0,       3,       6, 7, 8]]\n    Out.data = [a, a, a, b, b, b, c, d]\n    Out.dims = [8, 1]\n\nCase 2:\n\nGiven a common Tensor input(X)\n    X.data = [a, b, c]\n    X.dims = [3, 1]\nand input(Y)\n    Y.lod = [[0, 2, 3, 6]]\nwith condition len(Y.lod[-1]) -1 == X.dims[0]\nthen we get 1-level LoDTensor\n    Out.lod = [[0,    2, 3,      6]]\n    Out.data = [a, a, b, c, c, c]\n    Out.dims = [6, 1]\n\nCase 3:\n\nGiven a common Tensor input(X)\n    X.data = [[a, b], [c, d], [e, f]]\n    X.dims = [3, 2]\nand input(Y)\n    Y.lod = [[0, 2, 3, 6]]\nwith condition len(Y.lod[-1]) -1 == X.dims[0]\nthen we get 1-level LoDTensor\n    Out.lod = [[0,           2,     3,                     6]]\n    Out.data = [[a,b], [a,b] [c,d], [e, f], [e, f], [e, f]]\n    Out.dims = [6, 2]\n\nCase 4:\n\nGiven 2-level a LoDTensor input(X)\n    X.lod = [[0,       2, 3],\n             [0, 1,    3, 4]]\n    X.data = [a, b, c, d]\n    X.dims = [4, 1]\nand input(Y)\n    Y.lod = [[0,    2,    4],\n             [0, 3, 6, 6, 8]]\nwith condition len(Y.lod[-1]) -1 == X.dims[0]\nthen we get 2-level LoDTensor\n    Out.lod = [[0,                2,    4],\n               [0,       3,       6, 6, 8]]\n    Out.data = [a, a, a, b, b, b, d, d]\n    Out.dims = [8, 1]\n\n\n",
 "inputs" : [ 
 { 
   "name" : "X",
@@ -4366,6 +4302,99 @@
   "comment" : "(float, default 1.0e-6) Constant for numerical stability",
   "generated" : 0
 } ] 
+},{
+ "type" : "gru",
+ "comment" : "\nGRU Operator implements part calculations of the complete GRU as following:\n\n\\f[\nupdate \\ gate: u_t = actGate(xu_t + W_u * h_{t-1} + b_u) \\\\\nreset \\ gate: r_t = actGate(xr_t + W_r * h_{t-1} + b_r)  \\\\\noutput \\ candidate: {h}_t = actNode(xc_t + W_c * dot(r_t, h_{t-1}) + b_c) \\\\\noutput: h_t = dot((1 - u_t), h_{t-1}) + dot(u_t, {h}_t)\n\\f]\n\n@note To implement the complete GRU, fully-connected operator must be used  \nbefore to feed xu, xr and xc as the Input of GRU operator.\n",
+ "inputs" : [ 
+ { 
+   "name" : "Input",
+   "comment" : "(LoDTensor) The first input is a LodTensor, which supports variable-time length input sequence. The underlying tensor in this LoDTenosr is a matrix with shape (T X 3D), where, T is the total time steps in this mini-batch, D is the hidden size.",
+   "duplicable" : 0,
+   "intermediate" : 0
+ }, { 
+   "name" : "H0",
+   "comment" : "(Tensor, optional) The initial hidden state is an optional input. This is a tensor with shape (N x D), where N is the batch size, D is the hidden size.",
+   "duplicable" : 0,
+   "intermediate" : 0
+ }, { 
+   "name" : "Weight",
+   "comment" : "(Tensor) The learnable hidden-hidden weight matrix with shape (D x 3D), where D is the hidden size. The elements continuous in memory can be divided into two parts. The first part are weights of the update gate and reset gate with shape (D x 2D), and the second part are weights of output candidate with shape (D x D).",
+   "duplicable" : 0,
+   "intermediate" : 0
+ }, { 
+   "name" : "Bias",
+   "comment" : "(Tensor, optional) Bias vector with shape (1 x 3D) concating bias of the update gate, reset gate and output candidate.",
+   "duplicable" : 0,
+   "intermediate" : 0
+ } ], 
+ "outputs" : [ 
+ { 
+   "name" : "BatchGate",
+   "comment" : "(LoDTensor) To compute with batches, sequence data will be reorganized into several successive batches each containing data from the same time step. The LoDTensor BatchGate contains the update gate, reset gate and output candidate values organized in batches. The LoD size is 2. The first LoD contains the batch offsets and the second LoD contains the indexes in the raw sequence data.",
+   "duplicable" : 0,
+   "intermediate" : 1
+ }, { 
+   "name" : "BatchResetHiddenPrev",
+   "comment" : "(LoDTensor) The reseted hidden state LoDTensor organized in batches. This LoDTensor is a matrix with shape (T X D) and has the same LoD with `BatchGate`.",
+   "duplicable" : 0,
+   "intermediate" : 1
+ }, { 
+   "name" : "BatchHidden",
+   "comment" : "(LoDTensor) The hidden state LoDTensor organized in batches.  This LoDTensor is a matrix with shape (T X D) and has the same LoD with `BatchGate`.",
+   "duplicable" : 0,
+   "intermediate" : 1
+ }, { 
+   "name" : "Hidden",
+   "comment" : "(LoDTensor) the hidden state LoDTensor organized in sequences. This LoDTensor is a matrix with shape (T X D) and has the same LoD with `BatchGate`.",
+   "duplicable" : 0,
+   "intermediate" : 0
+ } ], 
+ "attrs" : [ 
+ { 
+   "name" : "activation",
+   "type" : "string",
+   "comment" : "(string, default tanh) The activation type used for output candidate {h}_t.",
+   "generated" : 0
+ }, { 
+   "name" : "gate_activation",
+   "type" : "string",
+   "comment" : "(string, default sigmoid) The activation type used in update gate and reset gate.",
+   "generated" : 0
+ }, { 
+   "name" : "is_reverse",
+   "type" : "bool",
+   "comment" : "(bool, defalut: False) whether to compute reversed GRU.",
+   "generated" : 0
+ } ] 
+},{
+ "type" : "ctc_align",
+ "comment" : "\nCTCAlign op is used to merge repeated elements between two blanks\nand then delete all blanks in sequence.\n\nGiven:\n    Input.data = [0, 1, 2, 2, 0, 4, 0, 4, 5, 0, 6,\n                  6, 0, 0, 7, 7, 7, 0]\n    Input.dims = {18, 1}\n    Input.LoD = [[0, 11, 18]]\n\nAnd:\n    blank = 0\n    merge_repeated = True\n\nThen:\n    Output.data = [1, 2, 4, 4, 5, 6,\n                   6, 7]\n    Output.dims = {8, 1}\n    Output.LoD = [[0, 6, 8]]\n\n",
+ "inputs" : [ 
+ { 
+   "name" : "Input",
+   "comment" : "(LodTensor, default: LoDTensor<int>), Its shape is [Lp, 1], where Lp is the sum of all input sequences' length.",
+   "duplicable" : 0,
+   "intermediate" : 0
+ } ], 
+ "outputs" : [ 
+ { 
+   "name" : "Output",
+   "comment" : "(Tensor, default: Tensor<int>), The align result.",
+   "duplicable" : 0,
+   "intermediate" : 0
+ } ], 
+ "attrs" : [ 
+ { 
+   "name" : "blank",
+   "type" : "int",
+   "comment" : "(int, default: 0), the blank label setted in Connectionist Temporal Classification (CTC) op.",
+   "generated" : 0
+ }, { 
+   "name" : "merge_repeated",
+   "type" : "bool",
+   "comment" : "(bool, default: true), whether to merge repeated elements between two blanks. ",
+   "generated" : 0
+ } ] 
 },{
 "type" : "beam_search",
 "comment" : "This is a beam search operator that help to generate sequences.",

--- a/develop/doc/searchindex.js
+++ b/develop/doc/searchindex.js