Deploy to GitHub Pages: fd12776c

5f13bcd7 · Travis CI · 752bfd30 · 5f13bcd7 · 5f13bcd7 · 5f13bcd7
8 changed file
--- a/develop/doc/_sources/design/support_new_device.md.txt
+++ b/develop/doc/_sources/design/support_new_device.md.txt
+# Design Doc: Support new Device/Library
+## Background
+Deep learning has a high demand for computing resources. New high-performance device and computing library are coming constantly. The deep learning framework has to integrate these high-performance device and computing library flexibly.
+On the one hand, hardware and computing library are not usually one-to-one coresponding relations. For example, in Intel CPU, there are Eigen and MKL computing library. And in Nvidia GPU, there are Eigen and cuDNN computing library. We have to implement specific kernels for an operator for each computing library.
+On the other hand, users usually do not want to care about the low-level hardware and computing library when writing a neural network configuration. In Fluid, `Layer` is exposed in `Python`, and `Operator` is exposed in `C++`. Both `Layer` and `Operator` are independent on hardwares.
+So, how to support a new Device/Library in Fluid becomes a challenge.
+## Basic: Integrate A New Device/Library
+For a general overview of fluid, please refer to [overview doc](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/read_source.md).
+There are mainly there parts we have to consider in integrating a new device/library:
+- Place and DeviceContext: indicates the device id and manages hardware resources
+- Memory and Tensor: malloc/free data on certain device
+- Math Functor and OpKernel: implement computing unit on certain device/library
+### Place and DeviceContext
+#### Place
+Fluid use class [Place](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/platform/place.h#L55) to represent specific device and computing library. There are inheritance relationships between different kinds of `Place`.
+```
+        |   CPUPlace   --> MKLDNNPlace
+Place --|   CUDAPlace  --> CUDNNPlace
+        |   FPGAPlace
+```
+And `Place` is defined as follows:
+```
+typedef boost::variant<CUDAPlace, CPUPlace, FPGAPlace> Place;
+```
+#### DeviceContext
+Fluid use class [DeviceContext](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/platform/device_context.h#L30) to manage the resources in certain hardware, such as CUDA stream in `CDUADeviceContext`. There are also inheritance relationships between different kinds of `DeviceContext`.
+```
+                /->  CPUDeviceContext   --> MKLDeviceContext
+DeviceContext ---->  CUDADeviceContext  --> CUDNNDeviceContext
+                \->  FPGADeviceContext
+```
+A example of Nvidia GPU is as follows:
+- DeviceContext
+```
+class DeviceContext {
+  virtual Place GetPlace() const = 0;
+};  
+```
+- CUDADeviceContext
+```
+class CUDADeviceContext : public DeviceContext {
+  Place GetPlace() const override { return place_; }
+private:
+  CUDAPlace place_;
+  cudaStream_t stream_; 
+  cublasHandle_t cublas_handle_;
+  std::unique_ptr<Eigen::GpuDevice> eigen_device_;  // binds with stream_
+};
+```
+- CUDNNDeviceContext
+```
+class CUDNNDeviceContext : public CUDADeviceContext {
+  private:
+    cudnnHandle_t cudnn_handle_;
+};
+```
+### Memory and Tensor
+#### memory module
+Fluid provide following [memory interfaces](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/memory/memory.h#L36):
+```
+template <typename Place>
+void* Alloc(Place place, size_t size);
+template <typename Place>
+void Free(Place place, void* ptr);
+template <typename Place>
+size_t Used(Place place);
+```
+To implementing these interfaces, we have to implement MemoryAllocator for specific Device
+#### Tensor
+[Tensor](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/tensor.h#L36) holds data with some shape in certain Place.
+```cpp
+class Tensor {
+ public:
+  /*! Return a pointer to mutable memory block. */
+  template <typename T>
+  inline T* data();
+  /**
+   * @brief   Return a pointer to mutable memory block.
+   * @note    If not exist, then allocation.
+   */
+  template <typename T>
+  inline T* mutable_data(platform::Place place);
+  /**
+   * @brief     Return a pointer to mutable memory block.
+   *
+   * @param[in] dims    The dimensions of the memory block.
+   * @param[in] place   The place of the memory block.
+   *
+   * @note      If not exist, then allocation.
+   */
+  template <typename T>
+  inline T* mutable_data(DDim dims, platform::Place place);
+  /*! Resize the dimensions of the memory block. */
+  inline Tensor& Resize(const DDim& dims);
+  /*! Return the dimensions of the memory block. */
+  inline const DDim& dims() const;
+ private:
+  /*! holds the memory block if allocated. */
+  std::shared_ptr<Placeholder> holder_;
+  /*! points to dimensions of memory block. */
+  DDim dim_;
+};
+```
+`Placeholder` is used to delay memory allocation; that is, we can first define a tensor, using `Resize` to configure its shape, and then call `mutuable_data` to allocate the actual memory.
+```cpp
+paddle::framework::Tensor t;
+paddle::platform::CPUPlace place;
+// set size first
+t.Resize({2, 3});
+// allocate memory on CPU later
+t.mutable_data(place);
+```
+### Math Functor and OpKernel
+Fluid implements computing unit based on different DeviceContext. Some computing unit is shared between operators. These common part will be put in operators/math directory as basic Functors.
+Let's take [MaxOutFunctor](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/math/maxouting.h#L27) as an example:
+The interface is defined in header file.
+```
+template <typename DeviceContext, typename T>
+class MaxOutFunctor {
+ public:
+  void operator()(const DeviceContext& context, const framework::Tensor& input,
+                  framework::Tensor* output, int groups);
+};
+```
+CPU implement in .cc file
+```
+template <typename T>
+class MaxOutFunctor<platform::CPUDeviceContext, T> {
+  public:
+  void operator()(const platform::CPUDeviceContext& context,
+                  const framework::Tensor& input, framework::Tensor* output,
+                  int groups) {
+                  ...
+                  }
+};
+```
+CUDA implement in .cu file
+```
+template <typename T>
+class MaxOutFunctor<platform::CUDADeviceContext, T> {
+ public:
+  void operator()(const platform::CUDADeviceContext& context,
+                  const framework::Tensor& input, framework::Tensor* output,
+                  int groups) {
+                  ...
+                  }
+};                  
+```
+We get computing handle from concrete DeviceContext, and make compution on tensors.
+The implement of `OpKernel` is similar to math functors, the extra thing we need to do is registering the OpKernel to global map.
+Fluid provides different register interface in op_registry.h
+Let's take [Crop](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/crop_op.cc#L134) operator as an example:
+In .cc file:
+```
+REGISTER_OP_CPU_KERNEL(crop, ops::CropKernel<float>);
+REGISTER_OP_CPU_KERNEL(
+    crop_grad, ops::CropGradKernel<paddle::platform::CPUDeviceContext, float>);
+```
+In .cu file:
+```
+REGISTER_OP_CUDA_KERNEL(crop, ops::CropKernel<float>);
+REGISTER_OP_CUDA_KERNEL(
+    crop_grad, ops::CropGradKernel<paddle::platform::CUDADeviceContext, float>);
+```
+## Advanced topics: How to switch between different Device/Library
+Generally, we will impelement OpKernel for all Device/Library of an Operator. We can easily train a Convolutional Neural Network in GPU. However, some OpKernel is not sutibale in a specific Device. For example, crf operator can be only run at CPU, whereas most other operators can be run at GPU. To achieve high performance in such circumstance, we have to switch between different Device/Library.
+We will discuss how to implement an efficient OpKernel switch policy. 
+- TBD
--- a/develop/doc/design/support_new_device.html
+++ b/develop/doc/design/support_new_device.html
+<!DOCTYPE html>
+<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
+<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
+<head>
+  <meta charset="utf-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>Design Doc: Support new Device/Library &mdash; PaddlePaddle  documentation</title>
+    <link rel="stylesheet" href="../_static/css/theme.css" type="text/css" />
+        <link rel="index" title="Index"
+              href="../genindex.html"/>
+        <link rel="search" title="Search" href="../search.html"/>
+    <link rel="top" title="PaddlePaddle  documentation" href="../index.html"/> 
+  <link rel="stylesheet" href="https://cdn.jsdelivr.net/perfect-scrollbar/0.6.14/css/perfect-scrollbar.min.css" type="text/css" />
+  <link rel="stylesheet" href="../_static/css/override.css" type="text/css" />
+  <script>
+  var _hmt = _hmt || [];
+  (function() {
+    var hm = document.createElement("script");
+    hm.src = "//hm.baidu.com/hm.js?b9a314ab40d04d805655aab1deee08ba";
+    var s = document.getElementsByTagName("script")[0]; 
+    s.parentNode.insertBefore(hm, s);
+  })();
+  </script>
+  <script src="../_static/js/modernizr.min.js"></script>
+</head>
+<body class="wy-body-for-nav" role="document">
+  <header class="site-header">
+    <div class="site-logo">
+      <a href="/"><img src="../_static/images/PP_w.png"></a>
+    </div>
+    <div class="site-nav-links">
+      <div class="site-menu">
+        <a class="fork-on-github" href="https://github.com/PaddlePaddle/Paddle" target="_blank"><i class="fa fa-github"></i>Fork me on Github</a>
+        <div class="language-switcher dropdown">
+          <a type="button" data-toggle="dropdown">
+            <span>English</span>
+            <i class="fa fa-angle-up"></i>
+            <i class="fa fa-angle-down"></i>
+          </a>
+          <ul class="dropdown-menu">
+            <li><a href="/doc_cn">中文</a></li>
+            <li><a href="/doc">English</a></li>
+          </ul>
+        </div>
+        <ul class="site-page-links">
+          <li><a href="/">Home</a></li>
+        </ul>
+      </div>
+      <div class="doc-module">
+        <ul>
+<li class="toctree-l1"><a class="reference internal" href="../getstarted/index_en.html">GET STARTED</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../howto/index_en.html">HOW TO</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../api/index_en.html">API</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../mobile/index_en.html">MOBILE</a></li>
+</ul>
+<div role="search">
+  <form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
+    <input type="text" name="q" placeholder="Search docs" />
+    <input type="hidden" name="check_keywords" value="yes" />
+    <input type="hidden" name="area" value="default" />
+  </form>
+</div>        
+      </div>
+    </div>
+  </header>
+  <div class="main-content-wrap">
+    <nav class="doc-menu-vertical" role="navigation">
+          <ul>
+<li class="toctree-l1"><a class="reference internal" href="../getstarted/index_en.html">GET STARTED</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="../getstarted/build_and_install/index_en.html">Install and Build</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../getstarted/build_and_install/pip_install_en.html">Install Using pip</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../getstarted/build_and_install/docker_install_en.html">Run in Docker Containers</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../howto/dev/build_en.html">Build using Docker</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../getstarted/build_and_install/build_from_source_en.html">Build from Sources</a></li>
+</ul>
+</li>
+</ul>
+</li>
+<li class="toctree-l1"><a class="reference internal" href="../howto/index_en.html">HOW TO</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="../howto/usage/cmd_parameter/index_en.html">Set Command-line Parameters</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cmd_parameter/use_case_en.html">Use Case</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cmd_parameter/arguments_en.html">Argument Outline</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cmd_parameter/detail_introduction_en.html">Detail Description</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/usage/cluster/cluster_train_en.html">PaddlePaddle Distributed Training</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/usage/k8s/k8s_en.html">Paddle On Kubernetes</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/usage/k8s/k8s_aws_en.html">Distributed PaddlePaddle Training on AWS with Kubernetes</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/dev/new_layer_en.html">Write New Layers</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/dev/contribute_to_paddle_en.html">Contribute Code</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/dev/write_docs_en.html">Contribute Documentation</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/deep_model/rnn/index_en.html">RNN Models</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../howto/deep_model/rnn/rnn_config_en.html">RNN Configuration</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/optimization/gpu_profiling_en.html">Tune GPU Performance</a></li>
+</ul>
+</li>
+<li class="toctree-l1"><a class="reference internal" href="../api/index_en.html">API</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="../api/v2/model_configs.html">Model Configuration</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/activation.html">Activation</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/layer.html">Layers</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/evaluators.html">Evaluators</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/optimizer.html">Optimizer</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/pooling.html">Pooling</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/networks.html">Networks</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/attr.html">Parameter Attribute</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../api/v2/data.html">Data Reader Interface and DataSets</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/data/data_reader.html">Data Reader Interface</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/data/image.html">Image Interface</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/data/dataset.html">Dataset</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../api/v2/run_logic.html">Training and Inference</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../api/v2/fluid.html">Fluid</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/layers.html">Layers</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/data_feeder.html">DataFeeder</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/executor.html">Executor</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/initializer.html">Initializer</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/evaluator.html">Evaluator</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/nets.html">Nets</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/optimizer.html">Optimizer</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/param_attr.html">ParamAttr</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/profiler.html">Profiler</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/fluid/regularizer.html">Regularizer</a></li>
+</ul>
+</li>
+</ul>
+</li>
+<li class="toctree-l1"><a class="reference internal" href="../mobile/index_en.html">MOBILE</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="../mobile/cross_compiling_for_android_en.html">Build PaddlePaddle for Android</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../mobile/cross_compiling_for_raspberry_en.html">Build PaddlePaddle for Raspberry Pi</a></li>
+</ul>
+</li>
+</ul>
+    </nav>
+    <section class="doc-content-wrap">
+<div role="navigation" aria-label="breadcrumbs navigation">
+  <ul class="wy-breadcrumbs">
+    <li>Design Doc: Support new Device/Library</li>
+  </ul>
+</div>
+      <div class="wy-nav-content" id="doc-content">
+        <div class="rst-content">
+          <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
+           <div itemprop="articleBody">
+  <div class="section" id="design-doc-support-new-device-library">
+<span id="design-doc-support-new-device-library"></span><h1>Design Doc: Support new Device/Library<a class="headerlink" href="#design-doc-support-new-device-library" title="Permalink to this headline">¶</a></h1>
+<div class="section" id="background">
+<span id="background"></span><h2>Background<a class="headerlink" href="#background" title="Permalink to this headline">¶</a></h2>
+<p>Deep learning has a high demand for computing resources. New high-performance device and computing library are coming constantly. The deep learning framework has to integrate these high-performance device and computing library flexibly.</p>
+<p>On the one hand, hardware and computing library are not usually one-to-one coresponding relations. For example, in Intel CPU, there are Eigen and MKL computing library. And in Nvidia GPU, there are Eigen and cuDNN computing library. We have to implement specific kernels for an operator for each computing library.</p>
+<p>On the other hand, users usually do not want to care about the low-level hardware and computing library when writing a neural network configuration. In Fluid, <code class="docutils literal"><span class="pre">Layer</span></code> is exposed in <code class="docutils literal"><span class="pre">Python</span></code>, and <code class="docutils literal"><span class="pre">Operator</span></code> is exposed in <code class="docutils literal"><span class="pre">C++</span></code>. Both <code class="docutils literal"><span class="pre">Layer</span></code> and <code class="docutils literal"><span class="pre">Operator</span></code> are independent on hardwares.</p>
+<p>So, how to support a new Device/Library in Fluid becomes a challenge.</p>
+</div>
+<div class="section" id="basic-integrate-a-new-device-library">
+<span id="basic-integrate-a-new-device-library"></span><h2>Basic: Integrate A New Device/Library<a class="headerlink" href="#basic-integrate-a-new-device-library" title="Permalink to this headline">¶</a></h2>
+<p>For a general overview of fluid, please refer to <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/read_source.md">overview doc</a>.</p>
+<p>There are mainly there parts we have to consider in integrating a new device/library:</p>
+<ul class="simple">
+<li>Place and DeviceContext: indicates the device id and manages hardware resources</li>
+<li>Memory and Tensor: malloc/free data on certain device</li>
+<li>Math Functor and OpKernel: implement computing unit on certain device/library</li>
+</ul>
+<div class="section" id="place-and-devicecontext">
+<span id="place-and-devicecontext"></span><h3>Place and DeviceContext<a class="headerlink" href="#place-and-devicecontext" title="Permalink to this headline">¶</a></h3>
+<div class="section" id="place">
+<span id="place"></span><h4>Place<a class="headerlink" href="#place" title="Permalink to this headline">¶</a></h4>
+<p>Fluid use class <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/platform/place.h#L55">Place</a> to represent specific device and computing library. There are inheritance relationships between different kinds of <code class="docutils literal"><span class="pre">Place</span></code>.</p>
+<div class="highlight-default"><div class="highlight"><pre><span></span>        <span class="o">|</span>   <span class="n">CPUPlace</span>   <span class="o">--&gt;</span> <span class="n">MKLDNNPlace</span>
+<span class="n">Place</span> <span class="o">--|</span>   <span class="n">CUDAPlace</span>  <span class="o">--&gt;</span> <span class="n">CUDNNPlace</span>
+        <span class="o">|</span>   <span class="n">FPGAPlace</span>
+</pre></div>
+</div>
+<p>And <code class="docutils literal"><span class="pre">Place</span></code> is defined as follows:</p>
+<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">typedef</span> <span class="n">boost</span><span class="p">::</span><span class="n">variant</span><span class="o">&lt;</span><span class="n">CUDAPlace</span><span class="p">,</span> <span class="n">CPUPlace</span><span class="p">,</span> <span class="n">FPGAPlace</span><span class="o">&gt;</span> <span class="n">Place</span><span class="p">;</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="devicecontext">
+<span id="devicecontext"></span><h4>DeviceContext<a class="headerlink" href="#devicecontext" title="Permalink to this headline">¶</a></h4>
+<p>Fluid use class <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/platform/device_context.h#L30">DeviceContext</a> to manage the resources in certain hardware, such as CUDA stream in <code class="docutils literal"><span class="pre">CDUADeviceContext</span></code>. There are also inheritance relationships between different kinds of <code class="docutils literal"><span class="pre">DeviceContext</span></code>.</p>
+<div class="highlight-default"><div class="highlight"><pre><span></span>                <span class="o">/-&gt;</span>  <span class="n">CPUDeviceContext</span>   <span class="o">--&gt;</span> <span class="n">MKLDeviceContext</span>
+<span class="n">DeviceContext</span> <span class="o">----&gt;</span>  <span class="n">CUDADeviceContext</span>  <span class="o">--&gt;</span> <span class="n">CUDNNDeviceContext</span>
+                \<span class="o">-&gt;</span>  <span class="n">FPGADeviceContext</span>
+</pre></div>
+</div>
+<p>A example of Nvidia GPU is as follows:</p>
+<ul class="simple">
+<li>DeviceContext</li>
+</ul>
+<div class="highlight-default"><div class="highlight"><pre><span></span><span class="k">class</span> <span class="nc">DeviceContext</span> <span class="p">{</span>
+  <span class="n">virtual</span> <span class="n">Place</span> <span class="n">GetPlace</span><span class="p">()</span> <span class="n">const</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
+<span class="p">};</span>  
+</pre></div>
+</div>
+<ul class="simple">
+<li>CUDADeviceContext</li>
+</ul>
+<div class="highlight-default"><div class="highlight"><pre><span></span><span class="k">class</span> <span class="nc">CUDADeviceContext</span> <span class="p">:</span> <span class="n">public</span> <span class="n">DeviceContext</span> <span class="p">{</span>
+  <span class="n">Place</span> <span class="n">GetPlace</span><span class="p">()</span> <span class="n">const</span> <span class="n">override</span> <span class="p">{</span> <span class="k">return</span> <span class="n">place_</span><span class="p">;</span> <span class="p">}</span>
+<span class="n">private</span><span class="p">:</span>
+  <span class="n">CUDAPlace</span> <span class="n">place_</span><span class="p">;</span>
+  <span class="n">cudaStream_t</span> <span class="n">stream_</span><span class="p">;</span> 
+  <span class="n">cublasHandle_t</span> <span class="n">cublas_handle_</span><span class="p">;</span>
+  <span class="n">std</span><span class="p">::</span><span class="n">unique_ptr</span><span class="o">&lt;</span><span class="n">Eigen</span><span class="p">::</span><span class="n">GpuDevice</span><span class="o">&gt;</span> <span class="n">eigen_device_</span><span class="p">;</span>  <span class="o">//</span> <span class="n">binds</span> <span class="k">with</span> <span class="n">stream_</span>
+<span class="p">};</span>
+</pre></div>
+</div>
+<ul class="simple">
+<li>CUDNNDeviceContext</li>
+</ul>
+<div class="highlight-default"><div class="highlight"><pre><span></span><span class="k">class</span> <span class="nc">CUDNNDeviceContext</span> <span class="p">:</span> <span class="n">public</span> <span class="n">CUDADeviceContext</span> <span class="p">{</span>
+  <span class="n">private</span><span class="p">:</span>
+    <span class="n">cudnnHandle_t</span> <span class="n">cudnn_handle_</span><span class="p">;</span>
+<span class="p">};</span>
+</pre></div>
+</div>
+</div>
+</div>
+<div class="section" id="memory-and-tensor">
+<span id="memory-and-tensor"></span><h3>Memory and Tensor<a class="headerlink" href="#memory-and-tensor" title="Permalink to this headline">¶</a></h3>
+<div class="section" id="memory-module">
+<span id="memory-module"></span><h4>memory module<a class="headerlink" href="#memory-module" title="Permalink to this headline">¶</a></h4>
+<p>Fluid provide following <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/memory/memory.h#L36">memory interfaces</a>:</p>
+<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">template</span> <span class="o">&lt;</span><span class="n">typename</span> <span class="n">Place</span><span class="o">&gt;</span>
+<span class="n">void</span><span class="o">*</span> <span class="n">Alloc</span><span class="p">(</span><span class="n">Place</span> <span class="n">place</span><span class="p">,</span> <span class="n">size_t</span> <span class="n">size</span><span class="p">);</span>
+<span class="n">template</span> <span class="o">&lt;</span><span class="n">typename</span> <span class="n">Place</span><span class="o">&gt;</span>
+<span class="n">void</span> <span class="n">Free</span><span class="p">(</span><span class="n">Place</span> <span class="n">place</span><span class="p">,</span> <span class="n">void</span><span class="o">*</span> <span class="n">ptr</span><span class="p">);</span>
+<span class="n">template</span> <span class="o">&lt;</span><span class="n">typename</span> <span class="n">Place</span><span class="o">&gt;</span>
+<span class="n">size_t</span> <span class="n">Used</span><span class="p">(</span><span class="n">Place</span> <span class="n">place</span><span class="p">);</span>
+</pre></div>
+</div>
+<p>To implementing these interfaces, we have to implement MemoryAllocator for specific Device</p>
+</div>
+<div class="section" id="tensor">
+<span id="tensor"></span><h4>Tensor<a class="headerlink" href="#tensor" title="Permalink to this headline">¶</a></h4>
+<p><a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/tensor.h#L36">Tensor</a> holds data with some shape in certain Place.</p>
+<div class="highlight-cpp"><div class="highlight"><pre><span></span><span class="k">class</span> <span class="nc">Tensor</span> <span class="p">{</span>
+ <span class="k">public</span><span class="o">:</span>
+  <span class="cm">/*! Return a pointer to mutable memory block. */</span>
+  <span class="k">template</span> <span class="o">&lt;</span><span class="k">typename</span> <span class="n">T</span><span class="o">&gt;</span>
+  <span class="kr">inline</span> <span class="n">T</span><span class="o">*</span> <span class="n">data</span><span class="p">();</span>
+  <span class="cm">/**</span>
+<span class="cm">   * @brief   Return a pointer to mutable memory block.</span>
+<span class="cm">   * @note    If not exist, then allocation.</span>
+<span class="cm">   */</span>
+  <span class="k">template</span> <span class="o">&lt;</span><span class="k">typename</span> <span class="n">T</span><span class="o">&gt;</span>
+  <span class="kr">inline</span> <span class="n">T</span><span class="o">*</span> <span class="n">mutable_data</span><span class="p">(</span><span class="n">platform</span><span class="o">::</span><span class="n">Place</span> <span class="n">place</span><span class="p">);</span>
+  <span class="cm">/**</span>
+<span class="cm">   * @brief     Return a pointer to mutable memory block.</span>
+<span class="cm">   *</span>
+<span class="cm">   * @param[in] dims    The dimensions of the memory block.</span>
+<span class="cm">   * @param[in] place   The place of the memory block.</span>
+<span class="cm">   *</span>
+<span class="cm">   * @note      If not exist, then allocation.</span>
+<span class="cm">   */</span>
+  <span class="k">template</span> <span class="o">&lt;</span><span class="k">typename</span> <span class="n">T</span><span class="o">&gt;</span>
+  <span class="kr">inline</span> <span class="n">T</span><span class="o">*</span> <span class="n">mutable_data</span><span class="p">(</span><span class="n">DDim</span> <span class="n">dims</span><span class="p">,</span> <span class="n">platform</span><span class="o">::</span><span class="n">Place</span> <span class="n">place</span><span class="p">);</span>
+  <span class="cm">/*! Resize the dimensions of the memory block. */</span>
+  <span class="kr">inline</span> <span class="n">Tensor</span><span class="o">&amp;</span> <span class="n">Resize</span><span class="p">(</span><span class="k">const</span> <span class="n">DDim</span><span class="o">&amp;</span> <span class="n">dims</span><span class="p">);</span>
+  <span class="cm">/*! Return the dimensions of the memory block. */</span>
+  <span class="kr">inline</span> <span class="k">const</span> <span class="n">DDim</span><span class="o">&amp;</span> <span class="n">dims</span><span class="p">()</span> <span class="k">const</span><span class="p">;</span>
+ <span class="k">private</span><span class="o">:</span>
+  <span class="cm">/*! holds the memory block if allocated. */</span>
+  <span class="n">std</span><span class="o">::</span><span class="n">shared_ptr</span><span class="o">&lt;</span><span class="n">Placeholder</span><span class="o">&gt;</span> <span class="n">holder_</span><span class="p">;</span>
+  <span class="cm">/*! points to dimensions of memory block. */</span>
+  <span class="n">DDim</span> <span class="n">dim_</span><span class="p">;</span>
+<span class="p">};</span>
+</pre></div>
+</div>
+<p><code class="docutils literal"><span class="pre">Placeholder</span></code> is used to delay memory allocation; that is, we can first define a tensor, using <code class="docutils literal"><span class="pre">Resize</span></code> to configure its shape, and then call <code class="docutils literal"><span class="pre">mutuable_data</span></code> to allocate the actual memory.</p>
+<div class="highlight-cpp"><div class="highlight"><pre><span></span><span class="n">paddle</span><span class="o">::</span><span class="n">framework</span><span class="o">::</span><span class="n">Tensor</span> <span class="n">t</span><span class="p">;</span>
+<span class="n">paddle</span><span class="o">::</span><span class="n">platform</span><span class="o">::</span><span class="n">CPUPlace</span> <span class="n">place</span><span class="p">;</span>
+<span class="c1">// set size first</span>
+<span class="n">t</span><span class="p">.</span><span class="n">Resize</span><span class="p">({</span><span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">});</span>
+<span class="c1">// allocate memory on CPU later</span>
+<span class="n">t</span><span class="p">.</span><span class="n">mutable_data</span><span class="p">(</span><span class="n">place</span><span class="p">);</span>
+</pre></div>
+</div>
+</div>
+</div>
+<div class="section" id="math-functor-and-opkernel">
+<span id="math-functor-and-opkernel"></span><h3>Math Functor and OpKernel<a class="headerlink" href="#math-functor-and-opkernel" title="Permalink to this headline">¶</a></h3>
+<p>Fluid implements computing unit based on different DeviceContext. Some computing unit is shared between operators. These common part will be put in operators/math directory as basic Functors.</p>
+<p>Let&#8217;s take <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/math/maxouting.h#L27">MaxOutFunctor</a> as an example:</p>
+<p>The interface is defined in header file.</p>
+<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">template</span> <span class="o">&lt;</span><span class="n">typename</span> <span class="n">DeviceContext</span><span class="p">,</span> <span class="n">typename</span> <span class="n">T</span><span class="o">&gt;</span>
+<span class="k">class</span> <span class="nc">MaxOutFunctor</span> <span class="p">{</span>
+ <span class="n">public</span><span class="p">:</span>
+  <span class="n">void</span> <span class="n">operator</span><span class="p">()(</span><span class="n">const</span> <span class="n">DeviceContext</span><span class="o">&amp;</span> <span class="n">context</span><span class="p">,</span> <span class="n">const</span> <span class="n">framework</span><span class="p">::</span><span class="n">Tensor</span><span class="o">&amp;</span> <span class="nb">input</span><span class="p">,</span>
+                  <span class="n">framework</span><span class="p">::</span><span class="n">Tensor</span><span class="o">*</span> <span class="n">output</span><span class="p">,</span> <span class="nb">int</span> <span class="n">groups</span><span class="p">);</span>
+<span class="p">};</span>
+</pre></div>
+</div>
+<p>CPU implement in .cc file</p>
+<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">template</span> <span class="o">&lt;</span><span class="n">typename</span> <span class="n">T</span><span class="o">&gt;</span>
+<span class="k">class</span> <span class="nc">MaxOutFunctor</span><span class="o">&lt;</span><span class="n">platform</span><span class="p">::</span><span class="n">CPUDeviceContext</span><span class="p">,</span> <span class="n">T</span><span class="o">&gt;</span> <span class="p">{</span>
+  <span class="n">public</span><span class="p">:</span>
+  <span class="n">void</span> <span class="n">operator</span><span class="p">()(</span><span class="n">const</span> <span class="n">platform</span><span class="p">::</span><span class="n">CPUDeviceContext</span><span class="o">&amp;</span> <span class="n">context</span><span class="p">,</span>
+                  <span class="n">const</span> <span class="n">framework</span><span class="p">::</span><span class="n">Tensor</span><span class="o">&amp;</span> <span class="nb">input</span><span class="p">,</span> <span class="n">framework</span><span class="p">::</span><span class="n">Tensor</span><span class="o">*</span> <span class="n">output</span><span class="p">,</span>
+                  <span class="nb">int</span> <span class="n">groups</span><span class="p">)</span> <span class="p">{</span>
+                  <span class="o">...</span>
+                  <span class="p">}</span>
+<span class="p">};</span>
+</pre></div>
+</div>
+<p>CUDA implement in .cu file</p>
+<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">template</span> <span class="o">&lt;</span><span class="n">typename</span> <span class="n">T</span><span class="o">&gt;</span>
+<span class="k">class</span> <span class="nc">MaxOutFunctor</span><span class="o">&lt;</span><span class="n">platform</span><span class="p">::</span><span class="n">CUDADeviceContext</span><span class="p">,</span> <span class="n">T</span><span class="o">&gt;</span> <span class="p">{</span>
+ <span class="n">public</span><span class="p">:</span>
+  <span class="n">void</span> <span class="n">operator</span><span class="p">()(</span><span class="n">const</span> <span class="n">platform</span><span class="p">::</span><span class="n">CUDADeviceContext</span><span class="o">&amp;</span> <span class="n">context</span><span class="p">,</span>
+                  <span class="n">const</span> <span class="n">framework</span><span class="p">::</span><span class="n">Tensor</span><span class="o">&amp;</span> <span class="nb">input</span><span class="p">,</span> <span class="n">framework</span><span class="p">::</span><span class="n">Tensor</span><span class="o">*</span> <span class="n">output</span><span class="p">,</span>
+                  <span class="nb">int</span> <span class="n">groups</span><span class="p">)</span> <span class="p">{</span>
+                  <span class="o">...</span>
+                  <span class="p">}</span>
+<span class="p">};</span>                  
+</pre></div>
+</div>
+<p>We get computing handle from concrete DeviceContext, and make compution on tensors.</p>
+<p>The implement of <code class="docutils literal"><span class="pre">OpKernel</span></code> is similar to math functors, the extra thing we need to do is registering the OpKernel to global map.</p>
+<p>Fluid provides different register interface in op_registry.h</p>
+<p>Let&#8217;s take <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/crop_op.cc#L134">Crop</a> operator as an example:</p>
+<p>In .cc file:</p>
+<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">REGISTER_OP_CPU_KERNEL</span><span class="p">(</span><span class="n">crop</span><span class="p">,</span> <span class="n">ops</span><span class="p">::</span><span class="n">CropKernel</span><span class="o">&lt;</span><span class="nb">float</span><span class="o">&gt;</span><span class="p">);</span>
+<span class="n">REGISTER_OP_CPU_KERNEL</span><span class="p">(</span>
+    <span class="n">crop_grad</span><span class="p">,</span> <span class="n">ops</span><span class="p">::</span><span class="n">CropGradKernel</span><span class="o">&lt;</span><span class="n">paddle</span><span class="p">::</span><span class="n">platform</span><span class="p">::</span><span class="n">CPUDeviceContext</span><span class="p">,</span> <span class="nb">float</span><span class="o">&gt;</span><span class="p">);</span>
+</pre></div>
+</div>
+<p>In .cu file:</p>
+<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">REGISTER_OP_CUDA_KERNEL</span><span class="p">(</span><span class="n">crop</span><span class="p">,</span> <span class="n">ops</span><span class="p">::</span><span class="n">CropKernel</span><span class="o">&lt;</span><span class="nb">float</span><span class="o">&gt;</span><span class="p">);</span>
+<span class="n">REGISTER_OP_CUDA_KERNEL</span><span class="p">(</span>
+    <span class="n">crop_grad</span><span class="p">,</span> <span class="n">ops</span><span class="p">::</span><span class="n">CropGradKernel</span><span class="o">&lt;</span><span class="n">paddle</span><span class="p">::</span><span class="n">platform</span><span class="p">::</span><span class="n">CUDADeviceContext</span><span class="p">,</span> <span class="nb">float</span><span class="o">&gt;</span><span class="p">);</span>
+</pre></div>
+</div>
+</div>
+</div>
+<div class="section" id="advanced-topics-how-to-switch-between-different-device-library">
+<span id="advanced-topics-how-to-switch-between-different-device-library"></span><h2>Advanced topics: How to switch between different Device/Library<a class="headerlink" href="#advanced-topics-how-to-switch-between-different-device-library" title="Permalink to this headline">¶</a></h2>
+<p>Generally, we will impelement OpKernel for all Device/Library of an Operator. We can easily train a Convolutional Neural Network in GPU. However, some OpKernel is not sutibale in a specific Device. For example, crf operator can be only run at CPU, whereas most other operators can be run at GPU. To achieve high performance in such circumstance, we have to switch between different Device/Library.</p>
+<p>We will discuss how to implement an efficient OpKernel switch policy.</p>
+<ul class="simple">
+<li>TBD</li>
+</ul>
+</div>
+</div>
+           </div>
+          </div>
+          <footer>
+  <hr/>
+  <div role="contentinfo">
+    <p>
+        &copy; Copyright 2016, PaddlePaddle developers.
+    </p>
+  </div>
+  Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>. 
+</footer>
+        </div>
+      </div>
+    </section>
+  </div>
+    <script type="text/javascript">
+        var DOCUMENTATION_OPTIONS = {
+            URL_ROOT:'../',
+            VERSION:'',
+            COLLAPSE_INDEX:false,
+            FILE_SUFFIX:'.html',
+            HAS_SOURCE:  true,
+            SOURCELINK_SUFFIX: ".txt",
+        };
+    </script>
+      <script type="text/javascript" src="../_static/jquery.js"></script>
+      <script type="text/javascript" src="../_static/underscore.js"></script>
+      <script type="text/javascript" src="../_static/doctools.js"></script>
+      <script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
+    <script type="text/javascript" src="../_static/js/theme.js"></script>
+  <script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js" integrity="sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa" crossorigin="anonymous"></script>
+  <script src="https://cdn.jsdelivr.net/perfect-scrollbar/0.6.14/js/perfect-scrollbar.jquery.min.js"></script>
+  <script src="../_static/js/paddle_doc_init.js"></script> 
+</body>
+</html>
\ No newline at end of file
--- a/develop/doc/objects.inv
+++ b/develop/doc/objects.inv
--- a/develop/doc/searchindex.js
+++ b/develop/doc/searchindex.js
--- a/develop/doc_cn/_sources/design/support_new_device.md.txt
+++ b/develop/doc_cn/_sources/design/support_new_device.md.txt
+# Design Doc: Support new Device/Library
+## Background
+Deep learning has a high demand for computing resources. New high-performance device and computing library are coming constantly. The deep learning framework has to integrate these high-performance device and computing library flexibly.
+On the one hand, hardware and computing library are not usually one-to-one coresponding relations. For example, in Intel CPU, there are Eigen and MKL computing library. And in Nvidia GPU, there are Eigen and cuDNN computing library. We have to implement specific kernels for an operator for each computing library.
+On the other hand, users usually do not want to care about the low-level hardware and computing library when writing a neural network configuration. In Fluid, `Layer` is exposed in `Python`, and `Operator` is exposed in `C++`. Both `Layer` and `Operator` are independent on hardwares.
+So, how to support a new Device/Library in Fluid becomes a challenge.
+## Basic: Integrate A New Device/Library
+For a general overview of fluid, please refer to [overview doc](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/read_source.md).
+There are mainly there parts we have to consider in integrating a new device/library:
+- Place and DeviceContext: indicates the device id and manages hardware resources
+- Memory and Tensor: malloc/free data on certain device
+- Math Functor and OpKernel: implement computing unit on certain device/library
+### Place and DeviceContext
+#### Place
+Fluid use class [Place](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/platform/place.h#L55) to represent specific device and computing library. There are inheritance relationships between different kinds of `Place`.
+```
+        |   CPUPlace   --> MKLDNNPlace
+Place --|   CUDAPlace  --> CUDNNPlace
+        |   FPGAPlace
+```
+And `Place` is defined as follows:
+```
+typedef boost::variant<CUDAPlace, CPUPlace, FPGAPlace> Place;
+```
+#### DeviceContext
+Fluid use class [DeviceContext](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/platform/device_context.h#L30) to manage the resources in certain hardware, such as CUDA stream in `CDUADeviceContext`. There are also inheritance relationships between different kinds of `DeviceContext`.
+```
+                /->  CPUDeviceContext   --> MKLDeviceContext
+DeviceContext ---->  CUDADeviceContext  --> CUDNNDeviceContext
+                \->  FPGADeviceContext
+```
+A example of Nvidia GPU is as follows:
+- DeviceContext
+```
+class DeviceContext {
+  virtual Place GetPlace() const = 0;
+};  
+```
+- CUDADeviceContext
+```
+class CUDADeviceContext : public DeviceContext {
+  Place GetPlace() const override { return place_; }
+private:
+  CUDAPlace place_;
+  cudaStream_t stream_; 
+  cublasHandle_t cublas_handle_;
+  std::unique_ptr<Eigen::GpuDevice> eigen_device_;  // binds with stream_
+};
+```
+- CUDNNDeviceContext
+```
+class CUDNNDeviceContext : public CUDADeviceContext {
+  private:
+    cudnnHandle_t cudnn_handle_;
+};
+```
+### Memory and Tensor
+#### memory module
+Fluid provide following [memory interfaces](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/memory/memory.h#L36):
+```
+template <typename Place>
+void* Alloc(Place place, size_t size);
+template <typename Place>
+void Free(Place place, void* ptr);
+template <typename Place>
+size_t Used(Place place);
+```
+To implementing these interfaces, we have to implement MemoryAllocator for specific Device
+#### Tensor
+[Tensor](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/tensor.h#L36) holds data with some shape in certain Place.
+```cpp
+class Tensor {
+ public:
+  /*! Return a pointer to mutable memory block. */
+  template <typename T>
+  inline T* data();
+  /**
+   * @brief   Return a pointer to mutable memory block.
+   * @note    If not exist, then allocation.
+   */
+  template <typename T>
+  inline T* mutable_data(platform::Place place);
+  /**
+   * @brief     Return a pointer to mutable memory block.
+   *
+   * @param[in] dims    The dimensions of the memory block.
+   * @param[in] place   The place of the memory block.
+   *
+   * @note      If not exist, then allocation.
+   */
+  template <typename T>
+  inline T* mutable_data(DDim dims, platform::Place place);
+  /*! Resize the dimensions of the memory block. */
+  inline Tensor& Resize(const DDim& dims);
+  /*! Return the dimensions of the memory block. */
+  inline const DDim& dims() const;
+ private:
+  /*! holds the memory block if allocated. */
+  std::shared_ptr<Placeholder> holder_;
+  /*! points to dimensions of memory block. */
+  DDim dim_;
+};
+```
+`Placeholder` is used to delay memory allocation; that is, we can first define a tensor, using `Resize` to configure its shape, and then call `mutuable_data` to allocate the actual memory.
+```cpp
+paddle::framework::Tensor t;
+paddle::platform::CPUPlace place;
+// set size first
+t.Resize({2, 3});
+// allocate memory on CPU later
+t.mutable_data(place);
+```
+### Math Functor and OpKernel
+Fluid implements computing unit based on different DeviceContext. Some computing unit is shared between operators. These common part will be put in operators/math directory as basic Functors.
+Let's take [MaxOutFunctor](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/math/maxouting.h#L27) as an example:
+The interface is defined in header file.
+```
+template <typename DeviceContext, typename T>
+class MaxOutFunctor {
+ public:
+  void operator()(const DeviceContext& context, const framework::Tensor& input,
+                  framework::Tensor* output, int groups);
+};
+```
+CPU implement in .cc file
+```
+template <typename T>
+class MaxOutFunctor<platform::CPUDeviceContext, T> {
+  public:
+  void operator()(const platform::CPUDeviceContext& context,
+                  const framework::Tensor& input, framework::Tensor* output,
+                  int groups) {
+                  ...
+                  }
+};
+```
+CUDA implement in .cu file
+```
+template <typename T>
+class MaxOutFunctor<platform::CUDADeviceContext, T> {
+ public:
+  void operator()(const platform::CUDADeviceContext& context,
+                  const framework::Tensor& input, framework::Tensor* output,
+                  int groups) {
+                  ...
+                  }
+};                  
+```
+We get computing handle from concrete DeviceContext, and make compution on tensors.
+The implement of `OpKernel` is similar to math functors, the extra thing we need to do is registering the OpKernel to global map.
+Fluid provides different register interface in op_registry.h
+Let's take [Crop](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/crop_op.cc#L134) operator as an example:
+In .cc file:
+```
+REGISTER_OP_CPU_KERNEL(crop, ops::CropKernel<float>);
+REGISTER_OP_CPU_KERNEL(
+    crop_grad, ops::CropGradKernel<paddle::platform::CPUDeviceContext, float>);
+```
+In .cu file:
+```
+REGISTER_OP_CUDA_KERNEL(crop, ops::CropKernel<float>);
+REGISTER_OP_CUDA_KERNEL(
+    crop_grad, ops::CropGradKernel<paddle::platform::CUDADeviceContext, float>);
+```
+## Advanced topics: How to switch between different Device/Library
+Generally, we will impelement OpKernel for all Device/Library of an Operator. We can easily train a Convolutional Neural Network in GPU. However, some OpKernel is not sutibale in a specific Device. For example, crf operator can be only run at CPU, whereas most other operators can be run at GPU. To achieve high performance in such circumstance, we have to switch between different Device/Library.
+We will discuss how to implement an efficient OpKernel switch policy. 
+- TBD
--- a/develop/doc_cn/design/support_new_device.html
+++ b/develop/doc_cn/design/support_new_device.html
+<!DOCTYPE html>
+<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
+<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
+<head>
+  <meta charset="utf-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>Design Doc: Support new Device/Library &mdash; PaddlePaddle  文档</title>
+    <link rel="stylesheet" href="../_static/css/theme.css" type="text/css" />
+        <link rel="index" title="索引"
+              href="../genindex.html"/>
+        <link rel="search" title="搜索" href="../search.html"/>
+    <link rel="top" title="PaddlePaddle  文档" href="../index.html"/> 
+  <link rel="stylesheet" href="https://cdn.jsdelivr.net/perfect-scrollbar/0.6.14/css/perfect-scrollbar.min.css" type="text/css" />
+  <link rel="stylesheet" href="../_static/css/override.css" type="text/css" />
+  <script>
+  var _hmt = _hmt || [];
+  (function() {
+    var hm = document.createElement("script");
+    hm.src = "//hm.baidu.com/hm.js?b9a314ab40d04d805655aab1deee08ba";
+    var s = document.getElementsByTagName("script")[0]; 
+    s.parentNode.insertBefore(hm, s);
+  })();
+  </script>
+  <script src="../_static/js/modernizr.min.js"></script>
+</head>
+<body class="wy-body-for-nav" role="document">
+  <header class="site-header">
+    <div class="site-logo">
+      <a href="/"><img src="../_static/images/PP_w.png"></a>
+    </div>
+    <div class="site-nav-links">
+      <div class="site-menu">
+        <a class="fork-on-github" href="https://github.com/PaddlePaddle/Paddle" target="_blank"><i class="fa fa-github"></i>Fork me on Github</a>
+        <div class="language-switcher dropdown">
+          <a type="button" data-toggle="dropdown">
+            <span>English</span>
+            <i class="fa fa-angle-up"></i>
+            <i class="fa fa-angle-down"></i>
+          </a>
+          <ul class="dropdown-menu">
+            <li><a href="/doc_cn">中文</a></li>
+            <li><a href="/doc">English</a></li>
+          </ul>
+        </div>
+        <ul class="site-page-links">
+          <li><a href="/">Home</a></li>
+        </ul>
+      </div>
+      <div class="doc-module">
+        <ul>
+<li class="toctree-l1"><a class="reference internal" href="../getstarted/index_cn.html">新手入门</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../howto/index_cn.html">进阶指南</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../api/index_cn.html">API</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../faq/index_cn.html">FAQ</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../mobile/index_cn.html">MOBILE</a></li>
+</ul>
+<div role="search">
+  <form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
+    <input type="text" name="q" placeholder="Search docs" />
+    <input type="hidden" name="check_keywords" value="yes" />
+    <input type="hidden" name="area" value="default" />
+  </form>
+</div>        
+      </div>
+    </div>
+  </header>
+  <div class="main-content-wrap">
+    <nav class="doc-menu-vertical" role="navigation">
+          <ul>
+<li class="toctree-l1"><a class="reference internal" href="../getstarted/index_cn.html">新手入门</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="../getstarted/build_and_install/index_cn.html">安装与编译</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../getstarted/build_and_install/pip_install_cn.html">使用pip安装</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../getstarted/build_and_install/docker_install_cn.html">使用Docker安装运行</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../howto/dev/build_cn.html">用Docker编译和测试PaddlePaddle</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../getstarted/build_and_install/build_from_source_cn.html">从源码编译</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../getstarted/concepts/use_concepts_cn.html">基本使用概念</a></li>
+</ul>
+</li>
+<li class="toctree-l1"><a class="reference internal" href="../howto/index_cn.html">进阶指南</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="../howto/usage/cmd_parameter/index_cn.html">设置命令行参数</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cmd_parameter/use_case_cn.html">使用案例</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cmd_parameter/arguments_cn.html">参数概述</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../howto/usage/cmd_parameter/detail_introduction_cn.html">细节描述</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/usage/cluster/cluster_train_cn.html">PaddlePaddle分布式训练</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/usage/k8s/k8s_basis_cn.html">Kubernetes 简介</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/usage/k8s/k8s_cn.html">Kubernetes单机训练</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/usage/k8s/k8s_distributed_cn.html">Kubernetes分布式训练</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/dev/contribute_to_paddle_cn.html">如何贡献代码</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/dev/write_docs_cn.html">如何贡献/修改文档</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/deep_model/rnn/index_cn.html">RNN相关模型</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../howto/deep_model/rnn/rnn_config_cn.html">RNN配置</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../howto/deep_model/rnn/recurrent_group_cn.html">Recurrent Group教程</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../howto/deep_model/rnn/hierarchical_layer_cn.html">支持双层序列作为输入的Layer</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../howto/deep_model/rnn/hrnn_rnn_api_compare_cn.html">单双层RNN API对比介绍</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../howto/optimization/gpu_profiling_cn.html">GPU性能分析与调优</a></li>
+</ul>
+</li>
+<li class="toctree-l1"><a class="reference internal" href="../api/index_cn.html">API</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="../api/v2/model_configs.html">模型配置</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/activation.html">Activation</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/layer.html">Layers</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/evaluators.html">Evaluators</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/optimizer.html">Optimizer</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/pooling.html">Pooling</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/networks.html">Networks</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/config/attr.html">Parameter Attribute</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../api/v2/data.html">数据访问</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/data/data_reader.html">Data Reader Interface</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/data/image.html">Image Interface</a></li>
+<li class="toctree-l3"><a class="reference internal" href="../api/v2/data/dataset.html">Dataset</a></li>
+</ul>
+</li>
+<li class="toctree-l2"><a class="reference internal" href="../api/v2/run_logic.html">训练与应用</a></li>
+</ul>
+</li>
+<li class="toctree-l1"><a class="reference internal" href="../faq/index_cn.html">FAQ</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="../faq/build_and_install/index_cn.html">编译安装与单元测试</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../faq/model/index_cn.html">模型配置</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../faq/parameter/index_cn.html">参数设置</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../faq/local/index_cn.html">本地训练与预测</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../faq/cluster/index_cn.html">集群训练与预测</a></li>
+</ul>
+</li>
+<li class="toctree-l1"><a class="reference internal" href="../mobile/index_cn.html">MOBILE</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="../mobile/cross_compiling_for_android_cn.html">Android平台编译指南</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../mobile/cross_compiling_for_ios_cn.html">iOS平台编译指南</a></li>
+<li class="toctree-l2"><a class="reference internal" href="../mobile/cross_compiling_for_raspberry_cn.html">Raspberry Pi平台编译指南</a></li>
+</ul>
+</li>
+</ul>
+    </nav>
+    <section class="doc-content-wrap">
+<div role="navigation" aria-label="breadcrumbs navigation">
+  <ul class="wy-breadcrumbs">
+    <li>Design Doc: Support new Device/Library</li>
+  </ul>
+</div>
+      <div class="wy-nav-content" id="doc-content">
+        <div class="rst-content">
+          <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
+           <div itemprop="articleBody">
+  <div class="section" id="design-doc-support-new-device-library">
+<span id="design-doc-support-new-device-library"></span><h1>Design Doc: Support new Device/Library<a class="headerlink" href="#design-doc-support-new-device-library" title="永久链接至标题">¶</a></h1>
+<div class="section" id="background">
+<span id="background"></span><h2>Background<a class="headerlink" href="#background" title="永久链接至标题">¶</a></h2>
+<p>Deep learning has a high demand for computing resources. New high-performance device and computing library are coming constantly. The deep learning framework has to integrate these high-performance device and computing library flexibly.</p>
+<p>On the one hand, hardware and computing library are not usually one-to-one coresponding relations. For example, in Intel CPU, there are Eigen and MKL computing library. And in Nvidia GPU, there are Eigen and cuDNN computing library. We have to implement specific kernels for an operator for each computing library.</p>
+<p>On the other hand, users usually do not want to care about the low-level hardware and computing library when writing a neural network configuration. In Fluid, <code class="docutils literal"><span class="pre">Layer</span></code> is exposed in <code class="docutils literal"><span class="pre">Python</span></code>, and <code class="docutils literal"><span class="pre">Operator</span></code> is exposed in <code class="docutils literal"><span class="pre">C++</span></code>. Both <code class="docutils literal"><span class="pre">Layer</span></code> and <code class="docutils literal"><span class="pre">Operator</span></code> are independent on hardwares.</p>
+<p>So, how to support a new Device/Library in Fluid becomes a challenge.</p>
+</div>
+<div class="section" id="basic-integrate-a-new-device-library">
+<span id="basic-integrate-a-new-device-library"></span><h2>Basic: Integrate A New Device/Library<a class="headerlink" href="#basic-integrate-a-new-device-library" title="永久链接至标题">¶</a></h2>
+<p>For a general overview of fluid, please refer to <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/read_source.md">overview doc</a>.</p>
+<p>There are mainly there parts we have to consider in integrating a new device/library:</p>
+<ul class="simple">
+<li>Place and DeviceContext: indicates the device id and manages hardware resources</li>
+<li>Memory and Tensor: malloc/free data on certain device</li>
+<li>Math Functor and OpKernel: implement computing unit on certain device/library</li>
+</ul>
+<div class="section" id="place-and-devicecontext">
+<span id="place-and-devicecontext"></span><h3>Place and DeviceContext<a class="headerlink" href="#place-and-devicecontext" title="永久链接至标题">¶</a></h3>
+<div class="section" id="place">
+<span id="place"></span><h4>Place<a class="headerlink" href="#place" title="永久链接至标题">¶</a></h4>
+<p>Fluid use class <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/platform/place.h#L55">Place</a> to represent specific device and computing library. There are inheritance relationships between different kinds of <code class="docutils literal"><span class="pre">Place</span></code>.</p>
+<div class="highlight-default"><div class="highlight"><pre><span></span>        <span class="o">|</span>   <span class="n">CPUPlace</span>   <span class="o">--&gt;</span> <span class="n">MKLDNNPlace</span>
+<span class="n">Place</span> <span class="o">--|</span>   <span class="n">CUDAPlace</span>  <span class="o">--&gt;</span> <span class="n">CUDNNPlace</span>
+        <span class="o">|</span>   <span class="n">FPGAPlace</span>
+</pre></div>
+</div>
+<p>And <code class="docutils literal"><span class="pre">Place</span></code> is defined as follows:</p>
+<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">typedef</span> <span class="n">boost</span><span class="p">::</span><span class="n">variant</span><span class="o">&lt;</span><span class="n">CUDAPlace</span><span class="p">,</span> <span class="n">CPUPlace</span><span class="p">,</span> <span class="n">FPGAPlace</span><span class="o">&gt;</span> <span class="n">Place</span><span class="p">;</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="devicecontext">
+<span id="devicecontext"></span><h4>DeviceContext<a class="headerlink" href="#devicecontext" title="永久链接至标题">¶</a></h4>
+<p>Fluid use class <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/platform/device_context.h#L30">DeviceContext</a> to manage the resources in certain hardware, such as CUDA stream in <code class="docutils literal"><span class="pre">CDUADeviceContext</span></code>. There are also inheritance relationships between different kinds of <code class="docutils literal"><span class="pre">DeviceContext</span></code>.</p>
+<div class="highlight-default"><div class="highlight"><pre><span></span>                <span class="o">/-&gt;</span>  <span class="n">CPUDeviceContext</span>   <span class="o">--&gt;</span> <span class="n">MKLDeviceContext</span>
+<span class="n">DeviceContext</span> <span class="o">----&gt;</span>  <span class="n">CUDADeviceContext</span>  <span class="o">--&gt;</span> <span class="n">CUDNNDeviceContext</span>
+                \<span class="o">-&gt;</span>  <span class="n">FPGADeviceContext</span>
+</pre></div>
+</div>
+<p>A example of Nvidia GPU is as follows:</p>
+<ul class="simple">
+<li>DeviceContext</li>
+</ul>
+<div class="highlight-default"><div class="highlight"><pre><span></span><span class="k">class</span> <span class="nc">DeviceContext</span> <span class="p">{</span>
+  <span class="n">virtual</span> <span class="n">Place</span> <span class="n">GetPlace</span><span class="p">()</span> <span class="n">const</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
+<span class="p">};</span>  
+</pre></div>
+</div>
+<ul class="simple">
+<li>CUDADeviceContext</li>
+</ul>
+<div class="highlight-default"><div class="highlight"><pre><span></span><span class="k">class</span> <span class="nc">CUDADeviceContext</span> <span class="p">:</span> <span class="n">public</span> <span class="n">DeviceContext</span> <span class="p">{</span>
+  <span class="n">Place</span> <span class="n">GetPlace</span><span class="p">()</span> <span class="n">const</span> <span class="n">override</span> <span class="p">{</span> <span class="k">return</span> <span class="n">place_</span><span class="p">;</span> <span class="p">}</span>
+<span class="n">private</span><span class="p">:</span>
+  <span class="n">CUDAPlace</span> <span class="n">place_</span><span class="p">;</span>
+  <span class="n">cudaStream_t</span> <span class="n">stream_</span><span class="p">;</span> 
+  <span class="n">cublasHandle_t</span> <span class="n">cublas_handle_</span><span class="p">;</span>
+  <span class="n">std</span><span class="p">::</span><span class="n">unique_ptr</span><span class="o">&lt;</span><span class="n">Eigen</span><span class="p">::</span><span class="n">GpuDevice</span><span class="o">&gt;</span> <span class="n">eigen_device_</span><span class="p">;</span>  <span class="o">//</span> <span class="n">binds</span> <span class="k">with</span> <span class="n">stream_</span>
+<span class="p">};</span>
+</pre></div>
+</div>
+<ul class="simple">
+<li>CUDNNDeviceContext</li>
+</ul>
+<div class="highlight-default"><div class="highlight"><pre><span></span><span class="k">class</span> <span class="nc">CUDNNDeviceContext</span> <span class="p">:</span> <span class="n">public</span> <span class="n">CUDADeviceContext</span> <span class="p">{</span>
+  <span class="n">private</span><span class="p">:</span>
+    <span class="n">cudnnHandle_t</span> <span class="n">cudnn_handle_</span><span class="p">;</span>
+<span class="p">};</span>
+</pre></div>
+</div>
+</div>
+</div>
+<div class="section" id="memory-and-tensor">
+<span id="memory-and-tensor"></span><h3>Memory and Tensor<a class="headerlink" href="#memory-and-tensor" title="永久链接至标题">¶</a></h3>
+<div class="section" id="memory-module">
+<span id="memory-module"></span><h4>memory module<a class="headerlink" href="#memory-module" title="永久链接至标题">¶</a></h4>
+<p>Fluid provide following <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/memory/memory.h#L36">memory interfaces</a>:</p>
+<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">template</span> <span class="o">&lt;</span><span class="n">typename</span> <span class="n">Place</span><span class="o">&gt;</span>
+<span class="n">void</span><span class="o">*</span> <span class="n">Alloc</span><span class="p">(</span><span class="n">Place</span> <span class="n">place</span><span class="p">,</span> <span class="n">size_t</span> <span class="n">size</span><span class="p">);</span>
+<span class="n">template</span> <span class="o">&lt;</span><span class="n">typename</span> <span class="n">Place</span><span class="o">&gt;</span>
+<span class="n">void</span> <span class="n">Free</span><span class="p">(</span><span class="n">Place</span> <span class="n">place</span><span class="p">,</span> <span class="n">void</span><span class="o">*</span> <span class="n">ptr</span><span class="p">);</span>
+<span class="n">template</span> <span class="o">&lt;</span><span class="n">typename</span> <span class="n">Place</span><span class="o">&gt;</span>
+<span class="n">size_t</span> <span class="n">Used</span><span class="p">(</span><span class="n">Place</span> <span class="n">place</span><span class="p">);</span>
+</pre></div>
+</div>
+<p>To implementing these interfaces, we have to implement MemoryAllocator for specific Device</p>
+</div>
+<div class="section" id="tensor">
+<span id="tensor"></span><h4>Tensor<a class="headerlink" href="#tensor" title="永久链接至标题">¶</a></h4>
+<p><a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/tensor.h#L36">Tensor</a> holds data with some shape in certain Place.</p>
+<div class="highlight-cpp"><div class="highlight"><pre><span></span><span class="k">class</span> <span class="nc">Tensor</span> <span class="p">{</span>
+ <span class="k">public</span><span class="o">:</span>
+  <span class="cm">/*! Return a pointer to mutable memory block. */</span>
+  <span class="k">template</span> <span class="o">&lt;</span><span class="k">typename</span> <span class="n">T</span><span class="o">&gt;</span>
+  <span class="kr">inline</span> <span class="n">T</span><span class="o">*</span> <span class="n">data</span><span class="p">();</span>
+  <span class="cm">/**</span>
+<span class="cm">   * @brief   Return a pointer to mutable memory block.</span>
+<span class="cm">   * @note    If not exist, then allocation.</span>
+<span class="cm">   */</span>
+  <span class="k">template</span> <span class="o">&lt;</span><span class="k">typename</span> <span class="n">T</span><span class="o">&gt;</span>
+  <span class="kr">inline</span> <span class="n">T</span><span class="o">*</span> <span class="n">mutable_data</span><span class="p">(</span><span class="n">platform</span><span class="o">::</span><span class="n">Place</span> <span class="n">place</span><span class="p">);</span>
+  <span class="cm">/**</span>
+<span class="cm">   * @brief     Return a pointer to mutable memory block.</span>
+<span class="cm">   *</span>
+<span class="cm">   * @param[in] dims    The dimensions of the memory block.</span>
+<span class="cm">   * @param[in] place   The place of the memory block.</span>
+<span class="cm">   *</span>
+<span class="cm">   * @note      If not exist, then allocation.</span>
+<span class="cm">   */</span>
+  <span class="k">template</span> <span class="o">&lt;</span><span class="k">typename</span> <span class="n">T</span><span class="o">&gt;</span>
+  <span class="kr">inline</span> <span class="n">T</span><span class="o">*</span> <span class="n">mutable_data</span><span class="p">(</span><span class="n">DDim</span> <span class="n">dims</span><span class="p">,</span> <span class="n">platform</span><span class="o">::</span><span class="n">Place</span> <span class="n">place</span><span class="p">);</span>
+  <span class="cm">/*! Resize the dimensions of the memory block. */</span>
+  <span class="kr">inline</span> <span class="n">Tensor</span><span class="o">&amp;</span> <span class="n">Resize</span><span class="p">(</span><span class="k">const</span> <span class="n">DDim</span><span class="o">&amp;</span> <span class="n">dims</span><span class="p">);</span>
+  <span class="cm">/*! Return the dimensions of the memory block. */</span>
+  <span class="kr">inline</span> <span class="k">const</span> <span class="n">DDim</span><span class="o">&amp;</span> <span class="n">dims</span><span class="p">()</span> <span class="k">const</span><span class="p">;</span>
+ <span class="k">private</span><span class="o">:</span>
+  <span class="cm">/*! holds the memory block if allocated. */</span>
+  <span class="n">std</span><span class="o">::</span><span class="n">shared_ptr</span><span class="o">&lt;</span><span class="n">Placeholder</span><span class="o">&gt;</span> <span class="n">holder_</span><span class="p">;</span>
+  <span class="cm">/*! points to dimensions of memory block. */</span>
+  <span class="n">DDim</span> <span class="n">dim_</span><span class="p">;</span>
+<span class="p">};</span>
+</pre></div>
+</div>
+<p><code class="docutils literal"><span class="pre">Placeholder</span></code> is used to delay memory allocation; that is, we can first define a tensor, using <code class="docutils literal"><span class="pre">Resize</span></code> to configure its shape, and then call <code class="docutils literal"><span class="pre">mutuable_data</span></code> to allocate the actual memory.</p>
+<div class="highlight-cpp"><div class="highlight"><pre><span></span><span class="n">paddle</span><span class="o">::</span><span class="n">framework</span><span class="o">::</span><span class="n">Tensor</span> <span class="n">t</span><span class="p">;</span>
+<span class="n">paddle</span><span class="o">::</span><span class="n">platform</span><span class="o">::</span><span class="n">CPUPlace</span> <span class="n">place</span><span class="p">;</span>
+<span class="c1">// set size first</span>
+<span class="n">t</span><span class="p">.</span><span class="n">Resize</span><span class="p">({</span><span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">});</span>
+<span class="c1">// allocate memory on CPU later</span>
+<span class="n">t</span><span class="p">.</span><span class="n">mutable_data</span><span class="p">(</span><span class="n">place</span><span class="p">);</span>
+</pre></div>
+</div>
+</div>
+</div>
+<div class="section" id="math-functor-and-opkernel">
+<span id="math-functor-and-opkernel"></span><h3>Math Functor and OpKernel<a class="headerlink" href="#math-functor-and-opkernel" title="永久链接至标题">¶</a></h3>
+<p>Fluid implements computing unit based on different DeviceContext. Some computing unit is shared between operators. These common part will be put in operators/math directory as basic Functors.</p>
+<p>Let&#8217;s take <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/math/maxouting.h#L27">MaxOutFunctor</a> as an example:</p>
+<p>The interface is defined in header file.</p>
+<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">template</span> <span class="o">&lt;</span><span class="n">typename</span> <span class="n">DeviceContext</span><span class="p">,</span> <span class="n">typename</span> <span class="n">T</span><span class="o">&gt;</span>
+<span class="k">class</span> <span class="nc">MaxOutFunctor</span> <span class="p">{</span>
+ <span class="n">public</span><span class="p">:</span>
+  <span class="n">void</span> <span class="n">operator</span><span class="p">()(</span><span class="n">const</span> <span class="n">DeviceContext</span><span class="o">&amp;</span> <span class="n">context</span><span class="p">,</span> <span class="n">const</span> <span class="n">framework</span><span class="p">::</span><span class="n">Tensor</span><span class="o">&amp;</span> <span class="nb">input</span><span class="p">,</span>
+                  <span class="n">framework</span><span class="p">::</span><span class="n">Tensor</span><span class="o">*</span> <span class="n">output</span><span class="p">,</span> <span class="nb">int</span> <span class="n">groups</span><span class="p">);</span>
+<span class="p">};</span>
+</pre></div>
+</div>
+<p>CPU implement in .cc file</p>
+<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">template</span> <span class="o">&lt;</span><span class="n">typename</span> <span class="n">T</span><span class="o">&gt;</span>
+<span class="k">class</span> <span class="nc">MaxOutFunctor</span><span class="o">&lt;</span><span class="n">platform</span><span class="p">::</span><span class="n">CPUDeviceContext</span><span class="p">,</span> <span class="n">T</span><span class="o">&gt;</span> <span class="p">{</span>
+  <span class="n">public</span><span class="p">:</span>
+  <span class="n">void</span> <span class="n">operator</span><span class="p">()(</span><span class="n">const</span> <span class="n">platform</span><span class="p">::</span><span class="n">CPUDeviceContext</span><span class="o">&amp;</span> <span class="n">context</span><span class="p">,</span>
+                  <span class="n">const</span> <span class="n">framework</span><span class="p">::</span><span class="n">Tensor</span><span class="o">&amp;</span> <span class="nb">input</span><span class="p">,</span> <span class="n">framework</span><span class="p">::</span><span class="n">Tensor</span><span class="o">*</span> <span class="n">output</span><span class="p">,</span>
+                  <span class="nb">int</span> <span class="n">groups</span><span class="p">)</span> <span class="p">{</span>
+                  <span class="o">...</span>
+                  <span class="p">}</span>
+<span class="p">};</span>
+</pre></div>
+</div>
+<p>CUDA implement in .cu file</p>
+<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">template</span> <span class="o">&lt;</span><span class="n">typename</span> <span class="n">T</span><span class="o">&gt;</span>
+<span class="k">class</span> <span class="nc">MaxOutFunctor</span><span class="o">&lt;</span><span class="n">platform</span><span class="p">::</span><span class="n">CUDADeviceContext</span><span class="p">,</span> <span class="n">T</span><span class="o">&gt;</span> <span class="p">{</span>
+ <span class="n">public</span><span class="p">:</span>
+  <span class="n">void</span> <span class="n">operator</span><span class="p">()(</span><span class="n">const</span> <span class="n">platform</span><span class="p">::</span><span class="n">CUDADeviceContext</span><span class="o">&amp;</span> <span class="n">context</span><span class="p">,</span>
+                  <span class="n">const</span> <span class="n">framework</span><span class="p">::</span><span class="n">Tensor</span><span class="o">&amp;</span> <span class="nb">input</span><span class="p">,</span> <span class="n">framework</span><span class="p">::</span><span class="n">Tensor</span><span class="o">*</span> <span class="n">output</span><span class="p">,</span>
+                  <span class="nb">int</span> <span class="n">groups</span><span class="p">)</span> <span class="p">{</span>
+                  <span class="o">...</span>
+                  <span class="p">}</span>
+<span class="p">};</span>                  
+</pre></div>
+</div>
+<p>We get computing handle from concrete DeviceContext, and make compution on tensors.</p>
+<p>The implement of <code class="docutils literal"><span class="pre">OpKernel</span></code> is similar to math functors, the extra thing we need to do is registering the OpKernel to global map.</p>
+<p>Fluid provides different register interface in op_registry.h</p>
+<p>Let&#8217;s take <a class="reference external" href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/crop_op.cc#L134">Crop</a> operator as an example:</p>
+<p>In .cc file:</p>
+<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">REGISTER_OP_CPU_KERNEL</span><span class="p">(</span><span class="n">crop</span><span class="p">,</span> <span class="n">ops</span><span class="p">::</span><span class="n">CropKernel</span><span class="o">&lt;</span><span class="nb">float</span><span class="o">&gt;</span><span class="p">);</span>
+<span class="n">REGISTER_OP_CPU_KERNEL</span><span class="p">(</span>
+    <span class="n">crop_grad</span><span class="p">,</span> <span class="n">ops</span><span class="p">::</span><span class="n">CropGradKernel</span><span class="o">&lt;</span><span class="n">paddle</span><span class="p">::</span><span class="n">platform</span><span class="p">::</span><span class="n">CPUDeviceContext</span><span class="p">,</span> <span class="nb">float</span><span class="o">&gt;</span><span class="p">);</span>
+</pre></div>
+</div>
+<p>In .cu file:</p>
+<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">REGISTER_OP_CUDA_KERNEL</span><span class="p">(</span><span class="n">crop</span><span class="p">,</span> <span class="n">ops</span><span class="p">::</span><span class="n">CropKernel</span><span class="o">&lt;</span><span class="nb">float</span><span class="o">&gt;</span><span class="p">);</span>
+<span class="n">REGISTER_OP_CUDA_KERNEL</span><span class="p">(</span>
+    <span class="n">crop_grad</span><span class="p">,</span> <span class="n">ops</span><span class="p">::</span><span class="n">CropGradKernel</span><span class="o">&lt;</span><span class="n">paddle</span><span class="p">::</span><span class="n">platform</span><span class="p">::</span><span class="n">CUDADeviceContext</span><span class="p">,</span> <span class="nb">float</span><span class="o">&gt;</span><span class="p">);</span>
+</pre></div>
+</div>
+</div>
+</div>
+<div class="section" id="advanced-topics-how-to-switch-between-different-device-library">
+<span id="advanced-topics-how-to-switch-between-different-device-library"></span><h2>Advanced topics: How to switch between different Device/Library<a class="headerlink" href="#advanced-topics-how-to-switch-between-different-device-library" title="永久链接至标题">¶</a></h2>
+<p>Generally, we will impelement OpKernel for all Device/Library of an Operator. We can easily train a Convolutional Neural Network in GPU. However, some OpKernel is not sutibale in a specific Device. For example, crf operator can be only run at CPU, whereas most other operators can be run at GPU. To achieve high performance in such circumstance, we have to switch between different Device/Library.</p>
+<p>We will discuss how to implement an efficient OpKernel switch policy.</p>
+<ul class="simple">
+<li>TBD</li>
+</ul>
+</div>
+</div>
+           </div>
+          </div>
+          <footer>
+  <hr/>
+  <div role="contentinfo">
+    <p>
+        &copy; Copyright 2016, PaddlePaddle developers.
+    </p>
+  </div>
+  Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>. 
+</footer>
+        </div>
+      </div>
+    </section>
+  </div>
+    <script type="text/javascript">
+        var DOCUMENTATION_OPTIONS = {
+            URL_ROOT:'../',
+            VERSION:'',
+            COLLAPSE_INDEX:false,
+            FILE_SUFFIX:'.html',
+            HAS_SOURCE:  true,
+            SOURCELINK_SUFFIX: ".txt",
+        };
+    </script>
+      <script type="text/javascript" src="../_static/jquery.js"></script>
+      <script type="text/javascript" src="../_static/underscore.js"></script>
+      <script type="text/javascript" src="../_static/doctools.js"></script>
+      <script type="text/javascript" src="../_static/translations.js"></script>
+      <script type="text/javascript" src="https://cdn.bootcss.com/mathjax/2.7.0/MathJax.js"></script>
+    <script type="text/javascript" src="../_static/js/theme.js"></script>
+  <script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js" integrity="sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa" crossorigin="anonymous"></script>
+  <script src="https://cdn.jsdelivr.net/perfect-scrollbar/0.6.14/js/perfect-scrollbar.jquery.min.js"></script>
+  <script src="../_static/js/paddle_doc_init.js"></script> 
+</body>
+</html>
\ No newline at end of file
--- a/develop/doc_cn/objects.inv
+++ b/develop/doc_cn/objects.inv
--- a/develop/doc_cn/searchindex.js
+++ b/develop/doc_cn/searchindex.js