PaddlePaddle Fluid have hundreds of operators. Each operator could have one or more kernels. A kernel is an implementation of the operator for a certain device, which could be a hardware device, e.g., the CUDA GPU, or a library that utilizes a device, e.g., Intel MKL that makes full use of the Xeon CPU.
[This document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/dev/new_op_en.md) explains how to add an operator, and its kernels. The kernels of an operator are indexed by a C++ type [`OpKernelType`](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/operator_kernel_type.md). An operator chooses the right kernel at runtime. This choosing mechanism is described [here](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/switch_kernel.md).
### Write Kernels for A New Device
#### Add A New Device
For some historical reaons, we misuse the word *library* for *device*. For example, we call the deivce type by *library type*. An example is the header file [`library_type.h`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/library_type.h#L24). We will correct this ASAP.
To register a new device, we need to add an enum value to `LibraryType`:
```
enum class LibraryType {
kPlain = 0,
kMKLDNN = 1,
kCUDNN = 2,
};
```
#### Add A New [Place](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/platform/place.h#L53)
If you have a new kind of Device, firstly you need to add a new kind of [`Place`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/platform/place.h#L53). For example `CUDAPlace`:
After a new kind of Device is added, you should add a corresponding [DeviceContext](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/platform/device_context.h#L37) for it.
```cpp
class DeviceContext {
public:
virtual ~DeviceContext() {}
virtual Place GetPlace() const = 0;
virtual void Wait() const {}
};
```
#### Implement new [OpKernel](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/operator.h#L351) for your Device.
A detailed documentation can be found in [`new_op_and_kernel`](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/dev/new_op_en.md)
```cpp
class OpKernelBase {
public:
/**
* ExecutionContext is the only parameter of Kernel Run function.
* Run will get input/output variables, state such as momentum and
* device resource such as CUDA stream, cublas handle, etc. from
* ExecutionContext. User should construct it before run the Operator.
<spanid="add-kernels-for-a-new-device"></span><h1>Add Kernels for a New Device<aclass="headerlink"href="#add-kernels-for-a-new-device"title="Permalink to this headline">¶</a></h1>
<divclass="section"id="background">
<spanid="background"></span><h2>Background<aclass="headerlink"href="#background"title="Permalink to this headline">¶</a></h2>
<p>PaddlePaddle Fluid have hundreds of operators. Each operator could have one or more kernels. A kernel is an implementation of the operator for a certain device, which could be a hardware device, e.g., the CUDA GPU, or a library that utilizes a device, e.g., Intel MKL that makes full use of the Xeon CPU.</p>
<p><aclass="reference external"href="https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/dev/new_op_en.md">This document</a> explains how to add an operator, and its kernels. The kernels of an operator are indexed by a C++ type <aclass="reference external"href="https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/operator_kernel_type.md"><codeclass="docutils literal"><spanclass="pre">OpKernelType</span></code></a>. An operator chooses the right kernel at runtime. This choosing mechanism is described <aclass="reference external"href="https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/switch_kernel.md">here</a>.</p>
<spanid="write-kernels-for-a-new-device"></span><h2>Write Kernels for A New Device<aclass="headerlink"href="#write-kernels-for-a-new-device"title="Permalink to this headline">¶</a></h2>
<divclass="section"id="add-a-new-device">
<spanid="add-a-new-device"></span><h3>Add A New Device<aclass="headerlink"href="#add-a-new-device"title="Permalink to this headline">¶</a></h3>
<p>For some historical reaons, we misuse the word <em>library</em> for <em>device</em>. For example, we call the deivce type by <em>library type</em>. An example is the header file <aclass="reference external"href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/library_type.h#L24"><codeclass="docutils literal"><spanclass="pre">library_type.h</span></code></a>. We will correct this ASAP.</p>
<p>To register a new device, we need to add an enum value to <codeclass="docutils literal"><spanclass="pre">LibraryType</span></code>:</p>
<spanid="add-a-new-place"></span><h3>Add A New <aclass="reference external"href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/platform/place.h#L53">Place</a><aclass="headerlink"href="#add-a-new-place"title="Permalink to this headline">¶</a></h3>
<p>If you have a new kind of Device, firstly you need to add a new kind of <aclass="reference external"href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/platform/place.h#L53"><codeclass="docutils literal"><spanclass="pre">Place</span></code></a>. For example <codeclass="docutils literal"><spanclass="pre">CUDAPlace</span></code>:</p>
<spanid="add-device-context"></span><h3>Add <aclass="reference external"href="(https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/platform/device_context.h#L37)">device context</a><aclass="headerlink"href="#add-device-context"title="Permalink to this headline">¶</a></h3>
<p>After a new kind of Device is added, you should add a corresponding <aclass="reference external"href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/platform/device_context.h#L37">DeviceContext</a> for it.</p>
<spanid="implement-new-opkernel-for-your-device"></span><h3>Implement new <aclass="reference external"href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/operator.h#L351">OpKernel</a> for your Device.<aclass="headerlink"href="#implement-new-opkernel-for-your-device"title="Permalink to this headline">¶</a></h3>
<p>A detailed documentation can be found in <aclass="reference external"href="https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/dev/new_op_en.md"><codeclass="docutils literal"><spanclass="pre">new_op_and_kernel</span></code></a></p>
<spanid="register-the-opkernel-to-framework"></span><h3>Register the OpKernel to framework<aclass="headerlink"href="#register-the-opkernel-to-framework"title="Permalink to this headline">¶</a></h3>
<p>After writing the components described above, we should register the kernel to the framework.</p>
<p>We use <codeclass="docutils literal"><spanclass="pre">REGISTER_OP_KERNEL</span></code> to do the registration.</p>
<p>kernel0, kernel1 are kernels that have the same <codeclass="docutils literal"><spanclass="pre">op_type</span></code>, <codeclass="docutils literal"><spanclass="pre">library_type</span></code>, <codeclass="docutils literal"><spanclass="pre">place_type</span></code> but different <codeclass="docutils literal"><spanclass="pre">data_types</span></code>.</p>
<p>take <aclass="reference external"href="(https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/conv_cudnn_op.cu.cc#L318)"><codeclass="docutils literal"><spanclass="pre">conv2d</span></code></a> as an example:</p>
<li><codeclass="docutils literal"><spanclass="pre">conv2d</span></code> is the type/name of the operator</li>
<li><codeclass="docutils literal"><spanclass="pre">CUDNN/CPU</span></code> is <codeclass="docutils literal"><spanclass="pre">library</span></code></li>
<li><codeclass="docutils literal"><spanclass="pre">paddle::platform::CUDAPlace/CPUPlace</span></code> is <codeclass="docutils literal"><spanclass="pre">place</span></code></li>
<li>template parameter <codeclass="docutils literal"><spanclass="pre">float/double</span></code> on <codeclass="docutils literal"><spanclass="pre">CUDNNConvOpKernel<T></span></code> is <codeclass="docutils literal"><spanclass="pre">data_type</span></code>.</li>
Built with <ahref="http://sphinx-doc.org/">Sphinx</a> using a <ahref="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <ahref="https://readthedocs.org">Read the Docs</a>.
"comment":"(string, default: tanh)The activation for candidate hidden state, `tanh` by default.",
"comment":"(string, default: tanh)The activation for candidate hidden state, `tanh` by default.",
"generated":0
"generated":0
}]
}]
},{
"type":"gru",
"comment":"\nGRU Operator implements part calculations of the complete GRU as following:\n\n\\f[\nupdate \\ gate: u_t = actGate(xu_t + W_u * h_{t-1} + b_u) \\\\\nreset \\ gate: r_t = actGate(xr_t + W_r * h_{t-1} + b_r) \\\\\noutput \\ candidate: {h}_t = actNode(xc_t + W_c * dot(r_t, h_{t-1}) + b_c) \\\\\noutput: h_t = dot((1 - u_t), h_{t-1}) + dot(u_t, {h}_t)\n\\f]\n\n@note To implement the complete GRU, fully-connected operator must be used \nbefore to feed xu, xr and xc as the Input of GRU operator.\n",
"inputs":[
{
"name":"Input",
"comment":"(LoDTensor) The first input is a LodTensor, which supports variable-time length input sequence. The underlying tensor in this LoDTenosr is a matrix with shape (T X 3D), where, T is the total time steps in this mini-batch, D is the hidden size.",
"duplicable":0,
"intermediate":0
},{
"name":"H0",
"comment":"(Tensor, optional) The initial hidden state is an optional input. This is a tensor with shape (N x D), where N is the batch size, D is the hidden size.",
"duplicable":0,
"intermediate":0
},{
"name":"Weight",
"comment":"(Tensor) The learnable hidden-hidden weight matrix with shape (D x 3D), where D is the hidden size. The elements continuous in memory can be divided into two parts. The first part are weights of the update gate and reset gate with shape (D x 2D), and the second part are weights of output candidate with shape (D x D).",
"duplicable":0,
"intermediate":0
},{
"name":"Bias",
"comment":"(Tensor, optional) Bias vector with shape (1 x 3D) concating bias of the update gate, reset gate and output candidate.",
"duplicable":0,
"intermediate":0
}],
"outputs":[
{
"name":"BatchGate",
"comment":"(LoDTensor) To compute with batches, sequence data will be reorganized into several successive batches each containing data from the same time step. The LoDTensor BatchGate contains the update gate, reset gate and output candidate values organized in batches. The LoD size is 2. The first LoD contains the batch offsets and the second LoD contains the indexes in the raw sequence data.",
"duplicable":0,
"intermediate":1
},{
"name":"BatchResetHiddenPrev",
"comment":"(LoDTensor) The reseted hidden state LoDTensor organized in batches. This LoDTensor is a matrix with shape (T X D) and has the same LoD with `BatchGate`.",
"duplicable":0,
"intermediate":1
},{
"name":"BatchHidden",
"comment":"(LoDTensor) The hidden state LoDTensor organized in batches. This LoDTensor is a matrix with shape (T X D) and has the same LoD with `BatchGate`.",
"duplicable":0,
"intermediate":1
},{
"name":"Hidden",
"comment":"(LoDTensor) the hidden state LoDTensor organized in sequences. This LoDTensor is a matrix with shape (T X D) and has the same LoD with `BatchGate`.",
"duplicable":0,
"intermediate":0
}],
"attrs":[
{
"name":"activation",
"type":"string",
"comment":"(string, default tanh) The activation type used for output candidate {h}_t.",
"generated":0
},{
"name":"gate_activation",
"type":"string",
"comment":"(string, default sigmoid) The activation type used in update gate and reset gate.",
"generated":0
},{
"name":"is_reverse",
"type":"bool",
"comment":"(bool, defalut: False) whether to compute reversed GRU.",
"generated":0
}]
},{
},{
"type":"warpctc",
"type":"warpctc",
"comment":"\nAn operator integrating the open-source\n[warp-ctc](https://github.com/baidu-research/warp-ctc) library, which is used in\n[Deep Speech 2: End-toEnd Speech Recognition in English and Mandarin](\nhttps://arxiv.org/pdf/1512.02595v1.pdf),\nto compute Connectionist Temporal Classification (CTC) loss.\nIt can be aliased as softmax with ctc, since a native softmax activation is\ninterated to the warp-ctc library, to to normlize values for each row of the\ninput tensor.\n\nMore detail of CTC loss can be found by refering to\n[Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with\nRecurrent Neural Networks](\nhttp://machinelearning.wustl.edu/mlpapers/paper_files/icml2006_GravesFGS06.pdf).\n",
"comment":"\nAn operator integrating the open-source\n[warp-ctc](https://github.com/baidu-research/warp-ctc) library, which is used in\n[Deep Speech 2: End-toEnd Speech Recognition in English and Mandarin](\nhttps://arxiv.org/pdf/1512.02595v1.pdf),\nto compute Connectionist Temporal Classification (CTC) loss.\nIt can be aliased as softmax with ctc, since a native softmax activation is\ninterated to the warp-ctc library, to to normlize values for each row of the\ninput tensor.\n\nMore detail of CTC loss can be found by refering to\n[Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with\nRecurrent Neural Networks](\nhttp://machinelearning.wustl.edu/mlpapers/paper_files/icml2006_GravesFGS06.pdf).\n",
"comment":"\nSequence Expand Operator.\n\nThis operator expands input(X) according to LOD of input(Y).\nFollowing are cases to better explain how this works:\nCase 1:\n\nGiven a 2-level LoDTensor input(X)\n X.lod = [[0, 2, 3],\n [0, 1, 3, 4]]\n X.data = [a, b, c, d]\n X.dims = [4, 1]\nand input(Y)\n Y.lod = [[0, 2, 4],\n [0, 3, 6, 7, 8]]\nwith condition len(Y.lod[-1]) -1 == X.dims[0]\nthen we get 2-level LoDTensor\n Out.lod = [[0, 2, 4],\n [0, 3, 6, 7, 8]]\n Out.data = [a, a, a, b, b, b, c, d]\n Out.dims = [8, 1]\n\nCase 2:\n\nGiven a common Tensor input(X)\n X.data = [a, b, c]\n X.dims = [3, 1]\nand input(Y)\n Y.lod = [[0, 2, 3, 6]]\nwith condition len(Y.lod[-1]) -1 == X.dims[0]\nthen we get 1-level LoDTensor\n Out.lod = [[0, 2, 3, 6]]\n Out.data = [a, a, b, c, c, c]\n Out.dims = [6, 1]\n\nCase 3:\n\nGiven a common Tensor input(X)\n X.data = [[a, b], [c, d], [e, f]]\n X.dims = [3, 2]\nand input(Y)\n Y.lod = [[0, 2, 3, 6]]\nwith condition len(Y.lod[-1]) -1 == X.dims[0]\nthen we get 1-level LoDTensor\n Out.lod = [[0, 2, 3, 6]]\n Out.data = [[a,b], [a,b] [c,d], [e, f], [e, f], [e, f]]\n Out.dims = [6, 2]\n\nCase 4:\n\nGiven 2-level a LoDTensor input(X)\n X.lod = [[0, 2, 3],\n [0, 1, 3, 4]]\n X.data = [a, b, c, d]\n X.dims = [4, 1]\nand input(Y)\n Y.lod = [[0, 2, 4],\n [0, 3, 6, 6, 8]]\nwith condition len(Y.lod[-1]) -1 == X.dims[0]\nthen we get 2-level LoDTensor\n Out.lod = [[0, 2, 4],\n [0, 3, 6, 6, 8]]\n Out.data = [a, a, a, b, b, b, d, d]\n Out.dims = [8, 1]\n\n\n",
"inputs":[
"inputs":[
{
{
"name":"X",
"name":"X",
...
@@ -4366,6 +4302,99 @@
...
@@ -4366,6 +4302,99 @@
"comment":"(float, default 1.0e-6) Constant for numerical stability",
"comment":"(float, default 1.0e-6) Constant for numerical stability",
"generated":0
"generated":0
}]
}]
},{
"type":"gru",
"comment":"\nGRU Operator implements part calculations of the complete GRU as following:\n\n\\f[\nupdate \\ gate: u_t = actGate(xu_t + W_u * h_{t-1} + b_u) \\\\\nreset \\ gate: r_t = actGate(xr_t + W_r * h_{t-1} + b_r) \\\\\noutput \\ candidate: {h}_t = actNode(xc_t + W_c * dot(r_t, h_{t-1}) + b_c) \\\\\noutput: h_t = dot((1 - u_t), h_{t-1}) + dot(u_t, {h}_t)\n\\f]\n\n@note To implement the complete GRU, fully-connected operator must be used \nbefore to feed xu, xr and xc as the Input of GRU operator.\n",
"inputs":[
{
"name":"Input",
"comment":"(LoDTensor) The first input is a LodTensor, which supports variable-time length input sequence. The underlying tensor in this LoDTenosr is a matrix with shape (T X 3D), where, T is the total time steps in this mini-batch, D is the hidden size.",
"duplicable":0,
"intermediate":0
},{
"name":"H0",
"comment":"(Tensor, optional) The initial hidden state is an optional input. This is a tensor with shape (N x D), where N is the batch size, D is the hidden size.",
"duplicable":0,
"intermediate":0
},{
"name":"Weight",
"comment":"(Tensor) The learnable hidden-hidden weight matrix with shape (D x 3D), where D is the hidden size. The elements continuous in memory can be divided into two parts. The first part are weights of the update gate and reset gate with shape (D x 2D), and the second part are weights of output candidate with shape (D x D).",
"duplicable":0,
"intermediate":0
},{
"name":"Bias",
"comment":"(Tensor, optional) Bias vector with shape (1 x 3D) concating bias of the update gate, reset gate and output candidate.",
"duplicable":0,
"intermediate":0
}],
"outputs":[
{
"name":"BatchGate",
"comment":"(LoDTensor) To compute with batches, sequence data will be reorganized into several successive batches each containing data from the same time step. The LoDTensor BatchGate contains the update gate, reset gate and output candidate values organized in batches. The LoD size is 2. The first LoD contains the batch offsets and the second LoD contains the indexes in the raw sequence data.",
"duplicable":0,
"intermediate":1
},{
"name":"BatchResetHiddenPrev",
"comment":"(LoDTensor) The reseted hidden state LoDTensor organized in batches. This LoDTensor is a matrix with shape (T X D) and has the same LoD with `BatchGate`.",
"duplicable":0,
"intermediate":1
},{
"name":"BatchHidden",
"comment":"(LoDTensor) The hidden state LoDTensor organized in batches. This LoDTensor is a matrix with shape (T X D) and has the same LoD with `BatchGate`.",
"duplicable":0,
"intermediate":1
},{
"name":"Hidden",
"comment":"(LoDTensor) the hidden state LoDTensor organized in sequences. This LoDTensor is a matrix with shape (T X D) and has the same LoD with `BatchGate`.",
"duplicable":0,
"intermediate":0
}],
"attrs":[
{
"name":"activation",
"type":"string",
"comment":"(string, default tanh) The activation type used for output candidate {h}_t.",
"generated":0
},{
"name":"gate_activation",
"type":"string",
"comment":"(string, default sigmoid) The activation type used in update gate and reset gate.",
"generated":0
},{
"name":"is_reverse",
"type":"bool",
"comment":"(bool, defalut: False) whether to compute reversed GRU.",
"generated":0
}]
},{
"type":"ctc_align",
"comment":"\nCTCAlign op is used to merge repeated elements between two blanks\nand then delete all blanks in sequence.\n\nGiven:\n Input.data = [0, 1, 2, 2, 0, 4, 0, 4, 5, 0, 6,\n 6, 0, 0, 7, 7, 7, 0]\n Input.dims = {18, 1}\n Input.LoD = [[0, 11, 18]]\n\nAnd:\n blank = 0\n merge_repeated = True\n\nThen:\n Output.data = [1, 2, 4, 4, 5, 6,\n 6, 7]\n Output.dims = {8, 1}\n Output.LoD = [[0, 6, 8]]\n\n",
"inputs":[
{
"name":"Input",
"comment":"(LodTensor, default: LoDTensor<int>), Its shape is [Lp, 1], where Lp is the sum of all input sequences' length.",
"duplicable":0,
"intermediate":0
}],
"outputs":[
{
"name":"Output",
"comment":"(Tensor, default: Tensor<int>), The align result.",
"duplicable":0,
"intermediate":0
}],
"attrs":[
{
"name":"blank",
"type":"int",
"comment":"(int, default: 0), the blank label setted in Connectionist Temporal Classification (CTC) op.",
"generated":0
},{
"name":"merge_repeated",
"type":"bool",
"comment":"(bool, default: true), whether to merge repeated elements between two blanks. ",
"generated":0
}]
},{
},{
"type":"beam_search",
"type":"beam_search",
"comment":"This is a beam search operator that help to generate sequences.",
"comment":"This is a beam search operator that help to generate sequences.",