Deploy to GitHub Pages: ef8cb8f6

eace3e49 · Travis CI · d690e184 · eace3e49 · eace3e49 · eace3e49
9 changed file
--- a/develop/doc/_sources/api/v2/fluid/nets.rst.txt
+++ b/develop/doc/_sources/api/v2/fluid/nets.rst.txt
@@ -26,8 +26,8 @@ glu
    :noindex:


-dot_product_attention
---------------------
-..  autofunction:: paddle.v2.fluid.nets.dot_product_attention
+scaled_dot_product_attention
+----------------------------
+..  autofunction:: paddle.v2.fluid.nets.scaled_dot_product_attention
    :noindex:

--- a/develop/doc/api/v2/fluid/layers.html
+++ b/develop/doc/api/v2/fluid/layers.html
@@ -258,16 +258,17 @@ multidimensional tensor will first be flattened
 into a 2-dimensional matrix. The parameter
 <cite>num_flatten_dims</cite> determines how the input tensor
 is flattened: the first <cite>num_flatten_dims</cite>
-dimensions will be flatten to form the first
-dimension of the final matrix (height of the
-matrix), and the rest <cite>rank(X) - num_flatten_dims</cite>
-dimensions are flattened to form the second
-dimension of the final matrix (width of the matrix).
-For example, suppose <cite>X</cite> is a 6-dimensional tensor
-with a shape [2, 3, 4, 5, 6], and
-<cite>num_flatten_dims</cite> = 3. Then, the flattened matrix
-will have a shape [2 x 3 x 4, 5 x 6] = [24, 30].
-By default, <cite>num_flatten_dims</cite> is set to 1.</li>
+(inclusive, index starts from 1) dimensions will
+be flatten to form the first dimension of the
+final matrix (height of the matrix), and the rest
+<cite>rank(X) - num_flatten_dims</cite> dimensions are
+flattened to form the second dimension of the
+final matrix (width of the matrix). For example,
+suppose <cite>X</cite> is a 6-dimensional tensor with a shape
+[2, 3, 4, 5, 6], and <cite>num_flatten_dims</cite> = 3. Then,
+the flattened matrix will have a shape
+[2 x 3 x 4, 5 x 6] = [24, 30]. By default,
+<cite>num_flatten_dims</cite> is set to 1.</li>
 <li><strong>param_attr</strong> (<em>ParamAttr|list</em>) &#8211; The parameter attribute for learnable
 parameters/weights of the fully connected
 layer.</li>
@@ -858,13 +859,9 @@ Duplicable: False  Optional: False</li>
 <dd><p>Reshape Operator.</p>
 <p>Reshape Input(X) into the shape specified by Attr(shape).</p>
 <p>An example:
-Given a 2-D tensor X with 2 rows and 2 columns</p>
-<blockquote>
-<div>[[1, 2], [3, 4]]</div></blockquote>
+Given a 2-D tensor X with 2 rows and 2 columns : [[1, 2], [3, 4]]</p>
 <p>and target shape = [1, 4], the reshape operator will transform
-the tensor X into a 2-D tensor:</p>
-<blockquote>
-<div>[[1, 2, 3, 4]]</div></blockquote>
+the tensor X into a 2-D tensor: [[1, 2, 3, 4]]</p>
 <p>One dimension in the target shape can be set -1, representing that its
 size is unknown. In this case, the real dimension will be infered from
 the original shape of Input(X) and other dimensions in the target shape.</p>
@@ -1206,8 +1203,9 @@ X and Y and returns that as the output.</p>
 <dt>
 <code class="descclassname">paddle.v2.fluid.layers.</code><code class="descname">cross_entropy</code><span class="sig-paren">(</span><em>input</em>, <em>label</em>, <em>**kwargs</em><span class="sig-paren">)</span></dt>
 <dd><p><strong>Cross Entropy Layer</strong></p>
-<p>This layer computes the cross entropy between <cite>input</cite> and <cite>label</cite>. It supports
-both standard cross-entropy and soft-label cross-entropy loss computation.</p>
+<p>This layer computes the cross entropy between <cite>input</cite> and <cite>label</cite>. It
+supports both standard cross-entropy and soft-label cross-entropy loss
+computation.</p>
 <ol class="arabic">
 <li><dl class="first docutils">
 <dt>One-hot cross-entropy:</dt>
@@ -1243,22 +1241,33 @@ to a one-hot cross-entropy with one-hot label representation.</p>
 <tbody valign="top">
 <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first simple">
 <li><strong>input</strong> (<em>Variable|list</em>) &#8211; a 2-D tensor with shape [N x D], where N is the
-batch size and D is the number of classes. This input is a probability
-computed by the previous operator, which is almost always the result
-of a softmax operator.</li>
+batch size and D is the number of classes. This
+input is a probability computed by the previous
+operator, which is almost always the result of
+a softmax operator.</li>
 <li><strong>label</strong> (<em>Variable|list</em>) &#8211; the ground truth which is a 2-D tensor. When
-<cite>soft_label</cite> is set to <cite>False</cite>, <cite>label</cite> is a tensor&lt;int64&gt; with shape
-[N x 1]. When <cite>soft_label</cite> is set to <cite>True</cite>, <cite>label</cite> is a
+<cite>soft_label</cite> is set to <cite>False</cite>, <cite>label</cite> is a
+tensor&lt;int64&gt; with shape [N x 1]. When
+<cite>soft_label</cite> is set to <cite>True</cite>, <cite>label</cite> is a
 tensor&lt;float/double&gt; with shape [N x D].</li>
-<li><strong>soft_label</strong> (bool, via <cite>**kwargs</cite>) &#8211; a flag indicating whether to interpretate
-the given labels as soft labels, default <cite>False</cite>.</li>
+<li><strong>soft_label</strong> (bool, via <cite>**kwargs</cite>) &#8211; a flag indicating whether to
+interpretate the given labels as soft
+labels, default <cite>False</cite>.</li>
 </ul>
 </td>
 </tr>
 <tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body"><p class="first">A 2-D tensor with shape [N x 1], the cross entropy loss.</p>
 </td>
 </tr>
-<tr class="field-odd field"><th class="field-name">Raises:</th><td class="field-body"><p class="first last"><cite>ValueError</cite> &#8211; 1) the 1st dimension of <cite>input</cite> and <cite>label</cite> are not equal; 2) when               <cite>soft_label == True</cite>, and the 2nd dimension of <cite>input</cite> and <cite>label</cite> are not                equal; 3) when <cite>soft_label == False</cite>, and the 2nd dimension of <cite>label</cite> is not 1.</p>
+<tr class="field-odd field"><th class="field-name">Raises:</th><td class="field-body"><p class="first"><cite>ValueError</cite> &#8211; 1) the 1st dimension of <cite>input</cite> and <cite>label</cite> are not equal.
+2) when <cite>soft_label == True</cite>, and the 2nd dimension of</p>
+<blockquote>
+<div><p><cite>input</cite> and <cite>label</cite> are not equal.</p>
+</div></blockquote>
+<ol class="last arabic simple" start="3">
+<li>when <cite>soft_label == False</cite>, and the 2nd dimension of
+<cite>label</cite> is not 1.</li>
+</ol>
 </td>
 </tr>
 </tbody>
@@ -1277,8 +1286,9 @@ the given labels as soft labels, default <cite>False</cite>.</li>
 <dt>
 <code class="descclassname">paddle.v2.fluid.layers.</code><code class="descname">square_error_cost</code><span class="sig-paren">(</span><em>input</em>, <em>label</em>, <em>**kwargs</em><span class="sig-paren">)</span></dt>
 <dd><p><strong>Square error cost layer</strong></p>
-<p>This layer accepts input predictions and target label and returns the squared error cost.
-For predictions, <span class="math">\(X\)</span>, and target labels, <span class="math">\(Y\)</span>, the equation is:</p>
+<p>This layer accepts input predictions and target label and returns the
+squared error cost.</p>
+<p>For predictions, <span class="math">\(X\)</span>, and target labels, <span class="math">\(Y\)</span>, the equation is:</p>
 <div class="math">
 \[Out = (X - Y)^2\]</div>
 <p>In the above equation:</p>
@@ -1299,7 +1309,12 @@ For predictions, <span class="math">\(X\)</span>, and target labels, <span class
 </ul>
 </td>
 </tr>
-<tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body"><p class="first">The tensor variable storing the element-wise squared error difference                   of input and label.</p>
+<tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body"><p class="first"><dl class="docutils">
+<dt>The tensor variable storing the element-wise squared error</dt>
+<dd><p class="first last">difference of input and label.</p>
+</dd>
+</dl>
+</p>
 </td>
 </tr>
 <tr class="field-odd field"><th class="field-name">Return type:</th><td class="field-body"><p class="first last">Variable</p>
@@ -1344,12 +1359,13 @@ in the input parameters to the function.</p>
 <code class="descclassname">paddle.v2.fluid.layers.</code><code class="descname">conv2d</code><span class="sig-paren">(</span><em>input</em>, <em>num_filters</em>, <em>filter_size</em>, <em>stride=None</em>, <em>padding=None</em>, <em>groups=None</em>, <em>param_attr=None</em>, <em>bias_attr=None</em>, <em>use_cudnn=True</em>, <em>act=None</em><span class="sig-paren">)</span></dt>
 <dd><p><strong>Convlution2D Layer</strong></p>
 <p>The convolution2D layer calculates the output based on the input, filter
-and strides, paddings, dilations, groups parameters. Input(Input) and Output(Output)
-are in NCHW format. Where N is batch size, C is the number of channels, H is the height
-of the feature, and W is the width of the feature.
+and strides, paddings, dilations, groups parameters. Input(Input) and
+Output(Output) are in NCHW format. Where N is batch size, C is the number of
+channels, H is the height of the feature, and W is the width of the feature.
 The details of convolution layer, please refer UFLDL&#8217;s <a class="reference external" href="http://ufldl.stanford.edu/tutorial/supervised/FeatureExtractionUsingConvolution/">convolution,</a> .
-If bias attribution and activation type are provided, bias is added to the output of the convolution,
-and the corresponding activation function is applied to the final result.</p>
+If bias attribution and activation type are provided, bias is added to the
+output of the convolution, and the corresponding activation function is
+applied to the final result.</p>
 <p>For each input <span class="math">\(X\)</span>, the equation is:</p>
 <div class="math">
 \[Out = \sigma (W \ast X + b)\]</div>
@@ -1360,7 +1376,11 @@ and the corresponding activation function is applied to the final result.</p>
 <li><span class="math">\(\ast\)</span>: Convolution operation.</li>
 <li><span class="math">\(b\)</span>: Bias value, a 2-D tensor with shape [M, 1].</li>
 <li><span class="math">\(\sigma\)</span>: Activation function.</li>
-<li><span class="math">\(Out\)</span>: Output value, the shape of <span class="math">\(Out\)</span> and <span class="math">\(X\)</span> may be different.</li>
+<li><dl class="first docutils">
+<dt><span class="math">\(Out\)</span>: Output value, the shape of <span class="math">\(Out\)</span> and <span class="math">\(X\)</span> may be</dt>
+<dd>different.</dd>
+</dl>
+</li>
 </ul>
 <p class="rubric">Example</p>
 <ul>
@@ -1407,20 +1427,28 @@ library is installed. Default: True</li>
 </ul>
 </td>
 </tr>
-<tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body"><p class="first">The tensor variable storing the convolution and                   non-linearity activation result.</p>
+<tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body"><p class="first"><dl class="docutils">
+<dt>The tensor variable storing the convolution and</dt>
+<dd><p class="first last">non-linearity activation result.</p>
+</dd>
+</dl>
+</p>
 </td>
 </tr>
 <tr class="field-odd field"><th class="field-name">Return type:</th><td class="field-body"><p class="first">Variable</p>
 </td>
 </tr>
-<tr class="field-even field"><th class="field-name">Raises:</th><td class="field-body"><p class="first last"><code class="xref py py-exc docutils literal"><span class="pre">ValueError</span></code> &#8211; If the shapes of input, filter_size, stride, padding and groups mismatch.</p>
+<tr class="field-even field"><th class="field-name">Raises:</th><td class="field-body"><p class="first last"><code class="xref py py-exc docutils literal"><span class="pre">ValueError</span></code> &#8211; If the shapes of input, filter_size, stride, padding and
+groups mismatch.</p>
 </td>
 </tr>
 </tbody>
 </table>
 <p class="rubric">Examples</p>
-<div class="highlight-python"><div class="highlight"><pre><span></span><span class="n">data</span> <span class="o">=</span> <span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">data</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s1">&#39;data&#39;</span><span class="p">,</span> <span class="n">shape</span><span class="o">=</span><span class="p">[</span><span class="mi">3</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="mi">32</span><span class="p">],</span> <span class="n">dtype</span><span class="o">=</span><span class="s1">&#39;float32&#39;</span><span class="p">)</span>
-<span class="n">conv2d</span> <span class="o">=</span> <span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">conv2d</span><span class="p">(</span><span class="nb">input</span><span class="o">=</span><span class="n">data</span><span class="p">,</span> <span class="n">num_filters</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">filter_size</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">act</span><span class="o">=</span><span class="s2">&quot;relu&quot;</span><span class="p">)</span>
+<div class="highlight-python"><div class="highlight"><pre><span></span><span class="n">data</span> <span class="o">=</span> <span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">data</span><span class="p">(</span>
+    <span class="n">name</span><span class="o">=</span><span class="s1">&#39;data&#39;</span><span class="p">,</span> <span class="n">shape</span><span class="o">=</span><span class="p">[</span><span class="mi">3</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="mi">32</span><span class="p">],</span> <span class="n">dtype</span><span class="o">=</span><span class="s1">&#39;float32&#39;</span><span class="p">)</span>
+<span class="n">conv2d</span> <span class="o">=</span> <span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">conv2d</span><span class="p">(</span>
+    <span class="nb">input</span><span class="o">=</span><span class="n">data</span><span class="p">,</span> <span class="n">num_filters</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">filter_size</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">act</span><span class="o">=</span><span class="s2">&quot;relu&quot;</span><span class="p">)</span>
 </pre></div>
 </div>
 </dd></dl>
@@ -2158,7 +2186,8 @@ are in NCHW format. Where N is batch size, C is the number of channels,
 H is the height of the feature, and W is the width of the feature.
 Parameters(dilations, strides, paddings) are two elements. These two elements
 represent height and width, respectively. The details of convolution transpose
-layer, please refer to the following explanation and references <a class="reference external" href="http://www.matthewzeiler.com/wp-content/uploads/2017/07/cvpr2010.pdf">therein</a>.</p>
+layer, please refer to the following explanation and references
+<a class="reference external" href="http://www.matthewzeiler.com/wp-content/uploads/2017/07/cvpr2010.pdf">therein</a>.</p>
 <p>For each input <span class="math">\(X\)</span>, the equation is:</p>
 <div class="math">
 \[Out = W \ast X\]</div>
@@ -2167,7 +2196,11 @@ layer, please refer to the following explanation and references <a class="refere
 <li><span class="math">\(X\)</span>: Input value, a tensor with NCHW format.</li>
 <li><span class="math">\(W\)</span>: Filter value, a tensor with MCHW format.</li>
 <li><span class="math">\(\ast\)</span> : Convolution transpose operation.</li>
-<li><span class="math">\(Out\)</span>: Output value, the shape of <span class="math">\(Out\)</span> and <span class="math">\(X\)</span> may be different.</li>
+<li><dl class="first docutils">
+<dt><span class="math">\(Out\)</span>: Output value, the shape of <span class="math">\(Out\)</span> and <span class="math">\(X\)</span> may be</dt>
+<dd>different.</dd>
+</dl>
+</li>
 </ul>
 <p class="rubric">Example</p>
 <ul>
@@ -2207,7 +2240,8 @@ stride_H = stride_W = stride. Default: stride = 1.</li>
 <li><strong>dilation</strong> (<em>int|tuple</em>) &#8211; The dilation size. If dilation is a tuple, it must
 contain two integers, (dilation_H, dilation_W). Otherwise, the
 dilation_H = dilation_W = dilation. Default: dilation = 1.</li>
-<li><strong>param_attr</strong> (<em>ParamAttr</em>) &#8211; The parameters to the Conv2d_transpose Layer. Default: None</li>
+<li><strong>param_attr</strong> (<em>ParamAttr</em>) &#8211; The parameters to the Conv2d_transpose Layer.
+Default: None</li>
 <li><strong>use_cudnn</strong> (<em>bool</em>) &#8211; Use cudnn kernel or not, it is valid only when the cudnn
 library is installed. Default: True</li>
 <li><strong>name</strong> (<em>str|None</em>) &#8211; A name for this layer(optional). If set None, the layer
@@ -2221,14 +2255,17 @@ will be named automatically.</li>
 <tr class="field-odd field"><th class="field-name">Return type:</th><td class="field-body"><p class="first">Variable</p>
 </td>
 </tr>
-<tr class="field-even field"><th class="field-name">Raises:</th><td class="field-body"><p class="first last"><code class="xref py py-exc docutils literal"><span class="pre">ValueError</span></code> &#8211; If the shapes of input, filter_size, stride, padding and groups mismatch.</p>
+<tr class="field-even field"><th class="field-name">Raises:</th><td class="field-body"><p class="first last"><code class="xref py py-exc docutils literal"><span class="pre">ValueError</span></code> &#8211; If the shapes of input, filter_size, stride, padding and
+groups mismatch.</p>
 </td>
 </tr>
 </tbody>
 </table>
 <p class="rubric">Examples</p>
-<div class="highlight-python"><div class="highlight"><pre><span></span><span class="n">data</span> <span class="o">=</span> <span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">data</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s1">&#39;data&#39;</span><span class="p">,</span> <span class="n">shape</span><span class="o">=</span><span class="p">[</span><span class="mi">3</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="mi">32</span><span class="p">],</span> <span class="n">dtype</span><span class="o">=</span><span class="s1">&#39;float32&#39;</span><span class="p">)</span>
-<span class="n">conv2d_transpose</span> <span class="o">=</span> <span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">conv2d_transpose</span><span class="p">(</span><span class="nb">input</span><span class="o">=</span><span class="n">data</span><span class="p">,</span> <span class="n">num_filters</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">filter_size</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
+<div class="highlight-python"><div class="highlight"><pre><span></span><span class="n">data</span> <span class="o">=</span> <span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">data</span><span class="p">(</span>
+    <span class="n">name</span><span class="o">=</span><span class="s1">&#39;data&#39;</span><span class="p">,</span> <span class="n">shape</span><span class="o">=</span><span class="p">[</span><span class="mi">3</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="mi">32</span><span class="p">],</span> <span class="n">dtype</span><span class="o">=</span><span class="s1">&#39;float32&#39;</span><span class="p">)</span>
+<span class="n">conv2d_transpose</span> <span class="o">=</span> <span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">conv2d_transpose</span><span class="p">(</span>
+    <span class="nb">input</span><span class="o">=</span><span class="n">data</span><span class="p">,</span> <span class="n">num_filters</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">filter_size</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
 </pre></div>
 </div>
 </dd></dl>
@@ -2337,8 +2374,10 @@ and concatenation of <span class="math">\(u_t\)</span>, <span class="math">\(r_t
 <li><strong>size</strong> (<em>integer</em>) &#8211; The input dimension value.</li>
 <li><strong>weight</strong> (<em>ParamAttr</em>) &#8211; The weight parameters for gru unit. Default: None</li>
 <li><strong>bias</strong> (<em>ParamAttr</em>) &#8211; The bias parameters for gru unit. Default: None</li>
-<li><strong>activation</strong> (<em>string</em>) &#8211; The activation type for cell (actNode). Default: &#8216;tanh&#8217;</li>
-<li><strong>gate_activation</strong> (<em>string</em>) &#8211; The activation type for gates (actGate). Default: &#8216;sigmoid&#8217;</li>
+<li><strong>activation</strong> (<em>string</em>) &#8211; The activation type for cell (actNode).
+Default: &#8216;tanh&#8217;</li>
+<li><strong>gate_activation</strong> (<em>string</em>) &#8211; The activation type for gates (actGate).
+Default: &#8216;sigmoid&#8217;</li>
 </ul>
 </td>
 </tr>
@@ -2414,7 +2453,10 @@ will be named automatically.</li>
 <tr class="field-odd field"><th class="field-name">Return type:</th><td class="field-body"><p class="first">tuple</p>
 </td>
 </tr>
-<tr class="field-even field"><th class="field-name">Raises:</th><td class="field-body"><p class="first last"><code class="xref py py-exc docutils literal"><span class="pre">ValueError</span></code> &#8211; The ranks of <strong>x_t</strong>, <strong>hidden_t_prev</strong> and <strong>cell_t_prev</strong>                not be 2 or the 1st dimensions of <strong>x_t</strong>, <strong>hidden_t_prev</strong>                 and <strong>cell_t_prev</strong> not be the same or the 2nd dimensions of                 <strong>hidden_t_prev</strong> and <strong>cell_t_prev</strong> not be the same.</p>
+<tr class="field-even field"><th class="field-name">Raises:</th><td class="field-body"><p class="first last"><code class="xref py py-exc docutils literal"><span class="pre">ValueError</span></code> &#8211; The ranks of <strong>x_t</strong>, <strong>hidden_t_prev</strong> and <strong>cell_t_prev</strong>
+not be 2 or the 1st dimensions of <strong>x_t</strong>, <strong>hidden_t_prev</strong>
+and <strong>cell_t_prev</strong> not be the same or the 2nd dimensions of
+<strong>hidden_t_prev</strong> and <strong>cell_t_prev</strong> not be the same.</p>
 </td>
 </tr>
 </tbody>
@@ -2706,9 +2748,9 @@ will be named automatically.</li>
 <dl class="function">
 <dt>
 <code class="descclassname">paddle.v2.fluid.layers.</code><code class="descname">matmul</code><span class="sig-paren">(</span><em>x</em>, <em>y</em>, <em>transpose_x=False</em>, <em>transpose_y=False</em>, <em>name=None</em><span class="sig-paren">)</span></dt>
-<dd><p>Applies matrix multiplication to two tensors. Currently, the input
-tensors&#8217; rank can be any, but when the rank of anyone inputs is
-bigger than 3, this two inputs&#8217; rank should be equal.</p>
+<dd><p>Applies matrix multiplication to two tensors.</p>
+<p>Currently, the input tensors&#8217; rank can be any, but when the rank of any
+inputs is bigger than 3, this two inputs&#8217; rank should be equal.</p>
 <p>The actual behavior depends on the shapes of <span class="math">\(x\)</span>, <span class="math">\(y\)</span> and the
 flag values of <code class="xref py py-attr docutils literal"><span class="pre">transpose_x</span></code>, <code class="xref py py-attr docutils literal"><span class="pre">transpose_y</span></code>. Specifically:</p>
 <ul class="simple">
@@ -2756,18 +2798,23 @@ will be named automatically.</li>
 <div class="highlight-python"><div class="highlight"><pre><span></span><span class="c1"># Examples to clarify shapes of the inputs and output</span>
 <span class="c1"># x: [B, ..., M, K], y: [B, ..., K, N]</span>
 <span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>  <span class="c1"># out: [B, ..., M, N]</span>
+
 <span class="c1"># x: [B, M, K], y: [B, K, N]</span>
 <span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>  <span class="c1"># out: [B, M, N]</span>
+
 <span class="c1"># x: [B, M, K], y: [K, N]</span>
 <span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>  <span class="c1"># out: [B, M, N]</span>
-<span class="c1"># x: [B, M, K], y: [K]</span>
-<span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>  <span class="c1"># out: [B, M]</span>
+
 <span class="c1"># x: [M, K], y: [K, N]</span>
 <span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>  <span class="c1"># out: [M, N]</span>
+
+<span class="c1"># x: [B, M, K], y: [K]</span>
+<span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>  <span class="c1"># out: [B, M]</span>
+
 <span class="c1"># x: [K], y: [K]</span>
 <span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>  <span class="c1"># out: [1]</span>
-<span class="c1"># x: [M], y: [N]</span>

+<span class="c1"># x: [M], y: [N]</span>
 <span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="bp">True</span><span class="p">,</span> <span class="bp">True</span><span class="p">)</span>  <span class="c1"># out: [M, N]</span>
 </pre></div>
 </div>
@@ -3502,7 +3549,8 @@ output.lod = [[0, 4, 8]]
 </pre></div>
 </div>
 <p>The simple usage is:</p>
-<div class="highlight-python"><div class="highlight"><pre><span></span><span class="n">output</span> <span class="o">=</span> <span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">im2sequence</span><span class="p">(</span><span class="nb">input</span><span class="o">=</span><span class="n">layer</span><span class="p">,</span> <span class="n">stride</span><span class="o">=</span><span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span> <span class="n">filter_size</span><span class="o">=</span><span class="p">[</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">])</span>
+<div class="highlight-python"><div class="highlight"><pre><span></span><span class="n">output</span> <span class="o">=</span> <span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">im2sequence</span><span class="p">(</span>
+    <span class="nb">input</span><span class="o">=</span><span class="n">layer</span><span class="p">,</span> <span class="n">stride</span><span class="o">=</span><span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span> <span class="n">filter_size</span><span class="o">=</span><span class="p">[</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">])</span>
 </pre></div>
 </div>
 </div></blockquote>
@@ -3518,8 +3566,13 @@ output.lod = [[0, 4, 8]]
 <dt>
 <code class="descclassname">paddle.v2.fluid.layers.</code><code class="descname">ctc_greedy_decoder</code><span class="sig-paren">(</span><em>input</em>, <em>blank</em>, <em>name=None</em><span class="sig-paren">)</span></dt>
 <dd><p>This op is used to decode sequences by greedy policy by below steps:
-1. Get the indexes of max value for each row in input. a.k.a. numpy.argmax(input, axis=0).
-2. For each sequence in result of step1, merge repeated tokens between two blanks and delete all blanks.</p>
+1. Get the indexes of max value for each row in input. a.k.a.</p>
+<blockquote>
+<div>numpy.argmax(input, axis=0).</div></blockquote>
+<ol class="arabic simple" start="2">
+<li>For each sequence in result of step1, merge repeated tokens between two
+blanks and delete all blanks.</li>
+</ol>
 <p>A simple example as below:</p>
 <div class="highlight-text"><div class="highlight"><pre><span></span>Given:

@@ -3549,8 +3602,15 @@ output.lod = [[0, 2, 3]]
 <col class="field-body" />
 <tbody valign="top">
 <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first simple">
-<li><strong>input</strong> (<em>Variable</em>) &#8211; (LoDTensor&lt;float&gt;), the probabilities of variable-length sequences, which is a 2-D Tensor with LoD information. It&#8217;s shape is [Lp, num_classes + 1], where Lp is the sum of all input sequences&#8217; length and num_classes is the true number of classes. (not including the blank label).</li>
-<li><strong>blank</strong> (<em>int</em>) &#8211; the blank label index of Connectionist Temporal Classification (CTC) loss, which is in thehalf-opened interval [0, num_classes + 1).</li>
+<li><strong>input</strong> (<em>Variable</em>) &#8211; (LoDTensor&lt;float&gt;), the probabilities of
+variable-length sequences, which is a 2-D Tensor with
+LoD information. It&#8217;s shape is [Lp, num_classes + 1],
+where Lp is the sum of all input sequences&#8217; length and
+num_classes is the true number of classes. (not
+including the blank label).</li>
+<li><strong>blank</strong> (<em>int</em>) &#8211; the blank label index of Connectionist Temporal
+Classification (CTC) loss, which is in thehalf-opened
+interval [0, num_classes + 1).</li>
 </ul>
 </td>
 </tr>
@@ -3609,7 +3669,7 @@ will be named automatically.</li>
 <div class="highlight-python"><div class="highlight"><pre><span></span><span class="n">data</span> <span class="o">=</span> <span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">data</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s2">&quot;data&quot;</span><span class="p">,</span>
                         <span class="n">shape</span><span class="o">=</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">17</span><span class="p">,</span> <span class="mi">13</span><span class="p">),</span>
                         <span class="n">dtype</span><span class="o">=</span><span class="s2">&quot;float32&quot;</span><span class="p">)</span>
-<span class="n">fc</span> <span class="o">=</span> <span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">l2_normalize</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">data</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
+<span class="n">normed</span> <span class="o">=</span> <span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">l2_normalize</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">data</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
 </pre></div>
 </div>
 </dd></dl>

--- a/develop/doc/api/v2/fluid/nets.html
+++ b/develop/doc/api/v2/fluid/nets.html
@@ -284,11 +284,11 @@ dimension to split along is <span class="math">\(rank(input) + dim\)</span>.</li
 </dd></dl>

 </div>
-<div class="section" id="dot-product-attention">
-<h2>dot_product_attention<a class="headerlink" href="#dot-product-attention" title="Permalink to this headline">¶</a></h2>
+<div class="section" id="scaled-dot-product-attention">
+<h2>scaled_dot_product_attention<a class="headerlink" href="#scaled-dot-product-attention" title="Permalink to this headline">¶</a></h2>
 <dl class="function">
 <dt>
-<code class="descclassname">paddle.v2.fluid.nets.</code><code class="descname">dot_product_attention</code><span class="sig-paren">(</span><em>querys</em>, <em>keys</em>, <em>values</em><span class="sig-paren">)</span></dt>
+<code class="descclassname">paddle.v2.fluid.nets.</code><code class="descname">scaled_dot_product_attention</code><span class="sig-paren">(</span><em>queries</em>, <em>keys</em>, <em>values</em>, <em>num_heads=1</em>, <em>dropout_rate=0.0</em><span class="sig-paren">)</span></dt>
 <dd><p>The dot-product attention.</p>
 <p>Attention mechanism can be seen as mapping a query and a set of key-value
 pairs to an output. The output is computed as a weighted sum of the values,
@@ -298,36 +298,55 @@ function (dot-product here) of the query with the corresponding key.</p>
 multipication as follows:</p>
 <blockquote>
 <div><div class="math">
-\[Attention(Q, K, V)= softmax(QK^\mathrm{T})V\]</div>
+\[Attention(Q, K, V)= softmax(QK^\mathrm{T})V\]</div>
 </div></blockquote>
 <p>Refer to <a class="reference external" href="https://arxiv.org/pdf/1706.03762.pdf">Attention Is All You Need</a>.</p>
-<p>Note that batch data containing sequences with different lengths is not
-supported by this because of the (batch) matrix multipication.</p>
 <table class="docutils field-list" frame="void" rules="none">
 <col class="field-name" />
 <col class="field-body" />
 <tbody valign="top">
 <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first simple">
-<li><strong>query</strong> (<em>Variable</em>) &#8211; The input variable which is a Tensor or LoDTensor.</li>
-<li><strong>key</strong> (<em>Variable</em>) &#8211; The input variable which is a Tensor or LoDTensor.</li>
-<li><strong>value</strong> (<em>Variable</em>) &#8211; The input variable which is a Tensor or LoDTensor.</li>
+<li><strong>queries</strong> (<em>Variable</em>) &#8211; The input variable which should be a 3-D Tensor.</li>
+<li><strong>keys</strong> (<em>Variable</em>) &#8211; The input variable which should be a 3-D Tensor.</li>
+<li><strong>values</strong> (<em>Variable</em>) &#8211; The input variable which should be a 3-D Tensor.</li>
+<li><strong>num_heads</strong> (<em>int</em>) &#8211; Head number to compute the scaled dot product
+attention. Default value is 1.</li>
+<li><strong>dropout_rate</strong> (<em>float</em>) &#8211; The dropout rate to drop the attention weight.
+Default value is 0.</li>
 </ul>
 </td>
 </tr>
-<tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body"><p class="first">The Tensor variables representing the output and attention scores.</p>
+<tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body"><p class="first"><dl class="docutils">
+<dt>A 3-D Tensor computed by multi-head scaled dot product</dt>
+<dd><p class="first last">attention.</p>
+</dd>
+</dl>
+</p>
 </td>
 </tr>
-<tr class="field-odd field"><th class="field-name">Return type:</th><td class="field-body"><p class="first last">tuple</p>
+<tr class="field-odd field"><th class="field-name">Return type:</th><td class="field-body"><p class="first">Variable</p>
+</td>
+</tr>
+<tr class="field-even field"><th class="field-name">Raises:</th><td class="field-body"><p class="first last"><code class="xref py py-exc docutils literal"><span class="pre">ValueError</span></code> &#8211; If input queries, keys, values are not 3-D Tensors.</p>
 </td>
 </tr>
 </tbody>
 </table>
+<div class="admonition note">
+<p class="first admonition-title">Note</p>
+<p>1. When num_heads &gt; 1, three linear projections are learned respectively
+to map input queries, keys and values into queries&#8217;, keys&#8217; and values&#8217;.
+queries&#8217;, keys&#8217; and values&#8217; have the same shapes with queries, keys
+and values.</p>
+<p class="last">1. When num_heads == 1, scaled_dot_product_attention has no learnable
+parameters.</p>
+</div>
 <p class="rubric">Examples</p>
-<div class="highlight-python"><div class="highlight"><pre><span></span><span class="c1"># Suppose q, k, v are tensor variables with the following shape:</span>
+<div class="highlight-python"><div class="highlight"><pre><span></span><span class="c1"># Suppose q, k, v are Tensors with the following shape:</span>
 <span class="c1"># q: [3, 5, 9], k: [3, 6, 9], v: [3, 6, 10]</span>
-<span class="n">out</span><span class="p">,</span> <span class="n">attn_scores</span> <span class="o">=</span> <span class="n">fluid</span><span class="o">.</span><span class="n">nets</span><span class="o">.</span><span class="n">dot_product_attention</span><span class="p">(</span><span class="n">q</span><span class="p">,</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span><span class="p">)</span>
-<span class="n">out</span><span class="o">.</span><span class="n">shape</span>  <span class="c1"># [3, 5, 10]</span>
-<span class="n">attn_scores</span><span class="o">.</span><span class="n">shape</span>  <span class="c1"># [3, 5, 6]</span>
+
+<span class="n">contexts</span> <span class="o">=</span> <span class="n">fluid</span><span class="o">.</span><span class="n">nets</span><span class="o">.</span><span class="n">scaled_dot_product_attention</span><span class="p">(</span><span class="n">q</span><span class="p">,</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span><span class="p">)</span>
+<span class="n">contexts</span><span class="o">.</span><span class="n">shape</span>  <span class="c1"># [3, 5, 10]</span>
 </pre></div>
 </div>
 </dd></dl>

--- a/develop/doc/operators.json
+++ b/develop/doc/operators.json
@@ -3192,7 +3192,7 @@
 } ] 
 },{
 "type" : "reshape",
- "comment" : "\nReshape Operator.\n\nReshape Input(X) into the shape specified by Attr(shape).\n\nAn example:\nGiven a 2-D tensor X with 2 rows and 2 columns\n\n    [[1, 2], [3, 4]]\n\nand target shape = [1, 4], the reshape operator will transform\nthe tensor X into a 2-D tensor:\n\n    [[1, 2, 3, 4]]\n\nOne dimension in the target shape can be set -1, representing that its\nsize is unknown. In this case, the real dimension will be infered from \nthe original shape of Input(X) and other dimensions in the target shape.\n",
+ "comment" : "\nReshape Operator.\n\nReshape Input(X) into the shape specified by Attr(shape).\n\nAn example:\nGiven a 2-D tensor X with 2 rows and 2 columns : [[1, 2], [3, 4]]\n\nand target shape = [1, 4], the reshape operator will transform\nthe tensor X into a 2-D tensor: [[1, 2, 3, 4]]\n\nOne dimension in the target shape can be set -1, representing that its\nsize is unknown. In this case, the real dimension will be infered from \nthe original shape of Input(X) and other dimensions in the target shape.\n",
 "inputs" : [ 
 { 
   "name" : "X",

--- a/develop/doc/searchindex.js
+++ b/develop/doc/searchindex.js
--- a/develop/doc_cn/_sources/api/v2/fluid/nets.rst.txt
+++ b/develop/doc_cn/_sources/api/v2/fluid/nets.rst.txt
@@ -26,8 +26,8 @@ glu
    :noindex:


-dot_product_attention
---------------------
-..  autofunction:: paddle.v2.fluid.nets.dot_product_attention
+scaled_dot_product_attention
+----------------------------
+..  autofunction:: paddle.v2.fluid.nets.scaled_dot_product_attention
    :noindex:

--- a/develop/doc_cn/api/v2/fluid/layers.html
+++ b/develop/doc_cn/api/v2/fluid/layers.html
@@ -277,16 +277,17 @@ multidimensional tensor will first be flattened
 into a 2-dimensional matrix. The parameter
 <cite>num_flatten_dims</cite> determines how the input tensor
 is flattened: the first <cite>num_flatten_dims</cite>
-dimensions will be flatten to form the first
-dimension of the final matrix (height of the
-matrix), and the rest <cite>rank(X) - num_flatten_dims</cite>
-dimensions are flattened to form the second
-dimension of the final matrix (width of the matrix).
-For example, suppose <cite>X</cite> is a 6-dimensional tensor
-with a shape [2, 3, 4, 5, 6], and
-<cite>num_flatten_dims</cite> = 3. Then, the flattened matrix
-will have a shape [2 x 3 x 4, 5 x 6] = [24, 30].
-By default, <cite>num_flatten_dims</cite> is set to 1.</li>
+(inclusive, index starts from 1) dimensions will
+be flatten to form the first dimension of the
+final matrix (height of the matrix), and the rest
+<cite>rank(X) - num_flatten_dims</cite> dimensions are
+flattened to form the second dimension of the
+final matrix (width of the matrix). For example,
+suppose <cite>X</cite> is a 6-dimensional tensor with a shape
+[2, 3, 4, 5, 6], and <cite>num_flatten_dims</cite> = 3. Then,
+the flattened matrix will have a shape
+[2 x 3 x 4, 5 x 6] = [24, 30]. By default,
+<cite>num_flatten_dims</cite> is set to 1.</li>
 <li><strong>param_attr</strong> (<em>ParamAttr|list</em>) &#8211; The parameter attribute for learnable
 parameters/weights of the fully connected
 layer.</li>
@@ -877,13 +878,9 @@ Duplicable: False  Optional: False</li>
 <dd><p>Reshape Operator.</p>
 <p>Reshape Input(X) into the shape specified by Attr(shape).</p>
 <p>An example:
-Given a 2-D tensor X with 2 rows and 2 columns</p>
-<blockquote>
-<div>[[1, 2], [3, 4]]</div></blockquote>
+Given a 2-D tensor X with 2 rows and 2 columns : [[1, 2], [3, 4]]</p>
 <p>and target shape = [1, 4], the reshape operator will transform
-the tensor X into a 2-D tensor:</p>
-<blockquote>
-<div>[[1, 2, 3, 4]]</div></blockquote>
+the tensor X into a 2-D tensor: [[1, 2, 3, 4]]</p>
 <p>One dimension in the target shape can be set -1, representing that its
 size is unknown. In this case, the real dimension will be infered from
 the original shape of Input(X) and other dimensions in the target shape.</p>
@@ -1225,8 +1222,9 @@ X and Y and returns that as the output.</p>
 <dt>
 <code class="descclassname">paddle.v2.fluid.layers.</code><code class="descname">cross_entropy</code><span class="sig-paren">(</span><em>input</em>, <em>label</em>, <em>**kwargs</em><span class="sig-paren">)</span></dt>
 <dd><p><strong>Cross Entropy Layer</strong></p>
-<p>This layer computes the cross entropy between <cite>input</cite> and <cite>label</cite>. It supports
-both standard cross-entropy and soft-label cross-entropy loss computation.</p>
+<p>This layer computes the cross entropy between <cite>input</cite> and <cite>label</cite>. It
+supports both standard cross-entropy and soft-label cross-entropy loss
+computation.</p>
 <ol class="arabic">
 <li><dl class="first docutils">
 <dt>One-hot cross-entropy:</dt>
@@ -1262,22 +1260,33 @@ to a one-hot cross-entropy with one-hot label representation.</p>
 <tbody valign="top">
 <tr class="field-odd field"><th class="field-name">参数:</th><td class="field-body"><ul class="first simple">
 <li><strong>input</strong> (<em>Variable|list</em>) &#8211; a 2-D tensor with shape [N x D], where N is the
-batch size and D is the number of classes. This input is a probability
-computed by the previous operator, which is almost always the result
-of a softmax operator.</li>
+batch size and D is the number of classes. This
+input is a probability computed by the previous
+operator, which is almost always the result of
+a softmax operator.</li>
 <li><strong>label</strong> (<em>Variable|list</em>) &#8211; the ground truth which is a 2-D tensor. When
-<cite>soft_label</cite> is set to <cite>False</cite>, <cite>label</cite> is a tensor&lt;int64&gt; with shape
-[N x 1]. When <cite>soft_label</cite> is set to <cite>True</cite>, <cite>label</cite> is a
+<cite>soft_label</cite> is set to <cite>False</cite>, <cite>label</cite> is a
+tensor&lt;int64&gt; with shape [N x 1]. When
+<cite>soft_label</cite> is set to <cite>True</cite>, <cite>label</cite> is a
 tensor&lt;float/double&gt; with shape [N x D].</li>
-<li><strong>soft_label</strong> (bool, via <cite>**kwargs</cite>) &#8211; a flag indicating whether to interpretate
-the given labels as soft labels, default <cite>False</cite>.</li>
+<li><strong>soft_label</strong> (bool, via <cite>**kwargs</cite>) &#8211; a flag indicating whether to
+interpretate the given labels as soft
+labels, default <cite>False</cite>.</li>
 </ul>
 </td>
 </tr>
 <tr class="field-even field"><th class="field-name">返回:</th><td class="field-body"><p class="first">A 2-D tensor with shape [N x 1], the cross entropy loss.</p>
 </td>
 </tr>
-<tr class="field-odd field"><th class="field-name">Raises:</th><td class="field-body"><p class="first last"><cite>ValueError</cite> &#8211; 1) the 1st dimension of <cite>input</cite> and <cite>label</cite> are not equal; 2) when               <cite>soft_label == True</cite>, and the 2nd dimension of <cite>input</cite> and <cite>label</cite> are not                equal; 3) when <cite>soft_label == False</cite>, and the 2nd dimension of <cite>label</cite> is not 1.</p>
+<tr class="field-odd field"><th class="field-name">Raises:</th><td class="field-body"><p class="first"><cite>ValueError</cite> &#8211; 1) the 1st dimension of <cite>input</cite> and <cite>label</cite> are not equal.
+2) when <cite>soft_label == True</cite>, and the 2nd dimension of</p>
+<blockquote>
+<div><p><cite>input</cite> and <cite>label</cite> are not equal.</p>
+</div></blockquote>
+<ol class="last arabic simple" start="3">
+<li>when <cite>soft_label == False</cite>, and the 2nd dimension of
+<cite>label</cite> is not 1.</li>
+</ol>
 </td>
 </tr>
 </tbody>
@@ -1296,8 +1305,9 @@ the given labels as soft labels, default <cite>False</cite>.</li>
 <dt>
 <code class="descclassname">paddle.v2.fluid.layers.</code><code class="descname">square_error_cost</code><span class="sig-paren">(</span><em>input</em>, <em>label</em>, <em>**kwargs</em><span class="sig-paren">)</span></dt>
 <dd><p><strong>Square error cost layer</strong></p>
-<p>This layer accepts input predictions and target label and returns the squared error cost.
-For predictions, <span class="math">\(X\)</span>, and target labels, <span class="math">\(Y\)</span>, the equation is:</p>
+<p>This layer accepts input predictions and target label and returns the
+squared error cost.</p>
+<p>For predictions, <span class="math">\(X\)</span>, and target labels, <span class="math">\(Y\)</span>, the equation is:</p>
 <div class="math">
 \[Out = (X - Y)^2\]</div>
 <p>In the above equation:</p>
@@ -1318,7 +1328,12 @@ For predictions, <span class="math">\(X\)</span>, and target labels, <span class
 </ul>
 </td>
 </tr>
-<tr class="field-even field"><th class="field-name">返回:</th><td class="field-body"><p class="first">The tensor variable storing the element-wise squared error difference                   of input and label.</p>
+<tr class="field-even field"><th class="field-name">返回:</th><td class="field-body"><p class="first"><dl class="docutils">
+<dt>The tensor variable storing the element-wise squared error</dt>
+<dd><p class="first last">difference of input and label.</p>
+</dd>
+</dl>
+</p>
 </td>
 </tr>
 <tr class="field-odd field"><th class="field-name">返回类型:</th><td class="field-body"><p class="first last">Variable</p>
@@ -1363,12 +1378,13 @@ in the input parameters to the function.</p>
 <code class="descclassname">paddle.v2.fluid.layers.</code><code class="descname">conv2d</code><span class="sig-paren">(</span><em>input</em>, <em>num_filters</em>, <em>filter_size</em>, <em>stride=None</em>, <em>padding=None</em>, <em>groups=None</em>, <em>param_attr=None</em>, <em>bias_attr=None</em>, <em>use_cudnn=True</em>, <em>act=None</em><span class="sig-paren">)</span></dt>
 <dd><p><strong>Convlution2D Layer</strong></p>
 <p>The convolution2D layer calculates the output based on the input, filter
-and strides, paddings, dilations, groups parameters. Input(Input) and Output(Output)
-are in NCHW format. Where N is batch size, C is the number of channels, H is the height
-of the feature, and W is the width of the feature.
+and strides, paddings, dilations, groups parameters. Input(Input) and
+Output(Output) are in NCHW format. Where N is batch size, C is the number of
+channels, H is the height of the feature, and W is the width of the feature.
 The details of convolution layer, please refer UFLDL&#8217;s <a class="reference external" href="http://ufldl.stanford.edu/tutorial/supervised/FeatureExtractionUsingConvolution/">convolution,</a> .
-If bias attribution and activation type are provided, bias is added to the output of the convolution,
-and the corresponding activation function is applied to the final result.</p>
+If bias attribution and activation type are provided, bias is added to the
+output of the convolution, and the corresponding activation function is
+applied to the final result.</p>
 <p>For each input <span class="math">\(X\)</span>, the equation is:</p>
 <div class="math">
 \[Out = \sigma (W \ast X + b)\]</div>
@@ -1379,7 +1395,11 @@ and the corresponding activation function is applied to the final result.</p>
 <li><span class="math">\(\ast\)</span>: Convolution operation.</li>
 <li><span class="math">\(b\)</span>: Bias value, a 2-D tensor with shape [M, 1].</li>
 <li><span class="math">\(\sigma\)</span>: Activation function.</li>
-<li><span class="math">\(Out\)</span>: Output value, the shape of <span class="math">\(Out\)</span> and <span class="math">\(X\)</span> may be different.</li>
+<li><dl class="first docutils">
+<dt><span class="math">\(Out\)</span>: Output value, the shape of <span class="math">\(Out\)</span> and <span class="math">\(X\)</span> may be</dt>
+<dd>different.</dd>
+</dl>
+</li>
 </ul>
 <p class="rubric">Example</p>
 <ul>
@@ -1426,20 +1446,28 @@ library is installed. Default: True</li>
 </ul>
 </td>
 </tr>
-<tr class="field-even field"><th class="field-name">返回:</th><td class="field-body"><p class="first">The tensor variable storing the convolution and                   non-linearity activation result.</p>
+<tr class="field-even field"><th class="field-name">返回:</th><td class="field-body"><p class="first"><dl class="docutils">
+<dt>The tensor variable storing the convolution and</dt>
+<dd><p class="first last">non-linearity activation result.</p>
+</dd>
+</dl>
+</p>
 </td>
 </tr>
 <tr class="field-odd field"><th class="field-name">返回类型:</th><td class="field-body"><p class="first">Variable</p>
 </td>
 </tr>
-<tr class="field-even field"><th class="field-name">Raises:</th><td class="field-body"><p class="first last"><code class="xref py py-exc docutils literal"><span class="pre">ValueError</span></code> &#8211; If the shapes of input, filter_size, stride, padding and groups mismatch.</p>
+<tr class="field-even field"><th class="field-name">Raises:</th><td class="field-body"><p class="first last"><code class="xref py py-exc docutils literal"><span class="pre">ValueError</span></code> &#8211; If the shapes of input, filter_size, stride, padding and
+groups mismatch.</p>
 </td>
 </tr>
 </tbody>
 </table>
 <p class="rubric">Examples</p>
-<div class="highlight-python"><div class="highlight"><pre><span></span><span class="n">data</span> <span class="o">=</span> <span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">data</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s1">&#39;data&#39;</span><span class="p">,</span> <span class="n">shape</span><span class="o">=</span><span class="p">[</span><span class="mi">3</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="mi">32</span><span class="p">],</span> <span class="n">dtype</span><span class="o">=</span><span class="s1">&#39;float32&#39;</span><span class="p">)</span>
-<span class="n">conv2d</span> <span class="o">=</span> <span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">conv2d</span><span class="p">(</span><span class="nb">input</span><span class="o">=</span><span class="n">data</span><span class="p">,</span> <span class="n">num_filters</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">filter_size</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">act</span><span class="o">=</span><span class="s2">&quot;relu&quot;</span><span class="p">)</span>
+<div class="highlight-python"><div class="highlight"><pre><span></span><span class="n">data</span> <span class="o">=</span> <span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">data</span><span class="p">(</span>
+    <span class="n">name</span><span class="o">=</span><span class="s1">&#39;data&#39;</span><span class="p">,</span> <span class="n">shape</span><span class="o">=</span><span class="p">[</span><span class="mi">3</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="mi">32</span><span class="p">],</span> <span class="n">dtype</span><span class="o">=</span><span class="s1">&#39;float32&#39;</span><span class="p">)</span>
+<span class="n">conv2d</span> <span class="o">=</span> <span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">conv2d</span><span class="p">(</span>
+    <span class="nb">input</span><span class="o">=</span><span class="n">data</span><span class="p">,</span> <span class="n">num_filters</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">filter_size</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">act</span><span class="o">=</span><span class="s2">&quot;relu&quot;</span><span class="p">)</span>
 </pre></div>
 </div>
 </dd></dl>
@@ -2177,7 +2205,8 @@ are in NCHW format. Where N is batch size, C is the number of channels,
 H is the height of the feature, and W is the width of the feature.
 Parameters(dilations, strides, paddings) are two elements. These two elements
 represent height and width, respectively. The details of convolution transpose
-layer, please refer to the following explanation and references <a class="reference external" href="http://www.matthewzeiler.com/wp-content/uploads/2017/07/cvpr2010.pdf">therein</a>.</p>
+layer, please refer to the following explanation and references
+<a class="reference external" href="http://www.matthewzeiler.com/wp-content/uploads/2017/07/cvpr2010.pdf">therein</a>.</p>
 <p>For each input <span class="math">\(X\)</span>, the equation is:</p>
 <div class="math">
 \[Out = W \ast X\]</div>
@@ -2186,7 +2215,11 @@ layer, please refer to the following explanation and references <a class="refere
 <li><span class="math">\(X\)</span>: Input value, a tensor with NCHW format.</li>
 <li><span class="math">\(W\)</span>: Filter value, a tensor with MCHW format.</li>
 <li><span class="math">\(\ast\)</span> : Convolution transpose operation.</li>
-<li><span class="math">\(Out\)</span>: Output value, the shape of <span class="math">\(Out\)</span> and <span class="math">\(X\)</span> may be different.</li>
+<li><dl class="first docutils">
+<dt><span class="math">\(Out\)</span>: Output value, the shape of <span class="math">\(Out\)</span> and <span class="math">\(X\)</span> may be</dt>
+<dd>different.</dd>
+</dl>
+</li>
 </ul>
 <p class="rubric">Example</p>
 <ul>
@@ -2226,7 +2259,8 @@ stride_H = stride_W = stride. Default: stride = 1.</li>
 <li><strong>dilation</strong> (<em>int|tuple</em>) &#8211; The dilation size. If dilation is a tuple, it must
 contain two integers, (dilation_H, dilation_W). Otherwise, the
 dilation_H = dilation_W = dilation. Default: dilation = 1.</li>
-<li><strong>param_attr</strong> (<em>ParamAttr</em>) &#8211; The parameters to the Conv2d_transpose Layer. Default: None</li>
+<li><strong>param_attr</strong> (<em>ParamAttr</em>) &#8211; The parameters to the Conv2d_transpose Layer.
+Default: None</li>
 <li><strong>use_cudnn</strong> (<em>bool</em>) &#8211; Use cudnn kernel or not, it is valid only when the cudnn
 library is installed. Default: True</li>
 <li><strong>name</strong> (<em>str|None</em>) &#8211; A name for this layer(optional). If set None, the layer
@@ -2240,14 +2274,17 @@ will be named automatically.</li>
 <tr class="field-odd field"><th class="field-name">返回类型:</th><td class="field-body"><p class="first">Variable</p>
 </td>
 </tr>
-<tr class="field-even field"><th class="field-name">Raises:</th><td class="field-body"><p class="first last"><code class="xref py py-exc docutils literal"><span class="pre">ValueError</span></code> &#8211; If the shapes of input, filter_size, stride, padding and groups mismatch.</p>
+<tr class="field-even field"><th class="field-name">Raises:</th><td class="field-body"><p class="first last"><code class="xref py py-exc docutils literal"><span class="pre">ValueError</span></code> &#8211; If the shapes of input, filter_size, stride, padding and
+groups mismatch.</p>
 </td>
 </tr>
 </tbody>
 </table>
 <p class="rubric">Examples</p>
-<div class="highlight-python"><div class="highlight"><pre><span></span><span class="n">data</span> <span class="o">=</span> <span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">data</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s1">&#39;data&#39;</span><span class="p">,</span> <span class="n">shape</span><span class="o">=</span><span class="p">[</span><span class="mi">3</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="mi">32</span><span class="p">],</span> <span class="n">dtype</span><span class="o">=</span><span class="s1">&#39;float32&#39;</span><span class="p">)</span>
-<span class="n">conv2d_transpose</span> <span class="o">=</span> <span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">conv2d_transpose</span><span class="p">(</span><span class="nb">input</span><span class="o">=</span><span class="n">data</span><span class="p">,</span> <span class="n">num_filters</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">filter_size</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
+<div class="highlight-python"><div class="highlight"><pre><span></span><span class="n">data</span> <span class="o">=</span> <span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">data</span><span class="p">(</span>
+    <span class="n">name</span><span class="o">=</span><span class="s1">&#39;data&#39;</span><span class="p">,</span> <span class="n">shape</span><span class="o">=</span><span class="p">[</span><span class="mi">3</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="mi">32</span><span class="p">],</span> <span class="n">dtype</span><span class="o">=</span><span class="s1">&#39;float32&#39;</span><span class="p">)</span>
+<span class="n">conv2d_transpose</span> <span class="o">=</span> <span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">conv2d_transpose</span><span class="p">(</span>
+    <span class="nb">input</span><span class="o">=</span><span class="n">data</span><span class="p">,</span> <span class="n">num_filters</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">filter_size</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
 </pre></div>
 </div>
 </dd></dl>
@@ -2356,8 +2393,10 @@ and concatenation of <span class="math">\(u_t\)</span>, <span class="math">\(r_t
 <li><strong>size</strong> (<em>integer</em>) &#8211; The input dimension value.</li>
 <li><strong>weight</strong> (<em>ParamAttr</em>) &#8211; The weight parameters for gru unit. Default: None</li>
 <li><strong>bias</strong> (<em>ParamAttr</em>) &#8211; The bias parameters for gru unit. Default: None</li>
-<li><strong>activation</strong> (<em>string</em>) &#8211; The activation type for cell (actNode). Default: &#8216;tanh&#8217;</li>
-<li><strong>gate_activation</strong> (<em>string</em>) &#8211; The activation type for gates (actGate). Default: &#8216;sigmoid&#8217;</li>
+<li><strong>activation</strong> (<em>string</em>) &#8211; The activation type for cell (actNode).
+Default: &#8216;tanh&#8217;</li>
+<li><strong>gate_activation</strong> (<em>string</em>) &#8211; The activation type for gates (actGate).
+Default: &#8216;sigmoid&#8217;</li>
 </ul>
 </td>
 </tr>
@@ -2433,7 +2472,10 @@ will be named automatically.</li>
 <tr class="field-odd field"><th class="field-name">返回类型:</th><td class="field-body"><p class="first">tuple</p>
 </td>
 </tr>
-<tr class="field-even field"><th class="field-name">Raises:</th><td class="field-body"><p class="first last"><code class="xref py py-exc docutils literal"><span class="pre">ValueError</span></code> &#8211; The ranks of <strong>x_t</strong>, <strong>hidden_t_prev</strong> and <strong>cell_t_prev</strong>                not be 2 or the 1st dimensions of <strong>x_t</strong>, <strong>hidden_t_prev</strong>                 and <strong>cell_t_prev</strong> not be the same or the 2nd dimensions of                 <strong>hidden_t_prev</strong> and <strong>cell_t_prev</strong> not be the same.</p>
+<tr class="field-even field"><th class="field-name">Raises:</th><td class="field-body"><p class="first last"><code class="xref py py-exc docutils literal"><span class="pre">ValueError</span></code> &#8211; The ranks of <strong>x_t</strong>, <strong>hidden_t_prev</strong> and <strong>cell_t_prev</strong>
+not be 2 or the 1st dimensions of <strong>x_t</strong>, <strong>hidden_t_prev</strong>
+and <strong>cell_t_prev</strong> not be the same or the 2nd dimensions of
+<strong>hidden_t_prev</strong> and <strong>cell_t_prev</strong> not be the same.</p>
 </td>
 </tr>
 </tbody>
@@ -2725,9 +2767,9 @@ will be named automatically.</li>
 <dl class="function">
 <dt>
 <code class="descclassname">paddle.v2.fluid.layers.</code><code class="descname">matmul</code><span class="sig-paren">(</span><em>x</em>, <em>y</em>, <em>transpose_x=False</em>, <em>transpose_y=False</em>, <em>name=None</em><span class="sig-paren">)</span></dt>
-<dd><p>Applies matrix multiplication to two tensors. Currently, the input
-tensors&#8217; rank can be any, but when the rank of anyone inputs is
-bigger than 3, this two inputs&#8217; rank should be equal.</p>
+<dd><p>Applies matrix multiplication to two tensors.</p>
+<p>Currently, the input tensors&#8217; rank can be any, but when the rank of any
+inputs is bigger than 3, this two inputs&#8217; rank should be equal.</p>
 <p>The actual behavior depends on the shapes of <span class="math">\(x\)</span>, <span class="math">\(y\)</span> and the
 flag values of <code class="xref py py-attr docutils literal"><span class="pre">transpose_x</span></code>, <code class="xref py py-attr docutils literal"><span class="pre">transpose_y</span></code>. Specifically:</p>
 <ul class="simple">
@@ -2775,18 +2817,23 @@ will be named automatically.</li>
 <div class="highlight-python"><div class="highlight"><pre><span></span><span class="c1"># Examples to clarify shapes of the inputs and output</span>
 <span class="c1"># x: [B, ..., M, K], y: [B, ..., K, N]</span>
 <span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>  <span class="c1"># out: [B, ..., M, N]</span>
+
 <span class="c1"># x: [B, M, K], y: [B, K, N]</span>
 <span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>  <span class="c1"># out: [B, M, N]</span>
+
 <span class="c1"># x: [B, M, K], y: [K, N]</span>
 <span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>  <span class="c1"># out: [B, M, N]</span>
-<span class="c1"># x: [B, M, K], y: [K]</span>
-<span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>  <span class="c1"># out: [B, M]</span>
+
 <span class="c1"># x: [M, K], y: [K, N]</span>
 <span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>  <span class="c1"># out: [M, N]</span>
+
+<span class="c1"># x: [B, M, K], y: [K]</span>
+<span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>  <span class="c1"># out: [B, M]</span>
+
 <span class="c1"># x: [K], y: [K]</span>
 <span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>  <span class="c1"># out: [1]</span>
-<span class="c1"># x: [M], y: [N]</span>

+<span class="c1"># x: [M], y: [N]</span>
 <span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="bp">True</span><span class="p">,</span> <span class="bp">True</span><span class="p">)</span>  <span class="c1"># out: [M, N]</span>
 </pre></div>
 </div>
@@ -3521,7 +3568,8 @@ output.lod = [[0, 4, 8]]
 </pre></div>
 </div>
 <p>The simple usage is:</p>
-<div class="highlight-python"><div class="highlight"><pre><span></span><span class="n">output</span> <span class="o">=</span> <span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">im2sequence</span><span class="p">(</span><span class="nb">input</span><span class="o">=</span><span class="n">layer</span><span class="p">,</span> <span class="n">stride</span><span class="o">=</span><span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span> <span class="n">filter_size</span><span class="o">=</span><span class="p">[</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">])</span>
+<div class="highlight-python"><div class="highlight"><pre><span></span><span class="n">output</span> <span class="o">=</span> <span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">im2sequence</span><span class="p">(</span>
+    <span class="nb">input</span><span class="o">=</span><span class="n">layer</span><span class="p">,</span> <span class="n">stride</span><span class="o">=</span><span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span> <span class="n">filter_size</span><span class="o">=</span><span class="p">[</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">])</span>
 </pre></div>
 </div>
 </div></blockquote>
@@ -3537,8 +3585,13 @@ output.lod = [[0, 4, 8]]
 <dt>
 <code class="descclassname">paddle.v2.fluid.layers.</code><code class="descname">ctc_greedy_decoder</code><span class="sig-paren">(</span><em>input</em>, <em>blank</em>, <em>name=None</em><span class="sig-paren">)</span></dt>
 <dd><p>This op is used to decode sequences by greedy policy by below steps:
-1. Get the indexes of max value for each row in input. a.k.a. numpy.argmax(input, axis=0).
-2. For each sequence in result of step1, merge repeated tokens between two blanks and delete all blanks.</p>
+1. Get the indexes of max value for each row in input. a.k.a.</p>
+<blockquote>
+<div>numpy.argmax(input, axis=0).</div></blockquote>
+<ol class="arabic simple" start="2">
+<li>For each sequence in result of step1, merge repeated tokens between two
+blanks and delete all blanks.</li>
+</ol>
 <p>A simple example as below:</p>
 <div class="highlight-text"><div class="highlight"><pre><span></span>Given:

@@ -3568,8 +3621,15 @@ output.lod = [[0, 2, 3]]
 <col class="field-body" />
 <tbody valign="top">
 <tr class="field-odd field"><th class="field-name">参数:</th><td class="field-body"><ul class="first simple">
-<li><strong>input</strong> (<em>Variable</em>) &#8211; (LoDTensor&lt;float&gt;), the probabilities of variable-length sequences, which is a 2-D Tensor with LoD information. It&#8217;s shape is [Lp, num_classes + 1], where Lp is the sum of all input sequences&#8217; length and num_classes is the true number of classes. (not including the blank label).</li>
-<li><strong>blank</strong> (<em>int</em>) &#8211; the blank label index of Connectionist Temporal Classification (CTC) loss, which is in thehalf-opened interval [0, num_classes + 1).</li>
+<li><strong>input</strong> (<em>Variable</em>) &#8211; (LoDTensor&lt;float&gt;), the probabilities of
+variable-length sequences, which is a 2-D Tensor with
+LoD information. It&#8217;s shape is [Lp, num_classes + 1],
+where Lp is the sum of all input sequences&#8217; length and
+num_classes is the true number of classes. (not
+including the blank label).</li>
+<li><strong>blank</strong> (<em>int</em>) &#8211; the blank label index of Connectionist Temporal
+Classification (CTC) loss, which is in thehalf-opened
+interval [0, num_classes + 1).</li>
 </ul>
 </td>
 </tr>
@@ -3628,7 +3688,7 @@ will be named automatically.</li>
 <div class="highlight-python"><div class="highlight"><pre><span></span><span class="n">data</span> <span class="o">=</span> <span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">data</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s2">&quot;data&quot;</span><span class="p">,</span>
                         <span class="n">shape</span><span class="o">=</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">17</span><span class="p">,</span> <span class="mi">13</span><span class="p">),</span>
                         <span class="n">dtype</span><span class="o">=</span><span class="s2">&quot;float32&quot;</span><span class="p">)</span>
-<span class="n">fc</span> <span class="o">=</span> <span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">l2_normalize</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">data</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
+<span class="n">normed</span> <span class="o">=</span> <span class="n">fluid</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">l2_normalize</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">data</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
 </pre></div>
 </div>
 </dd></dl>

--- a/develop/doc_cn/api/v2/fluid/nets.html
+++ b/develop/doc_cn/api/v2/fluid/nets.html
@@ -303,11 +303,11 @@ dimension to split along is <span class="math">\(rank(input) + dim\)</span>.</li
 </dd></dl>

 </div>
-<div class="section" id="dot-product-attention">
-<h2>dot_product_attention<a class="headerlink" href="#dot-product-attention" title="永久链接至标题">¶</a></h2>
+<div class="section" id="scaled-dot-product-attention">
+<h2>scaled_dot_product_attention<a class="headerlink" href="#scaled-dot-product-attention" title="永久链接至标题">¶</a></h2>
 <dl class="function">
 <dt>
-<code class="descclassname">paddle.v2.fluid.nets.</code><code class="descname">dot_product_attention</code><span class="sig-paren">(</span><em>querys</em>, <em>keys</em>, <em>values</em><span class="sig-paren">)</span></dt>
+<code class="descclassname">paddle.v2.fluid.nets.</code><code class="descname">scaled_dot_product_attention</code><span class="sig-paren">(</span><em>queries</em>, <em>keys</em>, <em>values</em>, <em>num_heads=1</em>, <em>dropout_rate=0.0</em><span class="sig-paren">)</span></dt>
 <dd><p>The dot-product attention.</p>
 <p>Attention mechanism can be seen as mapping a query and a set of key-value
 pairs to an output. The output is computed as a weighted sum of the values,
@@ -317,36 +317,55 @@ function (dot-product here) of the query with the corresponding key.</p>
 multipication as follows:</p>
 <blockquote>
 <div><div class="math">
-\[Attention(Q, K, V)= softmax(QK^\mathrm{T})V\]</div>
+\[Attention(Q, K, V)= softmax(QK^\mathrm{T})V\]</div>
 </div></blockquote>
 <p>Refer to <a class="reference external" href="https://arxiv.org/pdf/1706.03762.pdf">Attention Is All You Need</a>.</p>
-<p>Note that batch data containing sequences with different lengths is not
-supported by this because of the (batch) matrix multipication.</p>
 <table class="docutils field-list" frame="void" rules="none">
 <col class="field-name" />
 <col class="field-body" />
 <tbody valign="top">
 <tr class="field-odd field"><th class="field-name">参数:</th><td class="field-body"><ul class="first simple">
-<li><strong>query</strong> (<em>Variable</em>) &#8211; The input variable which is a Tensor or LoDTensor.</li>
-<li><strong>key</strong> (<em>Variable</em>) &#8211; The input variable which is a Tensor or LoDTensor.</li>
-<li><strong>value</strong> (<em>Variable</em>) &#8211; The input variable which is a Tensor or LoDTensor.</li>
+<li><strong>queries</strong> (<em>Variable</em>) &#8211; The input variable which should be a 3-D Tensor.</li>
+<li><strong>keys</strong> (<em>Variable</em>) &#8211; The input variable which should be a 3-D Tensor.</li>
+<li><strong>values</strong> (<em>Variable</em>) &#8211; The input variable which should be a 3-D Tensor.</li>
+<li><strong>num_heads</strong> (<em>int</em>) &#8211; Head number to compute the scaled dot product
+attention. Default value is 1.</li>
+<li><strong>dropout_rate</strong> (<em>float</em>) &#8211; The dropout rate to drop the attention weight.
+Default value is 0.</li>
 </ul>
 </td>
 </tr>
-<tr class="field-even field"><th class="field-name">返回:</th><td class="field-body"><p class="first">The Tensor variables representing the output and attention scores.</p>
+<tr class="field-even field"><th class="field-name">返回:</th><td class="field-body"><p class="first"><dl class="docutils">
+<dt>A 3-D Tensor computed by multi-head scaled dot product</dt>
+<dd><p class="first last">attention.</p>
+</dd>
+</dl>
+</p>
 </td>
 </tr>
-<tr class="field-odd field"><th class="field-name">返回类型:</th><td class="field-body"><p class="first last">tuple</p>
+<tr class="field-odd field"><th class="field-name">返回类型:</th><td class="field-body"><p class="first">Variable</p>
+</td>
+</tr>
+<tr class="field-even field"><th class="field-name">Raises:</th><td class="field-body"><p class="first last"><code class="xref py py-exc docutils literal"><span class="pre">ValueError</span></code> &#8211; If input queries, keys, values are not 3-D Tensors.</p>
 </td>
 </tr>
 </tbody>
 </table>
+<div class="admonition note">
+<p class="first admonition-title">注解</p>
+<p>1. When num_heads &gt; 1, three linear projections are learned respectively
+to map input queries, keys and values into queries&#8217;, keys&#8217; and values&#8217;.
+queries&#8217;, keys&#8217; and values&#8217; have the same shapes with queries, keys
+and values.</p>
+<p class="last">1. When num_heads == 1, scaled_dot_product_attention has no learnable
+parameters.</p>
+</div>
 <p class="rubric">Examples</p>
-<div class="highlight-python"><div class="highlight"><pre><span></span><span class="c1"># Suppose q, k, v are tensor variables with the following shape:</span>
+<div class="highlight-python"><div class="highlight"><pre><span></span><span class="c1"># Suppose q, k, v are Tensors with the following shape:</span>
 <span class="c1"># q: [3, 5, 9], k: [3, 6, 9], v: [3, 6, 10]</span>
-<span class="n">out</span><span class="p">,</span> <span class="n">attn_scores</span> <span class="o">=</span> <span class="n">fluid</span><span class="o">.</span><span class="n">nets</span><span class="o">.</span><span class="n">dot_product_attention</span><span class="p">(</span><span class="n">q</span><span class="p">,</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span><span class="p">)</span>
-<span class="n">out</span><span class="o">.</span><span class="n">shape</span>  <span class="c1"># [3, 5, 10]</span>
-<span class="n">attn_scores</span><span class="o">.</span><span class="n">shape</span>  <span class="c1"># [3, 5, 6]</span>
+
+<span class="n">contexts</span> <span class="o">=</span> <span class="n">fluid</span><span class="o">.</span><span class="n">nets</span><span class="o">.</span><span class="n">scaled_dot_product_attention</span><span class="p">(</span><span class="n">q</span><span class="p">,</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span><span class="p">)</span>
+<span class="n">contexts</span><span class="o">.</span><span class="n">shape</span>  <span class="c1"># [3, 5, 10]</span>
 </pre></div>
 </div>
 </dd></dl>

--- a/develop/doc_cn/searchindex.js
+++ b/develop/doc_cn/searchindex.js