Deploy to GitHub Pages: d011514e

96bbb20e · Travis CI · b603c182 · 96bbb20e · 96bbb20e · 96bbb20e
6 changed file
--- a/develop/doc/api/v2/config/layer.html
+++ b/develop/doc/api/v2/config/layer.html
@@ -1027,6 +1027,7 @@ more details about LSTM.</p>
 <tbody valign="top">
 <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first simple">
 <li><strong>name</strong> (<em>basestring</em>) &#8211; The lstmemory layer name.</li>
+<li><strong>size</strong> (<em>int</em>) &#8211; DEPRECATED. size of the lstm cell</li>
 <li><strong>input</strong> (<em>paddle.v2.config_base.Layer</em>) &#8211; input layer name.</li>
 <li><strong>reverse</strong> (<em>bool</em>) &#8211; is sequence process reversed or not.</li>
 <li><strong>act</strong> (<em>paddle.v2.activation.Base</em>) &#8211; activation type, paddle.v2.activation.Tanh by default. <span class="math">\(h_t\)</span></li>
@@ -1093,6 +1094,7 @@ Recurrent Neural Networks on Sequence Modeling.</a></p>
 <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first simple">
 <li><strong>name</strong> (<em>None|basestring</em>) &#8211; The gru layer name.</li>
 <li><strong>input</strong> (<em>paddle.v2.config_base.Layer.</em>) &#8211; input layer.</li>
+<li><strong>size</strong> (<em>int</em>) &#8211; DEPRECATED. size of the gru cell</li>
 <li><strong>reverse</strong> (<em>bool</em>) &#8211; Whether sequence process is reversed or not.</li>
 <li><strong>act</strong> (<em>paddle.v2.activation.Base</em>) &#8211; activation type, paddle.v2.activation.Tanh by default. This activation
 affects the <span class="math">\({\tilde{h_t}}\)</span>.</li>
@@ -1103,8 +1105,6 @@ This activation affects the <span class="math">\(z_t\)</span> and <span class="m
 bias.</li>
 <li><strong>param_attr</strong> (<em>paddle.v2.attr.ParameterAttribute|None|False</em>) &#8211; Parameter Attribute.</li>
 <li><strong>layer_attr</strong> (<em>paddle.v2.attr.ExtraAttributeNone</em>) &#8211; Extra Layer attribute</li>
-<li><strong>size</strong> (<em>None</em>) &#8211; Stub parameter of size, but actually not used. If set this size
-will get a warning.</li>
 </ul>
 </td>
 </tr>
@@ -1259,18 +1259,18 @@ be paddle.v2.config_base.Layer.</li>
 <dl class="class">
 <dt>
 <em class="property">class </em><code class="descclassname">paddle.v2.layer.</code><code class="descname">lstm_step</code></dt>
-<dd><p>LSTM Step Layer. It used in recurrent_group. The lstm equations are shown
-as follow.</p>
+<dd><p>LSTM Step Layer. This function is used only in recurrent_group.
+The lstm equations are shown as follows.</p>
 <div class="math">
-\[ \begin{align}\begin{aligned}i_t &amp; = \sigma(W_{xi}x_{t} + W_{hi}h_{t-1} + W_{ci}c_{t-1} + b_i)\\f_t &amp; = \sigma(W_{xf}x_{t} + W_{hf}h_{t-1} + W_{cf}c_{t-1} + b_f)\\c_t &amp; = f_tc_{t-1} + i_t tanh (W_{xc}x_t+W_{hc}h_{t-1} + b_c)\\o_t &amp; = \sigma(W_{xo}x_{t} + W_{ho}h_{t-1} + W_{co}c_t + b_o)\\h_t &amp; = o_t tanh(c_t)\end{aligned}\end{align} \]</div>
+\[ \begin{align}\begin{aligned}i_t &amp; = \sigma(W_{x_i}x_{t} + W_{h_i}h_{t-1} + W_{c_i}c_{t-1} + b_i)\\f_t &amp; = \sigma(W_{x_f}x_{t} + W_{h_f}h_{t-1} + W_{c_f}c_{t-1} + b_f)\\c_t &amp; = f_tc_{t-1} + i_t tanh (W_{x_c}x_t+W_{h_c}h_{t-1} + b_c)\\o_t &amp; = \sigma(W_{x_o}x_{t} + W_{h_o}h_{t-1} + W_{c_o}c_t + b_o)\\h_t &amp; = o_t tanh(c_t)\end{aligned}\end{align} \]</div>
 <p>The input of lstm step is <span class="math">\(Wx_t + Wh_{t-1}\)</span>, and user should use
 <code class="code docutils literal"><span class="pre">mixed</span></code> and <code class="code docutils literal"><span class="pre">full_matrix_projection</span></code> to calculate these
-input vector.</p>
+input vectors.</p>
 <p>The state of lstm step is <span class="math">\(c_{t-1}\)</span>. And lstm step layer will do</p>
 <div class="math">
 \[ \begin{align}\begin{aligned}i_t = \sigma(input + W_{ci}c_{t-1} + b_i)\\...\end{aligned}\end{align} \]</div>
-<p>This layer contains two outputs. Default output is <span class="math">\(h_t\)</span>. The other
-output is <span class="math">\(o_t\)</span>, which name is &#8216;state&#8217; and can use
+<p>This layer has two outputs. Default output is <span class="math">\(h_t\)</span>. The other
+output is <span class="math">\(o_t\)</span>, whose name is &#8216;state&#8217; and can use
 <code class="code docutils literal"><span class="pre">get_output</span></code> to extract this output.</p>
 <table class="docutils field-list" frame="void" rules="none">
 <col class="field-name" />
@@ -1278,8 +1278,8 @@ output is <span class="math">\(o_t\)</span>, which name is &#8216;state&#8217; a
 <tbody valign="top">
 <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first simple">
 <li><strong>name</strong> (<em>basestring</em>) &#8211; Layer&#8217;s name.</li>
-<li><strong>size</strong> (<em>int</em>) &#8211; Layer&#8217;s size. NOTE: lstm layer&#8217;s size, should be equal as
-<code class="code docutils literal"><span class="pre">input.size/4</span></code>, and should be equal as
+<li><strong>size</strong> (<em>int</em>) &#8211; Layer&#8217;s size. NOTE: lstm layer&#8217;s size, should be equal to
+<code class="code docutils literal"><span class="pre">input.size/4</span></code>, and should be equal to
 <code class="code docutils literal"><span class="pre">state.size</span></code>.</li>
 <li><strong>input</strong> (<em>paddle.v2.config_base.Layer</em>) &#8211; input layer. <span class="math">\(Wx_t + Wh_{t-1}\)</span></li>
 <li><strong>state</strong> (<em>paddle.v2.config_base.Layer</em>) &#8211; State Layer. <span class="math">\(c_{t-1}\)</span></li>

--- a/develop/doc/api/v2/config/networks.html
+++ b/develop/doc/api/v2/config/networks.html
@@ -452,16 +452,16 @@ False if no bias.</li>
 <dl class="function">
 <dt>
 <code class="descclassname">paddle.v2.networks.</code><code class="descname">lstmemory_unit</code><span class="sig-paren">(</span><em>*args</em>, <em>**kwargs</em><span class="sig-paren">)</span></dt>
-<dd><p>Define calculations that a LSTM unit performs in a single time step.
-This function itself is not a recurrent layer, so that it can not be
-directly applied to sequence input. This function is always used in
+<dd><p>Define calculations that a LSTM unit performs during a single time step.
+This function itself is not a recurrent layer, so it can not be
+directly used to process sequence inputs. This function is always used in
 recurrent_group (see layers.py for more details) to implement attention
 mechanism.</p>
 <p>Please refer to  <strong>Generating Sequences With Recurrent Neural Networks</strong>
 for more details about LSTM. The link goes as follows:
 .. _Link: <a class="reference external" href="https://arxiv.org/abs/1308.0850">https://arxiv.org/abs/1308.0850</a></p>
 <div class="math">
-\[ \begin{align}\begin{aligned}i_t &amp; = \sigma(W_{xi}x_{t} + W_{hi}h_{t-1} + W_{ci}c_{t-1} + b_i)\\f_t &amp; = \sigma(W_{xf}x_{t} + W_{hf}h_{t-1} + W_{cf}c_{t-1} + b_f)\\c_t &amp; = f_tc_{t-1} + i_t tanh (W_{xc}x_t+W_{hc}h_{t-1} + b_c)\\o_t &amp; = \sigma(W_{xo}x_{t} + W_{ho}h_{t-1} + W_{co}c_t + b_o)\\h_t &amp; = o_t tanh(c_t)\end{aligned}\end{align} \]</div>
+\[ \begin{align}\begin{aligned}i_t &amp; = \sigma(W_{x_i}x_{t} + W_{h_i}h_{t-1} + W_{c_i}c_{t-1} + b_i)\\f_t &amp; = \sigma(W_{x_f}x_{t} + W_{h_f}h_{t-1} + W_{c_f}c_{t-1} + b_f)\\c_t &amp; = f_tc_{t-1} + i_t tanh (W_{x_c}x_t+W_{h_c}h_{t-1} + b_c)\\o_t &amp; = \sigma(W_{x_o}x_{t} + W_{h_o}h_{t-1} + W_{c_o}c_t + b_o)\\h_t &amp; = o_t tanh(c_t)\end{aligned}\end{align} \]</div>
 <p>The example usage is:</p>
 <div class="highlight-python"><div class="highlight"><pre><span></span><span class="n">lstm_step</span> <span class="o">=</span> <span class="n">lstmemory_unit</span><span class="p">(</span><span class="nb">input</span><span class="o">=</span><span class="p">[</span><span class="n">layer1</span><span class="p">],</span>
                           <span class="n">size</span><span class="o">=</span><span class="mi">256</span><span class="p">,</span>
@@ -476,6 +476,7 @@ for more details about LSTM. The link goes as follows:
 <tbody valign="top">
 <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first simple">
 <li><strong>input</strong> (<em>LayerOutput</em>) &#8211; input layer name.</li>
+<li><strong>memory_boot</strong> (<em>LayerOutput | None</em>) &#8211; the initialization state of the LSTM cell.</li>
 <li><strong>name</strong> (<em>basestring</em>) &#8211; lstmemory unit name.</li>
 <li><strong>size</strong> (<em>int</em>) &#8211; lstmemory unit size.</li>
 <li><strong>param_attr</strong> (<em>ParameterAttribute</em>) &#8211; Parameter config, None if use default.</li>
@@ -508,7 +509,7 @@ False means no bias, None means default bias.</li>
 <dl class="function">
 <dt>
 <code class="descclassname">paddle.v2.networks.</code><code class="descname">lstmemory_group</code><span class="sig-paren">(</span><em>*args</em>, <em>**kwargs</em><span class="sig-paren">)</span></dt>
-<dd><p>lstm_group is a recurrent layer group version of Long Short Term Memory. It
+<dd><p>lstm_group is a recurrent_group version of Long Short Term Memory. It
 does exactly the same calculation as the lstmemory layer (see lstmemory in
 layers.py for the maths) does. A promising benefit is that LSTM memory
 cell states, or hidden states in every time step are accessible to the
@@ -518,8 +519,8 @@ it is recommended to use the lstmemory, which is relatively faster than
 lstmemory_group.</p>
 <p>NOTE: In PaddlePaddle&#8217;s implementation, the following input-to-hidden
 multiplications:
-<span class="math">\(W_{xi}x_{t}\)</span> , <span class="math">\(W_{xf}x_{t}\)</span>,
-<span class="math">\(W_{xc}x_t\)</span>, <span class="math">\(W_{xo}x_{t}\)</span> are not done in lstmemory_unit to
+<span class="math">\(W_{x_i}x_{t}\)</span> , <span class="math">\(W_{x_f}x_{t}\)</span>,
+<span class="math">\(W_{x_c}x_t\)</span>, <span class="math">\(W_{x_o}x_{t}\)</span> are not done in lstmemory_unit to
 speed up the calculations. Consequently, an additional mixed_layer with
 full_matrix_projection must be included before lstmemory_unit is called.</p>
 <p>The example usage is:</p>
@@ -536,8 +537,9 @@ full_matrix_projection must be included before lstmemory_unit is called.</p>
 <tbody valign="top">
 <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first simple">
 <li><strong>input</strong> (<em>LayerOutput</em>) &#8211; input layer name.</li>
-<li><strong>name</strong> (<em>basestring</em>) &#8211; lstmemory group name.</li>
 <li><strong>size</strong> (<em>int</em>) &#8211; lstmemory group size.</li>
+<li><strong>name</strong> (<em>basestring</em>) &#8211; name of the lstmemory group.</li>
+<li><strong>memory_boot</strong> (<em>LayerOutput | None</em>) &#8211; the initialization state of LSTM cell.</li>
 <li><strong>reverse</strong> (<em>bool</em>) &#8211; is lstm reversed</li>
 <li><strong>param_attr</strong> (<em>ParameterAttribute</em>) &#8211; Parameter config, None if use default.</li>
 <li><strong>act</strong> (<em>BaseActivation</em>) &#8211; lstm final activiation type</li>
@@ -662,8 +664,8 @@ concatenated and returned.</li>
 <dt>
 <code class="descclassname">paddle.v2.networks.</code><code class="descname">gru_unit</code><span class="sig-paren">(</span><em>*args</em>, <em>**kwargs</em><span class="sig-paren">)</span></dt>
 <dd><p>Define calculations that a gated recurrent unit performs in a single time
-step. This function itself is not a recurrent layer, so that it can not be
-directly applied to sequence input. This function is almost always used in
+step. This function itself is not a recurrent layer, so it can not be
+directly used to process sequence inputs. This function is always used in
 the recurrent_group (see layers.py for more details) to implement attention
 mechanism.</p>
 <p>Please see grumemory in layers.py for the details about the maths.</p>
@@ -673,6 +675,7 @@ mechanism.</p>
 <tbody valign="top">
 <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first simple">
 <li><strong>input</strong> (<em>LayerOutput</em>) &#8211; input layer name.</li>
+<li><strong>memory_boot</strong> (<em>LayerOutput | None</em>) &#8211; the initialization state of the LSTM cell.</li>
 <li><strong>name</strong> (<em>basestring</em>) &#8211; name of the gru group.</li>
 <li><strong>size</strong> (<em>int</em>) &#8211; hidden size of the gru.</li>
 <li><strong>act</strong> (<em>BaseActivation</em>) &#8211; type of the activation</li>
@@ -697,7 +700,7 @@ mechanism.</p>
 <dl class="function">
 <dt>
 <code class="descclassname">paddle.v2.networks.</code><code class="descname">gru_group</code><span class="sig-paren">(</span><em>*args</em>, <em>**kwargs</em><span class="sig-paren">)</span></dt>
-<dd><p>gru_group is a recurrent layer group version of Gated Recurrent Unit. It
+<dd><p>gru_group is a recurrent_group version of Gated Recurrent Unit. It
 does exactly the same calculation as the grumemory layer does. A promising
 benefit is that gru hidden states are accessible to the user. This is
 especially useful in attention model. If you do not need to access
@@ -717,6 +720,7 @@ to use the grumemory, which is relatively faster.</p>
 <tbody valign="top">
 <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first simple">
 <li><strong>input</strong> (<em>LayerOutput</em>) &#8211; input layer name.</li>
+<li><strong>memory_boot</strong> (<em>LayerOutput | None</em>) &#8211; the initialization state of the LSTM cell.</li>
 <li><strong>name</strong> (<em>basestring</em>) &#8211; name of the gru group.</li>
 <li><strong>size</strong> (<em>int</em>) &#8211; hidden size of the gru.</li>
 <li><strong>reverse</strong> (<em>bool</em>) &#8211; whether to process the input data in a reverse order</li>

--- a/develop/doc/searchindex.js
+++ b/develop/doc/searchindex.js
--- a/develop/doc_cn/api/v2/config/layer.html
+++ b/develop/doc_cn/api/v2/config/layer.html
@@ -1032,6 +1032,7 @@ more details about LSTM.</p>
 <tbody valign="top">
 <tr class="field-odd field"><th class="field-name">参数:</th><td class="field-body"><ul class="first simple">
 <li><strong>name</strong> (<em>basestring</em>) &#8211; The lstmemory layer name.</li>
+<li><strong>size</strong> (<em>int</em>) &#8211; DEPRECATED. size of the lstm cell</li>
 <li><strong>input</strong> (<em>paddle.v2.config_base.Layer</em>) &#8211; input layer name.</li>
 <li><strong>reverse</strong> (<em>bool</em>) &#8211; is sequence process reversed or not.</li>
 <li><strong>act</strong> (<em>paddle.v2.activation.Base</em>) &#8211; activation type, paddle.v2.activation.Tanh by default. <span class="math">\(h_t\)</span></li>
@@ -1098,6 +1099,7 @@ Recurrent Neural Networks on Sequence Modeling.</a></p>
 <tr class="field-odd field"><th class="field-name">参数:</th><td class="field-body"><ul class="first simple">
 <li><strong>name</strong> (<em>None|basestring</em>) &#8211; The gru layer name.</li>
 <li><strong>input</strong> (<em>paddle.v2.config_base.Layer.</em>) &#8211; input layer.</li>
+<li><strong>size</strong> (<em>int</em>) &#8211; DEPRECATED. size of the gru cell</li>
 <li><strong>reverse</strong> (<em>bool</em>) &#8211; Whether sequence process is reversed or not.</li>
 <li><strong>act</strong> (<em>paddle.v2.activation.Base</em>) &#8211; activation type, paddle.v2.activation.Tanh by default. This activation
 affects the <span class="math">\({\tilde{h_t}}\)</span>.</li>
@@ -1108,8 +1110,6 @@ This activation affects the <span class="math">\(z_t\)</span> and <span class="m
 bias.</li>
 <li><strong>param_attr</strong> (<em>paddle.v2.attr.ParameterAttribute|None|False</em>) &#8211; Parameter Attribute.</li>
 <li><strong>layer_attr</strong> (<em>paddle.v2.attr.ExtraAttributeNone</em>) &#8211; Extra Layer attribute</li>
-<li><strong>size</strong> (<em>None</em>) &#8211; Stub parameter of size, but actually not used. If set this size
-will get a warning.</li>
 </ul>
 </td>
 </tr>
@@ -1264,18 +1264,18 @@ be paddle.v2.config_base.Layer.</li>
 <dl class="class">
 <dt>
 <em class="property">class </em><code class="descclassname">paddle.v2.layer.</code><code class="descname">lstm_step</code></dt>
-<dd><p>LSTM Step Layer. It used in recurrent_group. The lstm equations are shown
-as follow.</p>
+<dd><p>LSTM Step Layer. This function is used only in recurrent_group.
+The lstm equations are shown as follows.</p>
 <div class="math">
-\[ \begin{align}\begin{aligned}i_t &amp; = \sigma(W_{xi}x_{t} + W_{hi}h_{t-1} + W_{ci}c_{t-1} + b_i)\\f_t &amp; = \sigma(W_{xf}x_{t} + W_{hf}h_{t-1} + W_{cf}c_{t-1} + b_f)\\c_t &amp; = f_tc_{t-1} + i_t tanh (W_{xc}x_t+W_{hc}h_{t-1} + b_c)\\o_t &amp; = \sigma(W_{xo}x_{t} + W_{ho}h_{t-1} + W_{co}c_t + b_o)\\h_t &amp; = o_t tanh(c_t)\end{aligned}\end{align} \]</div>
+\[ \begin{align}\begin{aligned}i_t &amp; = \sigma(W_{x_i}x_{t} + W_{h_i}h_{t-1} + W_{c_i}c_{t-1} + b_i)\\f_t &amp; = \sigma(W_{x_f}x_{t} + W_{h_f}h_{t-1} + W_{c_f}c_{t-1} + b_f)\\c_t &amp; = f_tc_{t-1} + i_t tanh (W_{x_c}x_t+W_{h_c}h_{t-1} + b_c)\\o_t &amp; = \sigma(W_{x_o}x_{t} + W_{h_o}h_{t-1} + W_{c_o}c_t + b_o)\\h_t &amp; = o_t tanh(c_t)\end{aligned}\end{align} \]</div>
 <p>The input of lstm step is <span class="math">\(Wx_t + Wh_{t-1}\)</span>, and user should use
 <code class="code docutils literal"><span class="pre">mixed</span></code> and <code class="code docutils literal"><span class="pre">full_matrix_projection</span></code> to calculate these
-input vector.</p>
+input vectors.</p>
 <p>The state of lstm step is <span class="math">\(c_{t-1}\)</span>. And lstm step layer will do</p>
 <div class="math">
 \[ \begin{align}\begin{aligned}i_t = \sigma(input + W_{ci}c_{t-1} + b_i)\\...\end{aligned}\end{align} \]</div>
-<p>This layer contains two outputs. Default output is <span class="math">\(h_t\)</span>. The other
-output is <span class="math">\(o_t\)</span>, which name is &#8216;state&#8217; and can use
+<p>This layer has two outputs. Default output is <span class="math">\(h_t\)</span>. The other
+output is <span class="math">\(o_t\)</span>, whose name is &#8216;state&#8217; and can use
 <code class="code docutils literal"><span class="pre">get_output</span></code> to extract this output.</p>
 <table class="docutils field-list" frame="void" rules="none">
 <col class="field-name" />
@@ -1283,8 +1283,8 @@ output is <span class="math">\(o_t\)</span>, which name is &#8216;state&#8217; a
 <tbody valign="top">
 <tr class="field-odd field"><th class="field-name">参数:</th><td class="field-body"><ul class="first simple">
 <li><strong>name</strong> (<em>basestring</em>) &#8211; Layer&#8217;s name.</li>
-<li><strong>size</strong> (<em>int</em>) &#8211; Layer&#8217;s size. NOTE: lstm layer&#8217;s size, should be equal as
-<code class="code docutils literal"><span class="pre">input.size/4</span></code>, and should be equal as
+<li><strong>size</strong> (<em>int</em>) &#8211; Layer&#8217;s size. NOTE: lstm layer&#8217;s size, should be equal to
+<code class="code docutils literal"><span class="pre">input.size/4</span></code>, and should be equal to
 <code class="code docutils literal"><span class="pre">state.size</span></code>.</li>
 <li><strong>input</strong> (<em>paddle.v2.config_base.Layer</em>) &#8211; input layer. <span class="math">\(Wx_t + Wh_{t-1}\)</span></li>
 <li><strong>state</strong> (<em>paddle.v2.config_base.Layer</em>) &#8211; State Layer. <span class="math">\(c_{t-1}\)</span></li>

--- a/develop/doc_cn/api/v2/config/networks.html
+++ b/develop/doc_cn/api/v2/config/networks.html
@@ -457,16 +457,16 @@ False if no bias.</li>
 <dl class="function">
 <dt>
 <code class="descclassname">paddle.v2.networks.</code><code class="descname">lstmemory_unit</code><span class="sig-paren">(</span><em>*args</em>, <em>**kwargs</em><span class="sig-paren">)</span></dt>
-<dd><p>Define calculations that a LSTM unit performs in a single time step.
-This function itself is not a recurrent layer, so that it can not be
-directly applied to sequence input. This function is always used in
+<dd><p>Define calculations that a LSTM unit performs during a single time step.
+This function itself is not a recurrent layer, so it can not be
+directly used to process sequence inputs. This function is always used in
 recurrent_group (see layers.py for more details) to implement attention
 mechanism.</p>
 <p>Please refer to  <strong>Generating Sequences With Recurrent Neural Networks</strong>
 for more details about LSTM. The link goes as follows:
 .. _Link: <a class="reference external" href="https://arxiv.org/abs/1308.0850">https://arxiv.org/abs/1308.0850</a></p>
 <div class="math">
-\[ \begin{align}\begin{aligned}i_t &amp; = \sigma(W_{xi}x_{t} + W_{hi}h_{t-1} + W_{ci}c_{t-1} + b_i)\\f_t &amp; = \sigma(W_{xf}x_{t} + W_{hf}h_{t-1} + W_{cf}c_{t-1} + b_f)\\c_t &amp; = f_tc_{t-1} + i_t tanh (W_{xc}x_t+W_{hc}h_{t-1} + b_c)\\o_t &amp; = \sigma(W_{xo}x_{t} + W_{ho}h_{t-1} + W_{co}c_t + b_o)\\h_t &amp; = o_t tanh(c_t)\end{aligned}\end{align} \]</div>
+\[ \begin{align}\begin{aligned}i_t &amp; = \sigma(W_{x_i}x_{t} + W_{h_i}h_{t-1} + W_{c_i}c_{t-1} + b_i)\\f_t &amp; = \sigma(W_{x_f}x_{t} + W_{h_f}h_{t-1} + W_{c_f}c_{t-1} + b_f)\\c_t &amp; = f_tc_{t-1} + i_t tanh (W_{x_c}x_t+W_{h_c}h_{t-1} + b_c)\\o_t &amp; = \sigma(W_{x_o}x_{t} + W_{h_o}h_{t-1} + W_{c_o}c_t + b_o)\\h_t &amp; = o_t tanh(c_t)\end{aligned}\end{align} \]</div>
 <p>The example usage is:</p>
 <div class="highlight-python"><div class="highlight"><pre><span></span><span class="n">lstm_step</span> <span class="o">=</span> <span class="n">lstmemory_unit</span><span class="p">(</span><span class="nb">input</span><span class="o">=</span><span class="p">[</span><span class="n">layer1</span><span class="p">],</span>
                           <span class="n">size</span><span class="o">=</span><span class="mi">256</span><span class="p">,</span>
@@ -481,6 +481,7 @@ for more details about LSTM. The link goes as follows:
 <tbody valign="top">
 <tr class="field-odd field"><th class="field-name">参数:</th><td class="field-body"><ul class="first simple">
 <li><strong>input</strong> (<em>LayerOutput</em>) &#8211; input layer name.</li>
+<li><strong>memory_boot</strong> (<em>LayerOutput | None</em>) &#8211; the initialization state of the LSTM cell.</li>
 <li><strong>name</strong> (<em>basestring</em>) &#8211; lstmemory unit name.</li>
 <li><strong>size</strong> (<em>int</em>) &#8211; lstmemory unit size.</li>
 <li><strong>param_attr</strong> (<em>ParameterAttribute</em>) &#8211; Parameter config, None if use default.</li>
@@ -513,7 +514,7 @@ False means no bias, None means default bias.</li>
 <dl class="function">
 <dt>
 <code class="descclassname">paddle.v2.networks.</code><code class="descname">lstmemory_group</code><span class="sig-paren">(</span><em>*args</em>, <em>**kwargs</em><span class="sig-paren">)</span></dt>
-<dd><p>lstm_group is a recurrent layer group version of Long Short Term Memory. It
+<dd><p>lstm_group is a recurrent_group version of Long Short Term Memory. It
 does exactly the same calculation as the lstmemory layer (see lstmemory in
 layers.py for the maths) does. A promising benefit is that LSTM memory
 cell states, or hidden states in every time step are accessible to the
@@ -523,8 +524,8 @@ it is recommended to use the lstmemory, which is relatively faster than
 lstmemory_group.</p>
 <p>NOTE: In PaddlePaddle&#8217;s implementation, the following input-to-hidden
 multiplications:
-<span class="math">\(W_{xi}x_{t}\)</span> , <span class="math">\(W_{xf}x_{t}\)</span>,
-<span class="math">\(W_{xc}x_t\)</span>, <span class="math">\(W_{xo}x_{t}\)</span> are not done in lstmemory_unit to
+<span class="math">\(W_{x_i}x_{t}\)</span> , <span class="math">\(W_{x_f}x_{t}\)</span>,
+<span class="math">\(W_{x_c}x_t\)</span>, <span class="math">\(W_{x_o}x_{t}\)</span> are not done in lstmemory_unit to
 speed up the calculations. Consequently, an additional mixed_layer with
 full_matrix_projection must be included before lstmemory_unit is called.</p>
 <p>The example usage is:</p>
@@ -541,8 +542,9 @@ full_matrix_projection must be included before lstmemory_unit is called.</p>
 <tbody valign="top">
 <tr class="field-odd field"><th class="field-name">参数:</th><td class="field-body"><ul class="first simple">
 <li><strong>input</strong> (<em>LayerOutput</em>) &#8211; input layer name.</li>
-<li><strong>name</strong> (<em>basestring</em>) &#8211; lstmemory group name.</li>
 <li><strong>size</strong> (<em>int</em>) &#8211; lstmemory group size.</li>
+<li><strong>name</strong> (<em>basestring</em>) &#8211; name of the lstmemory group.</li>
+<li><strong>memory_boot</strong> (<em>LayerOutput | None</em>) &#8211; the initialization state of LSTM cell.</li>
 <li><strong>reverse</strong> (<em>bool</em>) &#8211; is lstm reversed</li>
 <li><strong>param_attr</strong> (<em>ParameterAttribute</em>) &#8211; Parameter config, None if use default.</li>
 <li><strong>act</strong> (<em>BaseActivation</em>) &#8211; lstm final activiation type</li>
@@ -667,8 +669,8 @@ concatenated and returned.</li>
 <dt>
 <code class="descclassname">paddle.v2.networks.</code><code class="descname">gru_unit</code><span class="sig-paren">(</span><em>*args</em>, <em>**kwargs</em><span class="sig-paren">)</span></dt>
 <dd><p>Define calculations that a gated recurrent unit performs in a single time
-step. This function itself is not a recurrent layer, so that it can not be
-directly applied to sequence input. This function is almost always used in
+step. This function itself is not a recurrent layer, so it can not be
+directly used to process sequence inputs. This function is always used in
 the recurrent_group (see layers.py for more details) to implement attention
 mechanism.</p>
 <p>Please see grumemory in layers.py for the details about the maths.</p>
@@ -678,6 +680,7 @@ mechanism.</p>
 <tbody valign="top">
 <tr class="field-odd field"><th class="field-name">参数:</th><td class="field-body"><ul class="first simple">
 <li><strong>input</strong> (<em>LayerOutput</em>) &#8211; input layer name.</li>
+<li><strong>memory_boot</strong> (<em>LayerOutput | None</em>) &#8211; the initialization state of the LSTM cell.</li>
 <li><strong>name</strong> (<em>basestring</em>) &#8211; name of the gru group.</li>
 <li><strong>size</strong> (<em>int</em>) &#8211; hidden size of the gru.</li>
 <li><strong>act</strong> (<em>BaseActivation</em>) &#8211; type of the activation</li>
@@ -702,7 +705,7 @@ mechanism.</p>
 <dl class="function">
 <dt>
 <code class="descclassname">paddle.v2.networks.</code><code class="descname">gru_group</code><span class="sig-paren">(</span><em>*args</em>, <em>**kwargs</em><span class="sig-paren">)</span></dt>
-<dd><p>gru_group is a recurrent layer group version of Gated Recurrent Unit. It
+<dd><p>gru_group is a recurrent_group version of Gated Recurrent Unit. It
 does exactly the same calculation as the grumemory layer does. A promising
 benefit is that gru hidden states are accessible to the user. This is
 especially useful in attention model. If you do not need to access
@@ -722,6 +725,7 @@ to use the grumemory, which is relatively faster.</p>
 <tbody valign="top">
 <tr class="field-odd field"><th class="field-name">参数:</th><td class="field-body"><ul class="first simple">
 <li><strong>input</strong> (<em>LayerOutput</em>) &#8211; input layer name.</li>
+<li><strong>memory_boot</strong> (<em>LayerOutput | None</em>) &#8211; the initialization state of the LSTM cell.</li>
 <li><strong>name</strong> (<em>basestring</em>) &#8211; name of the gru group.</li>
 <li><strong>size</strong> (<em>int</em>) &#8211; hidden size of the gru.</li>
 <li><strong>reverse</strong> (<em>bool</em>) &#8211; whether to process the input data in a reverse order</li>

--- a/develop/doc_cn/searchindex.js
+++ b/develop/doc_cn/searchindex.js