Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into lstm_fix

8bf37994 · dangqingqing · d60fe75a · 9876bb09 · 8bf37994 · 8bf37994
5 changed file
--- a/doc/design/ops/images/LOD-and-shape-changes-during-decoding.jpg
+++ b/doc/design/ops/images/LOD-and-shape-changes-during-decoding.jpg
--- a/doc/design/ops/sequence_decoder.md
+++ b/doc/design/ops/sequence_decoder.md
+# Design: Sequence Decoder Generating LoDTensors
+In tasks such as machine translation and image to text, 
+a [sequence decoder](https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/README.md) is necessary to generate sequences.
+This documentation describes how to implement the sequence decoder as an operator.
+## Beam Search based Decoder
+The [beam search algorithm](https://en.wikipedia.org/wiki/Beam_search) is necessary when generating sequences, 
+it is a heuristic search algorithm that explores the paths by expanding the most promising node in a limited set.
+In the old version of PaddlePaddle, a C++ class `RecurrentGradientMachine` implements the general sequence decoder based on beam search, 
+due to the complexity, the implementation relays on a lot of special data structures, 
+quite trivial and hard to be customized by users.
+There are a lot of heuristic tricks in the sequence generation tasks, 
+so the flexibility of sequence decoder is very important to users.
+During PaddlePaddle's refactoring work,
+some new concept is proposed such as [LoDTensor](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/lod_tensor.md) and [TensorArray](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/tensor_array.md) that can better support sequence usage,
+and they can help to make the implementation of beam search based sequence decoder **more transparent and modular** .
+For example, the RNN sates, candidates IDs and probabilities of beam search can be represented as `LoDTensors`;
+the selected candidate's IDs in each time step can be stored in a `TensorArray`, and `Packed` to the sentences translated.
+## Changing LoD's absolute offset to relative offsets
+The current `LoDTensor` is designed to store levels of variable-length sequences,
+it stores several arrays of integers each represents a level.
+The integers in each level represents the begin and end (not inclusive) offset of a sequence **in the underlying tensor**, 
+let's call this format the **absolute-offset LoD** for clear.
+The relative-offset LoD can fast retrieve any sequence but fails to represent empty sequences, for example, a two-level LoD is as follows
+```python
+[[0, 3, 9]
+ [0, 2, 3, 3, 3, 9]]
+```
+The first level tells that there are two sequences:
+- the first's offset is `[0, 3)`
+- the second's offset is `[3, 9)`
+while on the second level, there are several empty sequences that both begin and end at `3`.
+It is impossible to tell how many empty second-level sequences exist in the first-level sequences.
+There are many scenarios that relay on empty sequence representation,
+such as machine translation or image to text, one instance has no translations or the empty candidate set for a prefix.
+So let's introduce another format of LoD, 
+it stores **the offsets of the lower level sequences** and is called **relative-offset** LoD.
+For example, to represent the same sequences of the above data
+```python
+[[0, 3, 6]
+ [0, 2, 3, 3, 3, 9]]
+```
+the first level represents that there are two sequences, 
+their offsets in the second-level LoD is `[0, 3)` and `[3, 5)`.
+The second level is the same with the relative offset example because the lower level is a tensor.
+It is easy to find out the second sequence in the first-level LoD has two empty sequences.
+The following demos are based on relative-offset LoD.
+## Usage in a simple machine translation model
+Let's start from a simple machine translation model that is simplified from [machine translation chapter](https://github.com/PaddlePaddle/book/tree/develop/08.machine_translation) to draw a simple blueprint of what a sequence decoder can do and how to use it.
+The model has an encoder that learns the semantic vector from a sequence,
+and a decoder which uses the sequence decoder to generate new sentences.
+**Encoder**
+```python
+import paddle as pd
+dict_size = 8000
+source_dict_size = dict_size
+target_dict_size = dict_size
+word_vector_dim = 128
+encoder_dim = 128
+decoder_dim = 128
+beam_size = 5
+max_length = 120
+# encoder
+src_word_id = pd.data(
+    name='source_language_word',
+    type=pd.data.integer_value_sequence(source_dict_dim))
+src_embedding = pd.embedding(size=source_dict_size, size=word_vector_dim)
+src_word_vec = pd.lookup(src_embedding, src_word_id)
+encoder_out_seq = pd.gru(input=src_word_vec, size=encoder_dim)
+encoder_ctx = pd.last_seq(encoder_out_seq)
+# encoder_ctx_proj is the learned semantic vector
+encoder_ctx_proj = pd.fc(
+    encoder_ctx, size=decoder_dim, act=pd.activation.Tanh(), bias=None)
+```
+**Decoder**
+```python
+def generate():
+    decoder = pd.while_loop()
+    with decoder.step():
+        decoder_mem = decoder.memory(init=encoder_ctx)  # mark the memory
+        generated_ids = decoder.memory() # TODO init to batch_size <s>s
+        generated_scores = decoder.memory() # TODO init to batch_size 1s or 0s
+        target_word = pd.lookup(trg_embedding, gendrated_ids)
+        # expand encoder_ctx's batch to fit target_word's lod
+        # for example
+        # decoder_mem.lod is
+        # [[0 1 3],
+        #  [0 1 3 6]]
+        # its tensor content is [a1 a2 a3 a4 a5]
+        # which means there are 2 sentences to translate
+        #   - the first sentence has 1 translation prefixes, the offsets are [0, 1)
+        #   - the second sentence has 2 translation prefixes, the offsets are [1, 3) and [3, 6)
+        # the target_word.lod is 
+        # [[0, 1, 6]
+        #  [0, 2, 4, 7, 9 12]]
+        # which means 2 sentences to translate, each has 1 and 5 prefixes
+        # the first prefix has 2 candidates
+        # the following has 2, 3, 2, 3 candidates
+        # the encoder_ctx_expanded's content will be
+        # [a1 a1 a2 a2 a3 a3 a3 a4 a4 a5 a5 a5]
+        encoder_ctx_expanded = pd.lod_expand(encoder_ctx, target_word)
+        decoder_input = pd.fc(
+            act=pd.activation.Linear(),
+            input=[target_word, encoder_ctx],
+            size=3 * decoder_dim)
+        gru_out, cur_mem = pd.gru_step(
+            decoder_input, mem=decoder_mem, size=decoder_dim)
+        scores = pd.fc(
+            gru_out,
+            size=trg_dic_size,
+            bias=None,
+            act=pd.activation.Softmax())
+        # K is an config
+        topk_scores, topk_ids = pd.top_k(scores, K)
+        topk_generated_scores = pd.add_scalar(topk_scores, generated_scores)
+        selected_ids, selected_generation_scores = decoder.beam_search(
+            topk_ids, topk_generated_scores)
+        # update the states
+        decoder_mem.update(cur_mem)  # tells how to update state
+        generated_ids.update(selected_ids)
+        generated_scores.update(selected_generation_scores)
+        decoder.output(selected_ids)
+        decoder.output(selected_generation_scores)
+translation_ids, translation_scores = decoder()
+```
+The `decoder.beam_search` is a operator that given the candidates and the scores of translations including the candidates,
+return the result of the beam search algorithm.
+In this way, users can customize anything on the inputs or outputs of beam search, for example, two ways to prune some translation prefixes
+1. meke the correspondind elements in `topk_generated_scores` zero or some small values, beam_search will discard this candidate.
+2. remove some specific candidate in `selected_ids`
+3. get the final `translation_ids`, remove the translation sequence in it.
+The implementation of sequence decoder can reuse the C++ class [RNNAlgorithm](https://github.com/Superjom/Paddle/blob/68cac3c0f8451fe62a4cdf156747d6dc0ee000b3/paddle/operators/dynamic_recurrent_op.h#L30),
+so the python syntax is quite similar to a [RNN](https://github.com/Superjom/Paddle/blob/68cac3c0f8451fe62a4cdf156747d6dc0ee000b3/doc/design/block.md#blocks-with-for-and-rnnop).
+Both of them are two-level `LoDTensors`
+- the first level represents `batch_size` of (source) sentences;
+- the second level represents the candidate ID sets for translation prefix.
+for example, 3 source sentences to translate, and has 2, 3, 1 candidates.
+Unlike an RNN, in sequence decoder, the previous state and the current state have different LoD and shape,
+a `lod_expand` operator is used to expand the LoD of the previous state to fit the current state.
+For example, the previous state
+* LoD is `[0, 1, 3][0, 2, 5, 6]`
+* content of tensor is `a1 a2 b1 b2 b3 c1`
+the current state stored in `encoder_ctx_expanded`
+* LoD is `[0, 2, 7][0 3 5 8 9 11 11]`
+* the content is 
+  - a1 a1 a1 (a1 has 3 candidates, so the state should be copied 3 times for each candidates)
+  - a2 a2
+  - b1 b1 b1
+  - b2
+  - b3 b3
+  - None (c1 has 0 candidates, so c1 is dropped)
+Benefit from the relative offset LoD, empty candidate set can be represented naturally.
+the status in each time step can be stored in `TensorArray`, and `Pack`ed to a final LoDTensor, the corresponding syntax is 
+```python
+decoder.output(selected_ids)
+decoder.output(selected_generation_scores)
+```
+the `selected_ids` is the candidate ids for the prefixes, 
+it will be `Packed` by `TensorArray` to a two-level `LoDTensor`,
+the first level represents the source sequences,
+the second level represents generated sequences.
+Pack the `selected_scores` will get a `LoDTensor` that stores scores of each candidate of translations.
+Pack the `selected_generation_scores` will get a `LoDTensor`, and each tail is the probability of the translation.
+## LoD and shape changes during decoding
+<p align="center">
+  <img src="./images/LOD-and-shape-changes-during-decoding.jpg"/>
+</p>
+According the image above, the only phrase to change LoD is beam search.
+## Beam search design
+The beam search algorthm will be implemented as one method of the sequence decoder, it has 3 inputs
+1. `topk_ids`, top K candidate ids for each prefix.
+2. `topk_scores`, the corresponding scores for `topk_ids`
+3. `generated_scores`, the score of the prefixes.
+All of the are LoDTensors, so that the sequence affilication is clear.
+Beam search will keep a beam for each prefix and select a smaller candidate set for each prefix.
+It will return three variables
+1. `selected_ids`, the final candidate beam search function selected for the next step.
+2. `selected_scores`, the scores for the candidates.
+3. `generated_scores`, the updated scores for each prefixes (with the new candidates appended).
+## Introducing the LoD-based `Pack` and `Unpack` methods in `TensorArray`
+The `selected_ids`, `selected_scores` and `generated_scores` are LoDTensors,
+and they exist in each time step,
+so it is natural to store them in arrays.
+Currently, PaddlePaddle has a module called `TensorArray` which can store an array of tensors,
+the results of beam search are better to store in a `TensorArray`.
+The `Pack` and `UnPack` in `TensorArray` are used to package tensors in the array to a `LoDTensor` or split the `LoDTensor` to an array of tensors. 
+It needs some extensions to support pack or unpack an array of `LoDTensors`.
--- a/paddle/gserver/layers/ScaleSubRegionLayer.cpp
+++ b/paddle/gserver/layers/ScaleSubRegionLayer.cpp
@@ -49,7 +49,7 @@ void ScaleSubRegionLayer::forward(PassType passType) {
  shape_ = TensorShape({batchSize, channelsNum_, imgH_, imgW_});
  resetOutput(batchSize, imgV->getWidth());
-  auto out = getOutput();
+  auto& out = getOutput();
  out.setFrameHeight(imgH_);
  out.setFrameWidth(imgW_);

--- a/paddle/gserver/tests/test_LayerGrad.cpp
+++ b/paddle/gserver/tests/test_LayerGrad.cpp
@@ -53,7 +53,7 @@ TEST(Operator, dot_mul) {
 TEST(Projection, context) {
  for (auto contextStart : {-5, -3, -1, 0, 3}) {
    for (auto contextLength : {1, 2, 5, 7}) {
-      for (auto batchSize : {1, 2, 5, 20, 50}) {
+      for (auto batchSize : {1, 2, 5, 20}) {
        for (auto trainablePadding : {false, true}) {
          LOG(INFO) << " contextStart=" << contextStart
                    << " contextLength=" << contextLength
@@ -585,14 +585,14 @@ TEST(Layer, maxoutLayer) {
 }
 void testFcLayer(string format, size_t nnz) {
  TestConfig config;
-  config.biasSize = 4096;
+  config.biasSize = 1024;
  config.layerConfig.set_type("fc");
-  config.layerConfig.set_size(4096);
+  config.layerConfig.set_size(1024);
  config.layerConfig.set_active_type("sigmoid");
  config.layerConfig.set_drop_rate(0.1);
  config.inputDefs.push_back(
-      {INPUT_DATA, "layer_0", 8192, nnz, ParaSparse(format)});
+      {INPUT_DATA, "layer_0", 2048, nnz, ParaSparse(format)});
  config.layerConfig.add_inputs();
  LOG(INFO) << config.inputDefs[0].sparse.sparse << " "
@@ -609,9 +609,9 @@ void testFcLayer(string format, size_t nnz) {
 }
 TEST(Layer, fcLayer) {
-  testFcLayer("", 4096 * 4096 * 2);
+  testFcLayer("", 1024 * 1024 * 2);
-  testFcLayer("csc", 4096 * 40);
+  testFcLayer("csc", 1024 * 10);
-  testFcLayer("csr", 4096 * 40);
+  testFcLayer("csr", 1024 * 10);
 }
 TEST(Layer, SelectiveFullyConnectedLayer) {
@@ -1995,7 +1995,7 @@ TEST(Layer, multibox_loss) {
 TEST(Layer, TransLayer) {
  TestConfig config;
  const int height = 128;
-  const int width = 1028;
+  const int width = 256;
  config.layerConfig.set_type("trans");
  config.layerConfig.set_size(width);

--- a/python/paddle/trainer_config_helpers/layers.py
+++ b/python/paddle/trainer_config_helpers/layers.py
@@ -789,10 +789,9 @@ class MixedLayerType(LayerOutput):
        :type size: int
        :param act: Activation type.
        :type act: BaseActivation
-        :param bias_attr: The Bias Attribute. If the parameter is set to
+        :param bias_attr: The bias attribute. If the parameter is set to False or an object
-                          False or something not type of ParameterAttribute,
+                          whose type is not ParameterAttribute, no bias is defined. If the
-                          no bias is defined. If the parameter is set to
+                          parameter is set to True, the bias is initialized to zero.
-                          True, the bias is initialized to zero.
        :type bias_attr: ParameterAttribute | None | bool | Any
        :param layer_attr: Extra Layer Attribute.
        :type layer_attr: ExtraLayerAttribute or None
@@ -889,10 +888,9 @@ def mixed_layer(size=0,
                  then this function will just return layer's name.
    :param act: Activation Type. LinearActivation is the default.
    :type act: BaseActivation
-    :param bias_attr: The Bias Attribute. If the parameter is set to
+    :param bias_attr: The bias attribute. If the parameter is set to False or an object
-                      False or something not type of ParameterAttribute,
+                      whose type is not ParameterAttribute, no bias is defined. If the
-                      no bias is defined. If the parameter is set to
+                      parameter is set to True, the bias is initialized to zero.
-                      True, the bias is initialized to zero.
    :type bias_attr: ParameterAttribute | None | bool | Any
    :param layer_attr: The extra layer config. Default is None.
    :type layer_attr: ExtraLayerAttribute
@@ -1034,10 +1032,9 @@ def fc_layer(input,
    :type act: BaseActivation
    :param param_attr: The Parameter Attribute|list.
    :type param_attr: ParameterAttribute
-    :param bias_attr: The Bias Attribute. If the parameter is set to
+    :param bias_attr: The bias attribute. If the parameter is set to False or an object
-                      False or something not type of ParameterAttribute,
+                      whose type is not ParameterAttribute, no bias is defined. If the
-                      no bias is defined. If the parameter is set to
+                      parameter is set to True, the bias is initialized to zero.
-                      True, the bias is initialized to zero.
    :type bias_attr: ParameterAttribute | None | bool | Any
    :param layer_attr: Extra Layer config.
    :type layer_attr: ExtraLayerAttribute | None
@@ -1390,10 +1387,9 @@ def pooling_layer(input,
    :type pooling_type: BasePoolingType | None
    :param stride: The step size between successive pooling regions.
    :type stride: Int
-    :param bias_attr: The Bias Attribute. If the parameter is set to
+    :param bias_attr: The bias attribute. If the parameter is set to False or an object
-                      False or something not type of ParameterAttribute,
+                      whose type is not ParameterAttribute, no bias is defined. If the
-                      no bias is defined. If the parameter is set to
+                      parameter is set to True, the bias is initialized to zero.
-                      True, the bias is initialized to zero.
    :type bias_attr: ParameterAttribute | None | bool | Any
    :param layer_attr: The Extra Attributes for layer, such as dropout.
    :type layer_attr: ExtraLayerAttribute | None
@@ -1491,10 +1487,9 @@ def lstmemory(input,
    :type gate_act: BaseActivation
    :param state_act: state activation type, TanhActivation by default.
    :type state_act: BaseActivation
-    :param bias_attr: The Bias Attribute. If the parameter is set to
+    :param bias_attr: The bias attribute. If the parameter is set to False or an object
-                      False or something not type of ParameterAttribute,
+                      whose type is not ParameterAttribute, no bias is defined. If the
-                      no bias is defined. If the parameter is set to
+                      parameter is set to True, the bias is initialized to zero.
-                      True, the bias is initialized to zero.
    :type bias_attr: ParameterAttribute | None | bool | Any
    :param param_attr: Parameter Attribute.
    :type param_attr: ParameterAttribute | None | False
@@ -1617,10 +1612,9 @@ def grumemory(input,
                     This activation affects the :math:`z_t` and :math:`r_t`. It is the
                     :math:`\\sigma` in the above formula.
    :type gate_act: BaseActivation
-    :param bias_attr: The Bias Attribute. If the parameter is set to
+    :param bias_attr: The bias attribute. If the parameter is set to False or an object
-                      False or something not type of ParameterAttribute,
+                      whose type is not ParameterAttribute, no bias is defined. If the
-                      no bias is defined. If the parameter is set to
+                      parameter is set to True, the bias is initialized to zero.
-                      True, the bias is initialized to zero.
    :type bias_attr: ParameterAttribute | None | bool | Any
    :param param_attr: Parameter Attribute.
    :type param_attr: ParameterAttribute | None | False
@@ -1817,10 +1811,9 @@ def expand_layer(input,
    :type expand_as: LayerOutput
    :param name: The name of this layer. It is optional.
    :type name: basestring
-    :param bias_attr: The Bias Attribute. If the parameter is set to
+    :param bias_attr: The bias attribute. If the parameter is set to False or an object
-                      False or something not type of ParameterAttribute,
+                      whose type is not ParameterAttribute, no bias is defined. If the
-                      no bias is defined. If the parameter is set to
+                      parameter is set to True, the bias is initialized to zero.
-                      True, the bias is initialized to zero.
    :type bias_attr: ParameterAttribute | None | bool | Any
    :param expand_level: whether input layer is timestep(default) or sequence.
    :type expand_level: ExpandLevel
@@ -1939,10 +1932,9 @@ def seq_reshape_layer(input,
    :type act: BaseActivation
    :param layer_attr: extra layer attributes.
    :type layer_attr: ExtraLayerAttribute.
-    :param bias_attr: The Bias Attribute. If the parameter is set to
+    :param bias_attr: The bias attribute. If the parameter is set to False or an object
-                      False or something not type of ParameterAttribute,
+                      whose type is not ParameterAttribute, no bias is defined. If the
-                      no bias is defined. If the parameter is set to
+                      parameter is set to True, the bias is initialized to zero.
-                      True, the bias is initialized to zero.
    :type bias_attr: ParameterAttribute | None | bool | Any
    :return: LayerOutput object.
    :rtype: LayerOutput
@@ -2326,10 +2318,9 @@ def hsigmoid(input,
    :type num_classes: int | None
    :param name: The name of this layer. It is optional.
    :type name: basestring
-    :param bias_attr: The Bias Attribute. If the parameter is set to
+    :param bias_attr: The bias attribute. If the parameter is set to False or an object
-                      False or something not type of ParameterAttribute,
+                      whose type is not ParameterAttribute, no bias is defined. If the
-                      no bias is defined. If the parameter is set to
+                      parameter is set to True, the bias is initialized to zero.
-                      True, the bias is initialized to zero.
    :type bias_attr: ParameterAttribute | None | bool | Any
    :param param_attr: Parameter Attribute. None means default parameter.
    :type param_attr: ParameterAttribute | None
@@ -2469,10 +2460,9 @@ def img_conv_layer(input,
    :type dilation: int | tuple | list
    :param dilation_y: The y dimension of the dilation.
    :type dilation_y: int
-    :param bias_attr: The Bias Attribute. If the parameter is set to
+    :param bias_attr: The bias attribute. If the parameter is set to False or an object
-                      False or something not type of ParameterAttribute,
+                      whose type is not ParameterAttribute, no bias is defined. If the
-                      no bias is defined. If the parameter is set to
+                      parameter is set to True, the bias is initialized to zero.
-                      True, the bias is initialized to zero.
    :type bias_attr: ParameterAttribute | None | bool | Any
    :param num_channels: number of input channels. If None will be set
                        automatically from previous output.
@@ -3219,10 +3209,9 @@ def addto_layer(input, act=None, name=None, bias_attr=None, layer_attr=None):
    :type input: LayerOutput | list | tuple
    :param act: Activation Type. LinearActivation is the default.
    :type act: BaseActivation
-    :param bias_attr: The Bias Attribute. If the parameter is set to
+    :param bias_attr: The bias attribute. If the parameter is set to False or an object
-                      False or something not type of ParameterAttribute,
+                      whose type is not ParameterAttribute, no bias is defined. If the
-                      no bias is defined. If the parameter is set to
+                      parameter is set to True, the bias is initialized to zero.
-                      True, the bias is initialized to zero.
    :type bias_attr: ParameterAttribute | None | bool | Any
    :param layer_attr: Extra Layer attribute.
    :type layer_attr: ExtraLayerAttribute
@@ -3375,10 +3364,9 @@ def seq_concat_layer(a, b, act=None, name=None, layer_attr=None,
    :type act: BaseActivation
    :param layer_attr: Extra Layer Attribute.
    :type layer_attr: ExtraLayerAttribute
-    :param bias_attr: The Bias Attribute. If the parameter is set to
+    :param bias_attr: The bias attribute. If the parameter is set to False or an object
-                      False or something not type of ParameterAttribute,
+                      whose type is not ParameterAttribute, no bias is defined. If the
-                      no bias is defined. If the parameter is set to
+                      parameter is set to True, the bias is initialized to zero.
-                      True, the bias is initialized to zero.
    :type bias_attr: ParameterAttribute | None | bool | Any
    :return: LayerOutput object.
    :rtype: LayerOutput
@@ -3558,10 +3546,9 @@ def lstm_step_layer(input,
    :type gate_act: BaseActivation
    :param state_act: State Activation Type. TanhActivation is the default.
    :type state_act: BaseActivation
-    :param bias_attr: The Bias Attribute. If the parameter is set to
+    :param bias_attr: The bias attribute. If the parameter is set to False or an object
-                      False or something not type of ParameterAttribute,
+                      whose type is not ParameterAttribute, no bias is defined. If the
-                      no bias is defined. If the parameter is set to
+                      parameter is set to True, the bias is initialized to zero.
-                      True, the bias is initialized to zero.
    :type bias_attr: ParameterAttribute | None | bool | Any
    :param layer_attr: layer's extra attribute.
    :type layer_attr: ExtraLayerAttribute
@@ -3617,10 +3604,9 @@ def gru_step_layer(input,
    :param name: The name of this layer. It is optional.
    :param gate_act: Activation type of this layer's two gates. Default is Sigmoid.
    :type gate_act: BaseActivation
-    :param bias_attr: The Bias Attribute. If the parameter is set to
+    :param bias_attr: The bias attribute. If the parameter is set to False or an object
-                      False or something not type of ParameterAttribute,
+                      whose type is not ParameterAttribute, no bias is defined. If the
-                      no bias is defined. If the parameter is set to
+                      parameter is set to True, the bias is initialized to zero.
-                      True, the bias is initialized to zero.
    :type bias_attr: ParameterAttribute | None | bool | Any
    :param param_attr: the parameter_attribute for transforming the output_mem
                       from previous step.
@@ -3680,10 +3666,9 @@ def gru_step_naive_layer(input,
    :type act: BaseActivation
    :param gate_act: Activation type of this layer's two gates. Default is Sigmoid.
    :type gate_act: BaseActivation
-    :param bias_attr: The Bias Attribute. If the parameter is set to
+    :param bias_attr: The bias attribute. If the parameter is set to False or an object
-                      False or something not type of ParameterAttribute,
+                      whose type is not ParameterAttribute, no bias is defined. If the
-                      no bias is defined. If the parameter is set to
+                      parameter is set to True, the bias is initialized to zero.
-                      True, the bias is initialized to zero.
    :type bias_attr: ParameterAttribute | None | bool | Any
    :param param_attr:
    :param layer_attr:
@@ -3813,10 +3798,9 @@ def recurrent_layer(input,
    :type input: LayerOutput
    :param act: Activation type. TanhActivation is the default.
    :type act: BaseActivation
-    :param bias_attr: The Bias Attribute. If the parameter is set to
+    :param bias_attr: The bias attribute. If the parameter is set to False or an object
-                      False or something not type of ParameterAttribute,
+                      whose type is not ParameterAttribute, no bias is defined. If the
-                      no bias is defined. If the parameter is set to
+                      parameter is set to True, the bias is initialized to zero.
-                      True, the bias is initialized to zero.
    :type bias_attr: ParameterAttribute | None | bool | Any
    :param param_attr: parameter attribute.
    :type param_attr: ParameterAttribute
@@ -4806,10 +4790,9 @@ def tensor_layer(a,
    :type act: BaseActivation
    :param param_attr: The Parameter Attribute.
    :type param_attr: ParameterAttribute
-    :param bias_attr: The Bias Attribute. If the parameter is set to
+    :param bias_attr: The bias attribute. If the parameter is set to False or an object
-                      False or something not type of ParameterAttribute,
+                      whose type is not ParameterAttribute, no bias is defined. If the
-                      no bias is defined. If the parameter is set to
+                      parameter is set to True, the bias is initialized to zero.
-                      True, the bias is initialized to zero.
    :type bias_attr: ParameterAttribute | None | bool | Any
    :param layer_attr: Extra Layer config.
    :type layer_attr: ExtraLayerAttribute | None
@@ -4871,10 +4854,9 @@ def selective_fc_layer(input,
    :type act: BaseActivation
    :param param_attr: The Parameter Attribute.
    :type param_attr: ParameterAttribute
-    :param bias_attr: The Bias Attribute. If the parameter is set to
+    :param bias_attr: The bias attribute. If the parameter is set to False or an object
-                      False or something not type of ParameterAttribute,
+                      whose type is not ParameterAttribute, no bias is defined. If the
-                      no bias is defined. If the parameter is set to
+                      parameter is set to True, the bias is initialized to zero.
-                      True, the bias is initialized to zero.
    :type bias_attr: ParameterAttribute | None | bool | Any
    :param layer_attr: Extra Layer config.
    :type layer_attr: ExtraLayerAttribute | None
@@ -5497,7 +5479,11 @@ def crf_decoding_layer(input,
    return LayerOutput(name, LayerType.CRF_DECODING_LAYER, parents, size=1)
-@wrap_act_default(act=SigmoidActivation())
+"""
+Following are cost Layers.
+"""
 @wrap_bias_attr_default(has_bias=True)
 @wrap_param_attr_default()
 @wrap_name_default()
@@ -5505,7 +5491,6 @@ def crf_decoding_layer(input,
 def nce_layer(input,
              label,
              num_classes=None,
-              act=None,
              param_attr=None,
              weight=None,
              num_neg_samples=10,
@@ -5514,9 +5499,12 @@ def nce_layer(input,
              bias_attr=None,
              layer_attr=None):
    """
-    Noise-contrastive estimation.
+    Noise-contrastive estimation. This layer implements the method in the
-    Implements the method in the following paper:
+    following paper:
-    A fast and simple algorithm for training neural probabilistic language models.
+    Reference:
+        A fast and simple algorithm for training neural probabilistic language
+        models. https://www.cs.toronto.edu/~amnih/papers/ncelm.pdf
    The example usage is:
@@ -5528,32 +5516,37 @@ def nce_layer(input,
    :param name: The name of this layer. It is optional.
    :type name: basestring
-    :param input: The input layers. It could be a LayerOutput of list/tuple of LayerOutput.
+    :param input: The input layers. It should be a LayerOutput or a list/tuple
+                  of LayerOutput.
    :type input: LayerOutput | list | tuple | collections.Sequence
-    :param label: label layer
+    :param label: The ground truth.
    :type label: LayerOutput
-    :param weight: weight layer, can be None(default)
+    :param weight: The weight layer defines a weight for each sample in the
+                   mini-batch. The default value is None.
    :type weight: LayerOutput
-    :param num_classes: number of classes.
+    :param num_classes: The class number.
    :type num_classes: int
-    :param act: Activation type. SigmoidActivation is the default.
+    :param param_attr: The parameter attributes.
-    :type act: BaseActivation
+    :type param_attr: ParameterAttribute|list
-    :param param_attr: The Parameter Attribute|list.
+    :param num_neg_samples: The number of sampled negative labels. The default
-    :type param_attr: ParameterAttribute
+                            value is 10.
-    :param num_neg_samples: number of negative samples. Default is 10.
    :type num_neg_samples: int
-    :param neg_distribution: The distribution for generating the random negative labels.
+    :param neg_distribution: The discrete noisy distribution over the output
-                             A uniform distribution will be used if not provided.
+                             space from which num_neg_samples negative labels
-                             If not None, its length must be equal to num_classes.
+                             are sampled. If this parameter is not set, a
+                             uniform distribution will be used. A user defined
+                             distribution is a list whose length must be equal
+                             to the num_classes. Each member of the list defines
+                             the probability of a class given input x.
    :type neg_distribution: list | tuple | collections.Sequence | None
-    :param bias_attr: The Bias Attribute. If the parameter is set to
+    :param bias_attr: The attribute for bias. If this parameter is set False or
-                      False or something not type of ParameterAttribute,
+                      any object whose type is not ParameterAttribute, no bias
-                      no bias is defined. If the parameter is set to
+                      is added. If this parameter is set True, the bias is
-                      True, the bias is initialized to zero.
+                      initialized to zero.
    :type bias_attr: ParameterAttribute | None | bool | Any
    :param layer_attr: Extra Layer Attribute.
    :type layer_attr: ExtraLayerAttribute
-    :return: layer name.
+    :return: The LayerOutput object.
    :rtype: LayerOutput
    """
    if isinstance(input, LayerOutput):
@@ -5576,8 +5569,6 @@ def nce_layer(input,
        assert isinstance(neg_distribution, collections.Sequence)
        assert len(neg_distribution) == num_classes
        assert abs(sum(neg_distribution) - 1.0) < 1e-5
-    if not isinstance(act, BaseActivation):
-        raise TypeError()
    ipts_for_layer = []
    parents = []
@@ -5599,7 +5590,7 @@ def nce_layer(input,
        type=LayerType.NCE_LAYER,
        num_classes=num_classes,
        neg_sampling_dist=neg_distribution,
-        active_type=act.name,
+        active_type=SigmoidActivation().name,
        num_neg_samples=num_neg_samples,
        inputs=ipts_for_layer,
        bias=ParamAttr.to_bias(bias_attr),
@@ -5609,12 +5600,7 @@ def nce_layer(input,
        LayerType.NCE_LAYER,
        parents=parents,
        size=l.config.size,
-        activation=act)
+        activation=SigmoidActivation())
-"""
-following are cost Layers.
-"""
 @wrap_name_default()
@@ -5773,20 +5759,21 @@ def cross_entropy(input,
    :param input: The first input layer.
    :type input: LayerOutput.
    :param label: The input label.
-    :type input: LayerOutput.
+    :type input: LayerOutput
    :param name: The name of this layer. It is optional.
-    :type name: None | basestring.
+    :type name: basestring
-    :param coeff: The cost is multiplied with coeff.
+    :param coeff: The weight of the gradient in the back propagation.
-                  The coefficient affects the gradient in the backward.
+                  1.0 is the default.
-    :type coeff: float.
+    :type coeff: float
    :param weight: The cost of each sample is multiplied with each weight.
                   The weight should be a layer with size=1. Note that gradient
                   will not be calculated for weight.
    :type weight: LayerOutout
-    :param layer_attr: Extra Layer Attribute.
+    :param layer_attr: The extra layer attribute. See ExtraLayerAttribute for
+                       details.
    :type layer_attr: ExtraLayerAttribute
    :return: LayerOutput object.
-    :rtype: LayerOutput.
+    :rtype: LayerOutput
    """
    ipts, parents = __cost_input__(input, label, weight)
@@ -5819,19 +5806,21 @@ def cross_entropy_with_selfnorm(input,
                                          label=label_layer)
    :param input: The first input layer.
-    :type input: LayerOutput.
+    :type input: LayerOutput
    :param label: The input label.
-    :type input: LayerOutput.
+    :type input: LayerOutput
    :param name: The name of this layer. It is optional.
-    :type name: None | basestring.
+    :type name: basestring
-    :param coeff: The coefficient affects the gradient in the backward.
+    :param coeff: The weight of the gradient in the back propagation.
-    :type coeff: float.
+                  1.0 is the default.
+    :type coeff: float
    :param softmax_selfnorm_alpha: The scale factor affects the cost.
-    :type softmax_selfnorm_alpha: float.
+    :type softmax_selfnorm_alpha: float
-    :param layer_attr: Extra Layer Attribute.
+    :param layer_attr: The extra layer attribute. See ExtraLayerAttribute for
+                       details.
    :type layer_attr: ExtraLayerAttribute
    :return: LayerOutput object.
-    :rtype: LayerOutput.
+    :rtype: LayerOutput
    """
    Layer(
        name=name,
@@ -5852,7 +5841,7 @@ def cross_entropy_with_selfnorm(input,
 @layer_support()
 def sum_cost(input, name=None, layer_attr=None):
    """
-    A loss layer which calculate the sum of the input as loss
+    A loss layer which calculates the sum of the input as loss.
    The example usage is:
@@ -5861,10 +5850,11 @@ def sum_cost(input, name=None, layer_attr=None):
       cost = sum_cost(input=input_layer)
    :param input: The input of this layer.
-    :type input: LayerOutput.
+    :type input: LayerOutput
    :param name: The name of this layer. It is optional.
-    :type name: None | basestring.
+    :type name: basestring
-    :param layer_attr: Extra Layer Attribute.
+    :param layer_attr: The extra layer attribute. See ExtraLayerAttribute for
+                       details.
    :type layer_attr: ExtraLayerAttribute
    :return: LayerOutput object.
    :rtype: LayerOutput.
@@ -5904,16 +5894,18 @@ def huber_regression_cost(input,
       cost = huber_regression_cost(input=input_layer, label=label_layer)
    :param input: The first input layer.
-    :type input: LayerOutput.
+    :type input: LayerOutput
    :param label: The input label.
-    :type input: LayerOutput.
+    :type input: LayerOutput
    :param name: The name of this layer. It is optional.
-    :type name: None | basestring.
+    :type name: basestring
    :param delta: The difference between the observed and predicted values.
-    :type delta: float.
+    :type delta: float
-    :param coeff: The coefficient affects the gradient in the backward.
+    :param coeff: The weight of the gradient in the back propagation.
-    :type coeff: float.
+                  1.0 is the default.
-    :param layer_attr: Extra Layer Attribute.
+    :type coeff: float
+    :param layer_attr: The extra layer attribute. See ExtraLayerAttribute for
+                       details.
    :type layer_attr: ExtraLayerAttribute
    :return: LayerOutput object.
    :rtype: LayerOutput.
@@ -5954,17 +5946,19 @@ def huber_classification_cost(input,
       cost = huber_classification_cost(input=input_layer, label=label_layer)
    :param input: The first input layer.
-    :type input: LayerOutput.
+    :type input: LayerOutput
    :param label: The input label.
-    :type input: LayerOutput.
+    :type input: LayerOutput
    :param name: The name of this layer. It is optional.
-    :type name: None | basestring.
+    :type name: basestring
-    :param coeff: The coefficient affects the gradient in the backward.
+    :param coeff: The weight of the gradient in the back propagation.
-    :type coeff: float.
+                  1.0 is the default.
-    :param layer_attr: Extra Layer Attribute.
+    :type coeff: float
+    :param layer_attr: The extra layer attribute. See ExtraLayerAttribute for
+                       details.
    :type layer_attr: ExtraLayerAttribute
    :return: LayerOutput object.
-    :rtype: LayerOutput.
+    :rtype: LayerOutput
    """
    assert isinstance(input, LayerOutput)
    if input.size is not None:
@@ -6001,10 +5995,12 @@ def multi_binary_label_cross_entropy(input,
    :param label: The input label.
    :type input: LayerOutput
    :param name: The name of this layer. It is optional.
-    :type name: None | basestring
+    :type name: basestring
-    :param coeff: The coefficient affects the gradient in the backward.
+    :param coeff: The weight of the gradient in the back propagation.
+                  1.0 is the default.
    :type coeff: float
-    :param layer_attr: Extra Layer Attribute.
+    :param layer_attr: The extra layer attribute. See ExtraLayerAttribute for
+                       details.
    :type layer_attr: ExtraLayerAttribute
    :return: LayerOutput object.
    :rtype: LayerOutput
@@ -6107,7 +6103,7 @@ def cross_entropy_over_beam(input, name=None):
    :param input: Input beams for this layer.
    :type input: BeamInput
-    :param name: The name of this layer.
+    :param name: The name of this layer. It is optional.
    :type name: basestring
    :return: LayerOutput object.
    :rtype: LayerOutput
@@ -6142,7 +6138,7 @@ def cross_entropy_over_beam(input, name=None):
 def smooth_l1_cost(input, label, name=None, coeff=1.0, layer_attr=None):
    """
    This is a L1 loss but more smooth. It requires that the
-    size of input and label are equal. The formula is as follows,
+    sizes of input and label are equal. The formula is as follows,
    .. math::
@@ -6154,8 +6150,9 @@ def smooth_l1_cost(input, label, name=None, coeff=1.0, layer_attr=None):
        smooth_{L1}(x) = \\begin{cases} 0.5x^2& \\text{if}  \\ |x| < 1 \\\\ |x|-0.5& \\text{otherwise} \end{cases}
-    More details can be found by referring to `Fast R-CNN
+    Reference:
-    <https://arxiv.org/pdf/1504.08083v2.pdf>`_
+        Fast R-CNN
+        https://arxiv.org/pdf/1504.08083v2.pdf
    The example usage is:
@@ -6169,10 +6166,12 @@ def smooth_l1_cost(input, label, name=None, coeff=1.0, layer_attr=None):
    :param label: The input label.
    :type input: LayerOutput
    :param name: The name of this layer. It is optional.
-    :type name: None | basestring
+    :type name: basestring
-    :param coeff: The coefficient affects the gradient in the backward.
+    :param coeff: The weight of the gradient in the back propagation.
+                  1.0 is the default.
    :type coeff: float
-    :param layer_attr: Extra Layer Attribute.
+    :param layer_attr: The extra layer attribute. See ExtraLayerAttribute for
+                       details.
    :type layer_attr: ExtraLayerAttribute
    :return: LayerOutput object.
    :rtype: LayerOutput
@@ -6194,12 +6193,12 @@ def smooth_l1_cost(input, label, name=None, coeff=1.0, layer_attr=None):
 @wrap_name_default()
 def multiplex_layer(input, name=None, layer_attr=None):
    """
-    This layer multiplex multiple layers according to the index,
+    This layer multiplex multiple layers according to the indexes,
-    which is provided by the first input layer.
+    which are provided by the first input layer.
-    inputs[0]: the index of the layer to output of size batchSize.
+    inputs[0]: the indexes of the layers to form the output of size batchSize.
    inputs[1:N]; the candidate output data.
-    For each index i from 0 to batchSize -1, the output is the i-th row of the
+    For each index i from 0 to batchSize - 1, the i-th row of the output is the
-    (index[i] + 1)-th layer.
+    the same to the i-th row of the (index[i] + 1)-th layer.
    For each i-th row of output:
    .. math::
@@ -6218,7 +6217,8 @@ def multiplex_layer(input, name=None, layer_attr=None):
    :type input: list of LayerOutput
    :param name: The name of this layer. It is optional.
    :type name: basestring
-    :param layer_attr: extra layer attributes.
+    :param layer_attr: The extra layer attribute. See ExtraLayerAttribute for
+                       details.
    :type layer_attr: ExtraLayerAttribute.
    :return: LayerOutput object.
    :rtype: LayerOutput
@@ -6322,14 +6322,14 @@ def row_conv_layer(input,
    :type context_len: int
    :param act: Activation Type. LinearActivation is the default.
    :type act: BaseActivation
-    :param param_attr: The Parameter Attribute. If None, the parameter will be
+    :param param_attr: The parameter attribute. See ParameterAttribute for
-                       initialized smartly. It's better to set it by yourself.
+                       details.
    :type param_attr: ParameterAttribute
-    :param layer_attr: Extra Layer config.
+    :param layer_attr: The extra layer attribute. See ExtraLayerAttribute for
+                       details.
    :type layer_attr: ExtraLayerAttribute | None
    :return: LayerOutput object.
    :rtype: LayerOutput
    """
    assert isinstance(input, LayerOutput)
    assert context_len > 0, "the context_len must be greatet than 0."
@@ -6354,7 +6354,7 @@ def prelu_layer(input,
                param_attr=None,
                layer_attr=None):
    """
-    The Parameter Relu activation that actives outputs with a learnable weight.
+    The Parametric Relu activation that actives outputs with a learnable weight.
    Reference:
        Delving Deep into Rectifiers: Surpassing Human-Level Performance on
@@ -6374,16 +6374,17 @@ def prelu_layer(input,
    :type name: basestring
    :param input: The input of this layer.
    :type input: LayerOutput
-    :param partial_sum: this parameter makes a group of inputs share a same weight.
+    :param partial_sum: this parameter makes a group of inputs share the same weight.
        - partial_sum = 1, indicates the element-wise activation: each element has a weight.
-        - partial_sum = number of elements in one channel, indicates the channel-wise activation, elements in a channel share a same weight.
+        - partial_sum = number of elements in one channel, indicates the channel-wise activation, elements in a channel share the same weight.
-        - partial_sum = number of outputs, indicates all elements share a same weight.
+        - partial_sum = number of outputs, indicates all elements share the same weight.
    :type partial_sum: int
    :param param_attr: The parameter attribute. See ParameterAttribute for details.
-    :type param_attr: ParameterAttribute | None
+    :type param_attr: ParameterAttribute
-    :param layer_attr: Extra layer configurations. Default is None.
+    :param layer_attr: The extra layer attribute. See ExtraLayerAttribute for
+                       details.
    :type layer_attr: ExtraLayerAttribute | None
    :return: LayerOutput object.
    :rtype: LayerOutput
@@ -6439,34 +6440,34 @@ def gated_unit_layer(input,
    :param input: The input of this layer.
    :type input: LayerOutput
-    :param size: output size of the gated unit.
+    :param size: The dimension of this layer's output.
    :type size: int
-    :param act: Activation type of the projected input. LinearActivation is the default.
+    :param act: Activation type of the projection. LinearActivation is the default.
    :type act: BaseActivation
    :param name: The name of this layer. It is optional.
    :type name: basestring
-    :param gate_attr: Attributes to tune the gate output, for example, error
+    :param gate_attr: The extra layer attribute of the gate. See ExtraLayerAttribute for
-        clipping threshold, dropout and so on. See ExtraLayerAttribute for
+                      details.
-        more details.
    :type gate_attr: ExtraLayerAttribute | None
-    :param gate_param_attr: Attributes to tune the learnable projected matrix
+    :param gate_param_attr: The parameter attribute of the gate. See ParameterAttribute
-        parameter of the gate.
+                            for details.
-    :type gate_param_attr: ParameterAttribute | None
+    :type gate_param_attr: ParameterAttribute
-    :param gate_bias_attr: Attributes to tune the learnable bias of the gate.
+    :param gate_bias_attr: The bias attribute of the gate. If the parameter is set to False or
-    :type gate_bias_attr: ParameterAttribute | None
+                           an object whose type is not ParameterAttribute, no bias is defined.
-    :param inproj_attr: Attributes to the tune the projected input, for
+                           If the parameter is set to True, the bias is initialized to zero.
-        example, error clipping threshold, dropout and so on. See
+    :type gate_bias_attr: ParameterAttribute | bool | None | Any
-        ExtraLayerAttribute for more details.
+    :param inproj_attr: Extra layer attributes of the projection. See ExtraLayerAttribute for
+                        details.
    :type inproj_attr: ExtraLayerAttribute | None
-    :param inproj_param_attr: Attributes to tune the learnable parameter of
+    :param inproj_param_attr: The parameter attribute of the projection. See ParameterAttribute
-        the projection of input.
+                              for details.
-    :type inproj_param_attr: ParameterAttribute | None
+    :type inproj_param_attr: ParameterAttribute
-    :param inproj_bias_attr: Attributes to tune the learnable bias of
+    :param inproj_bias_attr: The bias attribute of the projection. If the parameter is set to False
-        projection of the input.
+                             or an object whose type is not ParameterAttribute, no bias is defined.
-    :type inproj_bias_attr: ParameterAttribute | None
+                             If the parameter is set to True, the bias is initialized to zero.
-    :param layer_attr: Attributes to tune the final output of the gated unit,
+    :type inproj_bias_attr: ParameterAttribute | bool | None | Any
-        for example, error clipping threshold, dropout and so on. See
+    :param layer_attr: Extra layer attribute of the product. See ExtraLayerAttribute for
-        ExtraLayerAttribute for more details.
+                       details.
    :type layer_attr: ExtraLayerAttribute | None
    :return: LayerOutput object.
    :rtype: LayerOutput
@@ -6662,9 +6663,9 @@ def clip_layer(input, min, max, name=None):
    :param input: The input of this layer.
    :type input: LayerOutput.
    :param min: The lower threshold for clipping.
-    :type min: double
+    :type min: float
    :param max: The upper threshold for clipping.
-    :type max: double
+    :type max: float
    :return: LayerOutput object.
    :rtype: LayerOutput
    """
@@ -6712,7 +6713,6 @@ def seq_slice_layer(input, starts, ends, name=None):
    :type ends: LayerOutput | None
    :return: LayerOutput object.
    :rtype: LayerOutput
    """
    assert isinstance(input, LayerOutput), (
@@ -6833,20 +6833,21 @@ def img_conv3d_layer(input,
    :param padding: The numbers of padding along three axises. If the parameter is set to
                    one integer, they will be same.
    :type padding: int | tuple | list
-    :param bias_attr: The Bias Attribute. If the parameter is set to
+    :param bias_attr: The bias attribute. If the parameter is set to False or an object
-                      False or something not type of ParameterAttribute,
+                      whose type is not ParameterAttribute, no bias is defined. If the
-                      no bias is defined. If the parameter is set to
+                      parameter is set to True, the bias is initialized to zero.
-                      True, the bias is initialized to zero.
    :type bias_attr: ParameterAttribute | None | bool | Any
    :param num_channels: The number of input channels. If the parameter is not set or
                         set to None,  its actual value will be automatically set to
                         the channels number of the input .
    :type num_channels: int
-    :param param_attr: The parameter attribute of the convolution.
+    :param param_attr: The parameter attribute of the convolution. See ParameterAttribute for
+                       details.
    :type param_attr: ParameterAttribute
    :param shared_biases: Whether biases will be shared between filters or not.
    :type shared_biases: bool
-    :param layer_attr: Extra layer attributes.
+    :param layer_attr: The extra layer attributes. See ExtraLayerAttribute for
+                       details.
    :type layer_attr: ExtraLayerAttribute
    :param trans: True if it is a convTransLayer, False if it is a convLayer
    :type trans: bool
@@ -6953,12 +6954,12 @@ def scale_shift_layer(input, name=None, param_attr=None, bias_attr=None):
    :type name: basestring
    :param input: The input of this layer.
    :type input: LayerOutput
-    :param param_attr: The parameter attribute of scaling.
+    :param param_attr: The parameter attribute of scaling. See ParameterAttribute for
+                      details.
    :type param_attr: ParameterAttribute
-    :param bias_attr: The Bias Attribute. If the parameter is set to
+    :param bias_attr: The bias attribute. If the parameter is set to False or an object
-                      False or something not type of ParameterAttribute,
+                      whose type is not ParameterAttribute, no bias is defined. If the
-                      no bias is defined. If the parameter is set to
+                      parameter is set to True, the bias is initialized to zero.
-                      True, the bias is initialized to zero.
    :type bias_attr: ParameterAttribute | None | bool | Any
    :return: LayerOutput object.
    :rtype: LayerOutput
@@ -7016,10 +7017,9 @@ def sub_seq_layer(input, offsets, sizes, act=None, bias_attr=None, name=None):
    :type sizes: LayerOutput
    :param act: Activation type, LinearActivation is the default.
    :type act: BaseActivation.
-    :param bias_attr: The Bias Attribute. If the parameter is set to
+    :param bias_attr: The bias attribute. If the parameter is set to False or an object
-                      False or something not type of ParameterAttribute,
+                      whose type is not ParameterAttribute, no bias is defined. If the
-                      no bias is defined. If the parameter is set to
+                      parameter is set to True, the bias is initialized to zero.
-                      True, the bias is initialized to zero.
    :type bias_attr: ParameterAttribute | None | bool | Any
    :return: LayerOutput object.
    :rtype: LayerOutput