diff --git a/doc/design/ops/images/LOD-and-shape-changes-during-decoding.jpg b/doc/design/ops/images/LOD-and-shape-changes-during-decoding.jpg new file mode 100644 index 0000000000000000000000000000000000000000..8b0d90f7b9d8184b314b0ee4e521f53eb5f1b455 Binary files /dev/null and b/doc/design/ops/images/LOD-and-shape-changes-during-decoding.jpg differ diff --git a/doc/design/ops/sequence_decoder.md b/doc/design/ops/sequence_decoder.md new file mode 100644 index 0000000000000000000000000000000000000000..9007aae7a8355ed06c6720a921351f81b859c1fe --- /dev/null +++ b/doc/design/ops/sequence_decoder.md @@ -0,0 +1,245 @@ +# Design: Sequence Decoder Generating LoDTensors +In tasks such as machine translation and image to text, +a [sequence decoder](https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/README.md) is necessary to generate sequences. + +This documentation describes how to implement the sequence decoder as an operator. + +## Beam Search based Decoder +The [beam search algorithm](https://en.wikipedia.org/wiki/Beam_search) is necessary when generating sequences, +it is a heuristic search algorithm that explores the paths by expanding the most promising node in a limited set. + +In the old version of PaddlePaddle, a C++ class `RecurrentGradientMachine` implements the general sequence decoder based on beam search, +due to the complexity, the implementation relays on a lot of special data structures, +quite trivial and hard to be customized by users. + +There are a lot of heuristic tricks in the sequence generation tasks, +so the flexibility of sequence decoder is very important to users. + +During PaddlePaddle's refactoring work, +some new concept is proposed such as [LoDTensor](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/lod_tensor.md) and [TensorArray](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/tensor_array.md) that can better support sequence usage, +and they can help to make the implementation of beam search based sequence decoder **more transparent and modular** . + +For example, the RNN sates, candidates IDs and probabilities of beam search can be represented as `LoDTensors`; +the selected candidate's IDs in each time step can be stored in a `TensorArray`, and `Packed` to the sentences translated. + +## Changing LoD's absolute offset to relative offsets +The current `LoDTensor` is designed to store levels of variable-length sequences, +it stores several arrays of integers each represents a level. + +The integers in each level represents the begin and end (not inclusive) offset of a sequence **in the underlying tensor**, +let's call this format the **absolute-offset LoD** for clear. + +The relative-offset LoD can fast retrieve any sequence but fails to represent empty sequences, for example, a two-level LoD is as follows +```python +[[0, 3, 9] + [0, 2, 3, 3, 3, 9]] +``` +The first level tells that there are two sequences: +- the first's offset is `[0, 3)` +- the second's offset is `[3, 9)` + +while on the second level, there are several empty sequences that both begin and end at `3`. +It is impossible to tell how many empty second-level sequences exist in the first-level sequences. + +There are many scenarios that relay on empty sequence representation, +such as machine translation or image to text, one instance has no translations or the empty candidate set for a prefix. + +So let's introduce another format of LoD, +it stores **the offsets of the lower level sequences** and is called **relative-offset** LoD. + +For example, to represent the same sequences of the above data + +```python +[[0, 3, 6] + [0, 2, 3, 3, 3, 9]] +``` + +the first level represents that there are two sequences, +their offsets in the second-level LoD is `[0, 3)` and `[3, 5)`. + +The second level is the same with the relative offset example because the lower level is a tensor. +It is easy to find out the second sequence in the first-level LoD has two empty sequences. + +The following demos are based on relative-offset LoD. + +## Usage in a simple machine translation model +Let's start from a simple machine translation model that is simplified from [machine translation chapter](https://github.com/PaddlePaddle/book/tree/develop/08.machine_translation) to draw a simple blueprint of what a sequence decoder can do and how to use it. + +The model has an encoder that learns the semantic vector from a sequence, +and a decoder which uses the sequence decoder to generate new sentences. + +**Encoder** +```python +import paddle as pd + +dict_size = 8000 +source_dict_size = dict_size +target_dict_size = dict_size +word_vector_dim = 128 +encoder_dim = 128 +decoder_dim = 128 +beam_size = 5 +max_length = 120 + +# encoder +src_word_id = pd.data( + name='source_language_word', + type=pd.data.integer_value_sequence(source_dict_dim)) +src_embedding = pd.embedding(size=source_dict_size, size=word_vector_dim) + +src_word_vec = pd.lookup(src_embedding, src_word_id) + +encoder_out_seq = pd.gru(input=src_word_vec, size=encoder_dim) + +encoder_ctx = pd.last_seq(encoder_out_seq) +# encoder_ctx_proj is the learned semantic vector +encoder_ctx_proj = pd.fc( + encoder_ctx, size=decoder_dim, act=pd.activation.Tanh(), bias=None) +``` + +**Decoder** + +```python +def generate(): + decoder = pd.while_loop() + with decoder.step(): + decoder_mem = decoder.memory(init=encoder_ctx) # mark the memory + generated_ids = decoder.memory() # TODO init to batch_size s + generated_scores = decoder.memory() # TODO init to batch_size 1s or 0s + + target_word = pd.lookup(trg_embedding, gendrated_ids) + # expand encoder_ctx's batch to fit target_word's lod + # for example + # decoder_mem.lod is + # [[0 1 3], + # [0 1 3 6]] + # its tensor content is [a1 a2 a3 a4 a5] + # which means there are 2 sentences to translate + # - the first sentence has 1 translation prefixes, the offsets are [0, 1) + # - the second sentence has 2 translation prefixes, the offsets are [1, 3) and [3, 6) + # the target_word.lod is + # [[0, 1, 6] + # [0, 2, 4, 7, 9 12]] + # which means 2 sentences to translate, each has 1 and 5 prefixes + # the first prefix has 2 candidates + # the following has 2, 3, 2, 3 candidates + # the encoder_ctx_expanded's content will be + # [a1 a1 a2 a2 a3 a3 a3 a4 a4 a5 a5 a5] + encoder_ctx_expanded = pd.lod_expand(encoder_ctx, target_word) + decoder_input = pd.fc( + act=pd.activation.Linear(), + input=[target_word, encoder_ctx], + size=3 * decoder_dim) + gru_out, cur_mem = pd.gru_step( + decoder_input, mem=decoder_mem, size=decoder_dim) + scores = pd.fc( + gru_out, + size=trg_dic_size, + bias=None, + act=pd.activation.Softmax()) + # K is an config + topk_scores, topk_ids = pd.top_k(scores, K) + topk_generated_scores = pd.add_scalar(topk_scores, generated_scores) + + selected_ids, selected_generation_scores = decoder.beam_search( + topk_ids, topk_generated_scores) + + # update the states + decoder_mem.update(cur_mem) # tells how to update state + generated_ids.update(selected_ids) + generated_scores.update(selected_generation_scores) + + decoder.output(selected_ids) + decoder.output(selected_generation_scores) + +translation_ids, translation_scores = decoder() +``` +The `decoder.beam_search` is a operator that given the candidates and the scores of translations including the candidates, +return the result of the beam search algorithm. + +In this way, users can customize anything on the inputs or outputs of beam search, for example, two ways to prune some translation prefixes + +1. meke the correspondind elements in `topk_generated_scores` zero or some small values, beam_search will discard this candidate. +2. remove some specific candidate in `selected_ids` +3. get the final `translation_ids`, remove the translation sequence in it. + +The implementation of sequence decoder can reuse the C++ class [RNNAlgorithm](https://github.com/Superjom/Paddle/blob/68cac3c0f8451fe62a4cdf156747d6dc0ee000b3/paddle/operators/dynamic_recurrent_op.h#L30), +so the python syntax is quite similar to a [RNN](https://github.com/Superjom/Paddle/blob/68cac3c0f8451fe62a4cdf156747d6dc0ee000b3/doc/design/block.md#blocks-with-for-and-rnnop). + +Both of them are two-level `LoDTensors` + +- the first level represents `batch_size` of (source) sentences; +- the second level represents the candidate ID sets for translation prefix. + +for example, 3 source sentences to translate, and has 2, 3, 1 candidates. + +Unlike an RNN, in sequence decoder, the previous state and the current state have different LoD and shape, +a `lod_expand` operator is used to expand the LoD of the previous state to fit the current state. + +For example, the previous state + +* LoD is `[0, 1, 3][0, 2, 5, 6]` +* content of tensor is `a1 a2 b1 b2 b3 c1` + +the current state stored in `encoder_ctx_expanded` + +* LoD is `[0, 2, 7][0 3 5 8 9 11 11]` +* the content is + - a1 a1 a1 (a1 has 3 candidates, so the state should be copied 3 times for each candidates) + - a2 a2 + - b1 b1 b1 + - b2 + - b3 b3 + - None (c1 has 0 candidates, so c1 is dropped) + +Benefit from the relative offset LoD, empty candidate set can be represented naturally. + +the status in each time step can be stored in `TensorArray`, and `Pack`ed to a final LoDTensor, the corresponding syntax is + +```python +decoder.output(selected_ids) +decoder.output(selected_generation_scores) +``` + +the `selected_ids` is the candidate ids for the prefixes, +it will be `Packed` by `TensorArray` to a two-level `LoDTensor`, +the first level represents the source sequences, +the second level represents generated sequences. + +Pack the `selected_scores` will get a `LoDTensor` that stores scores of each candidate of translations. + +Pack the `selected_generation_scores` will get a `LoDTensor`, and each tail is the probability of the translation. + +## LoD and shape changes during decoding +

+ +

+ +According the image above, the only phrase to change LoD is beam search. + +## Beam search design +The beam search algorthm will be implemented as one method of the sequence decoder, it has 3 inputs + +1. `topk_ids`, top K candidate ids for each prefix. +2. `topk_scores`, the corresponding scores for `topk_ids` +3. `generated_scores`, the score of the prefixes. + +All of the are LoDTensors, so that the sequence affilication is clear. +Beam search will keep a beam for each prefix and select a smaller candidate set for each prefix. + +It will return three variables + +1. `selected_ids`, the final candidate beam search function selected for the next step. +2. `selected_scores`, the scores for the candidates. +3. `generated_scores`, the updated scores for each prefixes (with the new candidates appended). + +## Introducing the LoD-based `Pack` and `Unpack` methods in `TensorArray` +The `selected_ids`, `selected_scores` and `generated_scores` are LoDTensors, +and they exist in each time step, +so it is natural to store them in arrays. + +Currently, PaddlePaddle has a module called `TensorArray` which can store an array of tensors, +the results of beam search are better to store in a `TensorArray`. + +The `Pack` and `UnPack` in `TensorArray` are used to package tensors in the array to a `LoDTensor` or split the `LoDTensor` to an array of tensors. +It needs some extensions to support pack or unpack an array of `LoDTensors`. diff --git a/paddle/gserver/layers/ScaleSubRegionLayer.cpp b/paddle/gserver/layers/ScaleSubRegionLayer.cpp index b18bc0c1b9065e09324d8ab4ed165679f6196d00..aa6778aef4e893208fd064ca22e217c6c4d960f9 100644 --- a/paddle/gserver/layers/ScaleSubRegionLayer.cpp +++ b/paddle/gserver/layers/ScaleSubRegionLayer.cpp @@ -49,7 +49,7 @@ void ScaleSubRegionLayer::forward(PassType passType) { shape_ = TensorShape({batchSize, channelsNum_, imgH_, imgW_}); resetOutput(batchSize, imgV->getWidth()); - auto out = getOutput(); + auto& out = getOutput(); out.setFrameHeight(imgH_); out.setFrameWidth(imgW_); diff --git a/paddle/gserver/tests/test_LayerGrad.cpp b/paddle/gserver/tests/test_LayerGrad.cpp index 3f7d8810513ed4b765219be722cf8ae9adc7909f..df73e6781533def5641635e9dfa9c9e4e8a0b57f 100644 --- a/paddle/gserver/tests/test_LayerGrad.cpp +++ b/paddle/gserver/tests/test_LayerGrad.cpp @@ -53,7 +53,7 @@ TEST(Operator, dot_mul) { TEST(Projection, context) { for (auto contextStart : {-5, -3, -1, 0, 3}) { for (auto contextLength : {1, 2, 5, 7}) { - for (auto batchSize : {1, 2, 5, 20, 50}) { + for (auto batchSize : {1, 2, 5, 20}) { for (auto trainablePadding : {false, true}) { LOG(INFO) << " contextStart=" << contextStart << " contextLength=" << contextLength @@ -585,14 +585,14 @@ TEST(Layer, maxoutLayer) { } void testFcLayer(string format, size_t nnz) { TestConfig config; - config.biasSize = 4096; + config.biasSize = 1024; config.layerConfig.set_type("fc"); - config.layerConfig.set_size(4096); + config.layerConfig.set_size(1024); config.layerConfig.set_active_type("sigmoid"); config.layerConfig.set_drop_rate(0.1); config.inputDefs.push_back( - {INPUT_DATA, "layer_0", 8192, nnz, ParaSparse(format)}); + {INPUT_DATA, "layer_0", 2048, nnz, ParaSparse(format)}); config.layerConfig.add_inputs(); LOG(INFO) << config.inputDefs[0].sparse.sparse << " " @@ -609,9 +609,9 @@ void testFcLayer(string format, size_t nnz) { } TEST(Layer, fcLayer) { - testFcLayer("", 4096 * 4096 * 2); - testFcLayer("csc", 4096 * 40); - testFcLayer("csr", 4096 * 40); + testFcLayer("", 1024 * 1024 * 2); + testFcLayer("csc", 1024 * 10); + testFcLayer("csr", 1024 * 10); } TEST(Layer, SelectiveFullyConnectedLayer) { @@ -1995,7 +1995,7 @@ TEST(Layer, multibox_loss) { TEST(Layer, TransLayer) { TestConfig config; const int height = 128; - const int width = 1028; + const int width = 256; config.layerConfig.set_type("trans"); config.layerConfig.set_size(width); diff --git a/python/paddle/trainer_config_helpers/layers.py b/python/paddle/trainer_config_helpers/layers.py index 9dbfb93217262b0599fd4af75449838041da7a23..36406ef86b812bcb4c671a0e1b1f29e391d79b99 100644 --- a/python/paddle/trainer_config_helpers/layers.py +++ b/python/paddle/trainer_config_helpers/layers.py @@ -789,10 +789,9 @@ class MixedLayerType(LayerOutput): :type size: int :param act: Activation type. :type act: BaseActivation - :param bias_attr: The Bias Attribute. If the parameter is set to - False or something not type of ParameterAttribute, - no bias is defined. If the parameter is set to - True, the bias is initialized to zero. + :param bias_attr: The bias attribute. If the parameter is set to False or an object + whose type is not ParameterAttribute, no bias is defined. If the + parameter is set to True, the bias is initialized to zero. :type bias_attr: ParameterAttribute | None | bool | Any :param layer_attr: Extra Layer Attribute. :type layer_attr: ExtraLayerAttribute or None @@ -889,10 +888,9 @@ def mixed_layer(size=0, then this function will just return layer's name. :param act: Activation Type. LinearActivation is the default. :type act: BaseActivation - :param bias_attr: The Bias Attribute. If the parameter is set to - False or something not type of ParameterAttribute, - no bias is defined. If the parameter is set to - True, the bias is initialized to zero. + :param bias_attr: The bias attribute. If the parameter is set to False or an object + whose type is not ParameterAttribute, no bias is defined. If the + parameter is set to True, the bias is initialized to zero. :type bias_attr: ParameterAttribute | None | bool | Any :param layer_attr: The extra layer config. Default is None. :type layer_attr: ExtraLayerAttribute @@ -1034,10 +1032,9 @@ def fc_layer(input, :type act: BaseActivation :param param_attr: The Parameter Attribute|list. :type param_attr: ParameterAttribute - :param bias_attr: The Bias Attribute. If the parameter is set to - False or something not type of ParameterAttribute, - no bias is defined. If the parameter is set to - True, the bias is initialized to zero. + :param bias_attr: The bias attribute. If the parameter is set to False or an object + whose type is not ParameterAttribute, no bias is defined. If the + parameter is set to True, the bias is initialized to zero. :type bias_attr: ParameterAttribute | None | bool | Any :param layer_attr: Extra Layer config. :type layer_attr: ExtraLayerAttribute | None @@ -1390,10 +1387,9 @@ def pooling_layer(input, :type pooling_type: BasePoolingType | None :param stride: The step size between successive pooling regions. :type stride: Int - :param bias_attr: The Bias Attribute. If the parameter is set to - False or something not type of ParameterAttribute, - no bias is defined. If the parameter is set to - True, the bias is initialized to zero. + :param bias_attr: The bias attribute. If the parameter is set to False or an object + whose type is not ParameterAttribute, no bias is defined. If the + parameter is set to True, the bias is initialized to zero. :type bias_attr: ParameterAttribute | None | bool | Any :param layer_attr: The Extra Attributes for layer, such as dropout. :type layer_attr: ExtraLayerAttribute | None @@ -1491,10 +1487,9 @@ def lstmemory(input, :type gate_act: BaseActivation :param state_act: state activation type, TanhActivation by default. :type state_act: BaseActivation - :param bias_attr: The Bias Attribute. If the parameter is set to - False or something not type of ParameterAttribute, - no bias is defined. If the parameter is set to - True, the bias is initialized to zero. + :param bias_attr: The bias attribute. If the parameter is set to False or an object + whose type is not ParameterAttribute, no bias is defined. If the + parameter is set to True, the bias is initialized to zero. :type bias_attr: ParameterAttribute | None | bool | Any :param param_attr: Parameter Attribute. :type param_attr: ParameterAttribute | None | False @@ -1617,10 +1612,9 @@ def grumemory(input, This activation affects the :math:`z_t` and :math:`r_t`. It is the :math:`\\sigma` in the above formula. :type gate_act: BaseActivation - :param bias_attr: The Bias Attribute. If the parameter is set to - False or something not type of ParameterAttribute, - no bias is defined. If the parameter is set to - True, the bias is initialized to zero. + :param bias_attr: The bias attribute. If the parameter is set to False or an object + whose type is not ParameterAttribute, no bias is defined. If the + parameter is set to True, the bias is initialized to zero. :type bias_attr: ParameterAttribute | None | bool | Any :param param_attr: Parameter Attribute. :type param_attr: ParameterAttribute | None | False @@ -1817,10 +1811,9 @@ def expand_layer(input, :type expand_as: LayerOutput :param name: The name of this layer. It is optional. :type name: basestring - :param bias_attr: The Bias Attribute. If the parameter is set to - False or something not type of ParameterAttribute, - no bias is defined. If the parameter is set to - True, the bias is initialized to zero. + :param bias_attr: The bias attribute. If the parameter is set to False or an object + whose type is not ParameterAttribute, no bias is defined. If the + parameter is set to True, the bias is initialized to zero. :type bias_attr: ParameterAttribute | None | bool | Any :param expand_level: whether input layer is timestep(default) or sequence. :type expand_level: ExpandLevel @@ -1939,10 +1932,9 @@ def seq_reshape_layer(input, :type act: BaseActivation :param layer_attr: extra layer attributes. :type layer_attr: ExtraLayerAttribute. - :param bias_attr: The Bias Attribute. If the parameter is set to - False or something not type of ParameterAttribute, - no bias is defined. If the parameter is set to - True, the bias is initialized to zero. + :param bias_attr: The bias attribute. If the parameter is set to False or an object + whose type is not ParameterAttribute, no bias is defined. If the + parameter is set to True, the bias is initialized to zero. :type bias_attr: ParameterAttribute | None | bool | Any :return: LayerOutput object. :rtype: LayerOutput @@ -2326,10 +2318,9 @@ def hsigmoid(input, :type num_classes: int | None :param name: The name of this layer. It is optional. :type name: basestring - :param bias_attr: The Bias Attribute. If the parameter is set to - False or something not type of ParameterAttribute, - no bias is defined. If the parameter is set to - True, the bias is initialized to zero. + :param bias_attr: The bias attribute. If the parameter is set to False or an object + whose type is not ParameterAttribute, no bias is defined. If the + parameter is set to True, the bias is initialized to zero. :type bias_attr: ParameterAttribute | None | bool | Any :param param_attr: Parameter Attribute. None means default parameter. :type param_attr: ParameterAttribute | None @@ -2469,10 +2460,9 @@ def img_conv_layer(input, :type dilation: int | tuple | list :param dilation_y: The y dimension of the dilation. :type dilation_y: int - :param bias_attr: The Bias Attribute. If the parameter is set to - False or something not type of ParameterAttribute, - no bias is defined. If the parameter is set to - True, the bias is initialized to zero. + :param bias_attr: The bias attribute. If the parameter is set to False or an object + whose type is not ParameterAttribute, no bias is defined. If the + parameter is set to True, the bias is initialized to zero. :type bias_attr: ParameterAttribute | None | bool | Any :param num_channels: number of input channels. If None will be set automatically from previous output. @@ -3219,10 +3209,9 @@ def addto_layer(input, act=None, name=None, bias_attr=None, layer_attr=None): :type input: LayerOutput | list | tuple :param act: Activation Type. LinearActivation is the default. :type act: BaseActivation - :param bias_attr: The Bias Attribute. If the parameter is set to - False or something not type of ParameterAttribute, - no bias is defined. If the parameter is set to - True, the bias is initialized to zero. + :param bias_attr: The bias attribute. If the parameter is set to False or an object + whose type is not ParameterAttribute, no bias is defined. If the + parameter is set to True, the bias is initialized to zero. :type bias_attr: ParameterAttribute | None | bool | Any :param layer_attr: Extra Layer attribute. :type layer_attr: ExtraLayerAttribute @@ -3375,10 +3364,9 @@ def seq_concat_layer(a, b, act=None, name=None, layer_attr=None, :type act: BaseActivation :param layer_attr: Extra Layer Attribute. :type layer_attr: ExtraLayerAttribute - :param bias_attr: The Bias Attribute. If the parameter is set to - False or something not type of ParameterAttribute, - no bias is defined. If the parameter is set to - True, the bias is initialized to zero. + :param bias_attr: The bias attribute. If the parameter is set to False or an object + whose type is not ParameterAttribute, no bias is defined. If the + parameter is set to True, the bias is initialized to zero. :type bias_attr: ParameterAttribute | None | bool | Any :return: LayerOutput object. :rtype: LayerOutput @@ -3558,10 +3546,9 @@ def lstm_step_layer(input, :type gate_act: BaseActivation :param state_act: State Activation Type. TanhActivation is the default. :type state_act: BaseActivation - :param bias_attr: The Bias Attribute. If the parameter is set to - False or something not type of ParameterAttribute, - no bias is defined. If the parameter is set to - True, the bias is initialized to zero. + :param bias_attr: The bias attribute. If the parameter is set to False or an object + whose type is not ParameterAttribute, no bias is defined. If the + parameter is set to True, the bias is initialized to zero. :type bias_attr: ParameterAttribute | None | bool | Any :param layer_attr: layer's extra attribute. :type layer_attr: ExtraLayerAttribute @@ -3617,10 +3604,9 @@ def gru_step_layer(input, :param name: The name of this layer. It is optional. :param gate_act: Activation type of this layer's two gates. Default is Sigmoid. :type gate_act: BaseActivation - :param bias_attr: The Bias Attribute. If the parameter is set to - False or something not type of ParameterAttribute, - no bias is defined. If the parameter is set to - True, the bias is initialized to zero. + :param bias_attr: The bias attribute. If the parameter is set to False or an object + whose type is not ParameterAttribute, no bias is defined. If the + parameter is set to True, the bias is initialized to zero. :type bias_attr: ParameterAttribute | None | bool | Any :param param_attr: the parameter_attribute for transforming the output_mem from previous step. @@ -3680,10 +3666,9 @@ def gru_step_naive_layer(input, :type act: BaseActivation :param gate_act: Activation type of this layer's two gates. Default is Sigmoid. :type gate_act: BaseActivation - :param bias_attr: The Bias Attribute. If the parameter is set to - False or something not type of ParameterAttribute, - no bias is defined. If the parameter is set to - True, the bias is initialized to zero. + :param bias_attr: The bias attribute. If the parameter is set to False or an object + whose type is not ParameterAttribute, no bias is defined. If the + parameter is set to True, the bias is initialized to zero. :type bias_attr: ParameterAttribute | None | bool | Any :param param_attr: :param layer_attr: @@ -3813,10 +3798,9 @@ def recurrent_layer(input, :type input: LayerOutput :param act: Activation type. TanhActivation is the default. :type act: BaseActivation - :param bias_attr: The Bias Attribute. If the parameter is set to - False or something not type of ParameterAttribute, - no bias is defined. If the parameter is set to - True, the bias is initialized to zero. + :param bias_attr: The bias attribute. If the parameter is set to False or an object + whose type is not ParameterAttribute, no bias is defined. If the + parameter is set to True, the bias is initialized to zero. :type bias_attr: ParameterAttribute | None | bool | Any :param param_attr: parameter attribute. :type param_attr: ParameterAttribute @@ -4806,10 +4790,9 @@ def tensor_layer(a, :type act: BaseActivation :param param_attr: The Parameter Attribute. :type param_attr: ParameterAttribute - :param bias_attr: The Bias Attribute. If the parameter is set to - False or something not type of ParameterAttribute, - no bias is defined. If the parameter is set to - True, the bias is initialized to zero. + :param bias_attr: The bias attribute. If the parameter is set to False or an object + whose type is not ParameterAttribute, no bias is defined. If the + parameter is set to True, the bias is initialized to zero. :type bias_attr: ParameterAttribute | None | bool | Any :param layer_attr: Extra Layer config. :type layer_attr: ExtraLayerAttribute | None @@ -4871,10 +4854,9 @@ def selective_fc_layer(input, :type act: BaseActivation :param param_attr: The Parameter Attribute. :type param_attr: ParameterAttribute - :param bias_attr: The Bias Attribute. If the parameter is set to - False or something not type of ParameterAttribute, - no bias is defined. If the parameter is set to - True, the bias is initialized to zero. + :param bias_attr: The bias attribute. If the parameter is set to False or an object + whose type is not ParameterAttribute, no bias is defined. If the + parameter is set to True, the bias is initialized to zero. :type bias_attr: ParameterAttribute | None | bool | Any :param layer_attr: Extra Layer config. :type layer_attr: ExtraLayerAttribute | None @@ -5497,7 +5479,11 @@ def crf_decoding_layer(input, return LayerOutput(name, LayerType.CRF_DECODING_LAYER, parents, size=1) -@wrap_act_default(act=SigmoidActivation()) +""" +Following are cost Layers. +""" + + @wrap_bias_attr_default(has_bias=True) @wrap_param_attr_default() @wrap_name_default() @@ -5505,7 +5491,6 @@ def crf_decoding_layer(input, def nce_layer(input, label, num_classes=None, - act=None, param_attr=None, weight=None, num_neg_samples=10, @@ -5514,9 +5499,12 @@ def nce_layer(input, bias_attr=None, layer_attr=None): """ - Noise-contrastive estimation. - Implements the method in the following paper: - A fast and simple algorithm for training neural probabilistic language models. + Noise-contrastive estimation. This layer implements the method in the + following paper: + + Reference: + A fast and simple algorithm for training neural probabilistic language + models. https://www.cs.toronto.edu/~amnih/papers/ncelm.pdf The example usage is: @@ -5528,32 +5516,37 @@ def nce_layer(input, :param name: The name of this layer. It is optional. :type name: basestring - :param input: The input layers. It could be a LayerOutput of list/tuple of LayerOutput. + :param input: The input layers. It should be a LayerOutput or a list/tuple + of LayerOutput. :type input: LayerOutput | list | tuple | collections.Sequence - :param label: label layer + :param label: The ground truth. :type label: LayerOutput - :param weight: weight layer, can be None(default) + :param weight: The weight layer defines a weight for each sample in the + mini-batch. The default value is None. :type weight: LayerOutput - :param num_classes: number of classes. + :param num_classes: The class number. :type num_classes: int - :param act: Activation type. SigmoidActivation is the default. - :type act: BaseActivation - :param param_attr: The Parameter Attribute|list. - :type param_attr: ParameterAttribute - :param num_neg_samples: number of negative samples. Default is 10. + :param param_attr: The parameter attributes. + :type param_attr: ParameterAttribute|list + :param num_neg_samples: The number of sampled negative labels. The default + value is 10. :type num_neg_samples: int - :param neg_distribution: The distribution for generating the random negative labels. - A uniform distribution will be used if not provided. - If not None, its length must be equal to num_classes. + :param neg_distribution: The discrete noisy distribution over the output + space from which num_neg_samples negative labels + are sampled. If this parameter is not set, a + uniform distribution will be used. A user defined + distribution is a list whose length must be equal + to the num_classes. Each member of the list defines + the probability of a class given input x. :type neg_distribution: list | tuple | collections.Sequence | None - :param bias_attr: The Bias Attribute. If the parameter is set to - False or something not type of ParameterAttribute, - no bias is defined. If the parameter is set to - True, the bias is initialized to zero. + :param bias_attr: The attribute for bias. If this parameter is set False or + any object whose type is not ParameterAttribute, no bias + is added. If this parameter is set True, the bias is + initialized to zero. :type bias_attr: ParameterAttribute | None | bool | Any :param layer_attr: Extra Layer Attribute. :type layer_attr: ExtraLayerAttribute - :return: layer name. + :return: The LayerOutput object. :rtype: LayerOutput """ if isinstance(input, LayerOutput): @@ -5576,8 +5569,6 @@ def nce_layer(input, assert isinstance(neg_distribution, collections.Sequence) assert len(neg_distribution) == num_classes assert abs(sum(neg_distribution) - 1.0) < 1e-5 - if not isinstance(act, BaseActivation): - raise TypeError() ipts_for_layer = [] parents = [] @@ -5599,7 +5590,7 @@ def nce_layer(input, type=LayerType.NCE_LAYER, num_classes=num_classes, neg_sampling_dist=neg_distribution, - active_type=act.name, + active_type=SigmoidActivation().name, num_neg_samples=num_neg_samples, inputs=ipts_for_layer, bias=ParamAttr.to_bias(bias_attr), @@ -5609,12 +5600,7 @@ def nce_layer(input, LayerType.NCE_LAYER, parents=parents, size=l.config.size, - activation=act) - - -""" -following are cost Layers. -""" + activation=SigmoidActivation()) @wrap_name_default() @@ -5773,20 +5759,21 @@ def cross_entropy(input, :param input: The first input layer. :type input: LayerOutput. :param label: The input label. - :type input: LayerOutput. + :type input: LayerOutput :param name: The name of this layer. It is optional. - :type name: None | basestring. - :param coeff: The cost is multiplied with coeff. - The coefficient affects the gradient in the backward. - :type coeff: float. + :type name: basestring + :param coeff: The weight of the gradient in the back propagation. + 1.0 is the default. + :type coeff: float :param weight: The cost of each sample is multiplied with each weight. The weight should be a layer with size=1. Note that gradient will not be calculated for weight. :type weight: LayerOutout - :param layer_attr: Extra Layer Attribute. + :param layer_attr: The extra layer attribute. See ExtraLayerAttribute for + details. :type layer_attr: ExtraLayerAttribute :return: LayerOutput object. - :rtype: LayerOutput. + :rtype: LayerOutput """ ipts, parents = __cost_input__(input, label, weight) @@ -5819,19 +5806,21 @@ def cross_entropy_with_selfnorm(input, label=label_layer) :param input: The first input layer. - :type input: LayerOutput. + :type input: LayerOutput :param label: The input label. - :type input: LayerOutput. + :type input: LayerOutput :param name: The name of this layer. It is optional. - :type name: None | basestring. - :param coeff: The coefficient affects the gradient in the backward. - :type coeff: float. + :type name: basestring + :param coeff: The weight of the gradient in the back propagation. + 1.0 is the default. + :type coeff: float :param softmax_selfnorm_alpha: The scale factor affects the cost. - :type softmax_selfnorm_alpha: float. - :param layer_attr: Extra Layer Attribute. + :type softmax_selfnorm_alpha: float + :param layer_attr: The extra layer attribute. See ExtraLayerAttribute for + details. :type layer_attr: ExtraLayerAttribute :return: LayerOutput object. - :rtype: LayerOutput. + :rtype: LayerOutput """ Layer( name=name, @@ -5852,7 +5841,7 @@ def cross_entropy_with_selfnorm(input, @layer_support() def sum_cost(input, name=None, layer_attr=None): """ - A loss layer which calculate the sum of the input as loss + A loss layer which calculates the sum of the input as loss. The example usage is: @@ -5861,10 +5850,11 @@ def sum_cost(input, name=None, layer_attr=None): cost = sum_cost(input=input_layer) :param input: The input of this layer. - :type input: LayerOutput. + :type input: LayerOutput :param name: The name of this layer. It is optional. - :type name: None | basestring. - :param layer_attr: Extra Layer Attribute. + :type name: basestring + :param layer_attr: The extra layer attribute. See ExtraLayerAttribute for + details. :type layer_attr: ExtraLayerAttribute :return: LayerOutput object. :rtype: LayerOutput. @@ -5904,16 +5894,18 @@ def huber_regression_cost(input, cost = huber_regression_cost(input=input_layer, label=label_layer) :param input: The first input layer. - :type input: LayerOutput. + :type input: LayerOutput :param label: The input label. - :type input: LayerOutput. + :type input: LayerOutput :param name: The name of this layer. It is optional. - :type name: None | basestring. + :type name: basestring :param delta: The difference between the observed and predicted values. - :type delta: float. - :param coeff: The coefficient affects the gradient in the backward. - :type coeff: float. - :param layer_attr: Extra Layer Attribute. + :type delta: float + :param coeff: The weight of the gradient in the back propagation. + 1.0 is the default. + :type coeff: float + :param layer_attr: The extra layer attribute. See ExtraLayerAttribute for + details. :type layer_attr: ExtraLayerAttribute :return: LayerOutput object. :rtype: LayerOutput. @@ -5954,17 +5946,19 @@ def huber_classification_cost(input, cost = huber_classification_cost(input=input_layer, label=label_layer) :param input: The first input layer. - :type input: LayerOutput. + :type input: LayerOutput :param label: The input label. - :type input: LayerOutput. + :type input: LayerOutput :param name: The name of this layer. It is optional. - :type name: None | basestring. - :param coeff: The coefficient affects the gradient in the backward. - :type coeff: float. - :param layer_attr: Extra Layer Attribute. + :type name: basestring + :param coeff: The weight of the gradient in the back propagation. + 1.0 is the default. + :type coeff: float + :param layer_attr: The extra layer attribute. See ExtraLayerAttribute for + details. :type layer_attr: ExtraLayerAttribute :return: LayerOutput object. - :rtype: LayerOutput. + :rtype: LayerOutput """ assert isinstance(input, LayerOutput) if input.size is not None: @@ -6001,10 +5995,12 @@ def multi_binary_label_cross_entropy(input, :param label: The input label. :type input: LayerOutput :param name: The name of this layer. It is optional. - :type name: None | basestring - :param coeff: The coefficient affects the gradient in the backward. + :type name: basestring + :param coeff: The weight of the gradient in the back propagation. + 1.0 is the default. :type coeff: float - :param layer_attr: Extra Layer Attribute. + :param layer_attr: The extra layer attribute. See ExtraLayerAttribute for + details. :type layer_attr: ExtraLayerAttribute :return: LayerOutput object. :rtype: LayerOutput @@ -6107,7 +6103,7 @@ def cross_entropy_over_beam(input, name=None): :param input: Input beams for this layer. :type input: BeamInput - :param name: The name of this layer. + :param name: The name of this layer. It is optional. :type name: basestring :return: LayerOutput object. :rtype: LayerOutput @@ -6142,7 +6138,7 @@ def cross_entropy_over_beam(input, name=None): def smooth_l1_cost(input, label, name=None, coeff=1.0, layer_attr=None): """ This is a L1 loss but more smooth. It requires that the - size of input and label are equal. The formula is as follows, + sizes of input and label are equal. The formula is as follows, .. math:: @@ -6154,8 +6150,9 @@ def smooth_l1_cost(input, label, name=None, coeff=1.0, layer_attr=None): smooth_{L1}(x) = \\begin{cases} 0.5x^2& \\text{if} \\ |x| < 1 \\\\ |x|-0.5& \\text{otherwise} \end{cases} - More details can be found by referring to `Fast R-CNN - `_ + Reference: + Fast R-CNN + https://arxiv.org/pdf/1504.08083v2.pdf The example usage is: @@ -6169,10 +6166,12 @@ def smooth_l1_cost(input, label, name=None, coeff=1.0, layer_attr=None): :param label: The input label. :type input: LayerOutput :param name: The name of this layer. It is optional. - :type name: None | basestring - :param coeff: The coefficient affects the gradient in the backward. + :type name: basestring + :param coeff: The weight of the gradient in the back propagation. + 1.0 is the default. :type coeff: float - :param layer_attr: Extra Layer Attribute. + :param layer_attr: The extra layer attribute. See ExtraLayerAttribute for + details. :type layer_attr: ExtraLayerAttribute :return: LayerOutput object. :rtype: LayerOutput @@ -6194,12 +6193,12 @@ def smooth_l1_cost(input, label, name=None, coeff=1.0, layer_attr=None): @wrap_name_default() def multiplex_layer(input, name=None, layer_attr=None): """ - This layer multiplex multiple layers according to the index, - which is provided by the first input layer. - inputs[0]: the index of the layer to output of size batchSize. + This layer multiplex multiple layers according to the indexes, + which are provided by the first input layer. + inputs[0]: the indexes of the layers to form the output of size batchSize. inputs[1:N]; the candidate output data. - For each index i from 0 to batchSize -1, the output is the i-th row of the - (index[i] + 1)-th layer. + For each index i from 0 to batchSize - 1, the i-th row of the output is the + the same to the i-th row of the (index[i] + 1)-th layer. For each i-th row of output: .. math:: @@ -6218,7 +6217,8 @@ def multiplex_layer(input, name=None, layer_attr=None): :type input: list of LayerOutput :param name: The name of this layer. It is optional. :type name: basestring - :param layer_attr: extra layer attributes. + :param layer_attr: The extra layer attribute. See ExtraLayerAttribute for + details. :type layer_attr: ExtraLayerAttribute. :return: LayerOutput object. :rtype: LayerOutput @@ -6322,14 +6322,14 @@ def row_conv_layer(input, :type context_len: int :param act: Activation Type. LinearActivation is the default. :type act: BaseActivation - :param param_attr: The Parameter Attribute. If None, the parameter will be - initialized smartly. It's better to set it by yourself. + :param param_attr: The parameter attribute. See ParameterAttribute for + details. :type param_attr: ParameterAttribute - :param layer_attr: Extra Layer config. + :param layer_attr: The extra layer attribute. See ExtraLayerAttribute for + details. :type layer_attr: ExtraLayerAttribute | None :return: LayerOutput object. :rtype: LayerOutput - """ assert isinstance(input, LayerOutput) assert context_len > 0, "the context_len must be greatet than 0." @@ -6354,7 +6354,7 @@ def prelu_layer(input, param_attr=None, layer_attr=None): """ - The Parameter Relu activation that actives outputs with a learnable weight. + The Parametric Relu activation that actives outputs with a learnable weight. Reference: Delving Deep into Rectifiers: Surpassing Human-Level Performance on @@ -6374,16 +6374,17 @@ def prelu_layer(input, :type name: basestring :param input: The input of this layer. :type input: LayerOutput - :param partial_sum: this parameter makes a group of inputs share a same weight. + :param partial_sum: this parameter makes a group of inputs share the same weight. - partial_sum = 1, indicates the element-wise activation: each element has a weight. - - partial_sum = number of elements in one channel, indicates the channel-wise activation, elements in a channel share a same weight. - - partial_sum = number of outputs, indicates all elements share a same weight. + - partial_sum = number of elements in one channel, indicates the channel-wise activation, elements in a channel share the same weight. + - partial_sum = number of outputs, indicates all elements share the same weight. :type partial_sum: int :param param_attr: The parameter attribute. See ParameterAttribute for details. - :type param_attr: ParameterAttribute | None - :param layer_attr: Extra layer configurations. Default is None. + :type param_attr: ParameterAttribute + :param layer_attr: The extra layer attribute. See ExtraLayerAttribute for + details. :type layer_attr: ExtraLayerAttribute | None :return: LayerOutput object. :rtype: LayerOutput @@ -6439,34 +6440,34 @@ def gated_unit_layer(input, :param input: The input of this layer. :type input: LayerOutput - :param size: output size of the gated unit. + :param size: The dimension of this layer's output. :type size: int - :param act: Activation type of the projected input. LinearActivation is the default. + :param act: Activation type of the projection. LinearActivation is the default. :type act: BaseActivation :param name: The name of this layer. It is optional. :type name: basestring - :param gate_attr: Attributes to tune the gate output, for example, error - clipping threshold, dropout and so on. See ExtraLayerAttribute for - more details. + :param gate_attr: The extra layer attribute of the gate. See ExtraLayerAttribute for + details. :type gate_attr: ExtraLayerAttribute | None - :param gate_param_attr: Attributes to tune the learnable projected matrix - parameter of the gate. - :type gate_param_attr: ParameterAttribute | None - :param gate_bias_attr: Attributes to tune the learnable bias of the gate. - :type gate_bias_attr: ParameterAttribute | None - :param inproj_attr: Attributes to the tune the projected input, for - example, error clipping threshold, dropout and so on. See - ExtraLayerAttribute for more details. + :param gate_param_attr: The parameter attribute of the gate. See ParameterAttribute + for details. + :type gate_param_attr: ParameterAttribute + :param gate_bias_attr: The bias attribute of the gate. If the parameter is set to False or + an object whose type is not ParameterAttribute, no bias is defined. + If the parameter is set to True, the bias is initialized to zero. + :type gate_bias_attr: ParameterAttribute | bool | None | Any + :param inproj_attr: Extra layer attributes of the projection. See ExtraLayerAttribute for + details. :type inproj_attr: ExtraLayerAttribute | None - :param inproj_param_attr: Attributes to tune the learnable parameter of - the projection of input. - :type inproj_param_attr: ParameterAttribute | None - :param inproj_bias_attr: Attributes to tune the learnable bias of - projection of the input. - :type inproj_bias_attr: ParameterAttribute | None - :param layer_attr: Attributes to tune the final output of the gated unit, - for example, error clipping threshold, dropout and so on. See - ExtraLayerAttribute for more details. + :param inproj_param_attr: The parameter attribute of the projection. See ParameterAttribute + for details. + :type inproj_param_attr: ParameterAttribute + :param inproj_bias_attr: The bias attribute of the projection. If the parameter is set to False + or an object whose type is not ParameterAttribute, no bias is defined. + If the parameter is set to True, the bias is initialized to zero. + :type inproj_bias_attr: ParameterAttribute | bool | None | Any + :param layer_attr: Extra layer attribute of the product. See ExtraLayerAttribute for + details. :type layer_attr: ExtraLayerAttribute | None :return: LayerOutput object. :rtype: LayerOutput @@ -6662,9 +6663,9 @@ def clip_layer(input, min, max, name=None): :param input: The input of this layer. :type input: LayerOutput. :param min: The lower threshold for clipping. - :type min: double + :type min: float :param max: The upper threshold for clipping. - :type max: double + :type max: float :return: LayerOutput object. :rtype: LayerOutput """ @@ -6712,7 +6713,6 @@ def seq_slice_layer(input, starts, ends, name=None): :type ends: LayerOutput | None :return: LayerOutput object. :rtype: LayerOutput - """ assert isinstance(input, LayerOutput), ( @@ -6833,20 +6833,21 @@ def img_conv3d_layer(input, :param padding: The numbers of padding along three axises. If the parameter is set to one integer, they will be same. :type padding: int | tuple | list - :param bias_attr: The Bias Attribute. If the parameter is set to - False or something not type of ParameterAttribute, - no bias is defined. If the parameter is set to - True, the bias is initialized to zero. + :param bias_attr: The bias attribute. If the parameter is set to False or an object + whose type is not ParameterAttribute, no bias is defined. If the + parameter is set to True, the bias is initialized to zero. :type bias_attr: ParameterAttribute | None | bool | Any :param num_channels: The number of input channels. If the parameter is not set or set to None, its actual value will be automatically set to the channels number of the input . :type num_channels: int - :param param_attr: The parameter attribute of the convolution. + :param param_attr: The parameter attribute of the convolution. See ParameterAttribute for + details. :type param_attr: ParameterAttribute :param shared_biases: Whether biases will be shared between filters or not. :type shared_biases: bool - :param layer_attr: Extra layer attributes. + :param layer_attr: The extra layer attributes. See ExtraLayerAttribute for + details. :type layer_attr: ExtraLayerAttribute :param trans: True if it is a convTransLayer, False if it is a convLayer :type trans: bool @@ -6953,12 +6954,12 @@ def scale_shift_layer(input, name=None, param_attr=None, bias_attr=None): :type name: basestring :param input: The input of this layer. :type input: LayerOutput - :param param_attr: The parameter attribute of scaling. + :param param_attr: The parameter attribute of scaling. See ParameterAttribute for + details. :type param_attr: ParameterAttribute - :param bias_attr: The Bias Attribute. If the parameter is set to - False or something not type of ParameterAttribute, - no bias is defined. If the parameter is set to - True, the bias is initialized to zero. + :param bias_attr: The bias attribute. If the parameter is set to False or an object + whose type is not ParameterAttribute, no bias is defined. If the + parameter is set to True, the bias is initialized to zero. :type bias_attr: ParameterAttribute | None | bool | Any :return: LayerOutput object. :rtype: LayerOutput @@ -7016,10 +7017,9 @@ def sub_seq_layer(input, offsets, sizes, act=None, bias_attr=None, name=None): :type sizes: LayerOutput :param act: Activation type, LinearActivation is the default. :type act: BaseActivation. - :param bias_attr: The Bias Attribute. If the parameter is set to - False or something not type of ParameterAttribute, - no bias is defined. If the parameter is set to - True, the bias is initialized to zero. + :param bias_attr: The bias attribute. If the parameter is set to False or an object + whose type is not ParameterAttribute, no bias is defined. If the + parameter is set to True, the bias is initialized to zero. :type bias_attr: ParameterAttribute | None | bool | Any :return: LayerOutput object. :rtype: LayerOutput