提交 8bf37994 编写于 作者: D dangqingqing

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into lstm_fix

# Design: Sequence Decoder Generating LoDTensors
In tasks such as machine translation and image to text,
a [sequence decoder](https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/README.md) is necessary to generate sequences.
This documentation describes how to implement the sequence decoder as an operator.
## Beam Search based Decoder
The [beam search algorithm](https://en.wikipedia.org/wiki/Beam_search) is necessary when generating sequences,
it is a heuristic search algorithm that explores the paths by expanding the most promising node in a limited set.
In the old version of PaddlePaddle, a C++ class `RecurrentGradientMachine` implements the general sequence decoder based on beam search,
due to the complexity, the implementation relays on a lot of special data structures,
quite trivial and hard to be customized by users.
There are a lot of heuristic tricks in the sequence generation tasks,
so the flexibility of sequence decoder is very important to users.
During PaddlePaddle's refactoring work,
some new concept is proposed such as [LoDTensor](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/lod_tensor.md) and [TensorArray](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/tensor_array.md) that can better support sequence usage,
and they can help to make the implementation of beam search based sequence decoder **more transparent and modular** .
For example, the RNN sates, candidates IDs and probabilities of beam search can be represented as `LoDTensors`;
the selected candidate's IDs in each time step can be stored in a `TensorArray`, and `Packed` to the sentences translated.
## Changing LoD's absolute offset to relative offsets
The current `LoDTensor` is designed to store levels of variable-length sequences,
it stores several arrays of integers each represents a level.
The integers in each level represents the begin and end (not inclusive) offset of a sequence **in the underlying tensor**,
let's call this format the **absolute-offset LoD** for clear.
The relative-offset LoD can fast retrieve any sequence but fails to represent empty sequences, for example, a two-level LoD is as follows
```python
[[0, 3, 9]
[0, 2, 3, 3, 3, 9]]
```
The first level tells that there are two sequences:
- the first's offset is `[0, 3)`
- the second's offset is `[3, 9)`
while on the second level, there are several empty sequences that both begin and end at `3`.
It is impossible to tell how many empty second-level sequences exist in the first-level sequences.
There are many scenarios that relay on empty sequence representation,
such as machine translation or image to text, one instance has no translations or the empty candidate set for a prefix.
So let's introduce another format of LoD,
it stores **the offsets of the lower level sequences** and is called **relative-offset** LoD.
For example, to represent the same sequences of the above data
```python
[[0, 3, 6]
[0, 2, 3, 3, 3, 9]]
```
the first level represents that there are two sequences,
their offsets in the second-level LoD is `[0, 3)` and `[3, 5)`.
The second level is the same with the relative offset example because the lower level is a tensor.
It is easy to find out the second sequence in the first-level LoD has two empty sequences.
The following demos are based on relative-offset LoD.
## Usage in a simple machine translation model
Let's start from a simple machine translation model that is simplified from [machine translation chapter](https://github.com/PaddlePaddle/book/tree/develop/08.machine_translation) to draw a simple blueprint of what a sequence decoder can do and how to use it.
The model has an encoder that learns the semantic vector from a sequence,
and a decoder which uses the sequence decoder to generate new sentences.
**Encoder**
```python
import paddle as pd
dict_size = 8000
source_dict_size = dict_size
target_dict_size = dict_size
word_vector_dim = 128
encoder_dim = 128
decoder_dim = 128
beam_size = 5
max_length = 120
# encoder
src_word_id = pd.data(
name='source_language_word',
type=pd.data.integer_value_sequence(source_dict_dim))
src_embedding = pd.embedding(size=source_dict_size, size=word_vector_dim)
src_word_vec = pd.lookup(src_embedding, src_word_id)
encoder_out_seq = pd.gru(input=src_word_vec, size=encoder_dim)
encoder_ctx = pd.last_seq(encoder_out_seq)
# encoder_ctx_proj is the learned semantic vector
encoder_ctx_proj = pd.fc(
encoder_ctx, size=decoder_dim, act=pd.activation.Tanh(), bias=None)
```
**Decoder**
```python
def generate():
decoder = pd.while_loop()
with decoder.step():
decoder_mem = decoder.memory(init=encoder_ctx) # mark the memory
generated_ids = decoder.memory() # TODO init to batch_size <s>s
generated_scores = decoder.memory() # TODO init to batch_size 1s or 0s
target_word = pd.lookup(trg_embedding, gendrated_ids)
# expand encoder_ctx's batch to fit target_word's lod
# for example
# decoder_mem.lod is
# [[0 1 3],
# [0 1 3 6]]
# its tensor content is [a1 a2 a3 a4 a5]
# which means there are 2 sentences to translate
# - the first sentence has 1 translation prefixes, the offsets are [0, 1)
# - the second sentence has 2 translation prefixes, the offsets are [1, 3) and [3, 6)
# the target_word.lod is
# [[0, 1, 6]
# [0, 2, 4, 7, 9 12]]
# which means 2 sentences to translate, each has 1 and 5 prefixes
# the first prefix has 2 candidates
# the following has 2, 3, 2, 3 candidates
# the encoder_ctx_expanded's content will be
# [a1 a1 a2 a2 a3 a3 a3 a4 a4 a5 a5 a5]
encoder_ctx_expanded = pd.lod_expand(encoder_ctx, target_word)
decoder_input = pd.fc(
act=pd.activation.Linear(),
input=[target_word, encoder_ctx],
size=3 * decoder_dim)
gru_out, cur_mem = pd.gru_step(
decoder_input, mem=decoder_mem, size=decoder_dim)
scores = pd.fc(
gru_out,
size=trg_dic_size,
bias=None,
act=pd.activation.Softmax())
# K is an config
topk_scores, topk_ids = pd.top_k(scores, K)
topk_generated_scores = pd.add_scalar(topk_scores, generated_scores)
selected_ids, selected_generation_scores = decoder.beam_search(
topk_ids, topk_generated_scores)
# update the states
decoder_mem.update(cur_mem) # tells how to update state
generated_ids.update(selected_ids)
generated_scores.update(selected_generation_scores)
decoder.output(selected_ids)
decoder.output(selected_generation_scores)
translation_ids, translation_scores = decoder()
```
The `decoder.beam_search` is a operator that given the candidates and the scores of translations including the candidates,
return the result of the beam search algorithm.
In this way, users can customize anything on the inputs or outputs of beam search, for example, two ways to prune some translation prefixes
1. meke the correspondind elements in `topk_generated_scores` zero or some small values, beam_search will discard this candidate.
2. remove some specific candidate in `selected_ids`
3. get the final `translation_ids`, remove the translation sequence in it.
The implementation of sequence decoder can reuse the C++ class [RNNAlgorithm](https://github.com/Superjom/Paddle/blob/68cac3c0f8451fe62a4cdf156747d6dc0ee000b3/paddle/operators/dynamic_recurrent_op.h#L30),
so the python syntax is quite similar to a [RNN](https://github.com/Superjom/Paddle/blob/68cac3c0f8451fe62a4cdf156747d6dc0ee000b3/doc/design/block.md#blocks-with-for-and-rnnop).
Both of them are two-level `LoDTensors`
- the first level represents `batch_size` of (source) sentences;
- the second level represents the candidate ID sets for translation prefix.
for example, 3 source sentences to translate, and has 2, 3, 1 candidates.
Unlike an RNN, in sequence decoder, the previous state and the current state have different LoD and shape,
a `lod_expand` operator is used to expand the LoD of the previous state to fit the current state.
For example, the previous state
* LoD is `[0, 1, 3][0, 2, 5, 6]`
* content of tensor is `a1 a2 b1 b2 b3 c1`
the current state stored in `encoder_ctx_expanded`
* LoD is `[0, 2, 7][0 3 5 8 9 11 11]`
* the content is
- a1 a1 a1 (a1 has 3 candidates, so the state should be copied 3 times for each candidates)
- a2 a2
- b1 b1 b1
- b2
- b3 b3
- None (c1 has 0 candidates, so c1 is dropped)
Benefit from the relative offset LoD, empty candidate set can be represented naturally.
the status in each time step can be stored in `TensorArray`, and `Pack`ed to a final LoDTensor, the corresponding syntax is
```python
decoder.output(selected_ids)
decoder.output(selected_generation_scores)
```
the `selected_ids` is the candidate ids for the prefixes,
it will be `Packed` by `TensorArray` to a two-level `LoDTensor`,
the first level represents the source sequences,
the second level represents generated sequences.
Pack the `selected_scores` will get a `LoDTensor` that stores scores of each candidate of translations.
Pack the `selected_generation_scores` will get a `LoDTensor`, and each tail is the probability of the translation.
## LoD and shape changes during decoding
<p align="center">
<img src="./images/LOD-and-shape-changes-during-decoding.jpg"/>
</p>
According the image above, the only phrase to change LoD is beam search.
## Beam search design
The beam search algorthm will be implemented as one method of the sequence decoder, it has 3 inputs
1. `topk_ids`, top K candidate ids for each prefix.
2. `topk_scores`, the corresponding scores for `topk_ids`
3. `generated_scores`, the score of the prefixes.
All of the are LoDTensors, so that the sequence affilication is clear.
Beam search will keep a beam for each prefix and select a smaller candidate set for each prefix.
It will return three variables
1. `selected_ids`, the final candidate beam search function selected for the next step.
2. `selected_scores`, the scores for the candidates.
3. `generated_scores`, the updated scores for each prefixes (with the new candidates appended).
## Introducing the LoD-based `Pack` and `Unpack` methods in `TensorArray`
The `selected_ids`, `selected_scores` and `generated_scores` are LoDTensors,
and they exist in each time step,
so it is natural to store them in arrays.
Currently, PaddlePaddle has a module called `TensorArray` which can store an array of tensors,
the results of beam search are better to store in a `TensorArray`.
The `Pack` and `UnPack` in `TensorArray` are used to package tensors in the array to a `LoDTensor` or split the `LoDTensor` to an array of tensors.
It needs some extensions to support pack or unpack an array of `LoDTensors`.
......@@ -49,7 +49,7 @@ void ScaleSubRegionLayer::forward(PassType passType) {
shape_ = TensorShape({batchSize, channelsNum_, imgH_, imgW_});
resetOutput(batchSize, imgV->getWidth());
auto out = getOutput();
auto& out = getOutput();
out.setFrameHeight(imgH_);
out.setFrameWidth(imgW_);
......
......@@ -53,7 +53,7 @@ TEST(Operator, dot_mul) {
TEST(Projection, context) {
for (auto contextStart : {-5, -3, -1, 0, 3}) {
for (auto contextLength : {1, 2, 5, 7}) {
for (auto batchSize : {1, 2, 5, 20, 50}) {
for (auto batchSize : {1, 2, 5, 20}) {
for (auto trainablePadding : {false, true}) {
LOG(INFO) << " contextStart=" << contextStart
<< " contextLength=" << contextLength
......@@ -585,14 +585,14 @@ TEST(Layer, maxoutLayer) {
}
void testFcLayer(string format, size_t nnz) {
TestConfig config;
config.biasSize = 4096;
config.biasSize = 1024;
config.layerConfig.set_type("fc");
config.layerConfig.set_size(4096);
config.layerConfig.set_size(1024);
config.layerConfig.set_active_type("sigmoid");
config.layerConfig.set_drop_rate(0.1);
config.inputDefs.push_back(
{INPUT_DATA, "layer_0", 8192, nnz, ParaSparse(format)});
{INPUT_DATA, "layer_0", 2048, nnz, ParaSparse(format)});
config.layerConfig.add_inputs();
LOG(INFO) << config.inputDefs[0].sparse.sparse << " "
......@@ -609,9 +609,9 @@ void testFcLayer(string format, size_t nnz) {
}
TEST(Layer, fcLayer) {
testFcLayer("", 4096 * 4096 * 2);
testFcLayer("csc", 4096 * 40);
testFcLayer("csr", 4096 * 40);
testFcLayer("", 1024 * 1024 * 2);
testFcLayer("csc", 1024 * 10);
testFcLayer("csr", 1024 * 10);
}
TEST(Layer, SelectiveFullyConnectedLayer) {
......@@ -1995,7 +1995,7 @@ TEST(Layer, multibox_loss) {
TEST(Layer, TransLayer) {
TestConfig config;
const int height = 128;
const int width = 1028;
const int width = 256;
config.layerConfig.set_type("trans");
config.layerConfig.set_size(width);
......
......@@ -789,10 +789,9 @@ class MixedLayerType(LayerOutput):
:type size: int
:param act: Activation type.
:type act: BaseActivation
:param bias_attr: The Bias Attribute. If the parameter is set to
False or something not type of ParameterAttribute,
no bias is defined. If the parameter is set to
True, the bias is initialized to zero.
:param bias_attr: The bias attribute. If the parameter is set to False or an object
whose type is not ParameterAttribute, no bias is defined. If the
parameter is set to True, the bias is initialized to zero.
:type bias_attr: ParameterAttribute | None | bool | Any
:param layer_attr: Extra Layer Attribute.
:type layer_attr: ExtraLayerAttribute or None
......@@ -889,10 +888,9 @@ def mixed_layer(size=0,
then this function will just return layer's name.
:param act: Activation Type. LinearActivation is the default.
:type act: BaseActivation
:param bias_attr: The Bias Attribute. If the parameter is set to
False or something not type of ParameterAttribute,
no bias is defined. If the parameter is set to
True, the bias is initialized to zero.
:param bias_attr: The bias attribute. If the parameter is set to False or an object
whose type is not ParameterAttribute, no bias is defined. If the
parameter is set to True, the bias is initialized to zero.
:type bias_attr: ParameterAttribute | None | bool | Any
:param layer_attr: The extra layer config. Default is None.
:type layer_attr: ExtraLayerAttribute
......@@ -1034,10 +1032,9 @@ def fc_layer(input,
:type act: BaseActivation
:param param_attr: The Parameter Attribute|list.
:type param_attr: ParameterAttribute
:param bias_attr: The Bias Attribute. If the parameter is set to
False or something not type of ParameterAttribute,
no bias is defined. If the parameter is set to
True, the bias is initialized to zero.
:param bias_attr: The bias attribute. If the parameter is set to False or an object
whose type is not ParameterAttribute, no bias is defined. If the
parameter is set to True, the bias is initialized to zero.
:type bias_attr: ParameterAttribute | None | bool | Any
:param layer_attr: Extra Layer config.
:type layer_attr: ExtraLayerAttribute | None
......@@ -1390,10 +1387,9 @@ def pooling_layer(input,
:type pooling_type: BasePoolingType | None
:param stride: The step size between successive pooling regions.
:type stride: Int
:param bias_attr: The Bias Attribute. If the parameter is set to
False or something not type of ParameterAttribute,
no bias is defined. If the parameter is set to
True, the bias is initialized to zero.
:param bias_attr: The bias attribute. If the parameter is set to False or an object
whose type is not ParameterAttribute, no bias is defined. If the
parameter is set to True, the bias is initialized to zero.
:type bias_attr: ParameterAttribute | None | bool | Any
:param layer_attr: The Extra Attributes for layer, such as dropout.
:type layer_attr: ExtraLayerAttribute | None
......@@ -1491,10 +1487,9 @@ def lstmemory(input,
:type gate_act: BaseActivation
:param state_act: state activation type, TanhActivation by default.
:type state_act: BaseActivation
:param bias_attr: The Bias Attribute. If the parameter is set to
False or something not type of ParameterAttribute,
no bias is defined. If the parameter is set to
True, the bias is initialized to zero.
:param bias_attr: The bias attribute. If the parameter is set to False or an object
whose type is not ParameterAttribute, no bias is defined. If the
parameter is set to True, the bias is initialized to zero.
:type bias_attr: ParameterAttribute | None | bool | Any
:param param_attr: Parameter Attribute.
:type param_attr: ParameterAttribute | None | False
......@@ -1617,10 +1612,9 @@ def grumemory(input,
This activation affects the :math:`z_t` and :math:`r_t`. It is the
:math:`\\sigma` in the above formula.
:type gate_act: BaseActivation
:param bias_attr: The Bias Attribute. If the parameter is set to
False or something not type of ParameterAttribute,
no bias is defined. If the parameter is set to
True, the bias is initialized to zero.
:param bias_attr: The bias attribute. If the parameter is set to False or an object
whose type is not ParameterAttribute, no bias is defined. If the
parameter is set to True, the bias is initialized to zero.
:type bias_attr: ParameterAttribute | None | bool | Any
:param param_attr: Parameter Attribute.
:type param_attr: ParameterAttribute | None | False
......@@ -1817,10 +1811,9 @@ def expand_layer(input,
:type expand_as: LayerOutput
:param name: The name of this layer. It is optional.
:type name: basestring
:param bias_attr: The Bias Attribute. If the parameter is set to
False or something not type of ParameterAttribute,
no bias is defined. If the parameter is set to
True, the bias is initialized to zero.
:param bias_attr: The bias attribute. If the parameter is set to False or an object
whose type is not ParameterAttribute, no bias is defined. If the
parameter is set to True, the bias is initialized to zero.
:type bias_attr: ParameterAttribute | None | bool | Any
:param expand_level: whether input layer is timestep(default) or sequence.
:type expand_level: ExpandLevel
......@@ -1939,10 +1932,9 @@ def seq_reshape_layer(input,
:type act: BaseActivation
:param layer_attr: extra layer attributes.
:type layer_attr: ExtraLayerAttribute.
:param bias_attr: The Bias Attribute. If the parameter is set to
False or something not type of ParameterAttribute,
no bias is defined. If the parameter is set to
True, the bias is initialized to zero.
:param bias_attr: The bias attribute. If the parameter is set to False or an object
whose type is not ParameterAttribute, no bias is defined. If the
parameter is set to True, the bias is initialized to zero.
:type bias_attr: ParameterAttribute | None | bool | Any
:return: LayerOutput object.
:rtype: LayerOutput
......@@ -2326,10 +2318,9 @@ def hsigmoid(input,
:type num_classes: int | None
:param name: The name of this layer. It is optional.
:type name: basestring
:param bias_attr: The Bias Attribute. If the parameter is set to
False or something not type of ParameterAttribute,
no bias is defined. If the parameter is set to
True, the bias is initialized to zero.
:param bias_attr: The bias attribute. If the parameter is set to False or an object
whose type is not ParameterAttribute, no bias is defined. If the
parameter is set to True, the bias is initialized to zero.
:type bias_attr: ParameterAttribute | None | bool | Any
:param param_attr: Parameter Attribute. None means default parameter.
:type param_attr: ParameterAttribute | None
......@@ -2469,10 +2460,9 @@ def img_conv_layer(input,
:type dilation: int | tuple | list
:param dilation_y: The y dimension of the dilation.
:type dilation_y: int
:param bias_attr: The Bias Attribute. If the parameter is set to
False or something not type of ParameterAttribute,
no bias is defined. If the parameter is set to
True, the bias is initialized to zero.
:param bias_attr: The bias attribute. If the parameter is set to False or an object
whose type is not ParameterAttribute, no bias is defined. If the
parameter is set to True, the bias is initialized to zero.
:type bias_attr: ParameterAttribute | None | bool | Any
:param num_channels: number of input channels. If None will be set
automatically from previous output.
......@@ -3219,10 +3209,9 @@ def addto_layer(input, act=None, name=None, bias_attr=None, layer_attr=None):
:type input: LayerOutput | list | tuple
:param act: Activation Type. LinearActivation is the default.
:type act: BaseActivation
:param bias_attr: The Bias Attribute. If the parameter is set to
False or something not type of ParameterAttribute,
no bias is defined. If the parameter is set to
True, the bias is initialized to zero.
:param bias_attr: The bias attribute. If the parameter is set to False or an object
whose type is not ParameterAttribute, no bias is defined. If the
parameter is set to True, the bias is initialized to zero.
:type bias_attr: ParameterAttribute | None | bool | Any
:param layer_attr: Extra Layer attribute.
:type layer_attr: ExtraLayerAttribute
......@@ -3375,10 +3364,9 @@ def seq_concat_layer(a, b, act=None, name=None, layer_attr=None,
:type act: BaseActivation
:param layer_attr: Extra Layer Attribute.
:type layer_attr: ExtraLayerAttribute
:param bias_attr: The Bias Attribute. If the parameter is set to
False or something not type of ParameterAttribute,
no bias is defined. If the parameter is set to
True, the bias is initialized to zero.
:param bias_attr: The bias attribute. If the parameter is set to False or an object
whose type is not ParameterAttribute, no bias is defined. If the
parameter is set to True, the bias is initialized to zero.
:type bias_attr: ParameterAttribute | None | bool | Any
:return: LayerOutput object.
:rtype: LayerOutput
......@@ -3558,10 +3546,9 @@ def lstm_step_layer(input,
:type gate_act: BaseActivation
:param state_act: State Activation Type. TanhActivation is the default.
:type state_act: BaseActivation
:param bias_attr: The Bias Attribute. If the parameter is set to
False or something not type of ParameterAttribute,
no bias is defined. If the parameter is set to
True, the bias is initialized to zero.
:param bias_attr: The bias attribute. If the parameter is set to False or an object
whose type is not ParameterAttribute, no bias is defined. If the
parameter is set to True, the bias is initialized to zero.
:type bias_attr: ParameterAttribute | None | bool | Any
:param layer_attr: layer's extra attribute.
:type layer_attr: ExtraLayerAttribute
......@@ -3617,10 +3604,9 @@ def gru_step_layer(input,
:param name: The name of this layer. It is optional.
:param gate_act: Activation type of this layer's two gates. Default is Sigmoid.
:type gate_act: BaseActivation
:param bias_attr: The Bias Attribute. If the parameter is set to
False or something not type of ParameterAttribute,
no bias is defined. If the parameter is set to
True, the bias is initialized to zero.
:param bias_attr: The bias attribute. If the parameter is set to False or an object
whose type is not ParameterAttribute, no bias is defined. If the
parameter is set to True, the bias is initialized to zero.
:type bias_attr: ParameterAttribute | None | bool | Any
:param param_attr: the parameter_attribute for transforming the output_mem
from previous step.
......@@ -3680,10 +3666,9 @@ def gru_step_naive_layer(input,
:type act: BaseActivation
:param gate_act: Activation type of this layer's two gates. Default is Sigmoid.
:type gate_act: BaseActivation
:param bias_attr: The Bias Attribute. If the parameter is set to
False or something not type of ParameterAttribute,
no bias is defined. If the parameter is set to
True, the bias is initialized to zero.
:param bias_attr: The bias attribute. If the parameter is set to False or an object
whose type is not ParameterAttribute, no bias is defined. If the
parameter is set to True, the bias is initialized to zero.
:type bias_attr: ParameterAttribute | None | bool | Any
:param param_attr:
:param layer_attr:
......@@ -3813,10 +3798,9 @@ def recurrent_layer(input,
:type input: LayerOutput
:param act: Activation type. TanhActivation is the default.
:type act: BaseActivation
:param bias_attr: The Bias Attribute. If the parameter is set to
False or something not type of ParameterAttribute,
no bias is defined. If the parameter is set to
True, the bias is initialized to zero.
:param bias_attr: The bias attribute. If the parameter is set to False or an object
whose type is not ParameterAttribute, no bias is defined. If the
parameter is set to True, the bias is initialized to zero.
:type bias_attr: ParameterAttribute | None | bool | Any
:param param_attr: parameter attribute.
:type param_attr: ParameterAttribute
......@@ -4806,10 +4790,9 @@ def tensor_layer(a,
:type act: BaseActivation
:param param_attr: The Parameter Attribute.
:type param_attr: ParameterAttribute
:param bias_attr: The Bias Attribute. If the parameter is set to
False or something not type of ParameterAttribute,
no bias is defined. If the parameter is set to
True, the bias is initialized to zero.
:param bias_attr: The bias attribute. If the parameter is set to False or an object
whose type is not ParameterAttribute, no bias is defined. If the
parameter is set to True, the bias is initialized to zero.
:type bias_attr: ParameterAttribute | None | bool | Any
:param layer_attr: Extra Layer config.
:type layer_attr: ExtraLayerAttribute | None
......@@ -4871,10 +4854,9 @@ def selective_fc_layer(input,
:type act: BaseActivation
:param param_attr: The Parameter Attribute.
:type param_attr: ParameterAttribute
:param bias_attr: The Bias Attribute. If the parameter is set to
False or something not type of ParameterAttribute,
no bias is defined. If the parameter is set to
True, the bias is initialized to zero.
:param bias_attr: The bias attribute. If the parameter is set to False or an object
whose type is not ParameterAttribute, no bias is defined. If the
parameter is set to True, the bias is initialized to zero.
:type bias_attr: ParameterAttribute | None | bool | Any
:param layer_attr: Extra Layer config.
:type layer_attr: ExtraLayerAttribute | None
......@@ -5497,7 +5479,11 @@ def crf_decoding_layer(input,
return LayerOutput(name, LayerType.CRF_DECODING_LAYER, parents, size=1)
@wrap_act_default(act=SigmoidActivation())
"""
Following are cost Layers.
"""
@wrap_bias_attr_default(has_bias=True)
@wrap_param_attr_default()
@wrap_name_default()
......@@ -5505,7 +5491,6 @@ def crf_decoding_layer(input,
def nce_layer(input,
label,
num_classes=None,
act=None,
param_attr=None,
weight=None,
num_neg_samples=10,
......@@ -5514,9 +5499,12 @@ def nce_layer(input,
bias_attr=None,
layer_attr=None):
"""
Noise-contrastive estimation.
Implements the method in the following paper:
A fast and simple algorithm for training neural probabilistic language models.
Noise-contrastive estimation. This layer implements the method in the
following paper:
Reference:
A fast and simple algorithm for training neural probabilistic language
models. https://www.cs.toronto.edu/~amnih/papers/ncelm.pdf
The example usage is:
......@@ -5528,32 +5516,37 @@ def nce_layer(input,
:param name: The name of this layer. It is optional.
:type name: basestring
:param input: The input layers. It could be a LayerOutput of list/tuple of LayerOutput.
:param input: The input layers. It should be a LayerOutput or a list/tuple
of LayerOutput.
:type input: LayerOutput | list | tuple | collections.Sequence
:param label: label layer
:param label: The ground truth.
:type label: LayerOutput
:param weight: weight layer, can be None(default)
:param weight: The weight layer defines a weight for each sample in the
mini-batch. The default value is None.
:type weight: LayerOutput
:param num_classes: number of classes.
:param num_classes: The class number.
:type num_classes: int
:param act: Activation type. SigmoidActivation is the default.
:type act: BaseActivation
:param param_attr: The Parameter Attribute|list.
:type param_attr: ParameterAttribute
:param num_neg_samples: number of negative samples. Default is 10.
:param param_attr: The parameter attributes.
:type param_attr: ParameterAttribute|list
:param num_neg_samples: The number of sampled negative labels. The default
value is 10.
:type num_neg_samples: int
:param neg_distribution: The distribution for generating the random negative labels.
A uniform distribution will be used if not provided.
If not None, its length must be equal to num_classes.
:param neg_distribution: The discrete noisy distribution over the output
space from which num_neg_samples negative labels
are sampled. If this parameter is not set, a
uniform distribution will be used. A user defined
distribution is a list whose length must be equal
to the num_classes. Each member of the list defines
the probability of a class given input x.
:type neg_distribution: list | tuple | collections.Sequence | None
:param bias_attr: The Bias Attribute. If the parameter is set to
False or something not type of ParameterAttribute,
no bias is defined. If the parameter is set to
True, the bias is initialized to zero.
:param bias_attr: The attribute for bias. If this parameter is set False or
any object whose type is not ParameterAttribute, no bias
is added. If this parameter is set True, the bias is
initialized to zero.
:type bias_attr: ParameterAttribute | None | bool | Any
:param layer_attr: Extra Layer Attribute.
:type layer_attr: ExtraLayerAttribute
:return: layer name.
:return: The LayerOutput object.
:rtype: LayerOutput
"""
if isinstance(input, LayerOutput):
......@@ -5576,8 +5569,6 @@ def nce_layer(input,
assert isinstance(neg_distribution, collections.Sequence)
assert len(neg_distribution) == num_classes
assert abs(sum(neg_distribution) - 1.0) < 1e-5
if not isinstance(act, BaseActivation):
raise TypeError()
ipts_for_layer = []
parents = []
......@@ -5599,7 +5590,7 @@ def nce_layer(input,
type=LayerType.NCE_LAYER,
num_classes=num_classes,
neg_sampling_dist=neg_distribution,
active_type=act.name,
active_type=SigmoidActivation().name,
num_neg_samples=num_neg_samples,
inputs=ipts_for_layer,
bias=ParamAttr.to_bias(bias_attr),
......@@ -5609,12 +5600,7 @@ def nce_layer(input,
LayerType.NCE_LAYER,
parents=parents,
size=l.config.size,
activation=act)
"""
following are cost Layers.
"""
activation=SigmoidActivation())
@wrap_name_default()
......@@ -5773,20 +5759,21 @@ def cross_entropy(input,
:param input: The first input layer.
:type input: LayerOutput.
:param label: The input label.
:type input: LayerOutput.
:type input: LayerOutput
:param name: The name of this layer. It is optional.
:type name: None | basestring.
:param coeff: The cost is multiplied with coeff.
The coefficient affects the gradient in the backward.
:type coeff: float.
:type name: basestring
:param coeff: The weight of the gradient in the back propagation.
1.0 is the default.
:type coeff: float
:param weight: The cost of each sample is multiplied with each weight.
The weight should be a layer with size=1. Note that gradient
will not be calculated for weight.
:type weight: LayerOutout
:param layer_attr: Extra Layer Attribute.
:param layer_attr: The extra layer attribute. See ExtraLayerAttribute for
details.
:type layer_attr: ExtraLayerAttribute
:return: LayerOutput object.
:rtype: LayerOutput.
:rtype: LayerOutput
"""
ipts, parents = __cost_input__(input, label, weight)
......@@ -5819,19 +5806,21 @@ def cross_entropy_with_selfnorm(input,
label=label_layer)
:param input: The first input layer.
:type input: LayerOutput.
:type input: LayerOutput
:param label: The input label.
:type input: LayerOutput.
:type input: LayerOutput
:param name: The name of this layer. It is optional.
:type name: None | basestring.
:param coeff: The coefficient affects the gradient in the backward.
:type coeff: float.
:type name: basestring
:param coeff: The weight of the gradient in the back propagation.
1.0 is the default.
:type coeff: float
:param softmax_selfnorm_alpha: The scale factor affects the cost.
:type softmax_selfnorm_alpha: float.
:param layer_attr: Extra Layer Attribute.
:type softmax_selfnorm_alpha: float
:param layer_attr: The extra layer attribute. See ExtraLayerAttribute for
details.
:type layer_attr: ExtraLayerAttribute
:return: LayerOutput object.
:rtype: LayerOutput.
:rtype: LayerOutput
"""
Layer(
name=name,
......@@ -5852,7 +5841,7 @@ def cross_entropy_with_selfnorm(input,
@layer_support()
def sum_cost(input, name=None, layer_attr=None):
"""
A loss layer which calculate the sum of the input as loss
A loss layer which calculates the sum of the input as loss.
The example usage is:
......@@ -5861,10 +5850,11 @@ def sum_cost(input, name=None, layer_attr=None):
cost = sum_cost(input=input_layer)
:param input: The input of this layer.
:type input: LayerOutput.
:type input: LayerOutput
:param name: The name of this layer. It is optional.
:type name: None | basestring.
:param layer_attr: Extra Layer Attribute.
:type name: basestring
:param layer_attr: The extra layer attribute. See ExtraLayerAttribute for
details.
:type layer_attr: ExtraLayerAttribute
:return: LayerOutput object.
:rtype: LayerOutput.
......@@ -5904,16 +5894,18 @@ def huber_regression_cost(input,
cost = huber_regression_cost(input=input_layer, label=label_layer)
:param input: The first input layer.
:type input: LayerOutput.
:type input: LayerOutput
:param label: The input label.
:type input: LayerOutput.
:type input: LayerOutput
:param name: The name of this layer. It is optional.
:type name: None | basestring.
:type name: basestring
:param delta: The difference between the observed and predicted values.
:type delta: float.
:param coeff: The coefficient affects the gradient in the backward.
:type coeff: float.
:param layer_attr: Extra Layer Attribute.
:type delta: float
:param coeff: The weight of the gradient in the back propagation.
1.0 is the default.
:type coeff: float
:param layer_attr: The extra layer attribute. See ExtraLayerAttribute for
details.
:type layer_attr: ExtraLayerAttribute
:return: LayerOutput object.
:rtype: LayerOutput.
......@@ -5954,17 +5946,19 @@ def huber_classification_cost(input,
cost = huber_classification_cost(input=input_layer, label=label_layer)
:param input: The first input layer.
:type input: LayerOutput.
:type input: LayerOutput
:param label: The input label.
:type input: LayerOutput.
:type input: LayerOutput
:param name: The name of this layer. It is optional.
:type name: None | basestring.
:param coeff: The coefficient affects the gradient in the backward.
:type coeff: float.
:param layer_attr: Extra Layer Attribute.
:type name: basestring
:param coeff: The weight of the gradient in the back propagation.
1.0 is the default.
:type coeff: float
:param layer_attr: The extra layer attribute. See ExtraLayerAttribute for
details.
:type layer_attr: ExtraLayerAttribute
:return: LayerOutput object.
:rtype: LayerOutput.
:rtype: LayerOutput
"""
assert isinstance(input, LayerOutput)
if input.size is not None:
......@@ -6001,10 +5995,12 @@ def multi_binary_label_cross_entropy(input,
:param label: The input label.
:type input: LayerOutput
:param name: The name of this layer. It is optional.
:type name: None | basestring
:param coeff: The coefficient affects the gradient in the backward.
:type name: basestring
:param coeff: The weight of the gradient in the back propagation.
1.0 is the default.
:type coeff: float
:param layer_attr: Extra Layer Attribute.
:param layer_attr: The extra layer attribute. See ExtraLayerAttribute for
details.
:type layer_attr: ExtraLayerAttribute
:return: LayerOutput object.
:rtype: LayerOutput
......@@ -6107,7 +6103,7 @@ def cross_entropy_over_beam(input, name=None):
:param input: Input beams for this layer.
:type input: BeamInput
:param name: The name of this layer.
:param name: The name of this layer. It is optional.
:type name: basestring
:return: LayerOutput object.
:rtype: LayerOutput
......@@ -6142,7 +6138,7 @@ def cross_entropy_over_beam(input, name=None):
def smooth_l1_cost(input, label, name=None, coeff=1.0, layer_attr=None):
"""
This is a L1 loss but more smooth. It requires that the
size of input and label are equal. The formula is as follows,
sizes of input and label are equal. The formula is as follows,
.. math::
......@@ -6154,8 +6150,9 @@ def smooth_l1_cost(input, label, name=None, coeff=1.0, layer_attr=None):
smooth_{L1}(x) = \\begin{cases} 0.5x^2& \\text{if} \\ |x| < 1 \\\\ |x|-0.5& \\text{otherwise} \end{cases}
More details can be found by referring to `Fast R-CNN
<https://arxiv.org/pdf/1504.08083v2.pdf>`_
Reference:
Fast R-CNN
https://arxiv.org/pdf/1504.08083v2.pdf
The example usage is:
......@@ -6169,10 +6166,12 @@ def smooth_l1_cost(input, label, name=None, coeff=1.0, layer_attr=None):
:param label: The input label.
:type input: LayerOutput
:param name: The name of this layer. It is optional.
:type name: None | basestring
:param coeff: The coefficient affects the gradient in the backward.
:type name: basestring
:param coeff: The weight of the gradient in the back propagation.
1.0 is the default.
:type coeff: float
:param layer_attr: Extra Layer Attribute.
:param layer_attr: The extra layer attribute. See ExtraLayerAttribute for
details.
:type layer_attr: ExtraLayerAttribute
:return: LayerOutput object.
:rtype: LayerOutput
......@@ -6194,12 +6193,12 @@ def smooth_l1_cost(input, label, name=None, coeff=1.0, layer_attr=None):
@wrap_name_default()
def multiplex_layer(input, name=None, layer_attr=None):
"""
This layer multiplex multiple layers according to the index,
which is provided by the first input layer.
inputs[0]: the index of the layer to output of size batchSize.
This layer multiplex multiple layers according to the indexes,
which are provided by the first input layer.
inputs[0]: the indexes of the layers to form the output of size batchSize.
inputs[1:N]; the candidate output data.
For each index i from 0 to batchSize -1, the output is the i-th row of the
(index[i] + 1)-th layer.
For each index i from 0 to batchSize - 1, the i-th row of the output is the
the same to the i-th row of the (index[i] + 1)-th layer.
For each i-th row of output:
.. math::
......@@ -6218,7 +6217,8 @@ def multiplex_layer(input, name=None, layer_attr=None):
:type input: list of LayerOutput
:param name: The name of this layer. It is optional.
:type name: basestring
:param layer_attr: extra layer attributes.
:param layer_attr: The extra layer attribute. See ExtraLayerAttribute for
details.
:type layer_attr: ExtraLayerAttribute.
:return: LayerOutput object.
:rtype: LayerOutput
......@@ -6322,14 +6322,14 @@ def row_conv_layer(input,
:type context_len: int
:param act: Activation Type. LinearActivation is the default.
:type act: BaseActivation
:param param_attr: The Parameter Attribute. If None, the parameter will be
initialized smartly. It's better to set it by yourself.
:param param_attr: The parameter attribute. See ParameterAttribute for
details.
:type param_attr: ParameterAttribute
:param layer_attr: Extra Layer config.
:param layer_attr: The extra layer attribute. See ExtraLayerAttribute for
details.
:type layer_attr: ExtraLayerAttribute | None
:return: LayerOutput object.
:rtype: LayerOutput
"""
assert isinstance(input, LayerOutput)
assert context_len > 0, "the context_len must be greatet than 0."
......@@ -6354,7 +6354,7 @@ def prelu_layer(input,
param_attr=None,
layer_attr=None):
"""
The Parameter Relu activation that actives outputs with a learnable weight.
The Parametric Relu activation that actives outputs with a learnable weight.
Reference:
Delving Deep into Rectifiers: Surpassing Human-Level Performance on
......@@ -6374,16 +6374,17 @@ def prelu_layer(input,
:type name: basestring
:param input: The input of this layer.
:type input: LayerOutput
:param partial_sum: this parameter makes a group of inputs share a same weight.
:param partial_sum: this parameter makes a group of inputs share the same weight.
- partial_sum = 1, indicates the element-wise activation: each element has a weight.
- partial_sum = number of elements in one channel, indicates the channel-wise activation, elements in a channel share a same weight.
- partial_sum = number of outputs, indicates all elements share a same weight.
- partial_sum = number of elements in one channel, indicates the channel-wise activation, elements in a channel share the same weight.
- partial_sum = number of outputs, indicates all elements share the same weight.
:type partial_sum: int
:param param_attr: The parameter attribute. See ParameterAttribute for details.
:type param_attr: ParameterAttribute | None
:param layer_attr: Extra layer configurations. Default is None.
:type param_attr: ParameterAttribute
:param layer_attr: The extra layer attribute. See ExtraLayerAttribute for
details.
:type layer_attr: ExtraLayerAttribute | None
:return: LayerOutput object.
:rtype: LayerOutput
......@@ -6439,34 +6440,34 @@ def gated_unit_layer(input,
:param input: The input of this layer.
:type input: LayerOutput
:param size: output size of the gated unit.
:param size: The dimension of this layer's output.
:type size: int
:param act: Activation type of the projected input. LinearActivation is the default.
:param act: Activation type of the projection. LinearActivation is the default.
:type act: BaseActivation
:param name: The name of this layer. It is optional.
:type name: basestring
:param gate_attr: Attributes to tune the gate output, for example, error
clipping threshold, dropout and so on. See ExtraLayerAttribute for
more details.
:param gate_attr: The extra layer attribute of the gate. See ExtraLayerAttribute for
details.
:type gate_attr: ExtraLayerAttribute | None
:param gate_param_attr: Attributes to tune the learnable projected matrix
parameter of the gate.
:type gate_param_attr: ParameterAttribute | None
:param gate_bias_attr: Attributes to tune the learnable bias of the gate.
:type gate_bias_attr: ParameterAttribute | None
:param inproj_attr: Attributes to the tune the projected input, for
example, error clipping threshold, dropout and so on. See
ExtraLayerAttribute for more details.
:param gate_param_attr: The parameter attribute of the gate. See ParameterAttribute
for details.
:type gate_param_attr: ParameterAttribute
:param gate_bias_attr: The bias attribute of the gate. If the parameter is set to False or
an object whose type is not ParameterAttribute, no bias is defined.
If the parameter is set to True, the bias is initialized to zero.
:type gate_bias_attr: ParameterAttribute | bool | None | Any
:param inproj_attr: Extra layer attributes of the projection. See ExtraLayerAttribute for
details.
:type inproj_attr: ExtraLayerAttribute | None
:param inproj_param_attr: Attributes to tune the learnable parameter of
the projection of input.
:type inproj_param_attr: ParameterAttribute | None
:param inproj_bias_attr: Attributes to tune the learnable bias of
projection of the input.
:type inproj_bias_attr: ParameterAttribute | None
:param layer_attr: Attributes to tune the final output of the gated unit,
for example, error clipping threshold, dropout and so on. See
ExtraLayerAttribute for more details.
:param inproj_param_attr: The parameter attribute of the projection. See ParameterAttribute
for details.
:type inproj_param_attr: ParameterAttribute
:param inproj_bias_attr: The bias attribute of the projection. If the parameter is set to False
or an object whose type is not ParameterAttribute, no bias is defined.
If the parameter is set to True, the bias is initialized to zero.
:type inproj_bias_attr: ParameterAttribute | bool | None | Any
:param layer_attr: Extra layer attribute of the product. See ExtraLayerAttribute for
details.
:type layer_attr: ExtraLayerAttribute | None
:return: LayerOutput object.
:rtype: LayerOutput
......@@ -6662,9 +6663,9 @@ def clip_layer(input, min, max, name=None):
:param input: The input of this layer.
:type input: LayerOutput.
:param min: The lower threshold for clipping.
:type min: double
:type min: float
:param max: The upper threshold for clipping.
:type max: double
:type max: float
:return: LayerOutput object.
:rtype: LayerOutput
"""
......@@ -6712,7 +6713,6 @@ def seq_slice_layer(input, starts, ends, name=None):
:type ends: LayerOutput | None
:return: LayerOutput object.
:rtype: LayerOutput
"""
assert isinstance(input, LayerOutput), (
......@@ -6833,20 +6833,21 @@ def img_conv3d_layer(input,
:param padding: The numbers of padding along three axises. If the parameter is set to
one integer, they will be same.
:type padding: int | tuple | list
:param bias_attr: The Bias Attribute. If the parameter is set to
False or something not type of ParameterAttribute,
no bias is defined. If the parameter is set to
True, the bias is initialized to zero.
:param bias_attr: The bias attribute. If the parameter is set to False or an object
whose type is not ParameterAttribute, no bias is defined. If the
parameter is set to True, the bias is initialized to zero.
:type bias_attr: ParameterAttribute | None | bool | Any
:param num_channels: The number of input channels. If the parameter is not set or
set to None, its actual value will be automatically set to
the channels number of the input .
:type num_channels: int
:param param_attr: The parameter attribute of the convolution.
:param param_attr: The parameter attribute of the convolution. See ParameterAttribute for
details.
:type param_attr: ParameterAttribute
:param shared_biases: Whether biases will be shared between filters or not.
:type shared_biases: bool
:param layer_attr: Extra layer attributes.
:param layer_attr: The extra layer attributes. See ExtraLayerAttribute for
details.
:type layer_attr: ExtraLayerAttribute
:param trans: True if it is a convTransLayer, False if it is a convLayer
:type trans: bool
......@@ -6953,12 +6954,12 @@ def scale_shift_layer(input, name=None, param_attr=None, bias_attr=None):
:type name: basestring
:param input: The input of this layer.
:type input: LayerOutput
:param param_attr: The parameter attribute of scaling.
:param param_attr: The parameter attribute of scaling. See ParameterAttribute for
details.
:type param_attr: ParameterAttribute
:param bias_attr: The Bias Attribute. If the parameter is set to
False or something not type of ParameterAttribute,
no bias is defined. If the parameter is set to
True, the bias is initialized to zero.
:param bias_attr: The bias attribute. If the parameter is set to False or an object
whose type is not ParameterAttribute, no bias is defined. If the
parameter is set to True, the bias is initialized to zero.
:type bias_attr: ParameterAttribute | None | bool | Any
:return: LayerOutput object.
:rtype: LayerOutput
......@@ -7016,10 +7017,9 @@ def sub_seq_layer(input, offsets, sizes, act=None, bias_attr=None, name=None):
:type sizes: LayerOutput
:param act: Activation type, LinearActivation is the default.
:type act: BaseActivation.
:param bias_attr: The Bias Attribute. If the parameter is set to
False or something not type of ParameterAttribute,
no bias is defined. If the parameter is set to
True, the bias is initialized to zero.
:param bias_attr: The bias attribute. If the parameter is set to False or an object
whose type is not ParameterAttribute, no bias is defined. If the
parameter is set to True, the bias is initialized to zero.
:type bias_attr: ParameterAttribute | None | bool | Any
:return: LayerOutput object.
:rtype: LayerOutput
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册