未验证 提交 32a5dfd3 编写于 作者: C Cao Ying 提交者: GitHub

Merge pull request #7791 from lcy-seso/multihead_attention

Add the wrapper for multi-head scaled dot product attention.
...@@ -26,8 +26,8 @@ glu ...@@ -26,8 +26,8 @@ glu
:noindex: :noindex:
dot_product_attention scaled_dot_product_attention
--------------------- ----------------------------
.. autofunction:: paddle.v2.fluid.nets.dot_product_attention .. autofunction:: paddle.v2.fluid.nets.scaled_dot_product_attention
:noindex: :noindex:
...@@ -90,14 +90,10 @@ Reshape Operator. ...@@ -90,14 +90,10 @@ Reshape Operator.
Reshape Input(X) into the shape specified by Attr(shape). Reshape Input(X) into the shape specified by Attr(shape).
An example: An example:
Given a 2-D tensor X with 2 rows and 2 columns Given a 2-D tensor X with 2 rows and 2 columns : [[1, 2], [3, 4]]
[[1, 2], [3, 4]]
and target shape = [1, 4], the reshape operator will transform and target shape = [1, 4], the reshape operator will transform
the tensor X into a 2-D tensor: the tensor X into a 2-D tensor: [[1, 2, 3, 4]]
[[1, 2, 3, 4]]
One dimension in the target shape can be set -1, representing that its One dimension in the target shape can be set -1, representing that its
size is unknown. In this case, the real dimension will be infered from size is unknown. In this case, the real dimension will be infered from
......
...@@ -110,16 +110,17 @@ def fc(input, ...@@ -110,16 +110,17 @@ def fc(input,
into a 2-dimensional matrix. The parameter into a 2-dimensional matrix. The parameter
`num_flatten_dims` determines how the input tensor `num_flatten_dims` determines how the input tensor
is flattened: the first `num_flatten_dims` is flattened: the first `num_flatten_dims`
dimensions will be flatten to form the first (inclusive, index starts from 1) dimensions will
dimension of the final matrix (height of the be flatten to form the first dimension of the
matrix), and the rest `rank(X) - num_flatten_dims` final matrix (height of the matrix), and the rest
dimensions are flattened to form the second `rank(X) - num_flatten_dims` dimensions are
dimension of the final matrix (width of the matrix). flattened to form the second dimension of the
For example, suppose `X` is a 6-dimensional tensor final matrix (width of the matrix). For example,
with a shape [2, 3, 4, 5, 6], and suppose `X` is a 6-dimensional tensor with a shape
`num_flatten_dims` = 3. Then, the flattened matrix [2, 3, 4, 5, 6], and `num_flatten_dims` = 3. Then,
will have a shape [2 x 3 x 4, 5 x 6] = [24, 30]. the flattened matrix will have a shape
By default, `num_flatten_dims` is set to 1. [2 x 3 x 4, 5 x 6] = [24, 30]. By default,
`num_flatten_dims` is set to 1.
param_attr(ParamAttr|list): The parameter attribute for learnable param_attr(ParamAttr|list): The parameter attribute for learnable
parameters/weights of the fully connected parameters/weights of the fully connected
layer. layer.
...@@ -160,6 +161,7 @@ def fc(input, ...@@ -160,6 +161,7 @@ def fc(input,
param_shape = [ param_shape = [
reduce(lambda a, b: a * b, input_shape[num_flatten_dims:], 1) reduce(lambda a, b: a * b, input_shape[num_flatten_dims:], 1)
] + [size] ] + [size]
w = helper.create_parameter( w = helper.create_parameter(
attr=param_attr, shape=param_shape, dtype=dtype, is_bias=False) attr=param_attr, shape=param_shape, dtype=dtype, is_bias=False)
tmp = helper.create_tmp_variable(dtype) tmp = helper.create_tmp_variable(dtype)
...@@ -530,8 +532,10 @@ def gru_unit(input, ...@@ -530,8 +532,10 @@ def gru_unit(input,
size (integer): The input dimension value. size (integer): The input dimension value.
weight (ParamAttr): The weight parameters for gru unit. Default: None weight (ParamAttr): The weight parameters for gru unit. Default: None
bias (ParamAttr): The bias parameters for gru unit. Default: None bias (ParamAttr): The bias parameters for gru unit. Default: None
activation (string): The activation type for cell (actNode). Default: 'tanh' activation (string): The activation type for cell (actNode).
gate_activation (string): The activation type for gates (actGate). Default: 'sigmoid' Default: 'tanh'
gate_activation (string): The activation type for gates (actGate).
Default: 'sigmoid'
Returns: Returns:
tuple: The hidden value, reset-hidden value and gate values. tuple: The hidden value, reset-hidden value and gate values.
...@@ -670,8 +674,9 @@ def cross_entropy(input, label, **kwargs): ...@@ -670,8 +674,9 @@ def cross_entropy(input, label, **kwargs):
""" """
**Cross Entropy Layer** **Cross Entropy Layer**
This layer computes the cross entropy between `input` and `label`. It supports This layer computes the cross entropy between `input` and `label`. It
both standard cross-entropy and soft-label cross-entropy loss computation. supports both standard cross-entropy and soft-label cross-entropy loss
computation.
1) One-hot cross-entropy: 1) One-hot cross-entropy:
`soft_label = False`, `Label[i, 0]` indicates the class index for sample i: `soft_label = False`, `Label[i, 0]` indicates the class index for sample i:
...@@ -698,23 +703,28 @@ def cross_entropy(input, label, **kwargs): ...@@ -698,23 +703,28 @@ def cross_entropy(input, label, **kwargs):
Args: Args:
input (Variable|list): a 2-D tensor with shape [N x D], where N is the input (Variable|list): a 2-D tensor with shape [N x D], where N is the
batch size and D is the number of classes. This input is a probability batch size and D is the number of classes. This
computed by the previous operator, which is almost always the result input is a probability computed by the previous
of a softmax operator. operator, which is almost always the result of
a softmax operator.
label (Variable|list): the ground truth which is a 2-D tensor. When label (Variable|list): the ground truth which is a 2-D tensor. When
`soft_label` is set to `False`, `label` is a tensor<int64> with shape `soft_label` is set to `False`, `label` is a
[N x 1]. When `soft_label` is set to `True`, `label` is a tensor<int64> with shape [N x 1]. When
`soft_label` is set to `True`, `label` is a
tensor<float/double> with shape [N x D]. tensor<float/double> with shape [N x D].
soft_label (bool, via `**kwargs`): a flag indicating whether to interpretate soft_label (bool, via `**kwargs`): a flag indicating whether to
the given labels as soft labels, default `False`. interpretate the given labels as soft
labels, default `False`.
Returns: Returns:
A 2-D tensor with shape [N x 1], the cross entropy loss. A 2-D tensor with shape [N x 1], the cross entropy loss.
Raises: Raises:
`ValueError`: 1) the 1st dimension of `input` and `label` are not equal; 2) when \ `ValueError`: 1) the 1st dimension of `input` and `label` are not equal.
`soft_label == True`, and the 2nd dimension of `input` and `label` are not \ 2) when `soft_label == True`, and the 2nd dimension of
equal; 3) when `soft_label == False`, and the 2nd dimension of `label` is not 1. `input` and `label` are not equal.
3) when `soft_label == False`, and the 2nd dimension of
`label` is not 1.
Examples: Examples:
.. code-block:: python .. code-block:: python
...@@ -737,7 +747,9 @@ def square_error_cost(input, label, **kwargs): ...@@ -737,7 +747,9 @@ def square_error_cost(input, label, **kwargs):
""" """
**Square error cost layer** **Square error cost layer**
This layer accepts input predictions and target label and returns the squared error cost. This layer accepts input predictions and target label and returns the
squared error cost.
For predictions, :math:`X`, and target labels, :math:`Y`, the equation is: For predictions, :math:`X`, and target labels, :math:`Y`, the equation is:
.. math:: .. math::
...@@ -755,8 +767,8 @@ def square_error_cost(input, label, **kwargs): ...@@ -755,8 +767,8 @@ def square_error_cost(input, label, **kwargs):
label(Variable): Label tensor, has target labels. label(Variable): Label tensor, has target labels.
Returns: Returns:
Variable: The tensor variable storing the element-wise squared error difference \ Variable: The tensor variable storing the element-wise squared error
of input and label. difference of input and label.
Examples: Examples:
.. code-block:: python .. code-block:: python
...@@ -852,7 +864,8 @@ def chunk_eval(input, ...@@ -852,7 +864,8 @@ def chunk_eval(input,
"chunk_scheme": chunk_scheme, "chunk_scheme": chunk_scheme,
"excluded_chunk_types": excluded_chunk_types or [] "excluded_chunk_types": excluded_chunk_types or []
}) })
return precision, recall, f1_score, num_infer_chunks, num_label_chunks, num_correct_chunks return (precision, recall, f1_score, num_infer_chunks, num_label_chunks,
num_correct_chunks)
def sequence_conv(input, def sequence_conv(input,
...@@ -910,13 +923,14 @@ def conv2d(input, ...@@ -910,13 +923,14 @@ def conv2d(input,
**Convlution2D Layer** **Convlution2D Layer**
The convolution2D layer calculates the output based on the input, filter The convolution2D layer calculates the output based on the input, filter
and strides, paddings, dilations, groups parameters. Input(Input) and Output(Output) and strides, paddings, dilations, groups parameters. Input(Input) and
are in NCHW format. Where N is batch size, C is the number of channels, H is the height Output(Output) are in NCHW format. Where N is batch size, C is the number of
of the feature, and W is the width of the feature. channels, H is the height of the feature, and W is the width of the feature.
The details of convolution layer, please refer UFLDL's `convolution, The details of convolution layer, please refer UFLDL's `convolution,
<http://ufldl.stanford.edu/tutorial/supervised/FeatureExtractionUsingConvolution/>`_ . <http://ufldl.stanford.edu/tutorial/supervised/FeatureExtractionUsingConvolution/>`_ .
If bias attribution and activation type are provided, bias is added to the output of the convolution, If bias attribution and activation type are provided, bias is added to the
and the corresponding activation function is applied to the final result. output of the convolution, and the corresponding activation function is
applied to the final result.
For each input :math:`X`, the equation is: For each input :math:`X`, the equation is:
...@@ -931,7 +945,8 @@ def conv2d(input, ...@@ -931,7 +945,8 @@ def conv2d(input,
* :math:`\\ast`: Convolution operation. * :math:`\\ast`: Convolution operation.
* :math:`b`: Bias value, a 2-D tensor with shape [M, 1]. * :math:`b`: Bias value, a 2-D tensor with shape [M, 1].
* :math:`\\sigma`: Activation function. * :math:`\\sigma`: Activation function.
* :math:`Out`: Output value, the shape of :math:`Out` and :math:`X` may be different. * :math:`Out`: Output value, the shape of :math:`Out` and :math:`X` may be
different.
Example: Example:
...@@ -976,17 +991,20 @@ def conv2d(input, ...@@ -976,17 +991,20 @@ def conv2d(input,
act(str): Activation type. Default: None act(str): Activation type. Default: None
Returns: Returns:
Variable: The tensor variable storing the convolution and \ Variable: The tensor variable storing the convolution and
non-linearity activation result. non-linearity activation result.
Raises: Raises:
ValueError: If the shapes of input, filter_size, stride, padding and groups mismatch. ValueError: If the shapes of input, filter_size, stride, padding and
groups mismatch.
Examples: Examples:
.. code-block:: python .. code-block:: python
data = fluid.layers.data(name='data', shape=[3, 32, 32], dtype='float32') data = fluid.layers.data(
conv2d = fluid.layers.conv2d(input=data, num_filters=2, filter_size=3, act="relu") name='data', shape=[3, 32, 32], dtype='float32')
conv2d = fluid.layers.conv2d(
input=data, num_filters=2, filter_size=3, act="relu")
""" """
if stride is None: if stride is None:
stride = [1, 1] stride = [1, 1]
...@@ -1349,7 +1367,8 @@ def conv2d_transpose(input, ...@@ -1349,7 +1367,8 @@ def conv2d_transpose(input,
H is the height of the feature, and W is the width of the feature. H is the height of the feature, and W is the width of the feature.
Parameters(dilations, strides, paddings) are two elements. These two elements Parameters(dilations, strides, paddings) are two elements. These two elements
represent height and width, respectively. The details of convolution transpose represent height and width, respectively. The details of convolution transpose
layer, please refer to the following explanation and references `therein <http://www.matthewzeiler.com/wp-content/uploads/2017/07/cvpr2010.pdf>`_. layer, please refer to the following explanation and references
`therein <http://www.matthewzeiler.com/wp-content/uploads/2017/07/cvpr2010.pdf>`_.
For each input :math:`X`, the equation is: For each input :math:`X`, the equation is:
...@@ -1362,7 +1381,8 @@ def conv2d_transpose(input, ...@@ -1362,7 +1381,8 @@ def conv2d_transpose(input,
* :math:`X`: Input value, a tensor with NCHW format. * :math:`X`: Input value, a tensor with NCHW format.
* :math:`W`: Filter value, a tensor with MCHW format. * :math:`W`: Filter value, a tensor with MCHW format.
* :math:`\\ast` : Convolution transpose operation. * :math:`\\ast` : Convolution transpose operation.
* :math:`Out`: Output value, the shape of :math:`Out` and :math:`X` may be different. * :math:`Out`: Output value, the shape of :math:`Out` and :math:`X` may be
different.
Example: Example:
...@@ -1403,7 +1423,8 @@ def conv2d_transpose(input, ...@@ -1403,7 +1423,8 @@ def conv2d_transpose(input,
dilation(int|tuple): The dilation size. If dilation is a tuple, it must dilation(int|tuple): The dilation size. If dilation is a tuple, it must
contain two integers, (dilation_H, dilation_W). Otherwise, the contain two integers, (dilation_H, dilation_W). Otherwise, the
dilation_H = dilation_W = dilation. Default: dilation = 1. dilation_H = dilation_W = dilation. Default: dilation = 1.
param_attr(ParamAttr): The parameters to the Conv2d_transpose Layer. Default: None param_attr(ParamAttr): The parameters to the Conv2d_transpose Layer.
Default: None
use_cudnn(bool): Use cudnn kernel or not, it is valid only when the cudnn use_cudnn(bool): Use cudnn kernel or not, it is valid only when the cudnn
library is installed. Default: True library is installed. Default: True
name(str|None): A name for this layer(optional). If set None, the layer name(str|None): A name for this layer(optional). If set None, the layer
...@@ -1413,13 +1434,16 @@ def conv2d_transpose(input, ...@@ -1413,13 +1434,16 @@ def conv2d_transpose(input,
Variable: The tensor variable storing the convolution transpose result. Variable: The tensor variable storing the convolution transpose result.
Raises: Raises:
ValueError: If the shapes of input, filter_size, stride, padding and groups mismatch. ValueError: If the shapes of input, filter_size, stride, padding and
groups mismatch.
Examples: Examples:
.. code-block:: python .. code-block:: python
data = fluid.layers.data(name='data', shape=[3, 32, 32], dtype='float32') data = fluid.layers.data(
conv2d_transpose = fluid.layers.conv2d_transpose(input=data, num_filters=2, filter_size=3) name='data', shape=[3, 32, 32], dtype='float32')
conv2d_transpose = fluid.layers.conv2d_transpose(
input=data, num_filters=2, filter_size=3)
""" """
helper = LayerHelper("conv2d_transpose", **locals()) helper = LayerHelper("conv2d_transpose", **locals())
if not isinstance(input, Variable): if not isinstance(input, Variable):
...@@ -1643,9 +1667,9 @@ def lstm_unit(x_t, ...@@ -1643,9 +1667,9 @@ def lstm_unit(x_t,
tuple: The hidden value and cell value of lstm unit. tuple: The hidden value and cell value of lstm unit.
Raises: Raises:
ValueError: The ranks of **x_t**, **hidden_t_prev** and **cell_t_prev**\ ValueError: The ranks of **x_t**, **hidden_t_prev** and **cell_t_prev**
not be 2 or the 1st dimensions of **x_t**, **hidden_t_prev** \ not be 2 or the 1st dimensions of **x_t**, **hidden_t_prev**
and **cell_t_prev** not be the same or the 2nd dimensions of \ and **cell_t_prev** not be the same or the 2nd dimensions of
**hidden_t_prev** and **cell_t_prev** not be the same. **hidden_t_prev** and **cell_t_prev** not be the same.
Examples: Examples:
...@@ -1978,7 +2002,7 @@ def l2_normalize(x, axis, epsilon=1e-12, name=None): ...@@ -1978,7 +2002,7 @@ def l2_normalize(x, axis, epsilon=1e-12, name=None):
data = fluid.layers.data(name="data", data = fluid.layers.data(name="data",
shape=(3, 17, 13), shape=(3, 17, 13),
dtype="float32") dtype="float32")
fc = fluid.layers.l2_normalize(x=data, axis=1) normed = fluid.layers.l2_normalize(x=data, axis=1)
""" """
if len(x.shape) == 1: axis = 0 if len(x.shape) == 1: axis = 0
...@@ -2030,9 +2054,10 @@ def l2_normalize(x, axis, epsilon=1e-12, name=None): ...@@ -2030,9 +2054,10 @@ def l2_normalize(x, axis, epsilon=1e-12, name=None):
def matmul(x, y, transpose_x=False, transpose_y=False, name=None): def matmul(x, y, transpose_x=False, transpose_y=False, name=None):
""" """
Applies matrix multiplication to two tensors. Currently, the input Applies matrix multiplication to two tensors.
tensors' rank can be any, but when the rank of anyone inputs is
bigger than 3, this two inputs' rank should be equal. Currently, the input tensors' rank can be any, but when the rank of any
inputs is bigger than 3, this two inputs' rank should be equal.
The actual behavior depends on the shapes of :math:`x`, :math:`y` and the The actual behavior depends on the shapes of :math:`x`, :math:`y` and the
flag values of :attr:`transpose_x`, :attr:`transpose_y`. Specifically: flag values of :attr:`transpose_x`, :attr:`transpose_y`. Specifically:
...@@ -2073,25 +2098,56 @@ def matmul(x, y, transpose_x=False, transpose_y=False, name=None): ...@@ -2073,25 +2098,56 @@ def matmul(x, y, transpose_x=False, transpose_y=False, name=None):
# Examples to clarify shapes of the inputs and output # Examples to clarify shapes of the inputs and output
# x: [B, ..., M, K], y: [B, ..., K, N] # x: [B, ..., M, K], y: [B, ..., K, N]
fluid.layers.matmul(x, y) # out: [B, ..., M, N] fluid.layers.matmul(x, y) # out: [B, ..., M, N]
# x: [B, M, K], y: [B, K, N] # x: [B, M, K], y: [B, K, N]
fluid.layers.matmul(x, y) # out: [B, M, N] fluid.layers.matmul(x, y) # out: [B, M, N]
# x: [B, M, K], y: [K, N] # x: [B, M, K], y: [K, N]
fluid.layers.matmul(x, y) # out: [B, M, N] fluid.layers.matmul(x, y) # out: [B, M, N]
# x: [B, M, K], y: [K]
fluid.layers.matmul(x, y) # out: [B, M]
# x: [M, K], y: [K, N] # x: [M, K], y: [K, N]
fluid.layers.matmul(x, y) # out: [M, N] fluid.layers.matmul(x, y) # out: [M, N]
# x: [B, M, K], y: [K]
fluid.layers.matmul(x, y) # out: [B, M]
# x: [K], y: [K] # x: [K], y: [K]
fluid.layers.matmul(x, y) # out: [1] fluid.layers.matmul(x, y) # out: [1]
# x: [M], y: [N]
# x: [M], y: [N]
fluid.layers.matmul(x, y, True, True) # out: [M, N] fluid.layers.matmul(x, y, True, True) # out: [M, N]
""" """
def __check_input(x, y):
if len(y.shape) > len(x.shape):
raise ValueError(
"Invalid inputs for matmul. "
"x's rank should be always greater than or equal to y'rank.")
x_shape = list(x.shape)
y_shape = list(y.shape)
if len(x_shape) == 1:
x_shape = [1] + x_shape
if len(y_shape) == 1:
y_shape = y_shape + [1]
# check the inner 2 dimensions
if transpose_x:
x_shape[-2], x_shape[-1] = x_shape[-1], x_shape[-2]
if transpose_y:
y_shape[-2], y_shape[-1] = y_shape[-1], y_shape[-2]
if x_shape[-1] != y_shape[-2]:
raise ValueError("Invalid inputs for matmul.")
if len(y_shape) > 2:
for i, dim_x in enumerate(x_shape[:-2]):
if dim_x != y_shape[i]:
raise ValueError("Invalid inputs for matmul.")
__check_input(x, y)
helper = LayerHelper('matmul', **locals()) helper = LayerHelper('matmul', **locals())
assert max(len(x.shape), len(y.shape)) <= 3 or len(x.shape) == len( out = helper.create_tmp_variable(dtype=x.dtype)
y.
shape), 'Inputs\' rank should be equal or their rank should be less 4.'
out = helper.create_tmp_variable(dtype=helper.input_dtype())
helper.append_op( helper.append_op(
type='matmul', type='matmul',
inputs={'X': x, inputs={'X': x,
...@@ -2108,13 +2164,26 @@ def edit_distance(input, ...@@ -2108,13 +2164,26 @@ def edit_distance(input,
ignored_tokens=None, ignored_tokens=None,
name=None): name=None):
""" """
EditDistance operator computes the edit distances between a batch of hypothesis strings and their references. Edit distance, also called Levenshtein distance, measures how dissimilar two strings are by counting the minimum number of operations to transform one string into anthor. Here the operations include insertion, deletion, and substitution. For example, given hypothesis string A = "kitten" and reference B = "sitting", the edit distance is 3 for A will be transformed into B at least after two substitutions and one insertion: EditDistance operator computes the edit distances between a batch of
hypothesis strings and their references. Edit distance, also called
Levenshtein distance, measures how dissimilar two strings are by counting
the minimum number of operations to transform one string into anthor.
Here the operations include insertion, deletion, and substitution.
For example, given hypothesis string A = "kitten" and reference
B = "sitting", the edit distance is 3 for A will be transformed into B
at least after two substitutions and one insertion:
"kitten" -> "sitten" -> "sittin" -> "sitting" "kitten" -> "sitten" -> "sittin" -> "sitting"
Input(Hyps) is a LoDTensor consisting of all the hypothesis strings with the total number denoted by `batch_size`, and the separation is specified by the LoD information. And the `batch_size` reference strings are arranged in order in the same way in the LoDTensor Input(Refs). Input(Hyps) is a LoDTensor consisting of all the hypothesis strings with
the total number denoted by `batch_size`, and the separation is specified
by the LoD information. And the `batch_size` reference strings are arranged
in order in the same way in the LoDTensor Input(Refs).
Output(Out) contains the `batch_size` results and each stands for the edit stance for a pair of strings respectively. If Attr(normalized) is true, the edit distance will be divided by the length of reference string. Output(Out) contains the `batch_size` results and each stands for the edit
distance for a pair of strings respectively. If Attr(normalized) is true,
the edit distance will be divided by the length of reference string.
Args: Args:
...@@ -2122,9 +2191,11 @@ def edit_distance(input, ...@@ -2122,9 +2191,11 @@ def edit_distance(input,
label(Variable): The indices for reference strings. label(Variable): The indices for reference strings.
normalized(bool): Indicated whether to normalize the edit distance by the length of reference string. normalized(bool): Indicated whether to normalize the edit distance by
the length of reference string.
ignored_tokens(list of int): Tokens that should be removed before calculating edit distance. ignored_tokens(list of int): Tokens that should be removed before
calculating edit distance.
Returns: Returns:
Variable: sequence-to-sequence edit distance in shape [batch_size, 1]. Variable: sequence-to-sequence edit distance in shape [batch_size, 1].
...@@ -2175,8 +2246,10 @@ def edit_distance(input, ...@@ -2175,8 +2246,10 @@ def edit_distance(input,
def ctc_greedy_decoder(input, blank, name=None): def ctc_greedy_decoder(input, blank, name=None):
""" """
This op is used to decode sequences by greedy policy by below steps: This op is used to decode sequences by greedy policy by below steps:
1. Get the indexes of max value for each row in input. a.k.a. numpy.argmax(input, axis=0). 1. Get the indexes of max value for each row in input. a.k.a.
2. For each sequence in result of step1, merge repeated tokens between two blanks and delete all blanks. numpy.argmax(input, axis=0).
2. For each sequence in result of step1, merge repeated tokens between two
blanks and delete all blanks.
A simple example as below: A simple example as below:
...@@ -2206,9 +2279,16 @@ def ctc_greedy_decoder(input, blank, name=None): ...@@ -2206,9 +2279,16 @@ def ctc_greedy_decoder(input, blank, name=None):
Args: Args:
input(Variable): (LoDTensor<float>), the probabilities of variable-length sequences, which is a 2-D Tensor with LoD information. It's shape is [Lp, num_classes + 1], where Lp is the sum of all input sequences' length and num_classes is the true number of classes. (not including the blank label). input(Variable): (LoDTensor<float>), the probabilities of
variable-length sequences, which is a 2-D Tensor with
LoD information. It's shape is [Lp, num_classes + 1],
where Lp is the sum of all input sequences' length and
num_classes is the true number of classes. (not
including the blank label).
blank(int): the blank label index of Connectionist Temporal Classification (CTC) loss, which is in thehalf-opened interval [0, num_classes + 1). blank(int): the blank label index of Connectionist Temporal
Classification (CTC) loss, which is in thehalf-opened
interval [0, num_classes + 1).
Returns: Returns:
Variable: CTC greedy decode result. Variable: CTC greedy decode result.
...@@ -2276,8 +2356,10 @@ def warpctc(input, label, blank=0, norm_by_times=False, **kwargs): ...@@ -2276,8 +2356,10 @@ def warpctc(input, label, blank=0, norm_by_times=False, **kwargs):
Examples: Examples:
.. code-block:: python .. code-block:: python
y = layers.data(name='y', shape=[11, 8], dtype='float32', lod_level=1) y = layers.data(
y_predict = layers.data(name='y_predict', shape=[11, 1], dtype='float32') name='y', shape=[11, 8], dtype='float32', lod_level=1)
y_predict = layers.data(
name='y_predict', shape=[11, 1], dtype='float32')
cost = layers.warpctc(input=y_predict, label=y) cost = layers.warpctc(input=y_predict, label=y)
""" """
...@@ -2431,6 +2513,12 @@ def transpose(x, perm, name=None): ...@@ -2431,6 +2513,12 @@ def transpose(x, perm, name=None):
raise ValueError( raise ValueError(
"Input(perm) is the permutation of dimensions of Input(input). " "Input(perm) is the permutation of dimensions of Input(input). "
"It's length shoud be equal to Input(input)'s rank.") "It's length shoud be equal to Input(input)'s rank.")
for idx, dim in enumerate(perm):
if dim >= len(x.shape):
raise ValueError(
"Each element in perm should be less than x's rank. "
"%d-th element in perm is %d which accesses x's rank %d." %
(idx, perm[idx], len(x.shape)))
helper = LayerHelper('transpose', **locals()) helper = LayerHelper('transpose', **locals())
out = helper.create_tmp_variable(x.dtype) out = helper.create_tmp_variable(x.dtype)
...@@ -2539,7 +2627,8 @@ def im2sequence(input, filter_size=1, stride=1, padding=0, name=None): ...@@ -2539,7 +2627,8 @@ def im2sequence(input, filter_size=1, stride=1, padding=0, name=None):
.. code-block:: python .. code-block:: python
output = fluid.layers.im2sequence(input=layer, stride=[1, 1], filter_size=[2, 2]) output = fluid.layers.im2sequence(
input=layer, stride=[1, 1], filter_size=[2, 2])
""" """
......
...@@ -11,14 +11,13 @@ ...@@ -11,14 +11,13 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
import layers import layers
__all__ = [ __all__ = [
"simple_img_conv_pool", "simple_img_conv_pool",
"sequence_conv_pool", "sequence_conv_pool",
"glu", "glu",
"dot_product_attention", "scaled_dot_product_attention",
] ]
...@@ -160,7 +159,11 @@ def glu(input, dim=-1): ...@@ -160,7 +159,11 @@ def glu(input, dim=-1):
return out return out
def dot_product_attention(querys, keys, values): def scaled_dot_product_attention(queries,
keys,
values,
num_heads=1,
dropout_rate=0.):
""" """
The dot-product attention. The dot-product attention.
...@@ -174,39 +177,162 @@ def dot_product_attention(querys, keys, values): ...@@ -174,39 +177,162 @@ def dot_product_attention(querys, keys, values):
.. math:: .. math::
Attention(Q, K, V)= softmax(QK^\mathrm{T})V Attention(Q, K, V)= softmax(QK^\mathrm{T})V
Refer to `Attention Is All You Need Refer to `Attention Is All You Need
<https://arxiv.org/pdf/1706.03762.pdf>`_. <https://arxiv.org/pdf/1706.03762.pdf>`_.
Note that batch data containing sequences with different lengths is not
supported by this because of the (batch) matrix multipication.
Args: Args:
query (Variable): The input variable which is a Tensor or LoDTensor.
key (Variable): The input variable which is a Tensor or LoDTensor. queries (Variable): The input variable which should be a 3-D Tensor.
value (Variable): The input variable which is a Tensor or LoDTensor. keys (Variable): The input variable which should be a 3-D Tensor.
values (Variable): The input variable which should be a 3-D Tensor.
num_heads (int): Head number to compute the scaled dot product
attention. Default value is 1.
dropout_rate (float): The dropout rate to drop the attention weight.
Default value is 0.
Returns: Returns:
tuple: The Tensor variables representing the output and attention scores.
Variable: A 3-D Tensor computed by multi-head scaled dot product
attention.
Raises:
ValueError: If input queries, keys, values are not 3-D Tensors.
NOTE:
1. When num_heads > 1, three linear projections are learned respectively
to map input queries, keys and values into queries', keys' and values'.
queries', keys' and values' have the same shapes with queries, keys
and values.
1. When num_heads == 1, scaled_dot_product_attention has no learnable
parameters.
Examples: Examples:
.. code-block:: python .. code-block:: python
# Suppose q, k, v are tensor variables with the following shape: # Suppose q, k, v are Tensors with the following shape:
# q: [3, 5, 9], k: [3, 6, 9], v: [3, 6, 10] # q: [3, 5, 9], k: [3, 6, 9], v: [3, 6, 10]
out, attn_scores = fluid.nets.dot_product_attention(q, k, v)
out.shape # [3, 5, 10] contexts = fluid.nets.scaled_dot_product_attention(q, k, v)
attn_scores.shape # [3, 5, 6] contexts.shape # [3, 5, 10]
"""
if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
raise ValueError(
"Inputs quries, keys and values should all be 3-D tensors.")
if queries.shape[-1] != keys.shape[-1]:
raise ValueError(
"The hidden size of queries and keys should be the same.")
if keys.shape[-2] != values.shape[-2]:
raise ValueError(
"The max sequence length in query batch and in key batch "
"should be the same.")
if keys.shape[-1] % num_heads != 0:
raise ValueError("The hidden size of keys (%d) must be divisible "
"by the number of attention heads (%d)." %
(keys.shape[-1], num_heads))
if values.shape[-1] % num_heads != 0:
raise ValueError("The hidden size of values (%d) must be divisible "
"by the number of attention heads (%d)." %
(values.shape[-1], num_heads))
def __compute_qkv(queries, keys, values, num_heads):
"""
Add linear projection to queries, keys, and values.
Args:
queries(Tensor): a 3-D input Tensor.
keys(Tensor): a 3-D input Tensor.
values(Tensor): a 3-D input Tensor.
num_heads(int): The number of heads. Linearly project the inputs
ONLY when num_heads > 1.
Returns:
Tensor: linearly projected output Tensors: queries', keys' and
values'. They have the same shapes with queries, keys and
values.
"""
if num_heads == 1:
return queries, keys, values
q = layers.fc(input=queries, size=queries.shape[-1], num_flatten_dims=2)
k = layers.fc(input=keys, size=keys.shape[-1], num_flatten_dims=2)
v = layers.fc(input=values, size=values.shape[-1], num_flatten_dims=2)
return q, k, v
def __split_heads(x, num_heads):
"""
Reshape the last dimension of inpunt tensor x so that it becomes two
dimensions.
Args:
x(Tensor): a 3-D input Tensor.
num_heads(int): The number of heads.
Returns:
Tensor: a Tensor with shape [..., n, m/num_heads], where m is size
of the last dimension of x.
"""
if num_heads == 1:
return x
hidden_size = x.shape[-1]
# reshape the 3-D input: [batch_size, max_sequence_length, hidden_dim]
# into a 4-D output:
# [batch_size, max_sequence_length, num_heads, hidden_size_per_head].
reshaped = layers.reshape(
x=x,
shape=list(x.shape[:-1]) + [num_heads, hidden_size // num_heads])
# permuate the dimensions into:
# [batch_size, num_heads, max_sequence_len, hidden_size_per_head]
return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
def __combine_heads(x):
"""
Reshape the last two dimensions of inpunt tensor x so that it becomes
one dimension.
Args:
x(Tensor): a 4-D input Tensor with shape
[bs, num_heads, max_sequence_length, hidden_dim].
Returns:
Tensor: a Tensor with shape
[bs, max_sequence_length, num_heads * hidden_dim].
""" """
assert keys.shape[-2] == values.shape[
-2], 'The shapes of keys and values mismatch.' if len(x.shape) == 3: return x
assert querys.shape[-1] == keys.shape[ if len(x.shape) != 4:
-1], 'The shapes of querys and keys mismatch.' raise ValueError("Input(x) should be a 4-D Tensor.")
product = layers.matmul(x=querys, y=keys, transpose_y=True)
attn_scores = layers.reshape( trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
return layers.reshape(
x=trans_x,
shape=map(int, [
trans_x.shape[0], trans_x.shape[1],
trans_x.shape[2] * trans_x.shape[3]
]))
q, k, v = __compute_qkv(queries, keys, values, num_heads)
q = __split_heads(q, num_heads)
k = __split_heads(k, num_heads)
v = __split_heads(v, num_heads)
key_dim_per_head = keys.shape[-1] // num_heads
scaled_q = layers.scale(x=q, scale=key_dim_per_head**-0.5)
product = layers.matmul(x=k, y=scaled_q, transpose_y=True)
weights = layers.reshape(
x=layers.reshape( x=layers.reshape(
x=product, shape=[-1, product.shape[-1]], act='softmax'), x=product, shape=[-1, product.shape[-1]], act="softmax"),
shape=product.shape) shape=product.shape)
out = layers.matmul(attn_scores, values) if dropout_rate:
return out, attn_scores weights = layers.dropout(x, dropout_prob=dropout_rate, is_test=False)
ctx_multiheads = layers.matmul(weights, v)
return __combine_heads(ctx_multiheads)
文件模式从 100755 更改为 100644
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import unittest
import paddle.v2.fluid as fluid
import paddle.v2.fluid.core as core
import numpy as np
class TestMultiheadAttention(unittest.TestCase):
def gen_random_input(self):
"""Generate random input data.
"""
# batch_size, max_sequence_length, hidden dimension
self.input_shape = (3, 13, 16)
self.queries = np.random.random(size=self.input_shape).astype("float32")
self.keys = np.random.random(size=self.input_shape).astype("float32")
def set_program(self):
"""Build the test program.
"""
queries = fluid.layers.data(
name="queries",
shape=self.input_shape,
dtype="float32",
append_batch_size=False)
queries.stop_gradient = False
keys = fluid.layers.data(
name="keys",
shape=self.input_shape,
dtype="float32",
append_batch_size=False)
keys.stop_gradient = False
contexts = fluid.nets.scaled_dot_product_attention(
queries=queries,
keys=keys,
values=keys,
num_heads=8,
dropout_rate=0.)
out = fluid.layers.reduce_sum(contexts, dim=None)
fluid.backward.append_backward(loss=out)
self.fetch_list = [contexts]
def run_program(self):
"""Run the test program.
"""
places = [core.CPUPlace()]
if core.is_compile_gpu():
places.append(core.CUDAPlace(0))
for place in places:
self.set_inputs(place)
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
output = exe.run(fluid.default_main_program(),
feed=self.inputs,
fetch_list=self.fetch_list,
return_numpy=True)
self.op_output = output
def set_inputs(self, place):
"""Set the randomly generated data to the test program.
"""
self.inputs = {}
queries = fluid.Tensor()
queries.set(self.queries, place)
keys = fluid.Tensor()
keys.set(self.keys, place)
self.inputs["keys"] = keys
self.inputs["queries"] = queries
def test_multihead_attention(self):
self.gen_random_input()
self.set_program()
self.run_program()
#fixme(caoying) add more meaningfull unittest.
if __name__ == '__main__':
unittest.main()
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册