diff --git a/doc/design/ops/images/2_level_rnn.dot b/doc/design/ops/images/2_level_rnn.dot index a498e882a3d85a33d44dbad7474fa2a340e33976..5d77865061ca7bbbfcf254dd938f09aef5553505 100644 --- a/doc/design/ops/images/2_level_rnn.dot +++ b/doc/design/ops/images/2_level_rnn.dot @@ -1,6 +1,6 @@ digraph G { - rnn [label="1-th level RNN" shape=box] + rnn [label="1st level RNN" shape=box] subgraph cluster0 { label = "time step 0" @@ -8,7 +8,7 @@ digraph G { sent0 [label="sentence"] sent1 [label="sentence"] - rnn1 [label="2-th level RNN" shape=box] + rnn1 [label="2nd level RNN" shape=box] sent0 -> rnn1 sent1 -> rnn1 @@ -20,7 +20,7 @@ digraph G { sent2 [label="sentence"] sent3 [label="sentence"] - rnn2 [label="2-th level RNN" shape=box] + rnn2 [label="2nd level RNN" shape=box] sent2 -> rnn2 sent3 -> rnn2 @@ -32,7 +32,7 @@ digraph G { sent4 [label="sentence"] sent5 [label="sentence"] - rnn3 [label="2-th level RNN" shape=box] + rnn3 [label="2nd level RNN" shape=box] sent4 -> rnn3 sent5 -> rnn3 diff --git a/doc/design/ops/rnn.md b/doc/design/ops/rnn.md index a78eea7d45e9e9553d153170aa31da55ec6e8289..2f4854793fa1f0b02e4dc17b51a48a972be61c06 100644 --- a/doc/design/ops/rnn.md +++ b/doc/design/ops/rnn.md @@ -1,62 +1,62 @@ # RNNOp design -This document is about an RNN operator which requires that instances in a mini-batch have the same length. We will have a more flexible RNN operator. +This document describes the RNN (Recurrent Neural Network) operator and how it is implemented in PaddlePaddle. The RNN op requires that all instances in a mini-batch have the same length. We will have a more flexible dynamic RNN operator in the future. ## RNN Algorithm Implementation -
+
The above diagram shows an RNN unrolled into a full network. -There are several important concepts: +There are several important concepts here: -- *step-net*: the sub-graph to run at each step, -- *memory*, $h_t$, the state of the current step, -- *ex-memory*, $h_{t-1}$, the state of the previous step, -- *initial memory value*, the ex-memory of the first step. +- *step-net*: the sub-graph that runs at each step. +- *memory*, $h_t$, the state of the current step. +- *ex-memory*, $h_{t-1}$, the state of the previous step. +- *initial memory value*, the memory of the first (initial) step. ### Step-scope -There could be local variables defined in step-nets. PaddlePaddle runtime realizes these variables in *step-scopes* -- scopes created for each step. +There could be local variables defined in each step-net. PaddlePaddle runtime realizes these variables in *step-scopes* which are created for each step. -
+
-Figure 2 the RNN's data flow
+Figure 2 illustrates the RNN's data flow
+
@@ -110,7 +110,7 @@ a = some_op() # chapter_data is a set of 128-dim word vectors # the first level of LoD is sentence -# the second level of LoD is chapter +# the second level of LoD is a chapter chapter_data = pd.Variable(shape=[None, 128], type=pd.lod_tensor, level=2) def lower_level_rnn(paragraph): @@ -138,14 +138,14 @@ with top_level_rnn.stepnet(): pd.matmul(W0, paragraph_data) + pd.matmul(U0, h.pre_state())) top_level_rnn.add_outputs(h) -# just output the last step +# output the last step chapter_out = top_level_rnn(output_all_steps=False) ``` -in above example, the construction of the `top_level_rnn` calls `lower_level_rnn`. The input is a LoD Tensor. The top level RNN segments input text data into paragraphs, and the lower level RNN segments each paragraph into sentences. +In the above example, the construction of the `top_level_rnn` calls `lower_level_rnn`. The input is an LoD Tensor. The top level RNN segments input text data into paragraphs, and the lower level RNN segments each paragraph into sentences. -By default, the `RNNOp` will concatenate the outputs from all the time steps, -if the `output_all_steps` set to False, it will only output the final time step. +By default, the `RNNOp` will concatenate the outputs from all the time steps. +If the `output_all_steps` is set to False, it will only output the final time step.
diff --git a/doc/design/ops/sequence_decoder.md b/doc/design/ops/sequence_decoder.md index 9007aae7a8355ed06c6720a921351f81b859c1fe..9db5fb8e9a9f89b004bf71ddc064cd976c0d0bee 100644 --- a/doc/design/ops/sequence_decoder.md +++ b/doc/design/ops/sequence_decoder.md @@ -1,35 +1,28 @@ # Design: Sequence Decoder Generating LoDTensors -In tasks such as machine translation and image to text, -a [sequence decoder](https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/README.md) is necessary to generate sequences. +In tasks such as machine translation and visual captioning, +a [sequence decoder](https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/README.md) is necessary to generate sequences, one word at a time. This documentation describes how to implement the sequence decoder as an operator. ## Beam Search based Decoder -The [beam search algorithm](https://en.wikipedia.org/wiki/Beam_search) is necessary when generating sequences, -it is a heuristic search algorithm that explores the paths by expanding the most promising node in a limited set. +The [beam search algorithm](https://en.wikipedia.org/wiki/Beam_search) is necessary when generating sequences. It is a heuristic search algorithm that explores the paths by expanding the most promising node in a limited set. -In the old version of PaddlePaddle, a C++ class `RecurrentGradientMachine` implements the general sequence decoder based on beam search, -due to the complexity, the implementation relays on a lot of special data structures, -quite trivial and hard to be customized by users. +In the old version of PaddlePaddle, the C++ class `RecurrentGradientMachine` implements the general sequence decoder based on beam search, due to the complexity involved, the implementation relies on a lot of special data structures that are quite trivial and hard to be customized by users. -There are a lot of heuristic tricks in the sequence generation tasks, -so the flexibility of sequence decoder is very important to users. +There are a lot of heuristic tricks in the sequence generation tasks, so the flexibility of sequence decoder is very important to users. -During PaddlePaddle's refactoring work, -some new concept is proposed such as [LoDTensor](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/lod_tensor.md) and [TensorArray](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/tensor_array.md) that can better support sequence usage, -and they can help to make the implementation of beam search based sequence decoder **more transparent and modular** . +During the refactoring of PaddlePaddle, some new concepts are proposed such as: [LoDTensor](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/lod_tensor.md) and [TensorArray](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/tensor_array.md) that can better support the sequence usage, and they can also help make the implementation of beam search based sequence decoder **more transparent and modular** . -For example, the RNN sates, candidates IDs and probabilities of beam search can be represented as `LoDTensors`; +For example, the RNN states, candidates IDs and probabilities of beam search can be represented all as `LoDTensors`; the selected candidate's IDs in each time step can be stored in a `TensorArray`, and `Packed` to the sentences translated. ## Changing LoD's absolute offset to relative offsets -The current `LoDTensor` is designed to store levels of variable-length sequences, -it stores several arrays of integers each represents a level. +The current `LoDTensor` is designed to store levels of variable-length sequences. It stores several arrays of integers where each represents a level. -The integers in each level represents the begin and end (not inclusive) offset of a sequence **in the underlying tensor**, -let's call this format the **absolute-offset LoD** for clear. +The integers in each level represent the begin and end (not inclusive) offset of a sequence **in the underlying tensor**, +let's call this format the **absolute-offset LoD** for clarity. -The relative-offset LoD can fast retrieve any sequence but fails to represent empty sequences, for example, a two-level LoD is as follows +The relative-offset LoD can retrieve any sequence very quickly but fails to represent empty sequences, for example, a two-level LoD is as follows ```python [[0, 3, 9] [0, 2, 3, 3, 3, 9]] @@ -41,10 +34,9 @@ The first level tells that there are two sequences: while on the second level, there are several empty sequences that both begin and end at `3`. It is impossible to tell how many empty second-level sequences exist in the first-level sequences. -There are many scenarios that relay on empty sequence representation, -such as machine translation or image to text, one instance has no translations or the empty candidate set for a prefix. +There are many scenarios that rely on empty sequence representation, for example in machine translation or visual captioning, one instance has no translation or the empty candidate set for a prefix. -So let's introduce another format of LoD, +So let's introduce another format of LoD, it stores **the offsets of the lower level sequences** and is called **relative-offset** LoD. For example, to represent the same sequences of the above data @@ -54,19 +46,18 @@ For example, to represent the same sequences of the above data [0, 2, 3, 3, 3, 9]] ``` -the first level represents that there are two sequences, +the first level represents that there are two sequences, their offsets in the second-level LoD is `[0, 3)` and `[3, 5)`. The second level is the same with the relative offset example because the lower level is a tensor. It is easy to find out the second sequence in the first-level LoD has two empty sequences. -The following demos are based on relative-offset LoD. +The following examples are based on relative-offset LoD. ## Usage in a simple machine translation model -Let's start from a simple machine translation model that is simplified from [machine translation chapter](https://github.com/PaddlePaddle/book/tree/develop/08.machine_translation) to draw a simple blueprint of what a sequence decoder can do and how to use it. +Let's start from a simple machine translation model that is simplified from the [machine translation chapter](https://github.com/PaddlePaddle/book/tree/develop/08.machine_translation) to draw a blueprint of what a sequence decoder can do and how to use it. -The model has an encoder that learns the semantic vector from a sequence, -and a decoder which uses the sequence decoder to generate new sentences. +The model has an encoder that learns the semantic vector from a sequence, and a decoder which uses the sequence encoder to generate new sentences. **Encoder** ```python @@ -117,7 +108,7 @@ def generate(): # which means there are 2 sentences to translate # - the first sentence has 1 translation prefixes, the offsets are [0, 1) # - the second sentence has 2 translation prefixes, the offsets are [1, 3) and [3, 6) - # the target_word.lod is + # the target_word.lod is # [[0, 1, 6] # [0, 2, 4, 7, 9 12]] # which means 2 sentences to translate, each has 1 and 5 prefixes @@ -154,37 +145,36 @@ def generate(): translation_ids, translation_scores = decoder() ``` -The `decoder.beam_search` is a operator that given the candidates and the scores of translations including the candidates, -return the result of the beam search algorithm. +The `decoder.beam_search` is an operator that, given the candidates and the scores of translations including the candidates, +returns the result of the beam search algorithm. -In this way, users can customize anything on the inputs or outputs of beam search, for example, two ways to prune some translation prefixes +In this way, users can customize anything on the input or output of beam search, for example: -1. meke the correspondind elements in `topk_generated_scores` zero or some small values, beam_search will discard this candidate. -2. remove some specific candidate in `selected_ids` -3. get the final `translation_ids`, remove the translation sequence in it. +1. Make the corresponding elements in `topk_generated_scores` zero or some small values, beam_search will discard this candidate. +2. Remove some specific candidate in `selected_ids`. +3. Get the final `translation_ids`, remove the translation sequence in it. -The implementation of sequence decoder can reuse the C++ class [RNNAlgorithm](https://github.com/Superjom/Paddle/blob/68cac3c0f8451fe62a4cdf156747d6dc0ee000b3/paddle/operators/dynamic_recurrent_op.h#L30), -so the python syntax is quite similar to a [RNN](https://github.com/Superjom/Paddle/blob/68cac3c0f8451fe62a4cdf156747d6dc0ee000b3/doc/design/block.md#blocks-with-for-and-rnnop). +The implementation of sequence decoder can reuse the C++ class: [RNNAlgorithm](https://github.com/Superjom/Paddle/blob/68cac3c0f8451fe62a4cdf156747d6dc0ee000b3/paddle/operators/dynamic_recurrent_op.h#L30), +so the python syntax is quite similar to that of an [RNN](https://github.com/Superjom/Paddle/blob/68cac3c0f8451fe62a4cdf156747d6dc0ee000b3/doc/design/block.md#blocks-with-for-and-rnnop). -Both of them are two-level `LoDTensors` +Both of them are two-level `LoDTensors`: -- the first level represents `batch_size` of (source) sentences; -- the second level represents the candidate ID sets for translation prefix. +- The first level represents `batch_size` of (source) sentences. +- The second level represents the candidate ID sets for translation prefix. -for example, 3 source sentences to translate, and has 2, 3, 1 candidates. +For example, 3 source sentences to translate, and has 2, 3, 1 candidates. -Unlike an RNN, in sequence decoder, the previous state and the current state have different LoD and shape, -a `lod_expand` operator is used to expand the LoD of the previous state to fit the current state. +Unlike an RNN, in sequence decoder, the previous state and the current state have different LoD and shape, and an `lod_expand` operator is used to expand the LoD of the previous state to fit the current state. -For example, the previous state +For example, the previous state: * LoD is `[0, 1, 3][0, 2, 5, 6]` * content of tensor is `a1 a2 b1 b2 b3 c1` -the current state stored in `encoder_ctx_expanded` +the current state is stored in `encoder_ctx_expanded`: * LoD is `[0, 2, 7][0 3 5 8 9 11 11]` -* the content is +* the content is - a1 a1 a1 (a1 has 3 candidates, so the state should be copied 3 times for each candidates) - a2 a2 - b1 b1 b1 @@ -192,54 +182,48 @@ the current state stored in `encoder_ctx_expanded` - b3 b3 - None (c1 has 0 candidates, so c1 is dropped) -Benefit from the relative offset LoD, empty candidate set can be represented naturally. +The benefit from the relative offset LoD is that the empty candidate set can be represented naturally. -the status in each time step can be stored in `TensorArray`, and `Pack`ed to a final LoDTensor, the corresponding syntax is +The status in each time step can be stored in `TensorArray`, and `Pack`ed to a final LoDTensor. The corresponding syntax is: ```python decoder.output(selected_ids) decoder.output(selected_generation_scores) ``` -the `selected_ids` is the candidate ids for the prefixes, -it will be `Packed` by `TensorArray` to a two-level `LoDTensor`, -the first level represents the source sequences, -the second level represents generated sequences. +The `selected_ids` are the candidate ids for the prefixes, and will be `Packed` by `TensorArray` to a two-level `LoDTensor`, where the first level represents the source sequences and the second level represents generated sequences. -Pack the `selected_scores` will get a `LoDTensor` that stores scores of each candidate of translations. +Packing the `selected_scores` will get a `LoDTensor` that stores scores of each translation candidate. -Pack the `selected_generation_scores` will get a `LoDTensor`, and each tail is the probability of the translation. +Packing the `selected_generation_scores` will get a `LoDTensor`, and each tail is the probability of the translation. ## LoD and shape changes during decoding
-According the image above, the only phrase to change LoD is beam search. +According to the image above, the only phase that changes the LoD is beam search. ## Beam search design -The beam search algorthm will be implemented as one method of the sequence decoder, it has 3 inputs +The beam search algorithm will be implemented as one method of the sequence decoder and has 3 inputs: -1. `topk_ids`, top K candidate ids for each prefix. +1. `topk_ids`, the top K candidate ids for each prefix. 2. `topk_scores`, the corresponding scores for `topk_ids` 3. `generated_scores`, the score of the prefixes. -All of the are LoDTensors, so that the sequence affilication is clear. -Beam search will keep a beam for each prefix and select a smaller candidate set for each prefix. +All of these are LoDTensors, so that the sequence affiliation is clear. Beam search will keep a beam for each prefix and select a smaller candidate set for each prefix. -It will return three variables +It will return three variables: 1. `selected_ids`, the final candidate beam search function selected for the next step. 2. `selected_scores`, the scores for the candidates. -3. `generated_scores`, the updated scores for each prefixes (with the new candidates appended). +3. `generated_scores`, the updated scores for each prefix (with the new candidates appended). ## Introducing the LoD-based `Pack` and `Unpack` methods in `TensorArray` -The `selected_ids`, `selected_scores` and `generated_scores` are LoDTensors, -and they exist in each time step, +The `selected_ids`, `selected_scores` and `generated_scores` are LoDTensors that exist at each time step, so it is natural to store them in arrays. -Currently, PaddlePaddle has a module called `TensorArray` which can store an array of tensors, -the results of beam search are better to store in a `TensorArray`. +Currently, PaddlePaddle has a module called `TensorArray` which can store an array of tensors. It is better to store the results of beam search in a `TensorArray`. -The `Pack` and `UnPack` in `TensorArray` are used to package tensors in the array to a `LoDTensor` or split the `LoDTensor` to an array of tensors. -It needs some extensions to support pack or unpack an array of `LoDTensors`. +The `Pack` and `UnPack` in `TensorArray` are used to pack tensors in the array to an `LoDTensor` or split the `LoDTensor` to an array of tensors. +It needs some extensions to support the packing or unpacking an array of `LoDTensors`.