-3. 8 LSTM units will be trained in "forward / backward" order.
- 8 LSTM units will be trained in "forward / backward" order.
```python
hidden_0=paddle.layer.mixed(
...
...
@@ -326,7 +326,7 @@ for i in range(1, depth):
input_tmp=[mix_hidden,lstm]
```
-4. We will concatenate the output of top LSTM unit with it's input, and project into a hidden layer. Then put a fully connected layer on top of it to get the final vector representation.
- We will concatenate the output of top LSTM unit with it's input, and project into a hidden layer. Then put a fully connected layer on top of it to get the final vector representation.
```python
feature_out=paddle.layer.mixed(
...
...
@@ -340,7 +340,7 @@ for i in range(1, depth):
],)
```
-5. We use CRF as cost function, the parameter of CRF cost will be named `crfw`.
- We use CRF as cost function, the parameter of CRF cost will be named `crfw`.
```python
crf_cost=paddle.layer.crf(
...
...
@@ -353,7 +353,7 @@ crf_cost = paddle.layer.crf(
learning_rate=mix_hidden_lr))
```
-6. CRF decoding layer is used for evaluation and inference. It shares parameter with CRF layer. The sharing of parameters among multiple layers is specified by the same parameter name in these layers.
- CRF decoding layer is used for evaluation and inference. It shares parameter with CRF layer. The sharing of parameters among multiple layers is specified by the same parameter name in these layers.
- 3. 8 LSTM units will be trained in "forward / backward" order.
- 8 LSTM units will be trained in "forward / backward" order.
```python
hidden_0 = paddle.layer.mixed(
...
...
@@ -368,7 +368,7 @@ for i in range(1, depth):
input_tmp = [mix_hidden, lstm]
```
- 4. We will concatenate the output of top LSTM unit with it's input, and project into a hidden layer. Then put a fully connected layer on top of it to get the final vector representation.
- We will concatenate the output of top LSTM unit with it's input, and project into a hidden layer. Then put a fully connected layer on top of it to get the final vector representation.
```python
feature_out = paddle.layer.mixed(
...
...
@@ -382,7 +382,7 @@ for i in range(1, depth):
], )
```
- 5. We use CRF as cost function, the parameter of CRF cost will be named `crfw`.
- We use CRF as cost function, the parameter of CRF cost will be named `crfw`.
```python
crf_cost = paddle.layer.crf(
...
...
@@ -395,7 +395,7 @@ crf_cost = paddle.layer.crf(
learning_rate=mix_hidden_lr))
```
- 6. CRF decoding layer is used for evaluation and inference. It shares parameter with CRF layer. The sharing of parameters among multiple layers is specified by the same parameter name in these layers.
- CRF decoding layer is used for evaluation and inference. It shares parameter with CRF layer. The sharing of parameters among multiple layers is specified by the same parameter name in these layers.
This tutorial will use the default SGD and Adam learning algorithm, with a learning rate of 5e-4. Note that the `batch_size = 50` denotes generating 50 sequence each time.
### Model Structure
1. Define some global variables
```python
...
...
@@ -493,6 +494,7 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn
with mixed_layer(size=decoder_size) as encoded_proj:
3.2 Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$
```python
...
...
@@ -502,6 +504,7 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn
3.3 Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word.
- decoder_mem records the hidden state $z_i$ from the previous time step, with an initial state as decoder_boot.
...
...
@@ -536,6 +539,7 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn
out += full_matrix_projection(input=gru_step)
return out
```
4. Decoder differences between the training and generation
4.1 Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details.
...
...
@@ -546,6 +550,7 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn
- word embedding from the target langauge trg_embedding is passed to `gru_decoder_with_attention` as current_word.
...
...
@@ -571,6 +576,7 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn
cost=classification_cost(input=decoder,label=lbl)
outputs(cost)
```
4.3 In generation mode:
- during generation, as the decoder RNN will take the word vector generated from the previous time step as input, `GeneratedInput` is used to implement this automatically. Please refer to [GeneratedInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for details.
This tutorial will use the default SGD and Adam learning algorithm, with a learning rate of 5e-4. Note that the `batch_size = 50` denotes generating 50 sequence each time.
### Model Structure
1. Define some global variables
```python
...
...
@@ -535,6 +536,7 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn
with mixed_layer(size=decoder_size) as encoded_proj:
3.2 Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$
```python
...
...
@@ -544,6 +546,7 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn
3.3 Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word.
- decoder_mem records the hidden state $z_i$ from the previous time step, with an initial state as decoder_boot.
...
...
@@ -578,6 +581,7 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn
out += full_matrix_projection(input=gru_step)
return out
```
4. Decoder differences between the training and generation
4.1 Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details.
...
...
@@ -588,6 +592,7 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn
- during generation, as the decoder RNN will take the word vector generated from the previous time step as input, `GeneratedInput` is used to implement this automatically. Please refer to [GeneratedInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for details.