This tutorial is contributed by <axmlns:cc="http://creativecommons.org/ns#"href="http://book.paddlepaddle.org"property="cc:attributionName"rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <arel="license"href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.
-3. 8 LSTM units will be trained in "forward / backward" order.
- 8 LSTM units will be trained in "forward / backward" order.
```python
hidden_0=paddle.layer.mixed(
...
...
@@ -326,7 +326,7 @@ for i in range(1, depth):
input_tmp=[mix_hidden,lstm]
```
-4. We will concatenate the output of top LSTM unit with it's input, and project into a hidden layer. Then put a fully connected layer on top of it to get the final vector representation.
- We will concatenate the output of top LSTM unit with it's input, and project into a hidden layer. Then put a fully connected layer on top of it to get the final vector representation.
```python
feature_out=paddle.layer.mixed(
...
...
@@ -340,7 +340,7 @@ for i in range(1, depth):
],)
```
-5. We use CRF as cost function, the parameter of CRF cost will be named `crfw`.
- We use CRF as cost function, the parameter of CRF cost will be named `crfw`.
```python
crf_cost=paddle.layer.crf(
...
...
@@ -353,7 +353,7 @@ crf_cost = paddle.layer.crf(
learning_rate=mix_hidden_lr))
```
-6. CRF decoding layer is used for evaluation and inference. It shares parameter with CRF layer. The sharing of parameters among multiple layers is specified by the same parameter name in these layers.
- CRF decoding layer is used for evaluation and inference. It shares parameter with CRF layer. The sharing of parameters among multiple layers is specified by the same parameter name in these layers.
- 3. 8 LSTM units will be trained in "forward / backward" order.
- 8 LSTM units will be trained in "forward / backward" order.
```python
hidden_0 = paddle.layer.mixed(
...
...
@@ -368,7 +368,7 @@ for i in range(1, depth):
input_tmp = [mix_hidden, lstm]
```
- 4. We will concatenate the output of top LSTM unit with it's input, and project into a hidden layer. Then put a fully connected layer on top of it to get the final vector representation.
- We will concatenate the output of top LSTM unit with it's input, and project into a hidden layer. Then put a fully connected layer on top of it to get the final vector representation.
```python
feature_out = paddle.layer.mixed(
...
...
@@ -382,7 +382,7 @@ for i in range(1, depth):
], )
```
- 5. We use CRF as cost function, the parameter of CRF cost will be named `crfw`.
- We use CRF as cost function, the parameter of CRF cost will be named `crfw`.
```python
crf_cost = paddle.layer.crf(
...
...
@@ -395,7 +395,7 @@ crf_cost = paddle.layer.crf(
learning_rate=mix_hidden_lr))
```
- 6. CRF decoding layer is used for evaluation and inference. It shares parameter with CRF layer. The sharing of parameters among multiple layers is specified by the same parameter name in these layers.
- CRF decoding layer is used for evaluation and inference. It shares parameter with CRF layer. The sharing of parameters among multiple layers is specified by the same parameter name in these layers.
This tutorial will use the default SGD and Adam learning algorithm, with a learning rate of 5e-4. Note that the `batch_size = 50` denotes generating 50 sequence each time.
### Model Structure
1. Define some global variables
```python
...
...
@@ -493,6 +494,7 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn
with mixed_layer(size=decoder_size) as encoded_proj:
3.2 Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$
```python
...
...
@@ -502,6 +504,7 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn
3.3 Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word.
- decoder_mem records the hidden state $z_i$ from the previous time step, with an initial state as decoder_boot.
...
...
@@ -536,6 +539,7 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn
out += full_matrix_projection(input=gru_step)
return out
```
4. Decoder differences between the training and generation
4.1 Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details.
...
...
@@ -546,6 +550,7 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn
- word embedding from the target langauge trg_embedding is passed to `gru_decoder_with_attention` as current_word.
...
...
@@ -571,6 +576,7 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn
cost=classification_cost(input=decoder,label=lbl)
outputs(cost)
```
4.3 In generation mode:
- during generation, as the decoder RNN will take the word vector generated from the previous time step as input, `GeneratedInput` is used to implement this automatically. Please refer to [GeneratedInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for details.
This tutorial will use the default SGD and Adam learning algorithm, with a learning rate of 5e-4. Note that the `batch_size = 50` denotes generating 50 sequence each time.
### Model Structure
1. Define some global variables
```python
...
...
@@ -535,6 +536,7 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn
with mixed_layer(size=decoder_size) as encoded_proj:
3.2 Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$
```python
...
...
@@ -544,6 +546,7 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn
3.3 Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word.
- decoder_mem records the hidden state $z_i$ from the previous time step, with an initial state as decoder_boot.
...
...
@@ -578,6 +581,7 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn
out += full_matrix_projection(input=gru_step)
return out
```
4. Decoder differences between the training and generation
4.1 Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details.
...
...
@@ -588,6 +592,7 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn
- during generation, as the decoder RNN will take the word vector generated from the previous time step as input, `GeneratedInput` is used to implement this automatically. Please refer to [GeneratedInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for details.
where $ softmax(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} $
where $ \text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} $
For an $N$ class classification problem with $N$ output nodes, an $N$ dimensional vector is normalized to $N$ real values in the range [0, 1], each representing the probability of the sample to belong to the class. Here $y_i$ is the prediction probability that an image is digit $i$.
In such a classification problem, we usually use the cross entropy loss function:
Fig. 2 shows a softmax regression network, with weights in blue, and bias in red. +1 indicates bias is 1.
...
...
@@ -55,7 +55,7 @@ The Softmax regression model described above uses the simplest two-layer neural
1. After the first hidden layer, we get $ H_1 = \phi(W_1X + b_1) $, where $\phi$ is the activation function. Some common ones are sigmoid, tanh and ReLU.
2. After the second hidden layer, we get $ H_2 = \phi(W_2H_1 + b_2) $.
3. Finally, after output layer, we get $Y=softmax(W_3H_2 + b_3)$, the final classification result vector.
3. Finally, after output layer, we get $Y=\text{softmax}(W_3H_2 + b_3)$, the final classification result vector.
Fig. 3. is Multilayer Perceptron network, with weights in blue, and bias in red. +1 indicates bias is 1.
where $ softmax(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} $
where $ \text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} $
For an $N$ class classification problem with $N$ output nodes, an $N$ dimensional vector is normalized to $N$ real values in the range [0, 1], each representing the probability of the sample to belong to the class. Here $y_i$ is the prediction probability that an image is digit $i$.
In such a classification problem, we usually use the cross entropy loss function:
Fig. 2 shows a softmax regression network, with weights in blue, and bias in red. +1 indicates bias is 1.
...
...
@@ -97,7 +97,7 @@ The Softmax regression model described above uses the simplest two-layer neural
1. After the first hidden layer, we get $ H_1 = \phi(W_1X + b_1) $, where $\phi$ is the activation function. Some common ones are sigmoid, tanh and ReLU.
2. After the second hidden layer, we get $ H_2 = \phi(W_2H_1 + b_2) $.
3. Finally, after output layer, we get $Y=softmax(W_3H_2 + b_3)$, the final classification result vector.
3. Finally, after output layer, we get $Y=\text{softmax}(W_3H_2 + b_3)$, the final classification result vector.
Fig. 3. is Multilayer Perceptron network, with weights in blue, and bias in red. +1 indicates bias is 1.