README.md 20.3 KB
Newer Older
C
choijulie 已提交
1
# Sentiment Analysis
L
Luo Tao 已提交
2

3
The source codes of this section is located at [book/understand_sentiment](https://github.com/PaddlePaddle/book/tree/develop/06.understand_sentiment). First-time users may refer to PaddlePaddle for [Installation guide](https://github.com/PaddlePaddle/book/blob/develop/README.md#running-the-book).
L
Luo Tao 已提交
4

C
choijulie 已提交
5
## Background
L
fix bug  
livc 已提交
6

C
choijulie 已提交
7
In natural language processing, sentiment analysis refers to determining the emotion expressed in a piece of text. The text can be a sentence, a paragraph, or a document. Emotion categorization can be binary -- positive/negative or happy/sad -- or in three classes -- positive/neutral/negative. Sentiment analysis is applicable in a wide range of services, such as e-commerce sites like Amazon and Taobao, hospitality services like Airbnb and hotels.com, and movie rating sites like Rotten Tomatoes and IMDB. It can be used to gauge from the reviews how the customers feel about the product. Table 1 illustrates an example of sentiment analysis in movie reviews:
W
wangxuguang 已提交
8

C
choijulie 已提交
9
| Movie Review       | Category  |
W
wangxuguang 已提交
10
| --------     | -----  |
C
choijulie 已提交
11 12 13 14
| Best movie of Xiaogang Feng in recent years!| Positive |
| Pretty bad. Feels like a tv-series from a local TV-channel     | Negative |
| Politically correct version of Taken ... and boring as Heck| Negative|
|delightful, mesmerizing, and completely unexpected. The plot is nicely designed.|Positive|
W
wangxuguang 已提交
15

C
choijulie 已提交
16
<p align="center">Table 1 Sentiment Analysis in Movie Reviews</p>
W
wangxuguang 已提交
17

X
Xi Chen 已提交
18
In natural language processing, sentiment analysis can be categorized as a **Text Classification problem**, i.e., to categorize a piece of text to a specific class. It involves two related tasks: text representation and classification. Before the emergence of deep learning techniques, the mainstream methods for text representation include BOW (*bag of words*) and topic modeling, while the latter contains SVM (*support vector machine*) and LR (*logistic regression*).
W
wangxuguang 已提交
19

X
Xi Chen 已提交
20
The BOW model does not capture all the information in a piece of text, as it ignores syntax and grammar and just treats the text as a set of words. For example, “this movie is extremely bad“ and “boring, dull, and empty work” describe very similar semantic meaning, yet their BOW representations have very little similarity. Furthermore, “the movie is bad“ and “the movie is not bad“ have high similarity with BOW features, but they express completely opposite semantics.
W
wangxuguang 已提交
21

M
Mimee 已提交
22
This chapter introduces a deep learning model that handles these issues in BOW. Our model embeds texts into a low-dimensional space and takes word order into consideration. It is an end-to-end framework and it has large performance improvement over traditional methods \[[1](#references)\].
L
fix bug  
livc 已提交
23

C
choijulie 已提交
24
## Model Overview
L
fix bug  
livc 已提交
25

C
choijulie 已提交
26
The model we used in this chapter uses **Convolutional Neural Networks** (**CNNs**) and **Recurrent Neural Networks** (**RNNs**) with some specific extensions.
W
wangxuguang 已提交
27

W
wangxuguang 已提交
28

C
choijulie 已提交
29
### Revisit to the Convolutional Neural Networks for Texts (CNN)
W
wangxuguang 已提交
30

31
The convolutional neural network for texts is introduced in chapter [recommender_system](https://github.com/PaddlePaddle/book/tree/develop/05.recommender_system), here is a brief overview.
L
fix bug  
livc 已提交
32

X
Xi Chen 已提交
33
CNN mainly contains convolution and pooling operation, with versatile combinations in various applications. We firstly apply the convolution operation: we apply the kernel in each window, extracting features. Convolving by the kernel at every window produces a feature map. Next, we apply *max pooling* over time to represent the whole sentence, which is the maximum element across the feature map. In real applications, we will apply multiple CNN kernels on the sentences. It can be implemented efficiently by concatenating the kernels together as a matrix. Also, we can use CNN kernels with different kernel size. Finally, concatenating the resulting features produces a fixed-length representation, which can be combined with a softmax to form the model for the sentiment analysis problem.
L
fix bug  
livc 已提交
34

M
Mimee 已提交
35
For short texts, the aforementioned CNN model can achieve very high accuracy \[[1](#references)\]. If we want to extract more abstract representations, we may apply a deeper CNN model \[[2](#references),[3](#references)\].
C
choijulie 已提交
36 37 38

### Recurrent Neural Network (RNN)

M
Mimee 已提交
39
RNN is an effective model for sequential data. In terms of computability, the RNN is Turing-complete \[[4](#references)\]. Since NLP is a classical problem of sequential data, the RNN, especially its variant LSTM\[[5](#references)\]), achieves state-of-the-art performance on various NLP tasks, such as language modeling, syntax parsing, POS-tagging, image captioning, dialog, machine translation, and so forth.
L
fix bug  
livc 已提交
40

41
<p align="center">
T
Tao Luo 已提交
42
<img src="image/rnn.png" width = "60%" align="center"/><br/>
C
choijulie 已提交
43
Figure 1. An illustration of an unfolded RNN in time.
44
</p>
L
fix bug  
livc 已提交
45

C
choijulie 已提交
46
As shown in Figure 1, we unfold an RNN: at the $t$-th time step, the network takes two inputs: the $t$-th input vector $\vec{x_t}$ and the latent state from the last time-step $\vec{h_{t-1}}$. From those, it computes the latent state of the current step $\vec{h_t}$. This process is repeated until all inputs are consumed. Denoting the RNN as function $f$, it can be formulated as follows:
W
wangxuguang 已提交
47

C
choijulie 已提交
48
$$\vec{h_t}=f(\vec{x_t},\vec{h_{t-1}})=\sigma(W_{xh}\vec{x_t}+W_{hh}\vec{h_{h-1}}+\vec{b_h})$$
W
wangxuguang 已提交
49

C
choijulie 已提交
50
where $W_{xh}$ is the weight matrix to feed into the latent layer; $W_{hh}$ is the latent-to-latent matrix; $b_h$ is the latent bias and $\sigma$ refers to the $sigmoid$ function.
51

X
Xi Chen 已提交
52
In NLP, words are often represented as one-hot vectors and then mapped to an embedding. The embedded feature goes through an RNN as input $x_t$ at every time step. Moreover, we can add other layers on top of RNN, such as a deep or stacked RNN. Finally, the last latent state may be used as a feature for sentence classification.
W
wangxuguang 已提交
53

C
choijulie 已提交
54
### Long-Short Term Memory (LSTM)
L
fix bug  
livc 已提交
55

M
Mimee 已提交
56
Training an RNN on long sequential data sometimes leads to the gradient vanishing or exploding\[[6](#references)\]. To solve this problem Hochreiter S, Schmidhuber J. (1997) proposed **Long Short Term Memory** (LSTM)\[[5](#references)\]).
W
wangxuguang 已提交
57

C
choijulie 已提交
58
Compared to the structure of a simple RNN, an LSTM includes memory cell $c$, input gate $i$, forget gate $f$ and output gate $o$. These gates and memory cells dramatically improve the ability for the network to handle long sequences. We can formulate the **LSTM-RNN**, denoted as a function $F$, as follows:
W
wangxuguang 已提交
59

W
wangxuguang 已提交
60
$$ h_t=F(x_t,h_{t-1})$$
W
wangxuguang 已提交
61

M
Mimee 已提交
62
$F$ contains following formulations\[[7](#references)\]
W
wangxuguang 已提交
63
\begin{align}
W
wangxuguang 已提交
64 65
i_t & = \sigma(W_{xi}x_t+W_{hi}h_{h-1}+W_{ci}c_{t-1}+b_i)\\\\
f_t & = \sigma(W_{xf}x_t+W_{hf}h_{h-1}+W_{cf}c_{t-1}+b_f)\\\\
C
choijulie 已提交
66
c_t & = f_t\odot c_{t-1}+i_t\odot \tanh(W_{xc}x_t+W_{hc}h_{h-1}+b_c)\\\\
W
wangxuguang 已提交
67
o_t & = \sigma(W_{xo}x_t+W_{ho}h_{h-1}+W_{co}c_{t}+b_o)\\\\
C
choijulie 已提交
68
h_t & = o_t\odot \tanh(c_t)\\\\
W
wangxuguang 已提交
69
\end{align}
C
choijulie 已提交
70 71

In the equation,$i_t, f_t, c_t, o_t$ stand for input gate, forget gate, memory cell and output gate, respectively. $W$ and $b$ are model parameters, $\tanh$ is a hyperbolic tangent, and $\odot$ denotes an element-wise product operation. The input gate controls the magnitude of the new input into the memory cell $c$; the forget gate controls the memory propagated from the last time step; the output gate controls the magnitutde of the output. The three gates are computed similarly with different parameters, and they influence memory cell $c$ separately, as shown in Figure 2:
L
fix bug  
livc 已提交
72

W
wangxuguang 已提交
73
<p align="center">
C
choijulie 已提交
74 75
<img src="image/lstm_en.png" width = "65%" align="center"/><br/>
Figure 2. LSTM at time step $t$ [7].
W
wangxuguang 已提交
76
</p>
L
fix bug  
livc 已提交
77

X
Xi Chen 已提交
78
LSTM enhances the ability of considering long-term reliance, with the help of memory cell and gate. Similar structures are also proposed in Gated Recurrent Unit (GRU)\[[8](Reference)\] with a simpler design. **The structures are still similar to RNN, though with some modifications (As shown in Figure 2), i.e., latent status depends on input as well as the latent status of the last time step, and the process goes on recurrently until all inputs are consumed:**
W
wangxuguang 已提交
79

W
wangxuguang 已提交
80
$$ h_t=Recrurent(x_t,h_{t-1})$$
C
choijulie 已提交
81
where $Recrurent$ is a simple RNN, GRU or LSTM.
W
wangxuguang 已提交
82

C
choijulie 已提交
83
### Stacked Bidirectional LSTM
L
fix bug  
livc 已提交
84

M
Mimee 已提交
85
For vanilla LSTM, $h_t$ contains input information from previous time-step $1..t-1$ context. We can also apply an RNN with reverse-direction to take successive context $t+1…n$ into consideration. Combining constructing deep RNN (deeper RNN can contain more abstract and higher level semantic), we can design structures with deep stacked bidirectional LSTM to model sequential data\[[9](#references)\].
L
fix bug  
livc 已提交
86

C
choijulie 已提交
87
As shown in Figure 3 (3-layer RNN), odd/even layers are forward/reverse LSTM. Higher layers of LSTM take lower-layers LSTM as input, and the top-layer LSTM produces a fixed length vector by max-pooling (this representation considers contexts from previous and successive words for higher-level abstractions). Finally, we concatenate the output to a softmax layer for classification.
L
fix bug  
livc 已提交
88

89
<p align="center">
C
choijulie 已提交
90 91
<img src="image/stacked_lstm_en.png" width=450><br/>
Figure 3. Stacked Bidirectional LSTM for NLP modeling.
92
</p>
W
wangxuguang 已提交
93

C
choijulie 已提交
94
## Dataset
L
fix bug  
livc 已提交
95

X
Xi Chen 已提交
96
We use [IMDB](http://ai.stanford.edu/%7Eamaas/data/sentiment/) dataset for sentiment analysis in this tutorial, which consists of 50,000 movie reviews split evenly into a 25k train set and a 25k test set. In the labeled train/test sets, a negative review has a score <= 4 out of 10, and a positive review has a score >= 7 out of 10.
L
fix bug  
livc 已提交
97

C
choijulie 已提交
98 99 100 101 102 103 104 105 106 107
`paddle.datasets` package encapsulates multiple public datasets, including `cifar`, `imdb`, `mnist`, `moivelens`, and `wmt14`, etc. There's no need for us to manually download and preprocess IMDB.

After issuing a command `python train.py`, training will start immediately. The details will be unpacked by the following sessions to see how it works.


## Model Structure

### Initialize PaddlePaddle

We must import and initialize PaddlePaddle (enable/disable GPU, set the number of trainers, etc).
W
wangxuguang 已提交
108

H
hedaoyuan 已提交
109
```python
110 111
import sys
import paddle.v2 as paddle
C
choijulie 已提交
112 113 114

# PaddlePaddle init
paddle.init(use_gpu=False, trainer_count=1)
115
```
L
fix bug  
livc 已提交
116

C
choijulie 已提交
117
As alluded to in section [Model Overview](#model-overview), here we provide the implementations of both Text CNN and Stacked-bidirectional LSTM models.
L
fix bug  
livc 已提交
118

C
choijulie 已提交
119 120 121 122 123
### Text Convolution Neural Network (Text CNN)

We create a neural network `convolution_net` as the following snippet code.

Note: `paddle.networks.sequence_conv_pool` includes both convolution and pooling layer operations.
L
fix bug  
livc 已提交
124

H
hedaoyuan 已提交
125
```python
C
choijulie 已提交
126
def convolution_net(input_dim, class_dim=2, emb_dim=128, hid_dim=128):
127 128 129 130 131 132 133 134 135 136
    data = paddle.layer.data("word",
                             paddle.data_type.integer_value_sequence(input_dim))
    emb = paddle.layer.embedding(input=data, size=emb_dim)
    conv_3 = paddle.networks.sequence_conv_pool(
        input=emb, context_len=3, hidden_size=hid_dim)
    conv_4 = paddle.networks.sequence_conv_pool(
        input=emb, context_len=4, hidden_size=hid_dim)
    output = paddle.layer.fc(input=[conv_3, conv_4],
                             size=class_dim,
                             act=paddle.activation.Softmax())
C
choijulie 已提交
137 138
    lbl = paddle.layer.data("label", paddle.data_type.integer_value(2))
    cost = paddle.layer.classification_cost(input=output, label=lbl)
F
fengjiayi 已提交
139
    return cost, output
140
```
L
fix bug  
livc 已提交
141

C
choijulie 已提交
142 143 144 145 146 147 148 149 150 151 152 153 154 155 156
1. Define input data and its dimension

    Parameter `input_dim` denotes the dictionary size, and `class_dim` is the number of categories. In `convolution_net`, the input to the network is defined in `paddle.layer.data`.

1. Define Classifier

    The above Text CNN network extracts high-level features and maps them to a vector of the same size as the categories. `paddle.activation.Softmax` function or classifier is then used for calculating the probability of the sentence belonging to each category.

1. Define Loss Function

    In the context of supervised learning, labels of the training set are defined in `paddle.layer.data`, too. During training, cross-entropy is used as loss function in `paddle.layer.classification_cost` and as the output of the network; During testing, the outputs are the probabilities calculated in the classifier.

#### Stacked bidirectional LSTM

We create a neural network `stacked_lstm_net` as below.
L
fix bug  
livc 已提交
157

H
hedaoyuan 已提交
158
```python
W
wangxuguang 已提交
159
def stacked_lstm_net(input_dim,
160 161 162
                     class_dim=2,
                     emb_dim=128,
                     hid_dim=512,
C
choijulie 已提交
163
                     stacked_num=3):
164 165
    """
    A Wrapper for sentiment classification task.
X
Xi Chen 已提交
166
    This network uses a bi-directional recurrent network,
167
    consisting of three LSTM layers. This configuration is
168
    motivated from the following paper, but uses few layers.
169 170 171 172 173 174 175
        http://www.aclweb.org/anthology/P15-1109
    input_dim: here is word dictionary dimension.
    class_dim: number of categories.
    emb_dim: dimension of word embedding.
    hid_dim: dimension of hidden layer.
    stacked_num: number of stacked lstm-hidden layer.
    """
W
wangxuguang 已提交
176 177
    assert stacked_num % 2 == 1

H
hedaoyuan 已提交
178 179
    fc_para_attr = paddle.attr.Param(learning_rate=1e-3)
    lstm_para_attr = paddle.attr.Param(initial_std=0., learning_rate=1.)
180
    para_attr = [fc_para_attr, lstm_para_attr]
H
hedaoyuan 已提交
181
    bias_attr = paddle.attr.Param(initial_std=0., l2_rate=0.)
182 183 184 185 186 187 188 189 190 191 192 193
    relu = paddle.activation.Relu()
    linear = paddle.activation.Linear()

    data = paddle.layer.data("word",
                             paddle.data_type.integer_value_sequence(input_dim))
    emb = paddle.layer.embedding(input=data, size=emb_dim)

    fc1 = paddle.layer.fc(input=emb,
                          size=hid_dim,
                          act=linear,
                          bias_attr=bias_attr)
    lstm1 = paddle.layer.lstmemory(
F
fengjiayi 已提交
194
        input=fc1, act=relu, bias_attr=bias_attr)
W
wangxuguang 已提交
195 196 197

    inputs = [fc1, lstm1]
    for i in range(2, stacked_num + 1):
198 199 200 201 202 203
        fc = paddle.layer.fc(input=inputs,
                             size=hid_dim,
                             act=linear,
                             param_attr=para_attr,
                             bias_attr=bias_attr)
        lstm = paddle.layer.lstmemory(
W
wangxuguang 已提交
204 205 206
            input=fc,
            reverse=(i % 2) == 0,
            act=relu,
F
fengjiayi 已提交
207
            bias_attr=bias_attr)
W
wangxuguang 已提交
208 209
        inputs = [fc, lstm]

C
choijulie 已提交
210 211 212 213
    fc_last = paddle.layer.pooling(
        input=inputs[0], pooling_type=paddle.pooling.Max())
    lstm_last = paddle.layer.pooling(
        input=inputs[1], pooling_type=paddle.pooling.Max())
214 215 216 217 218 219
    output = paddle.layer.fc(input=[fc_last, lstm_last],
                             size=class_dim,
                             act=paddle.activation.Softmax(),
                             bias_attr=bias_attr,
                             param_attr=para_attr)

C
choijulie 已提交
220 221
    lbl = paddle.layer.data("label", paddle.data_type.integer_value(2))
    cost = paddle.layer.classification_cost(input=output, label=lbl)
F
fengjiayi 已提交
222
    return cost, output
W
wangxuguang 已提交
223
```
L
fix bug  
livc 已提交
224

C
choijulie 已提交
225
1. Define input data and its dimension
L
fix bug  
livc 已提交
226

C
choijulie 已提交
227
    Parameter `input_dim` denotes the dictionary size, and `class_dim` is the number of categories. In `stacked_lstm_net`, the input to the network is defined in `paddle.layer.data`.
L
fix bug  
livc 已提交
228

C
choijulie 已提交
229 230 231 232 233 234 235
1. Define Classifier

    The above stacked bidirectional LSTM network extracts high-level features and maps them to a vector of the same size as the categories. `paddle.activation.Softmax` function or classifier is then used for calculating the probability of the sentence belonging to each category.

1. Define Loss Function

    In the context of supervised learning, labels of the training set are defined in `paddle.layer.data`, too. During training, cross-entropy is used as loss function in `paddle.layer.classification_cost` and as the output of the network; During testing, the outputs are the probabilities calculated in the classifier.
L
fix bug  
livc 已提交
236 237


C
choijulie 已提交
238
To reiterate, we can either invoke `convolution_net` or `stacked_lstm_net`.
L
fix bug  
livc 已提交
239

H
hedaoyuan 已提交
240
```python
C
choijulie 已提交
241 242 243 244 245
word_dict = paddle.dataset.imdb.word_dict()
dict_dim = len(word_dict)
class_dim = 2

# option 1
F
fengjiayi 已提交
246
[cost, output] = convolution_net(dict_dim, class_dim=class_dim)
C
choijulie 已提交
247
# option 2
F
fengjiayi 已提交
248
# [cost, output] = stacked_lstm_net(dict_dim, class_dim=class_dim, stacked_num=3)
249
```
L
fix bug  
livc 已提交
250

C
choijulie 已提交
251 252 253 254 255
## Model Training

### Define Parameters

First, we create the model parameters according to the previous model configuration `cost`.
L
fix bug  
livc 已提交
256

H
hedaoyuan 已提交
257
```python
C
choijulie 已提交
258 259
# create parameters
parameters = paddle.parameters.create(cost)
260
```
L
fix bug  
livc 已提交
261

C
choijulie 已提交
262 263 264 265
### Create Trainer

Before jumping into creating a training module, algorithm setting is also necessary.
Here we specified `Adam` optimization algorithm via `paddle.optimizer`.
L
fix bug  
livc 已提交
266

H
hedaoyuan 已提交
267
```python
C
choijulie 已提交
268 269 270 271 272 273 274 275 276 277
# create optimizer
adam_optimizer = paddle.optimizer.Adam(
    learning_rate=2e-3,
    regularization=paddle.optimizer.L2Regularization(rate=8e-4),
    model_average=paddle.optimizer.ModelAverage(average_window=0.5))

# create trainer
trainer = paddle.trainer.SGD(cost=cost,
                                parameters=parameters,
                                update_equation=adam_optimizer)
278
```
L
fix bug  
livc 已提交
279

C
choijulie 已提交
280 281 282
### Training

`paddle.dataset.imdb.train()` will yield records during each pass, after shuffling, a batch input is generated for training.
L
fix bug  
livc 已提交
283

H
hedaoyuan 已提交
284
```python
C
choijulie 已提交
285 286 287 288 289 290 291
train_reader = paddle.batch(
    paddle.reader.shuffle(
        lambda: paddle.dataset.imdb.train(word_dict), buf_size=1000),
    batch_size=100)

test_reader = paddle.batch(
    lambda: paddle.dataset.imdb.test(word_dict), batch_size=100)
292
```
C
choijulie 已提交
293 294 295

`feeding` is devoted to specifying the correspondence between each yield record and `paddle.layer.data`. For instance, the first column of data generated by `paddle.dataset.imdb.train()` corresponds to `word` feature.

H
hedaoyuan 已提交
296
```python
C
choijulie 已提交
297
feeding = {'word': 0, 'label': 1}
H
hedaoyuan 已提交
298
```
C
choijulie 已提交
299 300 301

Callback function `event_handler` will be invoked to track training progress when a pre-defined event happens.

H
hedaoyuan 已提交
302
```python
C
choijulie 已提交
303 304 305 306 307 308 309 310 311
def event_handler(event):
    if isinstance(event, paddle.event.EndIteration):
        if event.batch_id % 100 == 0:
            print "\nPass %d, Batch %d, Cost %f, %s" % (
                event.pass_id, event.batch_id, event.cost, event.metrics)
        else:
            sys.stdout.write('.')
            sys.stdout.flush()
    if isinstance(event, paddle.event.EndPass):
F
fengjiayi 已提交
312 313 314
        with open('./params_pass_%d.tar' % event.pass_id, 'w') as f:
                parameters.to_tar(f)

C
choijulie 已提交
315 316
        result = trainer.test(reader=test_reader, feeding=feeding)
        print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics)
317 318
```

C
choijulie 已提交
319
Finally, we can invoke `trainer.train` to start training:
H
hedaoyuan 已提交
320 321

```python
C
choijulie 已提交
322 323 324 325 326
trainer.train(
    reader=train_reader,
    event_handler=event_handler,
    feeding=feeding,
    num_passes=10)
H
hedaoyuan 已提交
327 328
```

329

C
choijulie 已提交
330
## Conclusion
L
fix bug  
livc 已提交
331

C
choijulie 已提交
332
In this chapter, we use sentiment analysis as an example to introduce applying deep learning models on end-to-end short text classification, as well as how to use PaddlePaddle to implement the model. Meanwhile, we briefly introduce two models for text processing: CNN and RNN. In following chapters, we will see how these models can be applied in other tasks.
L
fix bug  
livc 已提交
333

M
Mimee 已提交
334
## References
L
fix bug  
livc 已提交
335

W
wangxuguang 已提交
336
1. Kim Y. [Convolutional neural networks for sentence classification](http://arxiv.org/pdf/1408.5882)[J]. arXiv preprint arXiv:1408.5882, 2014.
X
Xi Chen 已提交
337
2. Kalchbrenner N, Grefenstette E, Blunsom P. [A convolutional neural network for modeling sentences](http://arxiv.org/pdf/1404.2188.pdf?utm_medium=App.net&utm_source=PourOver)[J]. arXiv preprint arXiv:1404.2188, 2014.
W
wangxuguang 已提交
338 339 340 341 342 343 344
3. Yann N. Dauphin, et al. [Language Modeling with Gated Convolutional Networks](https://arxiv.org/pdf/1612.08083v1.pdf)[J] arXiv preprint arXiv:1612.08083, 2016.
4. Siegelmann H T, Sontag E D. [On the computational power of neural nets](http://research.cs.queensu.ca/home/akl/cisc879/papers/SELECTED_PAPERS_FROM_VARIOUS_SOURCES/05070215382317071.pdf)[C]//Proceedings of the fifth annual workshop on Computational learning theory. ACM, 1992: 440-449.
5. Hochreiter S, Schmidhuber J. [Long short-term memory](http://web.eecs.utk.edu/~itamar/courses/ECE-692/Bobby_paper1.pdf)[J]. Neural computation, 1997, 9(8): 1735-1780.
6. Bengio Y, Simard P, Frasconi P. [Learning long-term dependencies with gradient descent is difficult](http://www-dsi.ing.unifi.it/~paolo/ps/tnn-94-gradient.pdf)[J]. IEEE transactions on neural networks, 1994, 5(2): 157-166.
7. Graves A. [Generating sequences with recurrent neural networks](http://arxiv.org/pdf/1308.0850)[J]. arXiv preprint arXiv:1308.0850, 2013.
8. Cho K, Van Merriënboer B, Gulcehre C, et al. [Learning phrase representations using RNN encoder-decoder for statistical machine translation](http://arxiv.org/pdf/1406.1078)[J]. arXiv preprint arXiv:1406.1078, 2014.
9. Zhou J, Xu W. [End-to-end learning of semantic role labeling using recurrent neural networks](http://www.aclweb.org/anthology/P/P15/P15-1109.pdf)[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2015.
L
Luo Tao 已提交
345 346

<br/>
L
Luo Tao 已提交
347
This tutorial is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.