index.html 22.2 KB
Newer Older
F
fengjiayi 已提交
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44

<html>
<head>
  <script type="text/x-mathjax-config">
  MathJax.Hub.Config({
    extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
    jax: ["input/TeX", "output/HTML-CSS"],
    tex2jax: {
      inlineMath: [ ['$','$'] ],
      displayMath: [ ['$$','$$'] ],
      processEscapes: true
    },
    "HTML-CSS": { availableFonts: ["TeX"] }
  });
  </script>
  <script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js" async></script>
  <script type="text/javascript" src="../.tools/theme/marked.js">
  </script>
  <link href="http://cdn.bootcss.com/highlight.js/9.9.0/styles/darcula.min.css" rel="stylesheet">
  <script src="http://cdn.bootcss.com/highlight.js/9.9.0/highlight.min.js"></script>
  <link href="http://cdn.bootcss.com/bootstrap/4.0.0-alpha.6/css/bootstrap.min.css" rel="stylesheet">
  <link href="https://cdn.jsdelivr.net/perfect-scrollbar/0.6.14/css/perfect-scrollbar.min.css" rel="stylesheet">
  <link href="../.tools/theme/github-markdown.css" rel='stylesheet'>
</head>
<style type="text/css" >
.markdown-body {
    box-sizing: border-box;
    min-width: 200px;
    max-width: 980px;
    margin: 0 auto;
    padding: 45px;
}
</style>


<body>

<div id="context" class="container-fluid markdown-body">
</div>

<!-- This block will be replaced by each markdown file content. Please do not change lines below.-->
<div id="markdown" style='display:none'>
# Sentiment Analysis

T
tink2123 已提交
45
The source codes of this section is located at [book/understand_sentiment](https://github.com/PaddlePaddle/book/tree/develop/06.understand_sentiment). For instructions on getting started with this book,see [Running This Book](https://github.com/PaddlePaddle/book/blob/develop/README.md#running-the-book).
F
fengjiayi 已提交
46 47 48 49 50 51 52 53 54 55 56 57 58 59

## Background

In natural language processing, sentiment analysis refers to determining the emotion expressed in a piece of text. The text can be a sentence, a paragraph, or a document. Emotion categorization can be binary -- positive/negative or happy/sad -- or in three classes -- positive/neutral/negative. Sentiment analysis is applicable in a wide range of services, such as e-commerce sites like Amazon and Taobao, hospitality services like Airbnb and hotels.com, and movie rating sites like Rotten Tomatoes and IMDB. It can be used to gauge from the reviews how the customers feel about the product. Table 1 illustrates an example of sentiment analysis in movie reviews:

| Movie Review       | Category  |
| --------     | -----  |
| Best movie of Xiaogang Feng in recent years!| Positive |
| Pretty bad. Feels like a tv-series from a local TV-channel     | Negative |
| Politically correct version of Taken ... and boring as Heck| Negative|
|delightful, mesmerizing, and completely unexpected. The plot is nicely designed.|Positive|

<p align="center">Table 1 Sentiment Analysis in Movie Reviews</p>

X
Xi Chen 已提交
60
In natural language processing, sentiment analysis can be categorized as a **Text Classification problem**, i.e., to categorize a piece of text to a specific class. It involves two related tasks: text representation and classification. Before the emergence of deep learning techniques, the mainstream methods for text representation include BOW (*bag of words*) and topic modeling, while the latter contains SVM (*support vector machine*) and LR (*logistic regression*).
F
fengjiayi 已提交
61

X
Xi Chen 已提交
62
The BOW model does not capture all the information in a piece of text, as it ignores syntax and grammar and just treats the text as a set of words. For example, “this movie is extremely bad“ and “boring, dull, and empty work” describe very similar semantic meaning, yet their BOW representations have very little similarity. Furthermore, “the movie is bad“ and “the movie is not bad“ have high similarity with BOW features, but they express completely opposite semantics.
F
fengjiayi 已提交
63

M
Mimee 已提交
64
This chapter introduces a deep learning model that handles these issues in BOW. Our model embeds texts into a low-dimensional space and takes word order into consideration. It is an end-to-end framework and it has large performance improvement over traditional methods \[[1](#references)\].
F
fengjiayi 已提交
65 66 67 68 69 70 71 72

## Model Overview

The model we used in this chapter uses **Convolutional Neural Networks** (**CNNs**) and **Recurrent Neural Networks** (**RNNs**) with some specific extensions.


### Revisit to the Convolutional Neural Networks for Texts (CNN)

73
The convolutional neural network for texts is introduced in chapter [recommender_system](https://github.com/PaddlePaddle/book/tree/develop/05.recommender_system), here is a brief overview.
F
fengjiayi 已提交
74

X
Xi Chen 已提交
75
CNN mainly contains convolution and pooling operation, with versatile combinations in various applications. We firstly apply the convolution operation: we apply the kernel in each window, extracting features. Convolving by the kernel at every window produces a feature map. Next, we apply *max pooling* over time to represent the whole sentence, which is the maximum element across the feature map. In real applications, we will apply multiple CNN kernels on the sentences. It can be implemented efficiently by concatenating the kernels together as a matrix. Also, we can use CNN kernels with different kernel size. Finally, concatenating the resulting features produces a fixed-length representation, which can be combined with a softmax to form the model for the sentiment analysis problem.
F
fengjiayi 已提交
76

M
Mimee 已提交
77
For short texts, the aforementioned CNN model can achieve very high accuracy \[[1](#references)\]. If we want to extract more abstract representations, we may apply a deeper CNN model \[[2](#references),[3](#references)\].
F
fengjiayi 已提交
78 79 80

### Recurrent Neural Network (RNN)

M
Mimee 已提交
81
RNN is an effective model for sequential data. In terms of computability, the RNN is Turing-complete \[[4](#references)\]. Since NLP is a classical problem of sequential data, the RNN, especially its variant LSTM\[[5](#references)\]), achieves state-of-the-art performance on various NLP tasks, such as language modeling, syntax parsing, POS-tagging, image captioning, dialog, machine translation, and so forth.
F
fengjiayi 已提交
82 83 84 85 86 87 88 89

<p align="center">
<img src="image/rnn.png" width = "60%" align="center"/><br/>
Figure 1. An illustration of an unfolded RNN in time.
</p>

As shown in Figure 1, we unfold an RNN: at the $t$-th time step, the network takes two inputs: the $t$-th input vector $\vec{x_t}$ and the latent state from the last time-step $\vec{h_{t-1}}$. From those, it computes the latent state of the current step $\vec{h_t}$. This process is repeated until all inputs are consumed. Denoting the RNN as function $f$, it can be formulated as follows:

W
wanglun 已提交
90
$$\vec{h_t}=f(\vec{x_t},\vec{h_{t-1}})=\sigma(W_{xh}\vec{x_t}+W_{hh}\vec{h_{t-1}}+\vec{b_h})$$
F
fengjiayi 已提交
91 92 93

where $W_{xh}$ is the weight matrix to feed into the latent layer; $W_{hh}$ is the latent-to-latent matrix; $b_h$ is the latent bias and $\sigma$ refers to the $sigmoid$ function.

X
Xi Chen 已提交
94
In NLP, words are often represented as one-hot vectors and then mapped to an embedding. The embedded feature goes through an RNN as input $x_t$ at every time step. Moreover, we can add other layers on top of RNN, such as a deep or stacked RNN. Finally, the last latent state may be used as a feature for sentence classification.
F
fengjiayi 已提交
95 96 97

### Long-Short Term Memory (LSTM)

M
Mimee 已提交
98
Training an RNN on long sequential data sometimes leads to the gradient vanishing or exploding\[[6](#references)\]. To solve this problem Hochreiter S, Schmidhuber J. (1997) proposed **Long Short Term Memory** (LSTM)\[[5](#references)\]).
F
fengjiayi 已提交
99 100 101 102 103

Compared to the structure of a simple RNN, an LSTM includes memory cell $c$, input gate $i$, forget gate $f$ and output gate $o$. These gates and memory cells dramatically improve the ability for the network to handle long sequences. We can formulate the **LSTM-RNN**, denoted as a function $F$, as follows:

$$ h_t=F(x_t,h_{t-1})$$

M
Mimee 已提交
104
$F$ contains following formulations\[[7](#references)\]:
D
daming-lu 已提交
105 106 107 108 109
$$ i_t = \sigma{(W_{xi}x_t+W_{hi}h_{t-1}+W_{ci}c_{t-1}+b_i)} $$
$$ f_t = \sigma(W_{xf}x_t+W_{hf}h_{t-1}+W_{cf}c_{t-1}+b_f) $$
$$ c_t = f_t\odot c_{t-1}+i_t\odot tanh(W_{xc}x_t+W_{hc}h_{t-1}+b_c) $$
$$ o_t = \sigma(W_{xo}x_t+W_{ho}h_{t-1}+W_{co}c_{t}+b_o) $$
$$ h_t = o_t\odot tanh(c_t) $$
F
fengjiayi 已提交
110 111 112 113 114 115 116 117

In the equation,$i_t, f_t, c_t, o_t$ stand for input gate, forget gate, memory cell and output gate, respectively. $W$ and $b$ are model parameters, $\tanh$ is a hyperbolic tangent, and $\odot$ denotes an element-wise product operation. The input gate controls the magnitude of the new input into the memory cell $c$; the forget gate controls the memory propagated from the last time step; the output gate controls the magnitutde of the output. The three gates are computed similarly with different parameters, and they influence memory cell $c$ separately, as shown in Figure 2:

<p align="center">
<img src="image/lstm_en.png" width = "65%" align="center"/><br/>
Figure 2. LSTM at time step $t$ [7].
</p>

X
Xi Chen 已提交
118
LSTM enhances the ability of considering long-term reliance, with the help of memory cell and gate. Similar structures are also proposed in Gated Recurrent Unit (GRU)\[[8](Reference)\] with a simpler design. **The structures are still similar to RNN, though with some modifications (As shown in Figure 2), i.e., latent status depends on input as well as the latent status of the last time step, and the process goes on recurrently until all inputs are consumed:**
F
fengjiayi 已提交
119 120 121 122 123 124

$$ h_t=Recrurent(x_t,h_{t-1})$$
where $Recrurent$ is a simple RNN, GRU or LSTM.

### Stacked Bidirectional LSTM

M
Mimee 已提交
125
For vanilla LSTM, $h_t$ contains input information from previous time-step $1..t-1$ context. We can also apply an RNN with reverse-direction to take successive context $t+1…n$ into consideration. Combining constructing deep RNN (deeper RNN can contain more abstract and higher level semantic), we can design structures with deep stacked bidirectional LSTM to model sequential data\[[9](#references)\].
F
fengjiayi 已提交
126 127 128 129 130 131 132 133 134 135

As shown in Figure 3 (3-layer RNN), odd/even layers are forward/reverse LSTM. Higher layers of LSTM take lower-layers LSTM as input, and the top-layer LSTM produces a fixed length vector by max-pooling (this representation considers contexts from previous and successive words for higher-level abstractions). Finally, we concatenate the output to a softmax layer for classification.

<p align="center">
<img src="image/stacked_lstm_en.png" width=450><br/>
Figure 3. Stacked Bidirectional LSTM for NLP modeling.
</p>

## Dataset

X
Xi Chen 已提交
136
We use [IMDB](http://ai.stanford.edu/%7Eamaas/data/sentiment/) dataset for sentiment analysis in this tutorial, which consists of 50,000 movie reviews split evenly into a 25k train set and a 25k test set. In the labeled train/test sets, a negative review has a score <= 4 out of 10, and a positive review has a score >= 7 out of 10.
F
fengjiayi 已提交
137 138 139 140 141 142

`paddle.datasets` package encapsulates multiple public datasets, including `cifar`, `imdb`, `mnist`, `moivelens`, and `wmt14`, etc. There's no need for us to manually download and preprocess IMDB.

After issuing a command `python train.py`, training will start immediately. The details will be unpacked by the following sessions to see how it works.


143
## Model Configuration
F
fengjiayi 已提交
144

145
Our program starts with importing necessary packages and initializing some global variables:
F
fengjiayi 已提交
146 147

```python
148
from __future__ import print_function
S
sidgoyal78 已提交
149 150
import paddle
import paddle.fluid as fluid
151 152
from functools import partial
import numpy as np
R
root 已提交
153 154 155 156 157 158 159 160 161
try:
    from paddle.fluid.contrib.trainer import *
    from paddle.fluid.contrib.inferencer import *
except ImportError:
    print(
        "In the fluid 1.0, the trainer and inferencer are moving to paddle.fluid.contrib",
        file=sys.stderr)
    from paddle.fluid.trainer import *
    from paddle.fluid.inferencer import *
162 163 164 165

CLASS_DIM = 2
EMB_DIM = 128
HID_DIM = 512
166
STACKED_NUM = 3
167 168
BATCH_SIZE = 128
USE_GPU = False
F
fengjiayi 已提交
169 170 171 172 173 174 175 176
```

As alluded to in section [Model Overview](#model-overview), here we provide the implementations of both Text CNN and Stacked-bidirectional LSTM models.

### Text Convolution Neural Network (Text CNN)

We create a neural network `convolution_net` as the following snippet code.

177
Note: `fluid.nets.sequence_conv_pool` includes both convolution and pooling layer operations.
F
fengjiayi 已提交
178 179

```python
180
def convolution_net(data, input_dim, class_dim, emb_dim, hid_dim):
S
sidgoyal78 已提交
181 182 183 184 185 186 187 188 189 190 191 192 193 194
    emb = fluid.layers.embedding(
        input=data, size=[input_dim, emb_dim], is_sparse=True)
    conv_3 = fluid.nets.sequence_conv_pool(
        input=emb,
        num_filters=hid_dim,
        filter_size=3,
        act="tanh",
        pool_type="sqrt")
    conv_4 = fluid.nets.sequence_conv_pool(
        input=emb,
        num_filters=hid_dim,
        filter_size=4,
        act="tanh",
        pool_type="sqrt")
195 196
    prediction = fluid.layers.fc(
        input=[conv_3, conv_4], size=class_dim, act="softmax")
S
sidgoyal78 已提交
197 198
    return prediction

F
fengjiayi 已提交
199
```
200
Parameter `input_dim` denotes the dictionary size, and `class_dim` is the number of categories.
F
fengjiayi 已提交
201

202
The above Text CNN network extracts high-level features and maps them to a vector of the same size as the categories. `paddle.activation.Softmax` function or classifier is then used for calculating the probability of the sentence belonging to each category.
F
fengjiayi 已提交
203 204


205
### Stacked bidirectional LSTM
F
fengjiayi 已提交
206 207 208 209

We create a neural network `stacked_lstm_net` as below.

```python
S
sidgoyal78 已提交
210
def stacked_lstm_net(data, input_dim, class_dim, emb_dim, hid_dim, stacked_num):
F
fengjiayi 已提交
211

S
sidgoyal78 已提交
212 213
    emb = fluid.layers.embedding(
        input=data, size=[input_dim, emb_dim], is_sparse=True)
F
fengjiayi 已提交
214

S
sidgoyal78 已提交
215 216
    fc1 = fluid.layers.fc(input=emb, size=hid_dim)
    lstm1, cell1 = fluid.layers.dynamic_lstm(input=fc1, size=hid_dim)
F
fengjiayi 已提交
217 218

    inputs = [fc1, lstm1]
S
sidgoyal78 已提交
219

F
fengjiayi 已提交
220
    for i in range(2, stacked_num + 1):
S
sidgoyal78 已提交
221 222 223
        fc = fluid.layers.fc(input=inputs, size=hid_dim)
        lstm, cell = fluid.layers.dynamic_lstm(
            input=fc, size=hid_dim, is_reverse=(i % 2) == 0)
F
fengjiayi 已提交
224 225
        inputs = [fc, lstm]

S
sidgoyal78 已提交
226 227 228 229 230 231 232 233
    fc_last = fluid.layers.sequence_pool(input=inputs[0], pool_type='max')
    lstm_last = fluid.layers.sequence_pool(input=inputs[1], pool_type='max')

    prediction = fluid.layers.fc(input=[fc_last, lstm_last],
                                 size=class_dim,
                                 act='softmax')
    return prediction

F
fengjiayi 已提交
234
```
235
The above stacked bidirectional LSTM network extracts high-level features and maps them to a vector of the same size as the categories. `paddle.activation.Softmax` function or classifier is then used for calculating the probability of the sentence belonging to each category.
F
fengjiayi 已提交
236

237
To reiterate, we can either invoke `convolution_net` or `stacked_lstm_net`. In below steps, we will go with `convolution_net`.
F
fengjiayi 已提交
238

239
Next we define an `inference_program` that simply uses `convolution_net` to predict output with the input from `fluid.layer.data`.
F
fengjiayi 已提交
240

241 242 243 244
```python
def inference_program(word_dict):
    data = fluid.layers.data(
        name="words", shape=[1], dtype="int64", lod_level=1)
F
fengjiayi 已提交
245

246 247
    dict_dim = len(word_dict)
    net = convolution_net(data, dict_dim, CLASS_DIM, EMB_DIM, HID_DIM)
248
    # net = stacked_lstm_net(data, dict_dim, CLASS_DIM, EMB_DIM, HID_DIM, STACKED_NUM)
249 250
    return net
```
F
fengjiayi 已提交
251

252 253
Then we define a `training_program` that uses the result from `inference_program` to compute the cost with label data.
Also define `optimizer_func` to specify the optimizer.
F
fengjiayi 已提交
254 255


256
In the context of supervised learning, labels of the training set are defined in `paddle.layer.data` too. During training, cross-entropy is used as loss function in `paddle.layer.classification_cost` and as the output of the network; During testing, the outputs are the probabilities calculated in the classifier.
257
First result that returns from the list must be cost.
F
fengjiayi 已提交
258 259

```python
260 261 262 263 264 265 266 267 268 269 270
def train_program(word_dict):
    prediction = inference_program(word_dict)
    label = fluid.layers.data(name="label", shape=[1], dtype="int64")
    cost = fluid.layers.cross_entropy(input=prediction, label=label)
    avg_cost = fluid.layers.mean(cost)
    accuracy = fluid.layers.accuracy(input=prediction, label=label)
    return [avg_cost, accuracy]


def optimizer_func():
    return fluid.optimizer.Adagrad(learning_rate=0.002)
F
fengjiayi 已提交
271 272 273 274
```

## Model Training

275
### Specify training environment
F
fengjiayi 已提交
276

277
Specify your training environment, you should specify if the training is on CPU or GPU.
F
fengjiayi 已提交
278 279

```python
280 281
use_cuda = False
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
F
fengjiayi 已提交
282 283
```

284
### Datafeeder Configuration
F
fengjiayi 已提交
285

286 287
Next we define data feeders for test and train. The feeder reads a `buf_size` of data each time and feed them to the training/testing process.
`paddle.dataset.imdb.train` will yield records during each pass, after shuffling, a batch input of `BATCH_SIZE` is generated for training.
F
fengjiayi 已提交
288

289
Notice for loading and reading IMDB data, it could take up to 1 minute. Please be patient.
F
fengjiayi 已提交
290

291
```python
F
fengjiayi 已提交
292

293 294
print("Loading IMDB word dict....")
word_dict = paddle.dataset.imdb.word_dict()
F
fengjiayi 已提交
295

296
print ("Reading training data....")
F
fengjiayi 已提交
297 298
train_reader = paddle.batch(
    paddle.reader.shuffle(
299 300 301 302 303 304
        paddle.dataset.imdb.train(word_dict), buf_size=25000),
    batch_size=BATCH_SIZE)
```


### Create Trainer
F
fengjiayi 已提交
305

306 307 308
Create a trainer that takes `train_program` as input and specify optimizer function.

```python
R
root 已提交
309
trainer = Trainer(
310 311 312
    train_func=partial(train_program, word_dict),
    place=place,
    optimizer_func=optimizer_func)
F
fengjiayi 已提交
313 314
```

315 316 317
### Feeding Data

`feed_order` is devoted to specifying the correspondence between each yield record and `paddle.layer.data`. For instance, the first column of data generated by `imdb.train` corresponds to `words`.
F
fengjiayi 已提交
318 319

```python
320
feed_order = ['words', 'label']
F
fengjiayi 已提交
321 322
```

323 324 325 326
### Event Handler

Callback function `event_handler` will be called during training when a pre-defined event happens.
For example, we can check the cost by `trainer.test` when `EndStepEvent` occurs
F
fengjiayi 已提交
327 328

```python
329 330 331
# Specify the directory path to save the parameters
params_dirname = "understand_sentiment_conv.inference.model"

F
fengjiayi 已提交
332
def event_handler(event):
R
root 已提交
333
    if isinstance(event, EndStepEvent):
334
        print("Step {0}, Epoch {1} Metrics {2}".format(
M
minqiyang 已提交
335
                event.step, event.epoch, list(map(np.array, event.metrics))))
336 337 338 339

        if event.step == 10:
            trainer.save_params(params_dirname)
            trainer.stop()
F
fengjiayi 已提交
340 341
```

342 343 344
### Training

Finally, we invoke `trainer.train` to start training with `num_epochs` and other parameters.
F
fengjiayi 已提交
345 346 347

```python
trainer.train(
348
    num_epochs=1,
F
fengjiayi 已提交
349
    event_handler=event_handler,
350 351
    reader=train_reader,
    feed_order=feed_order)
F
fengjiayi 已提交
352 353
```

354 355 356 357 358 359 360
## Inference

### Create Inferencer

Initialize Inferencer with `inference_program` and `params_dirname` which is where we save params from training.

```python
R
root 已提交
361
inferencer = Inferencer(
362 363 364 365 366
        infer_func=partial(inference_program, word_dict),
        param_path=params_dirname,
        place=place)
```

367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388
### Create Lod Tensor with test data

To do inference, we pick 3 potential reviews out of our mind as testing data. Feel free to modify any of them.
We map each word in the reviews to id from `word_dict`, replaced by 'unknown' if the word is not in `word_dict`.
Then we create lod data with the id list and use `create_lod_tensor` to create lod tensor.

```python
reviews_str = [
    'read the book forget the movie', 'this is a great movie', 'this is very bad'
]
reviews = [c.split() for c in reviews_str]

UNK = word_dict['<unk>']
lod = []
for c in reviews:
    lod.append([word_dict.get(words, UNK) for words in c])

base_shape = [[len(c) for c in lod]]

tensor_words = fluid.create_lod_tensor(lod, base_shape, place)
```

389 390
### Infer

391
Now we can infer and predict probability of positive or negative from each review above.
392 393 394

```python
results = inferencer.infer({'words': tensor_words})
395 396 397 398

for i, r in enumerate(results[0]):
    print("Predict probability of ", r[0], " to be positive and ", r[1], " to be negative for review \'", reviews_str[i], "\'")

399 400

```
F
fengjiayi 已提交
401 402 403 404 405

## Conclusion

In this chapter, we use sentiment analysis as an example to introduce applying deep learning models on end-to-end short text classification, as well as how to use PaddlePaddle to implement the model. Meanwhile, we briefly introduce two models for text processing: CNN and RNN. In following chapters, we will see how these models can be applied in other tasks.

M
Mimee 已提交
406
## References
F
fengjiayi 已提交
407 408

1. Kim Y. [Convolutional neural networks for sentence classification](http://arxiv.org/pdf/1408.5882)[J]. arXiv preprint arXiv:1408.5882, 2014.
X
Xi Chen 已提交
409
2. Kalchbrenner N, Grefenstette E, Blunsom P. [A convolutional neural network for modeling sentences](http://arxiv.org/pdf/1404.2188.pdf?utm_medium=App.net&utm_source=PourOver)[J]. arXiv preprint arXiv:1404.2188, 2014.
F
fengjiayi 已提交
410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441
3. Yann N. Dauphin, et al. [Language Modeling with Gated Convolutional Networks](https://arxiv.org/pdf/1612.08083v1.pdf)[J] arXiv preprint arXiv:1612.08083, 2016.
4. Siegelmann H T, Sontag E D. [On the computational power of neural nets](http://research.cs.queensu.ca/home/akl/cisc879/papers/SELECTED_PAPERS_FROM_VARIOUS_SOURCES/05070215382317071.pdf)[C]//Proceedings of the fifth annual workshop on Computational learning theory. ACM, 1992: 440-449.
5. Hochreiter S, Schmidhuber J. [Long short-term memory](http://web.eecs.utk.edu/~itamar/courses/ECE-692/Bobby_paper1.pdf)[J]. Neural computation, 1997, 9(8): 1735-1780.
6. Bengio Y, Simard P, Frasconi P. [Learning long-term dependencies with gradient descent is difficult](http://www-dsi.ing.unifi.it/~paolo/ps/tnn-94-gradient.pdf)[J]. IEEE transactions on neural networks, 1994, 5(2): 157-166.
7. Graves A. [Generating sequences with recurrent neural networks](http://arxiv.org/pdf/1308.0850)[J]. arXiv preprint arXiv:1308.0850, 2013.
8. Cho K, Van Merriënboer B, Gulcehre C, et al. [Learning phrase representations using RNN encoder-decoder for statistical machine translation](http://arxiv.org/pdf/1406.1078)[J]. arXiv preprint arXiv:1406.1078, 2014.
9. Zhou J, Xu W. [End-to-end learning of semantic role labeling using recurrent neural networks](http://www.aclweb.org/anthology/P/P15/P15-1109.pdf)[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2015.

<br/>
This tutorial is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.

</div>
<!-- You can change the lines below now. -->

<script type="text/javascript">
marked.setOptions({
  renderer: new marked.Renderer(),
  gfm: true,
  breaks: false,
  smartypants: true,
  highlight: function(code, lang) {
    code = code.replace(/&amp;/g, "&")
    code = code.replace(/&gt;/g, ">")
    code = code.replace(/&lt;/g, "<")
    code = code.replace(/&nbsp;/g, " ")
    return hljs.highlightAuto(code, [lang]).value;
  }
});
document.getElementById("context").innerHTML = marked(
        document.getElementById("markdown").innerHTML)
</script>
</body>