diff --git a/02.recognize_digits/README.en.html b/02.recognize_digits/README.en.html deleted file mode 100644 index 893e24d3ede460735bb04f12b7a26175a08f4afd..0000000000000000000000000000000000000000 --- a/02.recognize_digits/README.en.html +++ /dev/null @@ -1,1290 +0,0 @@ -README.en

Recognize Digits

-

The source code for this tutorial is live at book/recognize_digits. For instructions on getting started with Paddle, please refer to installation instructions.

-

Introduction

-

When one learns to program, the first task is usually to write a program that prints “Hello World!”. In Machine Learning or Deep Learning, the equivalent task is to train a model to recognize hand-written digits on the dataset MNIST. Handwriting recognition is a classic image classification problem. The problem is relatively easy and MNIST is a complete dataset. As a simple Computer Vision dataset, MNIST contains images of handwritten digits and their corresponding labels (Fig. 1). The input image is a $28\times28$ matrix, and the label is one of the digits from $0$ to $9$. All images are normalized, meaning that they are both rescaled and centered.

-

-
-Fig. 1. Examples of MNIST images -

- -

The MNIST dataset is created from the NIST Special Database 3 (SD-3) and the Special Database 1 (SD-1). The SD-3 is labeled by the staff of the U.S. Census Bureau, while SD-1 is labeled by high school students the in U.S. Therefore the SD-3 is cleaner and easier to recognize than the SD-1 dataset. Yann LeCun et al. used half of the samples from each of SD-1 and SD-3 to create the MNIST training set (60,000 samples) and test set (10,000 samples), where training set was labeled by 250 different annotators, and it was guaranteed that there wasn’t a complete overlap of annotators of training set and test set.

-

Yann LeCun, one of the founders of Deep Learning, have previously made tremendous contributions to handwritten character recognition and proposed the Convolutional Neural Network (CNN), which drastically improved recognition capability for handwritten characters. CNNs are now a critical concept in Deep Learning. From the LeNet proposal by Yann LeCun, to those winning models in ImageNet competitions, such as VGGNet, GoogLeNet, and ResNet (See Image Classification tutorial), CNNs have achieved a series of impressive results in Image Classification tasks.

-

Many algorithms are tested on MNIST. In 1998, LeCun experimented with single layer linear classifier, Multilayer Perceptron (MLP) and Multilayer CNN LeNet. These algorithms quickly reduced test error from 12% to 0.7% [1]. Since then, researchers have worked on many algorithms such as K-Nearest Neighbors (k-NN) [2], Support Vector Machine (SVM) [3], Neural Networks [4-7] and Boosting [8]. Various preprocessing methods like distortion removal, noise removal, and blurring, have also been applied to increase recognition accuracy.

-

In this tutorial, we tackle the task of handwritten character recognition. We start with a simple softmax regression model and guide our readers step-by-step to improve this model’s performance on the task of recognition.

-

Model Overview

-

Before introducing classification algorithms and training procedure, we provide some definitions:
-- $X$ is the input: Input is a $28\times 28$ MNIST image. It is flattened to a $784$ dimensional vector. $X=\left (x_0, x_1, \dots, x_{783} \right )$.
-- $Y$ is the output: Output of the classifier is 1 of the 10 classes (digits from 0 to 9). $Y=\left (y_0, y_1, \dots, y_9 \right )$. Each dimension $y_i$ represents the probability that the input image belongs to class $i$.
-- $L$ is the ground truth label: $L=\left ( l_0, l_1, \dots, l_9 \right )$. It is also 10 dimensional, but only one dimension is 1 and all others are all 0.

-

Softmax Regression

-

In a simple softmax regression model, the input is fed to fully connected layers and a softmax function is applied to get probabilities of multiple output classes[9].

-

The input $X$ is multiplied by weights $W$, and bias $b$ is added to generate activations.

-

$$ y_i = \text{softmax}(\sum_j W_{i,j}x_j + b_i) $$

-

where $ \text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} $

-

For an $N$-class classification problem with $N$ output nodes, an $N$ dimensional vector is normalized to $N$ real values in the range $[0,1]$, each representing the probability that the sample belongs to a certain class. Here $y_i$ is the prediction probability that an image is digit $i$.

-

In such a classification problem, we usually use the cross entropy loss function:

-

$$ \text{crossentropy}(label, y) = -\sum_i label_ilog(y_i) $$

-

Fig. 2 shows a softmax regression network, with weights in blue, and bias in red. +1 indicates bias is 1.

-

-
-Fig. 2. Softmax regression network architecture
-

- -

Multilayer Perceptron

-

The softmax regression model described above uses the simplest two-layer neural network. That is, it only contains an input layer and an output layer. So its regression ability is limited. To achieve better recognition results, consider adding several hidden layers [10] between the input layer and the output layer.

-
    -
  1. After the first hidden layer, we get $ H_1 = \phi(W_1X + b_1) $, where $\phi$ is the activation function. Some common ones are sigmoid, tanh and ReLU.
  2. -
  3. After the second hidden layer, we get $ H_2 = \phi(W_2H_1 + b_2) $.
  4. -
  5. Finally, the output layer outputs $Y=\text{softmax}(W_3H_2 + b_3)$, the final classification result vector.
  6. -
-

Fig. 3. shows a Multilayer Perceptron network, with the weights in blue, and the bias in red. +1 indicates that the bias is $1$.

-

-
-Fig. 3. Multilayer Perceptron network architecture
- -

- -

Convolutional Neural Network

-

Convolutional Layer

-

-
-Fig. 4. Convolutional layer
-

- -

The Convolutional layer is the core of a Convolutional Neural Network. The parameters in this layer are composed of a set of filters or kernels. In the forward step, each kernel moves horizontally and vertically, we compute a dot product of the kernel and the input at the corresponding positions. Then, we add the bias and apply an activation function. The result is a two-dimensional activation map. For example, some kernel may recognize corners, and some may recognize circles. These convolution kernels may respond strongly to the corresponding features.

-

Fig. 4 is a dynamic graph of a convolutional layer, where depths are not shown for simplicity. Input is $W_1=5, H_1=5, D_1=3$. In fact, this is a common representation for colored images. $W_1$ and $H_1$ of a colored image correspond to the width and height respectively. $D_1$ corresponds to the 3 color channels for RGB. The parameters of the convolutional layer are $K=2, F=3, S=2, P=1$. $K$ is the number of kernels. Here, $Filter W_0$ and $Filter W_1$ are two kernels. $F$ is kernel size. $W0$ and $W1$ are both $3\times3$ matrix in all depths. $S$ is the stride. Kernels move leftwards or downwards by 2 units each time. $P$ is padding, an extension of the input. The gray area in the figure shows zero padding with size 1.

-

Pooling Layer

-

-
-Fig. 5 Pooling layer
-

- -

A Pooling layer performs downsampling. The main functionality of this layer is to reduce computation by reducing the network parameters. It also prevents overfitting to some extent. Usually, a pooling layer is added after a convolutional layer. Pooling layer can be of various types like max pooling, average pooling, etc. Max pooling uses rectangles to segment the input layer into several parts and computes the maximum value in each part as the output (Fig. 5.)

-

LeNet-5 Network

-

-
-Fig. 6. LeNet-5 Convolutional Neural Network architecture
-

- -

LeNet-5 is one of the simplest Convolutional Neural Networks. Fig. 6. shows its architecture: A 2-dimensional input image is fed into two sets of convolutional layers and pooling layers, this output is then fed to a fully connected layer and a softmax classifier. The following three properties of convolution enable LeNet-5 to better recognize images than Multilayer fully connected perceptrons:

- -

For more details on Convolutional Neural Networks, please refer to this Stanford open course and this Image Classification tutorial.

-

List of Common Activation Functions

- -

In fact, tanh function is just a rescaled version of the sigmoid function. It is obtained by magnifying the value of the sigmoid function and moving it downwards by 1.

- -

For more information, please refer to Activation functions on Wikipedia.

-

Data Preparation

-

PaddlePaddle provides a Python module, paddle.dataset.mnist, which downloads and caches the MNIST dataset. The cache is under /home/username/.cache/paddle/dataset/mnist:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
File nameDescriptionSize
train-images-idx3-ubyteTraining images60,000
train-labels-idx1-ubyteTraining labels60,000
t10k-images-idx3-ubyteEvaluation images10,000
t10k-labels-idx1-ubyteEvaluation labels10,000
-

Model Configuration

-

A PaddlePaddle program starts from importing the API package:

-
import paddle.v2 as paddle
-
- -

We want to use this program to demonstrate three different classifiers, each defined as a Python function:

- -
def softmax_regression(img):
-    predict = paddle.layer.fc(input=img,
-                              size=10,
-                              act=paddle.activation.Softmax())
-    return predict
-
- - -
def multilayer_perceptron(img):
-    hidden1 = paddle.layer.fc(input=img, size=128, act=paddle.activation.Relu())
-    hidden2 = paddle.layer.fc(input=hidden1,
-                              size=64,
-                              act=paddle.activation.Relu())
-    predict = paddle.layer.fc(input=hidden2,
-                              size=10,
-                              act=paddle.activation.Softmax())
-    return predict
-
- - -
def convolutional_neural_network(img):
-
-    conv_pool_1 = paddle.networks.simple_img_conv_pool(
-        input=img,
-        filter_size=5,
-        num_filters=20,
-        num_channel=1,
-        pool_size=2,
-        pool_stride=2,
-        act=paddle.activation.Tanh())
-
-    conv_pool_2 = paddle.networks.simple_img_conv_pool(
-        input=conv_pool_1,
-        filter_size=5,
-        num_filters=50,
-        num_channel=20,
-        pool_size=2,
-        pool_stride=2,
-        act=paddle.activation.Tanh())
-
-    fc1 = paddle.layer.fc(input=conv_pool_2,
-                          size=128,
-                          act=paddle.activation.Tanh())
-
-    predict = paddle.layer.fc(input=fc1,
-                              size=10,
-                              act=paddle.activation.Softmax())
-    return predict
-
- -

PaddlePaddle provides a special layer layer.data for reading data. Let us create a data layer for reading images and connect it to a classification network created using one of above three functions. We also need a cost layer for training the model.

-
paddle.init(use_gpu=False, trainer_count=1)
-
-images = paddle.layer.data(
-    name='pixel', type=paddle.data_type.dense_vector(784))
-label = paddle.layer.data(
-    name='label', type=paddle.data_type.integer_value(10))
-
-predict = softmax_regression(images)
-#predict = multilayer_perceptron(images) # uncomment for MLP
-#predict = convolutional_neural_network(images) # uncomment for LeNet5
-
-cost = paddle.layer.classification_cost(input=predict, label=label)
-
- -

Now, it is time to specify training parameters. In the following Momentum optimizer, momentum=0.9 means that 90% of the current momentum comes from that of the previous iteration. The learning rate relates to the speed at which the network training converges. Regularization is meant to prevent over-fitting; here we use the L2 regularization.

-
parameters = paddle.parameters.create(cost)
-
-optimizer = paddle.optimizer.Momentum(
-    learning_rate=0.1 / 128.0,
-    momentum=0.9,
-    regularization=paddle.optimizer.L2Regularization(rate=0.0005 * 128))
-
-trainer = paddle.trainer.SGD(cost=cost,
-                             parameters=parameters,
-                             update_equation=optimizer)
-
- -

Then we specify the training data paddle.dataset.movielens.train() and testing data paddle.dataset.movielens.test(). These two methods are reader creators. Once called, a reader creator returns a reader. A reader is a Python method, which, once called, returns a Python generator, which yields instances of data.

-

shuffle is a reader decorator. It takes in a reader A as input and returns a new reader B. Under the hood, B calls A to read data in the following fashion: it copies in buffer_size instances at a time into a buffer, shuffles the data, and yields the shuffled instances one at a time. A large buffer size would yield very shuffled data.

-

batch is a special decorator, which takes in reader and outputs a batch reader, which doesn’t yield an instance, but a minibatch at a time.

-
lists = []
-
-def event_handler(event):
-    if isinstance(event, paddle.event.EndIteration):
-        if event.batch_id % 100 == 0:
-            print "Pass %d, Batch %d, Cost %f, %s" % (
-                event.pass_id, event.batch_id, event.cost, event.metrics)
-    if isinstance(event, paddle.event.EndPass):
-        result = trainer.test(reader=paddle.batch(
-            paddle.dataset.mnist.test(), batch_size=128))
-        print "Test with Pass %d, Cost %f, %s\n" % (
-            event.pass_id, result.cost, result.metrics)
-        lists.append((event.pass_id, result.cost,
-                      result.metrics['classification_error_evaluator']))
-
-trainer.train(
-    reader=paddle.batch(
-        paddle.reader.shuffle(
-            paddle.dataset.mnist.train(), buf_size=8192),
-        batch_size=128),
-    event_handler=event_handler,
-    num_passes=100)
-
- -

During training, trainer.train invokes event_handler for certain events. This gives us a chance to print the training progress.

-
# Pass 0, Batch 0, Cost 2.780790, {'classification_error_evaluator': 0.9453125}
-# Pass 0, Batch 100, Cost 0.635356, {'classification_error_evaluator': 0.2109375}
-# Pass 0, Batch 200, Cost 0.326094, {'classification_error_evaluator': 0.1328125}
-# Pass 0, Batch 300, Cost 0.361920, {'classification_error_evaluator': 0.1015625}
-# Pass 0, Batch 400, Cost 0.410101, {'classification_error_evaluator': 0.125}
-# Test with Pass 0, Cost 0.326659, {'classification_error_evaluator': 0.09470000118017197}
-
- -

After the training, we can check the model’s prediction accuracy.

-
# find the best pass
-best = sorted(lists, key=lambda list: float(list[1]))[0]
-print 'Best pass is %s, testing Avgcost is %s' % (best[0], best[1])
-print 'The classification accuracy is %.2f%%' % (100 - float(best[2]) * 100)
-
- -

Usually, with MNIST data, the softmax regression model achieves an accuracy around 92.34%, the MLP 97.66%, and the convolution network around 99.20%. Convolution layers have been widely considered a great invention for image processing.

-

Conclusion

-

This tutorial describes a few basic Deep Learning models using Softmax regression, Multilayer Perceptron Network, and Convolutional Neural Network. The subsequent tutorials will derive more sophisticated models from these. It is crucial to understand these models for future learning. When our model evolves from a simple softmax regression to slightly complex Convolutional Neural Network, the recognition accuracy on the MNIST data set achieves large improvement in accuracy. This is due to the Convolutional layers’ local connections and parameter sharing. While learning new models in the future, we encourage the readers to understand the key ideas that lead a new model to improve the results of an old one. Moreover, this tutorial introduces the basic flow of PaddlePaddle model design, which starts with a dataprovider, a model layer construction, and finally training and prediction. Readers can leverage the flow used in this MNIST handwritten digit classification example and experiment with different data and network architectures to train models for classification tasks of their choice.

-

References

-
    -
  1. LeCun, Yann, Léon Bottou, Yoshua Bengio, and Patrick Haffner. “Gradient-based learning applied to document recognition.” Proceedings of the IEEE 86, no. 11 (1998): 2278-2324.
  2. -
  3. Wejéus, Samuel. “A Neural Network Approach to Arbitrary SymbolRecognition on Modern Smartphones.” (2014).
  4. -
  5. Decoste, Dennis, and Bernhard Schölkopf. “Training invariant support vector machines.” Machine learning 46, no. 1-3 (2002): 161-190.
  6. -
  7. Simard, Patrice Y., David Steinkraus, and John C. Platt. “Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis.” In ICDAR, vol. 3, pp. 958-962. 2003.
  8. -
  9. Salakhutdinov, Ruslan, and Geoffrey E. Hinton. “Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure.” In AISTATS, vol. 11. 2007.
  10. -
  11. Cireşan, Dan Claudiu, Ueli Meier, Luca Maria Gambardella, and Jürgen Schmidhuber. “Deep, big, simple neural nets for handwritten digit recognition.” Neural computation 22, no. 12 (2010): 3207-3220.
  12. -
  13. Deng, Li, Michael L. Seltzer, Dong Yu, Alex Acero, Abdel-rahman Mohamed, and Geoffrey E. Hinton. “Binary coding of speech spectrograms using a deep auto-encoder.” In Interspeech, pp. 1692-1695. 2010.
  14. -
  15. Kégl, Balázs, and Róbert Busa-Fekete. “Boosting products of base classifiers.” In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 497-504. ACM, 2009.
  16. -
  17. Rosenblatt, Frank. “The perceptron: A probabilistic model for information storage and organization in the brain.” Psychological review 65, no. 6 (1958): 386.
  18. -
  19. Bishop, Christopher M. “Pattern recognition.” Machine Learning 128 (2006): 1-58.
  20. -
-



-This tutorial is contributed by PaddlePaddle, and licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

\ No newline at end of file diff --git a/05.understand_sentiment/README.en.html b/05.understand_sentiment/README.en.html deleted file mode 100644 index 34cb1689a6f549873d0dc2afb17d99f493032692..0000000000000000000000000000000000000000 --- a/05.understand_sentiment/README.en.html +++ /dev/null @@ -1,1328 +0,0 @@ -README.en

Sentiment Analysis

-

The source codes of this section can be located at book/understand_sentiment. First-time users may refer to PaddlePaddle for Installation guide.

-

Background

-

In natural language processing, sentiment analysis refers to determining the emotion expressed in a piece of text. The text can be a sentence, a paragraph, or a document. Emotion categorization can be binary – positive/negative or happy/sad – or in three classes – positive/neutral/negative. Sentiment analysis is applicable in a wide range of services, such as e-commerce sites like Amazon and Taobao, hospitality services like Airbnb and hotels.com, and movie rating sites like Rotten Tomatoes and IMDB. It can be used to gauge from the reviews how the customers feel about the product. Table 1 illustrates an example of sentiment analysis in movie reviews:

- - - - - - - - - - - - - - - - - - - - - - - - - -
Movie ReviewCategory
Best movie of Xiaogang Feng in recent years!Positive
Pretty bad. Feels like a tv-series from a local TV-channelNegative
Politically correct version of Taken … and boring as HeckNegative
delightful, mesmerizing, and completely unexpected. The plot is nicely designed.Positive
-

Table 1 Sentiment Analysis in Movie Reviews

- -

In natural language processing, sentiment analysis can be categorized as a Text Classification problem, i.e., to categorize a piece of text to a specific class. It involves two related tasks: text representation and classification. Before the emergence of deep learning techniques, the mainstream methods for text representation include BOW (bag of words) and topic modeling, while the latter contain SVM (support vector machine) and LR (logistic regression).

-

The BOW model does not capture all the information in a piece of text, as it ignores syntax and grammar and just treats the text as a set of words. For example, “this movie is extremely bad“ and “boring, dull, and empty work” describe very similar semantic meaning, yet their BOW representations have with little similarity. Furthermore, “the movie is bad“ and “the movie is not bad“ have high similarity with BOW features, but they express completely opposite semantics.

-

This chapter introduces a deep learning model that handles these issues in BOW. Our model embeds texts into a low-dimensional space and takes word order into consideration. It is an end-to-end framework and it has large performance improvement over traditional methods [1].

-

Model Overview

-

The model we used in this chapter uses Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) with some specific extensions.

-

Convolutional Neural Networks for Texts (CNN)

-

Convolutional Neural Networks are frequently applied to data with grid-like topology such as two-dimensional images and one-dimensional texts. A CNN can extract multiple local features, combine them, and produce high-level abstractions, which correspond to semantic understanding. Empirically, CNN is shown to be efficient for image and text modeling.

-

CNN mainly contains convolution and pooling operation, with versatile combinations in various applications. Here, we briefly describe a CNN used to classify texts[1], as shown in Figure 1.

-

-
-Figure 1. CNN for text modeling. -

- -

Let $n$ be the length of the sentence to process, and the $i$-th word has embedding as $x_i\in\mathbb{R}^k$,where $k$ is the embedding dimensionality.

-

First, we concatenate the words by piecing together every $h$ words, each as a window of length $h$. This window is denoted as $x_{i:i+h-1}$, consisting of $x_{i},x_{i+1},\ldots,x_{i+h-1}$, where $x_i$ is the first word in the window and $i$ takes value ranging from $1$ to $n-h+1$: $x_{i:i+h-1}\in\mathbb{R}^{hk}$.

-

Next, we apply the convolution operation: we apply the kernel $w\in\mathbb{R}^{hk}$ in each window, extracting features $c_i=f(w\cdot x_{i:i+h-1}+b)$, where $b\in\mathbb{R}$ is the bias and $f$ is a non-linear activation function such as $sigmoid$. Convolving by the kernel at every window ${x_{1:h},x_{2:h+1},\ldots,x_{n-h+1:n}}$ produces a feature map in the following form:

-

$$c=[c_1,c_2,\ldots,c_{n-h+1}], c \in \mathbb{R}^{n-h+1}$$

-

Next, we apply max pooling over time to represent the whole sentence $\hat c$, which is the maximum element across the feature map:

-

$$\hat c=max(c)$$

-

In real applications, we will apply multiple CNN kernels on the sentences. It can be implemented efficiently by concatenating the kernels together as a matrix. Also, we can use CNN kernels with different kernel size (as shown in Figure 1 in different colors).

-

Finally, concatenating the resulting features produces a fixed-length representation, which can be combined with a softmax to form the model for the sentiment analysis problem.

-

For short texts, the aforementioned CNN model can achieve very high accuracy [1]. If we want to extract more abstract representations, we may apply a deeper CNN model [2,3].

-

Recurrent Neural Network (RNN)

-

RNN is an effective model for sequential data. In terms of computability, the RNN is Turing-complete [4]. Since NLP is a classical problem on sequential data, the RNN, especially its variant LSTM[5]), achieves state-of-the-art performance on various NLP tasks, such as language modeling, syntax parsing, POS-tagging, image captioning, dialog, machine translation, and so forth.

-

-
-Figure 2. An illustration of an unfolded RNN in time. -

- -

As shown in Figure 2, we unfold an RNN: at the $t$-th time step, the network takes two inputs: the $t$-th input vector $\vec{x_t}$ and the latent state from the last time-step $\vec{h_{t-1}}$. From those, it computes the latent state of the current step $\vec{h_t}$. This process is repeated until all inputs are consumed. Denoting the RNN as function $f$, it can be formulated as follows:

-

$$\vec{h_t}=f(\vec{x_t},\vec{h_{t-1}})=\sigma(W_{xh}\vec{x_t}+W_{hh}\vec{h_{h-1}}+\vec{b_h})$$

-

where $W_{xh}$ is the weight matrix to feed into the latent layer; $W_{hh}$ is the latent-to-latent matrix; $b_h$ is the latent bias and $\sigma$ refers to the $sigmoid$ function.

-

In NLP, words are often represented as a one-hot vectors and then mapped to an embedding. The embedded feature goes through an RNN as input $x_t$ at every time step. Moreover, we can add other layers on top of RNN, such as a deep or stacked RNN. Finally, the last latent state may be used as a feature for sentence classification.

-

Long-Short Term Memory (LSTM)

-

Training an RNN on long sequential data sometimes leads to the gradient vanishing or exploding[6]. To solve this problem Hochreiter S, Schmidhuber J. (1997) proposed Long Short Term Memory (LSTM)[5]).

-

Compared to the structure of a simple RNN, an LSTM includes memory cell $c$, input gate $i$, forget gate $f$ and output gate $o$. These gates and memory cells dramatically improve the ability for the network to handle long sequences. We can formulate the LSTM-RNN, denoted as a function $F$, as follows:

-

$$ h_t=F(x_t,h_{t-1})$$

-

$F$ contains following formulations[7]:
-\begin{align}
-i_t & = \sigma(W_{xi}x_t+W_{hi}h_{h-1}+W_{ci}c_{t-1}+b_i)\\
-f_t & = \sigma(W_{xf}x_t+W_{hf}h_{h-1}+W_{cf}c_{t-1}+b_f)\\
-c_t & = f_t\odot c_{t-1}+i_t\odot \tanh(W_{xc}x_t+W_{hc}h_{h-1}+b_c)\\
-o_t & = \sigma(W_{xo}x_t+W_{ho}h_{h-1}+W_{co}c_{t}+b_o)\\
-h_t & = o_t\odot \tanh(c_t)\\
-\end{align}

-

In the equation,$i_t, f_t, c_t, o_t$ stand for input gate, forget gate, memory cell and output gate, respectively; $W$ and $b$ are model parameters. The $tanh$ is a hyperbolic tangent, and $\odot$ denotes an element-wise product operation. Input gate controls the magnitude of new input into the memory cell $c$; forget gate controls memory propagated from the last time step; output gate controls output magnitude. The three gates are computed similarly with different parameters, and they influence memory cell $c$ separately, as shown in Figure 3:

-

-
-Figure 3. LSTM at time step $t$ [7]. -

- -

LSTM enhances the ability of considering long-term reliance, with the help of memory cell and gate. Similar structures are also proposed in Gated Recurrent Unit (GRU)[8] with simpler design. The structures are still similar to RNN, though with some modifications (As shown in Figure 2), i.e., latent status depends on input as well as the latent status of last time-step, and the process goes on recurrently until all input are consumed:

-

$$ h_t=Recrurent(x_t,h_{t-1})$$
-where $Recrurent$ is a simple RNN, GRU or LSTM.

-

Stacked Bidirectional LSTM

-

For vanilla LSTM, $h_t$ contains input information from previous time-step $1..t-1$ context. We can also apply an RNN with reverse-direction to take successive context $t+1…n$ into consideration. Combining constructing deep RNN (deeper RNN can contain more abstract and higher level semantic), we can design structures with deep stacked bidirectional LSTM to model sequential data[9].

-

As shown in Figure 4 (3-layer RNN), odd/even layers are forward/reverse LSTM. Higher layers of LSTM take lower-layers LSTM as input, and the top-layer LSTM produces a fixed length vector by max-pooling (this representation considers contexts from previous and successive words for higher-level abstractions). Finally, we concatenate the output to a softmax layer for classification.

-

-
-Figure 4. Stacked Bidirectional LSTM for NLP modeling. -

- -

Dataset

-

We use IMDB dataset for sentiment analysis in this tutorial, which consists of 50,000 movie reviews split evenly into 25k train and 25k test sets. In the labeled train/test sets, a negative review has a score <= 4 out of 10, and a positive review has a score >= 7 out of 10.

-

paddle.datasets package encapsulates multiple public datasets, including cifar, imdb, mnist, moivelens, and wmt14, etc. There’s no need for us to manually download and preprocess IMDB.

-

After issuing a command python train.py, training will start immediately. The details will be unpacked by the following sessions to see how it works.

-

Model Structure

-

Initialize PaddlePaddle

-

We must import and initialize PaddlePaddle (enable/disable GPU, set the number of trainers, etc).

-
import sys
-import paddle.v2 as paddle
-
-# PaddlePaddle init
-paddle.init(use_gpu=False, trainer_count=1)
-
- -

As alluded to in section Model Overview, here we provide the implementations of both Text CNN and Stacked-bidirectional LSTM models.

-

Text Convolution Neural Network (Text CNN)

-

We create a neural network convolution_net as the following snippet code.

-

Note: paddle.networks.sequence_conv_pool includes both convolution and pooling layer operations.

-
def convolution_net(input_dim, class_dim=2, emb_dim=128, hid_dim=128):
-    data = paddle.layer.data("word",
-                             paddle.data_type.integer_value_sequence(input_dim))
-    emb = paddle.layer.embedding(input=data, size=emb_dim)
-    conv_3 = paddle.networks.sequence_conv_pool(
-        input=emb, context_len=3, hidden_size=hid_dim)
-    conv_4 = paddle.networks.sequence_conv_pool(
-        input=emb, context_len=4, hidden_size=hid_dim)
-    output = paddle.layer.fc(input=[conv_3, conv_4],
-                             size=class_dim,
-                             act=paddle.activation.Softmax())
-    lbl = paddle.layer.data("label", paddle.data_type.integer_value(2))
-    cost = paddle.layer.classification_cost(input=output, label=lbl)
-    return cost
-
- -
    -
  1. -

    Define input data and its dimension

    -

    Parameter input_dim denotes the dictionary size, and class_dim is the number of categories. In convolution_net, the input to the network is defined in paddle.layer.data.

    -
  2. -
  3. -

    Define Classifier

    -

    The above Text CNN network extracts high-level features and maps them to a vector of the same size as the categories. paddle.activation.Softmax function or classifier is then used for calculating the probability of the sentence belonging to each category.

    -
  4. -
  5. -

    Define Loss Function

    -

    In the context of supervised learning, labels of the training set are defined in paddle.layer.data, too. During training, cross-entropy is used as loss function in paddle.layer.classification_cost and as the output of the network; During testing, the outputs are the probabilities calculated in the classifier.

    -
  6. -
-

Stacked bidirectional LSTM

-

We create a neural network stacked_lstm_net as below.

-
def stacked_lstm_net(input_dim,
-                     class_dim=2,
-                     emb_dim=128,
-                     hid_dim=512,
-                     stacked_num=3):
-    """
-    A Wrapper for sentiment classification task.
-    This network uses bi-directional recurrent network,
-    consisting three LSTM layers. This configure is referred to
-    the paper as following url, but use fewer layrs.
-        http://www.aclweb.org/anthology/P15-1109
-    input_dim: here is word dictionary dimension.
-    class_dim: number of categories.
-    emb_dim: dimension of word embedding.
-    hid_dim: dimension of hidden layer.
-    stacked_num: number of stacked lstm-hidden layer.
-    """
-    assert stacked_num % 2 == 1
-
-    layer_attr = paddle.attr.Extra(drop_rate=0.5)
-    fc_para_attr = paddle.attr.Param(learning_rate=1e-3)
-    lstm_para_attr = paddle.attr.Param(initial_std=0., learning_rate=1.)
-    para_attr = [fc_para_attr, lstm_para_attr]
-    bias_attr = paddle.attr.Param(initial_std=0., l2_rate=0.)
-    relu = paddle.activation.Relu()
-    linear = paddle.activation.Linear()
-
-    data = paddle.layer.data("word",
-                             paddle.data_type.integer_value_sequence(input_dim))
-    emb = paddle.layer.embedding(input=data, size=emb_dim)
-
-    fc1 = paddle.layer.fc(input=emb,
-                          size=hid_dim,
-                          act=linear,
-                          bias_attr=bias_attr)
-    lstm1 = paddle.layer.lstmemory(
-        input=fc1, act=relu, bias_attr=bias_attr, layer_attr=layer_attr)
-
-    inputs = [fc1, lstm1]
-    for i in range(2, stacked_num + 1):
-        fc = paddle.layer.fc(input=inputs,
-                             size=hid_dim,
-                             act=linear,
-                             param_attr=para_attr,
-                             bias_attr=bias_attr)
-        lstm = paddle.layer.lstmemory(
-            input=fc,
-            reverse=(i % 2) == 0,
-            act=relu,
-            bias_attr=bias_attr,
-            layer_attr=layer_attr)
-        inputs = [fc, lstm]
-
-    fc_last = paddle.layer.pooling(
-        input=inputs[0], pooling_type=paddle.pooling.Max())
-    lstm_last = paddle.layer.pooling(
-        input=inputs[1], pooling_type=paddle.pooling.Max())
-    output = paddle.layer.fc(input=[fc_last, lstm_last],
-                             size=class_dim,
-                             act=paddle.activation.Softmax(),
-                             bias_attr=bias_attr,
-                             param_attr=para_attr)
-
-    lbl = paddle.layer.data("label", paddle.data_type.integer_value(2))
-    cost = paddle.layer.classification_cost(input=output, label=lbl)
-    return cost
-
- -
    -
  1. -

    Define input data and its dimension

    -

    Parameter input_dim denotes the dictionary size, and class_dim is the number of categories. In stacked_lstm_net, the input to the network is defined in paddle.layer.data.

    -
  2. -
  3. -

    Define Classifier

    -

    The above stacked bidirectional LSTM network extracts high-level features and maps them to a vector of the same size as the categories. paddle.activation.Softmax function or classifier is then used for calculating the probability of the sentence belonging to each category.

    -
  4. -
  5. -

    Define Loss Function

    -

    In the context of supervised learning, labels of the training set are defined in paddle.layer.data, too. During training, cross-entropy is used as loss function in paddle.layer.classification_cost and as the output of the network; During testing, the outputs are the probabilities calculated in the classifier.

    -
  6. -
-

To reiterate, we can either invoke convolution_net or stacked_lstm_net.

-
word_dict = paddle.dataset.imdb.word_dict()
-dict_dim = len(word_dict)
-class_dim = 2
-
-# option 1
-cost = convolution_net(dict_dim, class_dim=class_dim)
-# option 2
-# cost = stacked_lstm_net(dict_dim, class_dim=class_dim, stacked_num=3)
-
- -

Model Training

-

Define Parameters

-

First, we create the model parameters according to the previous model configuration cost.

-
# create parameters
-parameters = paddle.parameters.create(cost)
-
- -

Create Trainer

-

Before jumping into creating a training module, algorithm setting is also necessary.
-Here we specified Adam optimization algorithm via paddle.optimizer.

-
# create optimizer
-adam_optimizer = paddle.optimizer.Adam(
-    learning_rate=2e-3,
-    regularization=paddle.optimizer.L2Regularization(rate=8e-4),
-    model_average=paddle.optimizer.ModelAverage(average_window=0.5))
-
-# create trainer
-trainer = paddle.trainer.SGD(cost=cost,
-                                parameters=parameters,
-                                update_equation=adam_optimizer)
-
- -

Training

-

paddle.dataset.imdb.train() will yield records during each pass, after shuffling, a batch input is generated for training.

-
train_reader = paddle.batch(
-    paddle.reader.shuffle(
-        lambda: paddle.dataset.imdb.train(word_dict), buf_size=1000),
-    batch_size=100)
-
-test_reader = paddle.batch(
-    lambda: paddle.dataset.imdb.test(word_dict), batch_size=100)
-
- -

feeding is devoted to specifying the correspondence between each yield record and paddle.layer.data. For instance, the first column of data generated by paddle.dataset.imdb.train() corresponds to word feature.

-
feeding = {'word': 0, 'label': 1}
-
- -

Callback function event_handler will be invoked to track training progress when a pre-defined event happens.

-
def event_handler(event):
-    if isinstance(event, paddle.event.EndIteration):
-        if event.batch_id % 100 == 0:
-            print "\nPass %d, Batch %d, Cost %f, %s" % (
-                event.pass_id, event.batch_id, event.cost, event.metrics)
-        else:
-            sys.stdout.write('.')
-            sys.stdout.flush()
-    if isinstance(event, paddle.event.EndPass):
-        result = trainer.test(reader=test_reader, feeding=feeding)
-        print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics)
-
- -

Finally, we can invoke trainer.train to start training:

-
trainer.train(
-    reader=train_reader,
-    event_handler=event_handler,
-    feeding=feeding,
-    num_passes=10)
-
- -

Conclusion

-

In this chapter, we use sentiment analysis as an example to introduce applying deep learning models on end-to-end short text classification, as well as how to use PaddlePaddle to implement the model. Meanwhile, we briefly introduce two models for text processing: CNN and RNN. In following chapters, we will see how these models can be applied in other tasks.

-

Reference

-
    -
  1. Kim Y. Convolutional neural networks for sentence classification[J]. arXiv preprint arXiv:1408.5882, 2014.
  2. -
  3. Kalchbrenner N, Grefenstette E, Blunsom P. A convolutional neural network for modelling sentences[J]. arXiv preprint arXiv:1404.2188, 2014.
  4. -
  5. Yann N. Dauphin, et al. Language Modeling with Gated Convolutional Networks[J] arXiv preprint arXiv:1612.08083, 2016.
  6. -
  7. Siegelmann H T, Sontag E D. On the computational power of neural nets[C]//Proceedings of the fifth annual workshop on Computational learning theory. ACM, 1992: 440-449.
  8. -
  9. Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural computation, 1997, 9(8): 1735-1780.
  10. -
  11. Bengio Y, Simard P, Frasconi P. Learning long-term dependencies with gradient descent is difficult[J]. IEEE transactions on neural networks, 1994, 5(2): 157-166.
  12. -
  13. Graves A. Generating sequences with recurrent neural networks[J]. arXiv preprint arXiv:1308.0850, 2013.
  14. -
  15. Cho K, Van Merriënboer B, Gulcehre C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[J]. arXiv preprint arXiv:1406.1078, 2014.
  16. -
  17. Zhou J, Xu W. End-to-end learning of semantic role labeling using recurrent neural networks[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2015.
  18. -
-



-This tutorial is contributed by PaddlePaddle, and licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

\ No newline at end of file