-# Linear Regression
-Let us begin the tutorial with a classical problem called Linear Regression \[[1](#References)\]. In this chapter, we will train a model from a realistic dataset to predict home prices. Some important concepts in Machine Learning will be covered through this example.
-
-The source code for this tutorial lives on [book/fit_a_line](https://github.com/PaddlePaddle/book/tree/develop/01.fit_a_line). For instructions on getting started with PaddlePaddle, see [PaddlePaddle installation guide](https://github.com/PaddlePaddle/book/blob/develop/README.en.md#running-the-book).
-
-## Problem Setup
-Suppose we have a dataset of $n$ real estate properties. These real estate properties will be referred to as *homes* in this chapter for clarity.
-
-Each home is associated with $d$ attributes. The attributes describe characteristics such the number of rooms in the home, the number of schools or hospitals in the neighborhood, and the traffic condition nearby.
-
-In our problem setup, the attribute $x_{i,j}$ denotes the $j$th characteristic of the $i$th home. In addition, $y_i$ denotes the price of the $i$th home. Our task is to predict $y_i$ given a set of attributes $\{x_{i,1}, ..., x_{i,d}\}$. We assume that the price of a home is a linear combination of all of its attributes, namely,
-
-$$y_i = \omega_1x_{i,1} + \omega_2x_{i,2} + \ldots + \omega_dx_{i,d} + b, i=1,\ldots,n$$
-
-where $\vec{\omega}$ and $b$ are the model parameters we want to estimate. Once they are learned, we will be able to predict the price of a home, given the attributes associated with it. We call this model **Linear Regression**. In other words, we want to regress a value against several values linearly. In practice, a linear model is often too simplistic to capture the real relationships between the variables. Yet, because Linear Regression is easy to train and analyze, it has been applied to a large number of real problems. As a result, it is an important topic in many classic Statistical Learning and Machine Learning textbooks \[[2,3,4](#References)\].
-
-## Results Demonstration
-We first show the result of our model. The dataset [UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing) is used to train a linear model to predict the home prices in Boston. The figure below shows the predictions the model makes for some home prices. The $X$-axis represents the median value of the prices of simlilar homes within a bin, while the $Y$-axis represents the home value our linear model predicts. The dotted line represents points where $X=Y$. When reading the diagram, the more precise the model predicts, the closer the point is to the dotted line.
-
-
- Figure 1. Predicted Value V.S. Actual Value
-
-
-## Model Overview
-
-### Model Definition
-
-In the UCI Housing Data Set, there are 13 home attributes $\{x_{i,j}\}$ that are related to the median home price $y_i$, which we aim to predict. Thus, our model can be written as:
-
-$$\hat{Y} = \omega_1X_{1} + \omega_2X_{2} + \ldots + \omega_{13}X_{13} + b$$
-
-where $\hat{Y}$ is the predicted value used to differentiate from actual value $Y$. The model learns parameters $\omega_1, \ldots, \omega_{13}, b$, where the entries of $\vec{\omega}$ are **weights** and $b$ is **bias**.
-
-Now we need an objective to optimize, so that the learned parameters can make $\hat{Y}$ as close to $Y$ as possible. Let's refer to the concept of [Loss Function (Cost Function)](https://en.wikipedia.org/wiki/Loss_function). A loss function must output a non-negative value, given any pair of the actual value $y_i$ and the predicted value $\hat{y_i}$. This value reflects the magnitutude of the model error.
-
-For Linear Regression, the most common loss function is [Mean Square Error (MSE)](https://en.wikipedia.org/wiki/Mean_squared_error) which has the following form:
-
-$$MSE=\frac{1}{n}\sum_{i=1}^{n}{(\hat{Y_i}-Y_i)}^2$$
-
-That is, for a dataset of size $n$, MSE is the average value of the the prediction sqaure errors.
-
-### Training
-
-After setting up our model, there are several major steps to go through to train it:
-1. Initialize the parameters including the weights $\vec{\omega}$ and the bias $b$. For example, we can set their mean values as $0$s, and their standard deviations as $1$s.
-2. Feedforward. Evaluate the network output and compute the corresponding loss.
-3. [Backpropagate](https://en.wikipedia.org/wiki/Backpropagation) the errors. The errors will be propagated from the output layer back to the input layer, during which the model parameters will be updated with the corresponding errors.
-4. Repeat steps 2~3, until the loss is below a predefined threshold or the maximum number of repeats is reached.
-
-## Dataset
-
-### Python Dataset Modules
-
-Our program starts with importing necessary packages:
-
-```python
-import paddle.v2 as paddle
-import paddle.v2.dataset.uci_housing as uci_housing
-```
-
-We encapsulated the [UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing) in our Python module `uci_housing`. This module can
-
-1. download the dataset to `~/.cache/paddle/dataset/uci_housing/housing.data`, if not yet, and
-2. [preprocesses](#preprocessing) the dataset.
-
-### An Introduction of the Dataset
-
-The UCI housing dataset has 506 instances. Each instance describes the attributes of a house in surburban Boston. The attributes are explained below:
-
-| Attribute Name | Characteristic | Data Type |
-| ------| ------ | ------ |
-| CRIM | per capita crime rate by town | Continuous|
-| ZN | proportion of residential land zoned for lots over 25,000 sq.ft. | Continuous |
-| INDUS | proportion of non-retail business acres per town | Continuous |
-| CHAS | Charles River dummy variable | Discrete, 1 if tract bounds river; 0 otherwise|
-| NOX | nitric oxides concentration (parts per 10 million) | Continuous |
-| RM | average number of rooms per dwelling | Continuous |
-| AGE | proportion of owner-occupied units built prior to 1940 | Continuous |
-| DIS | weighted distances to five Boston employment centres | Continuous |
-| RAD | index of accessibility to radial highways | Continuous |
-| TAX | full-value property-tax rate per $10,000 | Continuous |
-| PTRATIO | pupil-teacher ratio by town | Continuous |
-| B | 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town | Continuous |
-| LSTAT | % lower status of the population | Continuous |
-| MEDV | Median value of owner-occupied homes in $1000's | Continuous |
-
-The last entry is the median home price.
-
-### Preprocessing
-#### Continuous and Discrete Data
-We define a feature vector of length 13 for each home, where each entry corresponds to an attribute. Our first observation is that, among the 13 dimensions, there are 12 continuous dimensions and 1 discrete dimension.
-
-Note that although a discrete value is also written as numeric values such as 0, 1, or 2, its meaning differs from a continuous value drastically. The linear difference between two discrete values has no meaning. For example, suppose $0$, $1$, and $2$ are used to represent colors *Red*, *Green*, and *Blue* respectively. Judging from the numeric representation of these colors, *Red* differs more from *Blue* than it does from *Green*. Yet in actuality, it is not true that extent to which the color *Blue* is different from *Red* is greater than the extent to which *Green* is different from *Red*. Therefore, when handling a discrete feature that has $d$ possible values, we usually convert it to $d$ new features where each feature takes a binary value, $0$ or $1$, indicating whether the original value is absent or present. Alternatively, the discrete features can be mapped onto a continuous multi-dimensional vector through an embedding table. For our problem here, because CHAS itself is a binary discrete value, we do not need to do any preprocessing.
-
-#### Feature Normalization
-We also observe a huge difference among the value ranges of the 13 features (Figure 2). For instance, the values of feature *B* fall in $[0.32, 396.90]$, whereas those of feature *NOX* has a range of $[0.3850, 0.8170]$. An effective optimization would require data normalization. The goal of data normalization is to scale te values of each feature into roughly the same range, perhaps $[-0.5, 0.5]$. Here, we adopt a popular normalization technique where we substract the mean value from the feature value and divide the result by the width of the original range.
-
-There are at least three reasons for [Feature Normalization](https://en.wikipedia.org/wiki/Feature_scaling) (Feature Scaling):
-- A value range that is too large or too small might cause floating number overflow or underflow during computation.
-- Different value ranges might result in varying *importances* of different features to the model (at least in the beginning of the training process). This assumption about the data is often unreasonable, making the optimization difficult, which in turn results in increased training time.
-- Many machine learning techniques or models (e.g., *L1/L2 regularization* and *Vector Space Model*) assumes that all the features have roughly zero means and their value ranges are similar.
-
-
-
- Figure 2. The value ranges of the features
-
-
-#### Prepare Training and Test Sets
-We split the dataset in two, one for adjusting the model parameters, namely, for model training, and the other for model testing. The model error on the former is called the **training error**, and the error on the latter is called the **test error**. Our goal in training a model is to find the statistical dependency between the outputs and the inputs, so that we can predict new outputs given new inputs. As a result, the test error reflects the performance of the model better than the training error does. We consider two things when deciding the ratio of the training set to the test set: 1) More training data will decrease the variance of the parameter estimation, yielding more reliable models; 2) More test data will decrease the variance of the test error, yielding more reliable test errors. One standard split ratio is $8:2$.
-
-
-When training complex models, we usually have one more split: the validation set. Complex models usually have [Hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_optimization) that need to be set before the training process, such as the number of layers in the network. Because hyperparameters are not part of the model parameters, they cannot be trained using the same loss function. Thus we will try several sets of hyperparameters to train several models and cross-validate them on the validation set to pick the best one; finally, the selected trained model is tested on the test set. Because our model is relatively simple, we will omit this validation process.
-
-
-## Training
-
-`fit_a_line/trainer.py` demonstrates the training using [PaddlePaddle](http://paddlepaddle.org).
-
-### Initialize PaddlePaddle
-
-```python
-paddle.init(use_gpu=False, trainer_count=1)
-```
-
-### Model Configuration
-
-Logistic regression is essentially a fully-connected layer with linear activation:
-
-```python
-x = paddle.layer.data(name='x', type=paddle.data_type.dense_vector(13))
-y_predict = paddle.layer.fc(input=x,
- size=1,
- act=paddle.activation.Linear())
-y = paddle.layer.data(name='y', type=paddle.data_type.dense_vector(1))
-cost = paddle.layer.mse_cost(input=y_predict, label=y)
-```
-### Create Parameters
-
-```python
-parameters = paddle.parameters.create(cost)
-```
-
-### Create Trainer
-
-```python
-optimizer = paddle.optimizer.Momentum(momentum=0)
-
-trainer = paddle.trainer.SGD(cost=cost,
- parameters=parameters,
- update_equation=optimizer)
-```
-
-### Feeding Data
-
-PaddlePaddle provides the
-[reader mechanism](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader)
-for loadinng training data. A reader may return multiple columns, and we need a Python dictionary to specify the mapping from column index to data layers.
-
-```python
-feeding={'x': 0, 'y': 1}
-```
-
-Moreover, an event handler is provided to print the training progress:
-
-```python
-# event_handler to print training and testing info
-def event_handler(event):
- if isinstance(event, paddle.event.EndIteration):
- if event.batch_id % 100 == 0:
- print "Pass %d, Batch %d, Cost %f" % (
- event.pass_id, event.batch_id, event.cost)
-
- if isinstance(event, paddle.event.EndPass):
- result = trainer.test(
- reader=paddle.batch(
- uci_housing.test(), batch_size=2),
- feeding=feeding)
- print "Test %d, Cost %f" % (event.pass_id, result.cost)
-```
-
-```python
-# event_handler to print training and testing info
-from paddle.v2.plot import Ploter
-
-train_title = "Train cost"
-test_title = "Test cost"
-plot_cost = Ploter(train_title, test_title)
-
-step = 0
-
-def event_handler_plot(event):
- global step
- if isinstance(event, paddle.event.EndIteration):
- if step % 10 == 0: # every 10 batches, record a train cost
- plot_cost.append(train_title, step, event.cost)
-
- if step % 100 == 0: # every 100 batches, record a test cost
- result = trainer.test(
- reader=paddle.batch(
- uci_housing.test(), batch_size=2),
- feeding=feeding)
- plot_cost.append(test_title, step, result.cost)
-
- if step % 100 == 0: # every 100 batches, update cost plot
- plot_cost.plot()
-
- step += 1
-```
-
-### Start Training
-
-```python
-trainer.train(
- reader=paddle.batch(
- paddle.reader.shuffle(
- uci_housing.train(), buf_size=500),
- batch_size=2),
- feeding=feeding,
- event_handler=event_handler_plot,
- num_passes=30)
-```
-
-![png](./image/train_and_test.png)
-
-## Summary
-This chapter introduces *Linear Regression* and how to train and test this model with PaddlePaddle, using the UCI Housing Data Set. Because a large number of more complex models and techniques are derived from linear regression, it is important to understand its underlying theory and limitation.
-
-
-## References
-1. https://en.wikipedia.org/wiki/Linear_regression
-2. Friedman J, Hastie T, Tibshirani R. The elements of statistical learning[M]. Springer, Berlin: Springer series in statistics, 2001.
-3. Murphy K P. Machine learning: a probabilistic perspective[M]. MIT press, 2012.
-4. Bishop C M. Pattern recognition[J]. Machine Learning, 2006, 128.
-
-
-This tutorial is contributed by PaddlePaddle, and licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
-
-
-# Recognize Digits
-
-The source code for this tutorial is live at [book/recognize_digits](https://github.com/PaddlePaddle/book/tree/develop/02.recognize_digits). For instructions on getting started with Paddle, please refer to [installation instructions](https://github.com/PaddlePaddle/book/blob/develop/README.en.md#running-the-book).
-
-## Introduction
-When one learns to program, the first task is usually to write a program that prints "Hello World!". In Machine Learning or Deep Learning, the equivalent task is to train a model to recognize hand-written digits on the dataset [MNIST](http://yann.lecun.com/exdb/mnist/). Handwriting recognition is a classic image classification problem. The problem is relatively easy and MNIST is a complete dataset. As a simple Computer Vision dataset, MNIST contains images of handwritten digits and their corresponding labels (Fig. 1). The input image is a $28\times28$ matrix, and the label is one of the digits from $0$ to $9$. All images are normalized, meaning that they are both rescaled and centered.
-
-
-
-Fig. 1. Examples of MNIST images
-
-
-The MNIST dataset is created from the [NIST](https://www.nist.gov/srd/nist-special-database-19) Special Database 3 (SD-3) and the Special Database 1 (SD-1). The SD-3 is labeled by the staff of the U.S. Census Bureau, while SD-1 is labeled by high school students the in U.S. Therefore the SD-3 is cleaner and easier to recognize than the SD-1 dataset. Yann LeCun et al. used half of the samples from each of SD-1 and SD-3 to create the MNIST training set (60,000 samples) and test set (10,000 samples), where training set was labeled by 250 different annotators, and it was guaranteed that there wasn't a complete overlap of annotators of training set and test set.
-
-Yann LeCun, one of the founders of Deep Learning, have previously made tremendous contributions to handwritten character recognition and proposed the **Convolutional Neural Network** (CNN), which drastically improved recognition capability for handwritten characters. CNNs are now a critical concept in Deep Learning. From the LeNet proposal by Yann LeCun, to those winning models in ImageNet competitions, such as VGGNet, GoogLeNet, and ResNet (See [Image Classification](https://github.com/PaddlePaddle/book/tree/develop/03.image_classification) tutorial), CNNs have achieved a series of impressive results in Image Classification tasks.
-
-Many algorithms are tested on MNIST. In 1998, LeCun experimented with single layer linear classifier, Multilayer Perceptron (MLP) and Multilayer CNN LeNet. These algorithms quickly reduced test error from 12% to 0.7% \[[1](#references)\]. Since then, researchers have worked on many algorithms such as **K-Nearest Neighbors** (k-NN) \[[2](#references)\], **Support Vector Machine** (SVM) \[[3](#references)\], **Neural Networks** \[[4-7](#references)\] and **Boosting** \[[8](#references)\]. Various preprocessing methods like distortion removal, noise removal, and blurring, have also been applied to increase recognition accuracy.
-
-In this tutorial, we tackle the task of handwritten character recognition. We start with a simple **softmax** regression model and guide our readers step-by-step to improve this model's performance on the task of recognition.
-
-
-## Model Overview
-
-Before introducing classification algorithms and training procedure, we define the following symbols:
-- $X$ is the input: Input is a $28\times 28$ MNIST image. It is flattened to a $784$ dimensional vector. $X=\left (x_0, x_1, \dots, x_{783} \right )$.
-- $Y$ is the output: Output of the classifier is 1 of the 10 classes (digits from 0 to 9). $Y=\left (y_0, y_1, \dots, y_9 \right )$. Each dimension $y_i$ represents the probability that the input image belongs to class $i$.
-- $L$ is the ground truth label: $L=\left ( l_0, l_1, \dots, l_9 \right )$. It is also 10 dimensional, but only one entry is $1$ and all others are $0$s.
-
-### Softmax Regression
-
-In a simple softmax regression model, the input is first fed to fully connected layers. Then, a softmax function is applied to output probabilities of multiple output classes\[[9](#references)\].
-
-The input $X$ is multiplied by weights $W$ and then added to the bias $b$ to generate activations.
-
-$$ y_i = \text{softmax}(\sum_j W_{i,j}x_j + b_i) $$
-
-where $ \text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} $
-
-For an $N$-class classification problem with $N$ output nodes, Softmax normalizes the resulting $N$ dimensional vector so that each of its entries falls in the range $[0,1]\in\math{R}$, representing the probability that the sample belongs to a certain class. Here $y_i$ denotes the predicted probability that an image is of digit $i$.
-
-In such a classification problem, we usually use the cross entropy loss function:
-
-$$ \text{crossentropy}(label, y) = -\sum_i label_ilog(y_i) $$
-
-Fig. 2 illustrates a softmax regression network, with the weights in blue, and the bias in red. `+1` indicates that the bias is $1$.
-
-
-
-### Multilayer Perceptron
-
-The softmax regression model described above uses the simplest two-layer neural network. That is, it only contains an input layer and an output layer, with limited regression capability. To achieve better recognition results, consider adding several hidden layers\[[10](#references)\] between the input layer and the output layer.
-
-1. After the first hidden layer, we get $ H_1 = \phi(W_1X + b_1) $, where $\phi$ denotes the activation function. Some [common ones](###list-of-common-activation-functions) are sigmoid, tanh and ReLU.
-2. After the second hidden layer, we get $ H_2 = \phi(W_2H_1 + b_2) $.
-3. Finally, the output layer outputs $Y=\text{softmax}(W_3H_2 + b_3)$, the vector denoting our classification result.
-
-Fig. 3. shows a Multilayer Perceptron network, with the weights in blue, and the bias in red. +1 indicates that the bias is $1$.
-
-
-
-The **convolutional layer** is the core of a Convolutional Neural Network. The parameters in this layer are composed of a set of filters, also called kernels. We could visualize the convolution step in the following fashion: Each kernel slides horizontally and vertically till it covers the whole image. At every window, we compute the dot product of the kernel and the input. Then, we add the bias and apply an activation function. The result is a two-dimensional activation map. For example, some kernel may recognize corners, and some may recognize circles. These convolution kernels may respond strongly to the corresponding features.
-
-Fig. 4 illustrates the dynamic programming of a convolutional layer, where depths are flattened for simplicity. The input is $W_1=5$, $H_1=5$, $D_1=3$. In fact, this is a common representation for colored images. $W_1$ and $H_1$ correspond to the width and height in a colored image. $D_1$ corresponds to the 3 color channels for RGB. The parameters of the convolutional layer are $K=2$, $F=3$, $S=2$, $P=1$. $K$ denotes the number of kernels; specifically, $Filter$ $W_0$ and $Filter$ $W_1$ are the kernels. $F$ is kernel size while $W0$ and $W1$ are both $F\timesF = 3\times3$ matrices in all depths. $S$ is the stride, which is the width of the sliding window; here, kernels move leftwards or downwards by 2 units each time. $P$ is the width of the padding, which denotes an extension of the input; here, the gray area shows zero padding with size 1.
-
-#### Pooling Layer
-
-
-
-Fig. 5 Pooling layer using max-pooling
-
-
-A **pooling layer** performs downsampling. The main functionality of this layer is to reduce computation by reducing the network parameters. It also prevents over-fitting to some extent. Usually, a pooling layer is added after a convolutional layer. Pooling layer can use various techniques, such as max pooling and average pooling. As shown in Fig.5, max pooling uses rectangles to segment the input layer into several parts and computes the maximum value in each part as the output.
-
-#### LeNet-5 Network
-
-
-
-[**LeNet-5**](http://yann.lecun.com/exdb/lenet/) is one of the simplest Convolutional Neural Networks. Fig. 6. shows its architecture: A 2-dimensional input image is fed into two sets of convolutional layers and pooling layers. This output is then fed to a fully connected layer and a softmax classifier. Compared to multilayer, fully connected perceptrons, the LeNet-5 can recognize images better. This is due to the following three properties of the convolution:
-
-- The 3D nature of the neurons: a convolutional layer is organized by width, height and depth. Neurons in each layer are connected to only a small region in the previous layer. This region is called the receptive field.
-- Local connectivity: A CNN utilizes the local space correlation by connecting local neurons. This design guarantees that the learned filter has a strong response to local input features. Stacking many such layers generates a non-linear filter that is more global. This enables the network to first obtain good representation for small parts of input and then combine them to represent a larger region.
-- Weight sharing: In a CNN, computation is iterated on shared parameters (weights and bias) to form a feature map. This means that all the neurons in the same depth of the output respond to the same feature. This allows the network to detect a feature regardless of its position in the input. In other words, it is shift invariant.
-
-For more details on Convolutional Neural Networks, please refer to the tutorial on [Image Classification](https://github.com/PaddlePaddle/book/blob/develop/image_classification/README.md) and the [relevant lecture](http://cs231n.github.io/convolutional-networks/) from a Stanford open course.
-
-### List of Common Activation Functions
-- Sigmoid activation function: $ f(x) = sigmoid(x) = \frac{1}{1+e^{-x}} $
-
-- Tanh activation function: $ f(x) = tanh(x) = \frac{e^x-e^{-x}}{e^x+e^{-x}} $
-
- In fact, tanh function is just a rescaled version of the sigmoid function. It is obtained by magnifying the value of the sigmoid function and moving it downwards by 1.
-
-- ReLU activation function: $ f(x) = max(0, x) $
-
-For more information, please refer to [Activation functions on Wikipedia](https://en.wikipedia.org/wiki/Activation_function).
-
-## Data Preparation
-
-PaddlePaddle provides a Python module, `paddle.dataset.mnist`, which downloads and caches the [MNIST dataset](http://yann.lecun.com/exdb/mnist/). The cache is under `/home/username/.cache/paddle/dataset/mnist`:
-
-
-| File name | Description | Size |
-|----------------------|--------------|-----------|
-|train-images-idx3-ubyte| Training images | 60,000 |
-|train-labels-idx1-ubyte| Training labels | 60,000 |
-|t10k-images-idx3-ubyte | Evaluation images | 10,000 |
-|t10k-labels-idx1-ubyte | Evaluation labels | 10,000 |
-
-
-## Model Configuration
-
-A PaddlePaddle program starts from importing the API package:
-
-```python
-import gzip
-import paddle.v2 as paddle
-```
-
-We want to use this program to demonstrate three different classifiers, each defined as a Python function:
-
-- Softmax regression: the network has a fully-connection layer with softmax activation:
-
-```python
-def softmax_regression(img):
- predict = paddle.layer.fc(input=img,
- size=10,
- act=paddle.activation.Softmax())
- return predict
-```
-
-- Multi-Layer Perceptron: this network has two hidden fully-connected layers, one with ReLU and the other with softmax activation:
-
-```python
-def multilayer_perceptron(img):
- hidden1 = paddle.layer.fc(input=img, size=128, act=paddle.activation.Relu())
- hidden2 = paddle.layer.fc(input=hidden1,
- size=64,
- act=paddle.activation.Relu())
- predict = paddle.layer.fc(input=hidden2,
- size=10,
- act=paddle.activation.Softmax())
- return predict
-```
-
-- Convolution network LeNet-5: the input image is fed through two convolution-pooling layers, a fully-connected layer, and the softmax output layer:
-
-```python
-def convolutional_neural_network(img):
-
- conv_pool_1 = paddle.networks.simple_img_conv_pool(
- input=img,
- filter_size=5,
- num_filters=20,
- num_channel=1,
- pool_size=2,
- pool_stride=2,
- act=paddle.activation.Relu())
-
- conv_pool_2 = paddle.networks.simple_img_conv_pool(
- input=conv_pool_1,
- filter_size=5,
- num_filters=50,
- num_channel=20,
- pool_size=2,
- pool_stride=2,
- act=paddle.activation.Relu())
-
- predict = paddle.layer.fc(input=conv_pool_2,
- size=10,
- act=paddle.activation.Softmax())
- return predict
-```
-
-PaddlePaddle provides a special layer `layer.data` for reading data. Let us create a data layer for reading images and connect it to a classification network created using one of above three functions. We also need a cost layer for training the model.
-
-```python
-paddle.init(use_gpu=False, trainer_count=1)
-
-images = paddle.layer.data(
- name='pixel', type=paddle.data_type.dense_vector(784))
-label = paddle.layer.data(
- name='label', type=paddle.data_type.integer_value(10))
-
-# predict = softmax_regression(images)
-# predict = multilayer_perceptron(images) # uncomment for MLP
-predict = convolutional_neural_network(images) # uncomment for LeNet5
-
-cost = paddle.layer.classification_cost(input=predict, label=label)
-```
-
-Now, it is time to specify training parameters. In the following `Momentum` optimizer, `momentum=0.9` means that 90% of the current momentum comes from that of the previous iteration. The learning rate relates to the speed at which the network training converges. Regularization is meant to prevent over-fitting; here we use the L2 regularization.
-
-```python
-parameters = paddle.parameters.create(cost)
-
-optimizer = paddle.optimizer.Momentum(
- learning_rate=0.1 / 128.0,
- momentum=0.9,
- regularization=paddle.optimizer.L2Regularization(rate=0.0005 * 128))
-
-trainer = paddle.trainer.SGD(cost=cost,
- parameters=parameters,
- update_equation=optimizer)
-```
-
-Then we specify the training data `paddle.dataset.movielens.train()` and testing data `paddle.dataset.movielens.test()`. These two methods are *reader creators*. Once called, a reader creator returns a *reader*. A reader is a Python method, which, once called, returns a Python generator, which yields instances of data.
-
-`shuffle` is a reader decorator. It takes in a reader A as input and returns a new reader B. Under the hood, B calls A to read data in the following fashion: it copies in `buffer_size` instances at a time into a buffer, shuffles the data, and yields the shuffled instances one at a time. A large buffer size would yield very shuffled data.
-
-`batch` is a special decorator, which takes in reader and outputs a *batch reader*, which doesn't yield an instance, but a minibatch at a time.
-
-`event_handler_plot` is used to plot a figure like below:
-
-![png](./image/train_and_test.png)
-
-```python
-from paddle.v2.plot import Ploter
-
-train_title = "Train cost"
-test_title = "Test cost"
-cost_ploter = Ploter(train_title, test_title)
-
-step = 0
-
-# event_handler to plot a figure
-def event_handler_plot(event):
- global step
- if isinstance(event, paddle.event.EndIteration):
- if step % 100 == 0:
- cost_ploter.append(train_title, step, event.cost)
- cost_ploter.plot()
- step += 1
- if isinstance(event, paddle.event.EndPass):
- # save parameters
- with gzip.open('params_pass_%d.tar.gz' % event.pass_id, 'w') as f:
- parameters.to_tar(f)
-
- result = trainer.test(reader=paddle.batch(
- paddle.dataset.mnist.test(), batch_size=128))
- cost_ploter.append(test_title, step, result.cost)
-```
-
-`event_handler` is used to plot some text data when training.
-
-```python
-lists = []
-
-def event_handler(event):
- if isinstance(event, paddle.event.EndIteration):
- if event.batch_id % 100 == 0:
- print "Pass %d, Batch %d, Cost %f, %s" % (
- event.pass_id, event.batch_id, event.cost, event.metrics)
- if isinstance(event, paddle.event.EndPass):
- # save parameters
- with gzip.open('params_pass_%d.tar.gz' % event.pass_id, 'w') as f:
- parameters.to_tar(f)
-
- result = trainer.test(reader=paddle.batch(
- paddle.dataset.mnist.test(), batch_size=128))
- print "Test with Pass %d, Cost %f, %s\n" % (
- event.pass_id, result.cost, result.metrics)
- lists.append((event.pass_id, result.cost,
- result.metrics['classification_error_evaluator']))
-```
-
-```python
-trainer.train(
- reader=paddle.batch(
- paddle.reader.shuffle(
- paddle.dataset.mnist.train(), buf_size=8192),
- batch_size=128),
- event_handler=event_handler_plot,
- num_passes=5)
-```
-
-During training, `trainer.train` invokes `event_handler` for certain events. This gives us a chance to print the training progress.
-
-```
-# Pass 0, Batch 0, Cost 2.780790, {'classification_error_evaluator': 0.9453125}
-# Pass 0, Batch 100, Cost 0.635356, {'classification_error_evaluator': 0.2109375}
-# Pass 0, Batch 200, Cost 0.326094, {'classification_error_evaluator': 0.1328125}
-# Pass 0, Batch 300, Cost 0.361920, {'classification_error_evaluator': 0.1015625}
-# Pass 0, Batch 400, Cost 0.410101, {'classification_error_evaluator': 0.125}
-# Test with Pass 0, Cost 0.326659, {'classification_error_evaluator': 0.09470000118017197}
-```
-
-After the training, we can check the model's prediction accuracy.
-
-```
-# find the best pass
-best = sorted(lists, key=lambda list: float(list[1]))[0]
-print 'Best pass is %s, testing Avgcost is %s' % (best[0], best[1])
-print 'The classification accuracy is %.2f%%' % (100 - float(best[2]) * 100)
-```
-
-Usually, with MNIST data, the softmax regression model achieves an accuracy around 92.34%, the MLP 97.66%, and the convolution network around 99.20%. Convolution layers have been widely considered a great invention for image processing.
-
-## Application
-
-After training is done, user can use the trained model to classify images. The following code shows how to inference MNIST images through `paddle.infer` interface.
-
-```python
-from PIL import Image
-import numpy as np
-import os
-def load_image(file):
- im = Image.open(file).convert('L')
- im = im.resize((28, 28), Image.ANTIALIAS)
- im = np.array(im).astype(np.float32).flatten()
- im = im / 255.0
- return im
-
-test_data = []
-cur_dir = os.path.dirname(os.path.realpath(__file__))
-test_data.append((load_image(cur_dir + '/image/infer_3.png'),))
-
-probs = paddle.infer(
- output_layer=predict, parameters=parameters, input=test_data)
-lab = np.argsort(-probs) # probs and lab are the results of one batch data
-print "Label of image/infer_3.png is: %d" % lab[0][0]
-```
-
-
-## Conclusion
-
-This tutorial describes a few common deep learning models using **Softmax regression**, **Multilayer Perceptron Network**, and **Convolutional Neural Network**. Understanding these models is crucial for future learning; the subsequent tutorials derive more sophisticated networks by building on top of them.
-
-When our model evolves from a simple softmax regression to a slightly complex Convolutional Neural Network, the recognition accuracy on the MNIST data set achieves a large improvement in accuracy. This is due to the Convolutional layers' local connections and parameter sharing. While learning new models in the future, we encourage the readers to understand the key ideas that lead a new model to improve the results of an old one.
-
-Moreover, this tutorial introduces the basic flow of PaddlePaddle model design, which starts with a *dataprovider*, a model layer construction, and finally training and prediction. Motivated readers can leverage the flow used in this MNIST handwritten digit classification example and experiment with different data and network architectures to train models for classification tasks of their choice.
-
-
-## References
-
-1. LeCun, Yann, Léon Bottou, Yoshua Bengio, and Patrick Haffner. ["Gradient-based learning applied to document recognition."](http://ieeexplore.ieee.org/abstract/document/726791/) Proceedings of the IEEE 86, no. 11 (1998): 2278-2324.
-2. Wejéus, Samuel. ["A Neural Network Approach to Arbitrary SymbolRecognition on Modern Smartphones."](http://www.diva-portal.org/smash/record.jsf?pid=diva2%3A753279&dswid=-434) (2014).
-3. Decoste, Dennis, and Bernhard Schölkopf. ["Training invariant support vector machines."](http://link.springer.com/article/10.1023/A:1012454411458) Machine learning 46, no. 1-3 (2002): 161-190.
-4. Simard, Patrice Y., David Steinkraus, and John C. Platt. ["Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis."](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.160.8494&rep=rep1&type=pdf) In ICDAR, vol. 3, pp. 958-962. 2003.
-5. Salakhutdinov, Ruslan, and Geoffrey E. Hinton. ["Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure."](http://www.jmlr.org/proceedings/papers/v2/salakhutdinov07a/salakhutdinov07a.pdf) In AISTATS, vol. 11. 2007.
-6. Cireşan, Dan Claudiu, Ueli Meier, Luca Maria Gambardella, and Jürgen Schmidhuber. ["Deep, big, simple neural nets for handwritten digit recognition."](http://www.mitpressjournals.org/doi/abs/10.1162/NECO_a_00052) Neural computation 22, no. 12 (2010): 3207-3220.
-7. Deng, Li, Michael L. Seltzer, Dong Yu, Alex Acero, Abdel-rahman Mohamed, and Geoffrey E. Hinton. ["Binary coding of speech spectrograms using a deep auto-encoder."](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.185.1908&rep=rep1&type=pdf) In Interspeech, pp. 1692-1695. 2010.
-8. Kégl, Balázs, and Róbert Busa-Fekete. ["Boosting products of base classifiers."](http://dl.acm.org/citation.cfm?id=1553439) In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 497-504. ACM, 2009.
-9. Rosenblatt, Frank. ["The perceptron: A probabilistic model for information storage and organization in the brain."](http://psycnet.apa.org/journals/rev/65/6/386/) Psychological review 65, no. 6 (1958): 386.
-10. Bishop, Christopher M. ["Pattern recognition."](http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf) Machine Learning 128 (2006): 1-58.
-
-
-This tutorial is contributed by PaddlePaddle, and licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
-
-
+
+## 应用模型
+
+可以使用训练好的模型对图片进行分类,下面程序展示了如何使用`paddle.infer`接口进行推断,可以打开注释,更改加载的模型。
+
+```python
+from PIL import Image
+import numpy as np
+import os
+def load_image(file):
+ im = Image.open(file)
+ im = im.resize((32, 32), Image.ANTIALIAS)
+ im = np.array(im).astype(np.float32)
+ # PIL打开图片存储顺序为H(高度),W(宽度),C(通道)。
+ # PaddlePaddle要求数据顺序为CHW,所以需要转换顺序。
+ im = im.transpose((2, 0, 1)) # CHW
+ # CIFAR训练图片通道顺序为B(蓝),G(绿),R(红),
+ # 而PIL打开图片默认通道顺序为RGB,因为需要交换通道。
+ im = im[(2, 1, 0),:,:] # BGR
+ im = im.flatten()
+ im = im / 255.0
+ return im
+
+test_data = []
+cur_dir = os.path.dirname(os.path.realpath(__file__))
+test_data.append((load_image(cur_dir + '/image/dog.png'),)
+
+# with gzip.open('params_pass_50.tar.gz', 'r') as f:
+# parameters = paddle.parameters.Parameters.from_tar(f)
+
+probs = paddle.infer(
+ output_layer=out, parameters=parameters, input=test_data)
+lab = np.argsort(-probs) # probs and lab are the results of one batch data
+print "Label of image/dog.png is: %d" % lab[0][0]
+```
+
+
+## 总结
+
+传统图像分类方法由多个阶段构成,框架较为复杂,而端到端的CNN模型结构可一步到位,而且大幅度提升了分类准确率。本文我们首先介绍VGG、GoogleNet、ResNet三个经典的模型;然后基于CIFAR10数据集,介绍如何使用PaddlePaddle配置和训练CNN模型,尤其是VGG和ResNet模型;最后介绍如何使用PaddlePaddle的API接口对图片进行预测和特征提取。对于其他数据集比如ImageNet,配置和训练流程是同样的,大家可以自行进行实验。
+
+
+## 参考文献
+
+[1] D. G. Lowe, [Distinctive image features from scale-invariant keypoints](http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf). IJCV, 60(2):91-110, 2004.
+
+[2] N. Dalal, B. Triggs, [Histograms of Oriented Gradients for Human Detection](http://vision.stanford.edu/teaching/cs231b_spring1213/papers/CVPR05_DalalTriggs.pdf), Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2005.
+
+[3] Ahonen, T., Hadid, A., and Pietikinen, M. (2006). [Face description with local binary patterns: Application to face recognition](http://ieeexplore.ieee.org/document/1717463/). PAMI, 28.
+
+[4] J. Sivic, A. Zisserman, [Video Google: A Text Retrieval Approach to Object Matching in Videos](http://www.robots.ox.ac.uk/~vgg/publications/papers/sivic03.pdf), Proc. Ninth Int'l Conf. Computer Vision, pp. 1470-1478, 2003.
+
+[5] B. Olshausen, D. Field, [Sparse Coding with an Overcomplete Basis Set: A Strategy Employed by V1?](http://redwood.psych.cornell.edu/papers/olshausen_field_1997.pdf), Vision Research, vol. 37, pp. 3311-3325, 1997.
+
+[6] Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., and Gong, Y. (2010). [Locality-constrained Linear Coding for image classification](http://ieeexplore.ieee.org/abstract/document/5540018/). In CVPR.
+
+[7] Perronnin, F., Sánchez, J., & Mensink, T. (2010). [Improving the fisher kernel for large-scale image classification](http://dl.acm.org/citation.cfm?id=1888101). In ECCV (4).
+
+[8] Lin, Y., Lv, F., Cao, L., Zhu, S., Yang, M., Cour, T., Yu, K., and Huang, T. (2011). [Large-scale image clas- sification: Fast feature extraction and SVM training](http://ieeexplore.ieee.org/document/5995477/). In CVPR.
+
+[9] Krizhevsky, A., Sutskever, I., and Hinton, G. (2012). [ImageNet classification with deep convolutional neu- ral networks](http://www.cs.toronto.edu/~kriz/imagenet_classification_with_deep_convolutional.pdf). In NIPS.
+
+[10] G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R.R. Salakhutdinov. [Improving neural networks by preventing co-adaptation of feature detectors](https://arxiv.org/abs/1207.0580). arXiv preprint arXiv:1207.0580, 2012.
+
+[11] K. Chatfield, K. Simonyan, A. Vedaldi, A. Zisserman. [Return of the Devil in the Details: Delving Deep into Convolutional Nets](https://arxiv.org/abs/1405.3531). BMVC, 2014。
+
+[12] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A., [Going deeper with convolutions](https://arxiv.org/abs/1409.4842). In: CVPR. (2015)
+
+[13] Lin, M., Chen, Q., and Yan, S. [Network in network](https://arxiv.org/abs/1312.4400). In Proc. ICLR, 2014.
+
+[14] S. Ioffe and C. Szegedy. [Batch normalization: Accelerating deep network training by reducing internal covariate shift](https://arxiv.org/abs/1502.03167). In ICML, 2015.
+
+[15] K. He, X. Zhang, S. Ren, J. Sun. [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385). CVPR 2016.
+
+[16] Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z. [Rethinking the incep-tion architecture for computer vision](https://arxiv.org/abs/1512.00567). In: CVPR. (2016).
+
+[17] Szegedy, C., Ioffe, S., Vanhoucke, V. [Inception-v4, inception-resnet and the impact of residual connections on learning](https://arxiv.org/abs/1602.07261). arXiv:1602.07261 (2016).
+
+[18] Everingham, M., Eslami, S. M. A., Van Gool, L., Williams, C. K. I., Winn, J. and Zisserman, A. [The Pascal Visual Object Classes Challenge: A Retrospective]((http://link.springer.com/article/10.1007/s11263-014-0733-5)). International Journal of Computer Vision, 111(1), 98-136, 2015.
+
+[19] He, K., Zhang, X., Ren, S., and Sun, J. [Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification](https://arxiv.org/abs/1502.01852). ArXiv e-prints, February 2015.
+
+[20] http://deeplearning.net/tutorial/lenet.html
+
+[21] https://www.cs.toronto.edu/~kriz/cifar.html
+
+[22] http://cs231n.github.io/classification/
+
+
+ 本教程 由 PaddlePaddle 创作,采用 知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议进行许可。
+
+
-Image Classification
-=======================
-
-The source code for this chapter is at [book/image_classification](https://github.com/PaddlePaddle/book/tree/develop/03.image_classification). First-time users, please refer to PaddlePaddle [Installation Tutorial](https://github.com/PaddlePaddle/book/blob/develop/README.en.md#running-the-book) for installation instructions.
-
-## Background
-
-Compared to words, images provide much more vivid and easier to understand information with an artistic sense. They are an important source for people to express and exchange ideas. In this chapter, we focus on one of the essential problems in image recognition -- image classification.
-
-Image classification is the task of distinguishing images in different categories based on their semantic meaning. It is a core problem in computer vision and is also the foundation of other higher level computer vision tasks such as object detection, image segmentation, object tracking, action recognition, etc. Image classification has applications in many areas such as face recognition, intelligent video analysis in security systems, traffic scene recognition in transportation systems, content-based image retrieval and automatic photo indexing in web services, image classification in medicine, etc.
-
-To classify an image we first encode the entire image using handcrafted or learned features and then determine the category using a classifier. Thus, feature extraction plays an important role in image classification. Prior to deep learning the BoW(Bag of Words) model was the most widely used method for classifying an image as well as an object. The BoW technique was introduced in Natural Language Processing where a training sentence is represented as a bag of words. In the context of image classification, the BoW model requires constructing a dictionary. The simplest BoW framework can be designed with three steps: **feature extraction**, **feature encoding** and **classifier design**.
-
-Using Deep learning, image classification can be framed as a supervised or unsupervised learning problem that uses hierarchical features automatically without any need for manually crafted features from the image. In recent years, Convolutional Neural Networks (CNNs) have made significant progress in image classification. CNNs use raw image pixels as input, extract low-level and high-level abstract features through convolution operations, and directly output the classification results from the model. This style of end-to-end learning has lead to not only increased performance but also wider adoption various applications.
-
-In this chapter, we introduce deep-learning-based image classification methods and explain how to train a CNN model using PaddlePaddle.
-
-## Demonstration
-
-An image can be classified by a general as well as fine-grained image classifier.
-
-
-Figure 1 shows the results of a general image classifier -- the trained model can correctly recognize the main objects in the images.
-
-
-
-Figure 1. General image classification
-
-
-
-Figure 2 shows the results of a fine-grained image classifier. This task of flower recognition requires correctly recognizing of the flower's categories.
-
-
-
-Figure 2. Fine-grained image classification
-
-
-
-A good model should recognize objects of different categories correctly. The results of such a model should not vary due to viewpoint variation, illumination conditions, object distortion or occlusion.
-Figure 3 shows some images with various disturbances. A good model should classify these images correctly like humans.
-
-
-
-Figure 3. Disturbed images [22]
-
-
-## Model Overview
-
-A large amount of research in image classification is built upon public datasets such as [PASCAL VOC](http://host.robots.ox.ac.uk/pascal/VOC/), [ImageNet](http://image-net.org/) etc. Many image classification algorithms are usually evaluated and compared on these datasets. PASCAL VOC is a computer vision competition started in 2005, and ImageNet is a dataset for Large Scale Visual Recognition Challenge (ILSVRC) started in 2010. In this chapter, we introduce some image classification models from the submissions to these competitions.
-
-Before 2012, traditional image classification was accomplished with the three steps described in the background section. A complete model construction usually involves the following stages: low-level feature extraction, feature encoding, spatial constraint or feature clustering, classifier design, model ensemble.
-
- 1). **Low-level feature extraction**: This step extracts large amounts of local features according to fixed strides and scales. Popular local features include Scale-Invariant Feature Transform (SIFT)[1], Histogram of Oriented Gradient(HOG)[2], Local Binary Pattern(LBP)[3], etc. A common practice is to employ multiple feature descriptors in order to avoid missing a lot of information.
-
- 2). **Feature encoding**: Low-level features contain a large amount of redundancy and noise. In order to improve the robustness of features, it is necessary to employ a feature transformation to encode low-level features. This is called feature encoding. Common feature encoding methods include vector quantization [4], sparse coding [5], locality-constrained linear coding [6], Fisher vector encoding [7], etc.
-
- 3). **Spatial constraint**: Spatial constraint or feature clustering is usually adopted after feature encoding for extracting the maximum or average of each dimension in the spatial domain. Pyramid feature matching--a popular feature clustering method--divides an image uniformly into patches and performs feature clustering in each patch.
-
- 4). **Classification**: In the above steps an image can be described by a vector of fixed dimension. Then a classifier can be used to classify the image into categories. Common classifiers include Support Vector Machine(SVM), random forest etc. Kernel SVM is the most popular classifier and has achieved very good performance in traditional image classification tasks.
-
-This method has been used widely as image classification algorithm in PASCAL VOC [18]. NEC Labs(http://www.nec-labs.com/) won the championship by employing SIFT and LBP features, two non-linear encoders and SVM in ILSVRC 2010 [8].
-
-The CNN model--AlexNet proposed by Alex Krizhevsky et al.[9], made a breakthrough in ILSVRC 2012. It dramatically outperformed traditional methods and won the ILSVRC championship in 2012. This was also the first time that a deep learning method was used for large-scale image classification. Since AlexNet, a series of CNN models have been proposed that have advanced the state of the art steadily on Imagenet as shown in Figure 4. With deeper and more sophisticated architectures, Top-5 error rate is getting lower and lower (to around 3.5%). The error rate of human raters on the same Imagenet dataset is 5.1%, which means that the image classification capability of a deep learning model has surpassed human raters.
-
-
-
-### CNN
-
-Traditional CNNs consist of convolutional and fully-connected layers and use the softmax multi-category classifier with the cross-entropy loss function. Figure 5 shows a typical CNN. We first introduce the common components of a CNN.
-
-
-
-Figure 5. A CNN example [20]
-
-
-- convolutional layer: this layer uses the convolution operation to extract (low-level and high-level) features and to discover local correlation and spatial invariance.
-
-- pooling layer: this layer down samples feature maps by extracting local max (max-pooling) or average (avg-pooling) value of each patch in the feature map. Down-sampling is a common operation in image processing and is used to filter out high-frequency information.
-
-- fully-connected layer: this layer fully connects neurons between two adjacent layers.
-
-- non-linear activation: Convolutional and fully-connected layers are usually followed by some non-linear activation layers. Non-linearities enhance the expression capability of the network. Some examples of non-linear activation functions are Sigmoid, Tanh and ReLU. ReLU is the most commonly used activation function in CNN.
-
-- Dropout [10]: At each training stage, individual nodes are dropped out of the network with a certain probability. This improves the network's ability to generalize and avoids overfitting.
-
-Parameter updates at each layer during training causes input layer distributions to change and in turn requires hyper-parameters to be careful tuned. In 2015, Sergey Ioffe and Christian Szegedy proposed a Batch Normalization (BN) algorithm [14], which normalizes the features of each batch in a layer, and enables relatively stable distribution in each layer. Not only does BN algorithm act as a regularizer, but also reduces the need for careful hyper-parameter design. Experiments demonstrate that BN algorithm accelerates the training convergence and has been widely used in later deeper models.
-
-In the following sections, we will introduce the following network architectures - VGG, GoogleNet and ResNets.
-
-### VGG
-
-The Oxford Visual Geometry Group (VGG) proposed the VGG network in ILSVRC 2014 [11]. This model is deeper and wider than previous neural architectures. It consists of five main groups of convolution operations. Adjacent convolution groups are connected via max-pooling layers. Each group contains a series of 3x3 convolutional layers (i.e. kernels). The number of convolution kernels stays the same within the group and increases from 64 in the first group to 512 in the last one. The total number of learnable layers could be 11, 13, 16, or 19 depending on the number of convolutional layers in each group. Figure 6 illustrates a 16-layer VGG. The neural architecture of VGG is relatively simple and has been adopted by many papers such as the first one that surpassed human-level performance on ImageNet [19].
-
-
-
-Figure 6. VGG16 model for ImageNet
-
-
-### GoogleNet
-
-GoogleNet [12] won the ILSVRC championship in 2014. GoogleNet borrowed some ideas from the Network in Network(NIN) model [13] and is built on the Inception blocks. Let us first familiarize ourselves with these first.
-
-The two main characteristics of the NIN model are:
-
-1) A single-layer convolutional network is replaced with a Multi-Layer Perceptron Convolution (MLPconv). MLPconv is a tiny multi-layer convolutional network. It enhances non-linearity by adding several 1x1 convolutional layers after linear ones.
-
-2) In traditional CNNs, the last fewer layers are usually fully-connected with a large number of parameters. In contrast, NIN replaces all fully-connected layers with convolutional layers with feature maps of the same size as the category dimension and a global average pooling. This replacement of fully-connected layers significantly reduces the number of parameters.
-
-Figure 7 depicts two Inception blocks. Figure 7(a) is the simplest design. The output is a concatenation of features from three convolutional layers and one pooling layer. The disadvantage of this design is that the pooling layer does not change the number of filters and leads to an increase in the number of outputs. After several of such blocks, the number of outputs and parameters become larger and larger and lead to higher computation complexity. To overcome this drawback, the Inception block in Figure 7(b) employs three 1x1 convolutional layers. These reduce dimensions or the number of channels but improve the non-linearity of the network.
-
-
-
-Figure 7. Inception block
-
-
-GoogleNet consists of multiple stacked Inception blocks followed by an avg-pooling layer as in NIN instead of traditional fully connected layers. The difference between GoogleNet and NIN is that GoogleNet adds a fully connected layer after avg-pooling layer to output a vector of category size. Besides these two characteristics, the features from middle layers of a GoogleNet are also very discriminative. Therefore, GoogeleNet inserts two auxiliary classifiers in the model for enhancing gradient and regularization when doing backpropagation. The loss function of the whole network is the weighted sum of these three classifiers.
-
-Figure 8 illustrates the neural architecture of a GoogleNet which consists of 22 layers: it starts with three regular convolutional layers followed by three groups of sub-networks -- the first group contains two Inception blocks, the second one five, and the third one two. It ends up with an average pooling and a fully-connected layer.
-
-
-
-Figure 8. GoogleNet[12]
-
-
-The above model is the first version of GoogleNet or GoogelNet-v1. GoogleNet-v2 [14] introduced BN layer; GoogleNet-v3 [16] further split some convolutional layers, which increases non-linearity and network depth; GoogelNet-v4 [17] leads to the design idea of ResNet which will be introduced in the next section. The evolution from v1 to v4 improved the accuracy rate consistently. We will not go into details of the neural architectures of v2 to v4.
-
-### ResNet
-
-Residual Network(ResNet)[15] won the 2015 championship on three ImageNet competitions -- image classification, object localization, and object detection. The main challenge in training deeper networks is that accuracy degrades with network depth. The authors of ResNet proposed a residual learning approach to ease the difficulty of training deeper networks. Based on the design ideas of BN, small convolutional kernels, full convolutional network, ResNets reformulate the layers as residual blocks, with each block containing two branches, one directly connecting input to the output, the other performing two to three convolutions and calculating the residual function with reference to the layer inputs. The outputs of these two branches are then added up.
-
-Figure 9 illustrates the ResNet architecture. To the left is the basic building block, it consists of two 3x3 convolutional layers of the same channels. To the right is a Bottleneck block. The bottleneck is a 1x1 convolutional layer used to reduce dimension from 256 to 64. The other 1x1 convolutional layer is used to increase dimension from 64 to 256. Thus, the number of input and output channels of the middle 3x3 convolutional layer is 64, which is relatively small.
-
-
-
-Figure 9. Residual block
-
-
-Figure 10 illustrates ResNets with 50, 101, 152 layers, respectively. All three networks use bottleneck blocks of different numbers of repetitions. ResNet converges very fast and can be trained with hundreds or thousands of layers.
-
-
-
-Figure 10. ResNet model for ImageNet
-
-
-
-## Dataset
-
-Commonly used public datasets for image classification are [CIFAR](https://www.cs.toronto.edu/~kriz/cifar.html), [ImageNet](http://image-net.org/), [COCO](http://mscoco.org/), etc. Those used for fine-grained image classification are [CUB-200-2011](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html), [Stanford Dog](http://vision.stanford.edu/aditya86/ImageNetDogs/), [Oxford-flowers](http://www.robots.ox.ac.uk/~vgg/data/flowers/), etc. Among these, the ImageNet dataset is the largest. Most research results are reported on ImageNet as mentioned in the Model Overview section. Since 2010, the ImageNet dataset has gone through some changes. The commonly used ImageNet-2012 dataset contains 1000 categories. There are 1,281,167 training images, ranging from 732 to 1200 images per category, and 50,000 validation images with 50 images per category in average.
-
-Since ImageNet is too large to be downloaded and trained efficiently, we use [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) in this tutorial. The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. Figure 11 shows all the classes in CIFAR-10 as well as 10 images randomly sampled from each category.
-
-
-
-Figure 11. CIFAR10 dataset[21]
-
-
- `paddle.datasets` package encapsulates multiple public datasets, including `cifar`, `imdb`, `mnist`, `moivelens` and `wmt14`, etc. There's no need to manually download and preprocess CIFAR-10.
-
-After issuing a command `python train.py`, training will start immediately. The following sections describe the details:
-
-## Model Structure
-
-### Initialize PaddlePaddle
-
-We must import and initialize PaddlePaddle (enable/disable GPU, set the number of trainers, etc).
-
-```python
-import sys
-import gzip
-import paddle.v2 as paddle
-from vgg import vgg_bn_drop
-from resnet import resnet_cifar10
-
-# PaddlePaddle init
-paddle.init(use_gpu=False, trainer_count=1)
-```
-
-As mentioned in section [Model Overview](#model-overview), here we provide the implementations of the VGG and ResNet models.
-
-### VGG
-
-First, we use a VGG network. Since the image size and amount of CIFAR10 are relatively small comparing to ImageNet, we use a small version of VGG network for CIFAR10. Convolution groups incorporate BN and dropout operations.
-
-1. Define input data and its dimension
-
- The input to the network is defined as `paddle.layer.data`, or image pixels in the context of image classification. The images in CIFAR10 are 32x32 color images of three channels. Therefore, the size of the input data is 3072 (3x32x32), and the number of categories is 10.
-
- ```python
- datadim = 3 * 32 * 32
- classdim = 10
- image = paddle.layer.data(
- name="image", type=paddle.data_type.dense_vector(datadim))
- ```
-
-2. Define VGG main module
-
- ```python
- net = vgg_bn_drop(image)
- ```
- The input to VGG main module is from the data layer. `vgg_bn_drop` defines a 16-layer VGG network, with each convolutional layer followed by BN and dropout layers. Here is the definition in detail:
-
- ```python
- def vgg_bn_drop(input):
- def conv_block(ipt, num_filter, groups, dropouts, num_channels=None):
- return paddle.networks.img_conv_group(
- input=ipt,
- num_channels=num_channels,
- pool_size=2,
- pool_stride=2,
- conv_num_filter=[num_filter] * groups,
- conv_filter_size=3,
- conv_act=paddle.activation.Relu(),
- conv_with_batchnorm=True,
- conv_batchnorm_drop_rate=dropouts,
- pool_type=paddle.pooling.Max())
-
- conv1 = conv_block(input, 64, 2, [0.3, 0], 3)
- conv2 = conv_block(conv1, 128, 2, [0.4, 0])
- conv3 = conv_block(conv2, 256, 3, [0.4, 0.4, 0])
- conv4 = conv_block(conv3, 512, 3, [0.4, 0.4, 0])
- conv5 = conv_block(conv4, 512, 3, [0.4, 0.4, 0])
-
- drop = paddle.layer.dropout(input=conv5, dropout_rate=0.5)
- fc1 = paddle.layer.fc(input=drop, size=512, act=paddle.activation.Linear())
- bn = paddle.layer.batch_norm(
- input=fc1,
- act=paddle.activation.Relu(),
- layer_attr=paddle.attr.Extra(drop_rate=0.5))
- fc2 = paddle.layer.fc(input=bn, size=512, act=paddle.activation.Linear())
- return fc2
- ```
-
- 2.1. First, define a convolution block or conv_block. The default convolution kernel is 3x3, and the default pooling size is 2x2 with stride 2. Dropout specifies the probability in dropout operation. Function `img_conv_group` is defined in `paddle.networks` consisting of a series of `Conv->BN->ReLu->Dropout` and a `Pooling`.
-
- 2.2. Five groups of convolutions. The first two groups perform two convolutions, while the last three groups perform three convolutions. The dropout rate of the last convolution in each group is set to 0, which means there is no dropout for this layer.
-
- 2.3. The last two layers are fully-connected layers of dimension 512.
-
-3. Define Classifier
-
- The above VGG network extracts high-level features and maps them to a vector of the same size as the categories. Softmax function or classifier is then used for calculating the probability of the image belonging to each category.
-
- ```python
- out = paddle.layer.fc(input=net,
- size=classdim,
- act=paddle.activation.Softmax())
- ```
-
-4. Define Loss Function and Outputs
-
- In the context of supervised learning, labels of training images are defined in `paddle.layer.data` as well. During training, the cross-entropy loss function is used and the loss is the output of the network. During testing, the outputs are the probabilities calculated in the classifier.
-
- ```python
- lbl = paddle.layer.data(
- name="label", type=paddle.data_type.integer_value(classdim))
- cost = paddle.layer.classification_cost(input=out, label=lbl)
- ```
-
-### ResNet
-
-The first, third and fourth steps of a ResNet are the same as a VGG. The second one is the main module.
-
-```python
-net = resnet_cifar10(image, depth=56)
-```
-
-Here are some basic functions used in `resnet_cifar10`:
-
- - `conv_bn_layer` : convolutional layer followed by BN.
- - `shortcut` : the shortcut branch in a residual block. There are two kinds of shortcuts: 1x1 convolution used when the number of channels between input and output is different; direct connection used otherwise.
-
- - `basicblock` : a basic residual module as shown in the left of Figure 9, it consists of two sequential 3x3 convolutions and one "shortcut" branch.
- - `bottleneck` : a bottleneck module as shown in the right of Figure 9, it consists of two 1x1 convolutions with one 3x3 convolution in between branch and a "shortcut" branch.
- - `layer_warp` : a group of residual modules consisting of several stacking blocks. In each group, the sliding window size of the first residual block could be different from the rest of blocks, in order to reduce the size of feature maps along horizontal and vertical directions.
-
-```python
-def conv_bn_layer(input,
- ch_out,
- filter_size,
- stride,
- padding,
- active_type=paddle.activation.Relu(),
- ch_in=None):
- tmp = paddle.layer.img_conv(
- input=input,
- filter_size=filter_size,
- num_channels=ch_in,
- num_filters=ch_out,
- stride=stride,
- padding=padding,
- act=paddle.activation.Linear(),
- bias_attr=False)
- return paddle.layer.batch_norm(input=tmp, act=active_type)
-
-def shortcut(ipt, n_in, n_out, stride):
- if n_in != n_out:
- return conv_bn_layer(ipt, n_out, 1, stride, 0,
- paddle.activation.Linear())
- else:
- return ipt
-
-def basicblock(ipt, ch_out, stride):
- ch_in = ch_out * 2
- tmp = conv_bn_layer(ipt, ch_out, 3, stride, 1)
- tmp = conv_bn_layer(tmp, ch_out, 3, 1, 1, paddle.activation.Linear())
- short = shortcut(ipt, ch_in, ch_out, stride)
- return paddle.layer.addto(input=[tmp, short], act=paddle.activation.Relu())
-
-def layer_warp(block_func, ipt, features, count, stride):
- tmp = block_func(ipt, features, stride)
- for i in range(1, count):
- tmp = block_func(tmp, features, 1)
- return tmp
-```
-
-The following are the components of `resnet_cifar10`:
-
-1. The lowest level is `conv_bn_layer`.
-2. The middle level consists of three `layer_warp`, each of which uses the left residual block in Figure 9.
-3. The last level is average pooling layer.
-
-Note: besides the first convolutional layer and the last fully-connected layer, the total number of layers in three `layer_warp` should be dividable by 6, that is the depth of `resnet_cifar10` should satisfy $(depth - 2) % 6 == 0$.
-
-```python
-def resnet_cifar10(ipt, depth=32):
- # depth should be one of 20, 32, 44, 56, 110, 1202
- assert (depth - 2) % 6 == 0
- n = (depth - 2) / 6
- nStages = {16, 64, 128}
- conv1 = conv_bn_layer(
- ipt, ch_in=3, ch_out=16, filter_size=3, stride=1, padding=1)
- res1 = layer_warp(basicblock, conv1, 16, n, 1)
- res2 = layer_warp(basicblock, res1, 32, n, 2)
- res3 = layer_warp(basicblock, res2, 64, n, 2)
- pool = paddle.layer.img_pool(
- input=res3, pool_size=8, stride=1, pool_type=paddle.pooling.Avg())
- return pool
-```
-
-## Model Training
-
-### Define Parameters
-
-First, we create the model parameters according to the previous model configuration `cost`.
-
-```python
-# Create parameters
-parameters = paddle.parameters.create(cost)
-```
-
-### Create Trainer
-
-Before creating a training module, it is necessary to set the algorithm.
-Here we specify `Momentum` optimization algorithm via `paddle.optimizer`.
-
-```python
-# Create optimizer
-momentum_optimizer = paddle.optimizer.Momentum(
- momentum=0.9,
- regularization=paddle.optimizer.L2Regularization(rate=0.0002 * 128),
- learning_rate=0.1 / 128.0,
- learning_rate_decay_a=0.1,
- learning_rate_decay_b=50000 * 100,
- learning_rate_schedule='discexp')
-
-# Create trainer
-trainer = paddle.trainer.SGD(cost=cost,
- parameters=parameters,
- update_equation=momentum_optimizer)
-```
-
-The learning rate adjustment policy can be defined with variables `learning_rate_decay_a`($a$), `learning_rate_decay_b`($b$) and `learning_rate_schedule`. In this example, discrete exponential method is used for adjusting learning rate. The formula is as follows,
-$$ lr = lr_{0} * a^ {\lfloor \frac{n}{ b}\rfloor} $$
-where $n$ is the number of processed samples, $lr_{0}$ is the learning_rate.
-
-### Training
-
-`cifar.train10()` will yield records during each pass, after shuffling, a batch input is generated for training.
-
-```python
-reader=paddle.batch(
- paddle.reader.shuffle(
- paddle.dataset.cifar.train10(), buf_size=50000),
- batch_size=128)
-```
-
-`feeding` is devoted to specifying the correspondence between each yield record and `paddle.layer.data`. For instance,
- the first column of data generated by `cifar.train10()` corresponds to image layer's feature.
-
-```python
-feeding={'image': 0,
- 'label': 1}
-```
-
-Callback function `event_handler` will be called during training when a pre-defined event happens.
-
-`event_handler_plot`is used to plot a figure like below:
-
-![png](./image/train_and_test.png)
-
-```python
-from paddle.v2.plot import Ploter
-
-train_title = "Train cost"
-test_title = "Test cost"
-cost_ploter = Ploter(train_title, test_title)
-
-step = 0
-def event_handler_plot(event):
- global step
- if isinstance(event, paddle.event.EndIteration):
- if step % 1 == 0:
- cost_ploter.append(train_title, step, event.cost)
- cost_ploter.plot()
- step += 1
- if isinstance(event, paddle.event.EndPass):
- result = trainer.test(
- reader=paddle.batch(
- paddle.dataset.cifar.test10(), batch_size=128),
- feeding=feeding)
- cost_ploter.append(test_title, step, result.cost)
-```
-
-`event_handler` is used to plot some text data when training.
-
-```python
-# event handler to track training and testing process
-def event_handler(event):
- if isinstance(event, paddle.event.EndIteration):
- if event.batch_id % 100 == 0:
- print "\nPass %d, Batch %d, Cost %f, %s" % (
- event.pass_id, event.batch_id, event.cost, event.metrics)
- else:
- sys.stdout.write('.')
- sys.stdout.flush()
- if isinstance(event, paddle.event.EndPass):
- # save parameters
- with gzip.open('params_pass_%d.tar.gz' % event.pass_id, 'w') as f:
- parameters.to_tar(f)
-
- result = trainer.test(
- reader=paddle.batch(
- paddle.dataset.cifar.test10(), batch_size=128),
- feeding=feeding)
- print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics)
-```
-
-Finally, we can invoke `trainer.train` to start training:
-
-```python
-trainer.train(
- reader=reader,
- num_passes=200,
- event_handler=event_handler_plot,
- feeding=feeding)
-```
-
-Here is an example log after training for one pass. The average error rates are 0.6875 on the training set and 0.8852 on the validation set.
-
-```text
-Pass 0, Batch 0, Cost 2.473182, {'classification_error_evaluator': 0.9140625}
-...................................................................................................
-Pass 0, Batch 100, Cost 1.913076, {'classification_error_evaluator': 0.78125}
-...................................................................................................
-Pass 0, Batch 200, Cost 1.783041, {'classification_error_evaluator': 0.7421875}
-...................................................................................................
-Pass 0, Batch 300, Cost 1.668833, {'classification_error_evaluator': 0.6875}
-..........................................................................................
-Test with Pass 0, {'classification_error_evaluator': 0.885200023651123}
-```
-
-Figure 12 shows the curve of training error rate, which indicates it converges at Pass 200 with error rate 8.54%.
-
-
-Figure 12. The error rate of VGG model on CIFAR10
-
-
-
-
-## Application
-
-After training is done, users can use the trained model to classify images. The following code shows how to infer through `paddle.infer` interface. You can remove the comments to change the model name.
-
-```python
-from PIL import Image
-import numpy as np
-import os
-def load_image(file):
- im = Image.open(file)
- im = im.resize((32, 32), Image.ANTIALIAS)
- im = np.array(im).astype(np.float32)
- # The storage order of the loaded image is W(widht),
- # H(height), C(channel). PaddlePaddle requires
- # the CHW order, so transpose them.
- im = im.transpose((2, 0, 1)) # CHW
- # In the training phase, the channel order of CIFAR
- # image is B(Blue), G(green), R(Red). But PIL open
- # image in RGB mode. It must swap the channel order.
- im = im[(2, 1, 0),:,:] # BGR
- im = im.flatten()
- im = im / 255.0
- return im
-test_data = []
-cur_dir = os.path.dirname(os.path.realpath(__file__))
-test_data.append((load_image(cur_dir + '/image/dog.png'),)
-
-# users can remove the comments and change the model name
-# with gzip.open('params_pass_50.tar.gz', 'r') as f:
-# parameters = paddle.parameters.Parameters.from_tar(f)
-
-probs = paddle.infer(
- output_layer=out, parameters=parameters, input=test_data)
-lab = np.argsort(-probs) # probs and lab are the results of one batch data
-print "Label of image/dog.png is: %d" % lab[0][0]
-```
-
-
-## Conclusion
-
-Traditional image classification methods have complicated frameworks that involve multiple stages of processing. In contrast, CNN models can be trained end-to-end with a significant increase in classification accuracy. In this chapter, we introduced three models -- VGG, GoogleNet, ResNet and provided PaddlePaddle config files for training VGG and ResNet on CIFAR10. We also explained how to perform prediction and feature extraction using the PaddlePaddle API. For other datasets such as ImageNet, the procedure for config and training are the same and you are welcome to give it a try.
-
-
-## Reference
-
-[1] D. G. Lowe, [Distinctive image features from scale-invariant keypoints](http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf). IJCV, 60(2):91-110, 2004.
-
-[2] N. Dalal, B. Triggs, [Histograms of Oriented Gradients for Human Detection](http://vision.stanford.edu/teaching/cs231b_spring1213/papers/CVPR05_DalalTriggs.pdf), Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2005.
-
-[3] Ahonen, T., Hadid, A., and Pietikinen, M. (2006). [Face description with local binary patterns: Application to face recognition](http://ieeexplore.ieee.org/document/1717463/). PAMI, 28.
-
-[4] J. Sivic, A. Zisserman, [Video Google: A Text Retrieval Approach to Object Matching in Videos](http://www.robots.ox.ac.uk/~vgg/publications/papers/sivic03.pdf), Proc. Ninth Int'l Conf. Computer Vision, pp. 1470-1478, 2003.
-
-[5] B. Olshausen, D. Field, [Sparse Coding with an Overcomplete Basis Set: A Strategy Employed by V1?](http://redwood.psych.cornell.edu/papers/olshausen_field_1997.pdf), Vision Research, vol. 37, pp. 3311-3325, 1997.
-
-[6] Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., and Gong, Y. (2010). [Locality-constrained Linear Coding for image classification](http://ieeexplore.ieee.org/abstract/document/5540018/). In CVPR.
-
-[7] Perronnin, F., Sánchez, J., & Mensink, T. (2010). [Improving the fisher kernel for large-scale image classification](http://dl.acm.org/citation.cfm?id=1888101). In ECCV (4).
-
-[8] Lin, Y., Lv, F., Cao, L., Zhu, S., Yang, M., Cour, T., Yu, K., and Huang, T. (2011). [Large-scale image clas- sification: Fast feature extraction and SVM training](http://ieeexplore.ieee.org/document/5995477/). In CVPR.
-
-[9] Krizhevsky, A., Sutskever, I., and Hinton, G. (2012). [ImageNet classification with deep convolutional neu- ral networks](http://www.cs.toronto.edu/~kriz/imagenet_classification_with_deep_convolutional.pdf). In NIPS.
-
-[10] G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R.R. Salakhutdinov. [Improving neural networks by preventing co-adaptation of feature detectors](https://arxiv.org/abs/1207.0580). arXiv preprint arXiv:1207.0580, 2012.
-
-[11] K. Chatfield, K. Simonyan, A. Vedaldi, A. Zisserman. [Return of the Devil in the Details: Delving Deep into Convolutional Nets](https://arxiv.org/abs/1405.3531). BMVC, 2014。
-
-[12] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A., [Going deeper with convolutions](https://arxiv.org/abs/1409.4842). In: CVPR. (2015)
-
-[13] Lin, M., Chen, Q., and Yan, S. [Network in network](https://arxiv.org/abs/1312.4400). In Proc. ICLR, 2014.
-
-[14] S. Ioffe and C. Szegedy. [Batch normalization: Accelerating deep network training by reducing internal covariate shift](https://arxiv.org/abs/1502.03167). In ICML, 2015.
-
-[15] K. He, X. Zhang, S. Ren, J. Sun. [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385). CVPR 2016.
-
-[16] Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z. [Rethinking the incep-tion architecture for computer vision](https://arxiv.org/abs/1512.00567). In: CVPR. (2016).
-
-[17] Szegedy, C., Ioffe, S., Vanhoucke, V. [Inception-v4, inception-resnet and the impact of residual connections on learning](https://arxiv.org/abs/1602.07261). arXiv:1602.07261 (2016).
-
-[18] Everingham, M., Eslami, S. M. A., Van Gool, L., Williams, C. K. I., Winn, J. and Zisserman, A. [The Pascal Visual Object Classes Challenge: A Retrospective]((http://link.springer.com/article/10.1007/s11263-014-0733-5)). International Journal of Computer Vision, 111(1), 98-136, 2015.
-
-[19] He, K., Zhang, X., Ren, S., and Sun, J. [Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification](https://arxiv.org/abs/1502.01852). ArXiv e-prints, February 2015.
-
-[20] http://deeplearning.net/tutorial/lenet.html
-
-[21] https://www.cs.toronto.edu/~kriz/cifar.html
-
-[22] http://cs231n.github.io/classification/
-
-
-This tutorial is contributed by PaddlePaddle, and licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
-
-
+
+
+
+### 数据预处理
+
+本章训练的是5-gram模型,表示在PaddlePaddle训练时,每条数据的前4个词用来预测第5个词。PaddlePaddle提供了对应PTB数据集的python包`paddle.dataset.imikolov`,自动做数据的下载与预处理,方便大家使用。
+
+预处理会把数据集中的每一句话前后加上开始符号``以及结束符号``。然后依据窗口大小(本教程中为5),从头到尾每次向右滑动窗口并生成一条数据。
+
+如"I have a dream that one day" 一句提供了5条数据:
+
+```text
+ I have a dream
+I have a dream that
+have a dream that one
+a dream that one day
+dream that one day
+```
+
+最后,每个输入会按其单词次在字典里的位置,转化成整数的索引序列,作为PaddlePaddle的输入。
+## 编程实现
+
+本配置的模型结构如下图所示:
+
+
-# Word2Vec
-
-This is intended as a reference tutorial. The source code of this tutorial lives on [book/word2vec](https://github.com/PaddlePaddle/book/tree/develop/04.word2vec).
-
-For instructions on getting started with PaddlePaddle, see [PaddlePaddle installation guide](https://github.com/PaddlePaddle/book/blob/develop/README.en.md#running-the-book).
-
-## Background Introduction
-
-This section introduces the concept of **word embedding**, which is a vector representation of words. It is a popular technique used in natural language processing. Word embeddings support many Internet services, including search engines, advertising systems, and recommendation systems.
-
-### One-Hot Vectors
-
-Building these services requires us to quantify the similarity between two words or paragraphs. This calls for a new representation of all the words to make them more suitable for computation. An obvious way to achieve this is through the vector space model, where every word is represented as an **one-hot vector**.
-
-For each word, its vector representation has the corresponding entry in the vector as 1, and all other entries as 0. The lengths of one-hot vectors match the size of the dictionary. Each entry of a vector corresponds to the presence (or absence) of a word in the dictionary.
-
-One-hot vectors are intuitive, yet they have limited usefulness. Take the example of an Internet advertising system: Suppose a customer enters the query "Mother's Day", while an ad bids for the keyword carnations". Because the one-hot vectors of these two words are perpendicular, the metric distance (either Euclidean or cosine similarity) between them would indicate little relevance. However, *we* know that these two queries are connected semantically, since people often gift their mothers bundles of carnation flowers on Mother's Day. This discrepancy is due to the low information capacity in each vector. That is, comparing the vector representations of two words does not assess their relevance sufficiently. To calculate their similarity accurately, we need more information, which could be learned from large amounts of data through machine learning methods.
-
-Like many machine learning models, word embeddings can represent knowledge in various ways. Another model may project an one-hot vector to an embedding vector of lower dimension e.g. $embedding(mother's day) = [0.3, 4.2, -1.5, ...], embedding(carnations) = [0.2, 5.6, -2.3, ...]$. Mapping one-hot vectors onto an embedded vector space has the potential to bring the embedding vectors of similar words (either semantically or usage-wise) closer to each other, so that the cosine similarity between the corresponding vectors for words like "Mother's Day" and "carnations" are no longer zero.
-
-A word embedding model could be a probabilistic model, a co-occurrence matrix model, or a neural network. Before people started using neural networks to generate word embedding, the traditional method was to calculate a co-occurrence matrix $X$ of words. Here, $X$ is a $|V| \times |V|$ matrix, where $X_{ij}$ represents the co-occurrence times of the $i$th and $j$th words in the vocabulary `V` within all corpus, and $|V|$ is the size of the vocabulary. By performing matrix decomposition on $X$ e.g. Singular Value Decomposition \[[5](#References)\]
-
-$$X = USV^T$$
-
-the resulting $U$ can be seen as the word embedding of all the words.
-
-However, this method suffers from many drawbacks:
-1) Since many pairs of words don't co-occur, the co-occurrence matrix is sparse. To achieve good performance of matrix factorization, further treatment on word frequency is needed;
-2) The matrix is large, frequently on the order of $10^6*10^6$;
-3) We need to manually filter out stop words (like "although", "a", ...), otherwise these frequent words will affect the performance of matrix factorization.
-
-The neural network based model does not require storing huge hash tables of statistics on all of the corpus. It obtains the word embedding by learning from semantic information, hence could avoid the aforementioned problems in the traditional method. In this chapter, we will introduce the details of neural network word embedding model and how to train such model in PaddlePaddle.
-
-## Results Demonstration
-
-In this section, after training the word embedding model, we could use the data visualization algorithm $t-$SNE\[[4](#reference)\] to draw the word embedding vectors after projecting them onto a two-dimensional space (see figure below). From the figure we could see that the semantically relevant words -- *a*, *the*, and *these* or *big* and *huge* -- are close to each other in the projected space, while irrelevant words -- *say* and *business* or *decision* and *japan* -- are far from each other.
-
-
-
- Figure 1. Two dimension projection of word embeddings
-
-
-### Cosine Similarity
-
-On the other hand, we know that the cosine similarity between two vectors falls between $[-1,1]$. Specifically, the cosine similarity is 1 when the vectors are identical, 0 when the vectors are perpendicular, -1 when the are of opposite directions. That is, the cosine similarity between two vectors scales with their relevance. So we can calculate the cosine similarity of two word embedding vectors to represent their relevance:
-
-```
-please input two words: big huge
-similarity: 0.899180685161
-
-please input two words: from company
-similarity: -0.0997506977351
-```
-
-The above results could be obtained by running `calculate_dis.py`, which loads the words in the dictionary and their corresponding trained word embeddings. For detailed instruction, see section [Model Application](#Model Application).
-
-
-## Model Overview
-
-In this section, we will introduce three word embedding models: N-gram model, CBOW, and Skip-gram, which all output the frequency of each word given its immediate context.
-
-For N-gram model, we will first introduce the concept of language model, and implement it using PaddlePaddle in section [Model Training](#Model Training).
-
-The latter two models, which became popular recently, are neural word embedding model developed by Tomas Mikolov at Google \[[3](#reference)\]. Despite their apparent simplicity, these models train very well.
-
-### Language Model
-
-Before diving into word embedding models, we will first introduce the concept of **language model**. Language models build the joint probability function $P(w_1, ..., w_T)$ of a sentence, where $w_i$ is the i-th word in the sentence. The goal is to give higher probabilities to meaningful sentences, and lower probabilities to meaningless constructions.
-
-In general, models that generate the probability of a sequence can be applied to many fields, like machine translation, speech recognition, information retrieval, part-of-speech tagging, and handwriting recognition. Take information retrieval, for example. If you were to search for "how long is a football bame" (where bame is a medical noun), the search engine would have asked if you had meant "how long is a football game" instead. This is because the probability of "how long is a football bame" is very low according to the language model; in addition, among all of the words easily confused with "bame", "game" would build the most probable sentence.
-
-#### Target Probability
-For language model's target probability $P(w_1, ..., w_T)$, if the words in the sentence were to be independent, the joint probability of the whole sentence would be the product of each word's probability:
-
-$$P(w_1, ..., w_T) = \prod_{t=1}^TP(w_t)$$
-
-However, the frequency of words in a sentence typically relates to the words before them, so canonical language models are constructed using conditional probability in its target probability:
-
-$$P(w_1, ..., w_T) = \prod_{t=1}^TP(w_t | w_1, ... , w_{t-1})$$
-
-
-### N-gram neural model
-
-In computational linguistics, n-gram is an important method to represent text. An n-gram represents a contiguous sequence of n consecutive items given a text. Based on the desired application scenario, each item could be a letter, a syllable or a word. The N-gram model is also an important method in statistical language modeling. When training language models with n-grams, the first (n-1) words of an n-gram are used to predict the *n*th word.
-
-Yoshua Bengio and other scientists describe how to train a word embedding model using neural network in the famous paper of Neural Probabilistic Language Models \[[1](#reference)\] published in 2003. The Neural Network Language Model (NNLM) described in the paper learns the language model and word embedding simultaneously through a linear transformation and a non-linear hidden connection. That is, after training on large amounts of corpus, the model learns the word embedding; then, it computes the probability of the whole sentence, using the embedding. This type of language model can overcome the **curse of dimensionality** i.e. model inaccuracy caused by the difference in dimensionality between training and testing data. Note that the term *neural network language model* is ill-defined, so we will not use the name NNLM but only refer to it as *N-gram neural model* in this section.
-
-We have previously described language model using conditional probability, where the probability of the *t*-th word in a sentence depends on all $t-1$ words before it. Furthermore, since words further prior have less impact on a word, and every word within an n-gram is only effected by its previous n-1 words, we have:
-
-$$P(w_1, ..., w_T) = \prod_{t=n}^TP(w_t|w_{t-1}, w_{t-2}, ..., w_{t-n+1})$$
-
-Given some real corpus in which all sentences are meaningful, the n-gram model should maximize the following objective function:
-
-$$\frac{1}{T}\sum_t f(w_t, w_{t-1}, ..., w_{t-n+1};\theta) + R(\theta)$$
-
-where $f(w_t, w_{t-1}, ..., w_{t-n+1})$ represents the conditional probability of the current word $w_t$ given its previous $n-1$ words, and $R(\theta)$ represents parameter regularization term.
-
-
-
- Figure 2. N-gram neural network model
-
-
-
-Figure 2 shows the N-gram neural network model. From the bottom up, the model has the following components:
-
- - For each sample, the model gets input $w_{t-n+1},...w_{t-1}$, and outputs the probability that the t-th word is one of `|V|` in the dictionary.
-
- Every input word $w_{t-n+1},...w_{t-1}$ first gets transformed into word embedding $C(w_{t-n+1}),...C(w_{t-1})$ through a transformation matrix.
-
- - All the word embeddings concatenate into a single vector, which is mapped (nonlinearly) into the $t$-th word hidden representation:
-
- $$g=Utanh(\theta^Tx + b_1) + Wx + b_2$$
-
- where $x$ is the large vector concatenated from all the word embeddings representing the context; $\theta$, $U$, $b_1$, $b_2$ and $W$ are parameters connecting word embedding layers to the hidden layers. $g$ represents the unnormalized probability of the output word, $g_i$ represents the unnormalized probability of the output word being the i-th word in the dictionary.
-
- - Based on the definition of softmax, using normalized $g_i$, the probability that the output word is $w_t$ is represented as:
-
- $$P(w_t | w_1, ..., w_{t-n+1}) = \frac{e^{g_{w_t}}}{\sum_i^{|V|} e^{g_i}}$$
-
- - The cost of the entire network is a multi-class cross-entropy and can be described by the following loss function
-
- $$J(\theta) = -\sum_{i=1}^N\sum_{c=1}^{|V|}y_k^{i}log(softmax(g_k^i))$$
-
- where $y_k^i$ represents the true label for the $k$-th class in the $i$-th sample ($0$ or $1$), $softmax(g_k^i)$ represents the softmax probability for the $k$-th class in the $i$-th sample.
-
-### Continuous Bag-of-Words model(CBOW)
-
-CBOW model predicts the current word based on the N words both before and after it. When $N=2$, the model is as the figure below:
-
-
-
- Figure 3. CBOW model
-
-
-Specifically, by ignoring the order of words in the sequence, CBOW uses the average value of the word embedding of the context to predict the current word:
-
-$$\text{context} = \frac{x_{t-1} + x_{t-2} + x_{t+1} + x_{t+2}}{4}$$
-
-where $x_t$ is the word embedding of the t-th word, classification score vector is $z=U*\text{context}$, the final classification $y$ uses softmax and the loss function uses multi-class cross-entropy.
-
-### Skip-gram model
-
-The advantages of CBOW is that it smooths over the word embeddings of the context and reduces noise, so it is very effective on small dataset. Skip-gram uses a word to predict its context and get multiple context for the given word, so it can be used in larger datasets.
-
-
-
- Figure 4. Skip-gram model
-
-
-As illustrated in the figure above, skip-gram model maps the word embedding of the given word onto $2n$ word embeddings (including $n$ words before and $n$ words after the given word), and then combine the classification loss of all those $2n$ words by softmax.
-
-## Dataset
-
-We will use Peen Treebank (PTB) (Tomas Mikolov's pre-processed version) dataset. PTB is a small dataset, used in Recurrent Neural Network Language Modeling Toolkit\[[2](#reference)\]. Its statistics are as follows:
-
-
-
-
-
training set
-
validation set
-
test set
-
-
-
ptb.train.txt
-
ptb.valid.txt
-
ptb.test.txt
-
-
-
42068 lines
-
3370 lines
-
3761 lines
-
-
-
-
-### Python Dataset Module
-
-We encapsulated the PTB Data Set in our Python module `paddle.dataset.imikolov`. This module can
-
-1. download the dataset to `~/.cache/paddle/dataset/imikolov`, if not yet, and
-2. [preprocesses](#preprocessing) the dataset.
-
-### Preprocessing
-
-We will be training a 5-gram model. Given five words in a window, we will predict the fifth word given the first four words.
-
-Beginning and end of a sentence have a special meaning, so we will add begin token `` in the front of the sentence. And end token `` in the end of the sentence. By moving the five word window in the sentence, data instances are generated.
-
-For example, the sentence "I have a dream that one day" generates five data instances:
-
-```text
- I have a dream
-I have a dream that
-have a dream that one
-a dream that one day
-dream that one day
-```
-
-At last, each data instance will be converted into an integer sequence according it's words' index inside the dictionary.
-
-## Training
-
-The neural network that we will be using is illustrated in the graph below:
-
-
-
- Figure 5. N-gram neural network model in model configuration
-
-
-`word2vec/train.py` demonstrates training word2vec using PaddlePaddle:
-
-- Import packages.
-
-```python
-import math
-import paddle.v2 as paddle
-```
-
-- Configure parameter.
-
-```python
-embsize = 32 # word vector dimension
-hiddensize = 256 # hidden layer dimension
-N = 5 # train 5-gram
-```
-
-- Map the $n-1$ words $w_{t-n+1},...w_{t-1}$ before $w_t$ to a D-dimensional vector though matrix of dimention $|V|\times D$ (D=32 in this example).
-
-```python
-def wordemb(inlayer):
- wordemb = paddle.layer.table_projection(
- input=inlayer,
- size=embsize,
- param_attr=paddle.attr.Param(
- name="_proj",
- initial_std=0.001,
- learning_rate=1,
- l2_rate=0,
- sparse_update=True))
- return wordemb
-```
-
-- Define name and type for input to data layer.
-
-```python
-paddle.init(use_gpu=False, trainer_count=3)
-word_dict = paddle.dataset.imikolov.build_dict()
-dict_size = len(word_dict)
-# Every layer takes integer value of range [0, dict_size)
-firstword = paddle.layer.data(
- name="firstw", type=paddle.data_type.integer_value(dict_size))
-secondword = paddle.layer.data(
- name="secondw", type=paddle.data_type.integer_value(dict_size))
-thirdword = paddle.layer.data(
- name="thirdw", type=paddle.data_type.integer_value(dict_size))
-fourthword = paddle.layer.data(
- name="fourthw", type=paddle.data_type.integer_value(dict_size))
-nextword = paddle.layer.data(
- name="fifthw", type=paddle.data_type.integer_value(dict_size))
-
-Efirst = wordemb(firstword)
-Esecond = wordemb(secondword)
-Ethird = wordemb(thirdword)
-Efourth = wordemb(fourthword)
-```
-
-- Concatenate n-1 word embedding vectors into a single feature vector.
-
-```python
-contextemb = paddle.layer.concat(input=[Efirst, Esecond, Ethird, Efourth])
-```
-
-- Feature vector will go through a fully connected layer which outputs a hidden feature vector.
-
-```python
-hidden1 = paddle.layer.fc(input=contextemb,
- size=hiddensize,
- act=paddle.activation.Sigmoid(),
- layer_attr=paddle.attr.Extra(drop_rate=0.5),
- bias_attr=paddle.attr.Param(learning_rate=2),
- param_attr=paddle.attr.Param(
- initial_std=1. / math.sqrt(embsize * 8),
- learning_rate=1))
-```
-
-- Hidden feature vector will go through another fully conected layer, turn into a $|V|$ dimensional vector. At the same time softmax will be applied to get the probability of each word being generated.
-
-```python
-predictword = paddle.layer.fc(input=hidden1,
- size=dict_size,
- bias_attr=paddle.attr.Param(learning_rate=2),
- act=paddle.activation.Softmax())
-```
-
-- We will use cross-entropy cost function.
-
-```python
-cost = paddle.layer.classification_cost(input=predictword, label=nextword)
-```
-
-- Create parameter, optimizer and trainer.
-
-```python
-parameters = paddle.parameters.create(cost)
-adagrad = paddle.optimizer.AdaGrad(
- learning_rate=3e-3,
- regularization=paddle.optimizer.L2Regularization(8e-4))
-trainer = paddle.trainer.SGD(cost, parameters, adagrad)
-```
-
-Next, we will begin the training process. `paddle.dataset.imikolov.train()` and `paddle.dataset.imikolov.test()` is our training set and test set. Both of the function will return a **reader**: In PaddlePaddle, reader is a python function which returns a Python iterator which output a single data instance at a time.
-
-`paddle.batch` takes reader as input, outputs a **batched reader**: In PaddlePaddle, a reader outputs a single data instance at a time but batched reader outputs a minibatch of data instances.
-
-```python
-import gzip
-
-def event_handler(event):
- if isinstance(event, paddle.event.EndIteration):
- if event.batch_id % 100 == 0:
- print "Pass %d, Batch %d, Cost %f, %s" % (
- event.pass_id, event.batch_id, event.cost, event.metrics)
-
- if isinstance(event, paddle.event.EndPass):
- result = trainer.test(
- paddle.batch(
- paddle.dataset.imikolov.test(word_dict, N), 32))
- print "Pass %d, Testing metrics %s" % (event.pass_id, result.metrics)
- with gzip.open("model_%d.tar.gz"%event.pass_id, 'w') as f:
- parameters.to_tar(f)
-
-trainer.train(
- paddle.batch(paddle.dataset.imikolov.train(word_dict, N), 32),
- num_passes=100,
- event_handler=event_handler)
-```
-
-`trainer.train` will start training, the output of `event_handler` will be similar to following:
-```text
-Pass 0, Batch 0, Cost 7.870579, {'classification_error_evaluator': 1.0}, Testing metrics {'classification_error_evaluator': 0.999591588973999}
-Pass 0, Batch 100, Cost 6.136420, {'classification_error_evaluator': 0.84375}, Testing metrics {'classification_error_evaluator': 0.8328699469566345}
-Pass 0, Batch 200, Cost 5.786797, {'classification_error_evaluator': 0.8125}, Testing metrics {'classification_error_evaluator': 0.8328542709350586}
-...
-```
-
-After 30 passes, we can get average error rate around 0.735611.
-
-
-## Model Application
-
-After the model is trained, we can load saved model parameters and uses it for other models. We can also use the parameters in applications.
-
-### Viewing Word Vector
-
-Parameters trained by PaddlePaddle can be viewed by `parameters.get()`. For example, we can check the word vector for word `apple`.
-
-```python
-embeddings = parameters.get("_proj").reshape(len(word_dict), embsize)
-
-print embeddings[word_dict['apple']]
-```
-
-```text
-[-0.38961065 -0.02392169 -0.00093231 0.36301503 0.13538605 0.16076435
--0.0678709 0.1090285 0.42014077 -0.24119169 -0.31847557 0.20410083
-0.04910378 0.19021918 -0.0122014 -0.04099389 -0.16924137 0.1911236
--0.10917275 0.13068172 -0.23079982 0.42699069 -0.27679482 -0.01472992
-0.2069038 0.09005053 -0.3282454 0.12717034 -0.24218646 0.25304323
-0.19072419 -0.24286366]
-```
-
-### Modifying Word Vector
-
-Word vectors (`embeddings`) that we get is a numpy array. We can modify this array and set it back to `parameters`.
-
-
-```python
-def modify_embedding(emb):
- # Add your modification here.
- pass
-
-modify_embedding(embeddings)
-parameters.set("_proj", embeddings)
-```
-
-### Calculating Cosine Similarity
-
-Cosine similarity is one way of quantifying the similarity between two vectors. The range of result is $[-1, 1]$. The bigger the value, the similar two vectors are:
-
-
-```python
-from scipy import spatial
-
-emb_1 = embeddings[word_dict['world']]
-emb_2 = embeddings[word_dict['would']]
-
-print spatial.distance.cosine(emb_1, emb_2)
-```
-
-```text
-0.99375076448
-```
-
-## Conclusion
-
-This chapter introduces word embedding, the relationship between language model and word embedding, and how to train neural networks to learn word embedding.
-
-In information retrieval, the relevance between the query and document keyword can be computed through the cosine similarity of their word embeddings. In grammar analysis and semantic analysis, a previously trained word embedding can initialize models for better performance. In document classification, clustering the word embedding can group synonyms in the documents. We hope that readers can use word embedding models in their work after reading this chapter.
-
-
-## Referenes
-1. Bengio Y, Ducharme R, Vincent P, et al. [A neural probabilistic language model](http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)[J]. journal of machine learning research, 2003, 3(Feb): 1137-1155.
-2. Mikolov T, Kombrink S, Deoras A, et al. [Rnnlm-recurrent neural network language modeling toolkit](http://www.fit.vutbr.cz/~imikolov/rnnlm/rnnlm-demo.pdf)[C]//Proc. of the 2011 ASRU Workshop. 2011: 196-201.
-3. Mikolov T, Chen K, Corrado G, et al. [Efficient estimation of word representations in vector space](https://arxiv.org/pdf/1301.3781.pdf)[J]. arXiv preprint arXiv:1301.3781, 2013.
-4. Maaten L, Hinton G. [Visualizing data using t-SNE](https://lvdmaaten.github.io/publications/papers/JMLR_2008.pdf)[J]. Journal of Machine Learning Research, 2008, 9(Nov): 2579-2605.
-5. https://en.wikipedia.org/wiki/Singular_value_decomposition
-
-
-This tutorial is contributed by PaddlePaddle, and licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
-
-
-# Personalized Recommendation
-
-The source code of this tutorial is in [book/recommender_system](https://github.com/PaddlePaddle/book/tree/develop/05.recommender_system).
-
-For instructions on getting started with PaddlePaddle, see [PaddlePaddle installation guide](https://github.com/PaddlePaddle/book/blob/develop/README.en.md#running-the-book).
-
-
-## Background
-
-With the fast growth of e-commerce, online videos, and online reading business, users have to rely on recommender systems to avoid manually browsing tremendous volume of choices. Recommender systems understand users' interest by mining user behavior and other properties of users and products.
-
-Some well know approaches include:
-
-- User behavior-based approach. A well-known method is collaborative filtering. The underlying assumption is that if a person A has the same opinion as a person B on an issue, A is more likely to have B's opinion on a different issue than that of a randomly chosen person.
-
-- Content-based recommendation[[1](#reference)]. This approach infers feature vectors that represent products from their descriptions. It also infers feature vectors that represent users' interests. Then it measures the relevance of users and products by some distances between these feature vectors.
-
-- Hybrid approach[[2](#reference)]: This approach uses the content-based information to help address the cold start problem[[6](#reference)] in behavior-based approach.
-
-Among these options, collaborative filtering might be the most studied one. Some of its variants include user-based[[3](#reference)], item-based [[4](#reference)], social network based[[5](#reference)], and model-based.
-
-This tutorial explains a deep learning based approach and how to implement it using PaddlePaddle. We will train a model using a dataset that includes user information, movie information, and ratings. Once we train the model, we will be able to get a predicted rating given a pair of user and movie IDs.
-
-
-## Model Overview
-
-To know more about deep learning based recommendation, let us start from going over the Youtube recommender system[[7](#reference)] before introducing our hybrid model.
-
-
-### YouTube's Deep Learning Recommendation Model
-
-YouTube is a video-sharing Web site with one of the largest user base in the world. Its recommender system serves more than a billion users. This system is composed of two major parts: candidate generation and ranking. The former selects few hundreds of candidates from millions of videos, and the latter ranks and outputs the top 10.
-
-
-
-Figure 1. YouTube recommender system overview.
-
-
-#### Candidate Generation Network
-
-Youtube models candidate generation as a multiclass classification problem with a huge number of classes equal to the number of videos. The architecture of the model is as follows:
-
-
-
-Figure 2. Deep candidate generation model.
-
-
-The first stage of this model maps watching history and search queries into fixed-length representative features. Then, an MLP (multi-layer perceptron, as described in the [Recognize Digits](https://github.com/PaddlePaddle/book/blob/develop/recognize_digits/README.md) tutorial) takes the concatenation of all representative vectors. The output of the MLP represents the user' *intrinsic interests*. At training time, it is used together with a softmax output layer for minimizing the classification error. At serving time, it is used to compute the relevance of the user with all movies.
-
-For a user $U$, the predicted watching probability of video $i$ is
-
-$$P(\omega=i|u)=\frac{e^{v_{i}u}}{\sum_{j \in V}e^{v_{j}u}}$$
-
-where $u$ is the representative vector of user $U$, $V$ is the corpus of all videos, $v_i$ is the representative vector of the $i$-th video. $u$ and $v_i$ are vectors of the same length, so we can compute their dot product using a fully connected layer.
-
-This model could have a performance issue as the softmax output covers millions of classification labels. To optimize performance, at the training time, the authors down-sample negative samples, so the actual number of classes is reduced to thousands. At serving time, the authors ignore the normalization of the softmax outputs, because the results are just for ranking.
-
-#### Ranking Network
-
-The architecture of the ranking network is similar to that of the candidate generation network. Similar to ranking models widely used in online advertising, it uses rich features like video ID, last watching time, etc. The output layer of the ranking network is a weighted logistic regression, which rates all candidate videos.
-
-### Hybrid Model
-
-In the section, let us introduce our movie recommendation system. Especially, we feed moives titles into a text convolution network to get a fixed-length representative feature vector. Accordingly we will introduce the convolutional neural network for texts and the hybrid recommendation model respectively.
-
-#### Convolutional Neural Networks for Texts (CNN)
-
-**Convolutional Neural Networks** are frequently applied to data with grid-like topology such as two-dimensional images and one-dimensional texts. A CNN can extract multiple local features, combine them, and produce high-level abstractions, which correspond to semantic understanding. Empirically, CNN is shown to be efficient for image and text modeling.
-
-CNN mainly contains convolution and pooling operation, with versatile combinations in various applications. Here, we briefly describe a CNN as shown in Figure 3.
-
-
-
-
-Figure 3. CNN for text modeling.
-
-
-Let $n$ be the length of the sentence to process, and the $i$-th word has embedding as $x_i\in\mathbb{R}^k$,where $k$ is the embedding dimensionality.
-
-First, we concatenate the words by piecing together every $h$ words, each as a window of length $h$. This window is denoted as $x_{i:i+h-1}$, consisting of $x_{i},x_{i+1},\ldots,x_{i+h-1}$, where $x_i$ is the first word in the window and $i$ takes value ranging from $1$ to $n-h+1$: $x_{i:i+h-1}\in\mathbb{R}^{hk}$.
-
-Next, we apply the convolution operation: we apply the kernel $w\in\mathbb{R}^{hk}$ in each window, extracting features $c_i=f(w\cdot x_{i:i+h-1}+b)$, where $b\in\mathbb{R}$ is the bias and $f$ is a non-linear activation function such as $sigmoid$. Convolving by the kernel at every window ${x_{1:h},x_{2:h+1},\ldots,x_{n-h+1:n}}$ produces a feature map in the following form:
-
-$$c=[c_1,c_2,\ldots,c_{n-h+1}], c \in \mathbb{R}^{n-h+1}$$
-
-Next, we apply *max pooling* over time to represent the whole sentence $\hat c$, which is the maximum element across the feature map:
-
-$$\hat c=max(c)$$
-
-#### Model Structure Of The Hybrid Model
-
-In our network, the input includes features of users and movies. The user feature includes four properties: user ID, gender, occupation, and age. Movie features include their IDs, genres, and titles.
-
-We use fully-connected layers to map user features into representative feature vectors and concatenate them. The process of movie features is similar, except that for movie titles -- we feed titles into a text convolution network as described in the above section to get a fixed-length representative feature vector.
-
-Given the feature vectors of users and movies, we compute the relevance using cosine similarity. We minimize the squared error at training time.
-
-
-
-Figure 4. A hybrid recommendation model.
-
-
-## Dataset
-
-We use the [MovieLens ml-1m](http://files.grouplens.org/datasets/movielens/ml-1m.zip) to train our model. This dataset includes 10,000 ratings of 4,000 movies from 6,000 users to 4,000 movies. Each rate is in the range of 1~5. Thanks to GroupLens Research for collecting, processing and publishing the dataset.
-
-`paddle.v2.datasets` package encapsulates multiple public datasets, including `cifar`, `imdb`, `mnist`, `moivelens` and `wmt14`, etc. There's no need for us to manually download and preprocess `MovieLens` dataset.
-
-The raw `MoiveLens` contains movie ratings, relevant features from both movies and users.
-For instance, one movie's feature could be:
-
-```python
-import paddle.v2 as paddle
-movie_info = paddle.dataset.movielens.movie_info()
-print movie_info.values()[0]
-```
-
-```text
-
-```
-
-One user's feature could be:
-
-```python
-user_info = paddle.dataset.movielens.user_info()
-print user_info.values()[0]
-```
-
-```text
-
-```
-
-In this dateset, the distribution of age is shown as follows:
-
-```text
-1: "Under 18"
-18: "18-24"
-25: "25-34"
-35: "35-44"
-45: "45-49"
-50: "50-55"
-56: "56+"
-```
-
-User's occupation is selected from the following options:
-
-```text
-0: "other" or not specified
-1: "academic/educator"
-2: "artist"
-3: "clerical/admin"
-4: "college/grad student"
-5: "customer service"
-6: "doctor/health care"
-7: "executive/managerial"
-8: "farmer"
-9: "homemaker"
-10: "K-12 student"
-11: "lawyer"
-12: "programmer"
-13: "retired"
-14: "sales/marketing"
-15: "scientist"
-16: "self-employed"
-17: "technician/engineer"
-18: "tradesman/craftsman"
-19: "unemployed"
-20: "writer"
-```
-
-Each record consists of three main components: user features, movie features and movie ratings.
-Likewise, as a simple example, consider the following:
-
-```python
-train_set_creator = paddle.dataset.movielens.train()
-train_sample = next(train_set_creator())
-uid = train_sample[0]
-mov_id = train_sample[len(user_info[uid].value())]
-print "User %s rates Movie %s with Score %s"%(user_info[uid], movie_info[mov_id], train_sample[-1])
-```
-
-```text
-User rates Movie with Score [5.0]
-```
-
-The output shows that user 1 gave movie `1193` a rating of 5.
-
-After issuing a command `python train.py`, training will start immediately. The details will be unpacked by the following sessions to see how it works.
-
-## Model Architecture
-
-### Initialize PaddlePaddle
-
-First, we must import and initialize PaddlePaddle (enable/disable GPU, set the number of trainers, etc).
-
-```python
-import paddle.v2 as paddle
-paddle.init(use_gpu=False)
-```
-
-### Model Configuration
-
-```python
-uid = paddle.layer.data(
- name='user_id',
- type=paddle.data_type.integer_value(
- paddle.dataset.movielens.max_user_id() + 1))
-usr_emb = paddle.layer.embedding(input=uid, size=32)
-usr_fc = paddle.layer.fc(input=usr_emb, size=32)
-
-usr_gender_id = paddle.layer.data(
- name='gender_id', type=paddle.data_type.integer_value(2))
-usr_gender_emb = paddle.layer.embedding(input=usr_gender_id, size=16)
-usr_gender_fc = paddle.layer.fc(input=usr_gender_emb, size=16)
-
-usr_age_id = paddle.layer.data(
- name='age_id',
- type=paddle.data_type.integer_value(
- len(paddle.dataset.movielens.age_table)))
-usr_age_emb = paddle.layer.embedding(input=usr_age_id, size=16)
-usr_age_fc = paddle.layer.fc(input=usr_age_emb, size=16)
-
-usr_job_id = paddle.layer.data(
- name='job_id',
- type=paddle.data_type.integer_value(
- paddle.dataset.movielens.max_job_id() + 1))
-usr_job_emb = paddle.layer.embedding(input=usr_job_id, size=16)
-usr_job_fc = paddle.layer.fc(input=usr_job_emb, size=16)
-```
-
-As shown in the above code, the input is four dimension integers for each user, that is, `user_id`,`gender_id`, `age_id` and `job_id`. In order to deal with these features conveniently, we use the language model in NLP to transform these discrete values into embedding vaules `usr_emb`, `usr_gender_emb`, `usr_age_emb` and `usr_job_emb`.
-
-```python
-usr_combined_features = paddle.layer.fc(
- input=[usr_fc, usr_gender_fc, usr_age_fc, usr_job_fc],
- size=200,
- act=paddle.activation.Tanh())
-```
-
-Then, employing user features as input, directly connecting to a fully-connected layer, which is used to reduce dimension to 200.
-
-Furthermore, we do a similar transformation for each movie feature. The model configuration is:
-
-```python
-mov_id = paddle.layer.data(
- name='movie_id',
- type=paddle.data_type.integer_value(
- paddle.dataset.movielens.max_movie_id() + 1))
-mov_emb = paddle.layer.embedding(input=mov_id, size=32)
-mov_fc = paddle.layer.fc(input=mov_emb, size=32)
-
-mov_categories = paddle.layer.data(
- name='category_id',
- type=paddle.data_type.sparse_binary_vector(
- len(paddle.dataset.movielens.movie_categories())))
-mov_categories_hidden = paddle.layer.fc(input=mov_categories, size=32)
-
-movie_title_dict = paddle.dataset.movielens.get_movie_title_dict()
-mov_title_id = paddle.layer.data(
- name='movie_title',
- type=paddle.data_type.integer_value_sequence(len(movie_title_dict)))
-mov_title_emb = paddle.layer.embedding(input=mov_title_id, size=32)
-mov_title_conv = paddle.networks.sequence_conv_pool(
- input=mov_title_emb, hidden_size=32, context_len=3)
-
-mov_combined_features = paddle.layer.fc(
- input=[mov_fc, mov_categories_hidden, mov_title_conv],
- size=200,
- act=paddle.activation.Tanh())
-```
-
-Movie title, a sequence of words represented by an integer word index sequence, will be feed into a `sequence_conv_pool` layer, which will apply convolution and pooling on time dimension. Because pooling is done on time dimension, the output will be a fixed-length vector regardless the length of the input sequence.
-
-Finally, we can use cosine similarity to calculate the similarity between user characteristics and movie features.
-
-```python
-inference = paddle.layer.cos_sim(a=usr_combined_features, b=mov_combined_features, size=1, scale=5)
-cost = paddle.layer.mse_cost(
- input=inference,
- label=paddle.layer.data(
- name='score', type=paddle.data_type.dense_vector(1)))
-```
-
-## Model Training
-
-### Define Parameters
-
-First, we define the model parameters according to the previous model configuration `cost`.
-
-```python
-# Create parameters
-parameters = paddle.parameters.create(cost)
-```
-
-### Create Trainer
-
-Before jumping into creating a training module, algorithm setting is also necessary. Here we specified Adam optimization algorithm via `paddle.optimizer`.
-
-```python
-trainer = paddle.trainer.SGD(cost=cost, parameters=parameters,
- update_equation=paddle.optimizer.Adam(learning_rate=1e-4))
-```
-
-```text
-[INFO 2017-03-06 17:12:13,378 networks.py:1472] The input order is [user_id, gender_id, age_id, job_id, movie_id, category_id, movie_title, score]
-[INFO 2017-03-06 17:12:13,379 networks.py:1478] The output order is [__mse_cost_0__]
-```
-
-### Training
-
-`paddle.dataset.movielens.train` will yield records during each pass, after shuffling, a batch input is generated for training.
-
-```python
-reader=paddle.batch(
- paddle.reader.shuffle(
- paddle.dataset.movielens.train(), buf_size=8192),
- batch_size=256)
-```
-
-`feeding` is devoted to specifying the correspondence between each yield record and `paddle.layer.data`. For instance, the first column of data generated by `movielens.train` corresponds to `user_id` feature.
-
-```python
-feeding = {
- 'user_id': 0,
- 'gender_id': 1,
- 'age_id': 2,
- 'job_id': 3,
- 'movie_id': 4,
- 'category_id': 5,
- 'movie_title': 6,
- 'score': 7
-}
-```
-
-Callback function `event_handler` and `event_handler_plot` will be called during training when a pre-defined event happens.
-
-```python
-def event_handler(event):
- if isinstance(event, paddle.event.EndIteration):
- if event.batch_id % 100 == 0:
- print "Pass %d Batch %d Cost %.2f" % (
- event.pass_id, event.batch_id, event.cost)
-```
-
-```python
-from paddle.v2.plot import Ploter
-
-train_title = "Train cost"
-test_title = "Test cost"
-cost_ploter = Ploter(train_title, test_title)
-
-step = 0
-
-def event_handler_plot(event):
- global step
- if isinstance(event, paddle.event.EndIteration):
- if step % 10 == 0: # every 10 batches, record a train cost
- cost_ploter.append(train_title, step, event.cost)
-
- if step % 1000 == 0: # every 1000 batches, record a test cost
- result = trainer.test(
- reader=paddle.batch(
- paddle.dataset.movielens.test(), batch_size=256),
- feeding=feeding)
- cost_ploter.append(test_title, step, result.cost)
-
- if step % 100 == 0: # every 100 batches, update cost plot
- cost_ploter.plot()
-
- step += 1
-```
-
-Finally, we can invoke `trainer.train` to start training:
-
-```python
-trainer.train(
- reader=reader,
- event_handler=event_handler_plot,
- feeding=feeding,
- num_passes=2)
-```
-
-## Conclusion
-
-This tutorial goes over traditional approaches in recommender system and a deep learning based approach. We also show that how to train and use the model with PaddlePaddle. Deep learning has been well used in computer vision and NLP, we look forward to its new successes in recommender systems.
-
-## Reference
-
-1. [Peter Brusilovsky](https://en.wikipedia.org/wiki/Peter_Brusilovsky) (2007). *The Adaptive Web*. p. 325.
-2. Robin Burke , [Hybrid Web Recommender Systems](http://www.dcs.warwick.ac.uk/~acristea/courses/CS411/2010/Book%20-%20The%20Adaptive%20Web/HybridWebRecommenderSystems.pdf), pp. 377-408, The Adaptive Web, Peter Brusilovsky, Alfred Kobsa, Wolfgang Nejdl (Ed.), Lecture Notes in Computer Science, Springer-Verlag, Berlin, Germany, Lecture Notes in Computer Science, Vol. 4321, May 2007, 978-3-540-72078-2.
-3. P. Resnick, N. Iacovou, etc. “[GroupLens: An Open Architecture for Collaborative Filtering of Netnews](http://ccs.mit.edu/papers/CCSWP165.html)”, Proceedings of ACM Conference on Computer Supported Cooperative Work, CSCW 1994. pp.175-186.
-4. Sarwar, Badrul, et al. "[Item-based collaborative filtering recommendation algorithms.](http://files.grouplens.org/papers/www10_sarwar.pdf)" *Proceedings of the 10th International Conference on World Wide Web*. ACM, 2001.
-5. Kautz, Henry, Bart Selman, and Mehul Shah. "[Referral Web: Combining Social networks and collaborative filtering.](http://www.cs.cornell.edu/selman/papers/pdf/97.cacm.refweb.pdf)" Communications of the ACM 40.3 (1997): 63-65. APA
-6. Yuan, Jianbo, et al. ["Solving Cold-Start Problem in Large-scale Recommendation Engines: A Deep Learning Approach."](https://arxiv.org/pdf/1611.05480v1.pdf) *arXiv preprint arXiv:1611.05480* (2016).
-7. Covington P, Adams J, Sargin E. [Deep neural networks for youtube recommendations](https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf)[C]//Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 2016: 191-198.
-
-
-This tutorial is contributed by PaddlePaddle, and licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
-
-
-# Sentiment Analysis
-
-The source codes of this section can be located at [book/understand_sentiment](https://github.com/PaddlePaddle/book/tree/develop/06.understand_sentiment). First-time users may refer to PaddlePaddle for [Installation guide](https://github.com/PaddlePaddle/book/blob/develop/README.en.md#running-the-book).
-
-## Background
-
-In natural language processing, sentiment analysis refers to determining the emotion expressed in a piece of text. The text can be a sentence, a paragraph, or a document. Emotion categorization can be binary -- positive/negative or happy/sad -- or in three classes -- positive/neutral/negative. Sentiment analysis is applicable in a wide range of services, such as e-commerce sites like Amazon and Taobao, hospitality services like Airbnb and hotels.com, and movie rating sites like Rotten Tomatoes and IMDB. It can be used to gauge from the reviews how the customers feel about the product. Table 1 illustrates an example of sentiment analysis in movie reviews:
-
-| Movie Review | Category |
-| -------- | ----- |
-| Best movie of Xiaogang Feng in recent years!| Positive |
-| Pretty bad. Feels like a tv-series from a local TV-channel | Negative |
-| Politically correct version of Taken ... and boring as Heck| Negative|
-|delightful, mesmerizing, and completely unexpected. The plot is nicely designed.|Positive|
-
-
Table 1 Sentiment Analysis in Movie Reviews
-
-In natural language processing, sentiment analysis can be categorized as a **Text Classification problem**, i.e., to categorize a piece of text to a specific class. It involves two related tasks: text representation and classification. Before the emergence of deep learning techniques, the mainstream methods for text representation include BOW (*bag of words*) and topic modeling, while the latter contain SVM (*support vector machine*) and LR (*logistic regression*).
-
-The BOW model does not capture all the information in a piece of text, as it ignores syntax and grammar and just treats the text as a set of words. For example, “this movie is extremely bad“ and “boring, dull, and empty work” describe very similar semantic meaning, yet their BOW representations have with little similarity. Furthermore, “the movie is bad“ and “the movie is not bad“ have high similarity with BOW features, but they express completely opposite semantics.
-
-This chapter introduces a deep learning model that handles these issues in BOW. Our model embeds texts into a low-dimensional space and takes word order into consideration. It is an end-to-end framework and it has large performance improvement over traditional methods \[[1](#Reference)\].
-
-## Model Overview
-
-The model we used in this chapter uses **Convolutional Neural Networks** (**CNNs**) and **Recurrent Neural Networks** (**RNNs**) with some specific extensions.
-
-
-### Revisit to the Convolutional Neural Networks for Texts (CNN)
-
-The convolutional neural network for texts is introduced in chapter [recommender_system](https://github.com/PaddlePaddle/book/tree/develop/05.recommender_system), here we make a brief overview.
-
-CNN mainly contains convolution and pooling operation, with versatile combinations in various applications. We first apply the convolution operation: we apply the kernel in each window, extracting features. Convolving by the kernel at every window produces a feature map. Next, we apply *max pooling* over time to represent the whole sentence, which is the maximum element across the feature map. In real applications, we will apply multiple CNN kernels on the sentences. It can be implemented efficiently by concatenating the kernels together as a matrix. Also, we can use CNN kernels with different kernel size. Finally, concatenating the resulting features produces a fixed-length representation, which can be combined with a softmax to form the model for the sentiment analysis problem.
-
-For short texts, the aforementioned CNN model can achieve very high accuracy \[[1](#Reference)\]. If we want to extract more abstract representations, we may apply a deeper CNN model \[[2](#Reference),[3](#Reference)\].
-
-### Recurrent Neural Network (RNN)
-
-RNN is an effective model for sequential data. In terms of computability, the RNN is Turing-complete \[[4](#Reference)\]. Since NLP is a classical problem on sequential data, the RNN, especially its variant LSTM\[[5](#Reference)\]), achieves state-of-the-art performance on various NLP tasks, such as language modeling, syntax parsing, POS-tagging, image captioning, dialog, machine translation, and so forth.
-
-
-
-Figure 1. An illustration of an unfolded RNN in time.
-
-
-As shown in Figure 1, we unfold an RNN: at the $t$-th time step, the network takes two inputs: the $t$-th input vector $\vec{x_t}$ and the latent state from the last time-step $\vec{h_{t-1}}$. From those, it computes the latent state of the current step $\vec{h_t}$. This process is repeated until all inputs are consumed. Denoting the RNN as function $f$, it can be formulated as follows:
-
-$$\vec{h_t}=f(\vec{x_t},\vec{h_{t-1}})=\sigma(W_{xh}\vec{x_t}+W_{hh}\vec{h_{h-1}}+\vec{b_h})$$
-
-where $W_{xh}$ is the weight matrix to feed into the latent layer; $W_{hh}$ is the latent-to-latent matrix; $b_h$ is the latent bias and $\sigma$ refers to the $sigmoid$ function.
-
-In NLP, words are often represented as a one-hot vectors and then mapped to an embedding. The embedded feature goes through an RNN as input $x_t$ at every time step. Moreover, we can add other layers on top of RNN, such as a deep or stacked RNN. Finally, the last latent state may be used as a feature for sentence classification.
-
-### Long-Short Term Memory (LSTM)
-
-Training an RNN on long sequential data sometimes leads to the gradient vanishing or exploding\[[6](#)\]. To solve this problem Hochreiter S, Schmidhuber J. (1997) proposed **Long Short Term Memory** (LSTM)\[[5](#Reference)\]).
-
-Compared to the structure of a simple RNN, an LSTM includes memory cell $c$, input gate $i$, forget gate $f$ and output gate $o$. These gates and memory cells dramatically improve the ability for the network to handle long sequences. We can formulate the **LSTM-RNN**, denoted as a function $F$, as follows:
-
-$$ h_t=F(x_t,h_{t-1})$$
-
-$F$ contains following formulations\[[7](#Reference)\]:
-\begin{align}
-i_t & = \sigma(W_{xi}x_t+W_{hi}h_{h-1}+W_{ci}c_{t-1}+b_i)\\\\
-f_t & = \sigma(W_{xf}x_t+W_{hf}h_{h-1}+W_{cf}c_{t-1}+b_f)\\\\
-c_t & = f_t\odot c_{t-1}+i_t\odot \tanh(W_{xc}x_t+W_{hc}h_{h-1}+b_c)\\\\
-o_t & = \sigma(W_{xo}x_t+W_{ho}h_{h-1}+W_{co}c_{t}+b_o)\\\\
-h_t & = o_t\odot \tanh(c_t)\\\\
-\end{align}
-
-In the equation,$i_t, f_t, c_t, o_t$ stand for input gate, forget gate, memory cell and output gate, respectively. $W$ and $b$ are model parameters, $\tanh$ is a hyperbolic tangent, and $\odot$ denotes an element-wise product operation. The input gate controls the magnitude of the new input into the memory cell $c$; the forget gate controls the memory propagated from the last time step; the output gate controls the magnitutde of the output. The three gates are computed similarly with different parameters, and they influence memory cell $c$ separately, as shown in Figure 2:
-
-
-
-Figure 2. LSTM at time step $t$ [7].
-
-
-LSTM enhances the ability of considering long-term reliance, with the help of memory cell and gate. Similar structures are also proposed in Gated Recurrent Unit (GRU)\[[8](Reference)\] with simpler design. **The structures are still similar to RNN, though with some modifications (As shown in Figure 2), i.e., latent status depends on input as well as the latent status of last time-step, and the process goes on recurrently until all input are consumed:**
-
-$$ h_t=Recrurent(x_t,h_{t-1})$$
-where $Recrurent$ is a simple RNN, GRU or LSTM.
-
-### Stacked Bidirectional LSTM
-
-For vanilla LSTM, $h_t$ contains input information from previous time-step $1..t-1$ context. We can also apply an RNN with reverse-direction to take successive context $t+1…n$ into consideration. Combining constructing deep RNN (deeper RNN can contain more abstract and higher level semantic), we can design structures with deep stacked bidirectional LSTM to model sequential data\[[9](#Reference)\].
-
-As shown in Figure 3 (3-layer RNN), odd/even layers are forward/reverse LSTM. Higher layers of LSTM take lower-layers LSTM as input, and the top-layer LSTM produces a fixed length vector by max-pooling (this representation considers contexts from previous and successive words for higher-level abstractions). Finally, we concatenate the output to a softmax layer for classification.
-
-
-
-Figure 3. Stacked Bidirectional LSTM for NLP modeling.
-
-
-## Dataset
-
-We use [IMDB](http://ai.stanford.edu/%7Eamaas/data/sentiment/) dataset for sentiment analysis in this tutorial, which consists of 50,000 movie reviews split evenly into 25k train and 25k test sets. In the labeled train/test sets, a negative review has a score <= 4 out of 10, and a positive review has a score >= 7 out of 10.
-
-`paddle.datasets` package encapsulates multiple public datasets, including `cifar`, `imdb`, `mnist`, `moivelens`, and `wmt14`, etc. There's no need for us to manually download and preprocess IMDB.
-
-After issuing a command `python train.py`, training will start immediately. The details will be unpacked by the following sessions to see how it works.
-
-
-## Model Structure
-
-### Initialize PaddlePaddle
-
-We must import and initialize PaddlePaddle (enable/disable GPU, set the number of trainers, etc).
-
-```python
-import sys
-import paddle.v2 as paddle
-
-# PaddlePaddle init
-paddle.init(use_gpu=False, trainer_count=1)
-```
-
-As alluded to in section [Model Overview](#model-overview), here we provide the implementations of both Text CNN and Stacked-bidirectional LSTM models.
-
-### Text Convolution Neural Network (Text CNN)
-
-We create a neural network `convolution_net` as the following snippet code.
-
-Note: `paddle.networks.sequence_conv_pool` includes both convolution and pooling layer operations.
-
-```python
-def convolution_net(input_dim, class_dim=2, emb_dim=128, hid_dim=128):
- data = paddle.layer.data("word",
- paddle.data_type.integer_value_sequence(input_dim))
- emb = paddle.layer.embedding(input=data, size=emb_dim)
- conv_3 = paddle.networks.sequence_conv_pool(
- input=emb, context_len=3, hidden_size=hid_dim)
- conv_4 = paddle.networks.sequence_conv_pool(
- input=emb, context_len=4, hidden_size=hid_dim)
- output = paddle.layer.fc(input=[conv_3, conv_4],
- size=class_dim,
- act=paddle.activation.Softmax())
- lbl = paddle.layer.data("label", paddle.data_type.integer_value(2))
- cost = paddle.layer.classification_cost(input=output, label=lbl)
- return cost
-```
-
-1. Define input data and its dimension
-
- Parameter `input_dim` denotes the dictionary size, and `class_dim` is the number of categories. In `convolution_net`, the input to the network is defined in `paddle.layer.data`.
-
-1. Define Classifier
-
- The above Text CNN network extracts high-level features and maps them to a vector of the same size as the categories. `paddle.activation.Softmax` function or classifier is then used for calculating the probability of the sentence belonging to each category.
-
-1. Define Loss Function
-
- In the context of supervised learning, labels of the training set are defined in `paddle.layer.data`, too. During training, cross-entropy is used as loss function in `paddle.layer.classification_cost` and as the output of the network; During testing, the outputs are the probabilities calculated in the classifier.
-
-#### Stacked bidirectional LSTM
-
-We create a neural network `stacked_lstm_net` as below.
-
-```python
-def stacked_lstm_net(input_dim,
- class_dim=2,
- emb_dim=128,
- hid_dim=512,
- stacked_num=3):
- """
- A Wrapper for sentiment classification task.
- This network uses bi-directional recurrent network,
- consisting three LSTM layers. This configure is referred to
- the paper as following url, but use fewer layrs.
- http://www.aclweb.org/anthology/P15-1109
- input_dim: here is word dictionary dimension.
- class_dim: number of categories.
- emb_dim: dimension of word embedding.
- hid_dim: dimension of hidden layer.
- stacked_num: number of stacked lstm-hidden layer.
- """
- assert stacked_num % 2 == 1
-
- layer_attr = paddle.attr.Extra(drop_rate=0.5)
- fc_para_attr = paddle.attr.Param(learning_rate=1e-3)
- lstm_para_attr = paddle.attr.Param(initial_std=0., learning_rate=1.)
- para_attr = [fc_para_attr, lstm_para_attr]
- bias_attr = paddle.attr.Param(initial_std=0., l2_rate=0.)
- relu = paddle.activation.Relu()
- linear = paddle.activation.Linear()
-
- data = paddle.layer.data("word",
- paddle.data_type.integer_value_sequence(input_dim))
- emb = paddle.layer.embedding(input=data, size=emb_dim)
-
- fc1 = paddle.layer.fc(input=emb,
- size=hid_dim,
- act=linear,
- bias_attr=bias_attr)
- lstm1 = paddle.layer.lstmemory(
- input=fc1, act=relu, bias_attr=bias_attr, layer_attr=layer_attr)
-
- inputs = [fc1, lstm1]
- for i in range(2, stacked_num + 1):
- fc = paddle.layer.fc(input=inputs,
- size=hid_dim,
- act=linear,
- param_attr=para_attr,
- bias_attr=bias_attr)
- lstm = paddle.layer.lstmemory(
- input=fc,
- reverse=(i % 2) == 0,
- act=relu,
- bias_attr=bias_attr,
- layer_attr=layer_attr)
- inputs = [fc, lstm]
-
- fc_last = paddle.layer.pooling(
- input=inputs[0], pooling_type=paddle.pooling.Max())
- lstm_last = paddle.layer.pooling(
- input=inputs[1], pooling_type=paddle.pooling.Max())
- output = paddle.layer.fc(input=[fc_last, lstm_last],
- size=class_dim,
- act=paddle.activation.Softmax(),
- bias_attr=bias_attr,
- param_attr=para_attr)
-
- lbl = paddle.layer.data("label", paddle.data_type.integer_value(2))
- cost = paddle.layer.classification_cost(input=output, label=lbl)
- return cost
-```
-
-1. Define input data and its dimension
-
- Parameter `input_dim` denotes the dictionary size, and `class_dim` is the number of categories. In `stacked_lstm_net`, the input to the network is defined in `paddle.layer.data`.
-
-1. Define Classifier
-
- The above stacked bidirectional LSTM network extracts high-level features and maps them to a vector of the same size as the categories. `paddle.activation.Softmax` function or classifier is then used for calculating the probability of the sentence belonging to each category.
-
-1. Define Loss Function
-
- In the context of supervised learning, labels of the training set are defined in `paddle.layer.data`, too. During training, cross-entropy is used as loss function in `paddle.layer.classification_cost` and as the output of the network; During testing, the outputs are the probabilities calculated in the classifier.
-
-
-To reiterate, we can either invoke `convolution_net` or `stacked_lstm_net`.
-
-```python
-word_dict = paddle.dataset.imdb.word_dict()
-dict_dim = len(word_dict)
-class_dim = 2
-
-# option 1
-cost = convolution_net(dict_dim, class_dim=class_dim)
-# option 2
-# cost = stacked_lstm_net(dict_dim, class_dim=class_dim, stacked_num=3)
-```
-
-## Model Training
-
-### Define Parameters
-
-First, we create the model parameters according to the previous model configuration `cost`.
-
-```python
-# create parameters
-parameters = paddle.parameters.create(cost)
-```
-
-### Create Trainer
-
-Before jumping into creating a training module, algorithm setting is also necessary.
-Here we specified `Adam` optimization algorithm via `paddle.optimizer`.
-
-```python
-# create optimizer
-adam_optimizer = paddle.optimizer.Adam(
- learning_rate=2e-3,
- regularization=paddle.optimizer.L2Regularization(rate=8e-4),
- model_average=paddle.optimizer.ModelAverage(average_window=0.5))
-
-# create trainer
-trainer = paddle.trainer.SGD(cost=cost,
- parameters=parameters,
- update_equation=adam_optimizer)
-```
-
-### Training
-
-`paddle.dataset.imdb.train()` will yield records during each pass, after shuffling, a batch input is generated for training.
-
-```python
-train_reader = paddle.batch(
- paddle.reader.shuffle(
- lambda: paddle.dataset.imdb.train(word_dict), buf_size=1000),
- batch_size=100)
-
-test_reader = paddle.batch(
- lambda: paddle.dataset.imdb.test(word_dict), batch_size=100)
-```
-
-`feeding` is devoted to specifying the correspondence between each yield record and `paddle.layer.data`. For instance, the first column of data generated by `paddle.dataset.imdb.train()` corresponds to `word` feature.
-
-```python
-feeding = {'word': 0, 'label': 1}
-```
-
-Callback function `event_handler` will be invoked to track training progress when a pre-defined event happens.
-
-```python
-def event_handler(event):
- if isinstance(event, paddle.event.EndIteration):
- if event.batch_id % 100 == 0:
- print "\nPass %d, Batch %d, Cost %f, %s" % (
- event.pass_id, event.batch_id, event.cost, event.metrics)
- else:
- sys.stdout.write('.')
- sys.stdout.flush()
- if isinstance(event, paddle.event.EndPass):
- result = trainer.test(reader=test_reader, feeding=feeding)
- print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics)
-```
-
-Finally, we can invoke `trainer.train` to start training:
-
-```python
-trainer.train(
- reader=train_reader,
- event_handler=event_handler,
- feeding=feeding,
- num_passes=10)
-```
-
-
-## Conclusion
-
-In this chapter, we use sentiment analysis as an example to introduce applying deep learning models on end-to-end short text classification, as well as how to use PaddlePaddle to implement the model. Meanwhile, we briefly introduce two models for text processing: CNN and RNN. In following chapters, we will see how these models can be applied in other tasks.
-
-## Reference
-
-1. Kim Y. [Convolutional neural networks for sentence classification](http://arxiv.org/pdf/1408.5882)[J]. arXiv preprint arXiv:1408.5882, 2014.
-2. Kalchbrenner N, Grefenstette E, Blunsom P. [A convolutional neural network for modelling sentences](http://arxiv.org/pdf/1404.2188.pdf?utm_medium=App.net&utm_source=PourOver)[J]. arXiv preprint arXiv:1404.2188, 2014.
-3. Yann N. Dauphin, et al. [Language Modeling with Gated Convolutional Networks](https://arxiv.org/pdf/1612.08083v1.pdf)[J] arXiv preprint arXiv:1612.08083, 2016.
-4. Siegelmann H T, Sontag E D. [On the computational power of neural nets](http://research.cs.queensu.ca/home/akl/cisc879/papers/SELECTED_PAPERS_FROM_VARIOUS_SOURCES/05070215382317071.pdf)[C]//Proceedings of the fifth annual workshop on Computational learning theory. ACM, 1992: 440-449.
-5. Hochreiter S, Schmidhuber J. [Long short-term memory](http://web.eecs.utk.edu/~itamar/courses/ECE-692/Bobby_paper1.pdf)[J]. Neural computation, 1997, 9(8): 1735-1780.
-6. Bengio Y, Simard P, Frasconi P. [Learning long-term dependencies with gradient descent is difficult](http://www-dsi.ing.unifi.it/~paolo/ps/tnn-94-gradient.pdf)[J]. IEEE transactions on neural networks, 1994, 5(2): 157-166.
-7. Graves A. [Generating sequences with recurrent neural networks](http://arxiv.org/pdf/1308.0850)[J]. arXiv preprint arXiv:1308.0850, 2013.
-8. Cho K, Van Merriënboer B, Gulcehre C, et al. [Learning phrase representations using RNN encoder-decoder for statistical machine translation](http://arxiv.org/pdf/1406.1078)[J]. arXiv preprint arXiv:1406.1078, 2014.
-9. Zhou J, Xu W. [End-to-end learning of semantic role labeling using recurrent neural networks](http://www.aclweb.org/anthology/P/P15/P15-1109.pdf)[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2015.
-
-
-This tutorial is contributed by PaddlePaddle, and licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
-
-
-# Semantic Role Labeling
-
-The source code of this chapter is live on [book/label_semantic_roles](https://github.com/PaddlePaddle/book/tree/develop/07.label_semantic_roles).
-
-For instructions on getting started with PaddlePaddle, see [PaddlePaddle installation guide](https://github.com/PaddlePaddle/book/blob/develop/README.en.md#running-the-book).
-
-## Background
-
-Natural language analysis techniques consist of lexical, syntactic, and semantic analysis. **Semantic Role Labeling (SRL)** is an instance of **Shallow Semantic Analysis**.
-
-In a sentence, a **predicate** states a property or a characterization of a *subject*, such as what it does and what it is like. The predicate represents the core of an event, whereas the words accompanying the predicate are **arguments**. A **semantic role** refers to the abstract role an argument of a predicate take on in the event, including *agent*, *patient*, *theme*, *experiencer*, *beneficiary*, *instrument*, *location*, *goal*, and *source*.
-
-In the following example of a Chinese sentence, "to encounter" is the predicate (*pred*); "Ming" is the *agent*; "Hong" is the *patient*; "yesterday" and "evening" are the *time*; finally, "the park" is the *location*.
-
-$$\mbox{[小明 Ming]}_{\mbox{Agent}}\mbox{[昨天 yesterday]}_{\mbox{Time}}\mbox{[晚上 evening]}_\mbox{Time}\mbox{在[公园 a park]}_{\mbox{Location}}\mbox{[遇到 to encounter]}_{\mbox{Predicate}}\mbox{了[小红 Hong]}_{\mbox{Patient}}\mbox{。}$$
-
-Instead of analyzing the semantic information, **Semantic Role Labeling** (**SRL**) identifies the relation between the predicate and the other constituents surrounding it. The predicate-argument structures are labeled as specific semantic roles. A wide range of natural language understanding tasks, including *information extraction*, *discourse analysis*, and *deepQA*. Research usually assumes a predicate of a sentence to be specified; the only task is to identify its arguments and their semantic roles.
-
-Conventional SRL systems mostly build on top of syntactic analysis, usually consisting of five steps:
-
-1. Construct a syntax tree, as shown in Fig. 1
-2. Identity the candidate arguments of the given predicate on the tree.
-3. Prune the most unlikely candidate arguments.
-4. Identify the real arguments, often by a binary classifier.
-5. Multi-classify on results from step 4 to label the semantic roles. Steps 2 and 3 usually introduce hand-designed features based on syntactic analysis (step 1).
-
-
-
-
-Fig 1. Syntax tree
-
-
-
-However, a complete syntactic analysis requires identifying the relation among all constituents. Thus, the accuracy of SRL is sensitive to the preciseness of the syntactic analysis, making SRL challenging. To reduce its complexity and obtain some information on the syntactic structures, we often use *shallow syntactic analysis* a.k.a. partial parsing or chunking. Unlike complete syntactic analysis, which requires the construction of the complete parsing tree, *Shallow Syntactic Analysis* only requires identifying some independent constituents with relatively simple structures, such as verb phrases (chunk). To avoid difficulties in constructing a syntax tree with high accuracy, some work\[[1](#Reference)\] proposed semantic chunking-based SRL methods, which reduces SRL into a sequence tagging problem. Sequence tagging tasks classify syntactic chunks using **BIO representation**. For syntactic chunks forming role A, its first chunk receives the B-A tag (Begin) and the remaining ones receive the tag I-A (Inside); in the end, the chunks left out receive the tag O.
-
-The BIO representation of above example is shown in Fig.1.
-
-
-
-Fig 2. BIO representation
-
-
-This example illustrates the simplicity of sequence tagging, since
-
-1. It only relies on shallow syntactic analysis, reduces the precision requirement of syntactic analysis;
-2. Pruning the candidate arguments is no longer necessary;
-3. Arguments are identified and tagged at the same time. Simplifying the workflow reduces the risk of accumulating errors; oftentimes, methods that unify multiple steps boost performance.
-
-In this tutorial, our SRL system is built as an end-to-end system via a neural network. The system takes only text sequences as input, without using any syntactic parsing results or complex hand-designed features. The public dataset [CoNLL-2004 and CoNLL-2005 Shared Tasks](http://www.cs.upc.edu/~srlconll/) is used for the following task: given a sentence with predicates marked, identify the corresponding arguments and their semantic roles through sequence tagging.
-
-## Model
-
-**Recurrent Neural Networks** (*RNN*) are important tools for sequence modeling and have been successfully used in some natural language processing tasks. Unlike feed-forward neural networks, RNNs can model the dependencies between elements of sequences. As a variant of RNNs', LSTMs aim model long-term dependency in long sequences. We have introduced this in [understand_sentiment](https://github.com/PaddlePaddle/book/tree/develop/05.understand_sentiment). In this chapter, we continue to use LSTMs to solve SRL problems.
-
-### Stacked Recurrent Neural Network
-
-*Deep Neural Networks* can extract hierarchical representations. The higher layers can form relatively abstract/complex representations, based on primitive features discovered through the lower layers. Unfolding LSTMs through time results in a deep feed-forward neural network. This is because any computational path between the input at time $k < t$ to the output at time $t$ crosses several nonlinear layers. On the other hand, due to parameter sharing over time, LSTMs are also *shallow*; that is, the computation carried out at each time-step is just a linear transformation. Deep LSTM networks are typically constructed by stacking multiple LSTM layers on top of each other and taking the output from lower LSTM layer at time $t$ as the input of upper LSTM layer at time $t$. Deep, hierarchical neural networks can be efficient at representing some functions and modeling varying-length dependencies\[[2](#Reference)\].
-
-
-However, in a deep LSTM network, any gradient propagated back in depth needs to traverse a large number of nonlinear steps. As a result, while LSTMs of 4 layers can be trained properly, those with 4-8 have much worse performance. Conventional LSTMs prevent back-propagated errors from vanishing or exploding by introducing shortcut connections to skip the intermediate nonlinear layers. Therefore, deep LSTMs can consider shortcut connections in depth as well.
-
-
-A single LSTM cell has three operations:
-
-1. input-to-hidden: map input $x$ to the input of the forget gates, input gates, memory cells and output gates by linear transformation (i.e., matrix mapping);
-2. hidden-to-hidden: calculate forget gates, input gates, output gates and update memory cell, this is the main part of LSTMs;
-3. hidden-to-output: this part typically involves an activation operation on hidden states.
-
-Based on the stacked LSTMs, we add shortcut connections: take the input-to-hidden from the previous layer as a new input and learn another linear transformation.
-
-Fig.3 illustrates the final stacked recurrent neural networks.
-
-
-
-Fig 3. Stacked Recurrent Neural Networks
-
-
-### Bidirectional Recurrent Neural Network
-
-While LSTMs can summarize the history -- all the previous input seen up until now -- they can not see the future. Because most NLP (natural language processing) tasks provide the entirety of sentences, sequential learning can benefit from having the future encoded as well as the history.
-
-To address, we can design a bidirectional recurrent neural network by making a minor modification. A higher LSTM layer can process the sequence in reversed direction with regards to its immediate lower LSTM layer, i.e., deep LSTM layers take turns to train on input sequences from left-to-right and right-to-left. Therefore, LSTM layers at time-step $t$ can see both histories and the future, starting from the second layer. Fig. 4 illustrates the bidirectional recurrent neural networks.
-
-
-
-
-Fig 4. Bidirectional LSTMs
-
-
-Note that, this bidirectional RNNs is different with the one proposed by Bengio et al. in machine translation tasks \[[3](#Reference), [4](#Reference)\]. We will introduce another bidirectional RNNs in the following tasks [machine translation](https://github.com/PaddlePaddle/book/blob/develop/machine_translation/README.en.md)
-
-### Conditional Random Field (CRF)
-
-Typically, a neural network's lower layers learn representations while its very top layer learns the final task. These principles can guide our problem-solving approaches. In SRL tasks, a **Conditional Random Field** (*CRF*) is built on top of the network in order to perform the final prediction to tag sequences. It takes as input the representations provided by the last LSTM layer.
-
-
-The CRF is an undirected probabilistic graph with nodes denoting random variables and edges denoting dependencies between these variables. In essence, CRFs learn the conditional probability $P(Y|X)$, where $X = (x_1, x_2, ... , x_n)$ are sequences of input and $Y = (y_1, y_2, ... , y_n)$ are label sequences; to decode, simply search through $Y$ for a sequence that maximizes the conditional probability $P(Y|X)$, i.e., $Y^* = \mbox{arg max}_{Y} P(Y | X)$。
-
-Sequence tagging tasks do not assume a lot of conditional independence, because they are only concerned with the input and the output being linear sequences. Thus, the graph model of sequence tagging tasks is usually a simple chain or line, which results in a **Linear-Chain Conditional Random Field**, shown in Fig.5.
-
-
-
-Fig 5. Linear Chain Conditional Random Field used in SRL tasks
-
-
-By the fundamental theorem of random fields \[[5](#Reference)\], the joint distribution over the label sequence $Y$ given $X$ has the form:
-
-$$p(Y | X) = \frac{1}{Z(X)} \text{exp}\left(\sum_{i=1}^{n}\left(\sum_{j}\lambda_{j}t_{j} (y_{i - 1}, y_{i}, X, i) + \sum_{k} \mu_k s_k (y_i, X, i)\right)\right)$$
-
-
-where, $Z(X)$ is normalization constant, ${t_j}$ represents the feature functions defined on edges called the *transition feature*, which denotes the transition probabilities from $y_{i-1}$ to $y_i$ given input sequence $X$. ${s_k}$ represents the feature function defined on nodes, called the state feature, denoting the probability of $y_i$ given input sequence $X$. In addition, $\lambda_j$ and $\mu_k$ are weights corresponding to $t_j$ and $s_k$. Alternatively, $t$ and $s$ can be written in the same form that depends on $y_{i - 1}$, $y_i$, $X$, and $i$. Taking its summation over all nodes $i$, we have: $f_{k}(Y, X) = \sum_{i=1}^{n}f_k({y_{i - 1}, y_i, X, i})$, which defines the *feature function* $f$. Thus, $P(Y|X)$ can be written as:
-
-$$p(Y|X, W) = \frac{1}{Z(X)}\text{exp}\sum_{k}\omega_{k}f_{k}(Y, X)$$
-
-where $\omega$ are the weights to the feature function that the CRF learns. While training, given input sequences and label sequences $D = \left[(X_1, Y_1), (X_2 , Y_2) , ... , (X_N, Y_N)\right]$, by maximum likelihood estimation (**MLE**), we construct the following objective function:
-
-
-$$\DeclareMathOperator*{\argmax}{arg\,max} L(\lambda, D) = - \text{log}\left(\prod_{m=1}^{N}p(Y_m|X_m, W)\right) + C \frac{1}{2}\lVert W\rVert^{2}$$
-
-
-This objective function can be solved via back-propagation in an end-to-end manner. While decoding, given input sequences $X$, search for sequence $\bar{Y}$ to maximize the conditional probability $\bar{P}(Y|X)$ via decoding methods (such as *Viterbi*, or [Beam Search Algorithm](https://github.com/PaddlePaddle/book/blob/develop/07.machine_translation/README.en.md#Beam%20Search%20Algorithm)).
-
-### Deep Bidirectional LSTM (DB-LSTM) SRL model
-
-Given predicates and a sentence, SRL tasks aim to identify arguments of the given predicate and their semantic roles. If a sequence has $n$ predicates, we will process this sequence $n$ times. Here is the breakdown of a straight-forward model:
-
-1. Construct inputs;
- - input 1: predicate, input 2: sentence
- - expand input 1 into a sequence of the same length with input 2's sentence, using one-hot representation;
-2. Convert the one-hot sequences from step 1 to vector sequences via a word embedding's lookup table;
-3. Learn the representation of input sequences by taking vector sequences from step 2 as inputs;
-4. Take the representation from step 3 as input, label sequence as supervisory signal, and realize sequence tagging tasks.
-
-Here, we propose some improvements by introducing two simple but effective features:
-
-- predicate context (**ctx-p**): A single predicate word may not describe all the predicate information, especially when the same words appear multiple times in a sentence. With the expanded context, the ambiguity can be largely eliminated. Thus, we extract $n$ words before and after predicate to construct a window chunk.
-
-- region mark ($m_r$): The binary marker on a word, $m_r$, takes the value of $1$ when the word is in the predicate context region, and $0$ if not.
-
-After these modifications, the model is as follows, as illustrated in Figure 6:
-
-1. Construct inputs
- - Input 1: word sequence. Input 2: predicate. Input 3: predicate context, extract $n$ words before and after predicate. Input 4: region mark sequence, where an entry is 1 if word is located in the predicate context region, 0 otherwise.
- - expand input 2~3 into sequences with the same length with input 1
-2. Convert input 1~4 to vector sequences via word embedding lookup tables; While input 1 and 3 shares the same lookup table, input 2 and 4 have separate lookup tables.
-3. Take the four vector sequences from step 2 as inputs to bidirectional LSTMs; Train the LSTMs to update representations.
-4. Take the representation from step 3 as input to CRF, label sequence as supervisory signal, and complete sequence tagging tasks.
-
-
-
-
-Fig 6. DB-LSTM for SRL tasks
-
-
-## Data Preparation
-
-In the tutorial, we use [CoNLL 2005](http://www.cs.upc.edu/~srlconll/) SRL task open dataset as an example. Note that the training set and development set of the CoNLL 2005 SRL task are not free to download after the competition. Currently, only the test set can be obtained, including 23 sections of the Wall Street Journal and three sections of the Brown corpus. In this tutorial, we use the WSJ corpus as the training dataset to explain the model. However, since the training set is small, for a usable neural network SRL system, please consider paying for the full corpus.
-
-The original data includes a variety of information such as POS tagging, naming entity recognition, syntax tree, etc. In this tutorial, we only use the data under `test.wsj/words/` (text sequence) and `test.wsj/props/` (label results). The data directory used in this tutorial is as follows:
-
-```text
-conll05st-release/
-└── test.wsj
- ├── props # 标注结果
- └── words # 输入文本序列
-```
-
-The annotation information is derived from the results of Penn TreeBank\[[7](#references)\] and PropBank \[[8](# references)\]. The labeling of the PropBank is different from the labeling methods mentioned before, but shares with it the same underlying principle. For descriptions of the labeling, please refer to the paper \[[9](#references)\].
-
-The raw data needs to be preprocessed into formats that PaddlePaddle can handle. The preprocessing consists of the following steps:
-
-1. Merge the text sequence and the tag sequence into the same record;
-2. If a sentence contains $n$ predicates, the sentence will be processed $n$ times into $n$ separate training samples, each sample with a different predicate;
-3. Extract the predicate context and construct the predicate context region marker;
-4. Construct the markings in BIO format;
-5. Obtain the integer index corresponding to the word according to the dictionary.
-
-```python
-# import paddle.v2.dataset.conll05 as conll05
-# conll05.corpus_reader does step 1 and 2 as mentioned above.
-# conll05.reader_creator does step 3 to 5.
-# conll05.test gets preprocessed training instances.
-```
-
-After preprocessing, a training sample contains nine features, namely: word sequence, predicate, predicate context (5 columns), region mark sequence, label sequence. The following table is an example of a training sample.
-
-| word sequence | predicate | predicate context(5 columns) | region mark sequence | label sequence|
-|---|---|---|---|---|
-| A | set | n't been set . × | 0 | B-A1 |
-| record | set | n't been set . × | 0 | I-A1 |
-| date | set | n't been set . × | 0 | I-A1 |
-| has | set | n't been set . × | 0 | O |
-| n't | set | n't been set . × | 1 | B-AM-NEG |
-| been | set | n't been set . × | 1 | O |
-| set | set | n't been set . × | 1 | B-V |
-| . | set | n't been set . × | 1 | O |
-
-In addition to the data, we provide following resources:
-
-| filename | explanation |
-|---|---|
-| word_dict | dictionary of input sentences, total 44068 words |
-| label_dict | dictionary of labels, total 106 labels |
-| predicate_dict | predicate dictionary, total 3162 predicates |
-| emb | a pre-trained word vector lookup table, 32-dimentional |
-
-We trained a language model on the English Wikipedia to get a word vector lookup table used to initialize the SRL model. While training the SRL model, the word vector lookup table is no longer updated. To learn more about the language model and the word vector lookup table, please refer to the tutorial [word vector](https://github.com/PaddlePaddle/book/blob/develop/04.word2vec/README.md). There are 995,000,000 tokens in the training corpus, and the dictionary size is 4900,000 words. In the CoNLL 2005 training corpus, 5% of the words are not in the 4900,000 words, and we see them all as unknown words, represented by ``.
-
-Here we fetch the dictionary, and print its size:
-
-```python
-import math
-import numpy as np
-import gzip
-import paddle.v2 as paddle
-import paddle.v2.dataset.conll05 as conll05
-import paddle.v2.evaluator as evaluator
-
-paddle.init(use_gpu=False, trainer_count=1)
-
-word_dict, verb_dict, label_dict = conll05.get_dict()
-word_dict_len = len(word_dict)
-label_dict_len = len(label_dict)
-pred_len = len(verb_dict)
-
-print word_dict_len
-print label_dict_len
-print pred_len
-```
-
-## Model Configuration
-
-- Define input data dimensions and model hyperparameters.
-
-```python
-mark_dict_len = 2 # value range of region mark. Region mark is either 0 or 1, so range is 2
-word_dim = 32 # word vector dimension
-mark_dim = 5 # adjacent dimension
-hidden_dim = 512 # the dimension of LSTM hidden layer vector is 128 (512/4)
-depth = 8 # depth of stacked LSTM
-
-# There are 9 features per sample, so we will define 9 data layers.
-# They type for each layer is integer_value_sequence.
-def d_type(value_range):
- return paddle.data_type.integer_value_sequence(value_range)
-
-# word sequence
-word = paddle.layer.data(name='word_data', type=d_type(word_dict_len))
-# predicate
-predicate = paddle.layer.data(name='verb_data', type=d_type(pred_len))
-
-# 5 features for predicate context
-ctx_n2 = paddle.layer.data(name='ctx_n2_data', type=d_type(word_dict_len))
-ctx_n1 = paddle.layer.data(name='ctx_n1_data', type=d_type(word_dict_len))
-ctx_0 = paddle.layer.data(name='ctx_0_data', type=d_type(word_dict_len))
-ctx_p1 = paddle.layer.data(name='ctx_p1_data', type=d_type(word_dict_len))
-ctx_p2 = paddle.layer.data(name='ctx_p2_data', type=d_type(word_dict_len))
-
-# region marker sequence
-mark = paddle.layer.data(name='mark_data', type=d_type(mark_dict_len))
-
-# label sequence
-target = paddle.layer.data(name='target', type=d_type(label_dict_len))
-```
-
-Note that `hidden_dim = 512` means a LSTM hidden vector of 128 dimension (512/4). Please refer to PaddlePaddle's official documentation for detail: [lstmemory](http://www.paddlepaddle.org/doc/ui/api/trainer_config_helpers/layers.html#lstmemory)。
-
-- Transform the word sequence itself, the predicate, the predicate context, and the region mark sequence into embedded vector sequences.
-
-```python
-
-# Since word vectorlookup table is pre-trained, we won't update it this time.
-# is_static being True prevents updating the lookup table during training.
-emb_para = paddle.attr.Param(name='emb', initial_std=0., is_static=True)
-# hyperparameter configurations
-default_std = 1 / math.sqrt(hidden_dim) / 3.0
-std_default = paddle.attr.Param(initial_std=default_std)
-std_0 = paddle.attr.Param(initial_std=0.)
-
-predicate_embedding = paddle.layer.embedding(
- size=word_dim,
- input=predicate,
- param_attr=paddle.attr.Param(
- name='vemb', initial_std=default_std))
-mark_embedding = paddle.layer.embedding(
- size=mark_dim, input=mark, param_attr=std_0)
-
-word_input = [word, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2]
-emb_layers = [
- paddle.layer.embedding(
- size=word_dim, input=x, param_attr=emb_para) for x in word_input
-]
-emb_layers.append(predicate_embedding)
-emb_layers.append(mark_embedding)
-```
-
-- 8 LSTM units are trained through alternating left-to-right / right-to-left order denoted by the variable `reverse`.
-
-```python
-hidden_0 = paddle.layer.mixed(
- size=hidden_dim,
- bias_attr=std_default,
- input=[
- paddle.layer.full_matrix_projection(
- input=emb, param_attr=std_default) for emb in emb_layers
- ])
-
-mix_hidden_lr = 1e-3
-lstm_para_attr = paddle.attr.Param(initial_std=0.0, learning_rate=1.0)
-hidden_para_attr = paddle.attr.Param(
- initial_std=default_std, learning_rate=mix_hidden_lr)
-
-lstm_0 = paddle.layer.lstmemory(
- input=hidden_0,
- act=paddle.activation.Relu(),
- gate_act=paddle.activation.Sigmoid(),
- state_act=paddle.activation.Sigmoid(),
- bias_attr=std_0,
- param_attr=lstm_para_attr)
-
-# stack L-LSTM and R-LSTM with direct edges
-input_tmp = [hidden_0, lstm_0]
-
-for i in range(1, depth):
- mix_hidden = paddle.layer.mixed(
- size=hidden_dim,
- bias_attr=std_default,
- input=[
- paddle.layer.full_matrix_projection(
- input=input_tmp[0], param_attr=hidden_para_attr),
- paddle.layer.full_matrix_projection(
- input=input_tmp[1], param_attr=lstm_para_attr)
- ])
-
- lstm = paddle.layer.lstmemory(
- input=mix_hidden,
- act=paddle.activation.Relu(),
- gate_act=paddle.activation.Sigmoid(),
- state_act=paddle.activation.Sigmoid(),
- reverse=((i % 2) == 1),
- bias_attr=std_0,
- param_attr=lstm_para_attr)
-
- input_tmp = [mix_hidden, lstm]
-```
-
-- In PaddlePaddle, state features and transition features of a CRF are implemented by a fully connected layer and a CRF layer seperately. The fully connected layer with linear activation learns the state features, here we use paddle.layer.mixed (paddle.layer.fc can be uesed as well), and the CRF layer in PaddlePaddle: paddle.layer.crf only learns the transition features, which is a cost layer and is the last layer of the network. paddle.layer.crf outputs the log probability of true tag sequence as the cost by given the input sequence and it requires the true tag sequence as target in the learning process.
-
-```python
-
-# The output of the top LSTM unit and its input are feed into a fully connected layer,
-# size of which equals to size of tag labels.
-# The fully connected layer learns the state features
-
-feature_out = paddle.layer.mixed(
- size=label_dict_len,
- bias_attr=std_default,
- input=[
- paddle.layer.full_matrix_projection(
- input=input_tmp[0], param_attr=hidden_para_attr),
- paddle.layer.full_matrix_projection(
- input=input_tmp[1], param_attr=lstm_para_attr)], )
-
-crf_cost = paddle.layer.crf(
- size=label_dict_len,
- input=feature_out,
- label=target,
- param_attr=paddle.attr.Param(
- name='crfw',
- initial_std=default_std,
- learning_rate=mix_hidden_lr))
-```
-
-- The CRF decoding layer is used for evaluation and inference. It shares weights with CRF layer. The sharing of parameters among multiple layers is specified by using the same parameter name in these layers. If true tag sequence is provided in training process, `paddle.layer.crf_decoding` calculates labelling error for each input token and `evaluator.sum` sum the error over the entire sequence. Otherwise, `paddle.layer.crf_decoding` generates the labelling tags.
-
-```python
-crf_dec = paddle.layer.crf_decoding(
- size=label_dict_len,
- input=feature_out,
- label=target,
- param_attr=paddle.attr.Param(name='crfw'))
-evaluator.sum(input=crf_dec)
-```
-
-## Train model
-
-### Create Parameters
-
-All necessary parameters will be traced created given output layers that we need to use.
-
-```python
-parameters = paddle.parameters.create(crf_cost)
-```
-
-We can print out parameter name. It will be generated if not specified.
-
-```python
-print parameters.keys()
-```
-
-Now we load the pre-trained word lookup tables from word embeddings trained on the English language Wikipedia.
-
-```python
-def load_parameter(file_name, h, w):
- with open(file_name, 'rb') as f:
- f.read(16)
- return np.fromfile(f, dtype=np.float32).reshape(h, w)
-parameters.set('emb', load_parameter(conll05.get_embedding(), 44068, 32))
-```
-
-### Create Trainer
-
-We will create trainer given model topology, parameters, and optimization method. We will use the most basic **SGD** method, which is a momentum optimizer with 0 momentum. Meanwhile, we will set learning rate and regularization.
-
-```python
-optimizer = paddle.optimizer.Momentum(
- momentum=0,
- learning_rate=1e-3,
- regularization=paddle.optimizer.L2Regularization(rate=8e-4),
- model_average=paddle.optimizer.ModelAverage(
- average_window=0.5, max_average_window=10000), )
-
-trainer = paddle.trainer.SGD(cost=crf_cost,
- parameters=parameters,
- update_equation=optimizer,
- extra_layers=crf_dec)
-```
-
-### Trainer
-
-As mentioned in data preparation section, we will use CoNLL 2005 test corpus as the training data set. `conll05.test()` outputs one training instance at a time. It is shuffled and batched into mini batches, and used as input.
-
-```python
-reader = paddle.batch(
- paddle.reader.shuffle(
- conll05.test(), buf_size=8192), batch_size=2)
-```
-
-`feeding` is used to specify the correspondence between data instance and data layer. For example, according to following `feeding`, the 0th column of data instance produced by`conll05.test()` is matched to the data layer named `word_data`.
-
-```python
-feeding = {
- 'word_data': 0,
- 'ctx_n2_data': 1,
- 'ctx_n1_data': 2,
- 'ctx_0_data': 3,
- 'ctx_p1_data': 4,
- 'ctx_p2_data': 5,
- 'verb_data': 6,
- 'mark_data': 7,
- 'target': 8
-}
-```
-
-`event_handler` can be used as callback for training events, it will be used as an argument for the `train` method. Following `event_handler` prints cost during training.
-
-```python
-def event_handler(event):
- if isinstance(event, paddle.event.EndIteration):
- if event.batch_id and event.batch_id % 10 == 0:
- print "Pass %d, Batch %d, Cost %f, %s" % (
- event.pass_id, event.batch_id, event.cost, event.metrics)
- if event.batch_id % 400 == 0:
- result = trainer.test(reader=reader, feeding=feeding)
- print "\nTest with Pass %d, Batch %d, %s" % (event.pass_id, event.batch_id, result.metrics)
-
- if isinstance(event, paddle.event.EndPass):
- # save parameters
- with gzip.open('params_pass_%d.tar.gz' % event.pass_id, 'w') as f:
- parameters.to_tar(f)
-
- result = trainer.test(reader=reader, feeding=feeding)
- print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics)
-```
-
-`trainer.train` will train the model.
-
-```python
-trainer.train(
- reader=reader,
- event_handler=event_handler,
- num_passes=10000,
- feeding=feeding)
-```
-
-### Application
-
-Aftern training is done, we need to select an optimal model based one performance index to do inference. In this task, one can simply select the model with the least number of marks on the test set. The `paddle.layer.crf_decoding` layer is used in the inference, but its inputs does not include the ground truth label.
-
-```python
-predict = paddle.layer.crf_decoding(
- size=label_dict_len,
- input=feature_out,
- param_attr=paddle.attr.Param(name='crfw'))
-```
-
-Here, using one testing sample as an example.
-
-```python
-test_creator = paddle.dataset.conll05.test()
-test_data = []
-for item in test_creator():
- test_data.append(item[0:8])
- if len(test_data) == 1:
- break
-```
-
-The inference interface `paddle.infer` returns the index of predicting labels. Then printing the tagging results based dictionary `labels_reverse`.
-
-
-```python
-labs = paddle.infer(
- output_layer=predict, parameters=parameters, input=test_data, field='id')
-assert len(labs) == len(test_data[0][0])
-labels_reverse={}
-for (k,v) in label_dict.items():
- labels_reverse[v]=k
-pre_lab = [labels_reverse[i] for i in labs]
-print pre_lab
-```
-
-## Conclusion
-
-Semantic Role Labeling is an important intermediate step in a wide range of natural language processing tasks. In this tutorial, we use SRL as an example to illustrate using PaddlePaddle to do sequence tagging tasks. The models proposed are from our published paper\[[10](#Reference)\]. We only use test data for illustration since the training data on the CoNLL 2005 dataset is not completely public. This aims to propose an end-to-end neural network model with fewer dependencies on natural language processing tools but is comparable, or even better than traditional models in terms of performance. Please check out our paper for more information and discussions.
-
-## Reference
-1. Sun W, Sui Z, Wang M, et al. [Chinese semantic role labeling with shallow parsing](http://www.aclweb.org/anthology/D09-1#page=1513)[C]//Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3-Volume 3. Association for Computational Linguistics, 2009: 1475-1483.
-2. Pascanu R, Gulcehre C, Cho K, et al. [How to construct deep recurrent neural networks](https://arxiv.org/abs/1312.6026)[J]. arXiv preprint arXiv:1312.6026, 2013.
-3. Cho K, Van Merriënboer B, Gulcehre C, et al. [Learning phrase representations using RNN encoder-decoder for statistical machine translation](https://arxiv.org/abs/1406.1078)[J]. arXiv preprint arXiv:1406.1078, 2014.
-4. Bahdanau D, Cho K, Bengio Y. [Neural machine translation by jointly learning to align and translate](https://arxiv.org/abs/1409.0473)[J]. arXiv preprint arXiv:1409.0473, 2014.
-5. Lafferty J, McCallum A, Pereira F. [Conditional random fields: Probabilistic models for segmenting and labeling sequence data](http://www.jmlr.org/papers/volume15/doppa14a/source/biblio.bib.old)[C]//Proceedings of the eighteenth international conference on machine learning, ICML. 2001, 1: 282-289.
-6. 李航. 统计学习方法[J]. 清华大学出版社, 北京, 2012.
-7. Marcus M P, Marcinkiewicz M A, Santorini B. [Building a large annotated corpus of English: The Penn Treebank](http://repository.upenn.edu/cgi/viewcontent.cgi?article=1246&context=cis_reports)[J]. Computational linguistics, 1993, 19(2): 313-330.
-8. Palmer M, Gildea D, Kingsbury P. [The proposition bank: An annotated corpus of semantic roles](http://www.mitpressjournals.org/doi/pdfplus/10.1162/0891201053630264)[J]. Computational linguistics, 2005, 31(1): 71-106.
-9. Carreras X, Màrquez L. [Introduction to the CoNLL-2005 shared task: Semantic role labeling](http://www.cs.upc.edu/~srlconll/st05/papers/intro.pdf)[C]//Proceedings of the Ninth Conference on Computational Natural Language Learning. Association for Computational Linguistics, 2005: 152-164.
-10. Zhou J, Xu W. [End-to-end learning of semantic role labeling using recurrent neural networks](http://www.aclweb.org/anthology/P/P15/P15-1109.pdf)[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2015.
-
-
-This tutorial is contributed by PaddlePaddle, and licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
-
-
-# Machine Translation
-
-The source codes is located at [book/machine_translation](https://github.com/PaddlePaddle/book/tree/develop/08.machine_translation). Please refer to the PaddlePaddle [installation tutorial](https://github.com/PaddlePaddle/book/blob/develop/README.en.md#running-the-book) if you are a first time user.
-
-## Background
-
-Machine translation (MT) leverages computers to translate from one language to another. The language to be translated is referred to as the source language, while the language to be translated into is referred to as the target language. Thus, Machine translation is the process of translating from the source language to the target language. It is one of the most important research topics in the field of natural language processing.
-
-Early machine translation systems are mainly rule-based i.e. they rely on a language expert to specify the translation rules between the two languages. It is quite difficult to cover all the rules used in one languge. So it is quite a challenge for language experts to specify all possible rules in two or more different languages. Hence, a major challenge in conventional machine translation has been the difficulty in obtaining a complete rule set \[[1](#References)\]。
-
-
-To address the aforementioned problems, statistical machine translation techniques have been developed. These techniques learn the translation rules from a large corpus, instead of being designed by a language expert. While these techniques overcome the bottleneck of knowledge acquisition, there are still quite a lot of challenges, for example:
-
-1. human designed features cannot cover all possible linguistic variations;
-
-2. it is difficult to use global features;
-
-3. the techniques heavily rely on pre-processing techniques like word alignment, word segmentation and tokenization, rule-extraction and syntactic parsing etc. The error introduced in any of these steps could accumulate and impact translation quality.
-
-
-
-The recent development of deep learning provides new solutions to these challenges. The two main categories for deep learning based machine translation techniques are:
-
-1. techniques based on the statistical machine translation system but with some key components improved with neural networks, e.g., language model, reordering model (please refer to the left part of Figure 1);
-
-2. techniques mapping from source language to target language directly using a neural network, or end-to-end neural machine translation (NMT).
-
-
-
-Figure 1. Neural Network based Machine Translation
-
-
-
-This tutorial will mainly introduce an NMT model and how to use PaddlePaddle to train it.
-
-## Illustrative Results
-
-Let's consider an example of Chinese-to-English translation. The model is given the following segmented sentence in Chinese
-```text
-这些 是 希望 的 曙光 和 解脱 的 迹象 .
-```
-After training and with a beam-search size of 3, the generated translations are as follows:
-```text
-0 -5.36816 These are signs of hope and relief .
-1 -6.23177 These are the light of hope and relief .
-2 -7.7914 These are the light of hope and the relief of hope .
-```
-- The first column corresponds to the id of the generated sentence; the second column corresponds to the score of the generated sentence (in descending order), where a larger value indicates better quality; the last column corresponds to the generated sentence.
-- There are two special tokens: `` denotes the end of a sentence while `` denotes unknown word, i.e., a word not in the training dictionary.
-
-## Overview of the Model
-
-This section will introduce Gated Recurrent Unit (GRU), Bi-directional Recurrent Neural Network, the Encoder-Decoder framework used in NMT, attention mechanism, as well as the beam search algorithm.
-
-### Gated Recurrent Unit (GRU)
-
-We already introduced RNN and LSTM in the [Sentiment Analysis](https://github.com/PaddlePaddle/book/blob/develop/understand_sentiment/README.md) chapter.
-Compared to a simple RNN, the LSTM added memory cell, input gate, forget gate and output gate. These gates combined with the memory cell greatly improve the ability to handle long-term dependencies.
-
-GRU\[[2](#References)\] proposed by Cho et al is a simplified LSTM and an extension of a simple RNN. It is shown in the figure below.
-A GRU unit has only two gates:
-- reset gate: when this gate is closed, the history information is discarded, i.e., the irrelevant historical information has no effect on the future output.
-- update gate: it combines the input gate and the forget gate and is used to control the impact of historical information on the hidden output. The historical information is passed over when the update gate is close to 1.
-
-
-
-Figure 2. A GRU Gate
-
-
-Generally speaking, sequences with short distance dependencies will have an active reset gate while sequences with long distance dependency will have an active update date.
-In addition, Chung et al.\[[3](#References)\] have empirically shown that although GRU has less parameters, it has similar performance to LSTM on several different tasks.
-
-### Bi-directional Recurrent Neural Network
-
-We already introduced an instance of bi-directional RNN in the [Semantic Role Labeling](https://github.com/PaddlePaddle/book/blob/develop/label_semantic_roles/README.md) chapter. Here we present another bi-directional RNN model with a different architecture proposed by Bengio et al. in \[[2](#References),[4](#References)\]. This model takes a sequence as input and outputs a fixed dimensional feature vector at each step, encoding the context information at the corresponding time step.
-
-Specifically, this bi-directional RNN processes the input sequence in the original and reverse order respectively, and then concatenates the output feature vectors at each time step as the final output. Thus the output node at each time step contains information from the past and future as context. The figure below shows an unrolled bi-directional RNN. This network contains a forward RNN and backward RNN with six weight matrices: weight matrices from input to forward hidden layer and backward hidden ($W_1, W_3$), weight matrices from hidden to itself ($W_2, W_5$), matrices from forward hidden and backward hidden to output layer ($W_4, W_6$). Note that there are no connections between forward hidden and backward hidden layers.
-
-
-
-### Encoder-Decoder Framework
-
-The Encoder-Decoder\[[2](#References)\] framework aims to solve the mapping of a sequence to another sequence, for sequences with arbitrary lengths. The source sequence is encoded into a vector via an encoder, which is then decoded to a target sequence via a decoder by maximizing the predictive probability. Both the encoder and the decoder are typically implemented via RNN.
-
-
-
-Figure 4. Encoder-Decoder Framework
-
-
-#### Encoder
-
-There are three steps for encoding a sentence:
-
-1. One-hot vector representation of a word: Each word $x_i$ in the source sentence $x=\left \{ x_1,x_2,...,x_T \right \}$ is represented as a vector $w_i\epsilon \left \{ 0,1 \right \}^{\left | V \right |},i=1,2,...,T$ where $w_i$ has the same dimensionality as the size of the dictionary, i.e., $\left | V \right |$, and has an element of one at the location corresponding to the location of the word in the dictionary and zero elsewhere.
-
-2. Word embedding as a representation in the low-dimensional semantic space: There are two problems with one-hot vector representation
-
- * the dimensionality of the vector is typically large, leading to the curse of dimensionality;
-
- * it is hard to capture the relationships between words, i.e., semantic similarities. Therefore, it is useful to project the one-hot vector into a low-dimensional semantic space as a dense vector with fixed dimensions, i.e., $s_i=Cw_i$ for the $i$-th word, with $C\epsilon R^{K\times \left | V \right |}$ as the projection matrix and $K$ is the dimensionality of the word embedding vector.
-
-3. Encoding of the source sequence via RNN: This can be described mathematically as:
-
- $$h_i=\varnothing _\theta \left ( h_{i-1}, s_i \right )$$
-
- where
- $h_0$ is a zero vector,
- $\varnothing _\theta$ is a non-linear activation function, and
- $\mathbf{h}=\left \{ h_1,..., h_T \right \}$
- is the sequential encoding of the first $T$ words from the source sequence. The vector representation of the whole sentence can be represented as the encoding vector at the last time step $T$ from $\mathbf{h}$, or by temporal pooling over $\mathbf{h}$.
-
-
-Bi-directional RNN can also be used in step (3) for more a complicated sentence encoding. This can be implemented using a bi-directional GRU. Forward GRU encodes the source sequence in its original order $(x_1,x_2,...,x_T)$, and generates a sequence of hidden states $(\overrightarrow{h_1},\overrightarrow{h_2},...,\overrightarrow{h_T})$. The backward GRU encodes the source sequence in reverse order, i.e., $(x_T,x_T-1,...,x_1)$ and generates $(\overleftarrow{h_1},\overleftarrow{h_2},...,\overleftarrow{h_T})$. Then for each word $x_i$, its complete hidden state is the concatenation of the corresponding hidden states from the two GRUs, i.e., $h_i=\left [ \overrightarrow{h_i^T},\overleftarrow{h_i^T} \right ]^{T}$.
-
-
-
-Figure 5. Encoder using bi-directional GRU
-
-
-#### Decoder
-
-The goal of the decoder is to maximize the probability of the next correct word in the target language. The main idea is as follows:
-
-1. At each time step $i$, given the encoding vector (or context vector) $c$ of the source sentence, the $i$-th word $u_i$ from the ground-truth target language and the RNN hidden state $z_i$, the next hidden state $z_{i+1}$ is computed as:
-
- $$z_{i+1}=\phi _{\theta '}\left ( c,u_i,z_i \right )$$
- where $\phi _{\theta '}$ is a non-linear activation function and $c=q\mathbf{h}$ is the context vector of the source sentence. Without using [attention](#Attention Mechanism), if the output of the [encoder](#Encoder) is the encoding vector at the last time step of the source sentence, then $c$ can be defined as $c=h_T$. $u_i$ denotes the $i$-th word from the target language sentence and $u_0$ denotes the beginning of the target language sentence (i.e., ``), indicating the beginning of decoding. $z_i$ is the RNN hidden state at time step $i$ and $z_0$ is an all zero vector.
-
-2. Calculate the probability $p_{i+1}$ for the $i+1$-th word in the target language sequence by normalizing $z_{i+1}$ using `softmax` as follows
-
- $$p\left ( u_{i+1}|u_{<i+1},\mathbf{x} \right )=softmax(W_sz_{i+1}+b_z)$$
-
- where $W_sz_{i+1}+b_z$ scores each possible words and is then normalized via softmax to produce the probability $p_{i+1}$ for the $i+1$-th word.
-
-3. Compute the cost accoding to $p_{i+1}$ and $u_{i+1}$.
-4. Repeat Steps 1-3, until all the words in the target language sentence have been processed.
-
-The generation process of machine translation is to translate the source sentence into a sentence in the target language according to a pre-trained model. There are some differences between the decoding step in generation and training. Please refer to [Beam Search Algorithm](#Beam Search Algorithm) for details.
-
-### Attention Mechanism
-
-There are a few problems with the fixed dimensional vector representation from the encoding stage:
- * It is very challenging to encode both the semantic and syntactic information a sentence with a fixed dimensional vector regardless of the length of the sentence.
- * Intuitively, when translating a sentence, we typically pay more attention to the parts in the source sentence more relevant to the current translation. Moreover, the focus changes along the process of the translation. With a fixed dimensional vector, all the information from the source sentence is treated equally in terms of attention. This is not reasonable. Therefore, Bahdanau et al. \[[4](#References)\] introduced attention mechanism, which can decode based on different fragments of the context sequence in order to address the difficulty of feature learning for long sentences. Decoder with attention will be explained in the following.
-
-Different from the simple decoder, $z_i$ is computed as:
-
-$$z_{i+1}=\phi _{\theta '}\left ( c_i,u_i,z_i \right )$$
-
-It is observed that for each word $u_i$ in the target language sentence, there is a corresponding context vector $c_i$ as the encoding of the source sentence, which is computed as:
-
-$$c_i=\sum _{j=1}^{T}a_{ij}h_j, a_i=\left[ a_{i1},a_{i2},...,a_{iT}\right ]$$
-
-It is noted that the attention mechanism is achieved by a weighted average over the RNN hidden states $h_j$. The weight $a_{ij}$ denotes the strength of attention of the $i$-th word in the target language sentence to the $j$-th word in the source sentence and is calculated as
-
-\begin{align}
-a_{ij}&=\frac{exp(e_{ij})}{\sum_{k=1}^{T}exp(e_{ik})}\\\\
-e_{ij}&=align(z_i,h_j)\\\\
-\end{align}
-
-where $align$ is an alignment model that measures the fitness between the $i$-th word in the target language sentence and the $j$-th word in the source sentence. More concretely, the fitness is computed with the $i$-th hidden state $z_i$ of the decoder RNN and the $j$-th context vector $h_j$ of the source sentence. Hard alignment is used in the conventional alignment model, which means each word in the target language explicitly corresponds to one or more words from the target language sentence. In an attention model, soft alignment is used, where any word in source sentence is related to any word in the target language sentence, where the strength of the relation is a real number computed via the model, thus can be incorporated into the NMT framework and can be trained via back-propagation.
-
-
-
-Figure 6. Decoder with Attention Mechanism
-
-
-### Beam Search Algorithm
-
-[Beam Search](http://en.wikipedia.org/wiki/Beam_search) is a heuristic search algorithm that explores a graph by expanding the most promising node in a limited set. It is typically used when the solution space is huge (e.g., for machine translation, speech recognition), and there is not enough memory for all the possible solutions. For example, if we want to translate “`你好`” into English, even if there are only three words in the dictionary (``, ``, `hello`), it is still possible to generate an infinite number of sentences, where the word `hello` can appear different number of times. Beam search could be used to find a good translation among them.
-
-Beam search builds a search tree using breadth first search and sorts the nodes according to a heuristic cost (sum of the log probability of the generated words) at each level of the tree. Only a fixed number of nodes according to the pre-specified beam size (or beam width) are considered. Thus, only nodes with highest scores are expanded in the next level. This reduces the space and time requirements significantly. However, a globally optimal solution is not guaranteed.
-
-The goal is to maximize the probability of the generated sequence when using beam search in decoding, The procedure is as follows:
-
-1. At each time step $i$, compute the hidden state $z_{i+1}$ of the next time step according to the context vector $c$ of the source sentence, the $i$-th word $u_i$ generated for the target language sentence and the RNN hidden state $z_i$.
-2. Normalize $z_{i+1}$ using `softmax` to get the probability $p_{i+1}$ for the $i+1$-th word for the target language sentence.
-3. Sample the word $u_{i+1}$ according to $p_{i+1}$.
-4. Repeat Steps 1-3, until end-of-sentence token `` is generated or the maximum length of the sentence is reached.
-
-Note: $z_{i+1}$ and $p_{i+1}$ are computed the same way as in [Decoder](#Decoder). In generation mode, each step is greedy in so there is no guarantee of a global optimum.
-
-## Data Preparation
-
-This tutorial uses a dataset from [WMT-14](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/), where [bitexts (after selection)](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz) is used as the training set, and [dev+test data](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz) is used as test and generation set.
-
-
-### Data Preprocessing
-
-There are two steps for pre-processing:
-- Merge the source and target parallel corpus files into one file
- - Merge `XXX.src` and `XXX.trg` file pair as `XXX`
- - The $i$-th row in `XXX` is the concatenation of the $i$-th row from `XXX.src` with the $i$-th row from `XXX.trg`, separated with '\t'.
-
-- Create source dictionary and target dictionary, each containing **DICTSIZE** number of words, including the most frequent (DICTSIZE - 3) fo word from the corpus and 3 special token `` (begin of sequence), `` (end of sequence) and `` (unknown words that are not in the vocabulary).
-
-### A Subset of Dataset
-
-Because the full dataset is very big, to reduce the time for downloading the full dataset. PadddlePaddle package `paddle.dataset.wmt14` provides a preprocessed `subset of dataset`(http://paddlepaddle.bj.bcebos.com/demo/wmt_shrinked_data/wmt14.tgz).
-
-This subset has 193319 instances of training data and 6003 instances of test data. Dictionary size is 30000. Because of the limitation of size of the subset, the effectiveness of trained model from this subset is not guaranteed.
-
-## Training Instructions
-
-### Initialize PaddlePaddle
-
-```python
-import sys
-import paddle.v2 as paddle
-
-# train with a single CPU
-paddle.init(use_gpu=False, trainer_count=1)
-# False: training, True: generating
-is_generating = False
-```
-
-### Model Configuration
-
-1. Define some global variables
-
- ```python
- dict_size = 30000 # dict dim
- source_dict_dim = dict_size # source language dictionary size
- target_dict_dim = dict_size # destination language dictionary size
- word_vector_dim = 512 # word embedding dimension
- encoder_size = 512 # hidden layer size of GRU in encoder
- decoder_size = 512 # hidden layer size of GRU in decoder
- beam_size = 3 # expand width in beam search
- max_length = 250 # a stop condition of sequence generation
- ```
-
-2. Implement Encoder as follows:
- - Input is a sequence of words represented by an integer word index sequence. So we define data layer of data type `integer_value_sequence`. The value range of each element in the sequence is `[0, source_dict_dim)`
-
- ```python
- src_word_id = paddle.layer.data(
- name='source_language_word',
- type=paddle.data_type.integer_value_sequence(source_dict_dim))
- ```
-
- - Map the one-hot vector (represented by word index) into a word vector $\mathbf{s}$ in a low-dimensional semantic space
-
- ```python
- src_embedding = paddle.layer.embedding(
- input=src_word_id,
- size=word_vector_dim,
- param_attr=paddle.attr.ParamAttr(name='_source_language_embedding'))
- ```
-
- - Use bi-direcitonal GRU to encode the source language sequence, and concatenate the encoding outputs from the two GRUs to get $\mathbf{h}$
-
- ```python
- src_forward = paddle.networks.simple_gru(
- input=src_embedding, size=encoder_size)
- src_backward = paddle.networks.simple_gru(
- input=src_embedding, size=encoder_size, reverse=True)
- encoded_vector = paddle.layer.concat(input=[src_forward, src_backward])
- ```
-
-3. Implement Attention-based Decoder as follows:
-
- - Get a projection of the encoding (c.f. 2.3) of the source language sequence by passing it into a feed forward neural network
-
- ```python
- with paddle.layer.mixed(size=decoder_size) as encoded_proj:
- encoded_proj += paddle.layer.full_matrix_projection(
- input=encoded_vector)
- ```
-
- - Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$
-
- ```python
- backward_first = paddle.layer.first_seq(input=src_backward)
- with paddle.layer.mixed(
- size=decoder_size, act=paddle.activation.Tanh()) as decoder_boot:
- decoder_boot += paddle.layer.full_matrix_projection(
- input=backward_first)
- ```
-
- - Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word.
-
- - decoder_mem records the hidden state $z_i$ from the previous time step, with an initial state as decoder_boot.
- - context is computed via `simple_attention` as $c_i=\sum {j=1}^{T}a_{ij}h_j$, where enc_vec is the projection of $h_j$ and enc_proj is the projection of $h_j$ (c.f. 3.1). $a_{ij}$ is calculated within `simple_attention`.
- - decoder_inputs fuse $c_i$ with the representation of the current_word (i.e., $u_i$).
- - gru_step uses `gru_step_layer` function to compute $z_{i+1}=\phi _{\theta '}\left ( c_i,u_i,z_i \right )$.
- - Softmax normalization is used in the end to computed the probability of words, i.e., $p\left ( u_i|u_{<i},\mathbf{x} \right )=softmax(W_sz_i+b_z)$. The output is returned.
-
- ```python
- def gru_decoder_with_attention(enc_vec, enc_proj, current_word):
-
- decoder_mem = paddle.layer.memory(
- name='gru_decoder', size=decoder_size, boot_layer=decoder_boot)
-
- context = paddle.networks.simple_attention(
- encoded_sequence=enc_vec,
- encoded_proj=enc_proj,
- decoder_state=decoder_mem)
-
- with paddle.layer.mixed(size=decoder_size * 3) as decoder_inputs:
- decoder_inputs += paddle.layer.full_matrix_projection(input=context)
- decoder_inputs += paddle.layer.full_matrix_projection(
- input=current_word)
-
- gru_step = paddle.layer.gru_step(
- name='gru_decoder',
- input=decoder_inputs,
- output_mem=decoder_mem,
- size=decoder_size)
-
- with paddle.layer.mixed(
- size=target_dict_dim,
- bias_attr=True,
- act=paddle.activation.Softmax()) as out:
- out += paddle.layer.full_matrix_projection(input=gru_step)
- return out
- ```
-
-4. Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details.
-
- ```python
- decoder_group_name = "decoder_group"
- group_input1 = paddle.layer.StaticInputV2(input=encoded_vector, is_seq=True)
- group_input2 = paddle.layer.StaticInputV2(input=encoded_proj, is_seq=True)
- group_inputs = [group_input1, group_input2]
- ```
-
-5. Training mode:
-
- - word embedding from the target language trg_embedding is passed to `gru_decoder_with_attention` as current_word.
- - `recurrent_group` calls `gru_decoder_with_attention` in a recurrent way
- - the sequence of next words from the target language is used as label (lbl)
- - multi-class cross-entropy (`classification_cost`) is used to calculate the cost
-
- ```python
- if not is_generating:
- trg_embedding = paddle.layer.embedding(
- input=paddle.layer.data(
- name='target_language_word',
- type=paddle.data_type.integer_value_sequence(target_dict_dim)),
- size=word_vector_dim,
- param_attr=paddle.attr.ParamAttr(name='_target_language_embedding'))
- group_inputs.append(trg_embedding)
-
- # For decoder equipped with attention mechanism, in training,
- # target embeding (the groudtruth) is the data input,
- # while encoded source sequence is accessed to as an unbounded memory.
- # Here, the StaticInput defines a read-only memory
- # for the recurrent_group.
- decoder = paddle.layer.recurrent_group(
- name=decoder_group_name,
- step=gru_decoder_with_attention,
- input=group_inputs)
-
- lbl = paddle.layer.data(
- name='target_language_next_word',
- type=paddle.data_type.integer_value_sequence(target_dict_dim))
- cost = paddle.layer.classification_cost(input=decoder, label=lbl)
- ```
-
-6. Generating mode:
-
- - the decoder predicts a next target word based on the the last generated target word. Embedding of the last generated word is automatically gotten by GeneratedInputs.
- - `beam_search` calls `gru_decoder_with_attention` in a recurrent way, to predict sequence id.
-
- ```python
- if is_generating:
- # In generation, the decoder predicts a next target word based on
- # the encoded source sequence and the last generated target word.
-
- # The encoded source sequence (encoder's output) must be specified by
- # StaticInput, which is a read-only memory.
- # Embedding of the last generated word is automatically gotten by
- # GeneratedInputs, which is initialized by a start mark, such as ,
- # and must be included in generation.
-
- trg_embedding = paddle.layer.GeneratedInputV2(
- size=target_dict_dim,
- embedding_name='_target_language_embedding',
- embedding_size=word_vector_dim)
- group_inputs.append(trg_embedding)
-
- beam_gen = paddle.layer.beam_search(
- name=decoder_group_name,
- step=gru_decoder_with_attention,
- input=group_inputs,
- bos_id=0,
- eos_id=1,
- beam_size=beam_size,
- max_length=max_length)
- ```
-
-Note: Our configuration is based on Bahdanau et al. \[[4](#Reference)\] but with a few simplifications. Please refer to [issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) for more details.
-
-## Model Training
-
-1. Create Parameters
-
- Create every parameter that `cost` layer needs. And we can get parameter names. If the parameter name is not specified during model configuration, it will be generated.
-
- ```python
- if not is_generating:
- parameters = paddle.parameters.create(cost)
- for param in parameters.keys():
- print param
- ```
-
-2. Define DataSet
-
- Create [**data reader**](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader#python-data-reader-design-doc) for WMT-14 dataset.
-
- ```python
- if not is_generating:
- wmt14_reader = paddle.batch(
- paddle.reader.shuffle(
- paddle.dataset.wmt14.train(dict_size=dict_size), buf_size=8192),
- batch_size=5)
- ```
-3. Create trainer
-
- We need to tell trainer what to optimize, and how to optimize. Here trainer will optimize `cost` layer using stochastic gradient descent (SDG).
-
- ```python
- if not is_generating:
- optimizer = paddle.optimizer.Adam(
- learning_rate=5e-5,
- regularization=paddle.optimizer.L2Regularization(rate=8e-4))
- trainer = paddle.trainer.SGD(cost=cost,
- parameters=parameters,
- update_equation=optimizer)
- ```
-
-4. Define event handler
-
- The event handler is a callback function invoked by trainer when an event happens. Here we will print log in event handler.
-
- ```python
- if not is_generating:
- def event_handler(event):
- if isinstance(event, paddle.event.EndIteration):
- if event.batch_id % 2 == 0:
- print "\nPass %d, Batch %d, Cost %f, %s" % (
- event.pass_id, event.batch_id, event.cost, event.metrics)
- ```
-
-5. Start training
-
- ```python
- if not is_generating:
- trainer.train(
- reader=wmt14_reader, event_handler=event_handler, num_passes=2)
- ```
-
- The training log is as follows:
- ```text
- Pass 0, Batch 0, Cost 247.408008, {'classification_error_evaluator': 1.0}
- Pass 0, Batch 10, Cost 212.058789, {'classification_error_evaluator': 0.8737863898277283}
- ...
- ```
-
-## Model Usage
-
-1. Download Pre-trained Model
-
- As the training of an NMT model is very time consuming, we provide a pre-trained model. The model is trained with a cluster of 50 physical nodes (each node has two 6-core CPU) over 5 days. The provided model has the [BLEU Score](#BLEU Score) of 26.92, and the size of 205M.
-
- ```python
- if is_generating:
- parameters = paddle.dataset.wmt14.model()
- ```
-2. Define DataSet
-
- Get the first 3 samples of wmt14 generating set as the source language sequences.
-
- ```python
- if is_generating:
- gen_creator = paddle.dataset.wmt14.gen(dict_size)
- gen_data = []
- gen_num = 3
- for item in gen_creator():
- gen_data.append((item[0], ))
- if len(gen_data) == gen_num:
- break
- ```
-
-3. Create infer
-
- Use inference interface `paddle.infer` return the prediction probability (see field `prob`) and labels (see field `id`) of each generated sequence.
-
- ```python
- if is_generating:
- beam_result = paddle.infer(
- output_layer=beam_gen,
- parameters=parameters,
- input=gen_data,
- field=['prob', 'id'])
- ```
-4. Print generated translation
-
- Print sequence and its `beam_size` generated translation results based on the dictionary.
-
- ```python
- if is_generating:
- # get the dictionary
- src_dict, trg_dict = paddle.dataset.wmt14.get_dict(dict_size)
-
- # the delimited element of generated sequences is -1,
- # the first element of each generated sequence is the sequence length
- seq_list = []
- seq = []
- for w in beam_result[1]:
- if w != -1:
- seq.append(w)
- else:
- seq_list.append(' '.join([trg_dict.get(w) for w in seq[1:]]))
- seq = []
-
- prob = beam_result[0]
- for i in xrange(gen_num):
- print "\n*******************************************************\n"
- print "src:", ' '.join(
- [src_dict.get(w) for w in gen_data[i][0]]), "\n"
- for j in xrange(beam_size):
- print "prob = %f:" % (prob[i][j]), seq_list[i * beam_size + j]
- ```
-
- The generating log is as follows:
- ```text
- src: Les se au sujet de la largeur des sièges alors que de grosses commandes sont en jeu
-
- prob = -19.019573: The will be rotated about the width of the seats , while large orders are at stake .
- prob = -19.113066: The will be rotated about the width of the seats , while large commands are at stake .
- prob = -19.512890: The will be rotated about the width of the seats , while large commands are at play .
- ```
-
-## Summary
-
-End-to-end neural machine translation is a recently developed way to perform machine translations. In this chapter, we introduced the typical "Encoder-Decoder" framework and "attention" mechanism. Since NMT is a typical Sequence-to-Sequence (Seq2Seq) learning problem, tasks such as query rewriting, abstraction generation, and single-turn dialogues can all be solved with the model presented in this chapter.
-
-## References
-
-1. Koehn P. [Statistical machine translation](https://books.google.com.hk/books?id=4v_Cx1wIMLkC&printsec=frontcover&hl=zh-CN&source=gbs_ge_summary_r&cad=0#v=onepage&q&f=false)[M]. Cambridge University Press, 2009.
-2. Cho K, Van Merriënboer B, Gulcehre C, et al. [Learning phrase representations using RNN encoder-decoder for statistical machine translation](http://www.aclweb.org/anthology/D/D14/D14-1179.pdf)[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014: 1724-1734.
-3. Chung J, Gulcehre C, Cho K H, et al. [Empirical evaluation of gated recurrent neural networks on sequence modeling](https://arxiv.org/abs/1412.3555)[J]. arXiv preprint arXiv:1412.3555, 2014.
-4. Bahdanau D, Cho K, Bengio Y. [Neural machine translation by jointly learning to align and translate](https://arxiv.org/abs/1409.0473)[C]//Proceedings of ICLR 2015, 2015.
-5. Papineni K, Roukos S, Ward T, et al. [BLEU: a method for automatic evaluation of machine translation](http://dl.acm.org/citation.cfm?id=1073135)[C]//Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002: 311-318.
-
-
-This tutorial is contributed by PaddlePaddle, and licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
-
-