diff --git a/.gitignore b/.gitignore
index fcf712243f4bed65633e531c86529c9a08c68ce8..0a0dd02414c32ede8d58d2556709827f9a98bf5c 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1 +1,2 @@
-pandoc.template
\ No newline at end of file
+pandoc.template
+.DS_Store
\ No newline at end of file
diff --git a/.tmpl/build.sh b/.tmpl/build.sh
deleted file mode 100644
index 571b685b3018c1eb04a11a6f788bc1e7b2858869..0000000000000000000000000000000000000000
--- a/.tmpl/build.sh
+++ /dev/null
@@ -1,20 +0,0 @@
-#!/bin/bash
-
-function find_line() {
- local fn=$1
- local x=0
- cat $fn | while read line; do
- local x=$(( x+1 ))
- if echo $line | grep '${MARKDOWN}' -q; then
- echo $x
- break
- fi
- done
-}
-
-MD_FILE=$1
-TMPL_FILE=$2
-TPL_LINE=`find_line $TMPL_FILE`
-cat $TMPL_FILE | head -n $((TPL_LINE-1))
-cat $MD_FILE
-cat $TMPL_FILE | tail -n +$((TPL_LINE+1))
diff --git a/.tmpl/template.html b/.tmpl/convert-markdown-into-html.sh
old mode 100644
new mode 100755
similarity index 85%
rename from .tmpl/template.html
rename to .tmpl/convert-markdown-into-html.sh
index 84edf8fa33a7faffbfa44ec78b86cb42d3235dd3..149c686bc502b7fed97453e0769a7ef6ee841b76
--- a/.tmpl/template.html
+++ b/.tmpl/convert-markdown-into-html.sh
@@ -1,3 +1,8 @@
+markdown_file=$1
+
+# Notice: the single-quotes around EOF below make outputs
+# verbatium. c.f. http://stackoverflow.com/a/9870274/724872
+cat <<'EOF'
-
+
+
-
+
-
+
+
+
+
+
+
+
+
+
+
+# Linear Regression
+Let us begin the tutorial with a classical problem called Linear Regression \[[1](#References)\]. In this chapter, we will train a model from a realistic dataset to predict house prices. Some important concepts in Machine Learning will be covered through this example.
+
+The source code for this tutorial is at [book/fit_a_line](https://github.com/PaddlePaddle/book/tree/develop/fit_a_line). If this is your first time using PaddlePaddle, please refer to the [Install Guide](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html).
+
+## Problem
+Suppose we have a dataset of $n$ houses. Each house $i$ has $d$ properties and the price $y_i$. A property $x_{i,d}$ describes one aspect of the house, for example, the number of rooms in the house, the number of schools or hospitals in the neighborhood, the nearby traffic condition, etc. Our task is to predict $y_i$ given a set of properties $\{x_{i,1}, ..., x_{i,d}\}$. We assume that the price is a linear combination of all the properties, i.e.,
+
+$$y_i = \omega_1x_{i,1} + \omega_2x_{i,2} + \ldots + \omega_dx_{i,d} + b, i=1,\ldots,n$$
+
+where $\omega_{d}$ and $b$ are the model parameters we want to estimate. Once they are learned, given a set of properties of a house, we will be able to predict a price for that house. The model we have here is called Linear Regression, namely, we want to regress a value as a linear combination of several values. In practice this linear model for our problem is hardly true, because the real relationship between the house properties and the price is much more complicated. However, due to its simple formulation which makes the model training and analysis easy, Linear Regression has been applied to lots of real problems. It is always an important topic in many classical Statistical Learning and Machine Learning textbooks \[[2,3,4](#References)\].
+
+## Results Demonstration
+We first show the training result of our model. We use the [UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing) to train a linear model and predict the house prices in Boston. The figure below shows the predictions the model makes for some house prices. The $X$ coordinate of each point represents the median value of the prices of a certain type of houses, while the $Y$ coordinate represents the predicted value by our linear model. When $X=Y$, the point lies exactly on the dotted line. In other words, the more precise the model predicts, the closer the point is to the dotted line.
+
+
+ Figure 1. Predicted Value V.S. Actual Value
+
+
+## Model Overview
+
+### Model Definition
+
+In the UCI Housing Data Set, there are 13 house properties $x_{i,d}$ that are related to the median house price $y_i$. Thus our model is:
+
+$$\hat{Y} = \omega_1X_{1} + \omega_2X_{2} + \ldots + \omega_{13}X_{13} + b$$
+
+where $\hat{Y}$ is the predicted value used to differentiate from the actual value $Y$. The model parameters to be learned are: $\omega_1, \ldots, \omega_{13}, b$, where $\omega$ are called the weights and $b$ is called the bias.
+
+Now we need an optimization goal, so that with the learned parameters, $\hat{Y}$ is close to $Y$ as much as possible. Here we introduce the concept of [Loss Function (Cost Function)](https://en.wikipedia.org/wiki/Loss_function). The Loss Function has such property: given any pair of the actual value $y_i$ and the predicted value $\hat{y_i}$, its output is always non-negative. This non-negative value reflects the model error.
+
+For Linear Regression, the most common Loss Function is [Mean Square Error (MSE)](https://en.wikipedia.org/wiki/Mean_squared_error) which has the following form:
+
+$$MSE=\frac{1}{n}\sum_{i=1}^{n}{(\hat{Y_i}-Y_i)}^2$$
+
+For a dataset of size $n$, MSE is the average value of the $n$ predicted errors.
+
+### Training
+
+After defining our model, we have several major steps for the training:
+1. Initialize the parameters including the weights $\omega$ and the bias $b$. For example, we can set their mean values as 0s, and their standard deviations as 1s.
+2. Feedforward to compute the network output and the Loss Function.
+3. Backward to [backpropagate](https://en.wikipedia.org/wiki/Backpropagation) the errors. The errors will be propagated from the output layer back to the input layer, during which the model parameters will be updated with the corresponding errors.
+4. Repeat steps 2~3, until the loss is below a predefined threshold or the maximum number of repeats is reached.
+
+## Data Preparation
+Follow the command below to prepare data:
+```bash
+cd data && python prepare_data.py
+```
+This line of code will download the dataset from the [UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing) and perform some [preprocessing](#Preprocessing). The dataset is split into a training set and a test set.
+
+The dataset contains 506 lines in total, each line describing the properties and the median price of a certain type of houses in Boston. The meaning of each line is below:
+
+
+| Property Name | Explanation | Data Type |
+| ------| ------ | ------ |
+| CRIM | per capita crime rate by town | Continuous|
+| ZN | proportion of residential land zoned for lots over 25,000 sq.ft. | Continuous |
+| INDUS | proportion of non-retail business acres per town | Continuous |
+| CHAS | Charles River dummy variable | Discrete, 1 if tract bounds river; 0 otherwise|
+| NOX | nitric oxides concentration (parts per 10 million) | Continuous |
+| RM | average number of rooms per dwelling | Continuous |
+| AGE | proportion of owner-occupied units built prior to 1940 | Continuous |
+| DIS | weighted distances to five Boston employment centres | Continuous |
+| RAD | index of accessibility to radial highways | Continuous |
+| TAX | full-value property-tax rate per $10,000 | Continuous |
+| PTRATIO | pupil-teacher ratio by town | Continuous |
+| B | 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town | Continuous |
+| LSTAT | % lower status of the population | Continuous |
+| MEDV | Median value of owner-occupied homes in $1000's | Continuous |
+
+The last entry is the median house price.
+
+### Preprocessing
+#### Continuous and Discrete Data
+We define a feature vector of length 13 for each house, where each entry of the feature vector corresponds to a property of that house. Our first observation is that among the 13 dimensions, there are 12 continuous dimensions and 1 discrete dimension. Note that although a discrete value is also written as digits such as 0, 1, or 2, it has a quite different meaning from a continuous value. The reason is that the difference between two discrete values has no practical meaning. For example, if we use 0, 1, and 2 to represent `red`, `green`, and `blue` respectively, although the numerical difference between `red` and `green` is smaller than that between `red` and `blue`, we cannot say that the extent to which `blue` is different from `red` is greater than the extent to which `green` is different from `red`. Therefore, when handling a discrete feature that has $d$ possible values, we will usually convert it to $d$ new features where each feature can only take 0 or 1, indicating whether the original $d$th value is present or not. Or we can map the discrete feature to a continuous multi-dimensional vector through an embedding table. For our problem here, because CHAS itself is a binary discrete value, we do not need to do any preprocessing.
+
+#### Feature Normalization
+Another observation we have is that there is a huge difference among the value ranges of the 13 features (Figure 2). For example, feature B has a value range of [0.32, 396.90] while feature NOX has a range of [0.3850, 0.8170]. For an effective optimization, here we need data normalization. The goal of data normalization is to scale each feature into roughly the same value range, for example [-0.5, 0.5]. In this example, we adopt a standard way of normalization: substracting the mean value from the feature and divide the result by the original value range.
+
+There are at least three reasons for [Feature Normalization](https://en.wikipedia.org/wiki/Feature_scaling) (Feature Scaling):
+- A value range that is too large or too small might cause floating number overflow or underflow during computation.
+- Different value ranges might result in different importances of different features to the model (at least in the beginning of the training process), which however is an unreasonable assumption. Such assumption makes the optimization more difficult and increases the training time a lot.
+- Many Machine Learning techniques or models (e.g., L1/L2 regularization and Vector Space Model) are based on the assumption that all the features have roughly zero means and their value ranges are similar.
+
+
+
+ Figure 2. The value ranges of the features
+
+
+#### Prepare Training and Test Sets
+We split the dataset into two subsets, one for estimating the model parameters, namely, model training, and the other for model testing. The model error on the former is called the **training error**, and the error on the latter is called the **test error**. Our goal of training a model is to find the statistical dependency between the outputs and the inputs, so that we can predict new outputs given new inputs. As a result, the test error reflects the performance of the model better than the training error does. We consider two things when deciding the ratio of the training set to the test set: 1) More training data will decrease the variance of the parameter estimation, yielding more reliable models; 2) More test data will decrease the variance of the test error, yielding more reliable test errors. One standard split ratio is $8:2$. You can try different split ratios to observe how the two variances change.
+
+Executing the following command to split the dataset and write the training and test set into the `train.list` and `test.list` files, so that later PaddlePaddle can read from them.
+```python
+python prepare_data.py -r 0.8 #8:2 is the default split ratio
+```
+
+When training complex models, we usually have one more split: the validation set. Complex models usually have [Hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_optimization) that need to be set before the training process begins. These hyperparameters are not part of the model parameters and cannot be trained using the same Loss Function (e.g., the number of layers in the network). Thus we will try several sets of hyperparameters to get several models, and compare these trained models on the validation set to pick the best one, and finally it on the test set. Because our model is relatively simple in this problem, we ignore this validation process for now.
+
+### Provide Data to PaddlePaddle
+After the data is prepared, we use a Python Data Provider to provide data for PaddlePaddle. A Data Provider is a Python function which will be called by PaddlePaddle during training. In this example, the Data Provider only needs to read the data and return it to the training process of PaddlePaddle line by line.
+
+```python
+from paddle.trainer.PyDataProvider2 import *
+import numpy as np
+#define data type and dimensionality
+@provider(input_types=[dense_vector(13), dense_vector(1)])
+def process(settings, input_file):
+ data = np.load(input_file.strip())
+ for row in data:
+ yield row[:-1].tolist(), row[-1:].tolist()
+
+```
+
+## Model Configuration
+
+### Data Definition
+We first call the function `define_py_data_sources2` to let PaddlePaddle read training and test data from the `dataprovider.py` in the above. PaddlePaddle can accept configuration info from the command line, for example, here we pass a variable named `is_predict` to control the model to have different structures during training and test.
+```python
+from paddle.trainer_config_helpers import *
+
+is_predict = get_config_arg('is_predict', bool, False)
+
+define_py_data_sources2(
+ train_list='data/train.list',
+ test_list='data/test.list',
+ module='dataprovider',
+ obj='process')
+
+```
+
+### Algorithm Settings
+Next we need to set the details of the optimization algorithm. Due to the simplicity of the Linear Regression model, we only need to set the `batch_size` which defines how many samples are used every time for updating the parameters.
+```python
+settings(batch_size=2)
+```
+
+### Network
+Finally, we use `fc_layer` and `LinearActivation` to represent the Linear Regression model.
+```python
+#input data of 13 dimensional house information
+x = data_layer(name='x', size=13)
+
+y_predict = fc_layer(
+ input=x,
+ param_attr=ParamAttr(name='w'),
+ size=1,
+ act=LinearActivation(),
+ bias_attr=ParamAttr(name='b'))
+
+if not is_predict: #when training, we use MSE (i.e., regression_cost) as the Loss Function
+ y = data_layer(name='y', size=1)
+ cost = regression_cost(input=y_predict, label=y)
+ outputs(cost) #output MSE to view the loss change
+else: #during test, output the prediction value
+ outputs(y_predict)
+```
+
+## Training Model
+We can run the PaddlePaddle command line trainer in the root directory of the code. Here we name the configuration file as `trainer_config.py`. We train 30 passes and save the result in the directory `output`:
+```bash
+./train.sh
+```
+
+## Use Model
+Now we can use the trained model to do prediction.
+```bash
+python predict.py
+```
+Here by default we use the model in `output/pass-00029` for prediction, and compare the actual house price with the predicted one. The result is shown in `predictions.png`.
+If you want to use another model or test on other data, you can pass in a new model path or data path:
+```bash
+python predict.py -m output/pass-00020 -t data/housing.test.npy
+```
+
+## Summary
+In this chapter, we have introduced the Linear Regression model using the UCI Housing Data Set as an example. We have shown how to train and test this model with PaddlePaddle. Many more complex models and techniques are derived from this simple linear model, thus it is important for us to understand how it works.
+
+
+## References
+1. https://en.wikipedia.org/wiki/Linear_regression
+2. Friedman J, Hastie T, Tibshirani R. The elements of statistical learning[M]. Springer, Berlin: Springer series in statistics, 2001.
+3. Murphy K P. Machine learning: a probabilistic perspective[M]. MIT press, 2012.
+4. Bishop C M. Pattern recognition[J]. Machine Learning, 2006, 128.
+
+
+ 本教程 由 PaddlePaddle 创作,采用 知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议进行许可。
+
+Image Classification
+=======================
+
+The source code of this chapter is in [book/image_classification](https://github.com/PaddlePaddle/book/tree/develop/image_classification). For the first-time users, please refer to PaddlePaddle[Installation Tutorial](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html) for installation instructions.
+
+## Background
+
+Comparing to words, images provide more vivid and easier to understand information with more artistic sense. They are important source for people to convey and exchange ideas. In this chapter, we focus on one of the essential problems in image recognition -- image classification.
+
+Image classification distinguishes images of different categories based on their semantic meaning. It is a core problem in computer vision, and is also the foundation of other higher level computer vision tasks such as object detection, image segmentation, object tracking, action recognition, etc. Image classification has applications in many areas such as face recognition and intelligent video analysis in security systems, traffic scene recognition in transportation systems, content-based image retrieval and automatic photo indexing in web services, image classification in medicine, etc.
+
+In image classification, we first encode the whole image using handcrafted or learned features, and then determine the object category by a classifier. Therefore, feature extraction plays an important role in image classification. Prior to deep learning, BoW(Bag of Words) model is the most popular method for object classification. BoW was introduced in NLP where a sentence is represented as a bag of words (words, phrases, or characters) extracted from training sentences. In the context of image classification, BoW model requires constructing a dictionary. The simplest BoW framework can be designed with three steps: **feature extraction**, **feature encoding**, and **classifier design**.
+
+Deep learning approach to image classification works by supervised or unsupervised learning of hierarchical features automatically instead of crafting or selecting image features manually. Convolutional Neural Networks (CNNs) have made significant progress in image classification. They keep all image information by employing raw image pixels as input, extract low-level and high-level abstract features through convolution operations, and directly output the classification results from the model. This end-to-end learning fashion leads to good performance and wide applications.
+
+In this chapter, we focus on introducing deep learning-based image classification methods, and on explaining how to train a CNN model using PaddlePaddle.
+
+## Demonstration
+
+Image classification includes general and fine-grained ones. Figure 1 demonstrates the results of general image classification -- the trained model can correctly recognize the main objects in the images.
+
+
+
+Figure 1. General image classification
+
+
+
+Figure 2 demonstrates the results of fine-grained image classification -- flower recognition, which requires correct recognition of flower categories.
+
+
+
+Figure 2. Fine-grained image classification
+
+
+
+A good model should be able to recognize objects of different categories correctly, and meanwhile can correctly classify images taken from different points of view, under different illuminations, with object distortion or partial occlusion (we call these image disturbance). Figure 3 show some images with various disturbance. A good model should be able to classify these images correctly like humans.
+
+
+
+Figure 3. Disturbed images [22]
+
+
+## Model Overview
+
+A large amount of research work in image classification is built upon public datasets such as [PASCAL VOC](http://host.robots.ox.ac.uk/pascal/VOC/), [ImageNet](http://image-net.org/). Many image classification algorithms are usually evaluated and compared on these datasets. PASCAL VOC is a computer vision competition started in 2005, and ImageNet is a dataset started in Large Scale Visual Recognition Challenge (ILSVRC) 2010. In this chapter, we introduce some image classification models from the submissions to these competitions.
+
+Before 2012, traditional image classification methods can be achieved with the three steps described in the Background section. A complete model construction usually involves the following stages: low-level feature extraction, feature encoding, spatial constraint or feature clustering, classifier design, model ensemble.
+
+ 1). **Low-level feature extraction**: This is a step for extracting large amounts of local features according to fixed strides and scales. Popular local features include Scale-Invariant Feature Transform(SIFT)[1], Histogram of Oriented Gradient(HOG)[2], Local Binary Pattern(LBP)[3], etc. A common practice is to employ multiple feature descriptors in order to avoid missing too much information.
+ 2). **Feature encoding**: Low-level features contain large amount of redundancy and noise. In order to improve robustness of features, it is necessary to employ a feature transformation to encode low-level features, which is called feature encoding. Common feature encoding methods include vector quantization [4], sparse coding [5], locality-constrained linear coding [6], Fisher vector encoding [7], etc.
+ 3). **Spatial constraint**: Spatial constraint or feature clustering is usually adopted after feature encoding for extracting the maximum or average of each dimension in the spatial domain. Pyramid feature matching--a popular feature clustering method--divides an image uniformly into patches, and performs feature clustering in each patch.
+ 4). **Classification**: Upon the above steps, an image can be described by a vector of fixed dimension. Then a classifier can be used to classify the image into categories. Common classifiers include Support Vector Machine(SVM), random forest, etc. Kernel SVM is the most popular classifier, and has achieved very good performance in traditional image classification tasks.
+
+This method has been used widely as image classification algorithm in PASCAL VOC [18]. NEC Labs(http://www.nec-labs.com/) won the championship by employing SIFT and LBP features, two non-linear encoders and SVM in ILSVRC 2010 [8].
+
+The CNN model--AlexNet proposed by Alex Krizhevsky et al.[9], made a breakthrough in ILSVRC 2012. It outperformed traditional methods dramatically, and won the championship in ILSVRC 2012. This is also the first time that a deep learning method was used for large scale image classification. Since AlexNet, a series of CNN models have been proposed and has advanced the state of the art steadily on Imagenet as shown in Figure 4. With deeper and more sophisticated architectures, Top-5 error rate is getting lower and lower, until to around 3.5%. The error rate of human raters on the same Imagenet dataset is 5.1%, which means that the image classification capability of a deep learning model surpasses human raters.
+
+
+
+### CNN
+
+Traditional CNNs consist of convolutional and fully-connected layers, and employ softmax multi-category classifier and cross-entropy as loss function. Figure 5 shows a typical CNN. We first introduce the common parts of a CNN.
+
+
+
+Figure 5. A CNN example [20]
+
+
+- convolutional layer: It uses convolution operation to extract low-level and high-level features, and to discover local correlation and spatial invariance.
+
+- pooling layer: It down-sample feature maps via extracting local max (max-pooling) or average (avg-pooling) of each patch in the feature map. Down-sampling, a common operator in image processing, can be used to filter out high frequency information.
+
+- fully-connected layer: It fully connects neurons between two adjacent layers.
+
+- non-linear activation: Convolutional and fully-connected layers are usually followed by some non-linear activation layers, such as Sigmoid, Tanh, Relu to enhance the expression capability. Relu is the most commonly used activation function in CNN.
+
+- Dropout [10]: At each training stage, individual nodes are dropped out of the net with a certain probability in order to improve generalization and to avoid overfitting.
+
+Due to parameter updating in each layer during training, it causes the change in the distributions of layer inputs, and requires careful tuning of hyper-parameters. In 2015, Sergey Ioffe and Christian Szegedy proposed a Batch Normalization (BN) algorithm [14], which normalizes the features of each batch in a layer, and enables relatively stable distribution in each layer. Not only does BN algorithm act as a regularizer, but also reduces the need for careful hyper-parameter design. Experiments demonstrate that BN algorithm accelerates the training convergence and has been widely used in later deeper models.
+
+We will introduce the network architectures of VGG, GoogleNet and ResNets in the following sections.
+
+### VGG
+
+Oxford Visual Geometry Group (VGG) proposed VGG network in ILSVRC 2014 [11]. The model is deeper and wider than previous neural architectures. It comprises five main groups of convolution operations, with max-pooling layers between adjacent convolution groups. Each group contains a series of 3x3 convolutional layers, whose number of convolution kernels stays the same within the group and increases from 64 in the first group to 512 in the last one. The total number of learnable layers could be 11, 13, 16, or 19 depending on the number of convolutional layers in each group. Figure 6 illustrates a 16-layer VGG. The neural architecture of VGG is relatively simple, and has been adopted by many papers such as the first one that surpassed human-level performance on ImageNet [19].
+
+
+
+Figure 6. Vgg16 model for ImageNet
+
+
+### GoogleNet
+
+GoogleNet [12] won the championship in ILSVRC 2014. Before introducing this model, lets get familiar with Network in Network(NIN) model [13] from which GoogleNet borrowed some ideas, and Inception blocks upon which GoogleNet is built.
+
+NIN model has two main characteristics: 1) it replaces the single-layer convolutional network by Multi-Layer Perceptron Convolution or MLPconv. MLPconv, a tiny multi-layer convolutional network, enhances non-linearity by adding several 1x1 convolutional layers after linear ones. 2) In traditional CNNs, the last fewer layers are usually fully-connected with a large number of parameters. In contrast, NIN replaces all fully-connected layers with convolutional layers whose feature maps are of the same size as the category dimension, and followed by a global average pooling. This replacement of fully-connected layers significantly reduces the number of parameters.
+
+Figure 7 depicts two Inception blocks. Figure 7(a) is the simplest design, the output of which is a concat of features from three convolutional layers and one pooling layer. The disadvantage of this design is that the pooling layer does not change the number of filters and leads to an increase of outputs. After going through several of such blocks, the number of outputs and parameters will become larger and larger, leading to higher computation complexity. To overcome this drawback, the Inception block in Figure 7(b) employs three 1x1 convolutional layers to reduce dimension or the number of channels, meanwhile improves non-linearity of the network.
+
+
+
+Figure 7. Inception block
+
+
+GoogleNet consists of multiple stacking Inception blocks followed by an avg-pooling layer as in NIN in place of by traditional fully connected layers. The difference between GoogleNet and NIN is that GoogleNet adds a fully connected layer after avg-pooling layer to output a vector of category size. Besides these two characteristics, the features from middle layers of a GoogleNet are also very discriminative. Therefore, GoogeleNet inserts two auxiliary classifiers in the model for enhancing gradient and regularization when doing backpropagating. The loss function of the whole network is the weighted sum of these three classifiers.
+
+Figure 8 illustrates the neural architecture of a GoogleNet which consists of 22 layers: it starts with three regular convolutional layers followed by three groups of sub-networks-- the first group contains two Inception blocks, the second one five, and the third one two. It ends up with an average pooling and a fully-connected layer.
+
+
+
+Figure 8. GoogleNet[12]
+
+
+The above model is the first version of GoogleNet or GoogelNet-v1. GoogleNet-v2 [14] introduces BN layer; GoogleNet-v3 [16] further splits some convolutional layers, which increases non-linearity and network depth; GoogelNet-v4 [17] leads to the design idea of ResNet which will be introduced in the next section. The evolution from v1 to v4 leverages the accuracy rate consistently. We will not go into details of the neural architectures of v2 to v4.
+
+### ResNet
+
+Residual Network(ResNet)[15] won the 2015 championships on three ImageNet competitions -- image classification, object localization and object detection. The authors of ResNet proposed a residual learning approach to easing the difficulty of training deeper networks -- with the network depth increasing, accuracy degrades. Based upon the design ideas of BN, small convolutional kernels, full convolutional network, ResNets reformulate the layers as residual blocks, with each block containing two branches, one directly connecting input to the output, the other performing two to three convolutions and calculating the residual function with reference to the layer inputs. And then the outputs of these two branches are added up.
+
+Figure 9 illustrates the architecture of ResNet. The left is the basic building block consisting of two 3x3 convolutional layers of the same channels. The right one is a Bottleneck block. The bottleneck is a 1x1 convolutional layer used to reduce dimension from 256 to 64. The other 1x1 conolutional layer is used to increase dimension from 64 to 256. Therefore, the number of input and output channels of the middle 3x3 convolutional layer, which is 64, is relatively small.
+
+
+
+Figure 9. Residual block
+
+
+Figure 10 illustrates ResNets with 50, 101, 152 layers, respectively. All three networks use bottleneck blocks of different numbers of repetitions. ResNet converges very fast and can be trained with hundreds or thousands of layers.
+
+
+
+Figure 10. ResNet model for ImageNet
+
+
+
+## Data Preparation
+
+### Data description and downloading
+
+Commonly used public datasets for image classification are CIFAR(https://www.cs.toronto.edu/~kriz/cifar.html), ImageNet(http://image-net.org/), COCO(http://mscoco.org/), etc. Those used for fine-grained image classification are CUB-200-2011(http://www.vision.caltech.edu/visipedia/CUB-200-2011.html), Stanford Dog(http://vision.stanford.edu/aditya86/ImageNetDogs/), Oxford-flowers(http://www.robots.ox.ac.uk/~vgg/data/flowers/), etc. Among them, ImageNet are the largest and most research results are reported on ImageNet as mentioned in Model Overview section. Since 2010, the data of Imagenet has gone through some changes. The commonly used ImageNet-2012 dataset contains 1000 categories. There are 1,281,167 training images, ranging from 732 to 1200 images per category, and 50,000 validation images with 50 images per category in average.
+
+Since ImageNet is too large to be downloaded and trained efficiently, we use CIFAR10 (https://www.cs.toronto.edu/~kriz/cifar.html) in this tutorial. The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. Figure 11 shows all the classes in CIFAR10 as well as 10 images randomly sampled from each category.
+
+
+
+Figure 11. CIFAR10 dataset[21]
+
+
+The following command is used for downloading data and calculating the mean image used for data preprocessing.
+
+```bash
+./data/get_data.sh
+```
+
+### Data provider for PaddlePaddle
+
+We use Python interface for providing data to PaddlePaddle. The following file dataprovider.py is a complete example for CIFAR10.
+
+- 'initializer' function performs initialization of dataprovider: loading the mean image, defining two input types -- image and label.
+
+- 'process' function sends preprocessed data to PaddlePaddle. Data preprocessing performed in this function includes data perturbation, random horizontal flipping, deducting mean image from the raw image.
+
+```python
+import numpy as np
+import cPickle
+from paddle.trainer.PyDataProvider2 import *
+
+def initializer(settings, mean_path, is_train, **kwargs):
+ settings.is_train = is_train
+ settings.input_size = 3 * 32 * 32
+ settings.mean = np.load(mean_path)['mean']
+ settings.input_types = {
+ 'image': dense_vector(settings.input_size),
+ 'label': integer_value(10)
+ }
+
+
+@provider(init_hook=initializer, pool_size=50000)
+def process(settings, file_list):
+ with open(file_list, 'r') as fdata:
+ for fname in fdata:
+ fo = open(fname.strip(), 'rb')
+ batch = cPickle.load(fo)
+ fo.close()
+ images = batch['data']
+ labels = batch['labels']
+ for im, lab in zip(images, labels):
+ if settings.is_train and np.random.randint(2):
+ im = im.reshape(3, 32, 32)
+ im = im[:,:,::-1]
+ im = im.flatten()
+ im = im - settings.mean
+ yield {
+ 'image': im.astype('float32'),
+ 'label': int(lab)
+ }
+```
+
+## Model Config
+
+### Data Definition
+
+In model config file, function `define_py_data_sources2` sets argument 'module' to dataprovider file for loading data, 'args' to mean image file. If the config file is used for prediction, then there is no need to set argument 'train_list'.
+
+```python
+from paddle.trainer_config_helpers import *
+
+is_predict = get_config_arg("is_predict", bool, False)
+if not is_predict:
+ define_py_data_sources2(
+ train_list='data/train.list',
+ test_list='data/test.list',
+ module='dataprovider',
+ obj='process',
+ args={'mean_path': 'data/mean.meta'})
+```
+
+### Algorithm Settings
+
+In model config file, function 'settings' specifies optimization algorithm, batch size, learning rate, momentum and L2 regularization.
+
+```python
+settings(
+ batch_size=128,
+ learning_rate=0.1 / 128.0,
+ learning_rate_decay_a=0.1,
+ learning_rate_decay_b=50000 * 100,
+ learning_rate_schedule='discexp',
+ learning_method=MomentumOptimizer(0.9),
+ regularization=L2Regularization(0.0005 * 128),)
+```
+
+The learning rate adjustment policy can be defined with variables `learning_rate_decay_a`($a$), `learning_rate_decay_b`($b$) and `learning_rate_schedule`. In this example, discrete exponential method is used for adjusting learning rate. The formula is as follows,
+$$ lr = lr_{0} * a^ {\lfloor \frac{n}{ b}\rfloor} $$
+where $n$ is the number of processed samples, $lr_{0}$ is the learning_rate set in 'settings'.
+
+### Model Architecture
+
+Here we provide the cofig files for VGG and ResNet models.
+
+#### VGG
+
+First we define VGG network. Since the image size and amount of CIFAR10 are relatively small comparing to ImageNet, we uses a small version of VGG network for CIFAR10. Convolution groups incorporate BN and dropout operations.
+
+1. Define input data and its dimension
+
+ The input to the network is defined as `data_layer`, or image pixels in the context of image classification. The images in CIFAR10 are 32x32 color images of three channels. Therefore, the size of the input data is 3072 (3x32x32), and the number of categories is 10.
+
+ ```python
+ datadim = 3 * 32 * 32
+ classdim = 10
+ data = data_layer(name='image', size=datadim)
+ ```
+
+2. Define VGG main module
+
+ ```python
+ net = vgg_bn_drop(data)
+ ```
+ The input to VGG main module is from data layer. `vgg_bn_drop` defines a 16-layer VGG network, with each convolutional layer followed by BN and dropout layers. Here is the definition in detail:
+
+ ```python
+ def vgg_bn_drop(input, num_channels):
+ def conv_block(ipt, num_filter, groups, dropouts, num_channels_=None):
+ return img_conv_group(
+ input=ipt,
+ num_channels=num_channels_,
+ pool_size=2,
+ pool_stride=2,
+ conv_num_filter=[num_filter] * groups,
+ conv_filter_size=3,
+ conv_act=ReluActivation(),
+ conv_with_batchnorm=True,
+ conv_batchnorm_drop_rate=dropouts,
+ pool_type=MaxPooling())
+
+ conv1 = conv_block(input, 64, 2, [0.3, 0], 3)
+ conv2 = conv_block(conv1, 128, 2, [0.4, 0])
+ conv3 = conv_block(conv2, 256, 3, [0.4, 0.4, 0])
+ conv4 = conv_block(conv3, 512, 3, [0.4, 0.4, 0])
+ conv5 = conv_block(conv4, 512, 3, [0.4, 0.4, 0])
+
+ drop = dropout_layer(input=conv5, dropout_rate=0.5)
+ fc1 = fc_layer(input=drop, size=512, act=LinearActivation())
+ bn = batch_norm_layer(
+ input=fc1, act=ReluActivation(), layer_attr=ExtraAttr(drop_rate=0.5))
+ fc2 = fc_layer(input=bn, size=512, act=LinearActivation())
+ return fc2
+
+ ```
+
+ 2.1. First defines a convolution block or conv_block. The default convolution kernel is 3x3, and the default pooling size is 2x2 with stride 2. Dropout specifies the probability in dropout operation. Function `img_conv_group` is defined in `paddle.trainer_config_helpers` consisting of a series of `Conv->BN->ReLu->Dropout` and a `Pooling`.
+
+
+ 2.2. Five groups of convolutions. The first two groups perform two convolutions, while the last three groups perform three convolutions. The dropout rate of the last convolution in each group is set to 0, which means there is no dropout for this layer.
+
+
+ 2.3. The last two layers are fully-connected layer of dimension 512.
+
+3. Define Classifier
+
+ The above VGG network extracts high-level features and maps them to a vector of the same size as the categories. Softmax function or classifier is then used for calculating the probability of the image belonging to each category.
+
+ ```python
+ out = fc_layer(input=net, size=class_num, act=SoftmaxActivation())
+ ```
+
+4. Define Loss Function and Outputs
+
+ In the context of supervised learning, labels of training images are defined in `data_layer`, too. During training, cross-entropy is used as loss function and as the output of the network; During testing, the outputs are the probabilities calculated in the classifier.
+
+ ```python
+ if not is_predict:
+ lbl = data_layer(name="label", size=class_num)
+ cost = classification_cost(input=out, label=lbl)
+ outputs(cost)
+ else:
+ outputs(out)
+ ```
+
+### ResNet
+
+The first, third and forth steps of a ResNet are the same as a VGG. The second one is the main module.
+
+```python
+net = resnet_cifar10(data, depth=56)
+```
+
+Here are some basic functions used in `resnet_cifar10`:
+
+ - `conv_bn_layer` : convolutional layer followed by BN.
+ - `shortcut` : the shortcut branch in a residual block. There are two kinds of shortcuts: 1x1 convolution used when the number of channels between input and output are different; direct connection used otherwise.
+
+ - `basicblock` : a basic residual module as shown in the left of Figure 9, consisting of two sequential 3x3 convolutions and one "shortcut" branch.
+ - `bottleneck` : a bottleneck module as shown in the right of Figure 9, consisting of a two 1x1 convolutions with one 3x3 convolution in between branch and a "shortcut" branch.
+ - `layer_warp` : a group of residual modules consisting of several stacking blocks. In each group, the sliding window size of the first residual block could be different from the rest of blocks, in order to reduce the size of feature maps along horizontal and vertical directions.
+
+```python
+def conv_bn_layer(input,
+ ch_out,
+ filter_size,
+ stride,
+ padding,
+ active_type=ReluActivation(),
+ ch_in=None):
+ tmp = img_conv_layer(
+ input=input,
+ filter_size=filter_size,
+ num_channels=ch_in,
+ num_filters=ch_out,
+ stride=stride,
+ padding=padding,
+ act=LinearActivation(),
+ bias_attr=False)
+ return batch_norm_layer(input=tmp, act=active_type)
+
+
+def shortcut(ipt, n_in, n_out, stride):
+ if n_in != n_out:
+ return conv_bn_layer(ipt, n_out, 1, stride, 0, LinearActivation())
+ else:
+ return ipt
+
+def basicblock(ipt, ch_out, stride):
+ ch_in = ipt.num_filters
+ tmp = conv_bn_layer(ipt, ch_out, 3, stride, 1)
+ tmp = conv_bn_layer(tmp, ch_out, 3, 1, 1, LinearActivation())
+ short = shortcut(ipt, ch_in, ch_out, stride)
+ return addto_layer(input=[ipt, short], act=ReluActivation())
+
+def bottleneck(ipt, ch_out, stride):
+ ch_in = ipt.num_filter
+ tmp = conv_bn_layer(ipt, ch_out, 1, stride, 0)
+ tmp = conv_bn_layer(tmp, ch_out, 3, 1, 1)
+ tmp = conv_bn_layer(tmp, ch_out * 4, 1, 1, 0, LinearActivation())
+ short = shortcut(ipt, ch_in, ch_out, stride)
+ return addto_layer(input=[ipt, short], act=ReluActivation())
+
+def layer_warp(block_func, ipt, features, count, stride):
+ tmp = block_func(ipt, features, stride)
+ for i in range(1, count):
+ tmp = block_func(tmp, features, 1)
+ return tmp
+
+```
+
+The following are the components of `resnet_cifar10`:
+
+1. The lowest level is `conv_bn_layer`.
+2. The middle level consists of three `layer_warp`, each of which uses the left residual block in Figure 9.
+3. The last level is average pooling layer.
+
+Note: besides the first convolutional layer and the last fully-connected layer, the total number of layers in three `layer_warp` should be dividable by 6, that is the depth of `resnet_cifar10` should satisfy $(depth - 2) % 6 == 0$.
+
+```python
+def resnet_cifar10(ipt, depth=56):
+ # depth should be one of 20, 32, 44, 56, 110, 1202
+ assert (depth - 2) % 6 == 0
+ n = (depth - 2) / 6
+ nStages = {16, 64, 128}
+ conv1 = conv_bn_layer(ipt,
+ ch_in=3,
+ ch_out=16,
+ filter_size=3,
+ stride=1,
+ padding=1)
+ res1 = layer_warp(basicblock, conv1, 16, n, 1)
+ res2 = layer_warp(basicblock, res1, 32, n, 2)
+ res3 = layer_warp(basicblock, res2, 64, n, 2)
+ pool = img_pool_layer(input=res3,
+ pool_size=8,
+ stride=1,
+ pool_type=AvgPooling())
+ return pool
+```
+
+## Model Training
+
+We can train the model by running the script train.sh, which specifies config file, device type, number of threads, number of passes, path to the trained models, etc,
+
+``` bash
+sh train.sh
+```
+
+Here is an example script `train.sh`:
+
+```bash
+#cfg=models/resnet.py
+cfg=models/vgg.py
+output=output
+log=train.log
+
+paddle train \
+ --config=$cfg \
+ --use_gpu=true \
+ --trainer_count=1 \
+ --log_period=100 \
+ --num_passes=300 \
+ --save_dir=$output \
+ 2>&1 | tee $log
+```
+
+- `--config=$cfg` : specifies config file. The default is `models/vgg.py`.
+- `--use_gpu=true` : uses GPU for training. If use CPU,set it to be false.
+- `--trainer_count=1` : specifies the number of threads or GPUs.
+- `--log_period=100` : specifies the number of batches between two logs.
+- `--save_dir=$output` : specifies the path for saving trained models.
+
+Here is an example log after training for one pass. The average error rates are 0.79958 on training set and 0.7858 on validation set.
+
+```text
+TrainerInternal.cpp:165] Batch=300 samples=38400 AvgCost=2.07708 CurrentCost=1.96158 Eval: classification_error_evaluator=0.81151 CurrentEval: classification_error_evaluator=0.789297
+TrainerInternal.cpp:181] Pass=0 Batch=391 samples=50000 AvgCost=2.03348 Eval: classification_error_evaluator=0.79958
+Tester.cpp:115] Test samples=10000 cost=1.99246 Eval: classification_error_evaluator=0.7858
+```
+
+Figure 12 shows the curve of training error rate, which indicates it converges at Pass 200 with error rate 8.54%.
+
+
+
+Figure 12. The error rate of VGG model on CIFAR10
+
+
+## Model Application
+
+After training is done, the model from each pass is saved in `output/pass-%05d`. For example, the model of Pass 300 is saved in `output/pass-00299`. The script `classify.py` can be used to extract features and to classify an image. The default config file of this script is `models/vgg.py`.
+
+
+### Prediction
+
+We can run the following script to predict the category of an image. The default device is GPU. If to use CPU, set `-c`.
+
+```bash
+python classify.py --job=predict --model=output/pass-00299 --data=image/dog.png # -c
+```
+
+Here is the result:
+
+```text
+Label of image/dog.png is: 5
+```
+
+### Feature Extraction
+
+We can run the following command to extract features from an image. Here `job` should be `extract` and the default layer is the first convolutional layer. Figure 13 shows the 64 feature maps output from the first convolutional layer of the VGG model.
+
+```bash
+python classify.py --job=extract --model=output/pass-00299 --data=image/dog.png # -c
+```
+
+
+
+Figre 13. Visualization of convolution layer feature maps
+
+
+## Conclusion
+
+Traditional image classification methods involve multiple stages of processing and the framework is very complicated. In contrast, CNN models can be trained end-to-end with significant increase of classification accuracy. In this chapter, we introduce three models -- VGG, GoogleNet, ResNet, provide PaddlePaddle config files for training VGG and ResNet on CIFAR10, and explain how to perform prediction and feature extraction using PaddlePaddle API. For other datasets such as ImageNet, the procedure for config and training are the same and you are welcome to give it a try.
+
+
+## Reference
+
+[1] D. G. Lowe, [Distinctive image features from scale-invariant keypoints](http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf). IJCV, 60(2):91-110, 2004.
+
+[2] N. Dalal, B. Triggs, [Histograms of Oriented Gradients for Human Detection](http://vision.stanford.edu/teaching/cs231b_spring1213/papers/CVPR05_DalalTriggs.pdf), Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2005.
+
+[3] Ahonen, T., Hadid, A., and Pietikinen, M. (2006). [Face description with local binary patterns: Application to face recognition](http://ieeexplore.ieee.org/document/1717463/). PAMI, 28.
+
+[4] J. Sivic, A. Zisserman, [Video Google: A Text Retrieval Approach to Object Matching in Videos](http://www.robots.ox.ac.uk/~vgg/publications/papers/sivic03.pdf), Proc. Ninth Int'l Conf. Computer Vision, pp. 1470-1478, 2003.
+
+[5] B. Olshausen, D. Field, [Sparse Coding with an Overcomplete Basis Set: A Strategy Employed by V1?](http://redwood.psych.cornell.edu/papers/olshausen_field_1997.pdf), Vision Research, vol. 37, pp. 3311-3325, 1997.
+
+[6] Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., and Gong, Y. (2010). [Locality-constrained Linear Coding for image classification](http://ieeexplore.ieee.org/abstract/document/5540018/). In CVPR.
+
+[7] Perronnin, F., Sánchez, J., & Mensink, T. (2010). [Improving the fisher kernel for large-scale image classification](http://dl.acm.org/citation.cfm?id=1888101). In ECCV (4).
+
+[8] Lin, Y., Lv, F., Cao, L., Zhu, S., Yang, M., Cour, T., Yu, K., and Huang, T. (2011). [Large-scale image clas- sification: Fast feature extraction and SVM training](http://ieeexplore.ieee.org/document/5995477/). In CVPR.
+
+[9] Krizhevsky, A., Sutskever, I., and Hinton, G. (2012). [ImageNet classification with deep convolutional neu- ral networks](http://www.cs.toronto.edu/~kriz/imagenet_classification_with_deep_convolutional.pdf). In NIPS.
+
+[10] G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R.R. Salakhutdinov. [Improving neural networks by preventing co-adaptation of feature detectors](https://arxiv.org/abs/1207.0580). arXiv preprint arXiv:1207.0580, 2012.
+
+[11] K. Chatfield, K. Simonyan, A. Vedaldi, A. Zisserman. [Return of the Devil in the Details: Delving Deep into Convolutional Nets](https://arxiv.org/abs/1405.3531). BMVC, 2014。
+
+[12] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A., [Going deeper with convolutions](https://arxiv.org/abs/1409.4842). In: CVPR. (2015)
+
+[13] Lin, M., Chen, Q., and Yan, S. [Network in network](https://arxiv.org/abs/1312.4400). In Proc. ICLR, 2014.
+
+[14] S. Ioffe and C. Szegedy. [Batch normalization: Accelerating deep network training by reducing internal covariate shift](https://arxiv.org/abs/1502.03167). In ICML, 2015.
+
+[15] K. He, X. Zhang, S. Ren, J. Sun. [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385). CVPR 2016.
+
+[16] Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z. [Rethinking the incep-tion architecture for computer vision](https://arxiv.org/abs/1512.00567). In: CVPR. (2016).
+
+[17] Szegedy, C., Ioffe, S., Vanhoucke, V. [Inception-v4, inception-resnet and the impact of residual connections on learning](https://arxiv.org/abs/1602.07261). arXiv:1602.07261 (2016).
+
+[18] Everingham, M., Eslami, S. M. A., Van Gool, L., Williams, C. K. I., Winn, J. and Zisserman, A. [The Pascal Visual Object Classes Challenge: A Retrospective]((http://link.springer.com/article/10.1007/s11263-014-0733-5)). International Journal of Computer Vision, 111(1), 98-136, 2015.
+
+[19] He, K., Zhang, X., Ren, S., and Sun, J. [Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification](https://arxiv.org/abs/1502.01852). ArXiv e-prints, February 2015.
+
+[20] http://deeplearning.net/tutorial/lenet.html
+
+[21] https://www.cs.toronto.edu/~kriz/cifar.html
+
+[22] http://cs231n.github.io/classification/
+
+
+ 本教程 由 PaddlePaddle 创作,采用 知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议进行许可。
+
+# Semantic Role Labeling
+
+Source code of this chpater is in [book/label_semantic_roles](https://github.com/PaddlePaddle/book/tree/develop/label_semantic_roles).
+
+## Background
+
+Natural Language Analysis contains three components: Lexical Analysis, Syntactic Analysis, and Semantic Analysis. Semantic Role Labelling (SRL) is one way for Shallow Semantic Analysis. A predicate of a sentence is seen as a property that a subject has or is characterized by, such as what it does, what it is or how it is, which mostly corresponds to the core of an event. The noun associated with predicate is called Arugment. Sementic roles express the abstract roles that arguments of a predicate can take in the event, such as Agent, Patient, Theme, Experiencer, Beneficiary, Instrument, Location, Goal and Source etc.
+
+In the following example, “遇到” is Predicate (“Pred”),“小明” is Agent,“小红” is Patient,“昨天” means when the event occurs (Time), and “公园” means where the event occurs (Location).
+
+$$\mbox{[小明]}_{\mbox{Agent}}\mbox{[昨天]}_{\mbox{Time}}\mbox{[晚上]}_\mbox{Time}\mbox{在[公园]}_{\mbox{Location}}\mbox{[遇到]}_{\mbox{Predicate}}\mbox{了[小红]}_{\mbox{Patient}}\mbox{。}$$
+
+Instead of in-depth analysis on semantic information, the goal of Semantic Role Labeling is to identify the relation of predicate and other constituents, e.g., predicate-argument structure, as specific semantic roles, which is an important intermediate step in a wide range of natural language understanding tasks (Information Extraction, Discourse Analysis, DeepQA etc). Predicates are always assumed to be given, the only thing is to identify arguments and their semantic roles.
+
+Standard SRL system mostly build on top of Syntactic Analysis and contains 5 steps:
+
+1. Construct a syntactic parse tree, as shown in Fig. 1
+2. Identity candidate arguments of given predicate from constructed syntactic parse tree.
+3. Prune most unlikely candidate arguments.
+4. Identify argument, which is usually solved as a binary classification problem.
+5. Multi-class semantic role labeling. Steps 2-3 usually introduce hand-designed features based on Syntactic Analysis (step 1).
+
+
+
+
+Fig 1. Syntactic parse tree
+
+
+核心关系-> HED
+定中关系-> ATT
+主谓关系-> SBV
+状中结构-> ADV
+介宾关系-> POB
+右附加关系-> RAD
+动宾关系-> VOB
+标点-> WP
+
+
+However, complete syntactic analysis requires to identify the relation among all constitutes and the performance of SRL is sensitive to the precision of syntactic analysis, which make SRL a very challenging task. In order to reduce the complexity and obtain some syntactic structure information, shallow syntactic analysis is proposed. Shallow Syntactic Analysis is also called partial parsing or chunking. Unlike complete syntactic analysis which requires constructing complete parsing tree, Shallow Syntactic Analysis only need to identify some idependent components with relatively simple structure, such as verb phrases (chunk). In order to avoid constructing syntactic tree with high accuracy, some work\[[1](#Reference)\] proposed semantic chunking based SRL methods, which convert SRL as a sequence tagging problem. Sequence tagging tasks classify syntactic chunks using BIO representation. For syntactic chunks forming a chunk of type A, the first chunk receives the B-A tag (Begin), the remaining ones receive the tag I-A (Inside), and all chunks outside receive the tag O-A.
+
+The BIO representation of above example is shown in Fig.1.
+
+
+
+Fig 2. BIO represention
+
+
+输入序列-> input sequence
+语块-> chunk
+标注序列-> label sequence
+角色-> role
+
+This example illustrates the simplicity of sequence tagging because (1) shallow syntactic analysis reduces precision requirement of syntactic analysis; (2) pruning candidate arguments is removed; 3) argument identification and tagging are finished at the same time. Such unified methods simplify the precedure, reduce the risk of accumulating errors and boost the performance further.
+
+In this tutorial, our SRL system is built as an end-to-end system via neural network. We take only text sequences, without using any syntactic parsing results or complex hand-designed features. We give public dataset [CoNLL-2004 and CoNLL-2005 Shared Tasks](http://www.cs.upc.edu/~srlconll/) as an example to illustrate: given a sentence and it's predicates, identify the corresponding arguments and their semantic roles by sequence tagging method.
+
+## Model
+
+Recurrent Nerual Networks are important tools for sequence modeling and have been successfully used in some natural language processing tasks. Unlike Feed-forward neural netowrks, RNNs can model the dependency between elements of sequences. LSTMs as variants of RNNs aim to model long-term dependency in long sequences. We have introduced this in [understand_sentiment](https://github.com/PaddlePaddle/book/tree/develop/understand_sentiment). In this chapter, we continue to use LSTMs to solve SRL problems.
+
+### Stacked Recurrent Neural Network
+
+Deep Neural Networks allows to extract hierarchical represetations, higher layer can form more abstract/complex representations on top of lower layers. LSTMs when unfolded in time is deep, because a computational path between the input at time $k < t$ to the output at time $t$ crosses several nonlinear layers. However, the computation carried out at each time-step is only linear transformation, which makes LSTMs a shallow model. Deep LSTMs are typically constructed by stacking multiple LSTM layers on top of each other and taking the output from lower LSTM layer at time $t$ as the input of upper LSTM layer at time $t$. Deep, hierarchical nerual networks can be much efficient at representing some functions and modeling varying-length dependencies\[[2](#Reference)\].
+
+
+However, deep LSTMs increases the number of nonlinear steps the gradient has to traverse when propagated back in depth. For example, 4 layer LSTMs can be trained properly, but the performance becomes worse as the number of layers up to 4-8. Conventional LSTMs prevent backpropagated errors from vanishing and exploding by introduce shortcut connections to skip the intermediate nonlinear layers. Therefore, deep LSTMs can consider shortcut connections in depth as well.
+
+
+The operation of a single LSTM cell contain 3 parts: (1) input-to-hidden: map input $x$ to the input of forget gates, input gates, memory cells and output gates by linear transformation (i.e., matrix mapping); (2) hidden-to-hidden: calculate forget gates, input gates, output gates and update memory cell, this is the main part of LSTMs; (3)hidden-to-output: this part typically involves an activation operation on hidden states. Based on the above stacked LSTMs, we add a shortcut connection: take the input-to-hidden from previous layer as a new input and learn another linear transfermation.
+
+Fig.3 illustrate the final stacked recurrent neural networks.
+
+
+
+Fig 3. Stacked Recurrent Neural Networks
+
+
+线性变换-> linear transformation
+输入层到隐层-> input-to-hidden
+
+### Bidirectional Recurrent Neural Network
+
+ LSTMs can summarize the history of previous inputs seen up to now, but can not see the future. In most of natural language processing tasks, the entire sentences are ready to use. Therefore, sequencal learning might be much effecient if the future can be encoded as well like histories.
+
+To address the above drawbacks, we can design bidirectional recurrent neural networks by making a minor modification. Higher LSTM layers process the sequence in reversed direction with previous lower LSTM layers, i.e., Deep LSTMs operate from left-to-right, right-to-left, left-to-right,..., in depth. Therefore, LSTM layers at time-step $t$ can see both histories and the future since the second layer. Fig. 4 illustrates the bidirectional recurrent neural networks.
+
+
+
+
+Fig 4. Bidirectional LSTMs
+
+
+线性变换-> linear transformation
+输入层到隐层-> input-to-hidden
+正向处理输出序列->process sequence in forward direction
+反向处理上一层序列-> process sequence from previous layer in backward direction
+
+Note that, this bidirectional RNNs is different with the one proposed by Bengio et al in machine translation tasks \[[3](#Reference), [4](#Reference)\]. We will introduce another bidirectional RNNs in the following tasks[machine translation](https://github.com/PaddlePaddle/book/blob/develop/machine_translation/README.md)
+
+### Conditional Random Field
+
+The basic pipeline of Neural Networks solving problems is 1) all lower layers aim to learn representations; 2) the top layer is designed for learning the final task. In SRL tasks, CRF is built on top of the network for the final tag sequence prediction. It takes the representations provided by the last LSTM layer as input.
+
+
+CRF is a probabilistic graph model (undirected) with nodes denoting random variables and edges denoting dependencies between nodes. To be simplicity, CRFs learn conditional probability $P(Y|X)$, where $X = (x_1, x_2, ... , x_n)$ are sequences of input, $Y = (y_1, y_2, ... , y_n)$ are label sequences; Decoding is to search sequence $Y$ to maximize conditional probability $P(Y|X)$, i.e., $Y^* = \mbox{arg max}_{Y} P(Y | X)$。
+
+Sequence tagging tasks only consider input and output as linear sequences without extra dependent assumptions on graph model. Thus, the graph model of sequence tagging tasks is simple chain or line, which results in a Linear Chain Conditional Random Field, shown in Fig.5.
+
+
+
+Fig 5. Linear Chain Conditional Random Field used in SRL tasks
+
+
+By the fundamental theorem of random fields \[[5](#Reference)\], the joint distribution over the label sequence $Y$ given $X$ has the form:
+
+$$p(Y | X) = \frac{1}{Z(X)} \text{exp}\left(\sum_{i=1}^{n}\left(\sum_{j}\lambda_{j}t_{j} (y_{i - 1}, y_{i}, X, i) + \sum_{k} \mu_k s_k (y_i, X, i)\right)\right)$$
+
+
+where, $Z(X)$ is normalization constant, $t_j$ is feature function defined on edges, called transition feature, depending on $y_i$ and $y_{i-1}$ which represents transition probabilities from $y_{i-1}$ to $y_i$ given input sequence $X$. $s_k$ is feature function defined on nodes, called state feature, depending on $y_i$ and represents the probality of $y_i$ given input sequence $X$. $\lambda_j$ 和 $\mu_k$ are weights corresponding to $t_j$ and $s_k$. Actually, $t$ and $s$ can be wrtten in the same form, then take summation over all nodes $i$: $f_{k}(Y, X) = \sum_{i=1}^{n}f_k({y_{i - 1}, y_i, X, i})$, $f$ is defined as feature function. Thus, $P(Y|X)$ can be wrtten as:
+
+$$p(Y|X, W) = \frac{1}{Z(X)}\text{exp}\sum_{k}\omega_{k}f_{k}(Y, X)$$
+
+$\omega$ are weights of feature function which should be learned in CRF models. At training stage, given input sequences and label sequences $D = \left[(X_1, Y_1), (X_2 , Y_2) , ... , (X_N, Y_N)\right]$, solve following objective function using MLE:
+
+
+$$L(\lambda, D) = - \text{log}\left(\prod_{m=1}^{N}p(Y_m|X_m, W)\right) + C \frac{1}{2}\lVert W\rVert^{2}$$
+
+
+This objective function can be solved via back-propagation in an end-to-end manner. At decoding stage, given input sequences $X$, search sequence $\bar{Y}$ to maximize conditional probability $\bar{P}(Y|X)$ via decoding methods (such as Viterbi, Beam Search).
+
+### DB-LSTM SRL model
+
+Given predicates and a sentence, SRL tasks aim to identify arguments of the given predicate and their semantic roles. If a sequence has n predicates, we will process this sequence n times. One model is as follows:
+
+1. Construct inputs;
+ - output 1: predicate, output 2: sentence
+ - expand input 1 as a sequence with the same length with input 2 using one-hot representation;
+2. Convert one-hot sequences from step 1 to real-vector sequences via lookup table;
+3. Learn the representation of input sequences by taking real-vector sequences from step 2 as inputs;
+4. Take representations from step 3 as inputs, label sequence as supervision signal, do sequence tagging tasks
+
+We can try above method. Here, we propose some modifications by introducing two simple but effective features:
+
+- predicate context (ctx-p): A single predicate word can not exactly describe the predicate information, especially when the same words appear more than one times in a sentence. With the expanded context, the ambiguity can be largely eliminated. Thus, we extract $n$ words before and after predicate to construct a window chunk.
+
+- region mark ($m_r$): $m_r = 1$ to denote the argument position if it locates in the predicate context region, or $m_r = 0$ if not.
+
+After modification, the model is as follows:
+
+1. Construct inputs
+ - input 1: sentence, input 2: predicate sequence, input 3: predicate context, extract $n$ words before and after predicate and get one-hot representation, input 4: region mark, annotate argument position if it locates in the predicate context region
+ - expand input 2~3 as sequences with the same length with input 1
+2. Convert input 1~4 to real-vector sequences via lookup table; input 1 and 3 share the same lookup table, input 2 and 4 have separate lookup tables
+3. Take four real-vector sequences from step 2 as inputs of bidirectional LSTMs; Train LSTMs to update representations
+4. Take representation from step 3 as input of CRF, label sequence as supervision signal, do sequence tagging tasks
+
+
+
+
+Fig 6. DB-LSTM for SRL tasks
+
+
+论元-> argu
+谓词-> pred
+谓词上下文-> ctx-p
+谓词上下文区域标记-> $m_r$
+输入-> input
+原句-> sentence
+反向LSTM-> LSTM Reverse
+
+## 数据准备
+### 数据介绍与下载
+
+在此教程中,我们选用[CoNLL 2005](http://www.cs.upc.edu/~srlconll/)SRL任务开放出的数据集作为示例。运行 `sh ./get_data.sh` 会自动从官方网站上下载原始数据。需要特别说明的是,CoNLL 2005 SRL任务的训练数集和开发集在比赛之后并非免费进行公开,目前,能够获取到的只有测试集,包括Wall Street Journal的23节和Brown语料集中的3节。在本教程中,我们以测试集中的WSJ数据为训练集来讲解模型。但是,由于测试集中样本的数量远远不够,如果希望训练一个可用的神经网络SRL系统,请考虑付费获取全量数据。
+
+原始数据中同时包括了词性标注、命名实体识别、语法解析树等多种信息。本教程中,我们使用test.wsj文件夹中的数据进行训练和测试,并只会用到words文件夹(文本序列)和props文件夹(标注结果)下的数据。本教程使用的数据目录如下:
+
+```text
+conll05st-release/
+└── test.wsj
+ ├── props # 标注结果
+ └── words # 输入文本序列
+```
+
+标注信息源自Penn TreeBank\[[7](#参考文献)\]和PropBank\[[8](#参考文献)\]的标注结果。PropBank标注结果的标签和我们在文章一开始示例中使用的标注结果标签不同,但原理是相同的,关于标注结果标签含义的说明,请参考论文\[[9](#参考文献)\]。
+
+除数据之外,`get_data.sh`同时下载了以下资源:
+
+| 文件名称 | 说明 |
+|---|---|
+| word_dict | 输入句子的词典,共计44068个词 |
+| label_dict | 标记的词典,共计106个标记 |
+| predicate_dict | 谓词的词典,共计3162个词 |
+| emb | 一个训练好的词表,32维 |
+
+我们在英文维基百科上训练语言模型得到了一份词向量用来初始化SRL模型。在SRL模型训练过程中,词向量不再被更新。关于语言模型和词向量可以参考[词向量](https://github.com/PaddlePaddle/book/blob/develop/word2vec/README.md) 这篇教程。我们训练语言模型的语料共有995,000,000个token,词典大小控制为4900,000词。CoNLL 2005训练语料中有5%的词不在这4900,000个词中,我们将它们全部看作未登录词,用``表示。
+
+### 数据预处理
+脚本在下载数据之后,又调用了`extract_pair.py`和`extract_dict_feature.py`两个子脚本进行数据预处理,前者完成了下面的第1步,后者完成了下面的2~4步:
+
+1. 将文本序列和标记序列其合并到一条记录中;
+2. 一个句子如果含有$n$个谓词,这个句子会被处理$n$次,变成$n$条独立的训练样本,每个样本一个不同的谓词;
+3. 抽取谓词上下文和构造谓词上下文区域标记;
+4. 构造以BIO法表示的标记;
+
+`data/feature`文件是处理好的模型输入,一行是一条训练样本,以"\t"分隔,共9列,分别是:句子序列、谓词、谓词上下文(占 5 列)、谓词上下区域标志、标注序列。下表是一条训练样本的示例。
+
+| 句子序列 | 谓词 | 谓词上下文(窗口 = 5) | 谓词上下文区域标记 | 标注序列 |
+|---|---|---|---|---|
+| A | set | n't been set . × | 0 | B-A1 |
+| record | set | n't been set . × | 0 | I-A1 |
+| date | set | n't been set . × | 0 | I-A1 |
+| has | set | n't been set . × | 0 | O |
+| n't | set | n't been set . × | 1 | B-AM-NEG |
+| been | set | n't been set . × | 1 | O |
+| set | set | n't been set . × | 1 | B-V |
+| . | set | n't been set . × | 1 | O |
+
+### 提供数据给 PaddlePaddle
+1. 使用hook函数进行PaddlePaddle输入字段的格式定义。
+
+ ```python
+ def hook(settings, word_dict, label_dict, predicate_dict, **kwargs):
+ settings.word_dict = word_dict # 获取句子序列的字典
+ settings.label_dict = label_dict # 获取标记序列的字典
+ settings.predicate_dict = predicate_dict # 获取谓词的字典
+
+ # 所有输入特征都是使用one-hot表示序列,在PaddlePaddle中是interger_value_sequence类型
+ # input_types是一个字典,字典中每个元素对应着配置中的一个data_layer,key恰好就是data_layer的名字
+
+ settings.input_types = {
+ 'word_data': integer_value_sequence(len(word_dict)), # 句子序列
+ 'ctx_n2_data': integer_value_sequence(len(word_dict)), # 谓词上下文中的第1个词
+ 'ctx_n1_data': integer_value_sequence(len(word_dict)), # 谓词上下文中的第2个词
+ 'ctx_0_data': integer_value_sequence(len(word_dict)), # 谓词上下文中的第3个词
+ 'ctx_p1_data': integer_value_sequence(len(word_dict)), # 谓词上下文中的第4个词
+ 'ctx_p2_data': integer_value_sequence(len(word_dict)), # 谓词上下文中的第5个词
+ 'verb_data': integer_value_sequence(len(predicate_dict)), # 谓词
+ 'mark_data': integer_value_sequence(2), # 谓词上下文区域标记
+ 'target': integer_value_sequence(len(label_dict)) # 标记序列
+ }
+ ```
+
+2. 使用process将数据逐一提供给PaddlePaddle,只需要考虑如何从原始数据文件中返回一条训练样本。
+
+ ```python
+ def process(settings, file_name):
+ with open(file_name, 'r') as fdata:
+ for line in fdata:
+ sentence, predicate, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2, mark, label = \
+ line.strip().split('\t')
+
+ # 句子文本
+ words = sentence.split()
+ sen_len = len(words)
+ word_slot = [settings.word_dict.get(w, UNK_IDX) for w in words]
+
+ # 一个谓词,这里将谓词扩展成一个和句子一样长的序列
+ predicate_slot = [settings.predicate_dict.get(predicate)] * sen_len
+
+ # 在教程中,我们使用一个窗口为 5 的谓词上下文窗口:谓词和这个谓词前后隔两个词
+ # 这里会将窗口中的每一个词,扩展成和输入句子一样长的序列
+ ctx_n2_slot = [settings.word_dict.get(ctx_n2, UNK_IDX)] * sen_len
+ ctx_n1_slot = [settings.word_dict.get(ctx_n1, UNK_IDX)] * sen_len
+ ctx_0_slot = [settings.word_dict.get(ctx_0, UNK_IDX)] * sen_len
+ ctx_p1_slot = [settings.word_dict.get(ctx_p1, UNK_IDX)] * sen_len
+ ctx_p2_slot = [settings.word_dict.get(ctx_p2, UNK_IDX)] * sen_len
+
+ # 谓词上下文区域标记,是一个二值特征
+ marks = mark.split()
+ mark_slot = [int(w) for w in marks]
+
+ label_list = label.split()
+ label_slot = [settings.label_dict.get(w) for w in label_list]
+ yield {
+ 'word_data': word_slot,
+ 'ctx_n2_data': ctx_n2_slot,
+ 'ctx_n1_data': ctx_n1_slot,
+ 'ctx_0_data': ctx_0_slot,
+ 'ctx_p1_data': ctx_p1_slot,
+ 'ctx_p2_data': ctx_p2_slot,
+ 'verb_data': predicate_slot,
+ 'mark_data': mark_slot,
+ 'target': label_slot
+ }
+ ```
+
+## 模型配置说明
+
+### 数据定义
+
+首先通过 define_py_data_sources2 从dataprovider中读入数据。配置文件中会读取三个字典:输入文本序列的字典、标记的字典、谓词的字典,并传给data provider,data provider会利用这三个字典,将相应的文本输入转换成one-hot序列。
+
+```python
+define_py_data_sources2(
+ train_list=train_list_file,
+ test_list=test_list_file,
+ module='dataprovider',
+ obj='process',
+ args={
+ 'word_dict': word_dict, # 输入文本序列的字典
+ 'label_dict': label_dict, # 标记的字典
+ 'predicate_dict': predicate_dict # 谓词的词典
+ }
+)
+```
+### 算法配置
+
+在这里,我们指定了模型的训练参数,选择了$L_2$正则、学习率和batch size,并使用带Momentum的随机梯度下降法作为优化算法。
+
+```python
+settings(
+ batch_size=150,
+ learning_method=MomentumOptimizer(momentum=0),
+ learning_rate=2e-2,
+ regularization=L2Regularization(8e-4),
+ model_average=ModelAverage(average_window=0.5, max_average_window=10000)
+)
+```
+
+### 模型结构
+
+1. 定义输入数据维度及模型超参数。
+
+ ```python
+ mark_dict_len = 2 # 谓上下文区域标志的维度,是一个0-1 2值特征,因此维度为2
+ word_dim = 32 # 词向量维度
+ mark_dim = 5 # 谓词上下文区域通过词表被映射为一个实向量,这个是相邻的维度
+ hidden_dim = 512 # LSTM隐层向量的维度 : 512 / 4
+ depth = 8 # 栈式LSTM的深度
+
+ word = data_layer(name='word_data', size=word_dict_len)
+ predicate = data_layer(name='verb_data', size=pred_len)
+
+ ctx_n2 = data_layer(name='ctx_n2_data', size=word_dict_len)
+ ctx_n1 = data_layer(name='ctx_n1_data', size=word_dict_len)
+ ctx_0 = data_layer(name='ctx_0_data', size=word_dict_len)
+ ctx_p1 = data_layer(name='ctx_p1_data', size=word_dict_len)
+ ctx_p2 = data_layer(name='ctx_p2_data', size=word_dict_len)
+ mark = data_layer(name='mark_data', size=mark_dict_len)
+
+ if not is_predict:
+ target = data_layer(name='target', size=label_dict_len) # 标记序列只在训练和测试流程中定义
+ ```
+这里需要特别说明的是hidden_dim = 512指定了LSTM隐层向量的维度为128维,关于这一点请参考PaddlePaddle官方文档中[lstmemory](http://www.paddlepaddle.org/doc/ui/api/trainer_config_helpers/layers.html#lstmemory)的说明。
+
+2. 将句子序列、谓词、谓词上下文、谓词上下文区域标记通过词表,转换为实向量表示的词向量序列。
+
+ ```python
+
+ # 在本教程中,我们加载了预训练的词向量,这里设置了:is_static=True
+ # is_static 为 True 时保证了在训练 SRL 模型过程中,词表不再更新
+ emb_para = ParameterAttribute(name='emb', initial_std=0., is_static=True)
+
+ word_input = [word, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2]
+ emb_layers = [
+ embedding_layer(
+ size=word_dim, input=x, param_attr=emb_para) for x in word_input
+ ]
+ emb_layers.append(predicate_embedding)
+ mark_embedding = embedding_layer(
+ name='word_ctx-in_embedding', size=mark_dim, input=mark, param_attr=std_0)
+ emb_layers.append(mark_embedding)
+ ```
+
+3. 8个LSTM单元以“正向/反向”的顺序对所有输入序列进行学习。
+
+ ```python
+ # std_0 指定的参数以均值为0的高斯分布初始化,用在LSTM的bias初始化中
+ std_0 = ParameterAttribute(initial_std=0.)
+
+ hidden_0 = mixed_layer(
+ name='hidden0',
+ size=hidden_dim,
+ bias_attr=std_default,
+ input=[
+ full_matrix_projection(
+ input=emb, param_attr=std_default) for emb in emb_layers
+ ])
+ lstm_0 = lstmemory(
+ name='lstm0',
+ input=hidden_0,
+ act=ReluActivation(),
+ gate_act=SigmoidActivation(),
+ state_act=SigmoidActivation(),
+ bias_attr=std_0,
+ param_attr=lstm_para_attr)
+ input_tmp = [hidden_0, lstm_0]
+
+ for i in range(1, depth):
+ mix_hidden = mixed_layer(
+ name='hidden' + str(i),
+ size=hidden_dim,
+ bias_attr=std_default,
+ input=[
+ full_matrix_projection(
+ input=input_tmp[0], param_attr=hidden_para_attr),
+ full_matrix_projection(
+ input=input_tmp[1], param_attr=lstm_para_attr)
+ ])
+ lstm = lstmemory(
+ name='lstm' + str(i),
+ input=mix_hidden,
+ act=ReluActivation(),
+ gate_act=SigmoidActivation(),
+ state_act=SigmoidActivation(),
+ reverse=((i % 2) == 1),
+ bias_attr=std_0,
+ param_attr=lstm_para_attr)
+
+ input_tmp = [mix_hidden, lstm]
+ ```
+
+4. 取最后一个栈式LSTM的输出和这个LSTM单元的输入到隐层映射,经过一个全连接层映射到标记字典的维度,得到最终的特征向量表示。
+
+ ```python
+ feature_out = mixed_layer(
+ name='output',
+ size=label_dict_len,
+ bias_attr=std_default,
+ input=[
+ full_matrix_projection(
+ input=input_tmp[0], param_attr=hidden_para_attr),
+ full_matrix_projection(
+ input=input_tmp[1], param_attr=lstm_para_attr)
+ ], )
+ ```
+
+5. CRF层在网络的末端,完成序列标注。
+
+ ```python
+ crf_l = crf_layer(
+ name='crf',
+ size=label_dict_len,
+ input=feature_out,
+ label=target,
+ param_attr=ParameterAttribute(
+ name='crfw', initial_std=default_std, learning_rate=mix_hidden_lr))
+ ```
+
+## 训练模型
+执行`sh train.sh`进行模型的训练,其中指定了总共需要训练150个pass。
+
+```bash
+paddle train \
+ --config=./db_lstm.py \
+ --save_dir=./output \
+ --trainer_count=1 \
+ --dot_period=500 \
+ --log_period=10 \
+ --num_passes=200 \
+ --use_gpu=false \
+ --show_parameter_stats_period=10 \
+ --test_all_data_in_one_period=1 \
+2>&1 | tee 'train.log'
+```
+
+训练日志示例如下。
+
+```text
+I1224 18:11:53.661479 1433 TrainerInternal.cpp:165] Batch=880 samples=145305 AvgCost=2.11541 CurrentCost=1.8645 Eval: __sum_evaluator_0__=0.607942 CurrentEval: __sum_evaluator_0__=0.59322
+I1224 18:11:55.254021 1433 TrainerInternal.cpp:165] Batch=885 samples=146134 AvgCost=2.11408 CurrentCost=1.88156 Eval: __sum_evaluator_0__=0.607299 CurrentEval: __sum_evaluator_0__=0.494572
+I1224 18:11:56.867604 1433 TrainerInternal.cpp:165] Batch=890 samples=146987 AvgCost=2.11277 CurrentCost=1.88839 Eval: __sum_evaluator_0__=0.607203 CurrentEval: __sum_evaluator_0__=0.590856
+I1224 18:11:58.424069 1433 TrainerInternal.cpp:165] Batch=895 samples=147793 AvgCost=2.11129 CurrentCost=1.84247 Eval: __sum_evaluator_0__=0.607099 CurrentEval: __sum_evaluator_0__=0.588089
+I1224 18:12:00.006893 1433 TrainerInternal.cpp:165] Batch=900 samples=148611 AvgCost=2.11148 CurrentCost=2.14526 Eval: __sum_evaluator_0__=0.607882 CurrentEval: __sum_evaluator_0__=0.749389
+I1224 18:12:00.164089 1433 TrainerInternal.cpp:181] Pass=0 Batch=901 samples=148647 AvgCost=2.11195 Eval: __sum_evaluator_0__=0.60793
+```
+经过150个 pass 后,得到平均 error 约为 0.0516055。
+
+## 应用模型
+
+训练好的$N$个pass,会得到$N$个模型,我们需要从中选择一个最优模型进行预测。通常做法是在开发集上进行调参,并基于我们关心的某个性能指标选择最优模型。本教程的`predict.sh`脚本简单地选择了测试集上标记错误最少的那个pass(这里是pass-00100)用于预测。
+
+预测时,我们需要将配置中的 `crf_layer` 删掉,替换为 `crf_decoding_layer`,如下所示:
+
+```python
+crf_dec_l = crf_decoding_layer(
+ name='crf_dec_l',
+ size=label_dict_len,
+ input=feature_out,
+ param_attr=ParameterAttribute(name='crfw'))
+```
+
+运行`python predict.py`脚本,便可使用指定的模型进行预测。
+
+```bash
+python predict.py
+ -c db_lstm.py # 指定配置文件
+ -w output/pass-00100 # 指定预测使用的模型所在的路径
+ -l data/targetDict.txt # 指定标记的字典
+ -p data/verbDict.txt # 指定谓词的词典
+ -d data/wordDict.txt # 指定输入文本序列的字典
+ -i data/feature # 指定输入数据的路径
+ -o predict.res # 指定标记结果输出到文件的路径
+```
+
+预测结束后,在 - o 参数所指定的标记结果文件中,我们会得到如下格式的输出:每行是一条样本,以 “\t” 分隔的 2 列,第一列是输入文本,第二列是标记的结果。通过BIO标记可以直接得到论元的语义角色标签。
+
+```text
+The interest-only securities were priced at 35 1\/2 to yield 10.72 % . B-A0 I-A0 I-A0 O O O O O O B-V B-A1 I-A1 O
+```
+
+## Conclusion
+
+Semantic Role Labeling is an important intermediate step in a wide range of natural language processing tasks. In this tutorial, we give SRL as an example to introduce how to use PaddlePaddle to do sequence tagging tasks. Proposed models are from our published paper\[[10](#Reference)\]. We only use test data as illustration since train data on CoNLL 2005 dataset is not completely public. We hope to propose an end-to-end neural network model with less dependencies on natural language processing tools, but is comparable, or even better than trandional models. Please check out our paper for more information and discussions.
+
+## Reference
+1. Sun W, Sui Z, Wang M, et al. [Chinese semantic role labeling with shallow parsing](http://www.aclweb.org/anthology/D09-1#page=1513)[C]//Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3-Volume 3. Association for Computational Linguistics, 2009: 1475-1483.
+2. Pascanu R, Gulcehre C, Cho K, et al. [How to construct deep recurrent neural networks](https://arxiv.org/abs/1312.6026)[J]. arXiv preprint arXiv:1312.6026, 2013.
+3. Cho K, Van Merriënboer B, Gulcehre C, et al. [Learning phrase representations using RNN encoder-decoder for statistical machine translation](https://arxiv.org/abs/1406.1078)[J]. arXiv preprint arXiv:1406.1078, 2014.
+4. Bahdanau D, Cho K, Bengio Y. [Neural machine translation by jointly learning to align and translate](https://arxiv.org/abs/1409.0473)[J]. arXiv preprint arXiv:1409.0473, 2014.
+5. Lafferty J, McCallum A, Pereira F. [Conditional random fields: Probabilistic models for segmenting and labeling sequence data](http://www.jmlr.org/papers/volume15/doppa14a/source/biblio.bib.old)[C]//Proceedings of the eighteenth international conference on machine learning, ICML. 2001, 1: 282-289.
+6. 李航. 统计学习方法[J]. 清华大学出版社, 北京, 2012.
+7. Marcus M P, Marcinkiewicz M A, Santorini B. [Building a large annotated corpus of English: The Penn Treebank](http://repository.upenn.edu/cgi/viewcontent.cgi?article=1246&context=cis_reports)[J]. Computational linguistics, 1993, 19(2): 313-330.
+8. Palmer M, Gildea D, Kingsbury P. [The proposition bank: An annotated corpus of semantic roles](http://www.mitpressjournals.org/doi/pdfplus/10.1162/0891201053630264)[J]. Computational linguistics, 2005, 31(1): 71-106.
+9. Carreras X, Màrquez L. [Introduction to the CoNLL-2005 shared task: Semantic role labeling](http://www.cs.upc.edu/~srlconll/st05/papers/intro.pdf)[C]//Proceedings of the Ninth Conference on Computational Natural Language Learning. Association for Computational Linguistics, 2005: 152-164.
+10. Zhou J, Xu W. [End-to-end learning of semantic role labeling using recurrent neural networks](http://www.aclweb.org/anthology/P/P15/P15-1109.pdf)[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2015.
+
+
+ 本教程 由 PaddlePaddle 创作,采用 知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议进行许可。
+
+# Machine Translation
+
+Source codes are located at [book/machine_translation](https://github.com/PaddlePaddle/book/tree/develop/machine_translation). Please refer to the PaddlePaddle [installation tutorial](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html) if you are a first time user.
+
+## Background
+
+Machine translation (MT) aims to perform translation between different languages using computer. The language to be translated is referred to as source language, while the language to be translated into is referred to as target language. Mahcine translation is the process of translating from source language to target language and is one of the important research field of natural language processing.
+
+Early machine translation systems are mainly rule-based, which rely on the translation-rules between two languages provided by language expert. This types of approaches pose a great challenge to language experts, as it is hardly possible to cover all the rules used even in one language, needless to say two or even more different languages. Therefore, one major chanllenge the conventional machine translation faced is the difficult of obtaining a complete rule set \[[1](#References)\]。
+
+
+To address the problems mentioned above, statistical machine translation technique has been developed, where the translation rules are learned from a large scale corpus, instead of being designed by human. While it overcomes the bottleneck of knowleage acquisition, it still faces many other challenges: 1) human designed features are hard to cover all the all the linguistic variations; 2) it is difficult to use global features; 3) it heavy relies on pro-processing, such as word alignment, word segmentation and tokenization, rule-extraction and syntactic parsing etc., where the error introduced in each step could accumulate, leading to increasing impacts to the translation.
+
+The recent development of deep learning provides new solutions to those challenges. There are mainly two categories for deep learning based machine translation techniques: 1) techniques based on the statistical machine translation system but with some key components improved with neural networks, e.g., language model, reordering model (please refer to the left part of Figure 1); 2) techniques mapping from source language to target language directly using neural network, or end-to-end neural machine translation (NMT).
+
+
+
+Figure 1. Neural Network based Machine Translation.
+
+
+
+This tutorial will mainly introduce the NMT model and how to use PaddlePaddle to train an NMT model.
+
+## Illustrative Results
+
+Taking Chinese-to-English translation as an example, after training of the model, given the following segmented sentence in Chinese
+```text
+这些 是 希望 的 曙光 和 解脱 的 迹象 .
+```
+with a beam-search size of 3, the generated translations are as follows:
+```text
+0 -5.36816 these are signs of hope and relief .
+1 -6.23177 these are the light of hope and relief .
+2 -7.7914 these are the light of hope and the relief of hope .
+```
+- The first column corresponds to the id of the generated sentence; the second column corresponds to the score of the generated sentence (in descending order), where larger value indicates better quality; the last column corresponds to the generated sentence.
+- There are two special tokens: `` denotes the end of a sentence while `` denotes unknown word, i.e., word that is not contained in the training dictionary.
+
+## Overview of the Model
+
+This seciton will introduce Gated Recurrent Unit (GRU), Bi-directional Recurrent Neural Network, Encoder-Decoder framework used in NMT, attention mechanism, as well as beam search algorithm.
+
+### GRU
+
+We have already introduced RNN and LSTM in the Chapter of [Sentiment Analysis](https://github.com/PaddlePaddle/book/blob/develop/understand_sentiment/README.md).
+Compared to simple RNN, LSTM added memory cell, input gate, forget gate and output gate. These gates combined with memory cell greatly improve the ability of handling long term dependency.
+
+GRU\[[2](#References)\] proposed by Cho et al is a simplified LSTM and an extension of simple RNN, as shown in the figure below. A GRU unit has only two gates:
+- reset gate: when it is closed, the history information will be discarded, i.e., the irrelevant historical information has on effects on the future output.
+- update gate: it combines input gate and forget gate, and is used to control the impact of historical information on the hidden output. The historical information will be passes over when the update gate is close to 1.
+
+
+
+Figure 2. A GRU Gate.
+
+
+Generally speaking, sequences with short distance dependency will have active reset gate while sequences with long distance dependency will have active update date.
+In addition, Chung et al.\[[3](#References)\] have empirically shown that although GRU has less parameters, it performs similar to LSTM on several different tasks.
+
+### Bi-directional Recurrent Neural Network
+
+We have already introduce one instance of bi-directional RNN in the Chapter of [Semantic Role Labeling](https://github.com/PaddlePaddle/book/blob/develop/label_semantic_roles/README.md). Here we further present another bi-directional RNN model with different architecture proposed by Bengio et al. in \[[2](#References),[4](#References)\]. This model takes a sequence as input and outputs a fixed dimensional feature vector at each step, encoding the context information at the corresponding time step.
+
+Specifically, this bi-directional RNN processes the input sequence in the original and reverse order respectively, and then concatenates the output feature vectors at each time step as the final output, thus the output node at each time step contains information from the past and future as context. The figure below shows an unrolled bi-directional RNN. This network contains a forward RNN and backward RNN with six weight matrices: weight matrices from input to forward hidden layer and backward hidden ($W_1, W_3$), weight matrices from hidden to itself ($W_2, W_5$), matrices from forward hidden and backward hidden to output layer ($W_4, W_6$). Note that there is no connections between forward hidden and backward hidden layers.
+
+
+
+### Encoder-Decoder Framework
+
+Encoder-Decoder\[[2](#References)\] framework aims to solve the mapping of a sequence to another sequence, where both sequences can have arbitrary lengths. The source sequence is encoded into a vector via encoder, which is then decoded to a target sequence via a decoder by maximizing the predictive probability. Both encoder and decoder are typically implemented via RNN.
+
+
+
+Figure 4. Encoder-Decoder Framework.
+
+
+#### Encoder
+
+There are three steps for encoding a sentence:
+
+1. One-hot vector representation of word. Each word $x_i$ in the source sentence $x=\left \{ x_1,x_2,...,x_T \right \}$ is represented as a vector $w_i\epsilon R^{\left | V \right |},i=1,2,...,T$. where $w_i$ has the same dimensionality as the size of dictionary, i.e., $\left | V \right |$, and has an element of one at the location corresponding to the location of word in the dictionary and zero elsewhere.
+
+2. Word embedding as a representation in the low dimensional semantic space. There are two problems for one-hot vector representation: 1) the dimensionality of the vector is typically large, leading to curse of dimensionality; 2) it is hard to capture the relationships between words, i.e., the semantic similarities. It is therefore useful to project the one-hot vector into a low-dimensional semantic space as a dense vector with fixed dimensions, i.e., $s_i=Cw_i$ for the $i$-th word, with $C\epsilon R^{K\times \left | V \right |}$ as the projection matrix and $K$ is the dimentionality of the word embedding vector。
+
+3. Encoding of the source sequence via RNN. This can be described mathmatically as $h_i=\varnothing _\theta \left ( h_{i-1}, s_i \right )$, where $h_0$ is a zero vector, $\varnothing _\theta$ is a non-linear activation function, and $\mathbf{h}=\left \{ h_1,..., h_T \right \}$ is the sequential encoding of the first $T$ words from the source sequence. The vector representation of the whole sentence can be represented as the encoding vector at the last time step $T$ from $\mathbf{h}$, or by temporal pooling over $\mathbf{h}$.
+
+Bi-directional RNN can also be used in step 3 for more complicated sentence encoding. This can be implemeted using bi-directional GRU. Forward GRU performs encoding of the source sequence acooding to the its original order, i.e., $(x_1,x_2,...,x_T)$, generating a sequence of hidden states $(\overrightarrow{h_1},\overrightarrow{h_2},...,\overrightarrow{h_T})$. Similarily, backward GRU encodes the source sequence in the reserse order, i.e., $(x_T,x_{T-1},...,x_1), generating $(\overleftarrow{h_1},\overleftarrow{h_2},...,\overleftarrow{h_T})$. Then for each word $x_i$, its complete hidden state is the concatenation of the corresponding hidden states from the two GRUs, i.e., $h_i=\left [ \overrightarrow{h_i^T},\overleftarrow{h_i^T} \right ]^{T}$.
+
+
+
+Figure 5. Encoder using bi-directional GRU.
+
+
+#### Decoder
+
+The goal of the decoder is to maximize the probability of the next correct word in target language. The main idea is as follows:
+
+1. At each time step $i$, given the encoding vector (or context vector) $c$ of the source sentence, the $i$-th word $u_i$ from the ground-truth target language and the RNN hidden state $z_i$, the next hidden state $z_{i+1}$ is computated as:
+
+ $$z_{i+1}=\phi _{\theta '}\left ( c,u_i,z_i \right )$$
+ where $\phi _{\theta '}$ is a non-linear activation function and $c=q\mathbf{h}$ is the context vector of the source sentence. Without using [attention](#Attention Mechanism), if the output of the [encoder](#Encoder) is the encoding vector at the last time step of the source sentence, then $c$ can be defined as $c=h_T$. $u_i$ denotes the $i$-th word from the target language sentence and $u_0$ denotes the beginning of the target language sentence (i.e., ``), indicating the beginning of decoding. $z_i$ is the RNN hidden state at time step $i$ and $z_0$ is an all zero vector.
+
+2. Calculate the probability $p_{i+1}$ for the $i+1$-th word in the target language sequence by normalizing $z_{i+1}$ using `softmax` as follows
+
+ $$p\left ( u_{i+1}|u_{<i+1},\mathbf{x} \right )=softmax(W_sz_{i+1}+b_z)$$
+
+ where $W_sz_{i+1}+b_z$ scores each possible words and is then normalized via softmax to produce the probability $p_{i+1}$ for the $i+1$-th word.
+
+3. Compute the cost accoding to $p_{i+1}$ and $u_{i+1}.
+4. Repeat Steps 1~3, until all all the words in the target language sentence have been processed.
+
+The generation process of machine translation is to translate the source sentence into a sentence in a target language according to a pre-trained model. There are some differences between the decoding step in generation and training. Please refer to [Beam Search Algorithm](#Beam Search Algorithm) for details.
+
+### Attention Mechanism
+
+There are a few problems for the fixed dimensional vector represention from the encoding stage: 1) it is very challenging to encode both the semantic and syntactic information a sentence with a fixed dimensional vector regardless of the length of the sentence; 2) intuitively, when translating a sentence, we typically pay more attention to the parts in the source sentence more relevalent to the current translation. Moreover, the focus with change along process of the translation. With a fixed dimensional vector, all the information from the source sentence are treatly equally in terms of attention. This is not reasonable. Therefore, Bahdanau et al. \[[4](#References)\] introduced attention mechanism, which can decode based on different fragments of the context sequence in order to address the difficulty of feature learning for long sentences. Decoder with attention will be explained in the following.
+
+Different from the simple decoder, $z_i$ is computed as:
+
+$$z_{i+1}=\phi _{\theta '}\left ( c_i,u_i,z_i \right )$$
+
+It is observed that for each word $u_i$ in the target language sentence, there is a corresponding context vector $c_i$ as the encoding of the source sentence, which is computed as:
+
+$$c_i=\sum _{j=1}^{T}a_{ij}h_j, a_i=\left[ a_{i1},a_{i2},...,a_{iT}\right ]$$
+
+It is noted that the attention mechanism is achieved by weighted average over the RNN hidden states $h_j$. The weight $a_{ij}$ denotes the strength of attention of the $i$-th word in the target language sentence to the $j$-th word in the source sentence, and is calculated as
+
+\begin{align}
+a_{ij}&=\frac{exp(e_{ij})}{\sum_{k=1}^{T}exp(e_{ik})}\\\\
+e_{ij}&=align(z_i,h_j)\\\\
+\end{align}
+
+where $align$ is an alignment model, measuring the fitness between the $i$-th word in the target language sentence and the $j$-th word in the source sentence. More concretely, the fitness is computed with the $i$-th hidden state $z_i$ of the decoder RNN and the $j$-th context vector $h_j$ of the source sentence. Hard alignment is used in the conventional alignment model, meaning each word in the target language explicitly corresponds to one or more words from the target language sentence. In attention model, soft alignment is used, where any word in source sentence is related to any word in the target language sentence, where the strength of the relation is a real number computed via the model, thus can be incorporated into the NMT framework and can be trained via back-propagation.
+
+
+
+Figure 6. Decoder with Attention Mechanism.
+
+
+### Beam Search Algorithm
+
+Beam Search ([beam search](http://en.wikipedia.org/wiki/Beam_search)) is a heuristic search algorithm that explores a graph by expanding the most promising node in a limited set. It is typically used when the solution space is huge (e.g., for machine translation, speech recognition), and there is not enough memory for all the possible solutions. For example, if we want to translate “`你好`” into English, even if there is only three words in the dictionary (``, ``, `hello`), it is still possible to generate an infinite number of sentences, where the word `hello` can appear different number of times. Beam search could be used to find a good translation among them.
+
+Beam search builds a search tree using breadth first search and sorts the nodes according to a heuristic cost (sum of the log probability of the generated words in this tutorial) at each level of the tree, keeping only a fixed number of nodes according to the pre-specified beam size (or beam width). Therefore, only those nodes will higher-qualities will be expanded later at the next level thus reducing the space and time requirements significantly, with no guarantee on the global optimmal solution, however.
+
+The goal is to maximize the probability of the generated sequence when using beam search in decoding, The procedure is as follows:
+
+1. At each time step $i$, compute the hidden state $z_{i+1}$ of the next time step according to the context vector $c$ of the source sentence, the $i$-th word $u_i$ generated for the target language sentence and the RNN hidden state $z_i$.
+2. Normalize $z_{i+1}$ using `softmax` to get the probability $p_{i+1}$ for the $i+1$-th word for the target language sentence.
+3. Sample the word $u_{i+1}$ according to $p_{i+1}$.
+4. Repeat Steps 1~3, until eod-of-senetcen token `` is generated or the maximum length of the sentence is reached.
+
+Note: $z_{i+1}$ and $p_{i+1}$ are computed the same way as in [Decoder](#Decoder). As each step is greedy in generation, there is no guarantee for global optimum.
+
+## Data Preparation
+
+### Download and Uncompression
+
+This tutorial uses a dataset from [WMT-14](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/), where the dataset [bitexts (after selection)](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz) is used as training set, and [dev+test data](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz) is used as testing and generating set.
+
+Run the following command in Linux to obtain the data:
+```bash
+cd data
+./wmt14_data.sh
+```
+There are three folders in the downloaded dataset `data/wmt14`:
+
+
+
+
Folder Name
+
French-English Parallel Corpus
+
Number of Files
+
Size of Files
+
+
+
+
train
+
ccb2_pc30.src, ccb2_pc30.trg, etc
+
12
+
3.55G
+
+
+
+
test
+
ntst1213.src, ntst1213.trg
+
2
+
1636k
+
+
+
+
+
gen
+
ntst14.src, ntst14.trg
+
2
+
864k
+
+
+
+
+- `XXX.src` is the source file in French and `XXX.trg`is the target file in English. Each row of the file contains one sentence.
+- `XXX.src` and `XXX.trg` has the same number of rows and there is a one-to-one correspondance between the sentences at any row from the two files.
+
+### User Defined Dataset (Optional)
+
+To use your own dataset, just put it under the `data` fodler and organize it as follows
+```text
+user_dataset
+├── train
+│ ├── train_file1.src
+│ ├── train_file1.trg
+│ └── ...
+├── test
+│ ├── test_file1.src
+│ ├── test_file1.trg
+│ └── ...
+├── gen
+│ ├── gen_file1.src
+│ ├── gen_file1.trg
+│ └── ...
+```
+
+Explanation of the directories:
+- First level: `user_dataset`: the name of the user defined dataset.
+- Second level: `train`、`test` and `gen`: these names should not be changed.
+- Third level: Parallel corpus in source language and target language, each with a postfix of `.src` and `.trg`.
+
+### Data Pre-processing
+
+There are two steps for pre-processing:
+- Merge the source and target parallel corpos files into one file
+ - Merge `XXX.src` and `XXX.trg` file pair as `XXX`
+ - The $i$-th row in `XXX` is the concatenation of the $i$-th row from `XXX.src` with the $i$-th row from `XXX.trg`, separated with '\t'.
+
+- Create source dictionary and target dictionary, each containing **DICTSIZE** number of words, including the most frequent (DICTSIZE - 3) fo word from the corpus and 3 special token `` (begin of sequence), `` (end of sequence) and `` (unkown/out of vocabulary words).
+
+`preprocess.py` is used for pre-processing:
+```python
+python preprocess.py -i INPUT [-d DICTSIZE] [-m]
+```
+- `-i INPUT`: path to the original dataset.
+- `-d DICTSIZE`: number of words in the dictionary. If unspecified, the dictionary will contain all the words appeared in the input dataset.
+- `-m --mergeDict`: merge the source dictionary with target dictionary, making the two dictionaries have the same content.
+
+The specific command to run the script is as follows:
+```python
+python preprocess.py -i data/wmt14 -d 30000
+```
+You will see the following messages after a few minutes:
+```text
+concat parallel corpora for dataset
+build source dictionary for train data
+build target dictionary for train data
+dictionary size is 30000
+```
+The pre-processed data is located at `data/pre-wmt14`:
+```text
+pre-wmt14
+├── train
+│ └── train
+├── test
+│ └── test
+├── gen
+│ └── gen
+├── train.list
+├── test.list
+├── gen.list
+├── src.dict
+└── trg.dict
+```
+- `train`, `test` and `gen`: contains French-English parallel corpus for training, testing and generation. Each row from each file is separated into two columns with a “\t”, where the first column is the sequence in French and the second one is in English.
+- `train.list`, `test.list` and `gen.list`: record respectively the path to `train`, `test` and `gen` folders.
+- `src.dict` and `trg.dict`: source (French) and target (English) dictionary. Each dictionary contains 30000 words (29997 most frequent words and 3 special tokens).
+
+### Providing Data to PaddlePaddle
+
+We use `dataprovider.py` to provide data to PaddlePaddle as follows:
+
+1. Import PyDataProvider2 package from PaddlePaddle and define three special tokens:
+
+ ```python
+ from paddle.trainer.PyDataProvider2 import *
+ UNK_IDX = 2 #out of vocabulary word
+ START = "" #begin of sequence
+ END = "" #end of sequence
+ ```
+2. Use initialization function `hook` to define the input data types (`input_types`) for training and generation:
+ - Training: there are three input sequences, where "source language sequence" and "target language sequence" as input and the "target language next word sequence" as label.
+ - Generation: there are two input sequences, where the "source language sequence" as input and “source language sequence id” as the ids for the input data (optional).
+
+ `src_dict_path` in the `hook` function is the path to the source language dictionary, while `trg_dict_path` the path to target language dictionary. `is_generating` is passed from model config file. For more details on the usage of the `hook` function please refer to [Model Config](#Model Config).
+
+ ```python
+ def hook(settings, src_dict_path, trg_dict_path, is_generating, file_list,
+ **kwargs):
+ # job_mode = 1: training 0: generation
+ settings.job_mode = not is_generating
+
+ def fun(dict_path): # load dictionary according to the path
+ out_dict = dict()
+ with open(dict_path, "r") as fin:
+ out_dict = {
+ line.strip(): line_count
+ for line_count, line in enumerate(fin)
+ }
+ return out_dict
+
+ settings.src_dict = fun(src_dict_path)
+ settings.trg_dict = fun(trg_dict_path)
+
+ if settings.job_mode: #training
+ settings.input_types = {
+ 'source_language_word': #source language sequence
+ integer_value_sequence(len(settings.src_dict)),
+ 'target_language_word': #target language sequence
+ integer_value_sequence(len(settings.trg_dict)),
+ 'target_language_next_word': #target language next word sequence
+ integer_value_sequence(len(settings.trg_dict))
+ }
+ else: #generation
+ settings.input_types = {
+ 'source_language_word': #source language sequence
+ integer_value_sequence(len(settings.src_dict)),
+ 'sent_id': #source language sequence id
+ integer_value_sequence(len(open(file_list[0], "r").readlines()))
+ }
+ ```
+3. Use `process` function to open the file `file_name`, read each row of the file, convert the data to be compatible with `input_types`, and then use `yield` to return to PaddlePaddle process. More specifically
+
+ - add `` to the beginning of each source language sequence and add `` to the end, producing “source_language_word”.
+ - add `` to the beginning of each target language senquence, producing “target_language_word”.
+ - add `` to the end of each target language senquence, producing “target_language_next_word”.
+
+ ```python
+ def _get_ids(s, dictionary): # get the location of each word from the source language sequence in the dictionary
+ words = s.strip().split()
+ return [dictionary[START]] + \
+ [dictionary.get(w, UNK_IDX) for w in words] + \
+ [dictionary[END]]
+
+ @provider(init_hook=hook, pool_size=50000)
+ def process(settings, file_name):
+ with open(file_name, 'r') as f:
+ for line_count, line in enumerate(f):
+ line_split = line.strip().split('\t')
+ if settings.job_mode and len(line_split) != 2:
+ continue
+ src_seq = line_split[0]
+ src_ids = _get_ids(src_seq, settings.src_dict)
+
+ if settings.job_mode:
+ trg_seq = line_split[1]
+ trg_words = trg_seq.split()
+ trg_ids = [settings.trg_dict.get(w, UNK_IDX) for w in trg_words]
+
+ # sequence with length longer than 80 with be removed during training to avoid an overly deep RNN.
+ if len(src_ids) > 80 or len(trg_ids) > 80:
+ continue
+ trg_ids_next = trg_ids + [settings.trg_dict[END]]
+ trg_ids = [settings.trg_dict[START]] + trg_ids
+ yield {
+ 'source_language_word': src_ids,
+ 'target_language_word': trg_ids,
+ 'target_language_next_word': trg_ids_next
+ }
+ else:
+ yield {'source_language_word': src_ids, 'sent_id': [line_count]}
+ ```
+Note: As the size of the training data is 3.55G, for machines with limited memories, it is recommended to use `pool_size` to set the number of data samples stored in memory.
+
+## Model Config
+
+### Data Definition
+
+1. Specify the path to data and source/target dictionaries. `is_generating` accepts argument passed from command lines and is used to denote whether the current configuration is for training (default) or generation. See [Usage and Resutls](#Usage and Results).
+
+ ```python
+ import os
+ from paddle.trainer_config_helpers import *
+
+ data_dir = "./data/pre-wmt14" # data path
+ src_lang_dict = os.path.join(data_dir, 'src.dict') # path to the source language dictionary
+ trg_lang_dict = os.path.join(data_dir, 'trg.dict') # path to the target language dictionary
+ is_generating = get_config_arg("is_generating", bool, False) # config mode
+ ```
+2. Use `define_py_data_sources2` to get data from `dataprovider.py`, and use `args` variable to input the source/target language dicitonary path and config mode.
+
+ ```python
+ if not is_generating:
+ train_list = os.path.join(data_dir, 'train.list')
+ test_list = os.path.join(data_dir, 'test.list')
+ else:
+ train_list = None
+ test_list = os.path.join(data_dir, 'gen.list')
+
+ define_py_data_sources2(
+ train_list,
+ test_list,
+ module="dataprovider",
+ obj="process",
+ args={
+ "src_dict_path": src_lang_dict, # source language dictionary path
+ "trg_dict_path": trg_lang_dict, # target language dictionary path
+ "is_generating": is_generating # config mode
+ })
+ ```
+
+### Algorithm Configuration
+
+```python
+settings(
+ learning_method = AdamOptimizer(),
+ batch_size = 50,
+ learning_rate = 5e-4)
+```
+This tutorial will use the default SGD and Adam learning algorithm, with a learning rate of 5e-4. Note that the `batch_size = 50` denotes generating 50 sequence each time.
+
+### Model Structure
+1. Define some global variables
+
+ ```python
+ source_dict_dim = len(open(src_lang_dict, "r").readlines()) # size of the source language dictionary
+ target_dict_dim = len(open(trg_lang_dict, "r").readlines()) # size of target language dictionary
+ word_vector_dim = 512 # dimensionality of word vector
+ encoder_size = 512 # dimensionality of the hidden state of encoder GRU
+ decoder_size = 512 # dimentionality of the hidden state of decoder GRU
+
+ if is_generating:
+ beam_size=3 # beam size for the beam search algorithm
+ max_length=250 # maximum length for the generated sentence
+ gen_trans_file = get_config_arg("gen_trans_file", str, None) # generate file
+ ```
+
+2. Implement Encoder as follows:
+
+ 2.1 Input one-hot vector representations $\mathbf{w}$ converted with `dataprovider.py` from the source language sentence
+
+ ```python
+ src_word_id = data_layer(name='source_language_word', size=source_dict_dim)
+ ```
+ 2.2 Map the one-hot vector into a word vector $\mathbf{s}$ in a low-dimensional semantic space
+
+ ```python
+ src_embedding = embedding_layer(
+ input=src_word_id,
+ size=word_vector_dim,
+ param_attr=ParamAttr(name='_source_language_embedding'))
+ ```
+ 2.3 Use bi-direcitonal GRU to encode the source language sequence, and concatenate the encoding outputs from the two GRUs to get $\mathbf{h}$
+
+ ```python
+ src_forward = simple_gru(input=src_embedding, size=encoder_size)
+ src_backward = simple_gru(
+ input=src_embedding, size=encoder_size, reverse=True)
+ encoded_vector = concat_layer(input=[src_forward, src_backward])
+ ```
+
+3. Implement Attention-based Decoder as follows:
+
+ 3.1 Get a projection of the encoding (c.f. 2.3) of the source language sequence by passing it into a feed forward neural network
+
+ ```python
+ with mixed_layer(size=decoder_size) as encoded_proj:
+ encoded_proj += full_matrix_projection(input=encoded_vector)
+ ```
+ 3.2 Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$
+
+ ```python
+ backward_first = first_seq(input=src_backward)
+ with mixed_layer(
+ size=decoder_size,
+ act=TanhActivation(), ) as decoder_boot:
+ decoder_boot += full_matrix_projection(input=backward_first)
+ ```
+ 3.3 Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word.
+
+ - decoder_mem records the hidden state $z_i$ from the previous time step, with an initial state as decoder_boot.
+ - context is computated via `simple_attention` as $c_i=\sum {j=1}^{T}a_{ij}h_j$, where enc_vec is the projection of $h_j$ and enc_proj is the proection of $h_j$ (c.f. 3.1). $a_{ij}$ is calculated within `simple_attention`.
+ - decoder_inputs fuse $c_i$ with the representation of the current_word (i.e., $u_i$).
+ - gru_step uses `gru_step_layer` function to compute $z_{i+1}=\phi _{\theta '}\left ( c_i,u_i,z_i \right )$.
+ - Softmax normalization is used in the end to computed the probability of words, i.e., $p\left ( u_i|u_{<i},\mathbf{x} \right )=softmax(W_sz_i+b_z)$. The output is returned.
+
+ ```python
+ def gru_decoder_with_attention(enc_vec, enc_proj, current_word):
+ decoder_mem = memory(
+ name='gru_decoder', size=decoder_size, boot_layer=decoder_boot)
+
+ context = simple_attention(
+ encoded_sequence=enc_vec,
+ encoded_proj=enc_proj,
+ decoder_state=decoder_mem, )
+
+ with mixed_layer(size=decoder_size * 3) as decoder_inputs:
+ decoder_inputs += full_matrix_projection(input=context)
+ decoder_inputs += full_matrix_projection(input=current_word)
+
+ gru_step = gru_step_layer(
+ name='gru_decoder',
+ input=decoder_inputs,
+ output_mem=decoder_mem,
+ size=decoder_size)
+
+ with mixed_layer(
+ size=target_dict_dim, bias_attr=True,
+ act=SoftmaxActivation()) as out:
+ out += full_matrix_projection(input=gru_step)
+ return out
+ ```
+4. Decoder differences between the training and generation
+
+ 4.1 Define the name for decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details.
+
+ ```python
+ decoder_group_name = "decoder_group"
+ group_input1 = StaticInput(input=encoded_vector, is_seq=True)
+ group_input2 = StaticInput(input=encoded_proj, is_seq=True)
+ group_inputs = [group_input1, group_input2]
+ ```
+ 4.2 In training mode:
+
+ - word embedding from the target langauge trg_embedding is passed to `gru_decoder_with_attention` as current_word.
+ - `recurrent_group` calls `gru_decoder_with_attention` in a recurrent way
+ - the sequence of next words from the target language is used as label (lbl)
+ - multi-class cross-entropy (`classification_cost`) is used to calculate the cost
+
+ ```python
+ if not is_generating:
+ trg_embedding = embedding_layer(
+ input=data_layer(
+ name='target_language_word', size=target_dict_dim),
+ size=word_vector_dim,
+ param_attr=ParamAttr(name='_target_language_embedding'))
+ group_inputs.append(trg_embedding)
+
+ decoder = recurrent_group(
+ name=decoder_group_name,
+ step=gru_decoder_with_attention,
+ input=group_inputs)
+
+ lbl = data_layer(name='target_language_next_word', size=target_dict_dim)
+ cost = classification_cost(input=decoder, label=lbl)
+ outputs(cost)
+ ```
+ 4.3 In generation mode:
+
+ - during generation, as the decoder RNN will take the word vector generated from the previous time step as input, `GeneratedInput` is used to implement this automatically. Please refer to [GeneratedInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for details.
+ - `beam_search` will call `gru_decoder_with_attention` to generate id
+ - `seqtext_printer_evaluator` outputs the generated sentence in to `gen_trans_file` according to `trg_lang_dict`
+
+ ```python
+ else:
+ trg_embedding = GeneratedInput(
+ size=target_dict_dim,
+ embedding_name='_target_language_embedding',
+ embedding_size=word_vector_dim)
+ group_inputs.append(trg_embedding)
+
+ beam_gen = beam_search(
+ name=decoder_group_name,
+ step=gru_decoder_with_attention,
+ input=group_inputs,
+ bos_id=0,
+ eos_id=1,
+ beam_size=beam_size,
+ max_length=max_length)
+
+ seqtext_printer_evaluator(
+ input=beam_gen,
+ id_input=data_layer(
+ name="sent_id", size=1),
+ dict_file=trg_lang_dict,
+ result_file=gen_trans_file)
+ outputs(beam_gen)
+ ```
+Note: Our configuration is based on Bahdanau et al. \[[4](#Reference)\] but with a few simplifications. Please refer to [issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) for more details.
+
+
+## Model Training
+
+Training can be done with the following command:
+
+```bash
+./train.sh
+```
+where `train.sh` contains
+
+```bash
+paddle train \
+--config='seqToseq_net.py' \
+--save_dir='model' \
+--use_gpu=false \
+--num_passes=16 \
+--show_parameter_stats_period=100 \
+--trainer_count=4 \
+--log_period=10 \
+--dot_period=5 \
+2>&1 | tee 'train.log'
+```
+- config: configuration file for the network
+- save_dir: path to save the trained model
+- use_gpu: whether to use GPU for training; CPU is used here
+- num_passes: number of passes for training. In PaddlePaddle, one pass meansing one pass of complete training pass using all the data in the training set
+- show_parameter_stats_period: here we show the statistics of parameters every 100 batches
+- trainer_count: the number of CPU processes or GPU devices
+- log_period: here we print log every 10 batches
+- dot_period: we print one "." every 5 batches
+
+The training loss will the printed every 10 batches, and you will see messages as below:
+```text
+I0719 19:16:45.952062 15563 TrainerInternal.cpp:160] Batch=10 samples=500 AvgCost=198.475 CurrentCost=198.475 Eval: classification_error_evaluator=0.737155 CurrentEval: classification_error_evaluator=0.737155
+I0719 19:17:56.707319 15563 TrainerInternal.cpp:160] Batch=20 samples=1000 AvgCost=157.479 CurrentCost=116.483 Eval: classification_error_evaluator=0.698392 CurrentEval: classification_error_evaluator=0.659065
+.....
+```
+- AvgCost:average cost from batch-0 to the current batch.
+- CurrentCost:the cost for the current batch
+- classification\_error\_evaluator (Eval):average error rate from evaluator-0 to the current evaluator for each word
+- classification\_error\_evaluator (CurrentEval):error rate for the current evaluator for each word
+
+The model training is successful when the classification\_error\_evaluator is lower than 0.35.
+
+## Model Usage
+
+### Download Pre-trained Model
+
+As the training of an NMT model is very time consuming, we provide a pre-trained model (pass-00012, ~205M). The model is trained with a cluster of 50 physical nodes (each node has two 6-core CPU). We trained 16 passes (taking about 5 days) with each pass taking about 7 hours. The provided model (pass-00012) has the highest [BLEU Score](#BLEU Score) of 26.92. Run the following command to down load the model:
+```bash
+cd pretrained
+./wmt14_model.sh
+```
+
+### Usage and Results
+
+Run the following command to perform translation from French to English:
+
+```bash
+./gen.sh
+```
+where `gen.sh` contains:
+
+```bash
+paddle train \
+--job=test \
+--config='seqToseq_net.py' \
+--save_dir='pretrained/wmt14_model' \
+--use_gpu=true \
+--num_passes=13 \
+--test_pass=12 \
+--trainer_count=1 \
+--config_args=is_generating=1,gen_trans_file="gen_result" \
+2>&1 | tee 'translation/gen.log'
+```
+Parameters different training are listed as follows:
+- job:set the mode as testing.
+- save_dir:path to the pre-trained model.
+- num_passes and test_pass:load the model parameters from pass $i\epsilon \left [ test\_pass,num\_passes-1 \right ]$. Here we only load `data/wmt14_model/pass-00012`.
+- config_args:pass the self-defined command line parameters to model configuration. `is_generating=1` indicates generation mode and `gen_trans_file="gen_result"` represents the file generated.
+
+For translation results please refer to [Illustrative Results](#Illustrative Results).
+
+### BLEU Evaluation
+
+BLEU (Bilingual Evaluation understudy) is a metric widely used for automatic machine translation proposed by IBM watson Research Center in 2002\[[5](#References)\]. The basic idea is that the closer the translation produced by machine to the translation produced by human expert, the performance of the translation system is better.
+To measure the closeness between machine translation and human translation, sentence precision is used, which compares the number of matched n-grams. More matches will lead to higher BLEU scores.
+
+[Moses](http://www.statmt.org/moses/) is a opensource machine translation system, we used [multi-bleu.perl](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl) for BLEU evaluation. Run the following command for downloading::
+```bash
+./moses_bleu.sh
+```
+BLEU evaluation can be performed using the `eval_bleu` script as follows, where FILE is the name of the file to be evaluated, BEAMSIZE is the beam size value, and `data/wmt14/gen/ntst14.trg` is used as the standard translation in default.
+```bash
+./eval_bleu.sh FILE BEAMSIZE
+```
+Specificaly, the script is run as follows
+```bash
+./eval_bleu.sh gen_result 3
+```
+You will see the following message as output
+```text
+BLEU = 26.92
+```
+
+## Summary
+
+End-to-end neural machine translation is a recently developed way to perform machine translations. In this chapter, we introduce the typical "Encoder-Decoder" framework and "attention" mechanism. As NMT is a typical Sequence-to-Sequence (Seq2Seq) learning problem, therefore, tasks such as query rewriting, abstraction generation and single-turn dialogues can all be solved with model presented in this chapter.
+
+## References
+
+1. Koehn P. [Statistical machine translation](https://books.google.com.hk/books?id=4v_Cx1wIMLkC&printsec=frontcover&hl=zh-CN&source=gbs_ge_summary_r&cad=0#v=onepage&q&f=false)[M]. Cambridge University Press, 2009.
+2. Cho K, Van Merriënboer B, Gulcehre C, et al. [Learning phrase representations using RNN encoder-decoder for statistical machine translation](http://www.aclweb.org/anthology/D/D14/D14-1179.pdf)[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014: 1724-1734.
+3. Chung J, Gulcehre C, Cho K H, et al. [Empirical evaluation of gated recurrent neural networks on sequence modeling](https://arxiv.org/abs/1412.3555)[J]. arXiv preprint arXiv:1412.3555, 2014.
+4. Bahdanau D, Cho K, Bengio Y. [Neural machine translation by jointly learning to align and translate](https://arxiv.org/abs/1409.0473)[C]//Proceedings of ICLR 2015, 2015.
+5. Papineni K, Roukos S, Ward T, et al. [BLEU: a method for automatic evaluation of machine translation](http://dl.acm.org/citation.cfm?id=1073135)[C]//Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002: 311-318.
+
+
+ 本教程 由 PaddlePaddle 创作,采用 知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议进行许可。
+
+# Recognize Digits
+
+The source code for this tutorial is under [book/recognize_digits](https://github.com/PaddlePaddle/book/tree/develop/recognize_digits). First-time readers, please refer to PaddlePaddle [installation instructions](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html).
+
+## Introduction
+When we learn a new programming language, the first task is usually to write a program that prints "Hello World." In Machine Learning or Deep Learning, the equivalent task is to train a model to perform handwritten digit recognition with [MNIST](http://yann.lecun.com/exdb/mnist/) dataset. Handwriting recognition is a typical image classification problem. The problem is relatively easy, and MNIST is a complete dataset. As a simple Computer Vision dataset, MNIST contains images of handwritten digits and their corresponding labels (Fig. 1). The input image is a 28x28 matrix, and the label is one of the digits from 0 to 9. Each image is normalized in size and centered.
+
+
+
+Fig. 1. Examples of MNIST images
+
+
+The MNIST dataset is created from the [NIST](https://www.nist.gov/srd/nist-special-database-19) Special Database 3 (SD-3) and the Special Database 1 (SD-1). The SD-3 is labeled by the staff of the U.S. Census Bureau, while SD-1 is labeled by high school students the in U.S. Therefore the SD-3 is cleaner and easier to recognize than the SD-1 dataset. Yann LeCun et al. used half of the samples from each of SD-1 and SD-3 to create the MNIST training set (60,000 samples) and test set (10,000 samples), where training set was labeled by 250 different annotators, and it was guaranteed that there wasn't a complete overlap of annotators of training set and test set.
+
+Yann LeCun, one of the founders of Deep Learning, contributed highly towards handwritten character recognition in early days and proposed CNN (Convolutional Neural Network), which drastically improved recognition capability for handwritten characters. CNNs are now a critical concept in Deep Learning. From Yann LeCun's first proposal of LeNet to those winning models in ImageNet, such as VGGNet, GoogLeNet, ResNet, etc. (Please refer to [Image Classification](https://github.com/PaddlePaddle/book/tree/develop/image_classification) tutorial), CNN achieved a series of impressive results in Image Classification tasks.
+
+Many algorithms are tested on MNIST. In 1998, LeCun experimented with single layer linear classifier, MLP (Multilayer Perceptron) and Multilayer CNN LeNet. These algorithms constantly reduced test error from 12% to 0.7% \[[1](#References)\]. Since then, researchers have worked on many algorithms such as k-NN (K-Nearest Neighbors) \[[2](#References)\], Support Vector Machine (SVM) \[[3](#References)\], Neural Networks \[[4-7](#References)\] and Boosting \[[8](#References)\]. Various preprocessing methods like distortion removal, noise removal, blurring etc. have also been applied to increase recognition accuracy.
+
+In this tutorial, we tackle the task of handwritten character recognition. We start with a simple softmax regression model and guide our readers step-by-step to improve this model's performance on the task of recognition.
+
+
+## Model Overview
+
+Before introducing classification algorithms and training procedure, we provide some definitions:
+- $X$ is the input: Input is a $28\times28$ MNIST image. It is flattened to a $784$ dimensional vector. $X=\left ( x_0, x_1, \dots, x_{783} \right )$.
+- $Y$ is the output: Output of the classifier is 1 of the 10 classes (digits from 0 to 9). $Y=\left ( y_0, y_1, \dots, y_9 \right )$. Each dimension $y_i$ represents the probability that the input image belongs to class $i$.
+- $L$ is the ground truth label: $L=\left ( l_0, l_1, \dots, l_9 \right )$. It is also 10 dimensional, but only one dimension is 1 and all others are all 0.
+
+### Softmax Regression
+
+In a simple softmax regression model, the input is fed to fully connected layers and a softmax function is applied to get probabilities of multiple output classes\[[9](#References)\].
+
+Input $X$ is multiplied with weights $W$, and bias $b$ is added to generate activations.
+
+$$ y_i = softmax(\sum_j W_{i,j}x_j + b_i) $$
+
+where $ softmax(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} $
+
+For an $N$ class classification problem with $N$ output nodes, an $N$ dimensional vector is normalized to $N$ real values in the range [0, 1], each representing the probability of the sample to belong to the class. Here $y_i$ is the prediction probability that an image is digit $i$.
+
+In such a classification problem, we usually use the cross entropy loss function:
+
+$$ crossentropy(label, y) = -\sum_i label_ilog(y_i) $$
+
+Fig. 2 shows a softmax regression network, with weights in black, and bias in red. +1 indicates bias is 1.
+
+
+
+### Multilayer Perceptron
+
+The Softmax regression model described above uses the simplest two-layer neural network, i.e. it only contains an input layer and an output layer. So its regression ability is limited. To achieve better recognition results, we consider adding several hidden layers \[[10](#References)\] between the input layer and the output layer.
+
+1. After the first hidden layer, we get $ H_1 = \phi(W_1X + b_1) $, where $\phi$ is the activation function. Some common ones are sigmoid, tanh and ReLU.
+2. After the second hidden layer, we get $ H_2 = \phi(W_2H_1 + b_2) $.
+3. Finally, after output layer, we get $Y=softmax(W_3H_2 + b_3)$, the final classification result vector.
+
+Fig. 3. is Multilayer Perceptron network, with weights in black, and bias in red. +1 indicates bias is 1.
+
+
+
+The Convolutional layer is the core of a Convolutional Neural Network. The parameters in this layer are composed of a set of filters or kernels. In the forward step, each kernel moves horizontally and vertically, we compute a dot product of the kernel and the input at the corresponding positions, to this result we add bias and apply an activation function. The result is a two-dimensional activation map. For example, some kernel may recognize corners, and some may recognize circles. These convolution kernels may respond strongly to the corresponding features.
+
+Fig. 4 is a dynamic graph of a convolutional layer, where depths are not shown for simplicity. Input is $W_1=5, H_1=5, D_1=3$. In fact, this is a common representation for colored images. $W_1$ and $H_1$ of a colored image correspond to the width and height respectively. $D_1$ corresponds to the 3 color channels for RGB. The parameters of the convolutional layer are $K=2, F=3, S=2, P=1$. $K$ is the number of kernels. Here, $Filter W_0$ and $Filter W_1$ are two kernels. $F$ is kernel size. $W0$ and $W1$ are both $3\times3$ matrix in all depths. $S$ is the stride. Kernels move leftwards or downwards by 2 units each time. $P$ is padding, an extension of the input. The gray area in the figure shows zero padding with size 1.
+
+#### Pooling Layer
+
+
+
+Fig. 5 Pooling layer
+
+
+A Pooling layer performs downsampling. The main functionality of this layer is to reduce computation by reducing the network parameters. It also prevents overfitting to some extent. Usually, a pooling layer is added after a convolutional layer. Pooling layer can be of various types like max pooling, average pooling, etc. Max pooling uses rectangles to segment the input layer into several parts and computes the maximum value in each part as the output (Fig. 5.)
+
+#### LeNet-5 Network
+
+
+
+[LeNet-5](http://yann.lecun.com/exdb/lenet/) is one of the simplest Convolutional Neural Networks. Fig. 6. shows its architecture: A 2-dimensional input image is fed into two sets of convolutional layers and pooling layers, this output is then fed to a fully connected layer and a softmax classifier. The following three properties of convolution enable LeNet-5 to better recognize images than Multilayer fully connected perceptrons:
+
+- 3D properties of neurons: a convolutional layer is organized by width, height and depth. Neurons in each layer are connected to only a small region in the previous layer. This region is called the receptive field.
+- Local connection: A CNN utilizes the local space correlation by connecting local neurons. This design guarantees that the learned filter has a strong response to local input features. Stacking many such layers generates a non-linear filter that is more global. This enables the network to first obtain good representation for small parts of input and then combine them to represent a larger region.
+- Sharing weights: In a CNN, computation is iterated on shared parameters (weights and bias) to form a feature map. This means all neurons in the same depth of the output respond to the same feature. This allows detecting a feature regardless of its position in the input and enables translation equivariance.
+
+For more details on Convolutional Neural Networks, please refer to [this Stanford open course]( http://cs231n.github.io/convolutional-networks/ ) and [this Image Classification](https://github.com/PaddlePaddle/book/blob/develop/image_classification/README.md) tutorial.
+
+### List of Common Activation Functions
+- Sigmoid activation function: $ f(x) = sigmoid(x) = \frac{1}{1+e^{-x}} $
+
+- Tanh activation function: $ f(x) = tanh(x) = \frac{e^x-e^{-x}}{e^x+e^{-x}} $
+
+ In fact, tanh function is just a rescaled version of the sigmoid function. It is obtained by magnifying the value of the sigmoid function and moving it downwards by 1.
+
+- ReLU activation function: $ f(x) = max(0, x) $
+
+For more information, please refer to [Activation functions on Wikipedia](https://en.wikipedia.org/wiki/Activation_function).
+
+## Data Preparation
+
+### Data Download
+
+Execute the following command to download the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset and unzip. Add paths to the training set and the test set to train.list and test.list respectively for PaddlePaddle to read.
+
+```bash
+./data/get_mnist_data.sh
+```
+
+`gzip` downloaded data. The following files can be found in `data/raw_data`:
+
+| File name | Description |
+|----------------------|-------------------------|
+|train-images-idx3-ubyte| Training images, 60,000 |
+|train-labels-idx1-ubyte| Training labels, 60,000 |
+|t10k-images-idx3-ubyte | Evaluation images, 10,000 |
+|t10k-labels-idx1-ubyte | Evaluation labels, 10,000 |
+
+Users can randomly generate 10 images with the following script (Refer to Fig. 1.)
+
+```bash
+./load_data.py
+```
+
+### Provide Data to PaddlePaddle
+
+We use python interface to provide data to system. `mnist_provider.py` shows a complete example for training on MNIST data.
+
+```python
+# Define a py data provider
+@provider(
+ input_types={'pixel': dense_vector(28 * 28),
+ 'label': integer_value(10)})
+def process(settings, filename): # settings is not used currently.
+ # Open image file
+ with open( filename + "-images-idx3-ubyte", "rb") as f:
+ # Read first 4 parameters. magic is data format. n is number of data. rows and cols are number of rows and columns, respectively
+ magic, n, rows, cols = struct.upack(">IIII", f.read(16))
+ # With empty string as a unit, read data one by one
+ images = np.fromfile(
+ f, 'ubyte',
+ count=n * rows * cols).reshape(n, rows, cols).astype('float32')
+ # Normalize data of [0, 255] to [-1,1]
+ images = images / 255.0 * 2.0 - 1.0
+
+
+ # Open label file
+ with open( filename + "-labels-idx1-ubyte", "rb") as l:
+ # Read first two parameters
+ magic, n = struct.upack(">II", l.read(8))
+ # With empty string as a unit, read data one by one
+ labels = np.fromfile(l, 'ubyte', count=n).astype("int")
+
+ for i in xrange(n):
+ yield {"pixel": images[i, :], 'label': labels[i]}
+```
+
+
+## Model Configurations
+
+### Data Definition
+
+In the model configuration, use `define_py_data_sources2` to define reading of data from `dataprovider`. If this configuration is used for prediction, data definition is not necessary.
+
+```python
+ if not is_predict:
+ data_dir = './data/'
+ define_py_data_sources2(
+ train_list=data_dir + 'train.list',
+ test_list=data_dir + 'test.list',
+ module='mnist_provider',
+ obj='process')
+```
+
+### Algorithm Configuration
+
+Set training related parameters.
+
+- batch_size: use 128 samples in each training step.
+- learning_rate: determines step taken in each iteration, it determines how fast the model converges.
+- learning_method: use optimizer `MomentumOptimizer` for training. The parameter 0.9 indicates momentum keeps 0.9 of previous speed.
+- regularization: A method to prevent overfitting. Here L2 regularization is used.
+
+```python
+settings(
+ batch_size=128,
+ learning_rate=0.1 / 128.0,
+ learning_method=MomentumOptimizer(0.9),
+ regularization=L2Regularization(0.0005 * 128))
+```
+
+### Model Architecture
+
+#### Overview
+
+First get reference labels from `data_layer`, and get classification results (predictions) from classifier. Here we provide three different classifiers. In training, we compute loss function, which is usually cross entropy for classification problem. In prediction, we can directly output the results (predictions).
+
+``` python
+data_size = 1 * 28 * 28
+label_size = 10
+img = data_layer(name='pixel', size=data_size)
+
+predict = softmax_regression(img) # Softmax Regression
+#predict = multilayer_perceptron(img) # Multilayer Perceptron
+#predict = convolutional_neural_network(img) #LeNet5 Convolutional Neural Network
+
+if not is_predict:
+ lbl = data_layer(name="label", size=label_size)
+ inputs(img, lbl)
+ outputs(classification_cost(input=predict, label=lbl))
+else:
+ outputs(predict)
+```
+
+#### Softmax Regression
+
+One simple fully connected layer with softmax activation function outputs classification result.
+
+```python
+def softmax_regression(img):
+ predict = fc_layer(input=img, size=10, act=SoftmaxActivation())
+ return predict
+```
+
+#### MultiLayer Perceptron
+
+The following code implements a Multilayer Perceptron with two fully connected hidden layers and a ReLU activation function. The output layer has a Softmax activation function.
+
+```python
+def multilayer_perceptron(img):
+ # First fully connected layer with ReLU
+ hidden1 = fc_layer(input=img, size=128, act=ReluActivation())
+ # Second fully connected layer with ReLU
+ hidden2 = fc_layer(input=hidden1, size=64, act=ReluActivation())
+ # Output layer as fully connected layer and softmax activation. The size must be 10.
+ predict = fc_layer(input=hidden2, size=10, act=SoftmaxActivation())
+ return predict
+```
+
+#### Convolutional Neural Network LeNet-5
+
+The following is the LeNet-5 network architecture. A 2D input image is first fed into two sets of convolutional layers and pooling layers, this result is then fed to a fully connected layer, and another fully connected layer with a softmax activation.
+
+```python
+def convolutional_neural_network(img):
+ # First convolutional layer - pooling layer
+ conv_pool_1 = simple_img_conv_pool(
+ input=img,
+ filter_size=5,
+ num_filters=20,
+ num_channel=1,
+ pool_size=2,
+ pool_stride=2,
+ act=TanhActivation())
+ # Second convolutional layer - pooling layer
+ conv_pool_2 = simple_img_conv_pool(
+ input=conv_pool_1,
+ filter_size=5,
+ num_filters=50,
+ num_channel=20,
+ pool_size=2,
+ pool_stride=2,
+ act=TanhActivation())
+ # Fully connected layer
+ fc1 = fc_layer(input=conv_pool_2, size=128, act=TanhActivation())
+ # Output layer as fully connected layer and softmax activation. The size must be 10.
+ predict = fc_layer(input=fc1, size=10, act=SoftmaxActivation())
+ return predict
+```
+
+## Training Model
+
+### Training Commands and Logs
+
+1.Configure `train.sh` to execute training:
+
+```bash
+config=mnist_model.py # Select network in mnist_model.py
+output=./softmax_mnist_model
+log=softmax_train.log
+
+paddle train \
+--config=$config \ # Scripts for network configuration.
+--dot_period=10 \ # After `dot_period` steps, print one `.`
+--log_period=100 \ # Print a log every batchs
+--test_all_data_in_one_period=1 \ # Whether to use all data in every test
+--use_gpu=0 \ # Whether to use GPU
+--trainer_count=1 \ # Number of CPU or GPU
+--num_passes=100 \ # Passes for training (One pass uses all data.)
+--save_dir=$output \ # Path to saved model
+2>&1 | tee $log
+
+python -m paddle.utils.plotcurve -i $log > plot.png
+```
+
+After configuring parameters, execute `./train.sh`. Training log is as follows.
+
+```
+I0117 12:52:29.628617 4538 TrainerInternal.cpp:165] Batch=100 samples=12800 AvgCost=2.63996 CurrentCost=2.63996 Eval: classification_error_evaluator=0.241172 CurrentEval: classification_error_evaluator=0.241172
+.........
+I0117 12:52:29.768741 4538 TrainerInternal.cpp:165] Batch=200 samples=25600 AvgCost=1.74027 CurrentCost=0.840582 Eval: classification_error_evaluator=0.185234 CurrentEval: classification_error_evaluator=0.129297
+.........
+I0117 12:52:29.916970 4538 TrainerInternal.cpp:165] Batch=300 samples=38400 AvgCost=1.42119 CurrentCost=0.783026 Eval: classification_error_evaluator=0.167786 CurrentEval: classification_error_evaluator=0.132891
+.........
+I0117 12:52:30.061213 4538 TrainerInternal.cpp:165] Batch=400 samples=51200 AvgCost=1.23965 CurrentCost=0.695054 Eval: classification_error_evaluator=0.160039 CurrentEval: classification_error_evaluator=0.136797
+......I0117 12:52:30.223270 4538 TrainerInternal.cpp:181] Pass=0 Batch=469 samples=60000 AvgCost=1.1628 Eval: classification_error_evaluator=0.156233
+I0117 12:52:30.366894 4538 Tester.cpp:109] Test samples=10000 cost=0.50777 Eval: classification_error_evaluator=0.0978
+```
+
+2.Use `plot_cost.py` to plot error curve during training.
+
+```bash
+python plot_cost.py softmax_train.log
+```
+
+3.Use `evaluate.py ` to select the best trained model.
+
+```bash
+python evaluate.py softmax_train.log
+```
+
+### Training Results for Softmax Regression
+
+
+
+Fig. 7 Softmax regression error curve
+
+
+Evaluation results of the models:
+
+```text
+Best pass is 00013, testing Avgcost is 0.484447
+The classification accuracy is 90.01%
+```
+
+From the evaluation results, the best pass for softmax regression model is pass-00013, where the classification accuracy is 90.01%, and the last pass-00099 has an accuracy of 89.3%. From Fig. 7, we also see that the best accuracy may not appear in the last pass. This is because during training, the model may already arrive at a local optimum, and it just swings around nearby in the following passes, or it gets a lower local optimum.
+
+### Results of Multilayer Perceptron
+
+
+
+Fig. 8. Multilayer Perceptron error curve
+
+
+Evaluation results of the models:
+
+```text
+Best pass is 00085, testing Avgcost is 0.164746
+The classification accuracy is 94.95%
+```
+
+From the evaluation results, the final training accuracy is 94.95%. It is significantly better than the softmax regression model. This is because the softmax regression is simple, and it cannot fit complex data. The Multilayer Perceptron with hidden layers has better capacity to fit complex data than the softmax regression.
+
+### Training results for Convolutional Neural Network
+
+
+
+Results of model evaluation:
+
+```text
+Best pass is 00076, testing Avgcost is 0.0244684
+The classification accuracy is 99.20%
+```
+
+From the evaluation result, the best accuracy of Convolutional Neural Network is 99.20%. So for image classification, a Convolutional Neural Network has better recognition results than a fully connected network. This is related to the local connection and parameter sharing of convolutional layers. In Fig. 9, the Convolutional Neural Network achieves good results in early steps, which indicates that it converges faster.
+
+## Application Model
+
+### Prediction Commands and Results
+Script `predict.py` can make prediction for trained models. For example, in softmax regression:
+
+```bash
+python predict.py -c mnist_model.py -d data/raw_data/ -m softmax_mnist_model/pass-00047
+```
+
+- -c sets model architecture
+- -d sets data for prediction
+- -m sets model parameters, here the best trained model is used for prediction
+
+Follow the instructions to input image ID for prediction. The classifier can output probabilities for each digit, predictions with the highest probability, and ground truth label.
+
+```
+Input image_id [0~9999]: 3
+Predicted probability of each digit:
+[[ 1.00000000e+00 1.60381094e-28 1.60381094e-28 1.60381094e-28
+ 1.60381094e-28 1.60381094e-28 1.60381094e-28 1.60381094e-28
+ 1.60381094e-28 1.60381094e-28]]
+Predict Number: 0
+Actual Number: 0
+```
+
+From the result, this classifier recognizes the digit on the third image as digit 0 with near to 100% probability. This predicted result is consistent with the ground truth label.
+
+## Conclusion
+This tutorial describes a few basic Deep Learning models viz. Softmax regression, Multilayer Perceptron Network and Convolutional Neural Network. The subsequent tutorials will derive more sophisticated models from these. So it is crucial to understand these models for future learning. When our model evolved from a simple softmax regression to slightly complex Convolutional Neural Network, the recognition accuracy on the MNIST data set achieved large improvement in accuracy. This is due to the Convolutional layers' local connections and parameter sharing. While learning new models in the future, we encourage the readers to understand the key ideas that lead a new model to improve results of an old one. Moreover, this tutorial introduced the basic flow of PaddlePaddle model design, starting with a dataprovider, model layer construction, to final training and prediction. Readers can leverage the flow used in this MNIST handwritten digit classification example and experiment with different data and network architectures to train models for classification tasks of their choice.
+
+## References
+
+1. LeCun, Yann, Léon Bottou, Yoshua Bengio, and Patrick Haffner. ["Gradient-based learning applied to document recognition."](http://ieeexplore.ieee.org/abstract/document/726791/) Proceedings of the IEEE 86, no. 11 (1998): 2278-2324.
+2. Wejéus, Samuel. ["A Neural Network Approach to Arbitrary SymbolRecognition on Modern Smartphones."](http://www.diva-portal.org/smash/record.jsf?pid=diva2%3A753279&dswid=-434) (2014).
+3. Decoste, Dennis, and Bernhard Schölkopf. ["Training invariant support vector machines."](http://link.springer.com/article/10.1023/A:1012454411458) Machine learning 46, no. 1-3 (2002): 161-190.
+4. Simard, Patrice Y., David Steinkraus, and John C. Platt. ["Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis."](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.160.8494&rep=rep1&type=pdf) In ICDAR, vol. 3, pp. 958-962. 2003.
+5. Salakhutdinov, Ruslan, and Geoffrey E. Hinton. ["Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure."](http://www.jmlr.org/proceedings/papers/v2/salakhutdinov07a/salakhutdinov07a.pdf) In AISTATS, vol. 11. 2007.
+6. Cireşan, Dan Claudiu, Ueli Meier, Luca Maria Gambardella, and Jürgen Schmidhuber. ["Deep, big, simple neural nets for handwritten digit recognition."](http://www.mitpressjournals.org/doi/abs/10.1162/NECO_a_00052) Neural computation 22, no. 12 (2010): 3207-3220.
+7. Deng, Li, Michael L. Seltzer, Dong Yu, Alex Acero, Abdel-rahman Mohamed, and Geoffrey E. Hinton. ["Binary coding of speech spectrograms using a deep auto-encoder."](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.185.1908&rep=rep1&type=pdf) In Interspeech, pp. 1692-1695. 2010.
+8. Kégl, Balázs, and Róbert Busa-Fekete. ["Boosting products of base classifiers."](http://dl.acm.org/citation.cfm?id=1553439) In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 497-504. ACM, 2009.
+9. Rosenblatt, Frank. ["The perceptron: A probabilistic model for information storage and organization in the brain."](http://psycnet.apa.org/journals/rev/65/6/386/) Psychological review 65, no. 6 (1958): 386.
+10. Bishop, Christopher M. ["Pattern recognition."](http://s3.amazonaws.com/academia.edu.documents/30428242/bg0137.pdf?AWSAccessKeyId=AKIAJ56TQJRTWSMTNPEA&Expires=1484816640&Signature=85Ad6%2Fca8T82pmHzxaSXermovIA%3D&response-content-disposition=inline%3B%20filename%3DPattern_recognition_and_machine_learning.pdf) Machine Learning 128 (2006): 1-58.
+
+
+ This book is created by PaddlePaddle, and uses Shared knowledge signature - non commercial use-Sharing 4.0 International Licensing Protocal.
+
+# Personalized Recommendation
+
+The source code of this tutorial is in [book/recommender_system](https://github.com/PaddlePaddle/book/tree/develop/recommender_system).
+
+## Background
+
+With the fast growth of e-commerce, online videos, and online reading business, users have to rely on recommender systems to avoid manually browsing tremendous volume of choices. Recommender systems understand users' interest by mining user behavior and other properties of users and products.
+
+Some well know approaches include:
+
+- User behavior-based approach. A well-known method is collaborative filtering. The underlying assumption is that if a person A has the same opinion as a person B on an issue, A is more likely to have B's opinion on a different issue than that of a randomly chosen person.
+
+- Content-based recommendation[[1](#reference)]. This approach infers feature vectors that represent products from their descriptions. It also infers feature vectors that represent users' interests. Then it measures the relevance of users and products by some distances between these feature vectors.
+
+- Hybrid approach[[2](#reference)]: This approach uses the content-based information to help address the cold start problem[[6](#reference)] in behavior-based approach.
+
+Among these options, collaborative filtering might be the most studied one. Some of its variants include user-based[[3](#reference)], item-based [[4](#reference)], social network based[[5](#reference)], and model-based.
+
+This tutorial explains a deep learning based approach and how to implement it using PaddlePaddle. We will train a model using a dataset that includes user information, movie information, and ratings. Once we train the model, we will be able to get a predicted rating given a pair of user and movie IDs.
+
+
+## Model Overview
+
+To know more about deep learning based recommendation, let us start from going over the Youtube recommender system[[7](#参考文献)] before introducing our hybrid model.
+
+
+### YouTube's Deep Learning Recommendation Model
+
+YouTube is a video-sharing Web site with one of the largest user base in the world. Its recommender system serves more than a billion users. This system is composed of two major parts: candidate generation and ranking. The former selects few hundreds of candidates from millions of videos, and the latter ranks and outputs the top 10.
+
+
+
+Figure 1. YouTube recommender system overview.
+
+
+#### Candidate Generation Network
+
+Youtube models candidate generation as a multiclass classification problem with a huge number of classes equal to the number of videos. The architecture of the model is as follows:
+
+
+
+Figure. Deep candidate geeration model.
+
+
+The first stage of this model maps watching history and search queries into fixed-length representative features. Then, an MLP (multi-layer perceptron, as described in the [Recognize Digits](https://github.com/PaddlePaddle/book/blob/develop/recognize_digits/README.md) tutorial) takes the concatenation of all representative vectors. The output of the MLP represents the user' *intrinsic interests*. At training time, it is used together with a softmax output layer for minimizing the classification error. At serving time, it is used to compute the relevance of the user with all movies.
+
+For a user $U$, the predicted watching probability of video $i$ is
+
+$$P(\omega=i|u)=\frac{e^{v_{i}u}}{\sum_{j \in V}e^{v_{j}u}}$$
+
+where $u$ is the representative vector of user $U$, $V$ is the corpus of all videos, $v_i$ is the representative vector of the $i$-th video. $u$ and $v_i$ are vectors of the same length, so we can compute their dot product using a fully connected layer.
+
+This model could have a performance issue as the softmax output covers millions of classification labels. To optimize performance, at the training time, the authors down-sample negative samples, so the actual number of classes is reduced to thousands. At serving time, the authors ignore the normalization of the softmax outputs, because the results are just for ranking.
+
+
+#### Ranking Network
+
+The architecture of the ranking network is similar to that of the candidate generation network. Similar to ranking models widely used in online advertising, it uses rich features like video ID, last watching time, etc. The output layer of the ranking network is a weighted logistic regression, which rates all candidate videos.
+
+
+### Hybrid Model
+
+In the section, let us introduce our movie recommendation system.
+
+In our network, the input includes features of users and movies. The user feature includes four properties: user ID, gender, occupation, and age. Movie features include their IDs, genres, and titles.
+
+We use fully-connected layers to map user features into representative feature vectors and concatenate them. The process of movie features is similar, except that for movie titles -- we feed titles into a text convolution network as described in the [sentiment analysis tutorial](https://github.com/PaddlePaddle/book/blob/develop/understand_sentiment/README.md))to get a fixed-length representative feature vector.
+
+Given the feature vectors of users and movies, we compute the relevance using cosine similarity. We minimize the squared error at training time.
+
+
+
+
+Figure 3. A hybrid recommendation model.
+
+
+## Dataset
+
+We use the [MovieLens ml-1m](http://files.grouplens.org/datasets/movielens/ml-1m.zip) to train our model. This dataset includes 10,000 ratings of 4,000 movies from 6,000 users to 4,000 movies. Each rate is in the range of 1~5. Thanks to GroupLens Research for collecting, processing and publishing the dataset.
+
+We don't have to download and preprocess the data. Instead, we can use PaddlePaddle's dataset module `paddle.v2.dataset.movielens`.
+
+
+## Model Specification
+
+
+
+## Training
+
+
+
+## Inference
+
+
+
+## Conclusion
+
+This tutorial goes over traditional approaches in recommender system and a deep learning based approach. We also show that how to train and use the model with PaddlePaddle. Deep learning has been well used in computer vision and NLP, we look forward to its new successes in recommender systems.
+
+## Reference
+
+1. [Peter Brusilovsky](https://en.wikipedia.org/wiki/Peter_Brusilovsky) (2007). *The Adaptive Web*. p. 325.
+2. Robin Burke , [Hybrid Web Recommender Systems](http://www.dcs.warwick.ac.uk/~acristea/courses/CS411/2010/Book%20-%20The%20Adaptive%20Web/HybridWebRecommenderSystems.pdf), pp. 377-408, The Adaptive Web, Peter Brusilovsky, Alfred Kobsa, Wolfgang Nejdl (Ed.), Lecture Notes in Computer Science, Springer-Verlag, Berlin, Germany, Lecture Notes in Computer Science, Vol. 4321, May 2007, 978-3-540-72078-2.
+3. P. Resnick, N. Iacovou, etc. “[GroupLens: An Open Architecture for Collaborative Filtering of Netnews](http://ccs.mit.edu/papers/CCSWP165.html)”, Proceedings of ACM Conference on Computer Supported Cooperative Work, CSCW 1994. pp.175-186.
+4. Sarwar, Badrul, et al. "[Item-based collaborative filtering recommendation algorithms.](http://files.grouplens.org/papers/www10_sarwar.pdf)" *Proceedings of the 10th International Conference on World Wide Web*. ACM, 2001.
+5. Kautz, Henry, Bart Selman, and Mehul Shah. "[Referral Web: Combining Social networks and collaborative filtering.](http://www.cs.cornell.edu/selman/papers/pdf/97.cacm.refweb.pdf)" Communications of the ACM 40.3 (1997): 63-65. APA
+6. Yuan, Jianbo, et al. ["Solving Cold-Start Problem in Large-scale Recommendation Engines: A Deep Learning Approach."](https://arxiv.org/pdf/1611.05480v1.pdf) *arXiv preprint arXiv:1611.05480* (2016).
+7. Covington P, Adams J, Sargin E. [Deep neural networks for youtube recommendations](https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf)[C]//Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 2016: 191-198.
+
+
+ This tutorial was created by the PaddlePaddle community and published under Common Creative 4.0 License。
+
+# Sentiment Analysis
+
+The source codes of this section can be located at [book/understand_sentiment](https://github.com/PaddlePaddle/book/tree/develop/understand_sentiment). First-time users may refer to PaddlePaddle for [Installation guide](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html).
+
+## Background Introduction
+In natural language processing, sentiment analysis refers to describing emotion status in texts. The texts may refer to a sentence, a paragraph or a document. Emotion status can be a binary classification problem (positive/negative or happy/sad), or a three-class problem (positive/neutral/negative). Sentiment analysis can be applied widely in various situations, such as online shopping (Amazon, Taobao), travel and movie websites. It can be used to grasp from the reviews how the customers feel about the product. Table 1 is an example of sentiment analysis in movie reviews:
+
+| Movie Review | Category |
+| -------- | ----- |
+| Best movie of Xiaogang Feng in recent years!| Positive |
+| Pretty bad. Feels like a tv-series from a local TV-channel | Negative |
+| Politically correct version of Taken ... and boring as Heck| Negative|
+|delightful, mesmerizing, and completely unexpected. The plot is nicely designed.|Positive|
+
+
Table 1 Sentiment Analysis in Movie Reviews
+
+In natural language processing, sentiment analysis can be categorized as a **Text Classification problem**, i.e., to categorize a piece of text to a specific class. It involves two related tasks: text representation and classification. Before deep learning becomes heated, the main-stream methods for the former include BOW (bag of words) and topic modeling, while the latter contain SVM(support vector machine), LR(logistic regression).
+
+For a piece of text, BOW model ignores its word order, grammar and syntax, and regard it as a set of words, so BOW does not capture all the information in the text. For example, “this movie is extremely bad“ and “boring, dull and empty work” describe very similar semantic with low similarity in sense of BOW. Also, “the movie is bad“ and “the movie is not bad“ have high similarity with BOW feature, but they express completely opposite semantics.
+
+
+In this chapter, we introduce our deep learning model which handles these issues in BOW. Our model embeds texts into a low-dimensional space and takes word order into consideration. It is an end-to-end framework, and has large performance improvement over traditional methods \[[1](#Reference)\].
+
+## Model Overview
+The model we used in this chapter is the CNN (Convolutional Neural Networks) and RNN (Recurrent Neural Networks) with some specific extension.
+
+
+### Convolutional Neural Networks for Texts (CNN)
+Convolutional Neural Networks are always applied in data with grid-like topology, such as 2-d images and 1-d texts. CNN can combine extracted multiple local features to produce higher-level abstract semantics. Experimentally, CNN is very efficient for image and text modeling.
+
+CNN mainly contains convolution and pooling operation, with various extensions. We briefly describe CNN here with an example \[[1](#Refernce)\]. As shown in Figure 1:
+
+
+
+
+Figure 1. CNN for text modeling.
+
+
+Assuming the length of the sentence is $n$, where the $i$-th word has embedding as $x_i\in\mathbb{R}^k$,where $k$ is the embedding dimensionality.
+
+First, we concatenate the words together: we piece every $h$ words as a window of length $h$: $x_{i:i+h-1}$. It refers to $x_{i},x_{i+1},\ldots,x_{i+h-1}$, where $i$ is the first word in the window, ranging from $1$ to $n-h+1$: $x_{i:i+h-1}\in\mathbb{R}^{hk}$.
+
+Next, we apply the convolution operation: we apply the kernel $w\in\mathbb{R}^{hk}$ in each window, extracting features $c_i=f(w\cdot x_{i:i+h-1}+b)$,
+where $b\in\mathbb{R}$ is the bias and $f$ is a non-linear activation function such as $sigmoid$. Applying CNN on every window ${x_{1:h},x_{2:h+1},\ldots,x_{n-h+1:n}}$ produces a feature map as:
+
+$$c=[c_1,c_2,\ldots,c_{n-h+1}], c \in \mathbb{R}^{n-h+1}$$
+
+Next, we apply max pooling over time to represent the whole sentence $\hat c$, which is the maximum element across the feature map:
+
+$$\hat c=max(c)$$
+
+In real applications, we will apply multiple CNN kernels on the sentences. It can be implemented efficiently by concatenating the kernels together as a matrix. Also, we can use CNN kernels with different kernel size (as shown in Figure 1 in different colors).
+
+Finally, the CNN features are concatenated together to produce a fixed-length representation, which can be combined with a softmax for sentiment analysis problem.
+
+For short texts, above CNN model can achieve high accuracy \[[1](#Reference)\]. If we want to extract more abstract representation, we may apply a deeper CNN model \[[2](#Reference),[3](#Reference)\].
+
+### Recurrent Neural Network(RNN)
+RNN is an effective model for sequential data. Theoretical, the computational ability of RNN is Turing-complete \[[4](#Reference)\]. NLP is a classical sequential data, and RNN (especially its variant LSTM\[[5](#Reference)\]) achieves State-of-the-Art performance on various tasks in NLP, such as language modeling, syntax parsing, POS-tagging, image captioning, dialog, machine translation and so forth.
+
+
+
+Figure 2. An illustration of an unrolled RNN across “time”.
+
+As shown in Figure 2, we unroll an RNN: at $t$-th time step, the network takes the $t$-th input vector and the latent state from last time-step $h_{t-1}$ as inputs and compute the latent state of current step. The whole process is repeated until all inputs are consumed. If we regard the RNN as a function $f$, it can be formulated as:
+
+$$h_t=f(x_t,h_{t-1})=\sigma(W_{xh}x_t+W_{hh}h_{h-1}+b_h)$$
+
+where $W_{xh}$ is the weight matrix from input to latent; $W_{hh}$ is the latent-to-latent matrix; $b_h$ is the latent bias and $\sigma$ refers to the $sigmoid$function.
+
+In NLP, words are first represented as a one-hot vector and then mapped to an embedding. The embedded feature goes through an RNN as input $x_t$ at every time step. Moreover, we can add other layers on top of RNN. e.g., a deep or stacked RNN. Also, the last latent state can be used as a feature for sentence classification.
+
+### Long-Short Term Memory
+For data of long sequence, training RNN sometimes has gradient vanishing and explosion problem \[[6](#)\]. To solve this problem Hochreiter S, Schmidhuber J. (1997) proposed the LSTM(long short term memory\[[5](#Refernce)\]).
+
+Compared with simple RNN, the structrue of LSTM has included memory cell $c$, input gate $i$, forget gate $f$ and output gate $o$. These gates and memory cells largely improves the ability of handling long sequences. We can formulate LSTM-RNN as a function $F$ as:
+
+$$ h_t=F(x_t,h_{t-1})$$
+
+$F$ contains following formulations\[[7](#Reference)\]:
+\begin{align}
+i_t & = \sigma(W_{xi}x_t+W_{hi}h_{h-1}+W_{ci}c_{t-1}+b_i)\\\\
+f_t & = \sigma(W_{xf}x_t+W_{hf}h_{h-1}+W_{cf}c_{t-1}+b_f)\\\\
+c_t & = f_t\odot c_{t-1}+i_t\odot tanh(W_{xc}x_t+W_{hc}h_{h-1}+b_c)\\\\
+o_t & = \sigma(W_{xo}x_t+W_{ho}h_{h-1}+W_{co}c_{t}+b_o)\\\\
+h_t & = o_t\odot tanh(c_t)\\\\
+\end{align}
+
+In the equation,$i_t, f_t, c_t, o_t$ stand for input gate, forget gate, memory cell and output gate separately; $W$ and $b$ are model parameters. The $tanh$ is a hyperbolic tangent, and $\odot$ denotes an element-wise product operation. Input gate controls the magnitude of new input into the memory cell $c$; forget gate controls memory propagated from the last time step; output gate controls output magnitude. The three gates are computed similarly with different parameters, and they influence memory cell $c$ separately, as shown in Figure 3:
+
+
+
+Figure 3. LSTM at time step $t$ [7].
+
+
+LSTM enhances the ability of considering long-term reliance, with the help of memory cell and gate. Similar structures are also proposed in Gated Recurrent Unit (GRU)\[[8](Reference)\] with simpler design. **The structures are still similar to RNN, though with some modifications (As shown in Figure 2), i.e., latent status depends on input as well as the latent status of last time-step, and the process goes on recurrently until all input are consumed:**
+
+$$ h_t=Recrurent(x_t,h_{t-1})$$
+where $Recrurent$ is a simple RNN, GRU or LSTM.
+
+### Stacked Bidirectional LSTM
+For vanilla LSTM, $h_t$ contains input information from previous time-step $1..t-1$ context. We can also apply an RNN with reverse-direction to take successive context $t+1…n$ into consideration. Combining constructing deep RNN (deeper RNN can contain more abstract and higher level semantic), we can design structures with deep stacked bidirectional LSTM to model sequential data\[[9](#Reference)\].
+
+As shown in Figure 4 (3-layer RNN), odd/even layers are forward/reverse LSTM. Higher layers of LSTM take lower-layers LSTM as input, and the top-layer LSTM produces a fixed length vector by max-pooling (this representation considers contexts from previous and successive words for higher-level abstractions). Finally, we concatenate the output to a softmax layer for classification.
+
+
+
+Figure 4. Stacked Bidirectional LSTM for NLP modeling.
+
+
+## Data Preparation
+### Data introduction and Download
+We taks the [IMDB sentiment analysis dataset](http://ai.stanford.edu/%7Eamaas/data/sentiment/) as an example. IMDB dataset contains training and testing set, with 25000 movie reviews. With a 1-10 score, negative reviews are those with score<=4, while positives are those with score>=7. You may use following scripts to download the IMDB dataset and [Moses](http://www.statmt.org/moses/) toolbox:
+
+
+```bash
+./data/get_imdb.sh
+```
+If successful, you should see the directory ```data``` with following files:
+
+```
+aclImdb get_imdb.sh imdb mosesdecoder-master
+```
+
+* aclImdb: original data downloaded from the website;
+* imdb: containing only training and testing data
+* mosesdecoder-master: Moses tool
+
+### Data Preprocessing
+We use the script `preprocess.py` to preprocess the data. It will call `tokenizer.perl` in the Moses toolbox to split words and punctuations, randomly shuffle training set and construct the dictionary. Notice: we only use labeled training and testing set. Executing following commands will preprocess the data:
+
+```
+data_dir="./data/imdb"
+python preprocess.py -i $data_dir
+```
+
+If it runs successfully, `./data/pre-imdb` will contain:
+
+```
+dict.txt labels.list test.list test_part_000 train.list train_part_000
+```
+
+* test\_part\_000 和 train\_part\_000: all labeled training and testing set, and the training set is shuffled.
+* train.list and test.list: training and testing file-list (containing list of file names).
+* dict.txt: dictionary generated from training set.
+* labels.list: class label, 0 stands for negative while 1 for positive.
+
+### Data Provider for PaddlePaddle
+PaddlePaddle can read Python-style script for configuration. The following `dataprovider.py` provides a detailed example, consisting of two parts:
+
+* hook: define text information and class Id. Texts are defined as `integer_value_sequence` while class Ids are defined as `integer_value`.
+* process: read line by line for ID and text information split by `’\t\t’`, and yield the data as a generator.
+
+```python
+from paddle.trainer.PyDataProvider2 import *
+
+def hook(settings, dictionary, **kwargs):
+settings.word_dict = dictionary
+settings.input_types = {
+'word': integer_value_sequence(len(settings.word_dict)),
+'label': integer_value(2)
+}
+settings.logger.info('dict len : %d' % (len(settings.word_dict)))
+
+@provider(init_hook=hook)
+def process(settings, file_name):
+with open(file_name, 'r') as fdata:
+for line_count, line in enumerate(fdata):
+label, comment = line.strip().split('\t\t')
+label = int(label)
+words = comment.split()
+word_slot = [
+settings.word_dict[w] for w in words if w in settings.word_dict
+]
+yield {
+'word': word_slot,
+'label': label
+}
+```
+
+## Model Setup
+`trainer_config.py` is an example of a setup file.
+### Data Definition
+```python
+from os.path import join as join_path
+from paddle.trainer_config_helpers import *
+# if it is “test” mode
+is_test = get_config_arg('is_test', bool, False)
+# if it is “predict” mode
+is_predict = get_config_arg('is_predict', bool, False)
+
+# Data path
+data_dir = "./data/pre-imdb"
+# File names
+train_list = "train.list"
+test_list = "test.list"
+dict_file = "dict.txt"
+
+# Dictionary size
+dict_dim = len(open(join_path(data_dir, "dict.txt")).readlines())
+# class number
+class_dim = len(open(join_path(data_dir, 'labels.list')).readlines())
+
+if not is_predict:
+train_list = join_path(data_dir, train_list)
+test_list = join_path(data_dir, test_list)
+dict_file = join_path(data_dir, dict_file)
+train_list = train_list if not is_test else None
+# construct the dictionary
+word_dict = dict()
+with open(dict_file, 'r') as f:
+for i, line in enumerate(open(dict_file, 'r')):
+word_dict[line.split('\t')[0]] = i
+# Call the function “define_py_data_sources2” in the file dataprovider.py to extract features
+define_py_data_sources2(
+train_list,
+test_list,
+module="dataprovider",
+obj="process", # function to generate data
+args={'dictionary': word_dict}) # extra parameters, here refers to dictionary
+```
+
+### Algorithm Setup
+
+```python
+settings(
+batch_size=128,
+learning_rate=2e-3,
+learning_method=AdamOptimizer(),
+regularization=L2Regularization(8e-4),
+gradient_clipping_threshold=25)
+```
+
+* Batch size set as 128;
+* Set global learning rate;
+* Apply ADAM algorithm for optimization;
+* Set up L2 regularization;
+* Set up gradient clipping threshold;
+
+### Model Structure
+We use PaddlePaddle to implement two classification algorithms, based on above mentioned model [Text-CNN](#Text-CNN(CNN))和[Stacked-bidirectional LSTM](#Stacked-bidirectional LSTM(Stacked Bidirectional LSTM))。
+#### Implementation of Text CNN
+```python
+def convolution_net(input_dim,
+class_dim=2,
+emb_dim=128,
+hid_dim=128,
+is_predict=False):
+# network input: id denotes word order, dictionary size as input_dim
+data = data_layer("word", input_dim)
+# Embed one-hot id to embedding subspace
+emb = embedding_layer(input=data, size=emb_dim)
+# Convolution and max-pooling operation, convolution kernel size set as 3
+conv_3 = sequence_conv_pool(input=emb, context_len=3, hidden_size=hid_dim)
+# Convolution and max-pooling, convolution kernel size set as 4
+conv_4 = sequence_conv_pool(input=emb, context_len=4, hidden_size=hid_dim)
+# Concatenate conv_3 and conv_4 as input for softmax classification, class number as class_dim
+output = fc_layer(
+input=[conv_3, conv_4], size=class_dim, act=SoftmaxActivation())
+
+if not is_predict:
+lbl = data_layer("label", 1) #network input: class label
+outputs(classification_cost(input=output, label=lbl))
+else:
+outputs(output)
+```
+
+In our implementation, we can use just a single layer [`sequence_conv_pool`](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/trainer_config_helpers/networks.py) to do convolution and pooling operation, convolution kernel size set as hidden_size parameters.
+
+#### Implementation of Stacked bidirectional LSTM
+
+```python
+def stacked_lstm_net(input_dim,
+class_dim=2,
+emb_dim=128,
+hid_dim=512,
+stacked_num=3,
+is_predict=False):
+
+# layer number of LSTM “stacked_num” is an odd number to confirm the top-layer LSTM is forward
+assert stacked_num % 2 == 1
+# network attributes setup
+layer_attr = ExtraLayerAttribute(drop_rate=0.5)
+# parameter attributes setup
+fc_para_attr = ParameterAttribute(learning_rate=1e-3)
+lstm_para_attr = ParameterAttribute(initial_std=0., learning_rate=1.)
+para_attr = [fc_para_attr, lstm_para_attr]
+bias_attr = ParameterAttribute(initial_std=0., l2_rate=0.)
+# Activation functions
+relu = ReluActivation()
+linear = LinearActivation()
+
+
+# Network input: id as word order, dictionary size is set as input_dim
+data = data_layer("word", input_dim)
+# Mapping id from word to the embedding subspace
+emb = embedding_layer(input=data, size=emb_dim)
+
+fc1 = fc_layer(input=emb, size=hid_dim, act=linear, bias_attr=bias_attr)
+# LSTM-based RNN
+lstm1 = lstmemory(
+input=fc1, act=relu, bias_attr=bias_attr, layer_attr=layer_attr)
+
+# Construct stacked bidirectional LSTM with fc_layer and lstmemory with layer depth as stacked_num:
+inputs = [fc1, lstm1]
+for i in range(2, stacked_num + 1):
+fc = fc_layer(
+input=inputs,
+size=hid_dim,
+act=linear,
+param_attr=para_attr,
+bias_attr=bias_attr)
+lstm = lstmemory(
+input=fc,
+# Odd number-th layer: forward, Even number-th reverse.
+reverse=(i % 2) == 0,
+act=relu,
+bias_attr=bias_attr,
+layer_attr=layer_attr)
+inputs = [fc, lstm]
+
+# Apply max-pooling along the temporal dimension on the last fc_layer to produce a fixed length vector
+fc_last = pooling_layer(input=inputs[0], pooling_type=MaxPooling())
+# Apply max-pooling along tempoeral dim of lstmemory to obtain fixed length feature vector
+lstm_last = pooling_layer(input=inputs[1], pooling_type=MaxPooling())
+# concatenate fc_last and lstm_last as input for a softmax classification layer, with class number equals class_dim
+output = fc_layer(
+input=[fc_last, lstm_last],
+size=class_dim,
+act=SoftmaxActivation(),
+bias_attr=bias_attr,
+param_attr=para_attr)
+
+if is_predict:
+outputs(output)
+else:
+outputs(classification_cost(input=output, label=data_layer('label', 1)))
+```
+
+Our model defined in `trainer_config.py` uses the `stacked_lstm_net` structure as default. If you want to use `convolution_net`, you can comment related lines.
+
+```python
+stacked_lstm_net(
+dict_dim, class_dim=class_dim, stacked_num=3, is_predict=is_predict)
+# convolution_net(dict_dim, class_dim=class_dim, is_predict=is_predict)
+```
+
+## Model Training
+Use `train.sh` script to run local training:
+
+```
+./train.sh
+```
+
+train.sh is as following:
+
+```bash
+paddle train --config=trainer_config.py \
+--save_dir=./model_output \
+--job=train \
+--use_gpu=false \
+--trainer_count=4 \
+--num_passes=10 \
+--log_period=20 \
+--dot_period=20 \
+--show_parameter_stats_period=100 \
+--test_all_data_in_one_period=1 \
+2>&1 | tee 'train.log'
+```
+
+* \--config=trainer_config.py: set up model configuration.
+* \--save\_dir=./model_output: set up output folder to save model parameters.
+* \--job=train: set job mode as training.
+* \--use\_gpu=false: Use CPU for training. If you have installed GPU-version PaddlePaddle and want to try GPU training, you may set this term as true.
+* \--trainer\_count=4: setup thread number (or GPU numer).
+* \--num\_passes=15: Setup pass. In PaddlePaddle, a pass means a training epoch over all samples.
+* \--log\_period=20: print log every 20 batches.
+* \--show\_parameter\_stats\_period=100: Print statistics to screen every 100 batch.
+* \--test\_all_data\_in\_one\_period=1: Predict all testing data every time.
+
+If it is running sussefully, the output log will be saved at `train.log`, model parameters will be saved at the directory `model_output/`. Output log will be as following:
+
+```
+Batch=20 samples=2560 AvgCost=0.681644 CurrentCost=0.681644 Eval: classification_error_evaluator=0.36875 CurrentEval: classification_error_evaluator=0.36875
+...
+Pass=0 Batch=196 samples=25000 AvgCost=0.418964 Eval: classification_error_evaluator=0.1922
+Test samples=24999 cost=0.39297 Eval: classification_error_evaluator=0.149406
+```
+
+* Batch=xx: Already |xx| Batch trained.
+* samples=xx: xx samples have been processed during training.
+* AvgCost=xx: Average loss from 0-th batch to the current batch.
+* CurrentCost=xx: loss of the latest |log_period|-th batch;
+* Eval: classification\_error\_evaluator=xx: Average accuracy from 0-th batch to current batch;
+* CurrentEval: classification\_error\_evaluator: latest |log_period| batches of classification error;
+* Pass=0: Running over all data in the training set is called as a Pass. Pass “0” denotes the first round.
+
+
+## Application models
+### Testing
+
+Testing refers to use trained model to evaluate labeled dataset.
+
+```
+./test.sh
+```
+
+Scripts for testing `test.sh` is as following, where the function `get_best_pass` ranks classification accuracy to obtain the best model:
+
+```bash
+function get_best_pass() {
+cat $1 | grep -Pzo 'Test .*\n.*pass-.*' | \
+sed -r 'N;s/Test.* error=([0-9]+\.[0-9]+).*\n.*pass-([0-9]+)/\1 \2/g' | \
+sort | head -n 1
+}
+
+log=train.log
+LOG=`get_best_pass $log`
+LOG=(${LOG})
+evaluate_pass="model_output/pass-${LOG[1]}"
+
+echo 'evaluating from pass '$evaluate_pass
+
+model_list=./model.list
+touch $model_list | echo $evaluate_pass > $model_list
+net_conf=trainer_config.py
+paddle train --config=$net_conf \
+--model_list=$model_list \
+--job=test \
+--use_gpu=false \
+--trainer_count=4 \
+--config_args=is_test=1 \
+2>&1 | tee 'test.log'
+```
+
+Different from training, testing requires denoting `--job = test` and model path `--model_list = $model_list`. If successful, log will be saved at `test.log`. In our test, the best model is `model_output/pass-00002`, with classification error rate as 0.115645:
+
+```
+Pass=0 samples=24999 AvgCost=0.280471 Eval: classification_error_evaluator=0.115645
+```
+
+### Prediction
+`predict.py` script provides an API. Predicting IMDB data without labels as following:
+
+```
+./predict.sh
+```
+predict.sh is as following(default model path `model_output/pass-00002` may exist or modified to others):
+
+```bash
+model=model_output/pass-00002/
+config=trainer_config.py
+label=data/pre-imdb/labels.list
+cat ./data/aclImdb/test/pos/10007_10.txt | python predict.py \
+--tconf=$config \
+--model=$model \
+--label=$label \
+--dict=./data/pre-imdb/dict.txt \
+--batch_size=1
+```
+
+* `cat ./data/aclImdb/test/pos/10007_10.txt` : Input prediction samples.
+* `predict.py` : Prediction script.
+* `--tconf=$config` : Network set up.
+* `--model=$model` : Model path set up.
+* `--label=$label` : set up the label dictionary, mapping integer IDs to string labels.
+* `--dict=data/pre-imdb/dict.txt` : set up the dictionary file.
+* `--batch_size=1` : batch size during prediction.
+
+
+Prediction result of our example:
+
+```
+Loading parameters from model_output/pass-00002/
+predicting label is pos
+```
+
+`10007_10.txt` in folder`./data/aclImdb/test/pos`, the predicted label is also pos,so the prediction is correct.
+## Summary
+In this chapter, we use sentiment analysis as an example to introduce applying deep learning models on end-to-end short text classification, as well as how to use PaddlePaddle to implement the model. Meanwhile, we briefly introduce two models for text processing: CNN and RNN. In following chapters we will see how these models can be applied in other tasks.
+## Reference
+1. Kim Y. [Convolutional neural networks for sentence classification](http://arxiv.org/pdf/1408.5882)[J]. arXiv preprint arXiv:1408.5882, 2014.
+2. Kalchbrenner N, Grefenstette E, Blunsom P. [A convolutional neural network for modelling sentences](http://arxiv.org/pdf/1404.2188.pdf?utm_medium=App.net&utm_source=PourOver)[J]. arXiv preprint arXiv:1404.2188, 2014.
+3. Yann N. Dauphin, et al. [Language Modeling with Gated Convolutional Networks](https://arxiv.org/pdf/1612.08083v1.pdf)[J] arXiv preprint arXiv:1612.08083, 2016.
+4. Siegelmann H T, Sontag E D. [On the computational power of neural nets](http://research.cs.queensu.ca/home/akl/cisc879/papers/SELECTED_PAPERS_FROM_VARIOUS_SOURCES/05070215382317071.pdf)[C]//Proceedings of the fifth annual workshop on Computational learning theory. ACM, 1992: 440-449.
+5. Hochreiter S, Schmidhuber J. [Long short-term memory](http://web.eecs.utk.edu/~itamar/courses/ECE-692/Bobby_paper1.pdf)[J]. Neural computation, 1997, 9(8): 1735-1780.
+6. Bengio Y, Simard P, Frasconi P. [Learning long-term dependencies with gradient descent is difficult](http://www-dsi.ing.unifi.it/~paolo/ps/tnn-94-gradient.pdf)[J]. IEEE transactions on neural networks, 1994, 5(2): 157-166.
+7. Graves A. [Generating sequences with recurrent neural networks](http://arxiv.org/pdf/1308.0850)[J]. arXiv preprint arXiv:1308.0850, 2013.
+8. Cho K, Van Merriënboer B, Gulcehre C, et al. [Learning phrase representations using RNN encoder-decoder for statistical machine translation](http://arxiv.org/pdf/1406.1078)[J]. arXiv preprint arXiv:1406.1078, 2014.
+9. Zhou J, Xu W. [End-to-end learning of semantic role labeling using recurrent neural networks](http://www.aclweb.org/anthology/P/P15/P15-1109.pdf)[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2015.
+
+
+ 本教程 由 PaddlePaddle 创作,采用 知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议进行许可。
+
+# Word2Vec
+
+This is intended as a reference tutorial. The source code of this tutorial lives on [book/word2vec](https://github.com/PaddlePaddle/book/tree/develop/word2vec).
+
+For instructions on getting started with PaddlePaddle, see [PaddlePaddle installation guide](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html).
+
+## Background Introduction
+
+This section introduces the concept of **word embedding**, which is a vector representation of words. It is a popular technique used in natural language processing. Word embeddings support many Internet services, including search engines, advertising systems, and recommendation systems.
+
+### One-Hot Vectors
+
+Building these services requires us to quantify the similarity between two words or paragraphs. This calls for a new representation of all the words to make them more suitable for computation. An obvious way to achieve this is through the vector space model, where every word is represented as an **one-hot vector**.
+
+For each word, its vector representation has the corresponding entry in the vector as 1, and all other entries as 0. The lengths of one-hot vectors match the size of the dictionary. Each entry of a vector corresponds to the presence (or absence) of a word in the dictionary.
+
+One-hot vectors are intuitive, yet they have limited usefulness. Take the example of an Internet advertising system: Suppose a customer enters the query "Mother's Day", while an ad bids for the keyword carnations". Because the one-hot vectors of these two words are perpendicular, the metric distance (either Euclidean or cosine similarity) between them would indicate little relevance. However, *we* know that these two queries are connected semantically, since people often gift their mothers bundles of carnation flowers on Mother's Day. This discrepancy is due to the low information capacity in each vector. That is, comparing the vector representations of two words does not assess their relevance sufficiently. To calculate their similarity accurately, we need more information, which could be learned from large amounts of data through machine learning methods.
+
+Like many machine learning models, word embeddings can represent knowledge in various ways. Another model may project an one-hot vector to an embedding vector of lower dimension e.g. $embedding(mother's day) = [0.3, 4.2, -1.5, ...], embedding(carnations) = [0.2, 5.6, -2.3, ...]$. Mapping one-hot vectors onto an embedded vector space has the potential to bring the embedding vectors of similar words (either semantically or usage-wise) closer to each other, so that the cosine similarity between the corresponding vectors for words like "Mother's Day" and "carnations" are no longer zero.
+
+A word embedding model could be a probabilistic model, a co-occurrence matrix model, or a neural network. Before people started using neural networks to generate word embedding, the traditional method was to calculate a co-occurrence matrix $X$ of words. Here, $X$ is a $|V| \times |V|$ matrix, where $X_{ij}$ represents the co-occurrence times of the $i$th and $j$th words in the vocabulary `V` within all corpus, and $|V|$ is the size of the vocabulary. By performing matrix decomposition on $X$ e.g. Singular Value Decomposition \[[5](#References)\]
+
+$$X = USV^T$$
+
+the resulting $U$ can be seen as the word embedding of all the words.
+
+However, this method suffers from many drawbacks:
+1) Since many pairs of words don't co-occur, the co-occurrence matrix is sparse. To achieve good performance of matrix factorization, further treatment on word frequency is needed;
+2) The matrix is large, frequently on the order of $10^6*10^6$;
+3) We need to manually filter out stop words (like "although", "a", ...), otherwise these frequent words will affect the performance of matrix factorization.
+
+The neural network based model does not require storing huge hash tables of statistics on all of the corpus. It obtains the word embedding by learning from semantic information, hence could avoid the aforementioned problems in the traditional method. In this chapter, we will introduce the details of neural network word embedding model and how to train such model in PaddlePaddle.
+
+## Results Demonstration
+
+In this section, after training the word embedding model, we could use the data visualization algorithm $t-$SNE\[[4](#reference)\] to draw the word embedding vectors after projecting them onto a two-dimensional space (see figure below). From the figure we could see that the semantically relevant words -- *a*, *the*, and *these* or *big* and *huge* -- are close to each other in the projected space, while irrelevant words -- *say* and *business* or *decision* and *japan* -- are far from each other.
+
+
+
+ Figure 1. Two dimension projection of word embeddings
+
+
+### Cosine Similarity
+
+On the other hand, we know that the cosine similarity between two vectors falls between $[-1,1]$. Specifically, the cosine similarity is 1 when the vectors are identical, 0 when the vectors are perpendicular, -1 when the are of opposite directions. That is, the cosine similarity between two vectors scales with their relevance. So we can calculate the cosine similarity of two word embedding vectors to represent their relevance:
+
+```
+please input two words: big huge
+similarity: 0.899180685161
+
+please input two words: from company
+similarity: -0.0997506977351
+```
+
+The above results could be obtained by running `calculate_dis.py`, which loads the words in the dictionary and their corresponding trained word embeddings. For detailed instruction, see section [Model Application](#Model Application).
+
+
+## Model Overview
+
+In this section, we will introduce three word embedding models: N-gram model, CBOW, and Skip-gram, which all output the frequency of each word given its immediate context.
+
+For N-gram model, we will first introduce the concept of language model, and implement it using PaddlePaddle in section [Model Training](#Model Training).
+
+The latter two models, which became popular recently, are neural word embedding model developed by Tomas Mikolov at Google \[[3](#reference)\]. Despite their apparent simplicity, these models train very well.
+
+### Language Model
+
+Before diving into word embedding models, we will first introduce the concept of **language model**. Language models build the joint probability function $P(w_1, ..., w_T)$ of a sentence, where $w_i$ is the i-th word in the sentence. The goal is to give higher probabilities to meaningful sentences, and lower probabilities to meaningless constructions.
+
+In general, models that generate the probability of a sequence can be applied to many fields, like machine translation, speech recognition, information retrieval, part-of-speech tagging, and handwriting recognition. Take information retrieval, for example. If you were to search for "how long is a football bame" (where bame is a medical noun), the search engine would have asked if you had meant "how long is a football game" instead. This is because the probability of "how long is a football bame" is very low according to the language model; in addition, among all of the words easily confused with "bame", "game" would build the most probable sentence.
+
+#### Target Probability
+For language model's target probability $P(w_1, ..., w_T)$, if the words in the sentence were to be independent, the joint probability of the whole sentence would be the product of each word's probability:
+
+$$P(w_1, ..., w_T) = \prod_{t=1}^TP(w_t)$$
+
+However, the frequency of words in a sentence typically relates to the words before them, so canonical language models are constructed using conditional probability in its target probability:
+
+$$P(w_1, ..., w_T) = \prod_{t=1}^TP(w_t | w_1, ... , w_{t-1})$$
+
+
+### N-gram neural model
+
+In computational linguistics, n-gram is an important method to represent text. An n-gram represents a contiguous sequence of n consecutive items given a text. Based on the desired application scenario, each item could be a letter, a syllable or a word. The N-gram model is also an important method in statistical language modeling. When training language models with n-grams, the first (n-1) words of an n-gram are used to predict the *n*th word.
+
+Yoshua Bengio and other scientists describe how to train a word embedding model using neural network in the famous paper of Neural Probabilistic Language Models \[[1](#reference)\] published in 2003. The Neural Network Language Model (NNLM) described in the paper learns the language model and word embedding simultaneously through a linear transformation and a non-linear hidden connection. That is, after training on large amounts of corpus, the model learns the word embedding; then, it computes the probability of the whole sentence, using the embedding. This type of language model can overcome the **curse of dimensionality** i.e. model inaccuracy caused by the difference in dimensionality between training and testing data. Note that the term *neural network language model* is ill-defined, so we will not use the name NNLM but only refer to it as *N-gram neural model* in this section.
+
+We have previously described language model using conditional probability, where the probability of the *t*-th word in a sentence depends on all $t-1$ words before it. Furthermore, since words further prior have less impact on a word, and every word within an n-gram is only effected by its previous n-1 words, we have:
+
+$$P(w_1, ..., w_T) = \prod_{t=n}^TP(w_t|w_{t-1}, w_{t-2}, ..., w_{t-n+1})$$
+
+Given some real corpus in which all sentences are meaningful, the n-gram model should maximize the following objective function:
+
+$$\frac{1}{T}\sum_t f(w_t, w_{t-1}, ..., w_{t-n+1};\theta) + R(\theta)$$
+
+where $f(w_t, w_{t-1}, ..., w_{t-n+1})$ represents the conditional probability of the current word $w_t$ given its previous $n-1$ words, and $R(\theta)$ represents parameter regularization term.
+
+
+
+ Figure 2. N-gram neural network model
+
+
+
+Figure 2 shows the N-gram neural network model. From the bottom up, the model has the following components:
+
+ - For each sample, the model gets input $w_{t-n+1},...w_{t-1}$, and outputs the probability that the t-th word is one of `|V|` in the dictionary.
+
+ Every input word $w_{t-n+1},...w_{t-1}$ first gets transformed into word embedding $C(w_{t-n+1}),...C(w_{t-1})$ through a transformation matrix.
+
+ - All the word embeddings concatenate into a single vector, which is mapped (nonlinearly) into the $t$-th word hidden representation:
+
+ $$g=Utanh(\theta^Tx + b_1) + Wx + b_2$$
+
+ where $x$ is the large vector concatenated from all the word embeddings representing the context; $\theta$, $U$, $b_1$, $b_2$ and $W$ are parameters connecting word embedding layers to the hidden layers. $g$ represents the unnormalized probability of the output word, $g_i$ represents the unnormalized probability of the output word being the i-th word in the dictionary.
+
+ - Based on the definition of softmax, using normalized $g_i$, the probability that the output word is $w_t$ is represented as:
+
+ $$P(w_t | w_1, ..., w_{t-n+1}) = \frac{e^{g_{w_t}}}{\sum_i^{|V|} e^{g_i}}$$
+
+ - The cost of the entire network is a multi-class cross-entropy and can be described by the following loss function
+
+ $$J(\theta) = -\sum_{i=1}^N\sum_{c=1}^{|V|}y_k^{i}log(softmax(g_k^i))$$
+
+ where $y_k^i$ represents the true label for the $k$-th class in the $i$-th sample ($0$ or $1$), $softmax(g_k^i)$ represents the softmax probability for the $k$-th class in the $i$-th sample.
+
+### Continuous Bag-of-Words model(CBOW)
+
+CBOW model predicts the current word based on the N words both before and after it. When $N=2$, the model is as the figure below:
+
+
+
+ Figure 3. CBOW model
+
+
+Specifically, by ignoring the order of words in the sequence, CBOW uses the average value of the word embedding of the context to predict the current word:
+
+$$\text{context} = \frac{x_{t-1} + x_{t-2} + x_{t+1} + x_{t+2}}{4}$$
+
+where $x_t$ is the word embedding of the t-th word, classification score vector is $z=U*\text{context}$, the final classification $y$ uses softmax and the loss function uses multi-class cross-entropy.
+
+### Skip-gram model
+
+The advantages of CBOW is that it smooths over the word embeddings of the context and reduces noise, so it is very effective on small dataset. Skip-gram uses a word to predict its context and get multiple context for the given word, so it can be used in larger datasets.
+
+
+
+ Figure 4. Skip-gram model
+
+
+As illustrated in the figure above, skip-gram model maps the word embedding of the given word onto $2n$ word embeddings (including $n$ words before and $n$ words after the given word), and then combine the classification loss of all those $2n$ words by softmax.
+
+## Data Preparation
+
+## Model Configuration
+
+
+ Figure 5. N-gram neural network model in model configuration
+
+
+
+## Model Training
+
+## Model Application
+
+## Conclusion
+
+This chapter introduces word embedding, the relationship between language model and word embedding, and how to train neural networks to learn word embedding.
+
+In information retrieval, the relevance between the query and document keyword can be computed through the cosine similarity of their word embeddings. In grammar analysis and semantic analysis, a previously trained word embedding can initialize models for better performance. In document classification, clustering the word embedding can group synonyms in the documents. We hope that readers can use word embedding models in their work after reading this chapter.
+
+
+## Referenes
+1. Bengio Y, Ducharme R, Vincent P, et al. [A neural probabilistic language model](http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)[J]. journal of machine learning research, 2003, 3(Feb): 1137-1155.
+2. Mikolov T, Kombrink S, Deoras A, et al. [Rnnlm-recurrent neural network language modeling toolkit](http://www.fit.vutbr.cz/~imikolov/rnnlm/rnnlm-demo.pdf)[C]//Proc. of the 2011 ASRU Workshop. 2011: 196-201.
+3. Mikolov T, Chen K, Corrado G, et al. [Efficient estimation of word representations in vector space](https://arxiv.org/pdf/1301.3781.pdf)[J]. arXiv preprint arXiv:1301.3781, 2013.
+4. Maaten L, Hinton G. [Visualizing data using t-SNE](https://lvdmaaten.github.io/publications/papers/JMLR_2008.pdf)[J]. Journal of Machine Learning Research, 2008, 9(Nov): 2579-2605.
+5. https://en.wikipedia.org/wiki/Singular_value_decomposition
+
+
+ 本教程 由 PaddlePaddle 创作,采用 知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议进行许可。
+