diff --git a/build.sh b/build.sh
index 84ec951367790bc47daea94c49a11c6f3b10100f..8497a3db15496faba30b245db9189c1916d1f374 100755
--- a/build.sh
+++ b/build.sh
@@ -1,6 +1,9 @@
#!/bin/bash
-for file in `find . -name '*.md' | grep -v '^./README.md'`
-do
- bash .tmpl/convert-markdown-into-html.sh $file > `dirname $file`/index.html
+for i in $(du -a | grep '\.\/.\+\/README.md' | cut -f 2); do
+ .tmpl/convert-markdown-into-html.sh $i > $(dirname $i)/index.html
+done
+
+for i in $(du -a | grep '\.\/.\+\/README.en.md' | cut -f 2); do
+ .tmpl/convert-markdown-into-html.sh $i > $(dirname $i)/index.en.html
done
diff --git a/fit_a_line/index.en.html b/fit_a_line/index.en.html
new file mode 100644
index 0000000000000000000000000000000000000000..b2492b2c8d0ab1126ba444acc669102bc02ebdfb
--- /dev/null
+++ b/fit_a_line/index.en.html
@@ -0,0 +1,251 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+# Linear Regression
+Let us begin the tutorial with a classical problem called Linear Regression \[[1](#References)\]. In this chapter, we will train a model from a realistic dataset to predict house prices. Some important concepts in Machine Learning will be covered through this example.
+
+The source code for this tutorial is at [book/fit_a_line](https://github.com/PaddlePaddle/book/tree/develop/fit_a_line). If this is your first time using PaddlePaddle, please refer to the [Install Guide](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html).
+
+## Problem
+Suppose we have a dataset of $n$ houses. Each house $i$ has $d$ properties and the price $y_i$. A property $x_{i,d}$ describes one aspect of the house, for example, the number of rooms in the house, the number of schools or hospitals in the neighborhood, the nearby traffic condition, etc. Our task is to predict $y_i$ given a set of properties $\{x_{i,1}, ..., x_{i,d}\}$. We assume that the price is a linear combination of all the properties, i.e.,
+
+$$y_i = \omega_1x_{i,1} + \omega_2x_{i,2} + \ldots + \omega_dx_{i,d} + b, i=1,\ldots,n$$
+
+where $\omega_{d}$ and $b$ are the model parameters we want to estimate. Once they are learned, given a set of properties of a house, we will be able to predict a price for that house. The model we have here is called Linear Regression, namely, we want to regress a value as a linear combination of several values. In practice this linear model for our problem is hardly true, because the real relationship between the house properties and the price is much more complicated. However, due to its simple formulation which makes the model training and analysis easy, Linear Regression has been applied to lots of real problems. It is always an important topic in many classical Statistical Learning and Machine Learning textbooks \[[2,3,4](#References)\].
+
+## Results Demonstration
+We first show the training result of our model. We use the [UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing) to train a linear model and predict the house prices in Boston. The figure below shows the predictions the model makes for some house prices. The $X$ coordinate of each point represents the median value of the prices of a certain type of houses, while the $Y$ coordinate represents the predicted value by our linear model. When $X=Y$, the point lies exactly on the dotted line. In other words, the more precise the model predicts, the closer the point is to the dotted line.
+
+
+ Figure 1. Predicted Value V.S. Actual Value
+
+
+## Model Overview
+
+### Model Definition
+
+In the UCI Housing Data Set, there are 13 house properties $x_{i,d}$ that are related to the median house price $y_i$. Thus our model is:
+
+$$\hat{Y} = \omega_1X_{1} + \omega_2X_{2} + \ldots + \omega_{13}X_{13} + b$$
+
+where $\hat{Y}$ is the predicted value used to differentiate from the actual value $Y$. The model parameters to be learned are: $\omega_1, \ldots, \omega_{13}, b$, where $\omega$ are called the weights and $b$ is called the bias.
+
+Now we need an optimization goal, so that with the learned parameters, $\hat{Y}$ is close to $Y$ as much as possible. Here we introduce the concept of [Loss Function (Cost Function)](https://en.wikipedia.org/wiki/Loss_function). The Loss Function has such property: given any pair of the actual value $y_i$ and the predicted value $\hat{y_i}$, its output is always non-negative. This non-negative value reflects the model error.
+
+For Linear Regression, the most common Loss Function is [Mean Square Error (MSE)](https://en.wikipedia.org/wiki/Mean_squared_error) which has the following form:
+
+$$MSE=\frac{1}{n}\sum_{i=1}^{n}{(\hat{Y_i}-Y_i)}^2$$
+
+For a dataset of size $n$, MSE is the average value of the $n$ predicted errors.
+
+### Training
+
+After defining our model, we have several major steps for the training:
+1. Initialize the parameters including the weights $\omega$ and the bias $b$. For example, we can set their mean values as 0s, and their standard deviations as 1s.
+2. Feedforward to compute the network output and the Loss Function.
+3. Backward to [backpropagate](https://en.wikipedia.org/wiki/Backpropagation) the errors. The errors will be propagated from the output layer back to the input layer, during which the model parameters will be updated with the corresponding errors.
+4. Repeat steps 2~3, until the loss is below a predefined threshold or the maximum number of repeats is reached.
+
+## Data Preparation
+Follow the command below to prepare data:
+```bash
+cd data && python prepare_data.py
+```
+This line of code will download the dataset from the [UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing) and perform some [preprocessing](#Preprocessing). The dataset is split into a training set and a test set.
+
+The dataset contains 506 lines in total, each line describing the properties and the median price of a certain type of houses in Boston. The meaning of each line is below:
+
+
+| Property Name | Explanation | Data Type |
+| ------| ------ | ------ |
+| CRIM | per capita crime rate by town | Continuous|
+| ZN | proportion of residential land zoned for lots over 25,000 sq.ft. | Continuous |
+| INDUS | proportion of non-retail business acres per town | Continuous |
+| CHAS | Charles River dummy variable | Discrete, 1 if tract bounds river; 0 otherwise|
+| NOX | nitric oxides concentration (parts per 10 million) | Continuous |
+| RM | average number of rooms per dwelling | Continuous |
+| AGE | proportion of owner-occupied units built prior to 1940 | Continuous |
+| DIS | weighted distances to five Boston employment centres | Continuous |
+| RAD | index of accessibility to radial highways | Continuous |
+| TAX | full-value property-tax rate per $10,000 | Continuous |
+| PTRATIO | pupil-teacher ratio by town | Continuous |
+| B | 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town | Continuous |
+| LSTAT | % lower status of the population | Continuous |
+| MEDV | Median value of owner-occupied homes in $1000's | Continuous |
+
+The last entry is the median house price.
+
+### Preprocessing
+#### Continuous and Discrete Data
+We define a feature vector of length 13 for each house, where each entry of the feature vector corresponds to a property of that house. Our first observation is that among the 13 dimensions, there are 12 continuous dimensions and 1 discrete dimension. Note that although a discrete value is also written as digits such as 0, 1, or 2, it has a quite different meaning from a continuous value. The reason is that the difference between two discrete values has no practical meaning. For example, if we use 0, 1, and 2 to represent `red`, `green`, and `blue` respectively, although the numerical difference between `red` and `green` is smaller than that between `red` and `blue`, we cannot say that the extent to which `blue` is different from `red` is greater than the extent to which `green` is different from `red`. Therefore, when handling a discrete feature that has $d$ possible values, we will usually convert it to $d$ new features where each feature can only take 0 or 1, indicating whether the original $d$th value is present or not. Or we can map the discrete feature to a continuous multi-dimensional vector through an embedding table. For our problem here, because CHAS itself is a binary discrete value, we do not need to do any preprocessing.
+
+#### Feature Normalization
+Another observation we have is that there is a huge difference among the value ranges of the 13 features (Figure 2). For example, feature B has a value range of [0.32, 396.90] while feature NOX has a range of [0.3850, 0.8170]. For an effective optimization, here we need data normalization. The goal of data normalization is to scale each feature into roughly the same value range, for example [-0.5, 0.5]. In this example, we adopt a standard way of normalization: substracting the mean value from the feature and divide the result by the original value range.
+
+There are at least three reasons for [Feature Normalization](https://en.wikipedia.org/wiki/Feature_scaling) (Feature Scaling):
+- A value range that is too large or too small might cause floating number overflow or underflow during computation.
+- Different value ranges might result in different importances of different features to the model (at least in the beginning of the training process), which however is an unreasonable assumption. Such assumption makes the optimization more difficult and increases the training time a lot.
+- Many Machine Learning techniques or models (e.g., L1/L2 regularization and Vector Space Model) are based on the assumption that all the features have roughly zero means and their value ranges are similar.
+
+
+
+ Figure 2. The value ranges of the features
+
+
+#### Prepare Training and Test Sets
+We split the dataset into two subsets, one for estimating the model parameters, namely, model training, and the other for model testing. The model error on the former is called the **training error**, and the error on the latter is called the **test error**. Our goal of training a model is to find the statistical dependency between the outputs and the inputs, so that we can predict new outputs given new inputs. As a result, the test error reflects the performance of the model better than the training error does. We consider two things when deciding the ratio of the training set to the test set: 1) More training data will decrease the variance of the parameter estimation, yielding more reliable models; 2) More test data will decrease the variance of the test error, yielding more reliable test errors. One standard split ratio is $8:2$. You can try different split ratios to observe how the two variances change.
+
+Executing the following command to split the dataset and write the training and test set into the `train.list` and `test.list` files, so that later PaddlePaddle can read from them.
+```python
+python prepare_data.py -r 0.8 #8:2 is the default split ratio
+```
+
+When training complex models, we usually have one more split: the validation set. Complex models usually have [Hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_optimization) that need to be set before the training process begins. These hyperparameters are not part of the model parameters and cannot be trained using the same Loss Function (e.g., the number of layers in the network). Thus we will try several sets of hyperparameters to get several models, and compare these trained models on the validation set to pick the best one, and finally it on the test set. Because our model is relatively simple in this problem, we ignore this validation process for now.
+
+### Provide Data to PaddlePaddle
+After the data is prepared, we use a Python Data Provider to provide data for PaddlePaddle. A Data Provider is a Python function which will be called by PaddlePaddle during training. In this example, the Data Provider only needs to read the data and return it to the training process of PaddlePaddle line by line.
+
+```python
+from paddle.trainer.PyDataProvider2 import *
+import numpy as np
+#define data type and dimensionality
+@provider(input_types=[dense_vector(13), dense_vector(1)])
+def process(settings, input_file):
+ data = np.load(input_file.strip())
+ for row in data:
+ yield row[:-1].tolist(), row[-1:].tolist()
+
+```
+
+## Model Configuration
+
+### Data Definition
+We first call the function `define_py_data_sources2` to let PaddlePaddle read training and test data from the `dataprovider.py` in the above. PaddlePaddle can accept configuration info from the command line, for example, here we pass a variable named `is_predict` to control the model to have different structures during training and test.
+```python
+from paddle.trainer_config_helpers import *
+
+is_predict = get_config_arg('is_predict', bool, False)
+
+define_py_data_sources2(
+ train_list='data/train.list',
+ test_list='data/test.list',
+ module='dataprovider',
+ obj='process')
+
+```
+
+### Algorithm Settings
+Next we need to set the details of the optimization algorithm. Due to the simplicity of the Linear Regression model, we only need to set the `batch_size` which defines how many samples are used every time for updating the parameters.
+```python
+settings(batch_size=2)
+```
+
+### Network
+Finally, we use `fc_layer` and `LinearActivation` to represent the Linear Regression model.
+```python
+#input data of 13 dimensional house information
+x = data_layer(name='x', size=13)
+
+y_predict = fc_layer(
+ input=x,
+ param_attr=ParamAttr(name='w'),
+ size=1,
+ act=LinearActivation(),
+ bias_attr=ParamAttr(name='b'))
+
+if not is_predict: #when training, we use MSE (i.e., regression_cost) as the Loss Function
+ y = data_layer(name='y', size=1)
+ cost = regression_cost(input=y_predict, label=y)
+ outputs(cost) #output MSE to view the loss change
+else: #during test, output the prediction value
+ outputs(y_predict)
+```
+
+## Training Model
+We can run the PaddlePaddle command line trainer in the root directory of the code. Here we name the configuration file as `trainer_config.py`. We train 30 passes and save the result in the directory `output`:
+```bash
+./train.sh
+```
+
+## Use Model
+Now we can use the trained model to do prediction.
+```bash
+python predict.py
+```
+Here by default we use the model in `output/pass-00029` for prediction, and compare the actual house price with the predicted one. The result is shown in `predictions.png`.
+If you want to use another model or test on other data, you can pass in a new model path or data path:
+```bash
+python predict.py -m output/pass-00020 -t data/housing.test.npy
+```
+
+## Summary
+In this chapter, we have introduced the Linear Regression model using the UCI Housing Data Set as an example. We have shown how to train and test this model with PaddlePaddle. Many more complex models and techniques are derived from this simple linear model, thus it is important for us to understand how it works.
+
+
+## References
+1. https://en.wikipedia.org/wiki/Linear_regression
+2. Friedman J, Hastie T, Tibshirani R. The elements of statistical learning[M]. Springer, Berlin: Springer series in statistics, 2001.
+3. Murphy K P. Machine learning: a probabilistic perspective[M]. MIT press, 2012.
+4. Bishop C M. Pattern recognition[J]. Machine Learning, 2006, 128.
+
+
+ 本教程 由 PaddlePaddle 创作,采用 知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议进行许可。
+
+Image Classification
+=======================
+
+The source code of this chapter is in [book/image_classification](https://github.com/PaddlePaddle/book/tree/develop/image_classification). For the first-time users, please refer to PaddlePaddle[Installation Tutorial](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html) for installation instructions.
+
+## Background
+
+Comparing to words, images provide more vivid and easier to understand information with more artistic sense. They are important source for people to convey and exchange ideas. In this chapter, we focus on one of the essential problems in image recognition -- image classification.
+
+Image classification distinguishes images of different categories based on their semantic meaning. It is a core problem in computer vision, and is also the foundation of other higher level computer vision tasks such as object detection, image segmentation, object tracking, action recognition, etc. Image classification has applications in many areas such as face recognition and intelligent video analysis in security systems, traffic scene recognition in transportation systems, content-based image retrieval and automatic photo indexing in web services, image classification in medicine, etc.
+
+In image classification, we first encode the whole image using handcrafted or learned features, and then determine the object category by a classifier. Therefore, feature extraction plays an important role in image classification. Prior to deep learning, BoW(Bag of Words) model is the most popular method for object classification. BoW was introduced in NLP where a sentence is represented as a bag of words (words, phrases, or characters) extracted from training sentences. In the context of image classification, BoW model requires constructing a dictionary. The simplest BoW framework can be designed with three steps: **feature extraction**, **feature encoding**, and **classifier design**.
+
+Deep learning approach to image classification works by supervised or unsupervised learning of hierarchical features automatically instead of crafting or selecting image features manually. Convolutional Neural Networks (CNNs) have made significant progress in image classification. They keep all image information by employing raw image pixels as input, extract low-level and high-level abstract features through convolution operations, and directly output the classification results from the model. This end-to-end learning fashion leads to good performance and wide applications.
+
+In this chapter, we focus on introducing deep learning-based image classification methods, and on explaining how to train a CNN model using PaddlePaddle.
+
+## Demonstration
+
+Image classification includes general and fine-grained ones. Figure 1 demonstrates the results of general image classification -- the trained model can correctly recognize the main objects in the images.
+
+
+
+Figure 1. General image classification
+
+
+
+Figure 2 demonstrates the results of fine-grained image classification -- flower recognition, which requires correct recognition of flower categories.
+
+
+
+Figure 2. Fine-grained image classification
+
+
+
+A good model should be able to recognize objects of different categories correctly, and meanwhile can correctly classify images taken from different points of view, under different illuminations, with object distortion or partial occlusion (we call these image disturbance). Figure 3 show some images with various disturbance. A good model should be able to classify these images correctly like humans.
+
+
+
+Figure 3. Disturbed images [22]
+
+
+## Model Overview
+
+A large amount of research work in image classification is built upon public datasets such as [PASCAL VOC](http://host.robots.ox.ac.uk/pascal/VOC/), [ImageNet](http://image-net.org/). Many image classification algorithms are usually evaluated and compared on these datasets. PASCAL VOC is a computer vision competition started in 2005, and ImageNet is a dataset started in Large Scale Visual Recognition Challenge (ILSVRC) 2010. In this chapter, we introduce some image classification models from the submissions to these competitions.
+
+Before 2012, traditional image classification methods can be achieved with the three steps described in the Background section. A complete model construction usually involves the following stages: low-level feature extraction, feature encoding, spatial constraint or feature clustering, classifier design, model ensemble.
+
+ 1). **Low-level feature extraction**: This is a step for extracting large amounts of local features according to fixed strides and scales. Popular local features include Scale-Invariant Feature Transform(SIFT)[1], Histogram of Oriented Gradient(HOG)[2], Local Binary Pattern(LBP)[3], etc. A common practice is to employ multiple feature descriptors in order to avoid missing too much information.
+ 2). **Feature encoding**: Low-level features contain large amount of redundancy and noise. In order to improve robustness of features, it is necessary to employ a feature transformation to encode low-level features, which is called feature encoding. Common feature encoding methods include vector quantization [4], sparse coding [5], locality-constrained linear coding [6], Fisher vector encoding [7], etc.
+ 3). **Spatial constraint**: Spatial constraint or feature clustering is usually adopted after feature encoding for extracting the maximum or average of each dimension in the spatial domain. Pyramid feature matching--a popular feature clustering method--divides an image uniformly into patches, and performs feature clustering in each patch.
+ 4). **Classification**: Upon the above steps, an image can be described by a vector of fixed dimension. Then a classifier can be used to classify the image into categories. Common classifiers include Support Vector Machine(SVM), random forest, etc. Kernel SVM is the most popular classifier, and has achieved very good performance in traditional image classification tasks.
+
+This method has been used widely as image classification algorithm in PASCAL VOC [18]. NEC Labs(http://www.nec-labs.com/) won the championship by employing SIFT and LBP features, two non-linear encoders and SVM in ILSVRC 2010 [8].
+
+The CNN model--AlexNet proposed by Alex Krizhevsky et al.[9], made a breakthrough in ILSVRC 2012. It outperformed traditional methods dramatically, and won the championship in ILSVRC 2012. This is also the first time that a deep learning method was used for large scale image classification. Since AlexNet, a series of CNN models have been proposed and has advanced the state of the art steadily on Imagenet as shown in Figure 4. With deeper and more sophisticated architectures, Top-5 error rate is getting lower and lower, until to around 3.5%. The error rate of human raters on the same Imagenet dataset is 5.1%, which means that the image classification capability of a deep learning model surpasses human raters.
+
+
+
+### CNN
+
+Traditional CNNs consist of convolutional and fully-connected layers, and employ softmax multi-category classifier and cross-entropy as loss function. Figure 5 shows a typical CNN. We first introduce the common parts of a CNN.
+
+
+
+Figure 5. A CNN example [20]
+
+
+- convolutional layer: It uses convolution operation to extract low-level and high-level features, and to discover local correlation and spatial invariance.
+
+- pooling layer: It down-sample feature maps via extracting local max (max-pooling) or average (avg-pooling) of each patch in the feature map. Down-sampling, a common operator in image processing, can be used to filter out high frequency information.
+
+- fully-connected layer: It fully connects neurons between two adjacent layers.
+
+- non-linear activation: Convolutional and fully-connected layers are usually followed by some non-linear activation layers, such as Sigmoid, Tanh, Relu to enhance the expression capability. Relu is the most commonly used activation function in CNN.
+
+- Dropout [10]: At each training stage, individual nodes are dropped out of the net with a certain probability in order to improve generalization and to avoid overfitting.
+
+Due to parameter updating in each layer during training, it causes the change in the distributions of layer inputs, and requires careful tuning of hyper-parameters. In 2015, Sergey Ioffe and Christian Szegedy proposed a Batch Normalization (BN) algorithm [14], which normalizes the features of each batch in a layer, and enables relatively stable distribution in each layer. Not only does BN algorithm act as a regularizer, but also reduces the need for careful hyper-parameter design. Experiments demonstrate that BN algorithm accelerates the training convergence and has been widely used in later deeper models.
+
+We will introduce the network architectures of VGG, GoogleNet and ResNets in the following sections.
+
+### VGG
+
+Oxford Visual Geometry Group (VGG) proposed VGG network in ILSVRC 2014 [11]. The model is deeper and wider than previous neural architectures. It comprises five main groups of convolution operations, with max-pooling layers between adjacent convolution groups. Each group contains a series of 3x3 convolutional layers, whose number of convolution kernels stays the same within the group and increases from 64 in the first group to 512 in the last one. The total number of learnable layers could be 11, 13, 16, or 19 depending on the number of convolutional layers in each group. Figure 6 illustrates a 16-layer VGG. The neural architecture of VGG is relatively simple, and has been adopted by many papers such as the first one that surpassed human-level performance on ImageNet [19].
+
+
+
+Figure 6. Vgg16 model for ImageNet
+
+
+### GoogleNet
+
+GoogleNet [12] won the championship in ILSVRC 2014. Before introducing this model, lets get familiar with Network in Network(NIN) model [13] from which GoogleNet borrowed some ideas, and Inception blocks upon which GoogleNet is built.
+
+NIN model has two main characteristics: 1) it replaces the single-layer convolutional network by Multi-Layer Perceptron Convolution or MLPconv. MLPconv, a tiny multi-layer convolutional network, enhances non-linearity by adding several 1x1 convolutional layers after linear ones. 2) In traditional CNNs, the last fewer layers are usually fully-connected with a large number of parameters. In contrast, NIN replaces all fully-connected layers with convolutional layers whose feature maps are of the same size as the category dimension, and followed by a global average pooling. This replacement of fully-connected layers significantly reduces the number of parameters.
+
+Figure 7 depicts two Inception blocks. Figure 7(a) is the simplest design, the output of which is a concat of features from three convolutional layers and one pooling layer. The disadvantage of this design is that the pooling layer does not change the number of filters and leads to an increase of outputs. After going through several of such blocks, the number of outputs and parameters will become larger and larger, leading to higher computation complexity. To overcome this drawback, the Inception block in Figure 7(b) employs three 1x1 convolutional layers to reduce dimension or the number of channels, meanwhile improves non-linearity of the network.
+
+
+
+Figure 7. Inception block
+
+
+GoogleNet consists of multiple stacking Inception blocks followed by an avg-pooling layer as in NIN in place of by traditional fully connected layers. The difference between GoogleNet and NIN is that GoogleNet adds a fully connected layer after avg-pooling layer to output a vector of category size. Besides these two characteristics, the features from middle layers of a GoogleNet are also very discriminative. Therefore, GoogeleNet inserts two auxiliary classifiers in the model for enhancing gradient and regularization when doing backpropagating. The loss function of the whole network is the weighted sum of these three classifiers.
+
+Figure 8 illustrates the neural architecture of a GoogleNet which consists of 22 layers: it starts with three regular convolutional layers followed by three groups of sub-networks-- the first group contains two Inception blocks, the second one five, and the third one two. It ends up with an average pooling and a fully-connected layer.
+
+
+
+Figure 8. GoogleNet[12]
+
+
+The above model is the first version of GoogleNet or GoogelNet-v1. GoogleNet-v2 [14] introduces BN layer; GoogleNet-v3 [16] further splits some convolutional layers, which increases non-linearity and network depth; GoogelNet-v4 [17] leads to the design idea of ResNet which will be introduced in the next section. The evolution from v1 to v4 leverages the accuracy rate consistently. We will not go into details of the neural architectures of v2 to v4.
+
+### ResNet
+
+Residual Network(ResNet)[15] won the 2015 championships on three ImageNet competitions -- image classification, object localization and object detection. The authors of ResNet proposed a residual learning approach to easing the difficulty of training deeper networks -- with the network depth increasing, accuracy degrades. Based upon the design ideas of BN, small convolutional kernels, full convolutional network, ResNets reformulate the layers as residual blocks, with each block containing two branches, one directly connecting input to the output, the other performing two to three convolutions and calculating the residual function with reference to the layer inputs. And then the outputs of these two branches are added up.
+
+Figure 9 illustrates the architecture of ResNet. The left is the basic building block consisting of two 3x3 convolutional layers of the same channels. The right one is a Bottleneck block. The bottleneck is a 1x1 convolutional layer used to reduce dimension from 256 to 64. The other 1x1 conolutional layer is used to increase dimension from 64 to 256. Therefore, the number of input and output channels of the middle 3x3 convolutional layer, which is 64, is relatively small.
+
+
+
+Figure 9. Residual block
+
+
+Figure 10 illustrates ResNets with 50, 101, 152 layers, respectively. All three networks use bottleneck blocks of different numbers of repetitions. ResNet converges very fast and can be trained with hundreds or thousands of layers.
+
+
+
+Figure 10. ResNet model for ImageNet
+
+
+
+## Data Preparation
+
+### Data description and downloading
+
+Commonly used public datasets for image classification are CIFAR(https://www.cs.toronto.edu/~kriz/cifar.html), ImageNet(http://image-net.org/), COCO(http://mscoco.org/), etc. Those used for fine-grained image classification are CUB-200-2011(http://www.vision.caltech.edu/visipedia/CUB-200-2011.html), Stanford Dog(http://vision.stanford.edu/aditya86/ImageNetDogs/), Oxford-flowers(http://www.robots.ox.ac.uk/~vgg/data/flowers/), etc. Among them, ImageNet are the largest and most research results are reported on ImageNet as mentioned in Model Overview section. Since 2010, the data of Imagenet has gone through some changes. The commonly used ImageNet-2012 dataset contains 1000 categories. There are 1,281,167 training images, ranging from 732 to 1200 images per category, and 50,000 validation images with 50 images per category in average.
+
+Since ImageNet is too large to be downloaded and trained efficiently, we use CIFAR10 (https://www.cs.toronto.edu/~kriz/cifar.html) in this tutorial. The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. Figure 11 shows all the classes in CIFAR10 as well as 10 images randomly sampled from each category.
+
+
+
+Figure 11. CIFAR10 dataset[21]
+
+
+The following command is used for downloading data and calculating the mean image used for data preprocessing.
+
+```bash
+./data/get_data.sh
+```
+
+### Data provider for PaddlePaddle
+
+We use Python interface for providing data to PaddlePaddle. The following file dataprovider.py is a complete example for CIFAR10.
+
+- 'initializer' function performs initialization of dataprovider: loading the mean image, defining two input types -- image and label.
+
+- 'process' function sends preprocessed data to PaddlePaddle. Data preprocessing performed in this function includes data perturbation, random horizontal flipping, deducting mean image from the raw image.
+
+```python
+import numpy as np
+import cPickle
+from paddle.trainer.PyDataProvider2 import *
+
+def initializer(settings, mean_path, is_train, **kwargs):
+ settings.is_train = is_train
+ settings.input_size = 3 * 32 * 32
+ settings.mean = np.load(mean_path)['mean']
+ settings.input_types = {
+ 'image': dense_vector(settings.input_size),
+ 'label': integer_value(10)
+ }
+
+
+@provider(init_hook=initializer, pool_size=50000)
+def process(settings, file_list):
+ with open(file_list, 'r') as fdata:
+ for fname in fdata:
+ fo = open(fname.strip(), 'rb')
+ batch = cPickle.load(fo)
+ fo.close()
+ images = batch['data']
+ labels = batch['labels']
+ for im, lab in zip(images, labels):
+ if settings.is_train and np.random.randint(2):
+ im = im.reshape(3, 32, 32)
+ im = im[:,:,::-1]
+ im = im.flatten()
+ im = im - settings.mean
+ yield {
+ 'image': im.astype('float32'),
+ 'label': int(lab)
+ }
+```
+
+## Model Config
+
+### Data Definition
+
+In model config file, function `define_py_data_sources2` sets argument 'module' to dataprovider file for loading data, 'args' to mean image file. If the config file is used for prediction, then there is no need to set argument 'train_list'.
+
+```python
+from paddle.trainer_config_helpers import *
+
+is_predict = get_config_arg("is_predict", bool, False)
+if not is_predict:
+ define_py_data_sources2(
+ train_list='data/train.list',
+ test_list='data/test.list',
+ module='dataprovider',
+ obj='process',
+ args={'mean_path': 'data/mean.meta'})
+```
+
+### Algorithm Settings
+
+In model config file, function 'settings' specifies optimization algorithm, batch size, learning rate, momentum and L2 regularization.
+
+```python
+settings(
+ batch_size=128,
+ learning_rate=0.1 / 128.0,
+ learning_rate_decay_a=0.1,
+ learning_rate_decay_b=50000 * 100,
+ learning_rate_schedule='discexp',
+ learning_method=MomentumOptimizer(0.9),
+ regularization=L2Regularization(0.0005 * 128),)
+```
+
+The learning rate adjustment policy can be defined with variables `learning_rate_decay_a`($a$), `learning_rate_decay_b`($b$) and `learning_rate_schedule`. In this example, discrete exponential method is used for adjusting learning rate. The formula is as follows,
+$$ lr = lr_{0} * a^ {\lfloor \frac{n}{ b}\rfloor} $$
+where $n$ is the number of processed samples, $lr_{0}$ is the learning_rate set in 'settings'.
+
+### Model Architecture
+
+Here we provide the cofig files for VGG and ResNet models.
+
+#### VGG
+
+First we define VGG network. Since the image size and amount of CIFAR10 are relatively small comparing to ImageNet, we uses a small version of VGG network for CIFAR10. Convolution groups incorporate BN and dropout operations.
+
+1. Define input data and its dimension
+
+ The input to the network is defined as `data_layer`, or image pixels in the context of image classification. The images in CIFAR10 are 32x32 color images of three channels. Therefore, the size of the input data is 3072 (3x32x32), and the number of categories is 10.
+
+ ```python
+ datadim = 3 * 32 * 32
+ classdim = 10
+ data = data_layer(name='image', size=datadim)
+ ```
+
+2. Define VGG main module
+
+ ```python
+ net = vgg_bn_drop(data)
+ ```
+ The input to VGG main module is from data layer. `vgg_bn_drop` defines a 16-layer VGG network, with each convolutional layer followed by BN and dropout layers. Here is the definition in detail:
+
+ ```python
+ def vgg_bn_drop(input, num_channels):
+ def conv_block(ipt, num_filter, groups, dropouts, num_channels_=None):
+ return img_conv_group(
+ input=ipt,
+ num_channels=num_channels_,
+ pool_size=2,
+ pool_stride=2,
+ conv_num_filter=[num_filter] * groups,
+ conv_filter_size=3,
+ conv_act=ReluActivation(),
+ conv_with_batchnorm=True,
+ conv_batchnorm_drop_rate=dropouts,
+ pool_type=MaxPooling())
+
+ conv1 = conv_block(input, 64, 2, [0.3, 0], 3)
+ conv2 = conv_block(conv1, 128, 2, [0.4, 0])
+ conv3 = conv_block(conv2, 256, 3, [0.4, 0.4, 0])
+ conv4 = conv_block(conv3, 512, 3, [0.4, 0.4, 0])
+ conv5 = conv_block(conv4, 512, 3, [0.4, 0.4, 0])
+
+ drop = dropout_layer(input=conv5, dropout_rate=0.5)
+ fc1 = fc_layer(input=drop, size=512, act=LinearActivation())
+ bn = batch_norm_layer(
+ input=fc1, act=ReluActivation(), layer_attr=ExtraAttr(drop_rate=0.5))
+ fc2 = fc_layer(input=bn, size=512, act=LinearActivation())
+ return fc2
+
+ ```
+
+ 2.1. First defines a convolution block or conv_block. The default convolution kernel is 3x3, and the default pooling size is 2x2 with stride 2. Dropout specifies the probability in dropout operation. Function `img_conv_group` is defined in `paddle.trainer_config_helpers` consisting of a series of `Conv->BN->ReLu->Dropout` and a `Pooling`.
+
+
+ 2.2. Five groups of convolutions. The first two groups perform two convolutions, while the last three groups perform three convolutions. The dropout rate of the last convolution in each group is set to 0, which means there is no dropout for this layer.
+
+
+ 2.3. The last two layers are fully-connected layer of dimension 512.
+
+3. Define Classifier
+
+ The above VGG network extracts high-level features and maps them to a vector of the same size as the categories. Softmax function or classifier is then used for calculating the probability of the image belonging to each category.
+
+ ```python
+ out = fc_layer(input=net, size=class_num, act=SoftmaxActivation())
+ ```
+
+4. Define Loss Function and Outputs
+
+ In the context of supervised learning, labels of training images are defined in `data_layer`, too. During training, cross-entropy is used as loss function and as the output of the network; During testing, the outputs are the probabilities calculated in the classifier.
+
+ ```python
+ if not is_predict:
+ lbl = data_layer(name="label", size=class_num)
+ cost = classification_cost(input=out, label=lbl)
+ outputs(cost)
+ else:
+ outputs(out)
+ ```
+
+### ResNet
+
+The first, third and forth steps of a ResNet are the same as a VGG. The second one is the main module.
+
+```python
+net = resnet_cifar10(data, depth=56)
+```
+
+Here are some basic functions used in `resnet_cifar10`:
+
+ - `conv_bn_layer` : convolutional layer followed by BN.
+ - `shortcut` : the shortcut branch in a residual block. There are two kinds of shortcuts: 1x1 convolution used when the number of channels between input and output are different; direct connection used otherwise.
+
+ - `basicblock` : a basic residual module as shown in the left of Figure 9, consisting of two sequential 3x3 convolutions and one "shortcut" branch.
+ - `bottleneck` : a bottleneck module as shown in the right of Figure 9, consisting of a two 1x1 convolutions with one 3x3 convolution in between branch and a "shortcut" branch.
+ - `layer_warp` : a group of residual modules consisting of several stacking blocks. In each group, the sliding window size of the first residual block could be different from the rest of blocks, in order to reduce the size of feature maps along horizontal and vertical directions.
+
+```python
+def conv_bn_layer(input,
+ ch_out,
+ filter_size,
+ stride,
+ padding,
+ active_type=ReluActivation(),
+ ch_in=None):
+ tmp = img_conv_layer(
+ input=input,
+ filter_size=filter_size,
+ num_channels=ch_in,
+ num_filters=ch_out,
+ stride=stride,
+ padding=padding,
+ act=LinearActivation(),
+ bias_attr=False)
+ return batch_norm_layer(input=tmp, act=active_type)
+
+
+def shortcut(ipt, n_in, n_out, stride):
+ if n_in != n_out:
+ return conv_bn_layer(ipt, n_out, 1, stride, 0, LinearActivation())
+ else:
+ return ipt
+
+def basicblock(ipt, ch_out, stride):
+ ch_in = ipt.num_filters
+ tmp = conv_bn_layer(ipt, ch_out, 3, stride, 1)
+ tmp = conv_bn_layer(tmp, ch_out, 3, 1, 1, LinearActivation())
+ short = shortcut(ipt, ch_in, ch_out, stride)
+ return addto_layer(input=[ipt, short], act=ReluActivation())
+
+def bottleneck(ipt, ch_out, stride):
+ ch_in = ipt.num_filter
+ tmp = conv_bn_layer(ipt, ch_out, 1, stride, 0)
+ tmp = conv_bn_layer(tmp, ch_out, 3, 1, 1)
+ tmp = conv_bn_layer(tmp, ch_out * 4, 1, 1, 0, LinearActivation())
+ short = shortcut(ipt, ch_in, ch_out, stride)
+ return addto_layer(input=[ipt, short], act=ReluActivation())
+
+def layer_warp(block_func, ipt, features, count, stride):
+ tmp = block_func(ipt, features, stride)
+ for i in range(1, count):
+ tmp = block_func(tmp, features, 1)
+ return tmp
+
+```
+
+The following are the components of `resnet_cifar10`:
+
+1. The lowest level is `conv_bn_layer`.
+2. The middle level consists of three `layer_warp`, each of which uses the left residual block in Figure 9.
+3. The last level is average pooling layer.
+
+Note: besides the first convolutional layer and the last fully-connected layer, the total number of layers in three `layer_warp` should be dividable by 6, that is the depth of `resnet_cifar10` should satisfy $(depth - 2) % 6 == 0$.
+
+```python
+def resnet_cifar10(ipt, depth=56):
+ # depth should be one of 20, 32, 44, 56, 110, 1202
+ assert (depth - 2) % 6 == 0
+ n = (depth - 2) / 6
+ nStages = {16, 64, 128}
+ conv1 = conv_bn_layer(ipt,
+ ch_in=3,
+ ch_out=16,
+ filter_size=3,
+ stride=1,
+ padding=1)
+ res1 = layer_warp(basicblock, conv1, 16, n, 1)
+ res2 = layer_warp(basicblock, res1, 32, n, 2)
+ res3 = layer_warp(basicblock, res2, 64, n, 2)
+ pool = img_pool_layer(input=res3,
+ pool_size=8,
+ stride=1,
+ pool_type=AvgPooling())
+ return pool
+```
+
+## Model Training
+
+We can train the model by running the script train.sh, which specifies config file, device type, number of threads, number of passes, path to the trained models, etc,
+
+``` bash
+sh train.sh
+```
+
+Here is an example script `train.sh`:
+
+```bash
+#cfg=models/resnet.py
+cfg=models/vgg.py
+output=output
+log=train.log
+
+paddle train \
+ --config=$cfg \
+ --use_gpu=true \
+ --trainer_count=1 \
+ --log_period=100 \
+ --num_passes=300 \
+ --save_dir=$output \
+ 2>&1 | tee $log
+```
+
+- `--config=$cfg` : specifies config file. The default is `models/vgg.py`.
+- `--use_gpu=true` : uses GPU for training. If use CPU,set it to be false.
+- `--trainer_count=1` : specifies the number of threads or GPUs.
+- `--log_period=100` : specifies the number of batches between two logs.
+- `--save_dir=$output` : specifies the path for saving trained models.
+
+Here is an example log after training for one pass. The average error rates are 0.79958 on training set and 0.7858 on validation set.
+
+```text
+TrainerInternal.cpp:165] Batch=300 samples=38400 AvgCost=2.07708 CurrentCost=1.96158 Eval: classification_error_evaluator=0.81151 CurrentEval: classification_error_evaluator=0.789297
+TrainerInternal.cpp:181] Pass=0 Batch=391 samples=50000 AvgCost=2.03348 Eval: classification_error_evaluator=0.79958
+Tester.cpp:115] Test samples=10000 cost=1.99246 Eval: classification_error_evaluator=0.7858
+```
+
+Figure 12 shows the curve of training error rate, which indicates it converges at Pass 200 with error rate 8.54%.
+
+
+
+Figure 12. The error rate of VGG model on CIFAR10
+
+
+## Model Application
+
+After training is done, the model from each pass is saved in `output/pass-%05d`. For example, the model of Pass 300 is saved in `output/pass-00299`. The script `classify.py` can be used to extract features and to classify an image. The default config file of this script is `models/vgg.py`.
+
+
+### Prediction
+
+We can run the following script to predict the category of an image. The default device is GPU. If to use CPU, set `-c`.
+
+```bash
+python classify.py --job=predict --model=output/pass-00299 --data=image/dog.png # -c
+```
+
+Here is the result:
+
+```text
+Label of image/dog.png is: 5
+```
+
+### Feature Extraction
+
+We can run the following command to extract features from an image. Here `job` should be `extract` and the default layer is the first convolutional layer. Figure 13 shows the 64 feature maps output from the first convolutional layer of the VGG model.
+
+```bash
+python classify.py --job=extract --model=output/pass-00299 --data=image/dog.png # -c
+```
+
+
+
+Figre 13. Visualization of convolution layer feature maps
+
+
+## Conclusion
+
+Traditional image classification methods involve multiple stages of processing and the framework is very complicated. In contrast, CNN models can be trained end-to-end with significant increase of classification accuracy. In this chapter, we introduce three models -- VGG, GoogleNet, ResNet, provide PaddlePaddle config files for training VGG and ResNet on CIFAR10, and explain how to perform prediction and feature extraction using PaddlePaddle API. For other datasets such as ImageNet, the procedure for config and training are the same and you are welcome to give it a try.
+
+
+## Reference
+
+[1] D. G. Lowe, [Distinctive image features from scale-invariant keypoints](http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf). IJCV, 60(2):91-110, 2004.
+
+[2] N. Dalal, B. Triggs, [Histograms of Oriented Gradients for Human Detection](http://vision.stanford.edu/teaching/cs231b_spring1213/papers/CVPR05_DalalTriggs.pdf), Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2005.
+
+[3] Ahonen, T., Hadid, A., and Pietikinen, M. (2006). [Face description with local binary patterns: Application to face recognition](http://ieeexplore.ieee.org/document/1717463/). PAMI, 28.
+
+[4] J. Sivic, A. Zisserman, [Video Google: A Text Retrieval Approach to Object Matching in Videos](http://www.robots.ox.ac.uk/~vgg/publications/papers/sivic03.pdf), Proc. Ninth Int'l Conf. Computer Vision, pp. 1470-1478, 2003.
+
+[5] B. Olshausen, D. Field, [Sparse Coding with an Overcomplete Basis Set: A Strategy Employed by V1?](http://redwood.psych.cornell.edu/papers/olshausen_field_1997.pdf), Vision Research, vol. 37, pp. 3311-3325, 1997.
+
+[6] Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., and Gong, Y. (2010). [Locality-constrained Linear Coding for image classification](http://ieeexplore.ieee.org/abstract/document/5540018/). In CVPR.
+
+[7] Perronnin, F., Sánchez, J., & Mensink, T. (2010). [Improving the fisher kernel for large-scale image classification](http://dl.acm.org/citation.cfm?id=1888101). In ECCV (4).
+
+[8] Lin, Y., Lv, F., Cao, L., Zhu, S., Yang, M., Cour, T., Yu, K., and Huang, T. (2011). [Large-scale image clas- sification: Fast feature extraction and SVM training](http://ieeexplore.ieee.org/document/5995477/). In CVPR.
+
+[9] Krizhevsky, A., Sutskever, I., and Hinton, G. (2012). [ImageNet classification with deep convolutional neu- ral networks](http://www.cs.toronto.edu/~kriz/imagenet_classification_with_deep_convolutional.pdf). In NIPS.
+
+[10] G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R.R. Salakhutdinov. [Improving neural networks by preventing co-adaptation of feature detectors](https://arxiv.org/abs/1207.0580). arXiv preprint arXiv:1207.0580, 2012.
+
+[11] K. Chatfield, K. Simonyan, A. Vedaldi, A. Zisserman. [Return of the Devil in the Details: Delving Deep into Convolutional Nets](https://arxiv.org/abs/1405.3531). BMVC, 2014。
+
+[12] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A., [Going deeper with convolutions](https://arxiv.org/abs/1409.4842). In: CVPR. (2015)
+
+[13] Lin, M., Chen, Q., and Yan, S. [Network in network](https://arxiv.org/abs/1312.4400). In Proc. ICLR, 2014.
+
+[14] S. Ioffe and C. Szegedy. [Batch normalization: Accelerating deep network training by reducing internal covariate shift](https://arxiv.org/abs/1502.03167). In ICML, 2015.
+
+[15] K. He, X. Zhang, S. Ren, J. Sun. [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385). CVPR 2016.
+
+[16] Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z. [Rethinking the incep-tion architecture for computer vision](https://arxiv.org/abs/1512.00567). In: CVPR. (2016).
+
+[17] Szegedy, C., Ioffe, S., Vanhoucke, V. [Inception-v4, inception-resnet and the impact of residual connections on learning](https://arxiv.org/abs/1602.07261). arXiv:1602.07261 (2016).
+
+[18] Everingham, M., Eslami, S. M. A., Van Gool, L., Williams, C. K. I., Winn, J. and Zisserman, A. [The Pascal Visual Object Classes Challenge: A Retrospective]((http://link.springer.com/article/10.1007/s11263-014-0733-5)). International Journal of Computer Vision, 111(1), 98-136, 2015.
+
+[19] He, K., Zhang, X., Ren, S., and Sun, J. [Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification](https://arxiv.org/abs/1502.01852). ArXiv e-prints, February 2015.
+
+[20] http://deeplearning.net/tutorial/lenet.html
+
+[21] https://www.cs.toronto.edu/~kriz/cifar.html
+
+[22] http://cs231n.github.io/classification/
+
+
+ 本教程 由 PaddlePaddle 创作,采用 知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议进行许可。
+
+# Semantic Role Labeling
+
+Source code of this chpater is in [book/label_semantic_roles](https://github.com/PaddlePaddle/book/tree/develop/label_semantic_roles).
+
+## Background
+
+Natural Language Analysis contains three components: Lexical Analysis, Syntactic Analysis, and Semantic Analysis. Semantic Role Labelling (SRL) is one way for Shallow Semantic Analysis. A predicate of a sentence is seen as a property that a subject has or is characterized by, such as what it does, what it is or how it is, which mostly corresponds to the core of an event. The noun associated with predicate is called Arugment. Sementic roles express the abstract roles that arguments of a predicate can take in the event, such as Agent, Patient, Theme, Experiencer, Beneficiary, Instrument, Location, Goal and Source etc.
+
+In the following example, “遇到” is Predicate (“Pred”),“小明” is Agent,“小红” is Patient,“昨天” means when the event occurs (Time), and “公园” means where the event occurs (Location).
+
+$$\mbox{[小明]}_{\mbox{Agent}}\mbox{[昨天]}_{\mbox{Time}}\mbox{[晚上]}_\mbox{Time}\mbox{在[公园]}_{\mbox{Location}}\mbox{[遇到]}_{\mbox{Predicate}}\mbox{了[小红]}_{\mbox{Patient}}\mbox{。}$$
+
+Instead of in-depth analysis on semantic information, the goal of Semantic Role Labeling is to identify the relation of predicate and other constituents, e.g., predicate-argument structure, as specific semantic roles, which is an important intermediate step in a wide range of natural language understanding tasks (Information Extraction, Discourse Analysis, DeepQA etc). Predicates are always assumed to be given, the only thing is to identify arguments and their semantic roles.
+
+Standard SRL system mostly build on top of Syntactic Analysis and contains 5 steps:
+
+1. Construct a syntactic parse tree, as shown in Fig. 1
+2. Identity candidate arguments of given predicate from constructed syntactic parse tree.
+3. Prune most unlikely candidate arguments.
+4. Identify argument, which is usually solved as a binary classification problem.
+5. Multi-class semantic role labeling. Steps 2-3 usually introduce hand-designed features based on Syntactic Analysis (step 1).
+
+
+
+
+Fig 1. Syntactic parse tree
+
+
+核心关系-> HED
+定中关系-> ATT
+主谓关系-> SBV
+状中结构-> ADV
+介宾关系-> POB
+右附加关系-> RAD
+动宾关系-> VOB
+标点-> WP
+
+
+However, complete syntactic analysis requires to identify the relation among all constitutes and the performance of SRL is sensitive to the precision of syntactic analysis, which make SRL a very challenging task. In order to reduce the complexity and obtain some syntactic structure information, shallow syntactic analysis is proposed. Shallow Syntactic Analysis is also called partial parsing or chunking. Unlike complete syntactic analysis which requires constructing complete parsing tree, Shallow Syntactic Analysis only need to identify some idependent components with relatively simple structure, such as verb phrases (chunk). In order to avoid constructing syntactic tree with high accuracy, some work\[[1](#Reference)\] proposed semantic chunking based SRL methods, which convert SRL as a sequence tagging problem. Sequence tagging tasks classify syntactic chunks using BIO representation. For syntactic chunks forming a chunk of type A, the first chunk receives the B-A tag (Begin), the remaining ones receive the tag I-A (Inside), and all chunks outside receive the tag O-A.
+
+The BIO representation of above example is shown in Fig.1.
+
+
+
+Fig 2. BIO represention
+
+
+输入序列-> input sequence
+语块-> chunk
+标注序列-> label sequence
+角色-> role
+
+This example illustrates the simplicity of sequence tagging because (1) shallow syntactic analysis reduces precision requirement of syntactic analysis; (2) pruning candidate arguments is removed; 3) argument identification and tagging are finished at the same time. Such unified methods simplify the precedure, reduce the risk of accumulating errors and boost the performance further.
+
+In this tutorial, our SRL system is built as an end-to-end system via neural network. We take only text sequences, without using any syntactic parsing results or complex hand-designed features. We give public dataset [CoNLL-2004 and CoNLL-2005 Shared Tasks](http://www.cs.upc.edu/~srlconll/) as an example to illustrate: given a sentence and it's predicates, identify the corresponding arguments and their semantic roles by sequence tagging method.
+
+## Model
+
+Recurrent Nerual Networks are important tools for sequence modeling and have been successfully used in some natural language processing tasks. Unlike Feed-forward neural netowrks, RNNs can model the dependency between elements of sequences. LSTMs as variants of RNNs aim to model long-term dependency in long sequences. We have introduced this in [understand_sentiment](https://github.com/PaddlePaddle/book/tree/develop/understand_sentiment). In this chapter, we continue to use LSTMs to solve SRL problems.
+
+### Stacked Recurrent Neural Network
+
+Deep Neural Networks allows to extract hierarchical represetations, higher layer can form more abstract/complex representations on top of lower layers. LSTMs when unfolded in time is deep, because a computational path between the input at time $k < t$ to the output at time $t$ crosses several nonlinear layers. However, the computation carried out at each time-step is only linear transformation, which makes LSTMs a shallow model. Deep LSTMs are typically constructed by stacking multiple LSTM layers on top of each other and taking the output from lower LSTM layer at time $t$ as the input of upper LSTM layer at time $t$. Deep, hierarchical nerual networks can be much efficient at representing some functions and modeling varying-length dependencies\[[2](#Reference)\].
+
+
+However, deep LSTMs increases the number of nonlinear steps the gradient has to traverse when propagated back in depth. For example, 4 layer LSTMs can be trained properly, but the performance becomes worse as the number of layers up to 4-8. Conventional LSTMs prevent backpropagated errors from vanishing and exploding by introduce shortcut connections to skip the intermediate nonlinear layers. Therefore, deep LSTMs can consider shortcut connections in depth as well.
+
+
+The operation of a single LSTM cell contain 3 parts: (1) input-to-hidden: map input $x$ to the input of forget gates, input gates, memory cells and output gates by linear transformation (i.e., matrix mapping); (2) hidden-to-hidden: calculate forget gates, input gates, output gates and update memory cell, this is the main part of LSTMs; (3)hidden-to-output: this part typically involves an activation operation on hidden states. Based on the above stacked LSTMs, we add a shortcut connection: take the input-to-hidden from previous layer as a new input and learn another linear transfermation.
+
+Fig.3 illustrate the final stacked recurrent neural networks.
+
+
+
+Fig 3. Stacked Recurrent Neural Networks
+
+
+线性变换-> linear transformation
+输入层到隐层-> input-to-hidden
+
+### Bidirectional Recurrent Neural Network
+
+ LSTMs can summarize the history of previous inputs seen up to now, but can not see the future. In most of natural language processing tasks, the entire sentences are ready to use. Therefore, sequencal learning might be much effecient if the future can be encoded as well like histories.
+
+To address the above drawbacks, we can design bidirectional recurrent neural networks by making a minor modification. Higher LSTM layers process the sequence in reversed direction with previous lower LSTM layers, i.e., Deep LSTMs operate from left-to-right, right-to-left, left-to-right,..., in depth. Therefore, LSTM layers at time-step $t$ can see both histories and the future since the second layer. Fig. 4 illustrates the bidirectional recurrent neural networks.
+
+
+
+
+Fig 4. Bidirectional LSTMs
+
+
+线性变换-> linear transformation
+输入层到隐层-> input-to-hidden
+正向处理输出序列->process sequence in forward direction
+反向处理上一层序列-> process sequence from previous layer in backward direction
+
+Note that, this bidirectional RNNs is different with the one proposed by Bengio et al in machine translation tasks \[[3](#Reference), [4](#Reference)\]. We will introduce another bidirectional RNNs in the following tasks[machine translation](https://github.com/PaddlePaddle/book/blob/develop/machine_translation/README.md)
+
+### Conditional Random Field
+
+The basic pipeline of Neural Networks solving problems is 1) all lower layers aim to learn representations; 2) the top layer is designed for learning the final task. In SRL tasks, CRF is built on top of the network for the final tag sequence prediction. It takes the representations provided by the last LSTM layer as input.
+
+
+CRF is a probabilistic graph model (undirected) with nodes denoting random variables and edges denoting dependencies between nodes. To be simplicity, CRFs learn conditional probability $P(Y|X)$, where $X = (x_1, x_2, ... , x_n)$ are sequences of input, $Y = (y_1, y_2, ... , y_n)$ are label sequences; Decoding is to search sequence $Y$ to maximize conditional probability $P(Y|X)$, i.e., $Y^* = \mbox{arg max}_{Y} P(Y | X)$。
+
+Sequence tagging tasks only consider input and output as linear sequences without extra dependent assumptions on graph model. Thus, the graph model of sequence tagging tasks is simple chain or line, which results in a Linear Chain Conditional Random Field, shown in Fig.5.
+
+
+
+Fig 5. Linear Chain Conditional Random Field used in SRL tasks
+
+
+By the fundamental theorem of random fields \[[5](#Reference)\], the joint distribution over the label sequence $Y$ given $X$ has the form:
+
+$$p(Y | X) = \frac{1}{Z(X)} \text{exp}\left(\sum_{i=1}^{n}\left(\sum_{j}\lambda_{j}t_{j} (y_{i - 1}, y_{i}, X, i) + \sum_{k} \mu_k s_k (y_i, X, i)\right)\right)$$
+
+
+where, $Z(X)$ is normalization constant, $t_j$ is feature function defined on edges, called transition feature, depending on $y_i$ and $y_{i-1}$ which represents transition probabilities from $y_{i-1}$ to $y_i$ given input sequence $X$. $s_k$ is feature function defined on nodes, called state feature, depending on $y_i$ and represents the probality of $y_i$ given input sequence $X$. $\lambda_j$ 和 $\mu_k$ are weights corresponding to $t_j$ and $s_k$. Actually, $t$ and $s$ can be wrtten in the same form, then take summation over all nodes $i$: $f_{k}(Y, X) = \sum_{i=1}^{n}f_k({y_{i - 1}, y_i, X, i})$, $f$ is defined as feature function. Thus, $P(Y|X)$ can be wrtten as:
+
+$$p(Y|X, W) = \frac{1}{Z(X)}\text{exp}\sum_{k}\omega_{k}f_{k}(Y, X)$$
+
+$\omega$ are weights of feature function which should be learned in CRF models. At training stage, given input sequences and label sequences $D = \left[(X_1, Y_1), (X_2 , Y_2) , ... , (X_N, Y_N)\right]$, solve following objective function using MLE:
+
+
+$$L(\lambda, D) = - \text{log}\left(\prod_{m=1}^{N}p(Y_m|X_m, W)\right) + C \frac{1}{2}\lVert W\rVert^{2}$$
+
+
+This objective function can be solved via back-propagation in an end-to-end manner. At decoding stage, given input sequences $X$, search sequence $\bar{Y}$ to maximize conditional probability $\bar{P}(Y|X)$ via decoding methods (such as Viterbi, Beam Search).
+
+### DB-LSTM SRL model
+
+Given predicates and a sentence, SRL tasks aim to identify arguments of the given predicate and their semantic roles. If a sequence has n predicates, we will process this sequence n times. One model is as follows:
+
+1. Construct inputs;
+ - output 1: predicate, output 2: sentence
+ - expand input 1 as a sequence with the same length with input 2 using one-hot representation;
+2. Convert one-hot sequences from step 1 to real-vector sequences via lookup table;
+3. Learn the representation of input sequences by taking real-vector sequences from step 2 as inputs;
+4. Take representations from step 3 as inputs, label sequence as supervision signal, do sequence tagging tasks
+
+We can try above method. Here, we propose some modifications by introducing two simple but effective features:
+
+- predicate context (ctx-p): A single predicate word can not exactly describe the predicate information, especially when the same words appear more than one times in a sentence. With the expanded context, the ambiguity can be largely eliminated. Thus, we extract $n$ words before and after predicate to construct a window chunk.
+
+- region mark ($m_r$): $m_r = 1$ to denote the argument position if it locates in the predicate context region, or $m_r = 0$ if not.
+
+After modification, the model is as follows:
+
+1. Construct inputs
+ - input 1: sentence, input 2: predicate sequence, input 3: predicate context, extract $n$ words before and after predicate and get one-hot representation, input 4: region mark, annotate argument position if it locates in the predicate context region
+ - expand input 2~3 as sequences with the same length with input 1
+2. Convert input 1~4 to real-vector sequences via lookup table; input 1 and 3 share the same lookup table, input 2 and 4 have separate lookup tables
+3. Take four real-vector sequences from step 2 as inputs of bidirectional LSTMs; Train LSTMs to update representations
+4. Take representation from step 3 as input of CRF, label sequence as supervision signal, do sequence tagging tasks
+
+
+
+
+Fig 6. DB-LSTM for SRL tasks
+
+
+论元-> argu
+谓词-> pred
+谓词上下文-> ctx-p
+谓词上下文区域标记-> $m_r$
+输入-> input
+原句-> sentence
+反向LSTM-> LSTM Reverse
+
+## 数据准备
+### 数据介绍与下载
+
+在此教程中,我们选用[CoNLL 2005](http://www.cs.upc.edu/~srlconll/)SRL任务开放出的数据集作为示例。运行 `sh ./get_data.sh` 会自动从官方网站上下载原始数据。需要特别说明的是,CoNLL 2005 SRL任务的训练数集和开发集在比赛之后并非免费进行公开,目前,能够获取到的只有测试集,包括Wall Street Journal的23节和Brown语料集中的3节。在本教程中,我们以测试集中的WSJ数据为训练集来讲解模型。但是,由于测试集中样本的数量远远不够,如果希望训练一个可用的神经网络SRL系统,请考虑付费获取全量数据。
+
+原始数据中同时包括了词性标注、命名实体识别、语法解析树等多种信息。本教程中,我们使用test.wsj文件夹中的数据进行训练和测试,并只会用到words文件夹(文本序列)和props文件夹(标注结果)下的数据。本教程使用的数据目录如下:
+
+```text
+conll05st-release/
+└── test.wsj
+ ├── props # 标注结果
+ └── words # 输入文本序列
+```
+
+标注信息源自Penn TreeBank\[[7](#参考文献)\]和PropBank\[[8](#参考文献)\]的标注结果。PropBank标注结果的标签和我们在文章一开始示例中使用的标注结果标签不同,但原理是相同的,关于标注结果标签含义的说明,请参考论文\[[9](#参考文献)\]。
+
+除数据之外,`get_data.sh`同时下载了以下资源:
+
+| 文件名称 | 说明 |
+|---|---|
+| word_dict | 输入句子的词典,共计44068个词 |
+| label_dict | 标记的词典,共计106个标记 |
+| predicate_dict | 谓词的词典,共计3162个词 |
+| emb | 一个训练好的词表,32维 |
+
+我们在英文维基百科上训练语言模型得到了一份词向量用来初始化SRL模型。在SRL模型训练过程中,词向量不再被更新。关于语言模型和词向量可以参考[词向量](https://github.com/PaddlePaddle/book/blob/develop/word2vec/README.md) 这篇教程。我们训练语言模型的语料共有995,000,000个token,词典大小控制为4900,000词。CoNLL 2005训练语料中有5%的词不在这4900,000个词中,我们将它们全部看作未登录词,用``表示。
+
+### 数据预处理
+脚本在下载数据之后,又调用了`extract_pair.py`和`extract_dict_feature.py`两个子脚本进行数据预处理,前者完成了下面的第1步,后者完成了下面的2~4步:
+
+1. 将文本序列和标记序列其合并到一条记录中;
+2. 一个句子如果含有$n$个谓词,这个句子会被处理$n$次,变成$n$条独立的训练样本,每个样本一个不同的谓词;
+3. 抽取谓词上下文和构造谓词上下文区域标记;
+4. 构造以BIO法表示的标记;
+
+`data/feature`文件是处理好的模型输入,一行是一条训练样本,以"\t"分隔,共9列,分别是:句子序列、谓词、谓词上下文(占 5 列)、谓词上下区域标志、标注序列。下表是一条训练样本的示例。
+
+| 句子序列 | 谓词 | 谓词上下文(窗口 = 5) | 谓词上下文区域标记 | 标注序列 |
+|---|---|---|---|---|
+| A | set | n't been set . × | 0 | B-A1 |
+| record | set | n't been set . × | 0 | I-A1 |
+| date | set | n't been set . × | 0 | I-A1 |
+| has | set | n't been set . × | 0 | O |
+| n't | set | n't been set . × | 1 | B-AM-NEG |
+| been | set | n't been set . × | 1 | O |
+| set | set | n't been set . × | 1 | B-V |
+| . | set | n't been set . × | 1 | O |
+
+### 提供数据给 PaddlePaddle
+1. 使用hook函数进行PaddlePaddle输入字段的格式定义。
+
+ ```python
+ def hook(settings, word_dict, label_dict, predicate_dict, **kwargs):
+ settings.word_dict = word_dict # 获取句子序列的字典
+ settings.label_dict = label_dict # 获取标记序列的字典
+ settings.predicate_dict = predicate_dict # 获取谓词的字典
+
+ # 所有输入特征都是使用one-hot表示序列,在PaddlePaddle中是interger_value_sequence类型
+ # input_types是一个字典,字典中每个元素对应着配置中的一个data_layer,key恰好就是data_layer的名字
+
+ settings.input_types = {
+ 'word_data': integer_value_sequence(len(word_dict)), # 句子序列
+ 'ctx_n2_data': integer_value_sequence(len(word_dict)), # 谓词上下文中的第1个词
+ 'ctx_n1_data': integer_value_sequence(len(word_dict)), # 谓词上下文中的第2个词
+ 'ctx_0_data': integer_value_sequence(len(word_dict)), # 谓词上下文中的第3个词
+ 'ctx_p1_data': integer_value_sequence(len(word_dict)), # 谓词上下文中的第4个词
+ 'ctx_p2_data': integer_value_sequence(len(word_dict)), # 谓词上下文中的第5个词
+ 'verb_data': integer_value_sequence(len(predicate_dict)), # 谓词
+ 'mark_data': integer_value_sequence(2), # 谓词上下文区域标记
+ 'target': integer_value_sequence(len(label_dict)) # 标记序列
+ }
+ ```
+
+2. 使用process将数据逐一提供给PaddlePaddle,只需要考虑如何从原始数据文件中返回一条训练样本。
+
+ ```python
+ def process(settings, file_name):
+ with open(file_name, 'r') as fdata:
+ for line in fdata:
+ sentence, predicate, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2, mark, label = \
+ line.strip().split('\t')
+
+ # 句子文本
+ words = sentence.split()
+ sen_len = len(words)
+ word_slot = [settings.word_dict.get(w, UNK_IDX) for w in words]
+
+ # 一个谓词,这里将谓词扩展成一个和句子一样长的序列
+ predicate_slot = [settings.predicate_dict.get(predicate)] * sen_len
+
+ # 在教程中,我们使用一个窗口为 5 的谓词上下文窗口:谓词和这个谓词前后隔两个词
+ # 这里会将窗口中的每一个词,扩展成和输入句子一样长的序列
+ ctx_n2_slot = [settings.word_dict.get(ctx_n2, UNK_IDX)] * sen_len
+ ctx_n1_slot = [settings.word_dict.get(ctx_n1, UNK_IDX)] * sen_len
+ ctx_0_slot = [settings.word_dict.get(ctx_0, UNK_IDX)] * sen_len
+ ctx_p1_slot = [settings.word_dict.get(ctx_p1, UNK_IDX)] * sen_len
+ ctx_p2_slot = [settings.word_dict.get(ctx_p2, UNK_IDX)] * sen_len
+
+ # 谓词上下文区域标记,是一个二值特征
+ marks = mark.split()
+ mark_slot = [int(w) for w in marks]
+
+ label_list = label.split()
+ label_slot = [settings.label_dict.get(w) for w in label_list]
+ yield {
+ 'word_data': word_slot,
+ 'ctx_n2_data': ctx_n2_slot,
+ 'ctx_n1_data': ctx_n1_slot,
+ 'ctx_0_data': ctx_0_slot,
+ 'ctx_p1_data': ctx_p1_slot,
+ 'ctx_p2_data': ctx_p2_slot,
+ 'verb_data': predicate_slot,
+ 'mark_data': mark_slot,
+ 'target': label_slot
+ }
+ ```
+
+## 模型配置说明
+
+### 数据定义
+
+首先通过 define_py_data_sources2 从dataprovider中读入数据。配置文件中会读取三个字典:输入文本序列的字典、标记的字典、谓词的字典,并传给data provider,data provider会利用这三个字典,将相应的文本输入转换成one-hot序列。
+
+```python
+define_py_data_sources2(
+ train_list=train_list_file,
+ test_list=test_list_file,
+ module='dataprovider',
+ obj='process',
+ args={
+ 'word_dict': word_dict, # 输入文本序列的字典
+ 'label_dict': label_dict, # 标记的字典
+ 'predicate_dict': predicate_dict # 谓词的词典
+ }
+)
+```
+### 算法配置
+
+在这里,我们指定了模型的训练参数,选择了$L_2$正则、学习率和batch size,并使用带Momentum的随机梯度下降法作为优化算法。
+
+```python
+settings(
+ batch_size=150,
+ learning_method=MomentumOptimizer(momentum=0),
+ learning_rate=2e-2,
+ regularization=L2Regularization(8e-4),
+ model_average=ModelAverage(average_window=0.5, max_average_window=10000)
+)
+```
+
+### 模型结构
+
+1. 定义输入数据维度及模型超参数。
+
+ ```python
+ mark_dict_len = 2 # 谓上下文区域标志的维度,是一个0-1 2值特征,因此维度为2
+ word_dim = 32 # 词向量维度
+ mark_dim = 5 # 谓词上下文区域通过词表被映射为一个实向量,这个是相邻的维度
+ hidden_dim = 512 # LSTM隐层向量的维度 : 512 / 4
+ depth = 8 # 栈式LSTM的深度
+
+ word = data_layer(name='word_data', size=word_dict_len)
+ predicate = data_layer(name='verb_data', size=pred_len)
+
+ ctx_n2 = data_layer(name='ctx_n2_data', size=word_dict_len)
+ ctx_n1 = data_layer(name='ctx_n1_data', size=word_dict_len)
+ ctx_0 = data_layer(name='ctx_0_data', size=word_dict_len)
+ ctx_p1 = data_layer(name='ctx_p1_data', size=word_dict_len)
+ ctx_p2 = data_layer(name='ctx_p2_data', size=word_dict_len)
+ mark = data_layer(name='mark_data', size=mark_dict_len)
+
+ if not is_predict:
+ target = data_layer(name='target', size=label_dict_len) # 标记序列只在训练和测试流程中定义
+ ```
+这里需要特别说明的是hidden_dim = 512指定了LSTM隐层向量的维度为128维,关于这一点请参考PaddlePaddle官方文档中[lstmemory](http://www.paddlepaddle.org/doc/ui/api/trainer_config_helpers/layers.html#lstmemory)的说明。
+
+2. 将句子序列、谓词、谓词上下文、谓词上下文区域标记通过词表,转换为实向量表示的词向量序列。
+
+ ```python
+
+ # 在本教程中,我们加载了预训练的词向量,这里设置了:is_static=True
+ # is_static 为 True 时保证了在训练 SRL 模型过程中,词表不再更新
+ emb_para = ParameterAttribute(name='emb', initial_std=0., is_static=True)
+
+ word_input = [word, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2]
+ emb_layers = [
+ embedding_layer(
+ size=word_dim, input=x, param_attr=emb_para) for x in word_input
+ ]
+ emb_layers.append(predicate_embedding)
+ mark_embedding = embedding_layer(
+ name='word_ctx-in_embedding', size=mark_dim, input=mark, param_attr=std_0)
+ emb_layers.append(mark_embedding)
+ ```
+
+3. 8个LSTM单元以“正向/反向”的顺序对所有输入序列进行学习。
+
+ ```python
+ # std_0 指定的参数以均值为0的高斯分布初始化,用在LSTM的bias初始化中
+ std_0 = ParameterAttribute(initial_std=0.)
+
+ hidden_0 = mixed_layer(
+ name='hidden0',
+ size=hidden_dim,
+ bias_attr=std_default,
+ input=[
+ full_matrix_projection(
+ input=emb, param_attr=std_default) for emb in emb_layers
+ ])
+ lstm_0 = lstmemory(
+ name='lstm0',
+ input=hidden_0,
+ act=ReluActivation(),
+ gate_act=SigmoidActivation(),
+ state_act=SigmoidActivation(),
+ bias_attr=std_0,
+ param_attr=lstm_para_attr)
+ input_tmp = [hidden_0, lstm_0]
+
+ for i in range(1, depth):
+ mix_hidden = mixed_layer(
+ name='hidden' + str(i),
+ size=hidden_dim,
+ bias_attr=std_default,
+ input=[
+ full_matrix_projection(
+ input=input_tmp[0], param_attr=hidden_para_attr),
+ full_matrix_projection(
+ input=input_tmp[1], param_attr=lstm_para_attr)
+ ])
+ lstm = lstmemory(
+ name='lstm' + str(i),
+ input=mix_hidden,
+ act=ReluActivation(),
+ gate_act=SigmoidActivation(),
+ state_act=SigmoidActivation(),
+ reverse=((i % 2) == 1),
+ bias_attr=std_0,
+ param_attr=lstm_para_attr)
+
+ input_tmp = [mix_hidden, lstm]
+ ```
+
+4. 取最后一个栈式LSTM的输出和这个LSTM单元的输入到隐层映射,经过一个全连接层映射到标记字典的维度,得到最终的特征向量表示。
+
+ ```python
+ feature_out = mixed_layer(
+ name='output',
+ size=label_dict_len,
+ bias_attr=std_default,
+ input=[
+ full_matrix_projection(
+ input=input_tmp[0], param_attr=hidden_para_attr),
+ full_matrix_projection(
+ input=input_tmp[1], param_attr=lstm_para_attr)
+ ], )
+ ```
+
+5. CRF层在网络的末端,完成序列标注。
+
+ ```python
+ crf_l = crf_layer(
+ name='crf',
+ size=label_dict_len,
+ input=feature_out,
+ label=target,
+ param_attr=ParameterAttribute(
+ name='crfw', initial_std=default_std, learning_rate=mix_hidden_lr))
+ ```
+
+## 训练模型
+执行`sh train.sh`进行模型的训练,其中指定了总共需要训练150个pass。
+
+```bash
+paddle train \
+ --config=./db_lstm.py \
+ --save_dir=./output \
+ --trainer_count=1 \
+ --dot_period=500 \
+ --log_period=10 \
+ --num_passes=200 \
+ --use_gpu=false \
+ --show_parameter_stats_period=10 \
+ --test_all_data_in_one_period=1 \
+2>&1 | tee 'train.log'
+```
+
+训练日志示例如下。
+
+```text
+I1224 18:11:53.661479 1433 TrainerInternal.cpp:165] Batch=880 samples=145305 AvgCost=2.11541 CurrentCost=1.8645 Eval: __sum_evaluator_0__=0.607942 CurrentEval: __sum_evaluator_0__=0.59322
+I1224 18:11:55.254021 1433 TrainerInternal.cpp:165] Batch=885 samples=146134 AvgCost=2.11408 CurrentCost=1.88156 Eval: __sum_evaluator_0__=0.607299 CurrentEval: __sum_evaluator_0__=0.494572
+I1224 18:11:56.867604 1433 TrainerInternal.cpp:165] Batch=890 samples=146987 AvgCost=2.11277 CurrentCost=1.88839 Eval: __sum_evaluator_0__=0.607203 CurrentEval: __sum_evaluator_0__=0.590856
+I1224 18:11:58.424069 1433 TrainerInternal.cpp:165] Batch=895 samples=147793 AvgCost=2.11129 CurrentCost=1.84247 Eval: __sum_evaluator_0__=0.607099 CurrentEval: __sum_evaluator_0__=0.588089
+I1224 18:12:00.006893 1433 TrainerInternal.cpp:165] Batch=900 samples=148611 AvgCost=2.11148 CurrentCost=2.14526 Eval: __sum_evaluator_0__=0.607882 CurrentEval: __sum_evaluator_0__=0.749389
+I1224 18:12:00.164089 1433 TrainerInternal.cpp:181] Pass=0 Batch=901 samples=148647 AvgCost=2.11195 Eval: __sum_evaluator_0__=0.60793
+```
+经过150个 pass 后,得到平均 error 约为 0.0516055。
+
+## 应用模型
+
+训练好的$N$个pass,会得到$N$个模型,我们需要从中选择一个最优模型进行预测。通常做法是在开发集上进行调参,并基于我们关心的某个性能指标选择最优模型。本教程的`predict.sh`脚本简单地选择了测试集上标记错误最少的那个pass(这里是pass-00100)用于预测。
+
+预测时,我们需要将配置中的 `crf_layer` 删掉,替换为 `crf_decoding_layer`,如下所示:
+
+```python
+crf_dec_l = crf_decoding_layer(
+ name='crf_dec_l',
+ size=label_dict_len,
+ input=feature_out,
+ param_attr=ParameterAttribute(name='crfw'))
+```
+
+运行`python predict.py`脚本,便可使用指定的模型进行预测。
+
+```bash
+python predict.py
+ -c db_lstm.py # 指定配置文件
+ -w output/pass-00100 # 指定预测使用的模型所在的路径
+ -l data/targetDict.txt # 指定标记的字典
+ -p data/verbDict.txt # 指定谓词的词典
+ -d data/wordDict.txt # 指定输入文本序列的字典
+ -i data/feature # 指定输入数据的路径
+ -o predict.res # 指定标记结果输出到文件的路径
+```
+
+预测结束后,在 - o 参数所指定的标记结果文件中,我们会得到如下格式的输出:每行是一条样本,以 “\t” 分隔的 2 列,第一列是输入文本,第二列是标记的结果。通过BIO标记可以直接得到论元的语义角色标签。
+
+```text
+The interest-only securities were priced at 35 1\/2 to yield 10.72 % . B-A0 I-A0 I-A0 O O O O O O B-V B-A1 I-A1 O
+```
+
+## Conclusion
+
+Semantic Role Labeling is an important intermediate step in a wide range of natural language processing tasks. In this tutorial, we give SRL as an example to introduce how to use PaddlePaddle to do sequence tagging tasks. Proposed models are from our published paper\[[10](#Reference)\]. We only use test data as illustration since train data on CoNLL 2005 dataset is not completely public. We hope to propose an end-to-end neural network model with less dependencies on natural language processing tools, but is comparable, or even better than trandional models. Please check out our paper for more information and discussions.
+
+## Reference
+1. Sun W, Sui Z, Wang M, et al. [Chinese semantic role labeling with shallow parsing](http://www.aclweb.org/anthology/D09-1#page=1513)[C]//Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3-Volume 3. Association for Computational Linguistics, 2009: 1475-1483.
+2. Pascanu R, Gulcehre C, Cho K, et al. [How to construct deep recurrent neural networks](https://arxiv.org/abs/1312.6026)[J]. arXiv preprint arXiv:1312.6026, 2013.
+3. Cho K, Van Merriënboer B, Gulcehre C, et al. [Learning phrase representations using RNN encoder-decoder for statistical machine translation](https://arxiv.org/abs/1406.1078)[J]. arXiv preprint arXiv:1406.1078, 2014.
+4. Bahdanau D, Cho K, Bengio Y. [Neural machine translation by jointly learning to align and translate](https://arxiv.org/abs/1409.0473)[J]. arXiv preprint arXiv:1409.0473, 2014.
+5. Lafferty J, McCallum A, Pereira F. [Conditional random fields: Probabilistic models for segmenting and labeling sequence data](http://www.jmlr.org/papers/volume15/doppa14a/source/biblio.bib.old)[C]//Proceedings of the eighteenth international conference on machine learning, ICML. 2001, 1: 282-289.
+6. 李航. 统计学习方法[J]. 清华大学出版社, 北京, 2012.
+7. Marcus M P, Marcinkiewicz M A, Santorini B. [Building a large annotated corpus of English: The Penn Treebank](http://repository.upenn.edu/cgi/viewcontent.cgi?article=1246&context=cis_reports)[J]. Computational linguistics, 1993, 19(2): 313-330.
+8. Palmer M, Gildea D, Kingsbury P. [The proposition bank: An annotated corpus of semantic roles](http://www.mitpressjournals.org/doi/pdfplus/10.1162/0891201053630264)[J]. Computational linguistics, 2005, 31(1): 71-106.
+9. Carreras X, Màrquez L. [Introduction to the CoNLL-2005 shared task: Semantic role labeling](http://www.cs.upc.edu/~srlconll/st05/papers/intro.pdf)[C]//Proceedings of the Ninth Conference on Computational Natural Language Learning. Association for Computational Linguistics, 2005: 152-164.
+10. Zhou J, Xu W. [End-to-end learning of semantic role labeling using recurrent neural networks](http://www.aclweb.org/anthology/P/P15/P15-1109.pdf)[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2015.
+
+
+ 本教程 由 PaddlePaddle 创作,采用 知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议进行许可。
+
+# Machine Translation
+
+Source codes are located at [book/machine_translation](https://github.com/PaddlePaddle/book/tree/develop/machine_translation). Please refer to the PaddlePaddle [installation tutorial](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html) if you are a first time user.
+
+## Background
+
+Machine translation (MT) aims to perform translation between different languages using computer. The language to be translated is referred to as source language, while the language to be translated into is referred to as target language. Mahcine translation is the process of translating from source language to target language and is one of the important research field of natural language processing.
+
+Early machine translation systems are mainly rule-based, which rely on the translation-rules between two languages provided by language expert. This types of approaches pose a great challenge to language experts, as it is hardly possible to cover all the rules used even in one language, needless to say two or even more different languages. Therefore, one major chanllenge the conventional machine translation faced is the difficult of obtaining a complete rule set \[[1](#References)\]。
+
+
+To address the problems mentioned above, statistical machine translation technique has been developed, where the translation rules are learned from a large scale corpus, instead of being designed by human. While it overcomes the bottleneck of knowleage acquisition, it still faces many other challenges: 1) human designed features are hard to cover all the all the linguistic variations; 2) it is difficult to use global features; 3) it heavy relies on pro-processing, such as word alignment, word segmentation and tokenization, rule-extraction and syntactic parsing etc., where the error introduced in each step could accumulate, leading to increasing impacts to the translation.
+
+The recent development of deep learning provides new solutions to those challenges. There are mainly two categories for deep learning based machine translation techniques: 1) techniques based on the statistical machine translation system but with some key components improved with neural networks, e.g., language model, reordering model (please refer to the left part of Figure 1); 2) techniques mapping from source language to target language directly using neural network, or end-to-end neural machine translation (NMT).
+
+
+
+Figure 1. Neural Network based Machine Translation.
+
+
+
+This tutorial will mainly introduce the NMT model and how to use PaddlePaddle to train an NMT model.
+
+## Illustrative Results
+
+Taking Chinese-to-English translation as an example, after training of the model, given the following segmented sentence in Chinese
+```text
+这些 是 希望 的 曙光 和 解脱 的 迹象 .
+```
+with a beam-search size of 3, the generated translations are as follows:
+```text
+0 -5.36816 these are signs of hope and relief .
+1 -6.23177 these are the light of hope and relief .
+2 -7.7914 these are the light of hope and the relief of hope .
+```
+- The first column corresponds to the id of the generated sentence; the second column corresponds to the score of the generated sentence (in descending order), where larger value indicates better quality; the last column corresponds to the generated sentence.
+- There are two special tokens: `` denotes the end of a sentence while `` denotes unknown word, i.e., word that is not contained in the training dictionary.
+
+## Overview of the Model
+
+This seciton will introduce Gated Recurrent Unit (GRU), Bi-directional Recurrent Neural Network, Encoder-Decoder framework used in NMT, attention mechanism, as well as beam search algorithm.
+
+### GRU
+
+We have already introduced RNN and LSTM in the Chapter of [Sentiment Analysis](https://github.com/PaddlePaddle/book/blob/develop/understand_sentiment/README.md).
+Compared to simple RNN, LSTM added memory cell, input gate, forget gate and output gate. These gates combined with memory cell greatly improve the ability of handling long term dependency.
+
+GRU\[[2](#References)\] proposed by Cho et al is a simplified LSTM and an extension of simple RNN, as shown in the figure below. A GRU unit has only two gates:
+- reset gate: when it is closed, the history information will be discarded, i.e., the irrelevant historical information has on effects on the future output.
+- update gate: it combines input gate and forget gate, and is used to control the impact of historical information on the hidden output. The historical information will be passes over when the update gate is close to 1.
+
+
+
+Figure 2. A GRU Gate.
+
+
+Generally speaking, sequences with short distance dependency will have active reset gate while sequences with long distance dependency will have active update date.
+In addition, Chung et al.\[[3](#References)\] have empirically shown that although GRU has less parameters, it performs similar to LSTM on several different tasks.
+
+### Bi-directional Recurrent Neural Network
+
+We have already introduce one instance of bi-directional RNN in the Chapter of [Semantic Role Labeling](https://github.com/PaddlePaddle/book/blob/develop/label_semantic_roles/README.md). Here we further present another bi-directional RNN model with different architecture proposed by Bengio et al. in \[[2](#References),[4](#References)\]. This model takes a sequence as input and outputs a fixed dimensional feature vector at each step, encoding the context information at the corresponding time step.
+
+Specifically, this bi-directional RNN processes the input sequence in the original and reverse order respectively, and then concatenates the output feature vectors at each time step as the final output, thus the output node at each time step contains information from the past and future as context. The figure below shows an unrolled bi-directional RNN. This network contains a forward RNN and backward RNN with six weight matrices: weight matrices from input to forward hidden layer and backward hidden ($W_1, W_3$), weight matrices from hidden to itself ($W_2, W_5$), matrices from forward hidden and backward hidden to output layer ($W_4, W_6$). Note that there is no connections between forward hidden and backward hidden layers.
+
+
+
+### Encoder-Decoder Framework
+
+Encoder-Decoder\[[2](#References)\] framework aims to solve the mapping of a sequence to another sequence, where both sequences can have arbitrary lengths. The source sequence is encoded into a vector via encoder, which is then decoded to a target sequence via a decoder by maximizing the predictive probability. Both encoder and decoder are typically implemented via RNN.
+
+
+
+Figure 4. Encoder-Decoder Framework.
+
+
+#### Encoder
+
+There are three steps for encoding a sentence:
+
+1. One-hot vector representation of word. Each word $x_i$ in the source sentence $x=\left \{ x_1,x_2,...,x_T \right \}$ is represented as a vector $w_i\epsilon R^{\left | V \right |},i=1,2,...,T$. where $w_i$ has the same dimensionality as the size of dictionary, i.e., $\left | V \right |$, and has an element of one at the location corresponding to the location of word in the dictionary and zero elsewhere.
+
+2. Word embedding as a representation in the low dimensional semantic space. There are two problems for one-hot vector representation: 1) the dimensionality of the vector is typically large, leading to curse of dimensionality; 2) it is hard to capture the relationships between words, i.e., the semantic similarities. It is therefore useful to project the one-hot vector into a low-dimensional semantic space as a dense vector with fixed dimensions, i.e., $s_i=Cw_i$ for the $i$-th word, with $C\epsilon R^{K\times \left | V \right |}$ as the projection matrix and $K$ is the dimentionality of the word embedding vector。
+
+3. Encoding of the source sequence via RNN. This can be described mathmatically as $h_i=\varnothing _\theta \left ( h_{i-1}, s_i \right )$, where $h_0$ is a zero vector, $\varnothing _\theta$ is a non-linear activation function, and $\mathbf{h}=\left \{ h_1,..., h_T \right \}$ is the sequential encoding of the first $T$ words from the source sequence. The vector representation of the whole sentence can be represented as the encoding vector at the last time step $T$ from $\mathbf{h}$, or by temporal pooling over $\mathbf{h}$.
+
+Bi-directional RNN can also be used in step 3 for more complicated sentence encoding. This can be implemeted using bi-directional GRU. Forward GRU performs encoding of the source sequence acooding to the its original order, i.e., $(x_1,x_2,...,x_T)$, generating a sequence of hidden states $(\overrightarrow{h_1},\overrightarrow{h_2},...,\overrightarrow{h_T})$. Similarily, backward GRU encodes the source sequence in the reserse order, i.e., $(x_T,x_{T-1},...,x_1), generating $(\overleftarrow{h_1},\overleftarrow{h_2},...,\overleftarrow{h_T})$. Then for each word $x_i$, its complete hidden state is the concatenation of the corresponding hidden states from the two GRUs, i.e., $h_i=\left [ \overrightarrow{h_i^T},\overleftarrow{h_i^T} \right ]^{T}$.
+
+
+
+Figure 5. Encoder using bi-directional GRU.
+
+
+#### Decoder
+
+The goal of the decoder is to maximize the probability of the next correct word in target language. The main idea is as follows:
+
+1. At each time step $i$, given the encoding vector (or context vector) $c$ of the source sentence, the $i$-th word $u_i$ from the ground-truth target language and the RNN hidden state $z_i$, the next hidden state $z_{i+1}$ is computated as:
+
+ $$z_{i+1}=\phi _{\theta '}\left ( c,u_i,z_i \right )$$
+ where $\phi _{\theta '}$ is a non-linear activation function and $c=q\mathbf{h}$ is the context vector of the source sentence. Without using [attention](#Attention Mechanism), if the output of the [encoder](#Encoder) is the encoding vector at the last time step of the source sentence, then $c$ can be defined as $c=h_T$. $u_i$ denotes the $i$-th word from the target language sentence and $u_0$ denotes the beginning of the target language sentence (i.e., ``), indicating the beginning of decoding. $z_i$ is the RNN hidden state at time step $i$ and $z_0$ is an all zero vector.
+
+2. Calculate the probability $p_{i+1}$ for the $i+1$-th word in the target language sequence by normalizing $z_{i+1}$ using `softmax` as follows
+
+ $$p\left ( u_{i+1}|u_{<i+1},\mathbf{x} \right )=softmax(W_sz_{i+1}+b_z)$$
+
+ where $W_sz_{i+1}+b_z$ scores each possible words and is then normalized via softmax to produce the probability $p_{i+1}$ for the $i+1$-th word.
+
+3. Compute the cost accoding to $p_{i+1}$ and $u_{i+1}.
+4. Repeat Steps 1~3, until all all the words in the target language sentence have been processed.
+
+The generation process of machine translation is to translate the source sentence into a sentence in a target language according to a pre-trained model. There are some differences between the decoding step in generation and training. Please refer to [Beam Search Algorithm](#Beam Search Algorithm) for details.
+
+### Attention Mechanism
+
+There are a few problems for the fixed dimensional vector represention from the encoding stage: 1) it is very challenging to encode both the semantic and syntactic information a sentence with a fixed dimensional vector regardless of the length of the sentence; 2) intuitively, when translating a sentence, we typically pay more attention to the parts in the source sentence more relevalent to the current translation. Moreover, the focus with change along process of the translation. With a fixed dimensional vector, all the information from the source sentence are treatly equally in terms of attention. This is not reasonable. Therefore, Bahdanau et al. \[[4](#References)\] introduced attention mechanism, which can decode based on different fragments of the context sequence in order to address the difficulty of feature learning for long sentences. Decoder with attention will be explained in the following.
+
+Different from the simple decoder, $z_i$ is computed as:
+
+$$z_{i+1}=\phi _{\theta '}\left ( c_i,u_i,z_i \right )$$
+
+It is observed that for each word $u_i$ in the target language sentence, there is a corresponding context vector $c_i$ as the encoding of the source sentence, which is computed as:
+
+$$c_i=\sum _{j=1}^{T}a_{ij}h_j, a_i=\left[ a_{i1},a_{i2},...,a_{iT}\right ]$$
+
+It is noted that the attention mechanism is achieved by weighted average over the RNN hidden states $h_j$. The weight $a_{ij}$ denotes the strength of attention of the $i$-th word in the target language sentence to the $j$-th word in the source sentence, and is calculated as
+
+\begin{align}
+a_{ij}&=\frac{exp(e_{ij})}{\sum_{k=1}^{T}exp(e_{ik})}\\\\
+e_{ij}&=align(z_i,h_j)\\\\
+\end{align}
+
+where $align$ is an alignment model, measuring the fitness between the $i$-th word in the target language sentence and the $j$-th word in the source sentence. More concretely, the fitness is computed with the $i$-th hidden state $z_i$ of the decoder RNN and the $j$-th context vector $h_j$ of the source sentence. Hard alignment is used in the conventional alignment model, meaning each word in the target language explicitly corresponds to one or more words from the target language sentence. In attention model, soft alignment is used, where any word in source sentence is related to any word in the target language sentence, where the strength of the relation is a real number computed via the model, thus can be incorporated into the NMT framework and can be trained via back-propagation.
+
+
+
+Figure 6. Decoder with Attention Mechanism.
+
+
+### Beam Search Algorithm
+
+Beam Search ([beam search](http://en.wikipedia.org/wiki/Beam_search)) is a heuristic search algorithm that explores a graph by expanding the most promising node in a limited set. It is typically used when the solution space is huge (e.g., for machine translation, speech recognition), and there is not enough memory for all the possible solutions. For example, if we want to translate “`你好`” into English, even if there is only three words in the dictionary (``, ``, `hello`), it is still possible to generate an infinite number of sentences, where the word `hello` can appear different number of times. Beam search could be used to find a good translation among them.
+
+Beam search builds a search tree using breadth first search and sorts the nodes according to a heuristic cost (sum of the log probability of the generated words in this tutorial) at each level of the tree, keeping only a fixed number of nodes according to the pre-specified beam size (or beam width). Therefore, only those nodes will higher-qualities will be expanded later at the next level thus reducing the space and time requirements significantly, with no guarantee on the global optimmal solution, however.
+
+The goal is to maximize the probability of the generated sequence when using beam search in decoding, The procedure is as follows:
+
+1. At each time step $i$, compute the hidden state $z_{i+1}$ of the next time step according to the context vector $c$ of the source sentence, the $i$-th word $u_i$ generated for the target language sentence and the RNN hidden state $z_i$.
+2. Normalize $z_{i+1}$ using `softmax` to get the probability $p_{i+1}$ for the $i+1$-th word for the target language sentence.
+3. Sample the word $u_{i+1}$ according to $p_{i+1}$.
+4. Repeat Steps 1~3, until eod-of-senetcen token `` is generated or the maximum length of the sentence is reached.
+
+Note: $z_{i+1}$ and $p_{i+1}$ are computed the same way as in [Decoder](#Decoder). As each step is greedy in generation, there is no guarantee for global optimum.
+
+## Data Preparation
+
+### Download and Uncompression
+
+This tutorial uses a dataset from [WMT-14](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/), where the dataset [bitexts (after selection)](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz) is used as training set, and [dev+test data](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz) is used as testing and generating set.
+
+Run the following command in Linux to obtain the data:
+```bash
+cd data
+./wmt14_data.sh
+```
+There are three folders in the downloaded dataset `data/wmt14`:
+
+
+
+
Folder Name
+
French-English Parallel Corpus
+
Number of Files
+
Size of Files
+
+
+
+
train
+
ccb2_pc30.src, ccb2_pc30.trg, etc
+
12
+
3.55G
+
+
+
+
test
+
ntst1213.src, ntst1213.trg
+
2
+
1636k
+
+
+
+
+
gen
+
ntst14.src, ntst14.trg
+
2
+
864k
+
+
+
+
+- `XXX.src` is the source file in French and `XXX.trg`is the target file in English. Each row of the file contains one sentence.
+- `XXX.src` and `XXX.trg` has the same number of rows and there is a one-to-one correspondance between the sentences at any row from the two files.
+
+### User Defined Dataset (Optional)
+
+To use your own dataset, just put it under the `data` fodler and organize it as follows
+```text
+user_dataset
+├── train
+│ ├── train_file1.src
+│ ├── train_file1.trg
+│ └── ...
+├── test
+│ ├── test_file1.src
+│ ├── test_file1.trg
+│ └── ...
+├── gen
+│ ├── gen_file1.src
+│ ├── gen_file1.trg
+│ └── ...
+```
+
+Explanation of the directories:
+- First level: `user_dataset`: the name of the user defined dataset.
+- Second level: `train`、`test` and `gen`: these names should not be changed.
+- Third level: Parallel corpus in source language and target language, each with a postfix of `.src` and `.trg`.
+
+### Data Pre-processing
+
+There are two steps for pre-processing:
+- Merge the source and target parallel corpos files into one file
+ - Merge `XXX.src` and `XXX.trg` file pair as `XXX`
+ - The $i$-th row in `XXX` is the concatenation of the $i$-th row from `XXX.src` with the $i$-th row from `XXX.trg`, separated with '\t'.
+
+- Create source dictionary and target dictionary, each containing **DICTSIZE** number of words, including the most frequent (DICTSIZE - 3) fo word from the corpus and 3 special token `` (begin of sequence), `` (end of sequence) and `` (unkown/out of vocabulary words).
+
+`preprocess.py` is used for pre-processing:
+```python
+python preprocess.py -i INPUT [-d DICTSIZE] [-m]
+```
+- `-i INPUT`: path to the original dataset.
+- `-d DICTSIZE`: number of words in the dictionary. If unspecified, the dictionary will contain all the words appeared in the input dataset.
+- `-m --mergeDict`: merge the source dictionary with target dictionary, making the two dictionaries have the same content.
+
+The specific command to run the script is as follows:
+```python
+python preprocess.py -i data/wmt14 -d 30000
+```
+You will see the following messages after a few minutes:
+```text
+concat parallel corpora for dataset
+build source dictionary for train data
+build target dictionary for train data
+dictionary size is 30000
+```
+The pre-processed data is located at `data/pre-wmt14`:
+```text
+pre-wmt14
+├── train
+│ └── train
+├── test
+│ └── test
+├── gen
+│ └── gen
+├── train.list
+├── test.list
+├── gen.list
+├── src.dict
+└── trg.dict
+```
+- `train`, `test` and `gen`: contains French-English parallel corpus for training, testing and generation. Each row from each file is separated into two columns with a “\t”, where the first column is the sequence in French and the second one is in English.
+- `train.list`, `test.list` and `gen.list`: record respectively the path to `train`, `test` and `gen` folders.
+- `src.dict` and `trg.dict`: source (French) and target (English) dictionary. Each dictionary contains 30000 words (29997 most frequent words and 3 special tokens).
+
+### Providing Data to PaddlePaddle
+
+We use `dataprovider.py` to provide data to PaddlePaddle as follows:
+
+1. Import PyDataProvider2 package from PaddlePaddle and define three special tokens:
+
+ ```python
+ from paddle.trainer.PyDataProvider2 import *
+ UNK_IDX = 2 #out of vocabulary word
+ START = "" #begin of sequence
+ END = "" #end of sequence
+ ```
+2. Use initialization function `hook` to define the input data types (`input_types`) for training and generation:
+ - Training: there are three input sequences, where "source language sequence" and "target language sequence" as input and the "target language next word sequence" as label.
+ - Generation: there are two input sequences, where the "source language sequence" as input and “source language sequence id” as the ids for the input data (optional).
+
+ `src_dict_path` in the `hook` function is the path to the source language dictionary, while `trg_dict_path` the path to target language dictionary. `is_generating` is passed from model config file. For more details on the usage of the `hook` function please refer to [Model Config](#Model Config).
+
+ ```python
+ def hook(settings, src_dict_path, trg_dict_path, is_generating, file_list,
+ **kwargs):
+ # job_mode = 1: training 0: generation
+ settings.job_mode = not is_generating
+
+ def fun(dict_path): # load dictionary according to the path
+ out_dict = dict()
+ with open(dict_path, "r") as fin:
+ out_dict = {
+ line.strip(): line_count
+ for line_count, line in enumerate(fin)
+ }
+ return out_dict
+
+ settings.src_dict = fun(src_dict_path)
+ settings.trg_dict = fun(trg_dict_path)
+
+ if settings.job_mode: #training
+ settings.input_types = {
+ 'source_language_word': #source language sequence
+ integer_value_sequence(len(settings.src_dict)),
+ 'target_language_word': #target language sequence
+ integer_value_sequence(len(settings.trg_dict)),
+ 'target_language_next_word': #target language next word sequence
+ integer_value_sequence(len(settings.trg_dict))
+ }
+ else: #generation
+ settings.input_types = {
+ 'source_language_word': #source language sequence
+ integer_value_sequence(len(settings.src_dict)),
+ 'sent_id': #source language sequence id
+ integer_value_sequence(len(open(file_list[0], "r").readlines()))
+ }
+ ```
+3. Use `process` function to open the file `file_name`, read each row of the file, convert the data to be compatible with `input_types`, and then use `yield` to return to PaddlePaddle process. More specifically
+
+ - add `` to the beginning of each source language sequence and add `` to the end, producing “source_language_word”.
+ - add `` to the beginning of each target language senquence, producing “target_language_word”.
+ - add `` to the end of each target language senquence, producing “target_language_next_word”.
+
+ ```python
+ def _get_ids(s, dictionary): # get the location of each word from the source language sequence in the dictionary
+ words = s.strip().split()
+ return [dictionary[START]] + \
+ [dictionary.get(w, UNK_IDX) for w in words] + \
+ [dictionary[END]]
+
+ @provider(init_hook=hook, pool_size=50000)
+ def process(settings, file_name):
+ with open(file_name, 'r') as f:
+ for line_count, line in enumerate(f):
+ line_split = line.strip().split('\t')
+ if settings.job_mode and len(line_split) != 2:
+ continue
+ src_seq = line_split[0]
+ src_ids = _get_ids(src_seq, settings.src_dict)
+
+ if settings.job_mode:
+ trg_seq = line_split[1]
+ trg_words = trg_seq.split()
+ trg_ids = [settings.trg_dict.get(w, UNK_IDX) for w in trg_words]
+
+ # sequence with length longer than 80 with be removed during training to avoid an overly deep RNN.
+ if len(src_ids) > 80 or len(trg_ids) > 80:
+ continue
+ trg_ids_next = trg_ids + [settings.trg_dict[END]]
+ trg_ids = [settings.trg_dict[START]] + trg_ids
+ yield {
+ 'source_language_word': src_ids,
+ 'target_language_word': trg_ids,
+ 'target_language_next_word': trg_ids_next
+ }
+ else:
+ yield {'source_language_word': src_ids, 'sent_id': [line_count]}
+ ```
+Note: As the size of the training data is 3.55G, for machines with limited memories, it is recommended to use `pool_size` to set the number of data samples stored in memory.
+
+## Model Config
+
+### Data Definition
+
+1. Specify the path to data and source/target dictionaries. `is_generating` accepts argument passed from command lines and is used to denote whether the current configuration is for training (default) or generation. See [Usage and Resutls](#Usage and Results).
+
+ ```python
+ import os
+ from paddle.trainer_config_helpers import *
+
+ data_dir = "./data/pre-wmt14" # data path
+ src_lang_dict = os.path.join(data_dir, 'src.dict') # path to the source language dictionary
+ trg_lang_dict = os.path.join(data_dir, 'trg.dict') # path to the target language dictionary
+ is_generating = get_config_arg("is_generating", bool, False) # config mode
+ ```
+2. Use `define_py_data_sources2` to get data from `dataprovider.py`, and use `args` variable to input the source/target language dicitonary path and config mode.
+
+ ```python
+ if not is_generating:
+ train_list = os.path.join(data_dir, 'train.list')
+ test_list = os.path.join(data_dir, 'test.list')
+ else:
+ train_list = None
+ test_list = os.path.join(data_dir, 'gen.list')
+
+ define_py_data_sources2(
+ train_list,
+ test_list,
+ module="dataprovider",
+ obj="process",
+ args={
+ "src_dict_path": src_lang_dict, # source language dictionary path
+ "trg_dict_path": trg_lang_dict, # target language dictionary path
+ "is_generating": is_generating # config mode
+ })
+ ```
+
+### Algorithm Configuration
+
+```python
+settings(
+ learning_method = AdamOptimizer(),
+ batch_size = 50,
+ learning_rate = 5e-4)
+```
+This tutorial will use the default SGD and Adam learning algorithm, with a learning rate of 5e-4. Note that the `batch_size = 50` denotes generating 50 sequence each time.
+
+### Model Structure
+1. Define some global variables
+
+ ```python
+ source_dict_dim = len(open(src_lang_dict, "r").readlines()) # size of the source language dictionary
+ target_dict_dim = len(open(trg_lang_dict, "r").readlines()) # size of target language dictionary
+ word_vector_dim = 512 # dimensionality of word vector
+ encoder_size = 512 # dimensionality of the hidden state of encoder GRU
+ decoder_size = 512 # dimentionality of the hidden state of decoder GRU
+
+ if is_generating:
+ beam_size=3 # beam size for the beam search algorithm
+ max_length=250 # maximum length for the generated sentence
+ gen_trans_file = get_config_arg("gen_trans_file", str, None) # generate file
+ ```
+
+2. Implement Encoder as follows:
+
+ 2.1 Input one-hot vector representations $\mathbf{w}$ converted with `dataprovider.py` from the source language sentence
+
+ ```python
+ src_word_id = data_layer(name='source_language_word', size=source_dict_dim)
+ ```
+ 2.2 Map the one-hot vector into a word vector $\mathbf{s}$ in a low-dimensional semantic space
+
+ ```python
+ src_embedding = embedding_layer(
+ input=src_word_id,
+ size=word_vector_dim,
+ param_attr=ParamAttr(name='_source_language_embedding'))
+ ```
+ 2.3 Use bi-direcitonal GRU to encode the source language sequence, and concatenate the encoding outputs from the two GRUs to get $\mathbf{h}$
+
+ ```python
+ src_forward = simple_gru(input=src_embedding, size=encoder_size)
+ src_backward = simple_gru(
+ input=src_embedding, size=encoder_size, reverse=True)
+ encoded_vector = concat_layer(input=[src_forward, src_backward])
+ ```
+
+3. Implement Attention-based Decoder as follows:
+
+ 3.1 Get a projection of the encoding (c.f. 2.3) of the source language sequence by passing it into a feed forward neural network
+
+ ```python
+ with mixed_layer(size=decoder_size) as encoded_proj:
+ encoded_proj += full_matrix_projection(input=encoded_vector)
+ ```
+ 3.2 Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$
+
+ ```python
+ backward_first = first_seq(input=src_backward)
+ with mixed_layer(
+ size=decoder_size,
+ act=TanhActivation(), ) as decoder_boot:
+ decoder_boot += full_matrix_projection(input=backward_first)
+ ```
+ 3.3 Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word.
+
+ - decoder_mem records the hidden state $z_i$ from the previous time step, with an initial state as decoder_boot.
+ - context is computated via `simple_attention` as $c_i=\sum {j=1}^{T}a_{ij}h_j$, where enc_vec is the projection of $h_j$ and enc_proj is the proection of $h_j$ (c.f. 3.1). $a_{ij}$ is calculated within `simple_attention`.
+ - decoder_inputs fuse $c_i$ with the representation of the current_word (i.e., $u_i$).
+ - gru_step uses `gru_step_layer` function to compute $z_{i+1}=\phi _{\theta '}\left ( c_i,u_i,z_i \right )$.
+ - Softmax normalization is used in the end to computed the probability of words, i.e., $p\left ( u_i|u_{<i},\mathbf{x} \right )=softmax(W_sz_i+b_z)$. The output is returned.
+
+ ```python
+ def gru_decoder_with_attention(enc_vec, enc_proj, current_word):
+ decoder_mem = memory(
+ name='gru_decoder', size=decoder_size, boot_layer=decoder_boot)
+
+ context = simple_attention(
+ encoded_sequence=enc_vec,
+ encoded_proj=enc_proj,
+ decoder_state=decoder_mem, )
+
+ with mixed_layer(size=decoder_size * 3) as decoder_inputs:
+ decoder_inputs += full_matrix_projection(input=context)
+ decoder_inputs += full_matrix_projection(input=current_word)
+
+ gru_step = gru_step_layer(
+ name='gru_decoder',
+ input=decoder_inputs,
+ output_mem=decoder_mem,
+ size=decoder_size)
+
+ with mixed_layer(
+ size=target_dict_dim, bias_attr=True,
+ act=SoftmaxActivation()) as out:
+ out += full_matrix_projection(input=gru_step)
+ return out
+ ```
+4. Decoder differences between the training and generation
+
+ 4.1 Define the name for decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details.
+
+ ```python
+ decoder_group_name = "decoder_group"
+ group_input1 = StaticInput(input=encoded_vector, is_seq=True)
+ group_input2 = StaticInput(input=encoded_proj, is_seq=True)
+ group_inputs = [group_input1, group_input2]
+ ```
+ 4.2 In training mode:
+
+ - word embedding from the target langauge trg_embedding is passed to `gru_decoder_with_attention` as current_word.
+ - `recurrent_group` calls `gru_decoder_with_attention` in a recurrent way
+ - the sequence of next words from the target language is used as label (lbl)
+ - multi-class cross-entropy (`classification_cost`) is used to calculate the cost
+
+ ```python
+ if not is_generating:
+ trg_embedding = embedding_layer(
+ input=data_layer(
+ name='target_language_word', size=target_dict_dim),
+ size=word_vector_dim,
+ param_attr=ParamAttr(name='_target_language_embedding'))
+ group_inputs.append(trg_embedding)
+
+ decoder = recurrent_group(
+ name=decoder_group_name,
+ step=gru_decoder_with_attention,
+ input=group_inputs)
+
+ lbl = data_layer(name='target_language_next_word', size=target_dict_dim)
+ cost = classification_cost(input=decoder, label=lbl)
+ outputs(cost)
+ ```
+ 4.3 In generation mode:
+
+ - during generation, as the decoder RNN will take the word vector generated from the previous time step as input, `GeneratedInput` is used to implement this automatically. Please refer to [GeneratedInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for details.
+ - `beam_search` will call `gru_decoder_with_attention` to generate id
+ - `seqtext_printer_evaluator` outputs the generated sentence in to `gen_trans_file` according to `trg_lang_dict`
+
+ ```python
+ else:
+ trg_embedding = GeneratedInput(
+ size=target_dict_dim,
+ embedding_name='_target_language_embedding',
+ embedding_size=word_vector_dim)
+ group_inputs.append(trg_embedding)
+
+ beam_gen = beam_search(
+ name=decoder_group_name,
+ step=gru_decoder_with_attention,
+ input=group_inputs,
+ bos_id=0,
+ eos_id=1,
+ beam_size=beam_size,
+ max_length=max_length)
+
+ seqtext_printer_evaluator(
+ input=beam_gen,
+ id_input=data_layer(
+ name="sent_id", size=1),
+ dict_file=trg_lang_dict,
+ result_file=gen_trans_file)
+ outputs(beam_gen)
+ ```
+Note: Our configuration is based on Bahdanau et al. \[[4](#Reference)\] but with a few simplifications. Please refer to [issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) for more details.
+
+
+## Model Training
+
+Training can be done with the following command:
+
+```bash
+./train.sh
+```
+where `train.sh` contains
+
+```bash
+paddle train \
+--config='seqToseq_net.py' \
+--save_dir='model' \
+--use_gpu=false \
+--num_passes=16 \
+--show_parameter_stats_period=100 \
+--trainer_count=4 \
+--log_period=10 \
+--dot_period=5 \
+2>&1 | tee 'train.log'
+```
+- config: configuration file for the network
+- save_dir: path to save the trained model
+- use_gpu: whether to use GPU for training; CPU is used here
+- num_passes: number of passes for training. In PaddlePaddle, one pass meansing one pass of complete training pass using all the data in the training set
+- show_parameter_stats_period: here we show the statistics of parameters every 100 batches
+- trainer_count: the number of CPU processes or GPU devices
+- log_period: here we print log every 10 batches
+- dot_period: we print one "." every 5 batches
+
+The training loss will the printed every 10 batches, and you will see messages as below:
+```text
+I0719 19:16:45.952062 15563 TrainerInternal.cpp:160] Batch=10 samples=500 AvgCost=198.475 CurrentCost=198.475 Eval: classification_error_evaluator=0.737155 CurrentEval: classification_error_evaluator=0.737155
+I0719 19:17:56.707319 15563 TrainerInternal.cpp:160] Batch=20 samples=1000 AvgCost=157.479 CurrentCost=116.483 Eval: classification_error_evaluator=0.698392 CurrentEval: classification_error_evaluator=0.659065
+.....
+```
+- AvgCost:average cost from batch-0 to the current batch.
+- CurrentCost:the cost for the current batch
+- classification\_error\_evaluator (Eval):average error rate from evaluator-0 to the current evaluator for each word
+- classification\_error\_evaluator (CurrentEval):error rate for the current evaluator for each word
+
+The model training is successful when the classification\_error\_evaluator is lower than 0.35.
+
+## Model Usage
+
+### Download Pre-trained Model
+
+As the training of an NMT model is very time consuming, we provide a pre-trained model (pass-00012, ~205M). The model is trained with a cluster of 50 physical nodes (each node has two 6-core CPU). We trained 16 passes (taking about 5 days) with each pass taking about 7 hours. The provided model (pass-00012) has the highest [BLEU Score](#BLEU Score) of 26.92. Run the following command to down load the model:
+```bash
+cd pretrained
+./wmt14_model.sh
+```
+
+### Usage and Results
+
+Run the following command to perform translation from French to English:
+
+```bash
+./gen.sh
+```
+where `gen.sh` contains:
+
+```bash
+paddle train \
+--job=test \
+--config='seqToseq_net.py' \
+--save_dir='pretrained/wmt14_model' \
+--use_gpu=true \
+--num_passes=13 \
+--test_pass=12 \
+--trainer_count=1 \
+--config_args=is_generating=1,gen_trans_file="gen_result" \
+2>&1 | tee 'translation/gen.log'
+```
+Parameters different training are listed as follows:
+- job:set the mode as testing.
+- save_dir:path to the pre-trained model.
+- num_passes and test_pass:load the model parameters from pass $i\epsilon \left [ test\_pass,num\_passes-1 \right ]$. Here we only load `data/wmt14_model/pass-00012`.
+- config_args:pass the self-defined command line parameters to model configuration. `is_generating=1` indicates generation mode and `gen_trans_file="gen_result"` represents the file generated.
+
+For translation results please refer to [Illustrative Results](#Illustrative Results).
+
+### BLEU Evaluation
+
+BLEU (Bilingual Evaluation understudy) is a metric widely used for automatic machine translation proposed by IBM watson Research Center in 2002\[[5](#References)\]. The basic idea is that the closer the translation produced by machine to the translation produced by human expert, the performance of the translation system is better.
+To measure the closeness between machine translation and human translation, sentence precision is used, which compares the number of matched n-grams. More matches will lead to higher BLEU scores.
+
+[Moses](http://www.statmt.org/moses/) is a opensource machine translation system, we used [multi-bleu.perl](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl) for BLEU evaluation. Run the following command for downloading::
+```bash
+./moses_bleu.sh
+```
+BLEU evaluation can be performed using the `eval_bleu` script as follows, where FILE is the name of the file to be evaluated, BEAMSIZE is the beam size value, and `data/wmt14/gen/ntst14.trg` is used as the standard translation in default.
+```bash
+./eval_bleu.sh FILE BEAMSIZE
+```
+Specificaly, the script is run as follows
+```bash
+./eval_bleu.sh gen_result 3
+```
+You will see the following message as output
+```text
+BLEU = 26.92
+```
+
+## Summary
+
+End-to-end neural machine translation is a recently developed way to perform machine translations. In this chapter, we introduce the typical "Encoder-Decoder" framework and "attention" mechanism. As NMT is a typical Sequence-to-Sequence (Seq2Seq) learning problem, therefore, tasks such as query rewriting, abstraction generation and single-turn dialogues can all be solved with model presented in this chapter.
+
+## References
+
+1. Koehn P. [Statistical machine translation](https://books.google.com.hk/books?id=4v_Cx1wIMLkC&printsec=frontcover&hl=zh-CN&source=gbs_ge_summary_r&cad=0#v=onepage&q&f=false)[M]. Cambridge University Press, 2009.
+2. Cho K, Van Merriënboer B, Gulcehre C, et al. [Learning phrase representations using RNN encoder-decoder for statistical machine translation](http://www.aclweb.org/anthology/D/D14/D14-1179.pdf)[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014: 1724-1734.
+3. Chung J, Gulcehre C, Cho K H, et al. [Empirical evaluation of gated recurrent neural networks on sequence modeling](https://arxiv.org/abs/1412.3555)[J]. arXiv preprint arXiv:1412.3555, 2014.
+4. Bahdanau D, Cho K, Bengio Y. [Neural machine translation by jointly learning to align and translate](https://arxiv.org/abs/1409.0473)[C]//Proceedings of ICLR 2015, 2015.
+5. Papineni K, Roukos S, Ward T, et al. [BLEU: a method for automatic evaluation of machine translation](http://dl.acm.org/citation.cfm?id=1073135)[C]//Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002: 311-318.
+
+
+ 本教程 由 PaddlePaddle 创作,采用 知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议进行许可。
+
+# Word2Vec
+
+This is intended as a reference tutorial. The source code of this tutorial lives on [book/word2vec](https://github.com/PaddlePaddle/book/tree/develop/word2vec).
+
+For instructions on getting started with PaddlePaddle, see [PaddlePaddle installation guide](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html).
+
+## Background Introduction
+
+This section introduces the concept of **word embedding**, which is a vector representation of words. It is a popular technique used in natural language processing. Word embeddings support many Internet services, including search engines, advertising systems, and recommendation systems.
+
+### One-Hot Vectors
+
+Building these services requires us to quantify the similarity between two words or paragraphs. This calls for a new representation of all the words to make them more suitable for computation. An obvious way to achieve this is through the vector space model, where every word is represented as an **one-hot vector**.
+
+For each word, its vector representation has the corresponding entry in the vector as 1, and all other entries as 0. The lengths of one-hot vectors match the size of the dictionary. Each entry of a vector corresponds to the presence (or absence) of a word in the dictionary.
+
+One-hot vectors are intuitive, yet they have limited usefulness. Take the example of an Internet advertising system: Suppose a customer enters the query "Mother's Day", while an ad bids for the keyword carnations". Because the one-hot vectors of these two words are perpendicular, the metric distance (either Euclidean or cosine similarity) between them would indicate little relevance. However, *we* know that these two queries are connected semantically, since people often gift their mothers bundles of carnation flowers on Mother's Day. This discrepancy is due to the low information capacity in each vector. That is, comparing the vector representations of two words does not assess their relevance sufficiently. To calculate their similarity accurately, we need more information, which could be learned from large amounts of data through machine learning methods.
+
+Like many machine learning models, word embeddings can represent knowledge in various ways. Another model may project an one-hot vector to an embedding vector of lower dimension e.g. $embedding(mother's day) = [0.3, 4.2, -1.5, ...], embedding(carnations) = [0.2, 5.6, -2.3, ...]$. Mapping one-hot vectors onto an embedded vector space has the potential to bring the embedding vectors of similar words (either semantically or usage-wise) closer to each other, so that the cosine similarity between the corresponding vectors for words like "Mother's Day" and "carnations" are no longer zero.
+
+A word embedding model could be a probabilistic model, a co-occurrence matrix model, or a neural network. Before people started using neural networks to generate word embedding, the traditional method was to calculate a co-occurrence matrix $X$ of words. Here, $X$ is a $|V| \times |V|$ matrix, where $X_{ij}$ represents the co-occurrence times of the $i$th and $j$th words in the vocabulary `V` within all corpus, and $|V|$ is the size of the vocabulary. By performing matrix decomposition on $X$ e.g. Singular Value Decomposition \[[5](#References)\]
+
+$$X = USV^T$$
+
+the resulting $U$ can be seen as the word embedding of all the words.
+
+However, this method suffers from many drawbacks:
+1) Since many pairs of words don't co-occur, the co-occurrence matrix is sparse. To achieve good performance of matrix factorization, further treatment on word frequency is needed;
+2) The matrix is large, frequently on the order of $10^6*10^6$;
+3) We need to manually filter out stop words (like "although", "a", ...), otherwise these frequent words will affect the performance of matrix factorization.
+
+The neural network based model does not require storing huge hash tables of statistics on all of the corpus. It obtains the word embedding by learning from semantic information, hence could avoid the aforementioned problems in the traditional method. In this chapter, we will introduce the details of neural network word embedding model and how to train such model in PaddlePaddle.
+
+## Results Demonstration
+
+In this section, after training the word embedding model, we could use the data visualization algorithm $t-$SNE\[[4](#reference)\] to draw the word embedding vectors after projecting them onto a two-dimensional space (see figure below). From the figure we could see that the semantically relevant words -- *a*, *the*, and *these* or *big* and *huge* -- are close to each other in the projected space, while irrelevant words -- *say* and *business* or *decision* and *japan* -- are far from each other.
+
+
+
+ Figure 1. Two dimension projection of word embeddings
+
+
+### Cosine Similarity
+
+On the other hand, we know that the cosine similarity between two vectors falls between $[-1,1]$. Specifically, the cosine similarity is 1 when the vectors are identical, 0 when the vectors are perpendicular, -1 when the are of opposite directions. That is, the cosine similarity between two vectors scales with their relevance. So we can calculate the cosine similarity of two word embedding vectors to represent their relevance:
+
+```
+please input two words: big huge
+similarity: 0.899180685161
+
+please input two words: from company
+similarity: -0.0997506977351
+```
+
+The above results could be obtained by running `calculate_dis.py`, which loads the words in the dictionary and their corresponding trained word embeddings. For detailed instruction, see section [Model Application](#Model Application).
+
+
+## Model Overview
+
+In this section, we will introduce three word embedding models: N-gram model, CBOW, and Skip-gram, which all output the frequency of each word given its immediate context.
+
+For N-gram model, we will first introduce the concept of language model, and implement it using PaddlePaddle in section [Model Training](#Model Training).
+
+The latter two models, which became popular recently, are neural word embedding model developed by Tomas Mikolov at Google \[[3](#reference)\]. Despite their apparent simplicity, these models train very well.
+
+### Language Model
+
+Before diving into word embedding models, we will first introduce the concept of **language model**. Language models build the joint probability function $P(w_1, ..., w_T)$ of a sentence, where $w_i$ is the i-th word in the sentence. The goal is to give higher probabilities to meaningful sentences, and lower probabilities to meaningless constructions.
+
+In general, models that generate the probability of a sequence can be applied to many fields, like machine translation, speech recognition, information retrieval, part-of-speech tagging, and handwriting recognition. Take information retrieval, for example. If you were to search for "how long is a football bame" (where bame is a medical noun), the search engine would have asked if you had meant "how long is a football game" instead. This is because the probability of "how long is a football bame" is very low according to the language model; in addition, among all of the words easily confused with "bame", "game" would build the most probable sentence.
+
+#### Target Probability
+For language model's target probability $P(w_1, ..., w_T)$, if the words in the sentence were to be independent, the joint probability of the whole sentence would be the product of each word's probability:
+
+$$P(w_1, ..., w_T) = \prod_{t=1}^TP(w_t)$$
+
+However, the frequency of words in a sentence typically relates to the words before them, so canonical language models are constructed using conditional probability in its target probability:
+
+$$P(w_1, ..., w_T) = \prod_{t=1}^TP(w_t | w_1, ... , w_{t-1})$$
+
+
+### N-gram neural model
+
+In computational linguistics, n-gram is an important method to represent text. An n-gram represents a contiguous sequence of n consecutive items given a text. Based on the desired application scenario, each item could be a letter, a syllable or a word. The N-gram model is also an important method in statistical language modeling. When training language models with n-grams, the first (n-1) words of an n-gram are used to predict the *n*th word.
+
+Yoshua Bengio and other scientists describe how to train a word embedding model using neural network in the famous paper of Neural Probabilistic Language Models \[[1](#reference)\] published in 2003. The Neural Network Language Model (NNLM) described in the paper learns the language model and word embedding simultaneously through a linear transformation and a non-linear hidden connection. That is, after training on large amounts of corpus, the model learns the word embedding; then, it computes the probability of the whole sentence, using the embedding. This type of language model can overcome the **curse of dimensionality** i.e. model inaccuracy caused by the difference in dimensionality between training and testing data. Note that the term *neural network language model* is ill-defined, so we will not use the name NNLM but only refer to it as *N-gram neural model* in this section.
+
+We have previously described language model using conditional probability, where the probability of the *t*-th word in a sentence depends on all $t-1$ words before it. Furthermore, since words further prior have less impact on a word, and every word within an n-gram is only effected by its previous n-1 words, we have:
+
+$$P(w_1, ..., w_T) = \prod_{t=n}^TP(w_t|w_{t-1}, w_{t-2}, ..., w_{t-n+1})$$
+
+Given some real corpus in which all sentences are meaningful, the n-gram model should maximize the following objective function:
+
+$$\frac{1}{T}\sum_t f(w_t, w_{t-1}, ..., w_{t-n+1};\theta) + R(\theta)$$
+
+where $f(w_t, w_{t-1}, ..., w_{t-n+1})$ represents the conditional probability of the current word $w_t$ given its previous $n-1$ words, and $R(\theta)$ represents parameter regularization term.
+
+
+
+ Figure 2. N-gram neural network model
+
+
+
+Figure 2 shows the N-gram neural network model. From the bottom up, the model has the following components:
+
+ - For each sample, the model gets input $w_{t-n+1},...w_{t-1}$, and outputs the probability that the t-th word is one of `|V|` in the dictionary.
+
+ Every input word $w_{t-n+1},...w_{t-1}$ first gets transformed into word embedding $C(w_{t-n+1}),...C(w_{t-1})$ through a transformation matrix.
+
+ - All the word embeddings concatenate into a single vector, which is mapped (nonlinearly) into the $t$-th word hidden representation:
+
+ $$g=Utanh(\theta^Tx + b_1) + Wx + b_2$$
+
+ where $x$ is the large vector concatenated from all the word embeddings representing the context; $\theta$, $U$, $b_1$, $b_2$ and $W$ are parameters connecting word embedding layers to the hidden layers. $g$ represents the unnormalized probability of the output word, $g_i$ represents the unnormalized probability of the output word being the i-th word in the dictionary.
+
+ - Based on the definition of softmax, using normalized $g_i$, the probability that the output word is $w_t$ is represented as:
+
+ $$P(w_t | w_1, ..., w_{t-n+1}) = \frac{e^{g_{w_t}}}{\sum_i^{|V|} e^{g_i}}$$
+
+ - The cost of the entire network is a multi-class cross-entropy and can be described by the following loss function
+
+ $$J(\theta) = -\sum_{i=1}^N\sum_{c=1}^{|V|}y_k^{i}log(softmax(g_k^i))$$
+
+ where $y_k^i$ represents the true label for the $k$-th class in the $i$-th sample ($0$ or $1$), $softmax(g_k^i)$ represents the softmax probability for the $k$-th class in the $i$-th sample.
+
+### Continuous Bag-of-Words model(CBOW)
+
+CBOW model predicts the current word based on the N words both before and after it. When $N=2$, the model is as the figure below:
+
+
+
+ Figure 3. CBOW model
+
+
+Specifically, by ignoring the order of words in the sequence, CBOW uses the average value of the word embedding of the context to predict the current word:
+
+$$\text{context} = \frac{x_{t-1} + x_{t-2} + x_{t+1} + x_{t+2}}{4}$$
+
+where $x_t$ is the word embedding of the t-th word, classification score vector is $z=U*\text{context}$, the final classification $y$ uses softmax and the loss function uses multi-class cross-entropy.
+
+### Skip-gram model
+
+The advantages of CBOW is that it smooths over the word embeddings of the context and reduces noise, so it is very effective on small dataset. Skip-gram uses a word to predict its context and get multiple context for the given word, so it can be used in larger datasets.
+
+
+
+ Figure 4. Skip-gram model
+
+
+As illustrated in the figure above, skip-gram model maps the word embedding of the given word onto $2n$ word embeddings (including $n$ words before and $n$ words after the given word), and then combine the classification loss of all those $2n$ words by softmax.
+
+## Data Preparation
+
+## Model Configuration
+
+
+ Figure 5. N-gram neural network model in model configuration
+
+
+
+## Model Training
+
+## Model Application
+
+## Conclusion
+
+This chapter introduces word embedding, the relationship between language model and word embedding, and how to train neural networks to learn word embedding.
+
+In information retrieval, the relevance between the query and document keyword can be computed through the cosine similarity of their word embeddings. In grammar analysis and semantic analysis, a previously trained word embedding can initialize models for better performance. In document classification, clustering the word embedding can group synonyms in the documents. We hope that readers can use word embedding models in their work after reading this chapter.
+
+
+## Referenes
+1. Bengio Y, Ducharme R, Vincent P, et al. [A neural probabilistic language model](http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)[J]. journal of machine learning research, 2003, 3(Feb): 1137-1155.
+2. Mikolov T, Kombrink S, Deoras A, et al. [Rnnlm-recurrent neural network language modeling toolkit](http://www.fit.vutbr.cz/~imikolov/rnnlm/rnnlm-demo.pdf)[C]//Proc. of the 2011 ASRU Workshop. 2011: 196-201.
+3. Mikolov T, Chen K, Corrado G, et al. [Efficient estimation of word representations in vector space](https://arxiv.org/pdf/1301.3781.pdf)[J]. arXiv preprint arXiv:1301.3781, 2013.
+4. Maaten L, Hinton G. [Visualizing data using t-SNE](https://lvdmaaten.github.io/publications/papers/JMLR_2008.pdf)[J]. Journal of Machine Learning Research, 2008, 9(Nov): 2579-2605.
+5. https://en.wikipedia.org/wiki/Singular_value_decomposition
+
+
+ 本教程 由 PaddlePaddle 创作,采用 知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议进行许可。
+