“bf9ed4a97fb07c3d68eadba932d6bd8bf9c5b704”上不存在“git@gitcode.net:s920243400/PaddleDetection.git”
提交 53a6faaf 编写于 作者: G gongweibao

Merge remote-tracking branch 'upstream/develop' into develop

deprecated
*~
pandoc.template pandoc.template
.DS_Store .DS_Store
\ No newline at end of file
- repo: https://github.com/Lucas-C/pre-commit-hooks.git
sha: c25201a00e6b0514370501050cf2a8538ac12270
hooks:
- id: remove-crlf
- repo: https://github.com/reyoung/mirrors-yapf.git - repo: https://github.com/reyoung/mirrors-yapf.git
sha: v0.13.2 sha: v0.13.2
hooks: hooks:
- id: yapf - id: yapf
files: (.*\.(py|bzl)|BUILD|.*\.BUILD|WORKSPACE)$ # Bazel BUILD files follow Python syntax. files: (.*\.(py|bzl)|BUILD|.*\.BUILD|WORKSPACE)$ # Bazel BUILD files follow Python syntax.
- repo: https://github.com/pre-commit/pre-commit-hooks - repo: https://github.com/pre-commit/pre-commit-hooks
sha: 7539d8bd1a00a3c1bfd34cdb606d3a6372e83469 sha: v0.7.1
hooks: hooks:
- id: check-merge-conflict - id: check-merge-conflict
- id: check-symlinks - id: check-symlinks
- id: detect-private-key - id: detect-private-key
- id: end-of-file-fixer - id: end-of-file-fixer
files: \.md$
- id: trailing-whitespace
files: \.md$
- repo: git://github.com/Lucas-C/pre-commit-hooks
sha: v1.0.1
hooks:
- id: forbid-crlf
files: \.md$
- id: remove-crlf
files: \.md$
- id: forbid-tabs
files: \.md$
- id: remove-tabs
files: \.md$
- repo: local
hooks:
- id: convert-markdown-into-html
name: convert-markdown-into-html
description: "Convert README.md into index.html and README.en.md into index.en.html"
entry: python pre-commit-hooks/convert_markdown_into_html.py
language: system
files: \.md$
...@@ -1093,7 +1093,7 @@ function escape(html, encode) { ...@@ -1093,7 +1093,7 @@ function escape(html, encode) {
} }
function unescape(html) { function unescape(html) {
// explicitly match decimal, hex, and named HTML entities // explicitly match decimal, hex, and named HTML entities
return html.replace(/&(#(?:\d+)|(?:#x[0-9A-Fa-f]+)|(?:\w+));?/g, function(_, n) { return html.replace(/&(#(?:\d+)|(?:#x[0-9A-Fa-f]+)|(?:\w+));?/g, function(_, n) {
n = n.toLowerCase(); n = n.toLowerCase();
if (n === 'colon') return ':'; if (n === 'colon') return ':';
......
#!/bin/bash
for i in $(du -a | grep '\.\/.\+\/README.md' | cut -f 2); do
.tmpl/convert-markdown-into-html.sh $i > $(dirname $i)/index.html
done
for i in $(du -a | grep '\.\/.\+\/README.en.md' | cut -f 2); do
.tmpl/convert-markdown-into-html.sh $i > $(dirname $i)/index.en.html
done
# Linear Regression # Linear Regression
Let us begin the tutorial with a classical problem called Linear Regression \[[1](#References)\]. In this chapter, we will train a model from a realistic dataset to predict house prices. Some important concepts in Machine Learning will be covered through this example. Let us begin the tutorial with a classical problem called Linear Regression \[[1](#References)\]. In this chapter, we will train a model from a realistic dataset to predict home prices. Some important concepts in Machine Learning will be covered through this example.
The source code for this tutorial is at [book/fit_a_line](https://github.com/PaddlePaddle/book/tree/develop/fit_a_line). If this is your first time using PaddlePaddle, please refer to the [Install Guide](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html). The source code for this tutorial lives on [book/fit_a_line](https://github.com/PaddlePaddle/book/tree/develop/fit_a_line). For instructions on getting started with PaddlePaddle, see [PaddlePaddle installation guide](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html).
## Problem ## Problem Setup
Suppose we have a dataset of $n$ houses. Each house $i$ has $d$ properties and the price $y_i$. A property $x_{i,d}$ describes one aspect of the house, for example, the number of rooms in the house, the number of schools or hospitals in the neighborhood, the nearby traffic condition, etc. Our task is to predict $y_i$ given a set of properties $\{x_{i,1}, ..., x_{i,d}\}$. We assume that the price is a linear combination of all the properties, i.e., Suppose we have a dataset of $n$ real estate properties. These real estate properties will be referred to as *homes* in this chapter for clarity.
Each home is associated with $d$ attributes. The attributes describe characteristics such the number of rooms in the home, the number of schools or hospitals in the neighborhood, and the traffic condition nearby.
In our problem setup, the attribute $x_{i,j}$ denotes the $j$th characteristic of the $i$th home. In addition, $y_i$ denotes the price of the $i$th home. Our task is to predict $y_i$ given a set of attributes $\{x_{i,1}, ..., x_{i,d}\}$. We assume that the price of a home is a linear combination of all of its attributes, namely,
$$y_i = \omega_1x_{i,1} + \omega_2x_{i,2} + \ldots + \omega_dx_{i,d} + b, i=1,\ldots,n$$ $$y_i = \omega_1x_{i,1} + \omega_2x_{i,2} + \ldots + \omega_dx_{i,d} + b, i=1,\ldots,n$$
where $\omega_{d}$ and $b$ are the model parameters we want to estimate. Once they are learned, given a set of properties of a house, we will be able to predict a price for that house. The model we have here is called Linear Regression, namely, we want to regress a value as a linear combination of several values. In practice this linear model for our problem is hardly true, because the real relationship between the house properties and the price is much more complicated. However, due to its simple formulation which makes the model training and analysis easy, Linear Regression has been applied to lots of real problems. It is always an important topic in many classical Statistical Learning and Machine Learning textbooks \[[2,3,4](#References)\]. where $\vec{\omega}$ and $b$ are the model parameters we want to estimate. Once they are learned, we will be able to predict the price of a home, given the attributes associated with it. We call this model **Linear Regression**. In other words, we want to regress a value against several values linearly. In practice, a linear model is often too simplistic to capture the real relationships between the variables. Yet, because Linear Regression is easy to train and analyze, it has been applied to a large number of real problems. As a result, it is an important topic in many classic Statistical Learning and Machine Learning textbooks \[[2,3,4](#References)\].
## Results Demonstration ## Results Demonstration
We first show the training result of our model. We use the [UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing) to train a linear model and predict the house prices in Boston. The figure below shows the predictions the model makes for some house prices. The $X$ coordinate of each point represents the median value of the prices of a certain type of houses, while the $Y$ coordinate represents the predicted value by our linear model. When $X=Y$, the point lies exactly on the dotted line. In other words, the more precise the model predicts, the closer the point is to the dotted line. We first show the result of our model. The dataset [UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing) is used to train a linear model to predict the home prices in Boston. The figure below shows the predictions the model makes for some home prices. The $X$-axis represents the median value of the prices of simlilar homes within a bin, while the $Y$-axis represents the home value our linear model predicts. The dotted line represents points where $X=Y$. When reading the diagram, the more precise the model predicts, the closer the point is to the dotted line.
<p align="center"> <p align="center">
<img src = "image/predictions_en.png" width=400><br/> <img src = "image/predictions_en.png" width=400><br/>
Figure 1. Predicted Value V.S. Actual Value Figure 1. Predicted Value V.S. Actual Value
</p> </p>
## Model Overview ## Model Overview
### Model Definition ### Model Definition
In the UCI Housing Data Set, there are 13 house properties $x_{i,d}$ that are related to the median house price $y_i$. Thus our model is: In the UCI Housing Data Set, there are 13 home attributes $\{x_{i,j}\}$ that are related to the median home price $y_i$, which we aim to predict. Thus, our model can be written as:
$$\hat{Y} = \omega_1X_{1} + \omega_2X_{2} + \ldots + \omega_{13}X_{13} + b$$ $$\hat{Y} = \omega_1X_{1} + \omega_2X_{2} + \ldots + \omega_{13}X_{13} + b$$
where $\hat{Y}$ is the predicted value used to differentiate from the actual value $Y$. The model parameters to be learned are: $\omega_1, \ldots, \omega_{13}, b$, where $\omega$ are called the weights and $b$ is called the bias. where $\hat{Y}$ is the predicted value used to differentiate from actual value $Y$. The model learns parameters $\omega_1, \ldots, \omega_{13}, b$, where the entries of $\vec{\omega}$ are **weights** and $b$ is **bias**.
Now we need an optimization goal, so that with the learned parameters, $\hat{Y}$ is close to $Y$ as much as possible. Here we introduce the concept of [Loss Function (Cost Function)](https://en.wikipedia.org/wiki/Loss_function). The Loss Function has such property: given any pair of the actual value $y_i$ and the predicted value $\hat{y_i}$, its output is always non-negative. This non-negative value reflects the model error. Now we need an objective to optimize, so that the learned parameters can make $\hat{Y}$ as close to $Y$ as possible. Let's refer to the concept of [Loss Function (Cost Function)](https://en.wikipedia.org/wiki/Loss_function). A loss function must output a non-negative value, given any pair of the actual value $y_i$ and the predicted value $\hat{y_i}$. This value reflects the magnitutude of the model error.
For Linear Regression, the most common Loss Function is [Mean Square Error (MSE)](https://en.wikipedia.org/wiki/Mean_squared_error) which has the following form: For Linear Regression, the most common loss function is [Mean Square Error (MSE)](https://en.wikipedia.org/wiki/Mean_squared_error) which has the following form:
$$MSE=\frac{1}{n}\sum_{i=1}^{n}{(\hat{Y_i}-Y_i)}^2$$ $$MSE=\frac{1}{n}\sum_{i=1}^{n}{(\hat{Y_i}-Y_i)}^2$$
For a dataset of size $n$, MSE is the average value of the $n$ predicted errors. That is, for a dataset of size $n$, MSE is the average value of the the prediction sqaure errors.
### Training ### Training
After defining our model, we have several major steps for the training: After setting up our model, there are several major steps to go through to train it:
1. Initialize the parameters including the weights $\omega$ and the bias $b$. For example, we can set their mean values as 0s, and their standard deviations as 1s. 1. Initialize the parameters including the weights $\vec{\omega}$ and the bias $b$. For example, we can set their mean values as $0$s, and their standard deviations as $1$s.
2. Feedforward to compute the network output and the Loss Function. 2. Feedforward. Evaluate the network output and compute the corresponding loss.
3. Backward to [backpropagate](https://en.wikipedia.org/wiki/Backpropagation) the errors. The errors will be propagated from the output layer back to the input layer, during which the model parameters will be updated with the corresponding errors. 3. [Backpropagate](https://en.wikipedia.org/wiki/Backpropagation) the errors. The errors will be propagated from the output layer back to the input layer, during which the model parameters will be updated with the corresponding errors.
4. Repeat steps 2~3, until the loss is below a predefined threshold or the maximum number of repeats is reached. 4. Repeat steps 2~3, until the loss is below a predefined threshold or the maximum number of repeats is reached.
## Data Preparation ## Dataset
Follow the command below to prepare data:
```bash ### Python Dataset Modules
cd data && python prepare_data.py
Our program starts with importing necessary packages:
```python
import paddle.v2 as paddle
import paddle.v2.dataset.uci_housing as uci_housing
``` ```
This line of code will download the dataset from the [UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing) and perform some [preprocessing](#Preprocessing). The dataset is split into a training set and a test set.
The dataset contains 506 lines in total, each line describing the properties and the median price of a certain type of houses in Boston. The meaning of each line is below: We encapsulated the [UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing) in our Python module `uci_housing`. This module can
1. download the dataset to `~/.cache/paddle/dataset/uci_housing/housing.data`, if not yet, and
2. [preprocesses](#preprocessing) the dataset.
| Property Name | Explanation | Data Type | ### An Introduction of the Dataset
The UCI housing dataset has 506 instances. Each instance describes the attributes of a house in surburban Boston. The attributes are explained below:
| Attribute Name | Characteristic | Data Type |
| ------| ------ | ------ | | ------| ------ | ------ |
| CRIM | per capita crime rate by town | Continuous| | CRIM | per capita crime rate by town | Continuous|
| ZN | proportion of residential land zoned for lots over 25,000 sq.ft. | Continuous | | ZN | proportion of residential land zoned for lots over 25,000 sq.ft. | Continuous |
...@@ -70,113 +84,115 @@ The dataset contains 506 lines in total, each line describing the properties and ...@@ -70,113 +84,115 @@ The dataset contains 506 lines in total, each line describing the properties and
| LSTAT | % lower status of the population | Continuous | | LSTAT | % lower status of the population | Continuous |
| MEDV | Median value of owner-occupied homes in $1000's | Continuous | | MEDV | Median value of owner-occupied homes in $1000's | Continuous |
The last entry is the median house price. The last entry is the median home price.
### Preprocessing ### Preprocessing
#### Continuous and Discrete Data #### Continuous and Discrete Data
We define a feature vector of length 13 for each house, where each entry of the feature vector corresponds to a property of that house. Our first observation is that among the 13 dimensions, there are 12 continuous dimensions and 1 discrete dimension. Note that although a discrete value is also written as digits such as 0, 1, or 2, it has a quite different meaning from a continuous value. The reason is that the difference between two discrete values has no practical meaning. For example, if we use 0, 1, and 2 to represent `red`, `green`, and `blue` respectively, although the numerical difference between `red` and `green` is smaller than that between `red` and `blue`, we cannot say that the extent to which `blue` is different from `red` is greater than the extent to which `green` is different from `red`. Therefore, when handling a discrete feature that has $d$ possible values, we will usually convert it to $d$ new features where each feature can only take 0 or 1, indicating whether the original $d$th value is present or not. Or we can map the discrete feature to a continuous multi-dimensional vector through an embedding table. For our problem here, because CHAS itself is a binary discrete value, we do not need to do any preprocessing. We define a feature vector of length 13 for each home, where each entry corresponds to an attribute. Our first observation is that, among the 13 dimensions, there are 12 continuous dimensions and 1 discrete dimension.
Note that although a discrete value is also written as numeric values such as 0, 1, or 2, its meaning differs from a continuous value drastically. The linear difference between two discrete values has no meaning. For example, suppose $0$, $1$, and $2$ are used to represent colors *Red*, *Green*, and *Blue* respectively. Judging from the numeric representation of these colors, *Red* differs more from *Blue* than it does from *Green*. Yet in actuality, it is not true that extent to which the color *Blue* is different from *Red* is greater than the extent to which *Green* is different from *Red*. Therefore, when handling a discrete feature that has $d$ possible values, we usually convert it to $d$ new features where each feature takes a binary value, $0$ or $1$, indicating whether the original value is absent or present. Alternatively, the discrete features can be mapped onto a continuous multi-dimensional vector through an embedding table. For our problem here, because CHAS itself is a binary discrete value, we do not need to do any preprocessing.
#### Feature Normalization #### Feature Normalization
Another observation we have is that there is a huge difference among the value ranges of the 13 features (Figure 2). For example, feature B has a value range of [0.32, 396.90] while feature NOX has a range of [0.3850, 0.8170]. For an effective optimization, here we need data normalization. The goal of data normalization is to scale each feature into roughly the same value range, for example [-0.5, 0.5]. In this example, we adopt a standard way of normalization: substracting the mean value from the feature and divide the result by the original value range. We also observe a huge difference among the value ranges of the 13 features (Figure 2). For instance, the values of feature *B* fall in $[0.32, 396.90]$, whereas those of feature *NOX* has a range of $[0.3850, 0.8170]$. An effective optimization would require data normalization. The goal of data normalization is to scale te values of each feature into roughly the same range, perhaps $[-0.5, 0.5]$. Here, we adopt a popular normalization technique where we substract the mean value from the feature value and divide the result by the width of the original range.
There are at least three reasons for [Feature Normalization](https://en.wikipedia.org/wiki/Feature_scaling) (Feature Scaling): There are at least three reasons for [Feature Normalization](https://en.wikipedia.org/wiki/Feature_scaling) (Feature Scaling):
- A value range that is too large or too small might cause floating number overflow or underflow during computation. - A value range that is too large or too small might cause floating number overflow or underflow during computation.
- Different value ranges might result in different importances of different features to the model (at least in the beginning of the training process), which however is an unreasonable assumption. Such assumption makes the optimization more difficult and increases the training time a lot. - Different value ranges might result in varying *importances* of different features to the model (at least in the beginning of the training process). This assumption about the data is often unreasonable, making the optimization difficult, which in turn results in increased training time.
- Many Machine Learning techniques or models (e.g., L1/L2 regularization and Vector Space Model) are based on the assumption that all the features have roughly zero means and their value ranges are similar. - Many machine learning techniques or models (e.g., *L1/L2 regularization* and *Vector Space Model*) assumes that all the features have roughly zero means and their value ranges are similar.
<p align="center"> <p align="center">
<img src = "image/ranges_en.png" width=550><br/> <img src = "image/ranges_en.png" width=550><br/>
Figure 2. The value ranges of the features Figure 2. The value ranges of the features
</p> </p>
#### Prepare Training and Test Sets #### Prepare Training and Test Sets
We split the dataset into two subsets, one for estimating the model parameters, namely, model training, and the other for model testing. The model error on the former is called the **training error**, and the error on the latter is called the **test error**. Our goal of training a model is to find the statistical dependency between the outputs and the inputs, so that we can predict new outputs given new inputs. As a result, the test error reflects the performance of the model better than the training error does. We consider two things when deciding the ratio of the training set to the test set: 1) More training data will decrease the variance of the parameter estimation, yielding more reliable models; 2) More test data will decrease the variance of the test error, yielding more reliable test errors. One standard split ratio is $8:2$. You can try different split ratios to observe how the two variances change. We split the dataset in two, one for adjusting the model parameters, namely, for model training, and the other for model testing. The model error on the former is called the **training error**, and the error on the latter is called the **test error**. Our goal in training a model is to find the statistical dependency between the outputs and the inputs, so that we can predict new outputs given new inputs. As a result, the test error reflects the performance of the model better than the training error does. We consider two things when deciding the ratio of the training set to the test set: 1) More training data will decrease the variance of the parameter estimation, yielding more reliable models; 2) More test data will decrease the variance of the test error, yielding more reliable test errors. One standard split ratio is $8:2$.
When training complex models, we usually have one more split: the validation set. Complex models usually have [Hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_optimization) that need to be set before the training process, such as the number of layers in the network. Because hyperparameters are not part of the model parameters, they cannot be trained using the same loss function. Thus we will try several sets of hyperparameters to train several models and cross-validate them on the validation set to pick the best one; finally, the selected trained model is tested on the test set. Because our model is relatively simple, we will omit this validation process.
## Training
`fit_a_line/trainer.py` demonstrates the training using [PaddlePaddle](http://paddlepaddle.org).
### Initialize PaddlePaddle
Executing the following command to split the dataset and write the training and test set into the `train.list` and `test.list` files, so that later PaddlePaddle can read from them.
```python ```python
python prepare_data.py -r 0.8 #8:2 is the default split ratio paddle.init(use_gpu=False, trainer_count=1)
``` ```
When training complex models, we usually have one more split: the validation set. Complex models usually have [Hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_optimization) that need to be set before the training process begins. These hyperparameters are not part of the model parameters and cannot be trained using the same Loss Function (e.g., the number of layers in the network). Thus we will try several sets of hyperparameters to get several models, and compare these trained models on the validation set to pick the best one, and finally it on the test set. Because our model is relatively simple in this problem, we ignore this validation process for now. ### Model Configuration
### Provide Data to PaddlePaddle Logistic regression is essentially a fully-connected layer with linear activation:
After the data is prepared, we use a Python Data Provider to provide data for PaddlePaddle. A Data Provider is a Python function which will be called by PaddlePaddle during training. In this example, the Data Provider only needs to read the data and return it to the training process of PaddlePaddle line by line.
```python ```python
from paddle.trainer.PyDataProvider2 import * x = paddle.layer.data(name='x', type=paddle.data_type.dense_vector(13))
import numpy as np y_predict = paddle.layer.fc(input=x,
#define data type and dimensionality size=1,
@provider(input_types=[dense_vector(13), dense_vector(1)]) act=paddle.activation.Linear())
def process(settings, input_file): y = paddle.layer.data(name='y', type=paddle.data_type.dense_vector(1))
data = np.load(input_file.strip()) cost = paddle.layer.regression_cost(input=y_predict, label=y)
for row in data: ```
yield row[:-1].tolist(), row[-1:].tolist() ### Create Parameters
```python
parameters = paddle.parameters.create(cost)
``` ```
## Model Configuration ### Create Trainer
### Data Definition
We first call the function `define_py_data_sources2` to let PaddlePaddle read training and test data from the `dataprovider.py` in the above. PaddlePaddle can accept configuration info from the command line, for example, here we pass a variable named `is_predict` to control the model to have different structures during training and test.
```python ```python
from paddle.trainer_config_helpers import * optimizer = paddle.optimizer.Momentum(momentum=0)
is_predict = get_config_arg('is_predict', bool, False) trainer = paddle.trainer.SGD(cost=cost,
parameters=parameters,
update_equation=optimizer)
```
define_py_data_sources2( ### Feeding Data
train_list='data/train.list',
test_list='data/test.list',
module='dataprovider',
obj='process')
``` PaddlePaddle provides the
[reader mechanism](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader)
for loadinng training data. A reader may return multiple columns, and we need a Python dictionary to specify the mapping from column index to data layers.
### Algorithm Settings
Next we need to set the details of the optimization algorithm. Due to the simplicity of the Linear Regression model, we only need to set the `batch_size` which defines how many samples are used every time for updating the parameters.
```python ```python
settings(batch_size=2) feeding={'x': 0, 'y': 1}
``` ```
### Network Moreover, an event handler is provided to print the training progress:
Finally, we use `fc_layer` and `LinearActivation` to represent the Linear Regression model.
```python ```python
#input data of 13 dimensional house information # event_handler to print training and testing info
x = data_layer(name='x', size=13) def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
y_predict = fc_layer( if event.batch_id % 100 == 0:
input=x, print "Pass %d, Batch %d, Cost %f" % (
param_attr=ParamAttr(name='w'), event.pass_id, event.batch_id, event.cost)
size=1,
act=LinearActivation(), if isinstance(event, paddle.event.EndPass):
bias_attr=ParamAttr(name='b')) result = trainer.test(
reader=paddle.batch(
if not is_predict: #when training, we use MSE (i.e., regression_cost) as the Loss Function uci_housing.test(), batch_size=2),
y = data_layer(name='y', size=1) feeding=feeding)
cost = regression_cost(input=y_predict, label=y) print "Test %d, Cost %f" % (event.pass_id, result.cost)
outputs(cost) #output MSE to view the loss change
else: #during test, output the prediction value
outputs(y_predict)
``` ```
## Training Model ### Start Training
We can run the PaddlePaddle command line trainer in the root directory of the code. Here we name the configuration file as `trainer_config.py`. We train 30 passes and save the result in the directory `output`:
```bash
./train.sh
```
## Use Model ```python
Now we can use the trained model to do prediction. trainer.train(
```bash reader=paddle.batch(
python predict.py paddle.reader.shuffle(
``` uci_housing.train(), buf_size=500),
Here by default we use the model in `output/pass-00029` for prediction, and compare the actual house price with the predicted one. The result is shown in `predictions.png`. batch_size=2),
If you want to use another model or test on other data, you can pass in a new model path or data path: feeding=feeding,
```bash event_handler=event_handler,
python predict.py -m output/pass-00020 -t data/housing.test.npy num_passes=30)
``` ```
## Summary ## Summary
In this chapter, we have introduced the Linear Regression model using the UCI Housing Data Set as an example. We have shown how to train and test this model with PaddlePaddle. Many more complex models and techniques are derived from this simple linear model, thus it is important for us to understand how it works. This chapter introduces *Linear Regression* and how to train and test this model with PaddlePaddle, using the UCI Housing Data Set. Because a large number of more complex models and techniques are derived from linear regression, it is important to understand its underlying theory and limitation.
## References ## References
...@@ -186,4 +202,4 @@ In this chapter, we have introduced the Linear Regression model using the UCI Ho ...@@ -186,4 +202,4 @@ In this chapter, we have introduced the Linear Regression model using the UCI Ho
4. Bishop C M. Pattern recognition[J]. Machine Learning, 2006, 128. 4. Bishop C M. Pattern recognition[J]. Machine Learning, 2006, 128.
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Common Creative License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a> This tutorial was created and published with [Creative Common License 4.0](http://creativecommons.org/licenses/by-nc-sa/4.0/).
...@@ -39,16 +39,16 @@ $$MSE=\frac{1}{n}\sum_{i=1}^{n}{(\hat{Y_i}-Y_i)}^2$$ ...@@ -39,16 +39,16 @@ $$MSE=\frac{1}{n}\sum_{i=1}^{n}{(\hat{Y_i}-Y_i)}^2$$
### 训练过程 ### 训练过程
定义好模型结构之后,我们要通过以下几个步骤进行模型训练 定义好模型结构之后,我们要通过以下几个步骤进行模型训练
1. 初始化参数,其中包括权重$\omega_i$和偏置$b$,对其进行初始化(如0均值,1方差)。 1. 初始化参数,其中包括权重$\omega_i$和偏置$b$,对其进行初始化(如0均值,1方差)。
2. 网络正向传播计算网络输出和损失函数。 2. 网络正向传播计算网络输出和损失函数。
3. 根据损失函数进行反向误差传播 ([backpropagation](https://en.wikipedia.org/wiki/Backpropagation)),将网络误差从输出层依次向前传递, 并更新网络中的参数。 3. 根据损失函数进行反向误差传播 ([backpropagation](https://en.wikipedia.org/wiki/Backpropagation)),将网络误差从输出层依次向前传递, 并更新网络中的参数。
4. 重复2~3步骤,直至网络训练误差达到规定的程度或训练轮次达到设定值。 4. 重复2~3步骤,直至网络训练误差达到规定的程度或训练轮次达到设定值。
## 数据集 ## 数据集
### 数据集接口的封装 ### 数据集接口的封装
首先加载需要的包 首先加载需要的包
```python ```python
import paddle.v2 as paddle import paddle.v2 as paddle
...@@ -59,9 +59,8 @@ import paddle.v2.dataset.uci_housing as uci_housing ...@@ -59,9 +59,8 @@ import paddle.v2.dataset.uci_housing as uci_housing
其中,在uci_housing模块中封装了: 其中,在uci_housing模块中封装了:
1. 数据下载的过程<br> 1. 数据下载的过程。下载数据保存在~/.cache/paddle/dataset/uci_housing/housing.data。
下载数据保存在~/.cache/paddle/dataset/uci_housing/housing.data<br> 2. [数据预处理](#数据预处理)的过程。
2. [数据预处理](#数据预处理)的过程<br>
### 数据集介绍 ### 数据集介绍
...@@ -105,25 +104,23 @@ import paddle.v2.dataset.uci_housing as uci_housing ...@@ -105,25 +104,23 @@ import paddle.v2.dataset.uci_housing as uci_housing
我们将数据集分割为两份:一份用于调整模型的参数,即进行模型的训练,模型在这份数据集上的误差被称为**训练误差**;另外一份被用来测试,模型在这份数据集上的误差被称为**测试误差**。我们训练模型的目的是为了通过从训练数据中找到规律来预测未知的新数据,所以测试误差是更能反映模型表现的指标。分割数据的比例要考虑到两个因素:更多的训练数据会降低参数估计的方差,从而得到更可信的模型;而更多的测试数据会降低测试误差的方差,从而得到更可信的测试误差。我们这个例子中设置的分割比例为$8:2$ 我们将数据集分割为两份:一份用于调整模型的参数,即进行模型的训练,模型在这份数据集上的误差被称为**训练误差**;另外一份被用来测试,模型在这份数据集上的误差被称为**测试误差**。我们训练模型的目的是为了通过从训练数据中找到规律来预测未知的新数据,所以测试误差是更能反映模型表现的指标。分割数据的比例要考虑到两个因素:更多的训练数据会降低参数估计的方差,从而得到更可信的模型;而更多的测试数据会降低测试误差的方差,从而得到更可信的测试误差。我们这个例子中设置的分割比例为$8:2$
在更复杂的模型训练过程中,我们往往还会多使用一种数据集:验证集。因为复杂的模型中常常还有一些超参数([Hyperparameter](https://en.wikipedia.org/wiki/Hyperparameter_optimization))需要调节,所以我们会尝试多种超参数的组合来分别训练多个模型,然后对比它们在验证集上的表现选择相对最好的一组超参数,最后才使用这组参数下训练的模型在测试集上评估测试误差。由于本章训练的模型比较简单,我们暂且忽略掉这个过程。 在更复杂的模型训练过程中,我们往往还会多使用一种数据集:验证集。因为复杂的模型中常常还有一些超参数([Hyperparameter](https://en.wikipedia.org/wiki/Hyperparameter_optimization))需要调节,所以我们会尝试多种超参数的组合来分别训练多个模型,然后对比它们在验证集上的表现选择相对最好的一组超参数,最后才使用这组参数下训练的模型在测试集上评估测试误差。由于本章训练的模型比较简单,我们暂且忽略掉这个过程。
## 训练 ## 训练
fit_a_line下trainer.py演示了训练的整体过程
### 初始化paddlepaddle `fit_a_line/trainer.py`演示了训练的整体过程。
### 初始化PaddlePaddle
```python ```python
# init
paddle.init(use_gpu=False, trainer_count=1) paddle.init(use_gpu=False, trainer_count=1)
``` ```
### 模型配置 ### 模型配置
使用`fc_layer``LinearActivation`来表示线性回归的模型本身。 线性回归的模型其实就是一个采用线性激活函数(linear activation,`LinearActivation`)的全连接层(fully-connected layer,`fc_layer`):
```python ```python
#输入数据,13维的房屋信息
x = paddle.layer.data(name='x', type=paddle.data_type.dense_vector(13)) x = paddle.layer.data(name='x', type=paddle.data_type.dense_vector(13))
y_predict = paddle.layer.fc(input=x, y_predict = paddle.layer.fc(input=x,
size=1, size=1,
...@@ -131,17 +128,15 @@ y_predict = paddle.layer.fc(input=x, ...@@ -131,17 +128,15 @@ y_predict = paddle.layer.fc(input=x,
y = paddle.layer.data(name='y', type=paddle.data_type.dense_vector(1)) y = paddle.layer.data(name='y', type=paddle.data_type.dense_vector(1))
cost = paddle.layer.regression_cost(input=y_predict, label=y) cost = paddle.layer.regression_cost(input=y_predict, label=y)
``` ```
### 创建参数 ### 创建参数
```python ```python
# create parameters
parameters = paddle.parameters.create(cost) parameters = paddle.parameters.create(cost)
``` ```
### 创建trainer ### 创建Trainer
```python ```python
# create optimizer
optimizer = paddle.optimizer.Momentum(momentum=0) optimizer = paddle.optimizer.Momentum(momentum=0)
trainer = paddle.trainer.SGD(cost=cost, trainer = paddle.trainer.SGD(cost=cost,
...@@ -149,14 +144,20 @@ trainer = paddle.trainer.SGD(cost=cost, ...@@ -149,14 +144,20 @@ trainer = paddle.trainer.SGD(cost=cost,
update_equation=optimizer) update_equation=optimizer)
``` ```
### 读取数据且打印训练的中间信息 ### 读取数据且打印训练的中间信息
在程序中,我们通过reader接口来获取训练或者测试的数据,通过eventhandler来打印训练的中间信息
feeding中设置了训练数据和测试数据的下标,reader通过下标区分训练和测试数据。 PaddlePaddle提供一个
[reader机制](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader)
来读取数据。 Reader返回的数据可以包括多列,我们需要一个Python dict把列
序号映射到网络里的数据层。
```python ```python
feeding={'x': 0, feeding={'x': 0, 'y': 1}
'y': 1} ```
此外,我们还可以提供一个 event handler,来打印训练的进度:
```python
# event_handler to print training and testing info # event_handler to print training and testing info
def event_handler(event): def event_handler(event):
if isinstance(event, paddle.event.EndIteration): if isinstance(event, paddle.event.EndIteration):
...@@ -171,10 +172,10 @@ def event_handler(event): ...@@ -171,10 +172,10 @@ def event_handler(event):
feeding=feeding) feeding=feeding)
print "Test %d, Cost %f" % (event.pass_id, result.cost) print "Test %d, Cost %f" % (event.pass_id, result.cost)
``` ```
### 开始训练
### 开始训练
```python ```python
# training
trainer.train( trainer.train(
reader=paddle.batch( reader=paddle.batch(
paddle.reader.shuffle( paddle.reader.shuffle(
...@@ -185,13 +186,6 @@ trainer.train( ...@@ -185,13 +186,6 @@ trainer.train(
num_passes=30) num_passes=30)
``` ```
## bash中执行训练程序
**注意设置好paddle的安装包路径**
```bash
python train.py
```
## 总结 ## 总结
在这章里,我们借助波士顿房价这一数据集,介绍了线性回归模型的基本概念,以及如何使用PaddlePaddle实现训练和测试的过程。很多的模型和技巧都是从简单的线性回归模型演化而来,因此弄清楚线性模型的原理和局限非常重要。 在这章里,我们借助波士顿房价这一数据集,介绍了线性回归模型的基本概念,以及如何使用PaddlePaddle实现训练和测试的过程。很多的模型和技巧都是从简单的线性回归模型演化而来,因此弄清楚线性模型的原理和局限非常重要。
......
<html> <html>
<head> <head>
<script type="text/x-mathjax-config"> <script type="text/x-mathjax-config">
...@@ -5,8 +6,8 @@ ...@@ -5,8 +6,8 @@
extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"], extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
jax: ["input/TeX", "output/HTML-CSS"], jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: { tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ], inlineMath: [ ['$','$'] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ], displayMath: [ ['$$','$$'] ],
processEscapes: true processEscapes: true
}, },
"HTML-CSS": { availableFonts: ["TeX"] } "HTML-CSS": { availableFonts: ["TeX"] }
...@@ -40,61 +41,75 @@ ...@@ -40,61 +41,75 @@
<!-- This block will be replaced by each markdown file content. Please do not change lines below.--> <!-- This block will be replaced by each markdown file content. Please do not change lines below.-->
<div id="markdown" style='display:none'> <div id="markdown" style='display:none'>
# Linear Regression # Linear Regression
Let us begin the tutorial with a classical problem called Linear Regression \[[1](#References)\]. In this chapter, we will train a model from a realistic dataset to predict house prices. Some important concepts in Machine Learning will be covered through this example. Let us begin the tutorial with a classical problem called Linear Regression \[[1](#References)\]. In this chapter, we will train a model from a realistic dataset to predict home prices. Some important concepts in Machine Learning will be covered through this example.
The source code for this tutorial lives on [book/fit_a_line](https://github.com/PaddlePaddle/book/tree/develop/fit_a_line). For instructions on getting started with PaddlePaddle, see [PaddlePaddle installation guide](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html).
The source code for this tutorial is at [book/fit_a_line](https://github.com/PaddlePaddle/book/tree/develop/fit_a_line). If this is your first time using PaddlePaddle, please refer to the [Install Guide](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html). ## Problem Setup
Suppose we have a dataset of $n$ real estate properties. These real estate properties will be referred to as *homes* in this chapter for clarity.
## Problem Each home is associated with $d$ attributes. The attributes describe characteristics such the number of rooms in the home, the number of schools or hospitals in the neighborhood, and the traffic condition nearby.
Suppose we have a dataset of $n$ houses. Each house $i$ has $d$ properties and the price $y_i$. A property $x_{i,d}$ describes one aspect of the house, for example, the number of rooms in the house, the number of schools or hospitals in the neighborhood, the nearby traffic condition, etc. Our task is to predict $y_i$ given a set of properties $\{x_{i,1}, ..., x_{i,d}\}$. We assume that the price is a linear combination of all the properties, i.e.,
In our problem setup, the attribute $x_{i,j}$ denotes the $j$th characteristic of the $i$th home. In addition, $y_i$ denotes the price of the $i$th home. Our task is to predict $y_i$ given a set of attributes $\{x_{i,1}, ..., x_{i,d}\}$. We assume that the price of a home is a linear combination of all of its attributes, namely,
$$y_i = \omega_1x_{i,1} + \omega_2x_{i,2} + \ldots + \omega_dx_{i,d} + b, i=1,\ldots,n$$ $$y_i = \omega_1x_{i,1} + \omega_2x_{i,2} + \ldots + \omega_dx_{i,d} + b, i=1,\ldots,n$$
where $\omega_{d}$ and $b$ are the model parameters we want to estimate. Once they are learned, given a set of properties of a house, we will be able to predict a price for that house. The model we have here is called Linear Regression, namely, we want to regress a value as a linear combination of several values. In practice this linear model for our problem is hardly true, because the real relationship between the house properties and the price is much more complicated. However, due to its simple formulation which makes the model training and analysis easy, Linear Regression has been applied to lots of real problems. It is always an important topic in many classical Statistical Learning and Machine Learning textbooks \[[2,3,4](#References)\]. where $\vec{\omega}$ and $b$ are the model parameters we want to estimate. Once they are learned, we will be able to predict the price of a home, given the attributes associated with it. We call this model **Linear Regression**. In other words, we want to regress a value against several values linearly. In practice, a linear model is often too simplistic to capture the real relationships between the variables. Yet, because Linear Regression is easy to train and analyze, it has been applied to a large number of real problems. As a result, it is an important topic in many classic Statistical Learning and Machine Learning textbooks \[[2,3,4](#References)\].
## Results Demonstration ## Results Demonstration
We first show the training result of our model. We use the [UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing) to train a linear model and predict the house prices in Boston. The figure below shows the predictions the model makes for some house prices. The $X$ coordinate of each point represents the median value of the prices of a certain type of houses, while the $Y$ coordinate represents the predicted value by our linear model. When $X=Y$, the point lies exactly on the dotted line. In other words, the more precise the model predicts, the closer the point is to the dotted line. We first show the result of our model. The dataset [UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing) is used to train a linear model to predict the home prices in Boston. The figure below shows the predictions the model makes for some home prices. The $X$-axis represents the median value of the prices of simlilar homes within a bin, while the $Y$-axis represents the home value our linear model predicts. The dotted line represents points where $X=Y$. When reading the diagram, the more precise the model predicts, the closer the point is to the dotted line.
<p align="center"> <p align="center">
<img src = "image/predictions_en.png" width=400><br/> <img src = "image/predictions_en.png" width=400><br/>
Figure 1. Predicted Value V.S. Actual Value Figure 1. Predicted Value V.S. Actual Value
</p> </p>
## Model Overview ## Model Overview
### Model Definition ### Model Definition
In the UCI Housing Data Set, there are 13 house properties $x_{i,d}$ that are related to the median house price $y_i$. Thus our model is: In the UCI Housing Data Set, there are 13 home attributes $\{x_{i,j}\}$ that are related to the median home price $y_i$, which we aim to predict. Thus, our model can be written as:
$$\hat{Y} = \omega_1X_{1} + \omega_2X_{2} + \ldots + \omega_{13}X_{13} + b$$ $$\hat{Y} = \omega_1X_{1} + \omega_2X_{2} + \ldots + \omega_{13}X_{13} + b$$
where $\hat{Y}$ is the predicted value used to differentiate from the actual value $Y$. The model parameters to be learned are: $\omega_1, \ldots, \omega_{13}, b$, where $\omega$ are called the weights and $b$ is called the bias. where $\hat{Y}$ is the predicted value used to differentiate from actual value $Y$. The model learns parameters $\omega_1, \ldots, \omega_{13}, b$, where the entries of $\vec{\omega}$ are **weights** and $b$ is **bias**.
Now we need an optimization goal, so that with the learned parameters, $\hat{Y}$ is close to $Y$ as much as possible. Here we introduce the concept of [Loss Function (Cost Function)](https://en.wikipedia.org/wiki/Loss_function). The Loss Function has such property: given any pair of the actual value $y_i$ and the predicted value $\hat{y_i}$, its output is always non-negative. This non-negative value reflects the model error. Now we need an objective to optimize, so that the learned parameters can make $\hat{Y}$ as close to $Y$ as possible. Let's refer to the concept of [Loss Function (Cost Function)](https://en.wikipedia.org/wiki/Loss_function). A loss function must output a non-negative value, given any pair of the actual value $y_i$ and the predicted value $\hat{y_i}$. This value reflects the magnitutude of the model error.
For Linear Regression, the most common Loss Function is [Mean Square Error (MSE)](https://en.wikipedia.org/wiki/Mean_squared_error) which has the following form: For Linear Regression, the most common loss function is [Mean Square Error (MSE)](https://en.wikipedia.org/wiki/Mean_squared_error) which has the following form:
$$MSE=\frac{1}{n}\sum_{i=1}^{n}{(\hat{Y_i}-Y_i)}^2$$ $$MSE=\frac{1}{n}\sum_{i=1}^{n}{(\hat{Y_i}-Y_i)}^2$$
For a dataset of size $n$, MSE is the average value of the $n$ predicted errors. That is, for a dataset of size $n$, MSE is the average value of the the prediction sqaure errors.
### Training ### Training
After defining our model, we have several major steps for the training: After setting up our model, there are several major steps to go through to train it:
1. Initialize the parameters including the weights $\omega$ and the bias $b$. For example, we can set their mean values as 0s, and their standard deviations as 1s. 1. Initialize the parameters including the weights $\vec{\omega}$ and the bias $b$. For example, we can set their mean values as $0$s, and their standard deviations as $1$s.
2. Feedforward to compute the network output and the Loss Function. 2. Feedforward. Evaluate the network output and compute the corresponding loss.
3. Backward to [backpropagate](https://en.wikipedia.org/wiki/Backpropagation) the errors. The errors will be propagated from the output layer back to the input layer, during which the model parameters will be updated with the corresponding errors. 3. [Backpropagate](https://en.wikipedia.org/wiki/Backpropagation) the errors. The errors will be propagated from the output layer back to the input layer, during which the model parameters will be updated with the corresponding errors.
4. Repeat steps 2~3, until the loss is below a predefined threshold or the maximum number of repeats is reached. 4. Repeat steps 2~3, until the loss is below a predefined threshold or the maximum number of repeats is reached.
## Data Preparation ## Dataset
Follow the command below to prepare data:
```bash ### Python Dataset Modules
cd data && python prepare_data.py
Our program starts with importing necessary packages:
```python
import paddle.v2 as paddle
import paddle.v2.dataset.uci_housing as uci_housing
``` ```
This line of code will download the dataset from the [UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing) and perform some [preprocessing](#Preprocessing). The dataset is split into a training set and a test set.
The dataset contains 506 lines in total, each line describing the properties and the median price of a certain type of houses in Boston. The meaning of each line is below: We encapsulated the [UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing) in our Python module `uci_housing`. This module can
1. download the dataset to `~/.cache/paddle/dataset/uci_housing/housing.data`, if not yet, and
2. [preprocesses](#preprocessing) the dataset.
| Property Name | Explanation | Data Type | ### An Introduction of the Dataset
The UCI housing dataset has 506 instances. Each instance describes the attributes of a house in surburban Boston. The attributes are explained below:
| Attribute Name | Characteristic | Data Type |
| ------| ------ | ------ | | ------| ------ | ------ |
| CRIM | per capita crime rate by town | Continuous| | CRIM | per capita crime rate by town | Continuous|
| ZN | proportion of residential land zoned for lots over 25,000 sq.ft. | Continuous | | ZN | proportion of residential land zoned for lots over 25,000 sq.ft. | Continuous |
...@@ -111,113 +126,115 @@ The dataset contains 506 lines in total, each line describing the properties and ...@@ -111,113 +126,115 @@ The dataset contains 506 lines in total, each line describing the properties and
| LSTAT | % lower status of the population | Continuous | | LSTAT | % lower status of the population | Continuous |
| MEDV | Median value of owner-occupied homes in $1000's | Continuous | | MEDV | Median value of owner-occupied homes in $1000's | Continuous |
The last entry is the median house price. The last entry is the median home price.
### Preprocessing ### Preprocessing
#### Continuous and Discrete Data #### Continuous and Discrete Data
We define a feature vector of length 13 for each house, where each entry of the feature vector corresponds to a property of that house. Our first observation is that among the 13 dimensions, there are 12 continuous dimensions and 1 discrete dimension. Note that although a discrete value is also written as digits such as 0, 1, or 2, it has a quite different meaning from a continuous value. The reason is that the difference between two discrete values has no practical meaning. For example, if we use 0, 1, and 2 to represent `red`, `green`, and `blue` respectively, although the numerical difference between `red` and `green` is smaller than that between `red` and `blue`, we cannot say that the extent to which `blue` is different from `red` is greater than the extent to which `green` is different from `red`. Therefore, when handling a discrete feature that has $d$ possible values, we will usually convert it to $d$ new features where each feature can only take 0 or 1, indicating whether the original $d$th value is present or not. Or we can map the discrete feature to a continuous multi-dimensional vector through an embedding table. For our problem here, because CHAS itself is a binary discrete value, we do not need to do any preprocessing. We define a feature vector of length 13 for each home, where each entry corresponds to an attribute. Our first observation is that, among the 13 dimensions, there are 12 continuous dimensions and 1 discrete dimension.
Note that although a discrete value is also written as numeric values such as 0, 1, or 2, its meaning differs from a continuous value drastically. The linear difference between two discrete values has no meaning. For example, suppose $0$, $1$, and $2$ are used to represent colors *Red*, *Green*, and *Blue* respectively. Judging from the numeric representation of these colors, *Red* differs more from *Blue* than it does from *Green*. Yet in actuality, it is not true that extent to which the color *Blue* is different from *Red* is greater than the extent to which *Green* is different from *Red*. Therefore, when handling a discrete feature that has $d$ possible values, we usually convert it to $d$ new features where each feature takes a binary value, $0$ or $1$, indicating whether the original value is absent or present. Alternatively, the discrete features can be mapped onto a continuous multi-dimensional vector through an embedding table. For our problem here, because CHAS itself is a binary discrete value, we do not need to do any preprocessing.
#### Feature Normalization #### Feature Normalization
Another observation we have is that there is a huge difference among the value ranges of the 13 features (Figure 2). For example, feature B has a value range of [0.32, 396.90] while feature NOX has a range of [0.3850, 0.8170]. For an effective optimization, here we need data normalization. The goal of data normalization is to scale each feature into roughly the same value range, for example [-0.5, 0.5]. In this example, we adopt a standard way of normalization: substracting the mean value from the feature and divide the result by the original value range. We also observe a huge difference among the value ranges of the 13 features (Figure 2). For instance, the values of feature *B* fall in $[0.32, 396.90]$, whereas those of feature *NOX* has a range of $[0.3850, 0.8170]$. An effective optimization would require data normalization. The goal of data normalization is to scale te values of each feature into roughly the same range, perhaps $[-0.5, 0.5]$. Here, we adopt a popular normalization technique where we substract the mean value from the feature value and divide the result by the width of the original range.
There are at least three reasons for [Feature Normalization](https://en.wikipedia.org/wiki/Feature_scaling) (Feature Scaling): There are at least three reasons for [Feature Normalization](https://en.wikipedia.org/wiki/Feature_scaling) (Feature Scaling):
- A value range that is too large or too small might cause floating number overflow or underflow during computation. - A value range that is too large or too small might cause floating number overflow or underflow during computation.
- Different value ranges might result in different importances of different features to the model (at least in the beginning of the training process), which however is an unreasonable assumption. Such assumption makes the optimization more difficult and increases the training time a lot. - Different value ranges might result in varying *importances* of different features to the model (at least in the beginning of the training process). This assumption about the data is often unreasonable, making the optimization difficult, which in turn results in increased training time.
- Many Machine Learning techniques or models (e.g., L1/L2 regularization and Vector Space Model) are based on the assumption that all the features have roughly zero means and their value ranges are similar. - Many machine learning techniques or models (e.g., *L1/L2 regularization* and *Vector Space Model*) assumes that all the features have roughly zero means and their value ranges are similar.
<p align="center"> <p align="center">
<img src = "image/ranges_en.png" width=550><br/> <img src = "image/ranges_en.png" width=550><br/>
Figure 2. The value ranges of the features Figure 2. The value ranges of the features
</p> </p>
#### Prepare Training and Test Sets #### Prepare Training and Test Sets
We split the dataset into two subsets, one for estimating the model parameters, namely, model training, and the other for model testing. The model error on the former is called the **training error**, and the error on the latter is called the **test error**. Our goal of training a model is to find the statistical dependency between the outputs and the inputs, so that we can predict new outputs given new inputs. As a result, the test error reflects the performance of the model better than the training error does. We consider two things when deciding the ratio of the training set to the test set: 1) More training data will decrease the variance of the parameter estimation, yielding more reliable models; 2) More test data will decrease the variance of the test error, yielding more reliable test errors. One standard split ratio is $8:2$. You can try different split ratios to observe how the two variances change. We split the dataset in two, one for adjusting the model parameters, namely, for model training, and the other for model testing. The model error on the former is called the **training error**, and the error on the latter is called the **test error**. Our goal in training a model is to find the statistical dependency between the outputs and the inputs, so that we can predict new outputs given new inputs. As a result, the test error reflects the performance of the model better than the training error does. We consider two things when deciding the ratio of the training set to the test set: 1) More training data will decrease the variance of the parameter estimation, yielding more reliable models; 2) More test data will decrease the variance of the test error, yielding more reliable test errors. One standard split ratio is $8:2$.
When training complex models, we usually have one more split: the validation set. Complex models usually have [Hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_optimization) that need to be set before the training process, such as the number of layers in the network. Because hyperparameters are not part of the model parameters, they cannot be trained using the same loss function. Thus we will try several sets of hyperparameters to train several models and cross-validate them on the validation set to pick the best one; finally, the selected trained model is tested on the test set. Because our model is relatively simple, we will omit this validation process.
## Training
`fit_a_line/trainer.py` demonstrates the training using [PaddlePaddle](http://paddlepaddle.org).
### Initialize PaddlePaddle
Executing the following command to split the dataset and write the training and test set into the `train.list` and `test.list` files, so that later PaddlePaddle can read from them.
```python ```python
python prepare_data.py -r 0.8 #8:2 is the default split ratio paddle.init(use_gpu=False, trainer_count=1)
``` ```
When training complex models, we usually have one more split: the validation set. Complex models usually have [Hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_optimization) that need to be set before the training process begins. These hyperparameters are not part of the model parameters and cannot be trained using the same Loss Function (e.g., the number of layers in the network). Thus we will try several sets of hyperparameters to get several models, and compare these trained models on the validation set to pick the best one, and finally it on the test set. Because our model is relatively simple in this problem, we ignore this validation process for now. ### Model Configuration
### Provide Data to PaddlePaddle Logistic regression is essentially a fully-connected layer with linear activation:
After the data is prepared, we use a Python Data Provider to provide data for PaddlePaddle. A Data Provider is a Python function which will be called by PaddlePaddle during training. In this example, the Data Provider only needs to read the data and return it to the training process of PaddlePaddle line by line.
```python ```python
from paddle.trainer.PyDataProvider2 import * x = paddle.layer.data(name='x', type=paddle.data_type.dense_vector(13))
import numpy as np y_predict = paddle.layer.fc(input=x,
#define data type and dimensionality size=1,
@provider(input_types=[dense_vector(13), dense_vector(1)]) act=paddle.activation.Linear())
def process(settings, input_file): y = paddle.layer.data(name='y', type=paddle.data_type.dense_vector(1))
data = np.load(input_file.strip()) cost = paddle.layer.regression_cost(input=y_predict, label=y)
for row in data: ```
yield row[:-1].tolist(), row[-1:].tolist() ### Create Parameters
```python
parameters = paddle.parameters.create(cost)
``` ```
## Model Configuration ### Create Trainer
### Data Definition
We first call the function `define_py_data_sources2` to let PaddlePaddle read training and test data from the `dataprovider.py` in the above. PaddlePaddle can accept configuration info from the command line, for example, here we pass a variable named `is_predict` to control the model to have different structures during training and test.
```python ```python
from paddle.trainer_config_helpers import * optimizer = paddle.optimizer.Momentum(momentum=0)
is_predict = get_config_arg('is_predict', bool, False) trainer = paddle.trainer.SGD(cost=cost,
parameters=parameters,
update_equation=optimizer)
```
define_py_data_sources2( ### Feeding Data
train_list='data/train.list',
test_list='data/test.list',
module='dataprovider',
obj='process')
``` PaddlePaddle provides the
[reader mechanism](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader)
for loadinng training data. A reader may return multiple columns, and we need a Python dictionary to specify the mapping from column index to data layers.
### Algorithm Settings
Next we need to set the details of the optimization algorithm. Due to the simplicity of the Linear Regression model, we only need to set the `batch_size` which defines how many samples are used every time for updating the parameters.
```python ```python
settings(batch_size=2) feeding={'x': 0, 'y': 1}
``` ```
### Network Moreover, an event handler is provided to print the training progress:
Finally, we use `fc_layer` and `LinearActivation` to represent the Linear Regression model.
```python ```python
#input data of 13 dimensional house information # event_handler to print training and testing info
x = data_layer(name='x', size=13) def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
y_predict = fc_layer( if event.batch_id % 100 == 0:
input=x, print "Pass %d, Batch %d, Cost %f" % (
param_attr=ParamAttr(name='w'), event.pass_id, event.batch_id, event.cost)
size=1,
act=LinearActivation(), if isinstance(event, paddle.event.EndPass):
bias_attr=ParamAttr(name='b')) result = trainer.test(
reader=paddle.batch(
if not is_predict: #when training, we use MSE (i.e., regression_cost) as the Loss Function uci_housing.test(), batch_size=2),
y = data_layer(name='y', size=1) feeding=feeding)
cost = regression_cost(input=y_predict, label=y) print "Test %d, Cost %f" % (event.pass_id, result.cost)
outputs(cost) #output MSE to view the loss change
else: #during test, output the prediction value
outputs(y_predict)
``` ```
## Training Model ### Start Training
We can run the PaddlePaddle command line trainer in the root directory of the code. Here we name the configuration file as `trainer_config.py`. We train 30 passes and save the result in the directory `output`:
```bash
./train.sh
```
## Use Model ```python
Now we can use the trained model to do prediction. trainer.train(
```bash reader=paddle.batch(
python predict.py paddle.reader.shuffle(
``` uci_housing.train(), buf_size=500),
Here by default we use the model in `output/pass-00029` for prediction, and compare the actual house price with the predicted one. The result is shown in `predictions.png`. batch_size=2),
If you want to use another model or test on other data, you can pass in a new model path or data path: feeding=feeding,
```bash event_handler=event_handler,
python predict.py -m output/pass-00020 -t data/housing.test.npy num_passes=30)
``` ```
## Summary ## Summary
In this chapter, we have introduced the Linear Regression model using the UCI Housing Data Set as an example. We have shown how to train and test this model with PaddlePaddle. Many more complex models and techniques are derived from this simple linear model, thus it is important for us to understand how it works. This chapter introduces *Linear Regression* and how to train and test this model with PaddlePaddle, using the UCI Housing Data Set. Because a large number of more complex models and techniques are derived from linear regression, it is important to understand its underlying theory and limitation.
## References ## References
...@@ -227,7 +244,8 @@ In this chapter, we have introduced the Linear Regression model using the UCI Ho ...@@ -227,7 +244,8 @@ In this chapter, we have introduced the Linear Regression model using the UCI Ho
4. Bishop C M. Pattern recognition[J]. Machine Learning, 2006, 128. 4. Bishop C M. Pattern recognition[J]. Machine Learning, 2006, 128.
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Common Creative License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a> This tutorial was created and published with [Creative Common License 4.0](http://creativecommons.org/licenses/by-nc-sa/4.0/).
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
...@@ -246,6 +264,6 @@ marked.setOptions({ ...@@ -246,6 +264,6 @@ marked.setOptions({
} }
}); });
document.getElementById("context").innerHTML = marked( document.getElementById("context").innerHTML = marked(
document.getElementById("markdown").innerHTML) document.getElementById("markdown").innerHTML)
</script> </script>
</body> </body>
<html> <html>
<head> <head>
<script type="text/x-mathjax-config"> <script type="text/x-mathjax-config">
...@@ -5,8 +6,8 @@ ...@@ -5,8 +6,8 @@
extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"], extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
jax: ["input/TeX", "output/HTML-CSS"], jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: { tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ], inlineMath: [ ['$','$'] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ], displayMath: [ ['$$','$$'] ],
processEscapes: true processEscapes: true
}, },
"HTML-CSS": { availableFonts: ["TeX"] } "HTML-CSS": { availableFonts: ["TeX"] }
...@@ -86,14 +87,25 @@ $$MSE=\frac{1}{n}\sum_{i=1}^{n}{(\hat{Y_i}-Y_i)}^2$$ ...@@ -86,14 +87,25 @@ $$MSE=\frac{1}{n}\sum_{i=1}^{n}{(\hat{Y_i}-Y_i)}^2$$
3. 根据损失函数进行反向误差传播 ([backpropagation](https://en.wikipedia.org/wiki/Backpropagation)),将网络误差从输出层依次向前传递, 并更新网络中的参数。 3. 根据损失函数进行反向误差传播 ([backpropagation](https://en.wikipedia.org/wiki/Backpropagation)),将网络误差从输出层依次向前传递, 并更新网络中的参数。
4. 重复2~3步骤,直至网络训练误差达到规定的程度或训练轮次达到设定值。 4. 重复2~3步骤,直至网络训练误差达到规定的程度或训练轮次达到设定值。
## 数据集
### 数据集接口的封装
首先加载需要的包
## 数据准备 ```python
执行以下命令来准备数据: import paddle.v2 as paddle
```bash import paddle.v2.dataset.uci_housing as uci_housing
cd data && python prepare_data.py
``` ```
这段代码将从[UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing)下载数据并进行[预处理](#数据预处理),最后数据将被分为训练集和测试集。
我们通过uci_housing模块引入了数据集合[UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing)
其中,在uci_housing模块中封装了:
1. 数据下载的过程。下载数据保存在~/.cache/paddle/dataset/uci_housing/housing.data。
2. [数据预处理](#数据预处理)的过程。
### 数据集介绍
这份数据集共506行,每行包含了波士顿郊区的一类房屋的相关信息及该类房屋价格的中位数。其各维属性的意义如下: 这份数据集共506行,每行包含了波士顿郊区的一类房屋的相关信息及该类房屋价格的中位数。其各维属性的意义如下:
| 属性名 | 解释 | 类型 | | 属性名 | 解释 | 类型 |
...@@ -131,89 +143,89 @@ cd data && python prepare_data.py ...@@ -131,89 +143,89 @@ cd data && python prepare_data.py
</p> </p>
#### 整理训练集与测试集 #### 整理训练集与测试集
我们将数据集分割为两份:一份用于调整模型的参数,即进行模型的训练,模型在这份数据集上的误差被称为**训练误差**;另外一份被用来测试,模型在这份数据集上的误差被称为**测试误差**。我们训练模型的目的是为了通过从训练数据中找到规律来预测未知的新数据,所以测试误差是更能反映模型表现的指标。分割数据的比例要考虑到两个因素:更多的训练数据会降低参数估计的方差,从而得到更可信的模型;而更多的测试数据会降低测试误差的方差,从而得到更可信的测试误差。一种常见的分割比例为$8:2$,感兴趣的读者朋友们也可以尝试不同的设置来观察这两种误差的变化。 我们将数据集分割为两份:一份用于调整模型的参数,即进行模型的训练,模型在这份数据集上的误差被称为**训练误差**;另外一份被用来测试,模型在这份数据集上的误差被称为**测试误差**。我们训练模型的目的是为了通过从训练数据中找到规律来预测未知的新数据,所以测试误差是更能反映模型表现的指标。分割数据的比例要考虑到两个因素:更多的训练数据会降低参数估计的方差,从而得到更可信的模型;而更多的测试数据会降低测试误差的方差,从而得到更可信的测试误差。我们这个例子中设置的分割比例为$8:2$
在更复杂的模型训练过程中,我们往往还会多使用一种数据集:验证集。因为复杂的模型中常常还有一些超参数([Hyperparameter](https://en.wikipedia.org/wiki/Hyperparameter_optimization))需要调节,所以我们会尝试多种超参数的组合来分别训练多个模型,然后对比它们在验证集上的表现选择相对最好的一组超参数,最后才使用这组参数下训练的模型在测试集上评估测试误差。由于本章训练的模型比较简单,我们暂且忽略掉这个过程。
## 训练
`fit_a_line/trainer.py`演示了训练的整体过程。
### 初始化PaddlePaddle
执行如下命令可以分割数据集,并将训练集和测试集的地址分别写入train.list 和 test.list两个文件中,供PaddlePaddle读取。
```python ```python
python prepare_data.py -r 0.8 #默认使用8:2的比例进行分割 paddle.init(use_gpu=False, trainer_count=1)
``` ```
在更复杂的模型训练过程中,我们往往还会多使用一种数据集:验证集。因为复杂的模型中常常还有一些超参数([Hyperparameter](https://en.wikipedia.org/wiki/Hyperparameter_optimization))需要调节,所以我们会尝试多种超参数的组合来分别训练多个模型,然后对比它们在验证集上的表现选择相对最好的一组超参数,最后才使用这组参数下训练的模型在测试集上评估测试误差。由于本章训练的模型比较简单,我们暂且忽略掉这个过程。 ### 模型配置
### 提供数据给PaddlePaddle 线性回归的模型其实就是一个采用线性激活函数(linear activation,`LinearActivation`)的全连接层(fully-connected layer,`fc_layer`):
准备好数据之后,我们使用一个Python data provider来为PaddlePaddle的训练过程提供数据。一个 data provider 就是一个Python函数,它会被PaddlePaddle的训练过程调用。在这个例子里,只需要读取已经保存好的数据,然后一行一行地返回给PaddlePaddle的训练进程即可。
```python ```python
from paddle.trainer.PyDataProvider2 import * x = paddle.layer.data(name='x', type=paddle.data_type.dense_vector(13))
import numpy as np y_predict = paddle.layer.fc(input=x,
#定义数据的类型和维度 size=1,
@provider(input_types=[dense_vector(13), dense_vector(1)]) act=paddle.activation.Linear())
def process(settings, input_file): y = paddle.layer.data(name='y', type=paddle.data_type.dense_vector(1))
data = np.load(input_file.strip()) cost = paddle.layer.regression_cost(input=y_predict, label=y)
for row in data: ```
yield row[:-1].tolist(), row[-1:].tolist() ### 创建参数
```python
parameters = paddle.parameters.create(cost)
``` ```
## 模型配置说明 ### 创建Trainer
### 数据定义
首先,通过 `define_py_data_sources2` 来配置PaddlePaddle从上面的`dataprovider.py`里读入训练数据和测试数据。 PaddlePaddle接受从命令行读入的配置信息,例如这里我们传入一个名为`is_predict`的变量来控制模型在训练和测试时的不同结构。
```python ```python
from paddle.trainer_config_helpers import * optimizer = paddle.optimizer.Momentum(momentum=0)
is_predict = get_config_arg('is_predict', bool, False) trainer = paddle.trainer.SGD(cost=cost,
parameters=parameters,
update_equation=optimizer)
```
define_py_data_sources2( ### 读取数据且打印训练的中间信息
train_list='data/train.list',
test_list='data/test.list',
module='dataprovider',
obj='process')
``` PaddlePaddle提供一个
[reader机制](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader)
来读取数据。 Reader返回的数据可以包括多列,我们需要一个Python dict把列
序号映射到网络里的数据层。
### 算法配置
接着,指定模型优化算法的细节。由于线性回归模型比较简单,我们只要设置基本的`batch_size`即可,它指定每次更新参数的时候使用多少条数据计算梯度信息。
```python ```python
settings(batch_size=2) feeding={'x': 0, 'y': 1}
``` ```
### 网络结构 此外,我们还可以提供一个 event handler,来打印训练的进度:
最后,使用`fc_layer`和`LinearActivation`来表示线性回归的模型本身。
```python ```python
#输入数据,13维的房屋信息 # event_handler to print training and testing info
x = data_layer(name='x', size=13) def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
y_predict = fc_layer( if event.batch_id % 100 == 0:
input=x, print "Pass %d, Batch %d, Cost %f" % (
param_attr=ParamAttr(name='w'), event.pass_id, event.batch_id, event.cost)
size=1,
act=LinearActivation(), if isinstance(event, paddle.event.EndPass):
bias_attr=ParamAttr(name='b')) result = trainer.test(
reader=paddle.batch(
if not is_predict: #训练时,我们使用MSE,即regression_cost作为损失函数 uci_housing.test(), batch_size=2),
y = data_layer(name='y', size=1) feeding=feeding)
cost = regression_cost(input=y_predict, label=y) print "Test %d, Cost %f" % (event.pass_id, result.cost)
outputs(cost) #训练时输出MSE来监控损失的变化
else: #测试时,输出预测值
outputs(y_predict)
``` ```
## 训练模型 ### 开始训练
在对应代码的根目录下执行PaddlePaddle的命令行训练程序。这里指定模型配置文件为`trainer_config.py`,训练30轮,结果保存在`output`路径下。
```bash
./train.sh
```
## 应用模型 ```python
现在来看下如何使用已经训练好的模型进行预测。 trainer.train(
```bash reader=paddle.batch(
python predict.py paddle.reader.shuffle(
``` uci_housing.train(), buf_size=500),
这里默认使用`output/pass-00029`中保存的模型进行预测,并将数据中的房价与预测结果进行对比,结果保存在 `predictions.png`中。 batch_size=2),
如果你想使用别的模型或者其它的数据进行预测,只要传入新的路径即可: feeding=feeding,
```bash event_handler=event_handler,
python predict.py -m output/pass-00020 -t data/housing.test.npy num_passes=30)
``` ```
## 总结 ## 总结
...@@ -228,6 +240,7 @@ python predict.py -m output/pass-00020 -t data/housing.test.npy ...@@ -228,6 +240,7 @@ python predict.py -m output/pass-00020 -t data/housing.test.npy
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
...@@ -246,6 +259,6 @@ marked.setOptions({ ...@@ -246,6 +259,6 @@ marked.setOptions({
} }
}); });
document.getElementById("context").innerHTML = marked( document.getElementById("context").innerHTML = marked(
document.getElementById("markdown").innerHTML) document.getElementById("markdown").innerHTML)
</script> </script>
</body> </body>
<html> <html>
<head> <head>
<script type="text/x-mathjax-config"> <script type="text/x-mathjax-config">
...@@ -5,8 +6,8 @@ ...@@ -5,8 +6,8 @@
extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"], extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
jax: ["input/TeX", "output/HTML-CSS"], jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: { tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ], inlineMath: [ ['$','$'] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ], displayMath: [ ['$$','$$'] ],
processEscapes: true processEscapes: true
}, },
"HTML-CSS": { availableFonts: ["TeX"] } "HTML-CSS": { availableFonts: ["TeX"] }
...@@ -40,6 +41,7 @@ ...@@ -40,6 +41,7 @@
<!-- This block will be replaced by each markdown file content. Please do not change lines below.--> <!-- This block will be replaced by each markdown file content. Please do not change lines below.-->
<div id="markdown" style='display:none'> <div id="markdown" style='display:none'>
TODO: Write about https://github.com/PaddlePaddle/Paddle/tree/develop/demo/gan TODO: Write about https://github.com/PaddlePaddle/Paddle/tree/develop/demo/gan
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
...@@ -58,6 +60,6 @@ marked.setOptions({ ...@@ -58,6 +60,6 @@ marked.setOptions({
} }
}); });
document.getElementById("context").innerHTML = marked( document.getElementById("context").innerHTML = marked(
document.getElementById("markdown").innerHTML) document.getElementById("markdown").innerHTML)
</script> </script>
</body> </body>
<html> <html>
<head> <head>
<script type="text/x-mathjax-config"> <script type="text/x-mathjax-config">
...@@ -5,8 +6,8 @@ ...@@ -5,8 +6,8 @@
extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"], extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
jax: ["input/TeX", "output/HTML-CSS"], jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: { tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ], inlineMath: [ ['$','$'] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ], displayMath: [ ['$$','$$'] ],
processEscapes: true processEscapes: true
}, },
"HTML-CSS": { availableFonts: ["TeX"] } "HTML-CSS": { availableFonts: ["TeX"] }
...@@ -39,6 +40,7 @@ ...@@ -39,6 +40,7 @@
<!-- This block will be replaced by each markdown file content. Please do not change lines below.--> <!-- This block will be replaced by each markdown file content. Please do not change lines below.-->
<div id="markdown" style='display:none'> <div id="markdown" style='display:none'>
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
...@@ -57,6 +59,6 @@ marked.setOptions({ ...@@ -57,6 +59,6 @@ marked.setOptions({
} }
}); });
document.getElementById("context").innerHTML = marked( document.getElementById("context").innerHTML = marked(
document.getElementById("markdown").innerHTML) document.getElementById("markdown").innerHTML)
</script> </script>
</body> </body>
...@@ -248,48 +248,48 @@ First we define VGG network. Since the image size and amount of CIFAR10 are rela ...@@ -248,48 +248,48 @@ First we define VGG network. Since the image size and amount of CIFAR10 are rela
The input to the network is defined as `data_layer`, or image pixels in the context of image classification. The images in CIFAR10 are 32x32 color images of three channels. Therefore, the size of the input data is 3072 (3x32x32), and the number of categories is 10. The input to the network is defined as `data_layer`, or image pixels in the context of image classification. The images in CIFAR10 are 32x32 color images of three channels. Therefore, the size of the input data is 3072 (3x32x32), and the number of categories is 10.
```python ```python
datadim = 3 * 32 * 32 datadim = 3 * 32 * 32
classdim = 10 classdim = 10
data = data_layer(name='image', size=datadim) data = data_layer(name='image', size=datadim)
``` ```
2. Define VGG main module 2. Define VGG main module
```python ```python
net = vgg_bn_drop(data) net = vgg_bn_drop(data)
``` ```
The input to VGG main module is from data layer. `vgg_bn_drop` defines a 16-layer VGG network, with each convolutional layer followed by BN and dropout layers. Here is the definition in detail: The input to VGG main module is from data layer. `vgg_bn_drop` defines a 16-layer VGG network, with each convolutional layer followed by BN and dropout layers. Here is the definition in detail:
```python ```python
def vgg_bn_drop(input, num_channels): def vgg_bn_drop(input, num_channels):
def conv_block(ipt, num_filter, groups, dropouts, num_channels_=None): def conv_block(ipt, num_filter, groups, dropouts, num_channels_=None):
return img_conv_group( return img_conv_group(
input=ipt, input=ipt,
num_channels=num_channels_, num_channels=num_channels_,
pool_size=2, pool_size=2,
pool_stride=2, pool_stride=2,
conv_num_filter=[num_filter] * groups, conv_num_filter=[num_filter] * groups,
conv_filter_size=3, conv_filter_size=3,
conv_act=ReluActivation(), conv_act=ReluActivation(),
conv_with_batchnorm=True, conv_with_batchnorm=True,
conv_batchnorm_drop_rate=dropouts, conv_batchnorm_drop_rate=dropouts,
pool_type=MaxPooling()) pool_type=MaxPooling())
conv1 = conv_block(input, 64, 2, [0.3, 0], 3) conv1 = conv_block(input, 64, 2, [0.3, 0], 3)
conv2 = conv_block(conv1, 128, 2, [0.4, 0]) conv2 = conv_block(conv1, 128, 2, [0.4, 0])
conv3 = conv_block(conv2, 256, 3, [0.4, 0.4, 0]) conv3 = conv_block(conv2, 256, 3, [0.4, 0.4, 0])
conv4 = conv_block(conv3, 512, 3, [0.4, 0.4, 0]) conv4 = conv_block(conv3, 512, 3, [0.4, 0.4, 0])
conv5 = conv_block(conv4, 512, 3, [0.4, 0.4, 0]) conv5 = conv_block(conv4, 512, 3, [0.4, 0.4, 0])
drop = dropout_layer(input=conv5, dropout_rate=0.5) drop = dropout_layer(input=conv5, dropout_rate=0.5)
fc1 = fc_layer(input=drop, size=512, act=LinearActivation()) fc1 = fc_layer(input=drop, size=512, act=LinearActivation())
bn = batch_norm_layer( bn = batch_norm_layer(
input=fc1, act=ReluActivation(), layer_attr=ExtraAttr(drop_rate=0.5)) input=fc1, act=ReluActivation(), layer_attr=ExtraAttr(drop_rate=0.5))
fc2 = fc_layer(input=bn, size=512, act=LinearActivation()) fc2 = fc_layer(input=bn, size=512, act=LinearActivation())
return fc2 return fc2
``` ```
2.1. First defines a convolution block or conv_block. The default convolution kernel is 3x3, and the default pooling size is 2x2 with stride 2. Dropout specifies the probability in dropout operation. Function `img_conv_group` is defined in `paddle.trainer_config_helpers` consisting of a series of `Conv->BN->ReLu->Dropout` and a `Pooling`. 2.1. First defines a convolution block or conv_block. The default convolution kernel is 3x3, and the default pooling size is 2x2 with stride 2. Dropout specifies the probability in dropout operation. Function `img_conv_group` is defined in `paddle.trainer_config_helpers` consisting of a series of `Conv->BN->ReLu->Dropout` and a `Pooling`.
...@@ -303,22 +303,22 @@ First we define VGG network. Since the image size and amount of CIFAR10 are rela ...@@ -303,22 +303,22 @@ First we define VGG network. Since the image size and amount of CIFAR10 are rela
The above VGG network extracts high-level features and maps them to a vector of the same size as the categories. Softmax function or classifier is then used for calculating the probability of the image belonging to each category. The above VGG network extracts high-level features and maps them to a vector of the same size as the categories. Softmax function or classifier is then used for calculating the probability of the image belonging to each category.
```python ```python
out = fc_layer(input=net, size=class_num, act=SoftmaxActivation()) out = fc_layer(input=net, size=class_num, act=SoftmaxActivation())
``` ```
4. Define Loss Function and Outputs 4. Define Loss Function and Outputs
In the context of supervised learning, labels of training images are defined in `data_layer`, too. During training, cross-entropy is used as loss function and as the output of the network; During testing, the outputs are the probabilities calculated in the classifier. In the context of supervised learning, labels of training images are defined in `data_layer`, too. During training, cross-entropy is used as loss function and as the output of the network; During testing, the outputs are the probabilities calculated in the classifier.
```python ```python
if not is_predict: if not is_predict:
lbl = data_layer(name="label", size=class_num) lbl = data_layer(name="label", size=class_num)
cost = classification_cost(input=out, label=lbl) cost = classification_cost(input=out, label=lbl)
outputs(cost) outputs(cost)
else: else:
outputs(out) outputs(out)
``` ```
### ResNet ### ResNet
......
...@@ -3,7 +3,7 @@ ...@@ -3,7 +3,7 @@
本教程源代码目录在[book/image_classification](https://github.com/PaddlePaddle/book/tree/develop/image_classification), 初次使用请参考PaddlePaddle[安装教程](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html) 本教程源代码目录在[book/image_classification](https://github.com/PaddlePaddle/book/tree/develop/image_classification), 初次使用请参考PaddlePaddle[安装教程](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html)
## 背景介绍 ## 背景介绍
图像相比文字能够提供更加生动、容易理解及更具艺术感的信息,是人们转递与交换信息的重要来源。在本教程中,我们专注于图像识别领域的一个重要问题,即图像分类。 图像相比文字能够提供更加生动、容易理解及更具艺术感的信息,是人们转递与交换信息的重要来源。在本教程中,我们专注于图像识别领域的一个重要问题,即图像分类。
...@@ -51,7 +51,7 @@ ...@@ -51,7 +51,7 @@
2). **特征编码**: 底层特征中包含了大量冗余与噪声,为了提高特征表达的鲁棒性,需要使用一种特征变换算法对底层特征进行编码,称作特征编码。常用的特征编码包括向量量化编码 \[[4](#参考文献)\]、稀疏编码 \[[5](#参考文献)\]、局部线性约束编码 \[[6](#参考文献)\]、Fisher向量编码 \[[7](#参考文献)\] 等。 2). **特征编码**: 底层特征中包含了大量冗余与噪声,为了提高特征表达的鲁棒性,需要使用一种特征变换算法对底层特征进行编码,称作特征编码。常用的特征编码包括向量量化编码 \[[4](#参考文献)\]、稀疏编码 \[[5](#参考文献)\]、局部线性约束编码 \[[6](#参考文献)\]、Fisher向量编码 \[[7](#参考文献)\] 等。
3). **空间特征约束**: 特征编码之后一般会经过空间特征约束,也称作**特征汇聚**。特征汇聚是指在一个空间范围内,对每一维特征取最大值或者平均值,可以获得一定特征不变形的特征表达。金字塔特征匹配是一种常用的特征聚会方法,这种方法提出将图像均匀分块,在分块内做特征汇聚。 3). **空间特征约束**: 特征编码之后一般会经过空间特征约束,也称作**特征汇聚**。特征汇聚是指在一个空间范围内,对每一维特征取最大值或者平均值,可以获得一定特征不变形的特征表达。金字塔特征匹配是一种常用的特征聚会方法,这种方法提出将图像均匀分块,在分块内做特征汇聚。
4). **通过分类器分类**: 经过前面步骤之后一张图像可以用一个固定维度的向量进行描述,接下来就是经过分类器对图像进行分类。通常使用的分类器包括SVM(Support Vector Machine, 支持向量机)、随机森林等。而使用核方法的SVM是最为广泛的分类器,在传统图像分类任务上性能很好。 4). **通过分类器分类**: 经过前面步骤之后一张图像可以用一个固定维度的向量进行描述,接下来就是经过分类器对图像进行分类。通常使用的分类器包括SVM(Support Vector Machine, 支持向量机)、随机森林等。而使用核方法的SVM是最为广泛的分类器,在传统图像分类任务上性能很好。
这种方法在PASCAL VOC竞赛中的图像分类算法中被广泛使用 \[[18](#参考文献)\][NEC实验室](http://www.nec-labs.com/)在ILSVRC2010中采用SIFT和LBP特征,两个非线性编码器以及SVM分类器获得图像分类的冠军 \[[8](#参考文献)\] 这种方法在PASCAL VOC竞赛中的图像分类算法中被广泛使用 \[[18](#参考文献)\][NEC实验室](http://www.nec-labs.com/)在ILSVRC2010中采用SIFT和LBP特征,两个非线性编码器以及SVM分类器获得图像分类的冠军 \[[8](#参考文献)\]
Alex Krizhevsky在2012年ILSVRC提出的CNN模型 \[[9](#参考文献)\] 取得了历史性的突破,效果大幅度超越传统方法,获得了ILSVRC2012冠军,该模型被称作AlexNet。这也是首次将深度学习用于大规模图像分类中。从AlexNet之后,涌现了一系列CNN模型,不断地在ImageNet上刷新成绩,如图4展示。随着模型变得越来越深以及精妙的结构设计,Top-5的错误率也越来越低,降到了3.5%附近。而在同样的ImageNet数据集上,人眼的辨识错误率大概在5.1%,也就是目前的深度学习模型的识别能力已经超过了人眼。 Alex Krizhevsky在2012年ILSVRC提出的CNN模型 \[[9](#参考文献)\] 取得了历史性的突破,效果大幅度超越传统方法,获得了ILSVRC2012冠军,该模型被称作AlexNet。这也是首次将深度学习用于大规模图像分类中。从AlexNet之后,涌现了一系列CNN模型,不断地在ImageNet上刷新成绩,如图4展示。随着模型变得越来越深以及精妙的结构设计,Top-5的错误率也越来越低,降到了3.5%附近。而在同样的ImageNet数据集上,人眼的辨识错误率大概在5.1%,也就是目前的深度学习模型的识别能力已经超过了人眼。
...@@ -67,8 +67,8 @@ Alex Krizhevsky在2012年ILSVRC提出的CNN模型 \[[9](#参考文献)\] 取得 ...@@ -67,8 +67,8 @@ Alex Krizhevsky在2012年ILSVRC提出的CNN模型 \[[9](#参考文献)\] 取得
<p align="center"> <p align="center">
<img src="image/lenet.png"><br/> <img src="image/lenet.png"><br/>
图5. CNN网络示例[20] 图5. CNN网络示例[20]
</p> </p>
- 卷积层(convolution layer): 执行卷积操作提取底层到高层的特征,发掘出图片局部关联性质和空间不变性质。 - 卷积层(convolution layer): 执行卷积操作提取底层到高层的特征,发掘出图片局部关联性质和空间不变性质。
- 池化层(pooling layer): 执行降采样操作。通过取卷积输出特征图中局部区块的最大值(max-pooling)或者均值(avg-pooling)。降采样也是图像处理中常见的一种操作,可以过滤掉一些不重要的高频信息。 - 池化层(pooling layer): 执行降采样操作。通过取卷积输出特征图中局部区块的最大值(max-pooling)或者均值(avg-pooling)。降采样也是图像处理中常见的一种操作,可以过滤掉一些不重要的高频信息。
...@@ -108,7 +108,7 @@ GoogleNet整体网络结构如图8所示,总共22层网络:开始由3层普 ...@@ -108,7 +108,7 @@ GoogleNet整体网络结构如图8所示,总共22层网络:开始由3层普
<p align="center"> <p align="center">
<img src="image/googlenet.jpeg" ><br/> <img src="image/googlenet.jpeg" ><br/>
图8. GoogleNet[12] 图8. GoogleNet[12]
</p> </p>
...@@ -174,7 +174,7 @@ paddle.init(use_gpu=False, trainer_count=1) ...@@ -174,7 +174,7 @@ paddle.init(use_gpu=False, trainer_count=1)
1. 定义数据输入及其维度 1. 定义数据输入及其维度
网络输入定义为 `data_layer` (数据层),在图像分类中即为图像像素信息。CIFRAR10是RGB 3通道32x32大小的彩色图,因此输入数据大小为3072(3x32x32),类别大小为10,即10分类。 网络输入定义为 `data_layer` (数据层),在图像分类中即为图像像素信息。CIFRAR10是RGB 3通道32x32大小的彩色图,因此输入数据大小为3072(3x32x32),类别大小为10,即10分类。
```python ```python
datadim = 3 * 32 * 32 datadim = 3 * 32 * 32
classdim = 10 classdim = 10
...@@ -189,7 +189,7 @@ paddle.init(use_gpu=False, trainer_count=1) ...@@ -189,7 +189,7 @@ paddle.init(use_gpu=False, trainer_count=1)
net = vgg_bn_drop(image) net = vgg_bn_drop(image)
``` ```
VGG核心模块的输入是数据层,`vgg_bn_drop` 定义了16层VGG结构,每层卷积后面引入BN层和Dropout层,详细的定义如下: VGG核心模块的输入是数据层,`vgg_bn_drop` 定义了16层VGG结构,每层卷积后面引入BN层和Dropout层,详细的定义如下:
```python ```python
def vgg_bn_drop(input): def vgg_bn_drop(input):
def conv_block(ipt, num_filter, groups, dropouts, num_channels=None): def conv_block(ipt, num_filter, groups, dropouts, num_channels=None):
...@@ -220,11 +220,11 @@ paddle.init(use_gpu=False, trainer_count=1) ...@@ -220,11 +220,11 @@ paddle.init(use_gpu=False, trainer_count=1)
fc2 = paddle.layer.fc(input=bn, size=512, act=paddle.activation.Linear()) fc2 = paddle.layer.fc(input=bn, size=512, act=paddle.activation.Linear())
return fc2 return fc2
``` ```
2.1. 首先定义了一组卷积网络,即conv_block。卷积核大小为3x3,池化窗口大小为2x2,窗口滑动大小为2,groups决定每组VGG模块是几次连续的卷积操作,dropouts指定Dropout操作的概率。所使用的`img_conv_group`是在`paddle.networks`中预定义的模块,由若干组 `Conv->BN->ReLu->Dropout` 和 一组 `Pooling` 组成, 2.1. 首先定义了一组卷积网络,即conv_block。卷积核大小为3x3,池化窗口大小为2x2,窗口滑动大小为2,groups决定每组VGG模块是几次连续的卷积操作,dropouts指定Dropout操作的概率。所使用的`img_conv_group`是在`paddle.networks`中预定义的模块,由若干组 `Conv->BN->ReLu->Dropout` 和 一组 `Pooling` 组成,
2.2. 五组卷积操作,即 5个conv_block。 第一、二组采用两次连续的卷积操作。第三、四、五组采用三次连续的卷积操作。每组最后一个卷积后面Dropout概率为0,即不使用Dropout操作。 2.2. 五组卷积操作,即 5个conv_block。 第一、二组采用两次连续的卷积操作。第三、四、五组采用三次连续的卷积操作。每组最后一个卷积后面Dropout概率为0,即不使用Dropout操作。
2.3. 最后接两层512维的全连接。 2.3. 最后接两层512维的全连接。
3. 定义分类器 3. 定义分类器
...@@ -240,7 +240,7 @@ paddle.init(use_gpu=False, trainer_count=1) ...@@ -240,7 +240,7 @@ paddle.init(use_gpu=False, trainer_count=1)
4. 定义损失函数和网络输出 4. 定义损失函数和网络输出
在有监督训练中需要输入图像对应的类别信息,同样通过`paddle.layer.data`来定义。训练中采用多类交叉熵作为损失函数,并作为网络的输出,预测阶段定义网络的输出为分类器得到的概率信息。 在有监督训练中需要输入图像对应的类别信息,同样通过`paddle.layer.data`来定义。训练中采用多类交叉熵作为损失函数,并作为网络的输出,预测阶段定义网络的输出为分类器得到的概率信息。
```python ```python
lbl = paddle.layer.data( lbl = paddle.layer.data(
name="label", type=paddle.data_type.integer_value(classdim)) name="label", type=paddle.data_type.integer_value(classdim))
...@@ -305,9 +305,9 @@ def layer_warp(block_func, ipt, features, count, stride): ...@@ -305,9 +305,9 @@ def layer_warp(block_func, ipt, features, count, stride):
`resnet_cifar10` 的连接结构主要有以下几个过程。 `resnet_cifar10` 的连接结构主要有以下几个过程。
1. 底层输入连接一层 `conv_bn_layer`,即带BN的卷积层。 1. 底层输入连接一层 `conv_bn_layer`,即带BN的卷积层。
2. 然后连接3组残差模块即下面配置3组 `layer_warp` ,每组采用图 10 左边残差模块组成。 2. 然后连接3组残差模块即下面配置3组 `layer_warp` ,每组采用图 10 左边残差模块组成。
3. 最后对网络做均值池化并返回该层。 3. 最后对网络做均值池化并返回该层。
注意:除过第一层卷积层和最后一层全连接层之外,要求三组 `layer_warp` 总的含参层数能够被6整除,即 `resnet_cifar10` 的 depth 要满足 $(depth - 2) % 6 == 0$ 。 注意:除过第一层卷积层和最后一层全连接层之外,要求三组 `layer_warp` 总的含参层数能够被6整除,即 `resnet_cifar10` 的 depth 要满足 $(depth - 2) % 6 == 0$ 。
...@@ -452,7 +452,7 @@ Test with Pass 0, {'classification_error_evaluator': 0.885200023651123} ...@@ -452,7 +452,7 @@ Test with Pass 0, {'classification_error_evaluator': 0.885200023651123}
[2] N. Dalal, B. Triggs, [Histograms of Oriented Gradients for Human Detection](http://vision.stanford.edu/teaching/cs231b_spring1213/papers/CVPR05_DalalTriggs.pdf), Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2005. [2] N. Dalal, B. Triggs, [Histograms of Oriented Gradients for Human Detection](http://vision.stanford.edu/teaching/cs231b_spring1213/papers/CVPR05_DalalTriggs.pdf), Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2005.
[3] Ahonen, T., Hadid, A., and Pietikinen, M. (2006). [Face description with local binary patterns: Application to face recognition](http://ieeexplore.ieee.org/document/1717463/). PAMI, 28. [3] Ahonen, T., Hadid, A., and Pietikinen, M. (2006). [Face description with local binary patterns: Application to face recognition](http://ieeexplore.ieee.org/document/1717463/). PAMI, 28.
[4] J. Sivic, A. Zisserman, [Video Google: A Text Retrieval Approach to Object Matching in Videos](http://www.robots.ox.ac.uk/~vgg/publications/papers/sivic03.pdf), Proc. Ninth Int'l Conf. Computer Vision, pp. 1470-1478, 2003. [4] J. Sivic, A. Zisserman, [Video Google: A Text Retrieval Approach to Object Matching in Videos](http://www.robots.ox.ac.uk/~vgg/publications/papers/sivic03.pdf), Proc. Ninth Int'l Conf. Computer Vision, pp. 1470-1478, 2003.
......
...@@ -3,7 +3,7 @@ ...@@ -3,7 +3,7 @@
本教程源代码目录在[book/image_classification](https://github.com/PaddlePaddle/book/tree/develop/image_classification), 初次使用请参考PaddlePaddle[安装教程](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html) 本教程源代码目录在[book/image_classification](https://github.com/PaddlePaddle/book/tree/develop/image_classification), 初次使用请参考PaddlePaddle[安装教程](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html)
## 背景介绍 ## 背景介绍
图像相比文字能够提供更加生动、容易理解及更具艺术感的信息,是人们转递与交换信息的重要来源。在本教程中,我们专注于图像识别领域的一个重要问题,即图像分类。 图像相比文字能够提供更加生动、容易理解及更具艺术感的信息,是人们转递与交换信息的重要来源。在本教程中,我们专注于图像识别领域的一个重要问题,即图像分类。
...@@ -51,7 +51,7 @@ ...@@ -51,7 +51,7 @@
2). **特征编码**: 底层特征中包含了大量冗余与噪声,为了提高特征表达的鲁棒性,需要使用一种特征变换算法对底层特征进行编码,称作特征编码。常用的特征编码包括向量量化编码 \[[4](#参考文献)\]、稀疏编码 \[[5](#参考文献)\]、局部线性约束编码 \[[6](#参考文献)\]、Fisher向量编码 \[[7](#参考文献)\] 等。 2). **特征编码**: 底层特征中包含了大量冗余与噪声,为了提高特征表达的鲁棒性,需要使用一种特征变换算法对底层特征进行编码,称作特征编码。常用的特征编码包括向量量化编码 \[[4](#参考文献)\]、稀疏编码 \[[5](#参考文献)\]、局部线性约束编码 \[[6](#参考文献)\]、Fisher向量编码 \[[7](#参考文献)\] 等。
3). **空间特征约束**: 特征编码之后一般会经过空间特征约束,也称作**特征汇聚**。特征汇聚是指在一个空间范围内,对每一维特征取最大值或者平均值,可以获得一定特征不变形的特征表达。金字塔特征匹配是一种常用的特征聚会方法,这种方法提出将图像均匀分块,在分块内做特征汇聚。 3). **空间特征约束**: 特征编码之后一般会经过空间特征约束,也称作**特征汇聚**。特征汇聚是指在一个空间范围内,对每一维特征取最大值或者平均值,可以获得一定特征不变形的特征表达。金字塔特征匹配是一种常用的特征聚会方法,这种方法提出将图像均匀分块,在分块内做特征汇聚。
4). **通过分类器分类**: 经过前面步骤之后一张图像可以用一个固定维度的向量进行描述,接下来就是经过分类器对图像进行分类。通常使用的分类器包括SVM(Support Vector Machine, 支持向量机)、随机森林等。而使用核方法的SVM是最为广泛的分类器,在传统图像分类任务上性能很好。 4). **通过分类器分类**: 经过前面步骤之后一张图像可以用一个固定维度的向量进行描述,接下来就是经过分类器对图像进行分类。通常使用的分类器包括SVM(Support Vector Machine, 支持向量机)、随机森林等。而使用核方法的SVM是最为广泛的分类器,在传统图像分类任务上性能很好。
这种方法在PASCAL VOC竞赛中的图像分类算法中被广泛使用 \[[18](#参考文献)\][NEC实验室](http://www.nec-labs.com/)在ILSVRC2010中采用SIFT和LBP特征,两个非线性编码器以及SVM分类器获得图像分类的冠军 \[[8](#参考文献)\] 这种方法在PASCAL VOC竞赛中的图像分类算法中被广泛使用 \[[18](#参考文献)\][NEC实验室](http://www.nec-labs.com/)在ILSVRC2010中采用SIFT和LBP特征,两个非线性编码器以及SVM分类器获得图像分类的冠军 \[[8](#参考文献)\]
Alex Krizhevsky在2012年ILSVRC提出的CNN模型 \[[9](#参考文献)\] 取得了历史性的突破,效果大幅度超越传统方法,获得了ILSVRC2012冠军,该模型被称作AlexNet。这也是首次将深度学习用于大规模图像分类中。从AlexNet之后,涌现了一系列CNN模型,不断地在ImageNet上刷新成绩,如图4展示。随着模型变得越来越深以及精妙的结构设计,Top-5的错误率也越来越低,降到了3.5%附近。而在同样的ImageNet数据集上,人眼的辨识错误率大概在5.1%,也就是目前的深度学习模型的识别能力已经超过了人眼。 Alex Krizhevsky在2012年ILSVRC提出的CNN模型 \[[9](#参考文献)\] 取得了历史性的突破,效果大幅度超越传统方法,获得了ILSVRC2012冠军,该模型被称作AlexNet。这也是首次将深度学习用于大规模图像分类中。从AlexNet之后,涌现了一系列CNN模型,不断地在ImageNet上刷新成绩,如图4展示。随着模型变得越来越深以及精妙的结构设计,Top-5的错误率也越来越低,降到了3.5%附近。而在同样的ImageNet数据集上,人眼的辨识错误率大概在5.1%,也就是目前的深度学习模型的识别能力已经超过了人眼。
...@@ -67,8 +67,8 @@ Alex Krizhevsky在2012年ILSVRC提出的CNN模型 \[[9](#参考文献)\] 取得 ...@@ -67,8 +67,8 @@ Alex Krizhevsky在2012年ILSVRC提出的CNN模型 \[[9](#参考文献)\] 取得
<p align="center"> <p align="center">
<img src="image/lenet.png"><br/> <img src="image/lenet.png"><br/>
图5. CNN网络示例[20] 图5. CNN网络示例[20]
</p> </p>
- 卷积层(convolution layer): 执行卷积操作提取底层到高层的特征,发掘出图片局部关联性质和空间不变性质。 - 卷积层(convolution layer): 执行卷积操作提取底层到高层的特征,发掘出图片局部关联性质和空间不变性质。
- 池化层(pooling layer): 执行降采样操作。通过取卷积输出特征图中局部区块的最大值(max-pooling)或者均值(avg-pooling)。降采样也是图像处理中常见的一种操作,可以过滤掉一些不重要的高频信息。 - 池化层(pooling layer): 执行降采样操作。通过取卷积输出特征图中局部区块的最大值(max-pooling)或者均值(avg-pooling)。降采样也是图像处理中常见的一种操作,可以过滤掉一些不重要的高频信息。
...@@ -108,7 +108,7 @@ GoogleNet整体网络结构如图8所示,总共22层网络:开始由3层普 ...@@ -108,7 +108,7 @@ GoogleNet整体网络结构如图8所示,总共22层网络:开始由3层普
<p align="center"> <p align="center">
<img src="image/googlenet.jpeg" ><br/> <img src="image/googlenet.jpeg" ><br/>
图8. GoogleNet[12] 图8. GoogleNet[12]
</p> </p>
...@@ -245,7 +245,7 @@ $$ lr = lr_{0} * a^ {\lfloor \frac{n}{ b}\rfloor} $$ ...@@ -245,7 +245,7 @@ $$ lr = lr_{0} * a^ {\lfloor \frac{n}{ b}\rfloor} $$
1. 定义数据输入及其维度 1. 定义数据输入及其维度
网络输入定义为 `data_layer` (数据层),在图像分类中即为图像像素信息。CIFRAR10是RGB 3通道32x32大小的彩色图,因此输入数据大小为3072(3x32x32),类别大小为10,即10分类。 网络输入定义为 `data_layer` (数据层),在图像分类中即为图像像素信息。CIFRAR10是RGB 3通道32x32大小的彩色图,因此输入数据大小为3072(3x32x32),类别大小为10,即10分类。
```python ```python
datadim = 3 * 32 * 32 datadim = 3 * 32 * 32
classdim = 10 classdim = 10
...@@ -258,7 +258,7 @@ $$ lr = lr_{0} * a^ {\lfloor \frac{n}{ b}\rfloor} $$ ...@@ -258,7 +258,7 @@ $$ lr = lr_{0} * a^ {\lfloor \frac{n}{ b}\rfloor} $$
net = vgg_bn_drop(data) net = vgg_bn_drop(data)
``` ```
VGG核心模块的输入是数据层,`vgg_bn_drop` 定义了16层VGG结构,每层卷积后面引入BN层和Dropout层,详细的定义如下: VGG核心模块的输入是数据层,`vgg_bn_drop` 定义了16层VGG结构,每层卷积后面引入BN层和Dropout层,详细的定义如下:
```python ```python
def vgg_bn_drop(input, num_channels): def vgg_bn_drop(input, num_channels):
def conv_block(ipt, num_filter, groups, dropouts, num_channels_=None): def conv_block(ipt, num_filter, groups, dropouts, num_channels_=None):
...@@ -273,26 +273,26 @@ $$ lr = lr_{0} * a^ {\lfloor \frac{n}{ b}\rfloor} $$ ...@@ -273,26 +273,26 @@ $$ lr = lr_{0} * a^ {\lfloor \frac{n}{ b}\rfloor} $$
conv_with_batchnorm=True, conv_with_batchnorm=True,
conv_batchnorm_drop_rate=dropouts, conv_batchnorm_drop_rate=dropouts,
pool_type=MaxPooling()) pool_type=MaxPooling())
conv1 = conv_block(input, 64, 2, [0.3, 0], 3) conv1 = conv_block(input, 64, 2, [0.3, 0], 3)
conv2 = conv_block(conv1, 128, 2, [0.4, 0]) conv2 = conv_block(conv1, 128, 2, [0.4, 0])
conv3 = conv_block(conv2, 256, 3, [0.4, 0.4, 0]) conv3 = conv_block(conv2, 256, 3, [0.4, 0.4, 0])
conv4 = conv_block(conv3, 512, 3, [0.4, 0.4, 0]) conv4 = conv_block(conv3, 512, 3, [0.4, 0.4, 0])
conv5 = conv_block(conv4, 512, 3, [0.4, 0.4, 0]) conv5 = conv_block(conv4, 512, 3, [0.4, 0.4, 0])
drop = dropout_layer(input=conv5, dropout_rate=0.5) drop = dropout_layer(input=conv5, dropout_rate=0.5)
fc1 = fc_layer(input=drop, size=512, act=LinearActivation()) fc1 = fc_layer(input=drop, size=512, act=LinearActivation())
bn = batch_norm_layer( bn = batch_norm_layer(
input=fc1, act=ReluActivation(), layer_attr=ExtraAttr(drop_rate=0.5)) input=fc1, act=ReluActivation(), layer_attr=ExtraAttr(drop_rate=0.5))
fc2 = fc_layer(input=bn, size=512, act=LinearActivation()) fc2 = fc_layer(input=bn, size=512, act=LinearActivation())
return fc2 return fc2
``` ```
2.1. 首先定义了一组卷积网络,即conv_block。卷积核大小为3x3,池化窗口大小为2x2,窗口滑动大小为2,groups决定每组VGG模块是几次连续的卷积操作,dropouts指定Dropout操作的概率。所使用的`img_conv_group`是在`paddle.trainer_config_helpers`中预定义的模块,由若干组 `Conv->BN->ReLu->Dropout` 和 一组 `Pooling` 组成, 2.1. 首先定义了一组卷积网络,即conv_block。卷积核大小为3x3,池化窗口大小为2x2,窗口滑动大小为2,groups决定每组VGG模块是几次连续的卷积操作,dropouts指定Dropout操作的概率。所使用的`img_conv_group`是在`paddle.trainer_config_helpers`中预定义的模块,由若干组 `Conv->BN->ReLu->Dropout` 和 一组 `Pooling` 组成,
2.2. 五组卷积操作,即 5个conv_block。 第一、二组采用两次连续的卷积操作。第三、四、五组采用三次连续的卷积操作。每组最后一个卷积后面Dropout概率为0,即不使用Dropout操作。 2.2. 五组卷积操作,即 5个conv_block。 第一、二组采用两次连续的卷积操作。第三、四、五组采用三次连续的卷积操作。每组最后一个卷积后面Dropout概率为0,即不使用Dropout操作。
2.3. 最后接两层512维的全连接。 2.3. 最后接两层512维的全连接。
3. 定义分类器 3. 定义分类器
...@@ -306,7 +306,7 @@ $$ lr = lr_{0} * a^ {\lfloor \frac{n}{ b}\rfloor} $$ ...@@ -306,7 +306,7 @@ $$ lr = lr_{0} * a^ {\lfloor \frac{n}{ b}\rfloor} $$
4. 定义损失函数和网络输出 4. 定义损失函数和网络输出
在有监督训练中需要输入图像对应的类别信息,同样通过`data_layer`来定义。训练中采用多类交叉熵作为损失函数,并作为网络的输出,预测阶段定义网络的输出为分类器得到的概率信息。 在有监督训练中需要输入图像对应的类别信息,同样通过`data_layer`来定义。训练中采用多类交叉熵作为损失函数,并作为网络的输出,预测阶段定义网络的输出为分类器得到的概率信息。
```python ```python
if not is_predict: if not is_predict:
lbl = data_layer(name="label", size=class_num) lbl = data_layer(name="label", size=class_num)
...@@ -383,9 +383,9 @@ def layer_warp(block_func, ipt, features, count, stride): ...@@ -383,9 +383,9 @@ def layer_warp(block_func, ipt, features, count, stride):
`resnet_cifar10` 的连接结构主要有以下几个过程。 `resnet_cifar10` 的连接结构主要有以下几个过程。
1. 底层输入连接一层 `conv_bn_layer`,即带BN的卷积层。 1. 底层输入连接一层 `conv_bn_layer`,即带BN的卷积层。
2. 然后连接3组残差模块即下面配置3组 `layer_warp` ,每组采用图 10 左边残差模块组成。 2. 然后连接3组残差模块即下面配置3组 `layer_warp` ,每组采用图 10 左边残差模块组成。
3. 最后对网络做均值池化并返回该层。 3. 最后对网络做均值池化并返回该层。
注意:除过第一层卷积层和最后一层全连接层之外,要求三组 `layer_warp` 总的含参层数能够被6整除,即 `resnet_cifar10` 的 depth 要满足 $(depth - 2) % 6 == 0$ 。 注意:除过第一层卷积层和最后一层全连接层之外,要求三组 `layer_warp` 总的含参层数能够被6整除,即 `resnet_cifar10` 的 depth 要满足 $(depth - 2) % 6 == 0$ 。
...@@ -487,7 +487,7 @@ python classify.py --job=extract --model=output/pass-00299 --data=image/dog.png ...@@ -487,7 +487,7 @@ python classify.py --job=extract --model=output/pass-00299 --data=image/dog.png
<p align="center"> <p align="center">
<img src="image/fea_conv0.png" width="500"><br/> <img src="image/fea_conv0.png" width="500"><br/>
图13. 卷积特征可视化图 图13. 卷积特征可视化图
</p> </p>
## 总结 ## 总结
...@@ -501,7 +501,7 @@ python classify.py --job=extract --model=output/pass-00299 --data=image/dog.png ...@@ -501,7 +501,7 @@ python classify.py --job=extract --model=output/pass-00299 --data=image/dog.png
[2] N. Dalal, B. Triggs, [Histograms of Oriented Gradients for Human Detection](http://vision.stanford.edu/teaching/cs231b_spring1213/papers/CVPR05_DalalTriggs.pdf), Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2005. [2] N. Dalal, B. Triggs, [Histograms of Oriented Gradients for Human Detection](http://vision.stanford.edu/teaching/cs231b_spring1213/papers/CVPR05_DalalTriggs.pdf), Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2005.
[3] Ahonen, T., Hadid, A., and Pietikinen, M. (2006). [Face description with local binary patterns: Application to face recognition](http://ieeexplore.ieee.org/document/1717463/). PAMI, 28. [3] Ahonen, T., Hadid, A., and Pietikinen, M. (2006). [Face description with local binary patterns: Application to face recognition](http://ieeexplore.ieee.org/document/1717463/). PAMI, 28.
[4] J. Sivic, A. Zisserman, [Video Google: A Text Retrieval Approach to Object Matching in Videos](http://www.robots.ox.ac.uk/~vgg/publications/papers/sivic03.pdf), Proc. Ninth Int'l Conf. Computer Vision, pp. 1470-1478, 2003. [4] J. Sivic, A. Zisserman, [Video Google: A Text Retrieval Approach to Object Matching in Videos](http://www.robots.ox.ac.uk/~vgg/publications/papers/sivic03.pdf), Proc. Ninth Int'l Conf. Computer Vision, pp. 1470-1478, 2003.
......
<html> <html>
<head> <head>
<script type="text/x-mathjax-config"> <script type="text/x-mathjax-config">
...@@ -5,8 +6,8 @@ ...@@ -5,8 +6,8 @@
extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"], extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
jax: ["input/TeX", "output/HTML-CSS"], jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: { tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ], inlineMath: [ ['$','$'] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ], displayMath: [ ['$$','$$'] ],
processEscapes: true processEscapes: true
}, },
"HTML-CSS": { availableFonts: ["TeX"] } "HTML-CSS": { availableFonts: ["TeX"] }
...@@ -289,48 +290,48 @@ First we define VGG network. Since the image size and amount of CIFAR10 are rela ...@@ -289,48 +290,48 @@ First we define VGG network. Since the image size and amount of CIFAR10 are rela
The input to the network is defined as `data_layer`, or image pixels in the context of image classification. The images in CIFAR10 are 32x32 color images of three channels. Therefore, the size of the input data is 3072 (3x32x32), and the number of categories is 10. The input to the network is defined as `data_layer`, or image pixels in the context of image classification. The images in CIFAR10 are 32x32 color images of three channels. Therefore, the size of the input data is 3072 (3x32x32), and the number of categories is 10.
```python ```python
datadim = 3 * 32 * 32 datadim = 3 * 32 * 32
classdim = 10 classdim = 10
data = data_layer(name='image', size=datadim) data = data_layer(name='image', size=datadim)
``` ```
2. Define VGG main module 2. Define VGG main module
```python ```python
net = vgg_bn_drop(data) net = vgg_bn_drop(data)
``` ```
The input to VGG main module is from data layer. `vgg_bn_drop` defines a 16-layer VGG network, with each convolutional layer followed by BN and dropout layers. Here is the definition in detail: The input to VGG main module is from data layer. `vgg_bn_drop` defines a 16-layer VGG network, with each convolutional layer followed by BN and dropout layers. Here is the definition in detail:
```python ```python
def vgg_bn_drop(input, num_channels): def vgg_bn_drop(input, num_channels):
def conv_block(ipt, num_filter, groups, dropouts, num_channels_=None): def conv_block(ipt, num_filter, groups, dropouts, num_channels_=None):
return img_conv_group( return img_conv_group(
input=ipt, input=ipt,
num_channels=num_channels_, num_channels=num_channels_,
pool_size=2, pool_size=2,
pool_stride=2, pool_stride=2,
conv_num_filter=[num_filter] * groups, conv_num_filter=[num_filter] * groups,
conv_filter_size=3, conv_filter_size=3,
conv_act=ReluActivation(), conv_act=ReluActivation(),
conv_with_batchnorm=True, conv_with_batchnorm=True,
conv_batchnorm_drop_rate=dropouts, conv_batchnorm_drop_rate=dropouts,
pool_type=MaxPooling()) pool_type=MaxPooling())
conv1 = conv_block(input, 64, 2, [0.3, 0], 3) conv1 = conv_block(input, 64, 2, [0.3, 0], 3)
conv2 = conv_block(conv1, 128, 2, [0.4, 0]) conv2 = conv_block(conv1, 128, 2, [0.4, 0])
conv3 = conv_block(conv2, 256, 3, [0.4, 0.4, 0]) conv3 = conv_block(conv2, 256, 3, [0.4, 0.4, 0])
conv4 = conv_block(conv3, 512, 3, [0.4, 0.4, 0]) conv4 = conv_block(conv3, 512, 3, [0.4, 0.4, 0])
conv5 = conv_block(conv4, 512, 3, [0.4, 0.4, 0]) conv5 = conv_block(conv4, 512, 3, [0.4, 0.4, 0])
drop = dropout_layer(input=conv5, dropout_rate=0.5) drop = dropout_layer(input=conv5, dropout_rate=0.5)
fc1 = fc_layer(input=drop, size=512, act=LinearActivation()) fc1 = fc_layer(input=drop, size=512, act=LinearActivation())
bn = batch_norm_layer( bn = batch_norm_layer(
input=fc1, act=ReluActivation(), layer_attr=ExtraAttr(drop_rate=0.5)) input=fc1, act=ReluActivation(), layer_attr=ExtraAttr(drop_rate=0.5))
fc2 = fc_layer(input=bn, size=512, act=LinearActivation()) fc2 = fc_layer(input=bn, size=512, act=LinearActivation())
return fc2 return fc2
``` ```
2.1. First defines a convolution block or conv_block. The default convolution kernel is 3x3, and the default pooling size is 2x2 with stride 2. Dropout specifies the probability in dropout operation. Function `img_conv_group` is defined in `paddle.trainer_config_helpers` consisting of a series of `Conv->BN->ReLu->Dropout` and a `Pooling`. 2.1. First defines a convolution block or conv_block. The default convolution kernel is 3x3, and the default pooling size is 2x2 with stride 2. Dropout specifies the probability in dropout operation. Function `img_conv_group` is defined in `paddle.trainer_config_helpers` consisting of a series of `Conv->BN->ReLu->Dropout` and a `Pooling`.
...@@ -344,22 +345,22 @@ First we define VGG network. Since the image size and amount of CIFAR10 are rela ...@@ -344,22 +345,22 @@ First we define VGG network. Since the image size and amount of CIFAR10 are rela
The above VGG network extracts high-level features and maps them to a vector of the same size as the categories. Softmax function or classifier is then used for calculating the probability of the image belonging to each category. The above VGG network extracts high-level features and maps them to a vector of the same size as the categories. Softmax function or classifier is then used for calculating the probability of the image belonging to each category.
```python ```python
out = fc_layer(input=net, size=class_num, act=SoftmaxActivation()) out = fc_layer(input=net, size=class_num, act=SoftmaxActivation())
``` ```
4. Define Loss Function and Outputs 4. Define Loss Function and Outputs
In the context of supervised learning, labels of training images are defined in `data_layer`, too. During training, cross-entropy is used as loss function and as the output of the network; During testing, the outputs are the probabilities calculated in the classifier. In the context of supervised learning, labels of training images are defined in `data_layer`, too. During training, cross-entropy is used as loss function and as the output of the network; During testing, the outputs are the probabilities calculated in the classifier.
```python ```python
if not is_predict: if not is_predict:
lbl = data_layer(name="label", size=class_num) lbl = data_layer(name="label", size=class_num)
cost = classification_cost(input=out, label=lbl) cost = classification_cost(input=out, label=lbl)
outputs(cost) outputs(cost)
else: else:
outputs(out) outputs(out)
``` ```
### ResNet ### ResNet
...@@ -589,6 +590,7 @@ Traditional image classification methods involve multiple stages of processing a ...@@ -589,6 +590,7 @@ Traditional image classification methods involve multiple stages of processing a
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
...@@ -607,6 +609,6 @@ marked.setOptions({ ...@@ -607,6 +609,6 @@ marked.setOptions({
} }
}); });
document.getElementById("context").innerHTML = marked( document.getElementById("context").innerHTML = marked(
document.getElementById("markdown").innerHTML) document.getElementById("markdown").innerHTML)
</script> </script>
</body> </body>
<html> <html>
<head> <head>
<script type="text/x-mathjax-config"> <script type="text/x-mathjax-config">
...@@ -5,8 +6,8 @@ ...@@ -5,8 +6,8 @@
extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"], extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
jax: ["input/TeX", "output/HTML-CSS"], jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: { tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ], inlineMath: [ ['$','$'] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ], displayMath: [ ['$$','$$'] ],
processEscapes: true processEscapes: true
}, },
"HTML-CSS": { availableFonts: ["TeX"] } "HTML-CSS": { availableFonts: ["TeX"] }
...@@ -44,7 +45,7 @@ ...@@ -44,7 +45,7 @@
本教程源代码目录在[book/image_classification](https://github.com/PaddlePaddle/book/tree/develop/image_classification), 初次使用请参考PaddlePaddle[安装教程](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html)。 本教程源代码目录在[book/image_classification](https://github.com/PaddlePaddle/book/tree/develop/image_classification), 初次使用请参考PaddlePaddle[安装教程](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html)。
## 背景介绍 ## 背景介绍
图像相比文字能够提供更加生动、容易理解及更具艺术感的信息,是人们转递与交换信息的重要来源。在本教程中,我们专注于图像识别领域的一个重要问题,即图像分类。 图像相比文字能够提供更加生动、容易理解及更具艺术感的信息,是人们转递与交换信息的重要来源。在本教程中,我们专注于图像识别领域的一个重要问题,即图像分类。
...@@ -92,7 +93,7 @@ ...@@ -92,7 +93,7 @@
2). **特征编码**: 底层特征中包含了大量冗余与噪声,为了提高特征表达的鲁棒性,需要使用一种特征变换算法对底层特征进行编码,称作特征编码。常用的特征编码包括向量量化编码 \[[4](#参考文献)\]、稀疏编码 \[[5](#参考文献)\]、局部线性约束编码 \[[6](#参考文献)\]、Fisher向量编码 \[[7](#参考文献)\] 等。 2). **特征编码**: 底层特征中包含了大量冗余与噪声,为了提高特征表达的鲁棒性,需要使用一种特征变换算法对底层特征进行编码,称作特征编码。常用的特征编码包括向量量化编码 \[[4](#参考文献)\]、稀疏编码 \[[5](#参考文献)\]、局部线性约束编码 \[[6](#参考文献)\]、Fisher向量编码 \[[7](#参考文献)\] 等。
3). **空间特征约束**: 特征编码之后一般会经过空间特征约束,也称作**特征汇聚**。特征汇聚是指在一个空间范围内,对每一维特征取最大值或者平均值,可以获得一定特征不变形的特征表达。金字塔特征匹配是一种常用的特征聚会方法,这种方法提出将图像均匀分块,在分块内做特征汇聚。 3). **空间特征约束**: 特征编码之后一般会经过空间特征约束,也称作**特征汇聚**。特征汇聚是指在一个空间范围内,对每一维特征取最大值或者平均值,可以获得一定特征不变形的特征表达。金字塔特征匹配是一种常用的特征聚会方法,这种方法提出将图像均匀分块,在分块内做特征汇聚。
4). **通过分类器分类**: 经过前面步骤之后一张图像可以用一个固定维度的向量进行描述,接下来就是经过分类器对图像进行分类。通常使用的分类器包括SVM(Support Vector Machine, 支持向量机)、随机森林等。而使用核方法的SVM是最为广泛的分类器,在传统图像分类任务上性能很好。 4). **通过分类器分类**: 经过前面步骤之后一张图像可以用一个固定维度的向量进行描述,接下来就是经过分类器对图像进行分类。通常使用的分类器包括SVM(Support Vector Machine, 支持向量机)、随机森林等。而使用核方法的SVM是最为广泛的分类器,在传统图像分类任务上性能很好。
这种方法在PASCAL VOC竞赛中的图像分类算法中被广泛使用 \[[18](#参考文献)\]。[NEC实验室](http://www.nec-labs.com/)在ILSVRC2010中采用SIFT和LBP特征,两个非线性编码器以及SVM分类器获得图像分类的冠军 \[[8](#参考文献)\]。 这种方法在PASCAL VOC竞赛中的图像分类算法中被广泛使用 \[[18](#参考文献)\]。[NEC实验室](http://www.nec-labs.com/)在ILSVRC2010中采用SIFT和LBP特征,两个非线性编码器以及SVM分类器获得图像分类的冠军 \[[8](#参考文献)\]。
Alex Krizhevsky在2012年ILSVRC提出的CNN模型 \[[9](#参考文献)\] 取得了历史性的突破,效果大幅度超越传统方法,获得了ILSVRC2012冠军,该模型被称作AlexNet。这也是首次将深度学习用于大规模图像分类中。从AlexNet之后,涌现了一系列CNN模型,不断地在ImageNet上刷新成绩,如图4展示。随着模型变得越来越深以及精妙的结构设计,Top-5的错误率也越来越低,降到了3.5%附近。而在同样的ImageNet数据集上,人眼的辨识错误率大概在5.1%,也就是目前的深度学习模型的识别能力已经超过了人眼。 Alex Krizhevsky在2012年ILSVRC提出的CNN模型 \[[9](#参考文献)\] 取得了历史性的突破,效果大幅度超越传统方法,获得了ILSVRC2012冠军,该模型被称作AlexNet。这也是首次将深度学习用于大规模图像分类中。从AlexNet之后,涌现了一系列CNN模型,不断地在ImageNet上刷新成绩,如图4展示。随着模型变得越来越深以及精妙的结构设计,Top-5的错误率也越来越低,降到了3.5%附近。而在同样的ImageNet数据集上,人眼的辨识错误率大概在5.1%,也就是目前的深度学习模型的识别能力已经超过了人眼。
...@@ -108,8 +109,8 @@ Alex Krizhevsky在2012年ILSVRC提出的CNN模型 \[[9](#参考文献)\] 取得 ...@@ -108,8 +109,8 @@ Alex Krizhevsky在2012年ILSVRC提出的CNN模型 \[[9](#参考文献)\] 取得
<p align="center"> <p align="center">
<img src="image/lenet.png"><br/> <img src="image/lenet.png"><br/>
图5. CNN网络示例[20] 图5. CNN网络示例[20]
</p> </p>
- 卷积层(convolution layer): 执行卷积操作提取底层到高层的特征,发掘出图片局部关联性质和空间不变性质。 - 卷积层(convolution layer): 执行卷积操作提取底层到高层的特征,发掘出图片局部关联性质和空间不变性质。
- 池化层(pooling layer): 执行降采样操作。通过取卷积输出特征图中局部区块的最大值(max-pooling)或者均值(avg-pooling)。降采样也是图像处理中常见的一种操作,可以过滤掉一些不重要的高频信息。 - 池化层(pooling layer): 执行降采样操作。通过取卷积输出特征图中局部区块的最大值(max-pooling)或者均值(avg-pooling)。降采样也是图像处理中常见的一种操作,可以过滤掉一些不重要的高频信息。
...@@ -149,7 +150,7 @@ GoogleNet整体网络结构如图8所示,总共22层网络:开始由3层普 ...@@ -149,7 +150,7 @@ GoogleNet整体网络结构如图8所示,总共22层网络:开始由3层普
<p align="center"> <p align="center">
<img src="image/googlenet.jpeg" ><br/> <img src="image/googlenet.jpeg" ><br/>
图8. GoogleNet[12] 图8. GoogleNet[12]
</p> </p>
...@@ -215,7 +216,7 @@ paddle.init(use_gpu=False, trainer_count=1) ...@@ -215,7 +216,7 @@ paddle.init(use_gpu=False, trainer_count=1)
1. 定义数据输入及其维度 1. 定义数据输入及其维度
网络输入定义为 `data_layer` (数据层),在图像分类中即为图像像素信息。CIFRAR10是RGB 3通道32x32大小的彩色图,因此输入数据大小为3072(3x32x32),类别大小为10,即10分类。 网络输入定义为 `data_layer` (数据层),在图像分类中即为图像像素信息。CIFRAR10是RGB 3通道32x32大小的彩色图,因此输入数据大小为3072(3x32x32),类别大小为10,即10分类。
```python ```python
datadim = 3 * 32 * 32 datadim = 3 * 32 * 32
classdim = 10 classdim = 10
...@@ -230,7 +231,7 @@ paddle.init(use_gpu=False, trainer_count=1) ...@@ -230,7 +231,7 @@ paddle.init(use_gpu=False, trainer_count=1)
net = vgg_bn_drop(image) net = vgg_bn_drop(image)
``` ```
VGG核心模块的输入是数据层,`vgg_bn_drop` 定义了16层VGG结构,每层卷积后面引入BN层和Dropout层,详细的定义如下: VGG核心模块的输入是数据层,`vgg_bn_drop` 定义了16层VGG结构,每层卷积后面引入BN层和Dropout层,详细的定义如下:
```python ```python
def vgg_bn_drop(input): def vgg_bn_drop(input):
def conv_block(ipt, num_filter, groups, dropouts, num_channels=None): def conv_block(ipt, num_filter, groups, dropouts, num_channels=None):
...@@ -261,11 +262,11 @@ paddle.init(use_gpu=False, trainer_count=1) ...@@ -261,11 +262,11 @@ paddle.init(use_gpu=False, trainer_count=1)
fc2 = paddle.layer.fc(input=bn, size=512, act=paddle.activation.Linear()) fc2 = paddle.layer.fc(input=bn, size=512, act=paddle.activation.Linear())
return fc2 return fc2
``` ```
2.1. 首先定义了一组卷积网络,即conv_block。卷积核大小为3x3,池化窗口大小为2x2,窗口滑动大小为2,groups决定每组VGG模块是几次连续的卷积操作,dropouts指定Dropout操作的概率。所使用的`img_conv_group`是在`paddle.networks`中预定义的模块,由若干组 `Conv->BN->ReLu->Dropout` 和 一组 `Pooling` 组成, 2.1. 首先定义了一组卷积网络,即conv_block。卷积核大小为3x3,池化窗口大小为2x2,窗口滑动大小为2,groups决定每组VGG模块是几次连续的卷积操作,dropouts指定Dropout操作的概率。所使用的`img_conv_group`是在`paddle.networks`中预定义的模块,由若干组 `Conv->BN->ReLu->Dropout` 和 一组 `Pooling` 组成,
2.2. 五组卷积操作,即 5个conv_block。 第一、二组采用两次连续的卷积操作。第三、四、五组采用三次连续的卷积操作。每组最后一个卷积后面Dropout概率为0,即不使用Dropout操作。 2.2. 五组卷积操作,即 5个conv_block。 第一、二组采用两次连续的卷积操作。第三、四、五组采用三次连续的卷积操作。每组最后一个卷积后面Dropout概率为0,即不使用Dropout操作。
2.3. 最后接两层512维的全连接。 2.3. 最后接两层512维的全连接。
3. 定义分类器 3. 定义分类器
...@@ -281,7 +282,7 @@ paddle.init(use_gpu=False, trainer_count=1) ...@@ -281,7 +282,7 @@ paddle.init(use_gpu=False, trainer_count=1)
4. 定义损失函数和网络输出 4. 定义损失函数和网络输出
在有监督训练中需要输入图像对应的类别信息,同样通过`paddle.layer.data`来定义。训练中采用多类交叉熵作为损失函数,并作为网络的输出,预测阶段定义网络的输出为分类器得到的概率信息。 在有监督训练中需要输入图像对应的类别信息,同样通过`paddle.layer.data`来定义。训练中采用多类交叉熵作为损失函数,并作为网络的输出,预测阶段定义网络的输出为分类器得到的概率信息。
```python ```python
lbl = paddle.layer.data( lbl = paddle.layer.data(
name="label", type=paddle.data_type.integer_value(classdim)) name="label", type=paddle.data_type.integer_value(classdim))
...@@ -346,9 +347,9 @@ def layer_warp(block_func, ipt, features, count, stride): ...@@ -346,9 +347,9 @@ def layer_warp(block_func, ipt, features, count, stride):
`resnet_cifar10` 的连接结构主要有以下几个过程。 `resnet_cifar10` 的连接结构主要有以下几个过程。
1. 底层输入连接一层 `conv_bn_layer`,即带BN的卷积层。 1. 底层输入连接一层 `conv_bn_layer`,即带BN的卷积层。
2. 然后连接3组残差模块即下面配置3组 `layer_warp` ,每组采用图 10 左边残差模块组成。 2. 然后连接3组残差模块即下面配置3组 `layer_warp` ,每组采用图 10 左边残差模块组成。
3. 最后对网络做均值池化并返回该层。 3. 最后对网络做均值池化并返回该层。
注意:除过第一层卷积层和最后一层全连接层之外,要求三组 `layer_warp` 总的含参层数能够被6整除,即 `resnet_cifar10` 的 depth 要满足 $(depth - 2) % 6 == 0$ 。 注意:除过第一层卷积层和最后一层全连接层之外,要求三组 `layer_warp` 总的含参层数能够被6整除,即 `resnet_cifar10` 的 depth 要满足 $(depth - 2) % 6 == 0$ 。
...@@ -493,7 +494,7 @@ Test with Pass 0, {'classification_error_evaluator': 0.885200023651123} ...@@ -493,7 +494,7 @@ Test with Pass 0, {'classification_error_evaluator': 0.885200023651123}
[2] N. Dalal, B. Triggs, [Histograms of Oriented Gradients for Human Detection](http://vision.stanford.edu/teaching/cs231b_spring1213/papers/CVPR05_DalalTriggs.pdf), Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2005. [2] N. Dalal, B. Triggs, [Histograms of Oriented Gradients for Human Detection](http://vision.stanford.edu/teaching/cs231b_spring1213/papers/CVPR05_DalalTriggs.pdf), Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2005.
[3] Ahonen, T., Hadid, A., and Pietikinen, M. (2006). [Face description with local binary patterns: Application to face recognition](http://ieeexplore.ieee.org/document/1717463/). PAMI, 28. [3] Ahonen, T., Hadid, A., and Pietikinen, M. (2006). [Face description with local binary patterns: Application to face recognition](http://ieeexplore.ieee.org/document/1717463/). PAMI, 28.
[4] J. Sivic, A. Zisserman, [Video Google: A Text Retrieval Approach to Object Matching in Videos](http://www.robots.ox.ac.uk/~vgg/publications/papers/sivic03.pdf), Proc. Ninth Int'l Conf. Computer Vision, pp. 1470-1478, 2003. [4] J. Sivic, A. Zisserman, [Video Google: A Text Retrieval Approach to Object Matching in Videos](http://www.robots.ox.ac.uk/~vgg/publications/papers/sivic03.pdf), Proc. Ninth Int'l Conf. Computer Vision, pp. 1470-1478, 2003.
...@@ -535,6 +536,7 @@ Test with Pass 0, {'classification_error_evaluator': 0.885200023651123} ...@@ -535,6 +536,7 @@ Test with Pass 0, {'classification_error_evaluator': 0.885200023651123}
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
...@@ -553,6 +555,6 @@ marked.setOptions({ ...@@ -553,6 +555,6 @@ marked.setOptions({
} }
}); });
document.getElementById("context").innerHTML = marked( document.getElementById("context").innerHTML = marked(
document.getElementById("markdown").innerHTML) document.getElementById("markdown").innerHTML)
</script> </script>
</body> </body>
<html> <html>
<head> <head>
<script type="text/x-mathjax-config"> <script type="text/x-mathjax-config">
...@@ -5,8 +6,8 @@ ...@@ -5,8 +6,8 @@
extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"], extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
jax: ["input/TeX", "output/HTML-CSS"], jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: { tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ], inlineMath: [ ['$','$'] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ], displayMath: [ ['$$','$$'] ],
processEscapes: true processEscapes: true
}, },
"HTML-CSS": { availableFonts: ["TeX"] } "HTML-CSS": { availableFonts: ["TeX"] }
...@@ -39,6 +40,7 @@ ...@@ -39,6 +40,7 @@
<!-- This block will be replaced by each markdown file content. Please do not change lines below.--> <!-- This block will be replaced by each markdown file content. Please do not change lines below.-->
<div id="markdown" style='display:none'> <div id="markdown" style='display:none'>
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
...@@ -57,6 +59,6 @@ marked.setOptions({ ...@@ -57,6 +59,6 @@ marked.setOptions({
} }
}); });
document.getElementById("context").innerHTML = marked( document.getElementById("context").innerHTML = marked(
document.getElementById("markdown").innerHTML) document.getElementById("markdown").innerHTML)
</script> </script>
</body> </body>
<html> <html>
<head> <head>
<script type="text/x-mathjax-config"> <script type="text/x-mathjax-config">
...@@ -5,8 +6,8 @@ ...@@ -5,8 +6,8 @@
extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"], extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
jax: ["input/TeX", "output/HTML-CSS"], jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: { tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ], inlineMath: [ ['$','$'] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ], displayMath: [ ['$$','$$'] ],
processEscapes: true processEscapes: true
}, },
"HTML-CSS": { availableFonts: ["TeX"] } "HTML-CSS": { availableFonts: ["TeX"] }
...@@ -39,6 +40,7 @@ ...@@ -39,6 +40,7 @@
<!-- This block will be replaced by each markdown file content. Please do not change lines below.--> <!-- This block will be replaced by each markdown file content. Please do not change lines below.-->
<div id="markdown" style='display:none'> <div id="markdown" style='display:none'>
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
...@@ -57,6 +59,6 @@ marked.setOptions({ ...@@ -57,6 +59,6 @@ marked.setOptions({
} }
}); });
document.getElementById("context").innerHTML = marked( document.getElementById("context").innerHTML = marked(
document.getElementById("markdown").innerHTML) document.getElementById("markdown").innerHTML)
</script> </script>
</body> </body>
<html> <html>
<head> <head>
<meta http-equiv="refresh" content="0; url=https://github.com/paddlepaddle/book" /> <script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: {
inlineMath: [ ['$','$'] ],
displayMath: [ ['$$','$$'] ],
processEscapes: true
},
"HTML-CSS": { availableFonts: ["TeX"] }
});
</script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js" async></script>
<script type="text/javascript" src="../.tmpl/marked.js">
</script>
<link href="http://cdn.bootcss.com/highlight.js/9.9.0/styles/darcula.min.css" rel="stylesheet">
<script src="http://cdn.bootcss.com/highlight.js/9.9.0/highlight.min.js"></script>
<link href="http://cdn.bootcss.com/bootstrap/4.0.0-alpha.6/css/bootstrap.min.css" rel="stylesheet">
<link href="https://cdn.jsdelivr.net/perfect-scrollbar/0.6.14/css/perfect-scrollbar.min.css" rel="stylesheet">
<link href="../.tmpl/github-markdown.css" rel='stylesheet'>
</head> </head>
<style type="text/css" >
.markdown-body {
box-sizing: border-box;
min-width: 200px;
max-width: 980px;
margin: 0 auto;
padding: 45px;
}
</style>
<body> <body>
<a href="https://github.com/paddlepaddle/book">Please access github home page</a>
<div id="context" class="container markdown-body">
</div>
<!-- This block will be replaced by each markdown file content. Please do not change lines below.-->
<div id="markdown" style='display:none'>
# 深度学习入门
1. 新手入门 [[fit_a_line](fit_a_line/)] [[html](http://book.paddlepaddle.org/fit_a_line)]
1. 识别数字 [[recognize_digits](recognize_digits/)] [[html](http://book.paddlepaddle.org/recognize_digits)]
1. 图像分类 [[image_classification](image_classification/)] [[html](http://book.paddlepaddle.org/image_classification)]
1. 词向量 [[word2vec](word2vec/)] [[html](http://book.paddlepaddle.org/word2vec)]
1. 情感分析 [[understand_sentiment](understand_sentiment/)] [[html](http://book.paddlepaddle.org/understand_sentiment)]
1. 语义角色标注 [[label_semantic_roles](label_semantic_roles/)] [[html](http://book.paddlepaddle.org/label_semantic_roles)]
1. 机器翻译 [[machine_translation](machine_translation/)] [[html](http://book.paddlepaddle.org/machine_translation)]
1. 个性化推荐 [[recommender_system](recommender_system/)] [[html](http://book.paddlepaddle.org/recommender_system)]
<br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。
</div>
<!-- You can change the lines below now. -->
<script type="text/javascript">
marked.setOptions({
renderer: new marked.Renderer(),
gfm: true,
breaks: false,
smartypants: true,
highlight: function(code, lang) {
code = code.replace(/&amp;/g, "&")
code = code.replace(/&gt;/g, ">")
code = code.replace(/&lt;/g, "<")
code = code.replace(/&nbsp;/g, " ")
return hljs.highlightAuto(code, [lang]).value;
}
});
document.getElementById("context").innerHTML = marked(
document.getElementById("markdown").innerHTML)
</script>
</body> </body>
...@@ -22,34 +22,20 @@ Standard SRL system mostly builds on top of Syntactic Analysis and contains five ...@@ -22,34 +22,20 @@ Standard SRL system mostly builds on top of Syntactic Analysis and contains five
<div align="center"> <div align="center">
<img src="image/dependency_parsing.png" width = "80%" align=center /><br> <img src="image/dependency_parsing_en.png" width = "80%" align=center /><br>
Fig 1. Syntactic parse tree Fig 1. Syntactic parse tree
</div> </div>
核心关系-> HED
定中关系-> ATT
主谓关系-> SBV
状中结构-> ADV
介宾关系-> POB
右附加关系-> RAD
动宾关系-> VOB
标点-> WP
However, complete syntactic analysis requires identifying the relation among all constitutes and the performance of SRL is sensitive to the precision of syntactic analysis, which makes SRL a very challenging task. To reduce the complexity and obtain some syntactic structure information, we often use shallow syntactic analysis. Shallow Syntactic Analysis is also called partial parsing or chunking. Unlike complete syntactic analysis which requires the construction of the complete parsing tree, Shallow Syntactic Analysis only need to identify some independent components with relatively simple structure, such as verb phrases (chunk). To avoid difficulties in constructing a syntactic tree with high accuracy, some work\[[1](#Reference)\] proposed semantic chunking based SRL methods, which convert SRL as a sequence tagging problem. Sequence tagging tasks classify syntactic chunks using BIO representation. For syntactic chunks forming a chunk of type A, the first chunk receives the B-A tag (Begin), the remaining ones receive the tag I-A (Inside), and all chunks outside receive the tag O-A. However, complete syntactic analysis requires identifying the relation among all constitutes and the performance of SRL is sensitive to the precision of syntactic analysis, which makes SRL a very challenging task. To reduce the complexity and obtain some syntactic structure information, we often use shallow syntactic analysis. Shallow Syntactic Analysis is also called partial parsing or chunking. Unlike complete syntactic analysis which requires the construction of the complete parsing tree, Shallow Syntactic Analysis only need to identify some independent components with relatively simple structure, such as verb phrases (chunk). To avoid difficulties in constructing a syntactic tree with high accuracy, some work\[[1](#Reference)\] proposed semantic chunking based SRL methods, which convert SRL as a sequence tagging problem. Sequence tagging tasks classify syntactic chunks using BIO representation. For syntactic chunks forming a chunk of type A, the first chunk receives the B-A tag (Begin), the remaining ones receive the tag I-A (Inside), and all chunks outside receive the tag O-A.
The BIO representation of above example is shown in Fig.1. The BIO representation of above example is shown in Fig.1.
<div align="center"> <div align="center">
<img src="image/bio_example.png" width = "90%" align=center /><br> <img src="image/bio_example_en.png" width = "90%" align=center /><br>
Fig 2. BIO represention Fig 2. BIO represention
</div> </div>
输入序列-> input sequence
语块-> chunk
标注序列-> label sequence
角色-> role
This example illustrates the simplicity of sequence tagging because (1) shallow syntactic analysis reduces the precision requirement of syntactic analysis; (2) pruning candidate arguments is removed; 3) argument identification and tagging are finished at the same time. Such unified methods simplify the procedure, reduce the risk of accumulating errors and boost the performance further. This example illustrates the simplicity of sequence tagging because (1) shallow syntactic analysis reduces the precision requirement of syntactic analysis; (2) pruning candidate arguments is removed; 3) argument identification and tagging are finished at the same time. Such unified methods simplify the procedure, reduce the risk of accumulating errors and boost the performance further.
In this tutorial, our SRL system is built as an end-to-end system via a neural network. We take only text sequences, without using any syntactic parsing results or complex hand-designed features. We give public dataset [CoNLL-2004 and CoNLL-2005 Shared Tasks](http://www.cs.upc.edu/~srlconll/) as an example to illustrate: given a sentence with predicates marked, identify the corresponding arguments and their semantic roles by sequence tagging method. In this tutorial, our SRL system is built as an end-to-end system via a neural network. We take only text sequences, without using any syntactic parsing results or complex hand-designed features. We give public dataset [CoNLL-2004 and CoNLL-2005 Shared Tasks](http://www.cs.upc.edu/~srlconll/) as an example to illustrate: given a sentence with predicates marked, identify the corresponding arguments and their semantic roles by sequence tagging method.
...@@ -70,14 +56,11 @@ The operation of a single LSTM cell contain 3 parts: (1) input-to-hidden: map in ...@@ -70,14 +56,11 @@ The operation of a single LSTM cell contain 3 parts: (1) input-to-hidden: map in
Fig.3 illustrate the final stacked recurrent neural networks. Fig.3 illustrate the final stacked recurrent neural networks.
<p align="center"> <p align="center">
<img src="./image/stacked_lstm.png" width = "40%" align=center><br> <img src="./image/stacked_lstm_en.png" width = "40%" align=center><br>
Fig 3. Stacked Recurrent Neural Networks Fig 3. Stacked Recurrent Neural Networks
</p> </p>
线性变换-> linear transformation
输入层到隐层-> input-to-hidden
### Bidirectional Recurrent Neural Network ### Bidirectional Recurrent Neural Network
LSTMs can summarize the history of previous inputs seen up to now, but can not see the future. In most of NLP (natural language processing) tasks, the entire sentences are ready to use. Therefore, sequential learning might be much efficient if the future can be encoded as well like histories. LSTMs can summarize the history of previous inputs seen up to now, but can not see the future. In most of NLP (natural language processing) tasks, the entire sentences are ready to use. Therefore, sequential learning might be much efficient if the future can be encoded as well like histories.
...@@ -85,16 +68,11 @@ LSTMs can summarize the history of previous inputs seen up to now, but can not s ...@@ -85,16 +68,11 @@ LSTMs can summarize the history of previous inputs seen up to now, but can not s
To address the above drawbacks, we can design bidirectional recurrent neural networks by making a minor modification. Higher LSTM layers process the sequence in reversed direction with previous lower LSTM layers, i.e., Deep LSTMs operate from left-to-right, right-to-left, left-to-right,..., in depth. Therefore, LSTM layers at time-step $t$ can see both histories and the future since the second layer. Fig. 4 illustrates the bidirectional recurrent neural networks. To address the above drawbacks, we can design bidirectional recurrent neural networks by making a minor modification. Higher LSTM layers process the sequence in reversed direction with previous lower LSTM layers, i.e., Deep LSTMs operate from left-to-right, right-to-left, left-to-right,..., in depth. Therefore, LSTM layers at time-step $t$ can see both histories and the future since the second layer. Fig. 4 illustrates the bidirectional recurrent neural networks.
<p align="center"> <p align="center">
<img src="./image/bidirectional_stacked_lstm.png" width = "60%" align=center><br> <img src="./image/bidirectional_stacked_lstm_en.png" width = "60%" align=center><br>
Fig 4. Bidirectional LSTMs Fig 4. Bidirectional LSTMs
</p> </p>
线性变换-> linear transformation
输入层到隐层-> input-to-hidden
正向处理输出序列->process sequence in the forward direction
反向处理上一层序列-> process sequence from the previous layer in backward direction
Note that, this bidirectional RNNs is different with the one proposed by Bengio et al. in machine translation tasks \[[3](#Reference), [4](#Reference)\]. We will introduce another bidirectional RNNs in the following tasks[machine translation](https://github.com/PaddlePaddle/book/blob/develop/machine_translation/README.md) Note that, this bidirectional RNNs is different with the one proposed by Bengio et al. in machine translation tasks \[[3](#Reference), [4](#Reference)\]. We will introduce another bidirectional RNNs in the following tasks[machine translation](https://github.com/PaddlePaddle/book/blob/develop/machine_translation/README.md)
### Conditional Random Field ### Conditional Random Field
...@@ -106,12 +84,12 @@ CRF is a probabilistic graph model (undirected) with nodes denoting random varia ...@@ -106,12 +84,12 @@ CRF is a probabilistic graph model (undirected) with nodes denoting random varia
Sequence tagging tasks only consider input and output as linear sequences without extra dependent assumptions on graph model. Thus, the graph model of sequence tagging tasks is simple chain or line, which results in a Linear-Chain Conditional Random Field, shown in Fig.5. Sequence tagging tasks only consider input and output as linear sequences without extra dependent assumptions on graph model. Thus, the graph model of sequence tagging tasks is simple chain or line, which results in a Linear-Chain Conditional Random Field, shown in Fig.5.
<p align="center"> <p align="center">
<img src="./image/linear_chain_crf.png" width = "35%" align=center><br> <img src="./image/linear_chain_crf.png" width = "35%" align=center><br>
Fig 5. Linear Chain Conditional Random Field used in SRL tasks Fig 5. Linear Chain Conditional Random Field used in SRL tasks
</p> </p>
By the fundamental theorem of random fields \[[5](#Reference)\], the joint distribution over the label sequence $Y$ given $X$ has the form: By the fundamental theorem of random fields \[[5](#Reference)\], the joint distribution over the label sequence $Y$ given $X$ has the form:
$$p(Y | X) = \frac{1}{Z(X)} \text{exp}\left(\sum_{i=1}^{n}\left(\sum_{j}\lambda_{j}t_{j} (y_{i - 1}, y_{i}, X, i) + \sum_{k} \mu_k s_k (y_i, X, i)\right)\right)$$ $$p(Y | X) = \frac{1}{Z(X)} \text{exp}\left(\sum_{i=1}^{n}\left(\sum_{j}\lambda_{j}t_{j} (y_{i - 1}, y_{i}, X, i) + \sum_{k} \mu_k s_k (y_i, X, i)\right)\right)$$
...@@ -155,19 +133,11 @@ After modification, the model is as follows: ...@@ -155,19 +133,11 @@ After modification, the model is as follows:
4. Take representation from step 3 as input of CRF, label sequence as supervision signal, do sequence tagging tasks 4. Take representation from step 3 as input of CRF, label sequence as supervision signal, do sequence tagging tasks
<div align="center"> <div align="center">
<img src="image/db_lstm_network.png" width = "60%" align=center /><br> <img src="image/db_lstm_en.png" width = "60%" align=center /><br>
Fig 6. DB-LSTM for SRL tasks Fig 6. DB-LSTM for SRL tasks
</div> </div>
论元-> argu
谓词-> pred
谓词上下文-> ctx-p
谓词上下文区域标记-> $m_r$
输入-> input
原句-> sentence
反向LSTM-> LSTM Reverse
## Data Preparation ## Data Preparation
In the tutorial, we use [CoNLL 2005](http://www.cs.upc.edu/~srlconll/) SRL task open dataset as an example. It is important to note that the training set and development set of the CoNLL 2005 SRL task are not free to download after the competition. Currently, only the test set can be obtained, including 23 sections of the Wall Street Journal and three sections of the Brown corpus. In this tutorial, we use the WSJ corpus as the training dataset to explain the model. However, since the training set is small, if you want to train a usable neural network SRL system, consider paying for the full corpus. In the tutorial, we use [CoNLL 2005](http://www.cs.upc.edu/~srlconll/) SRL task open dataset as an example. It is important to note that the training set and development set of the CoNLL 2005 SRL task are not free to download after the competition. Currently, only the test set can be obtained, including 23 sections of the Wall Street Journal and three sections of the Brown corpus. In this tutorial, we use the WSJ corpus as the training dataset to explain the model. However, since the training set is small, if you want to train a usable neural network SRL system, consider paying for the full corpus.
...@@ -259,10 +229,10 @@ def d_type(value_range): ...@@ -259,10 +229,10 @@ def d_type(value_range):
# word sequence # word sequence
word = paddle.layer.data(name='word_data', type=d_type(word_dict_len)) word = paddle.layer.data(name='word_data', type=d_type(word_dict_len))
# predicate # predicate
predicate = paddle.layer.data(name='verb_data', type=d_type(pred_len)) predicate = paddle.layer.data(name='verb_data', type=d_type(pred_len))
# 5 features for predicate context # 5 features for predicate context
ctx_n2 = paddle.layer.data(name='ctx_n2_data', type=d_type(word_dict_len)) ctx_n2 = paddle.layer.data(name='ctx_n2_data', type=d_type(word_dict_len))
ctx_n1 = paddle.layer.data(name='ctx_n1_data', type=d_type(word_dict_len)) ctx_n1 = paddle.layer.data(name='ctx_n1_data', type=d_type(word_dict_len))
ctx_0 = paddle.layer.data(name='ctx_0_data', type=d_type(word_dict_len)) ctx_0 = paddle.layer.data(name='ctx_0_data', type=d_type(word_dict_len))
ctx_p1 = paddle.layer.data(name='ctx_p1_data', type=d_type(word_dict_len)) ctx_p1 = paddle.layer.data(name='ctx_p1_data', type=d_type(word_dict_len))
...@@ -274,12 +244,12 @@ mark = paddle.layer.data(name='mark_data', type=d_type(mark_dict_len)) ...@@ -274,12 +244,12 @@ mark = paddle.layer.data(name='mark_data', type=d_type(mark_dict_len))
# label sequence # label sequence
target = paddle.layer.data(name='target', type=d_type(label_dict_len)) target = paddle.layer.data(name='target', type=d_type(label_dict_len))
``` ```
Speciala note: hidden_dim = 512 means LSTM hidden vector of 128 dimension (512/4). Please refer PaddlePaddle official documentation for detail: [lstmemory](http://www.paddlepaddle.org/doc/ui/api/trainer_config_helpers/layers.html#lstmemory) Speciala note: hidden_dim = 512 means LSTM hidden vector of 128 dimension (512/4). Please refer PaddlePaddle official documentation for detail: [lstmemory](http://www.paddlepaddle.org/doc/ui/api/trainer_config_helpers/layers.html#lstmemory)
- 2. The word sequence, predicate, predicate context, and region mark sequence are transformed into embedding vector sequences. - 2. The word sequence, predicate, predicate context, and region mark sequence are transformed into embedding vector sequences.
```python ```python
# Since word vectorlookup table is pre-trained, we won't update it this time. # Since word vectorlookup table is pre-trained, we won't update it this time.
# is_static being True prevents updating the lookup table during training. # is_static being True prevents updating the lookup table during training.
...@@ -405,7 +375,7 @@ parameters = paddle.parameters.create([crf_cost, crf_dec]) ...@@ -405,7 +375,7 @@ parameters = paddle.parameters.create([crf_cost, crf_dec])
``` ```
We can print out parameter name. It will be generated if not specified. We can print out parameter name. It will be generated if not specified.
```python ```python
print parameters.keys() print parameters.keys()
``` ```
......
...@@ -52,7 +52,7 @@ $$\mbox{[小明]}_{\mbox{Agent}}\mbox{[昨天]}_{\mbox{Time}}\mbox{[晚上]}_\mb ...@@ -52,7 +52,7 @@ $$\mbox{[小明]}_{\mbox{Agent}}\mbox{[昨天]}_{\mbox{Time}}\mbox{[晚上]}_\mb
图3是最终得到的栈式循环神经网络结构示意图。 图3是最终得到的栈式循环神经网络结构示意图。
<p align="center"> <p align="center">
<img src="./image/stacked_lstm.png" width = "40%" align=center><br> <img src="./image/stacked_lstm.png" width = "40%" align=center><br>
图3. 基于LSTM的栈式循环神经网络结构示意图 图3. 基于LSTM的栈式循环神经网络结构示意图
</p> </p>
...@@ -63,7 +63,7 @@ $$\mbox{[小明]}_{\mbox{Agent}}\mbox{[昨天]}_{\mbox{Time}}\mbox{[晚上]}_\mb ...@@ -63,7 +63,7 @@ $$\mbox{[小明]}_{\mbox{Agent}}\mbox{[昨天]}_{\mbox{Time}}\mbox{[晚上]}_\mb
为了克服这一缺陷,我们可以设计一种双向循环网络单元,它的思想简单且直接:对上一节的栈式循环神经网络进行一个小小的修改,堆叠多个LSTM单元,让每一层LSTM单元分别以:正向、反向、正向 …… 的顺序学习上一层的输出序列。于是,从第2层开始,$t$时刻我们的LSTM单元便总是可以看到历史和未来的信息。图4是基于LSTM的双向循环神经网络结构示意图。 为了克服这一缺陷,我们可以设计一种双向循环网络单元,它的思想简单且直接:对上一节的栈式循环神经网络进行一个小小的修改,堆叠多个LSTM单元,让每一层LSTM单元分别以:正向、反向、正向 …… 的顺序学习上一层的输出序列。于是,从第2层开始,$t$时刻我们的LSTM单元便总是可以看到历史和未来的信息。图4是基于LSTM的双向循环神经网络结构示意图。
<p align="center"> <p align="center">
<img src="./image/bidirectional_stacked_lstm.png" width = "60%" align=center><br> <img src="./image/bidirectional_stacked_lstm.png" width = "60%" align=center><br>
图4. 基于LSTM的双向循环神经网络结构示意图 图4. 基于LSTM的双向循环神经网络结构示意图
</p> </p>
...@@ -78,7 +78,7 @@ CRF是一种概率化结构模型,可以看作是一个概率无向图模型 ...@@ -78,7 +78,7 @@ CRF是一种概率化结构模型,可以看作是一个概率无向图模型
序列标注任务只需要考虑输入和输出都是一个线性序列,并且由于我们只是将输入序列作为条件,不做任何条件独立假设,因此输入序列的元素之间并不存在图结构。综上,在序列标注任务中使用的是如图5所示的定义在链式图上的CRF,称之为线性链条件随机场(Linear Chain Conditional Random Field)。 序列标注任务只需要考虑输入和输出都是一个线性序列,并且由于我们只是将输入序列作为条件,不做任何条件独立假设,因此输入序列的元素之间并不存在图结构。综上,在序列标注任务中使用的是如图5所示的定义在链式图上的CRF,称之为线性链条件随机场(Linear Chain Conditional Random Field)。
<p align="center"> <p align="center">
<img src="./image/linear_chain_crf.png" width = "35%" align=center><br> <img src="./image/linear_chain_crf.png" width = "35%" align=center><br>
图5. 序列标注任务中使用的线性链条件随机场 图5. 序列标注任务中使用的线性链条件随机场
</p> </p>
...@@ -122,7 +122,7 @@ $$L(\lambda, D) = - \text{log}\left(\prod_{m=1}^{N}p(Y_m|X_m, W)\right) + C \fra ...@@ -122,7 +122,7 @@ $$L(\lambda, D) = - \text{log}\left(\prod_{m=1}^{N}p(Y_m|X_m, W)\right) + C \fra
3. 第2步的4个词向量序列作为双向LSTM模型的输入;LSTM模型学习输入序列的特征表示,得到新的特性表示序列; 3. 第2步的4个词向量序列作为双向LSTM模型的输入;LSTM模型学习输入序列的特征表示,得到新的特性表示序列;
4. CRF以第3步中LSTM学习到的特征为输入,以标记序列为监督信号,完成序列标注; 4. CRF以第3步中LSTM学习到的特征为输入,以标记序列为监督信号,完成序列标注;
<div align="center"> <div align="center">
<img src="image/db_lstm_network.png" width = "60%" align=center /><br> <img src="image/db_lstm_network.png" width = "60%" align=center /><br>
图6. SRL任务上的深层双向LSTM模型 图6. SRL任务上的深层双向LSTM模型
</div> </div>
...@@ -161,7 +161,7 @@ conll05st-release/ ...@@ -161,7 +161,7 @@ conll05st-release/
预处理完成之后一条训练样本包含9个特征,分别是:句子序列、谓词、谓词上下文(占 5 列)、谓词上下区域标志、标注序列。下表是一条训练样本的示例。 预处理完成之后一条训练样本包含9个特征,分别是:句子序列、谓词、谓词上下文(占 5 列)、谓词上下区域标志、标注序列。下表是一条训练样本的示例。
| 句子序列 | 谓词 | 谓词上下文(窗口 = 5) | 谓词上下文区域标记 | 标注序列 | | 句子序列 | 谓词 | 谓词上下文(窗口 = 5) | 谓词上下文区域标记 | 标注序列 |
|---|---|---|---|---| |---|---|---|---|---|
| A | set | n't been set . × | 0 | B-A1 | | A | set | n't been set . × | 0 | B-A1 |
| record | set | n't been set . × | 0 | I-A1 | | record | set | n't been set . × | 0 | I-A1 |
...@@ -214,7 +214,7 @@ word_dim = 32 # 词向量维度 ...@@ -214,7 +214,7 @@ word_dim = 32 # 词向量维度
mark_dim = 5 # 谓词上下文区域通过词表被映射为一个实向量,这个是相邻的维度 mark_dim = 5 # 谓词上下文区域通过词表被映射为一个实向量,这个是相邻的维度
hidden_dim = 512 # LSTM隐层向量的维度 : 512 / 4 hidden_dim = 512 # LSTM隐层向量的维度 : 512 / 4
depth = 8 # 栈式LSTM的深度 depth = 8 # 栈式LSTM的深度
# 一条样本总共9个特征,下面定义了9个data层,每个层类型为integer_value_sequence,表示整数ID的序列类型. # 一条样本总共9个特征,下面定义了9个data层,每个层类型为integer_value_sequence,表示整数ID的序列类型.
def d_type(size): def d_type(size):
return paddle.data_type.integer_value_sequence(size) return paddle.data_type.integer_value_sequence(size)
...@@ -222,10 +222,10 @@ def d_type(size): ...@@ -222,10 +222,10 @@ def d_type(size):
# 句子序列 # 句子序列
word = paddle.layer.data(name='word_data', type=d_type(word_dict_len)) word = paddle.layer.data(name='word_data', type=d_type(word_dict_len))
# 谓词 # 谓词
predicate = paddle.layer.data(name='verb_data', type=d_type(pred_len)) predicate = paddle.layer.data(name='verb_data', type=d_type(pred_len))
# 谓词上下文5个特征 # 谓词上下文5个特征
ctx_n2 = paddle.layer.data(name='ctx_n2_data', type=d_type(word_dict_len)) ctx_n2 = paddle.layer.data(name='ctx_n2_data', type=d_type(word_dict_len))
ctx_n1 = paddle.layer.data(name='ctx_n1_data', type=d_type(word_dict_len)) ctx_n1 = paddle.layer.data(name='ctx_n1_data', type=d_type(word_dict_len))
ctx_0 = paddle.layer.data(name='ctx_0_data', type=d_type(word_dict_len)) ctx_0 = paddle.layer.data(name='ctx_0_data', type=d_type(word_dict_len))
ctx_p1 = paddle.layer.data(name='ctx_p1_data', type=d_type(word_dict_len)) ctx_p1 = paddle.layer.data(name='ctx_p1_data', type=d_type(word_dict_len))
...@@ -237,12 +237,12 @@ mark = paddle.layer.data(name='mark_data', type=d_type(mark_dict_len)) ...@@ -237,12 +237,12 @@ mark = paddle.layer.data(name='mark_data', type=d_type(mark_dict_len))
# 标注序列 # 标注序列
target = paddle.layer.data(name='target', type=d_type(label_dict_len)) target = paddle.layer.data(name='target', type=d_type(label_dict_len))
``` ```
这里需要特别说明的是hidden_dim = 512指定了LSTM隐层向量的维度为128维,关于这一点请参考PaddlePaddle官方文档中[lstmemory](http://www.paddlepaddle.org/doc/ui/api/trainer_config_helpers/layers.html#lstmemory)的说明。 这里需要特别说明的是hidden_dim = 512指定了LSTM隐层向量的维度为128维,关于这一点请参考PaddlePaddle官方文档中[lstmemory](http://www.paddlepaddle.org/doc/ui/api/trainer_config_helpers/layers.html#lstmemory)的说明。
- 2. 将句子序列、谓词、谓词上下文、谓词上下文区域标记通过词表,转换为实向量表示的词向量序列。 - 2. 将句子序列、谓词、谓词上下文、谓词上下文区域标记通过词表,转换为实向量表示的词向量序列。
```python ```python
# 在本教程中,我们加载了预训练的词向量,这里设置了:is_static=True # 在本教程中,我们加载了预训练的词向量,这里设置了:is_static=True
# is_static 为 True 时保证了在训练 SRL 模型过程中,词表不再更新 # is_static 为 True 时保证了在训练 SRL 模型过程中,词表不再更新
...@@ -369,7 +369,7 @@ parameters = paddle.parameters.create([crf_cost, crf_dec]) ...@@ -369,7 +369,7 @@ parameters = paddle.parameters.create([crf_cost, crf_dec])
``` ```
可以打印参数名字,如果在网络配置中没有指定名字,则默认生成。 可以打印参数名字,如果在网络配置中没有指定名字,则默认生成。
```python ```python
print parameters.keys() print parameters.keys()
``` ```
......
...@@ -20,7 +20,7 @@ from optparse import OptionParser ...@@ -20,7 +20,7 @@ from optparse import OptionParser
def read_labels(props_file): def read_labels(props_file):
''' '''
a sentence maybe has more than one verb, each verb has its label sequence a sentence maybe has more than one verb, each verb has its label sequence
label[], is a 3-dimension list. label[], is a 3-dimension list.
the first dim is to store all sentence's label seqs, len is the sentence number the first dim is to store all sentence's label seqs, len is the sentence number
the second dim is to store all label sequences for one sentences the second dim is to store all label sequences for one sentences
the third dim is to store each label for one word the third dim is to store each label for one word
......
label_semantic_roles/image/bio_example.png

379.1 KB | W: | H:

label_semantic_roles/image/bio_example.png

31.7 KB | W: | H:

label_semantic_roles/image/bio_example.png
label_semantic_roles/image/bio_example.png
label_semantic_roles/image/bio_example.png
label_semantic_roles/image/bio_example.png
  • 2-up
  • Swipe
  • Onion skin
label_semantic_roles/image/dependency_parsing.png

186.0 KB | W: | H:

label_semantic_roles/image/dependency_parsing.png

58.2 KB | W: | H:

label_semantic_roles/image/dependency_parsing.png
label_semantic_roles/image/dependency_parsing.png
label_semantic_roles/image/dependency_parsing.png
label_semantic_roles/image/dependency_parsing.png
  • 2-up
  • Swipe
  • Onion skin
<html> <html>
<head> <head>
<script type="text/x-mathjax-config"> <script type="text/x-mathjax-config">
...@@ -5,8 +6,8 @@ ...@@ -5,8 +6,8 @@
extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"], extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
jax: ["input/TeX", "output/HTML-CSS"], jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: { tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ], inlineMath: [ ['$','$'] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ], displayMath: [ ['$$','$$'] ],
processEscapes: true processEscapes: true
}, },
"HTML-CSS": { availableFonts: ["TeX"] } "HTML-CSS": { availableFonts: ["TeX"] }
...@@ -63,34 +64,20 @@ Standard SRL system mostly builds on top of Syntactic Analysis and contains five ...@@ -63,34 +64,20 @@ Standard SRL system mostly builds on top of Syntactic Analysis and contains five
<div align="center"> <div align="center">
<img src="image/dependency_parsing.png" width = "80%" align=center /><br> <img src="image/dependency_parsing_en.png" width = "80%" align=center /><br>
Fig 1. Syntactic parse tree Fig 1. Syntactic parse tree
</div> </div>
核心关系-> HED
定中关系-> ATT
主谓关系-> SBV
状中结构-> ADV
介宾关系-> POB
右附加关系-> RAD
动宾关系-> VOB
标点-> WP
However, complete syntactic analysis requires identifying the relation among all constitutes and the performance of SRL is sensitive to the precision of syntactic analysis, which makes SRL a very challenging task. To reduce the complexity and obtain some syntactic structure information, we often use shallow syntactic analysis. Shallow Syntactic Analysis is also called partial parsing or chunking. Unlike complete syntactic analysis which requires the construction of the complete parsing tree, Shallow Syntactic Analysis only need to identify some independent components with relatively simple structure, such as verb phrases (chunk). To avoid difficulties in constructing a syntactic tree with high accuracy, some work\[[1](#Reference)\] proposed semantic chunking based SRL methods, which convert SRL as a sequence tagging problem. Sequence tagging tasks classify syntactic chunks using BIO representation. For syntactic chunks forming a chunk of type A, the first chunk receives the B-A tag (Begin), the remaining ones receive the tag I-A (Inside), and all chunks outside receive the tag O-A. However, complete syntactic analysis requires identifying the relation among all constitutes and the performance of SRL is sensitive to the precision of syntactic analysis, which makes SRL a very challenging task. To reduce the complexity and obtain some syntactic structure information, we often use shallow syntactic analysis. Shallow Syntactic Analysis is also called partial parsing or chunking. Unlike complete syntactic analysis which requires the construction of the complete parsing tree, Shallow Syntactic Analysis only need to identify some independent components with relatively simple structure, such as verb phrases (chunk). To avoid difficulties in constructing a syntactic tree with high accuracy, some work\[[1](#Reference)\] proposed semantic chunking based SRL methods, which convert SRL as a sequence tagging problem. Sequence tagging tasks classify syntactic chunks using BIO representation. For syntactic chunks forming a chunk of type A, the first chunk receives the B-A tag (Begin), the remaining ones receive the tag I-A (Inside), and all chunks outside receive the tag O-A.
The BIO representation of above example is shown in Fig.1. The BIO representation of above example is shown in Fig.1.
<div align="center"> <div align="center">
<img src="image/bio_example.png" width = "90%" align=center /><br> <img src="image/bio_example_en.png" width = "90%" align=center /><br>
Fig 2. BIO represention Fig 2. BIO represention
</div> </div>
输入序列-> input sequence
语块-> chunk
标注序列-> label sequence
角色-> role
This example illustrates the simplicity of sequence tagging because (1) shallow syntactic analysis reduces the precision requirement of syntactic analysis; (2) pruning candidate arguments is removed; 3) argument identification and tagging are finished at the same time. Such unified methods simplify the procedure, reduce the risk of accumulating errors and boost the performance further. This example illustrates the simplicity of sequence tagging because (1) shallow syntactic analysis reduces the precision requirement of syntactic analysis; (2) pruning candidate arguments is removed; 3) argument identification and tagging are finished at the same time. Such unified methods simplify the procedure, reduce the risk of accumulating errors and boost the performance further.
In this tutorial, our SRL system is built as an end-to-end system via a neural network. We take only text sequences, without using any syntactic parsing results or complex hand-designed features. We give public dataset [CoNLL-2004 and CoNLL-2005 Shared Tasks](http://www.cs.upc.edu/~srlconll/) as an example to illustrate: given a sentence with predicates marked, identify the corresponding arguments and their semantic roles by sequence tagging method. In this tutorial, our SRL system is built as an end-to-end system via a neural network. We take only text sequences, without using any syntactic parsing results or complex hand-designed features. We give public dataset [CoNLL-2004 and CoNLL-2005 Shared Tasks](http://www.cs.upc.edu/~srlconll/) as an example to illustrate: given a sentence with predicates marked, identify the corresponding arguments and their semantic roles by sequence tagging method.
...@@ -111,14 +98,11 @@ The operation of a single LSTM cell contain 3 parts: (1) input-to-hidden: map in ...@@ -111,14 +98,11 @@ The operation of a single LSTM cell contain 3 parts: (1) input-to-hidden: map in
Fig.3 illustrate the final stacked recurrent neural networks. Fig.3 illustrate the final stacked recurrent neural networks.
<p align="center"> <p align="center">
<img src="./image/stacked_lstm.png" width = "40%" align=center><br> <img src="./image/stacked_lstm_en.png" width = "40%" align=center><br>
Fig 3. Stacked Recurrent Neural Networks Fig 3. Stacked Recurrent Neural Networks
</p> </p>
线性变换-> linear transformation
输入层到隐层-> input-to-hidden
### Bidirectional Recurrent Neural Network ### Bidirectional Recurrent Neural Network
LSTMs can summarize the history of previous inputs seen up to now, but can not see the future. In most of NLP (natural language processing) tasks, the entire sentences are ready to use. Therefore, sequential learning might be much efficient if the future can be encoded as well like histories. LSTMs can summarize the history of previous inputs seen up to now, but can not see the future. In most of NLP (natural language processing) tasks, the entire sentences are ready to use. Therefore, sequential learning might be much efficient if the future can be encoded as well like histories.
...@@ -126,16 +110,11 @@ LSTMs can summarize the history of previous inputs seen up to now, but can not s ...@@ -126,16 +110,11 @@ LSTMs can summarize the history of previous inputs seen up to now, but can not s
To address the above drawbacks, we can design bidirectional recurrent neural networks by making a minor modification. Higher LSTM layers process the sequence in reversed direction with previous lower LSTM layers, i.e., Deep LSTMs operate from left-to-right, right-to-left, left-to-right,..., in depth. Therefore, LSTM layers at time-step $t$ can see both histories and the future since the second layer. Fig. 4 illustrates the bidirectional recurrent neural networks. To address the above drawbacks, we can design bidirectional recurrent neural networks by making a minor modification. Higher LSTM layers process the sequence in reversed direction with previous lower LSTM layers, i.e., Deep LSTMs operate from left-to-right, right-to-left, left-to-right,..., in depth. Therefore, LSTM layers at time-step $t$ can see both histories and the future since the second layer. Fig. 4 illustrates the bidirectional recurrent neural networks.
<p align="center"> <p align="center">
<img src="./image/bidirectional_stacked_lstm.png" width = "60%" align=center><br> <img src="./image/bidirectional_stacked_lstm_en.png" width = "60%" align=center><br>
Fig 4. Bidirectional LSTMs Fig 4. Bidirectional LSTMs
</p> </p>
线性变换-> linear transformation
输入层到隐层-> input-to-hidden
正向处理输出序列->process sequence in the forward direction
反向处理上一层序列-> process sequence from the previous layer in backward direction
Note that, this bidirectional RNNs is different with the one proposed by Bengio et al. in machine translation tasks \[[3](#Reference), [4](#Reference)\]. We will introduce another bidirectional RNNs in the following tasks[machine translation](https://github.com/PaddlePaddle/book/blob/develop/machine_translation/README.md) Note that, this bidirectional RNNs is different with the one proposed by Bengio et al. in machine translation tasks \[[3](#Reference), [4](#Reference)\]. We will introduce another bidirectional RNNs in the following tasks[machine translation](https://github.com/PaddlePaddle/book/blob/develop/machine_translation/README.md)
### Conditional Random Field ### Conditional Random Field
...@@ -147,12 +126,12 @@ CRF is a probabilistic graph model (undirected) with nodes denoting random varia ...@@ -147,12 +126,12 @@ CRF is a probabilistic graph model (undirected) with nodes denoting random varia
Sequence tagging tasks only consider input and output as linear sequences without extra dependent assumptions on graph model. Thus, the graph model of sequence tagging tasks is simple chain or line, which results in a Linear-Chain Conditional Random Field, shown in Fig.5. Sequence tagging tasks only consider input and output as linear sequences without extra dependent assumptions on graph model. Thus, the graph model of sequence tagging tasks is simple chain or line, which results in a Linear-Chain Conditional Random Field, shown in Fig.5.
<p align="center"> <p align="center">
<img src="./image/linear_chain_crf.png" width = "35%" align=center><br> <img src="./image/linear_chain_crf.png" width = "35%" align=center><br>
Fig 5. Linear Chain Conditional Random Field used in SRL tasks Fig 5. Linear Chain Conditional Random Field used in SRL tasks
</p> </p>
By the fundamental theorem of random fields \[[5](#Reference)\], the joint distribution over the label sequence $Y$ given $X$ has the form: By the fundamental theorem of random fields \[[5](#Reference)\], the joint distribution over the label sequence $Y$ given $X$ has the form:
$$p(Y | X) = \frac{1}{Z(X)} \text{exp}\left(\sum_{i=1}^{n}\left(\sum_{j}\lambda_{j}t_{j} (y_{i - 1}, y_{i}, X, i) + \sum_{k} \mu_k s_k (y_i, X, i)\right)\right)$$ $$p(Y | X) = \frac{1}{Z(X)} \text{exp}\left(\sum_{i=1}^{n}\left(\sum_{j}\lambda_{j}t_{j} (y_{i - 1}, y_{i}, X, i) + \sum_{k} \mu_k s_k (y_i, X, i)\right)\right)$$
...@@ -196,19 +175,11 @@ After modification, the model is as follows: ...@@ -196,19 +175,11 @@ After modification, the model is as follows:
4. Take representation from step 3 as input of CRF, label sequence as supervision signal, do sequence tagging tasks 4. Take representation from step 3 as input of CRF, label sequence as supervision signal, do sequence tagging tasks
<div align="center"> <div align="center">
<img src="image/db_lstm_network.png" width = "60%" align=center /><br> <img src="image/db_lstm_en.png" width = "60%" align=center /><br>
Fig 6. DB-LSTM for SRL tasks Fig 6. DB-LSTM for SRL tasks
</div> </div>
论元-> argu
谓词-> pred
谓词上下文-> ctx-p
谓词上下文区域标记-> $m_r$
输入-> input
原句-> sentence
反向LSTM-> LSTM Reverse
## Data Preparation ## Data Preparation
In the tutorial, we use [CoNLL 2005](http://www.cs.upc.edu/~srlconll/) SRL task open dataset as an example. It is important to note that the training set and development set of the CoNLL 2005 SRL task are not free to download after the competition. Currently, only the test set can be obtained, including 23 sections of the Wall Street Journal and three sections of the Brown corpus. In this tutorial, we use the WSJ corpus as the training dataset to explain the model. However, since the training set is small, if you want to train a usable neural network SRL system, consider paying for the full corpus. In the tutorial, we use [CoNLL 2005](http://www.cs.upc.edu/~srlconll/) SRL task open dataset as an example. It is important to note that the training set and development set of the CoNLL 2005 SRL task are not free to download after the competition. Currently, only the test set can be obtained, including 23 sections of the Wall Street Journal and three sections of the Brown corpus. In this tutorial, we use the WSJ corpus as the training dataset to explain the model. However, since the training set is small, if you want to train a usable neural network SRL system, consider paying for the full corpus.
...@@ -300,10 +271,10 @@ def d_type(value_range): ...@@ -300,10 +271,10 @@ def d_type(value_range):
# word sequence # word sequence
word = paddle.layer.data(name='word_data', type=d_type(word_dict_len)) word = paddle.layer.data(name='word_data', type=d_type(word_dict_len))
# predicate # predicate
predicate = paddle.layer.data(name='verb_data', type=d_type(pred_len)) predicate = paddle.layer.data(name='verb_data', type=d_type(pred_len))
# 5 features for predicate context # 5 features for predicate context
ctx_n2 = paddle.layer.data(name='ctx_n2_data', type=d_type(word_dict_len)) ctx_n2 = paddle.layer.data(name='ctx_n2_data', type=d_type(word_dict_len))
ctx_n1 = paddle.layer.data(name='ctx_n1_data', type=d_type(word_dict_len)) ctx_n1 = paddle.layer.data(name='ctx_n1_data', type=d_type(word_dict_len))
ctx_0 = paddle.layer.data(name='ctx_0_data', type=d_type(word_dict_len)) ctx_0 = paddle.layer.data(name='ctx_0_data', type=d_type(word_dict_len))
ctx_p1 = paddle.layer.data(name='ctx_p1_data', type=d_type(word_dict_len)) ctx_p1 = paddle.layer.data(name='ctx_p1_data', type=d_type(word_dict_len))
...@@ -315,12 +286,12 @@ mark = paddle.layer.data(name='mark_data', type=d_type(mark_dict_len)) ...@@ -315,12 +286,12 @@ mark = paddle.layer.data(name='mark_data', type=d_type(mark_dict_len))
# label sequence # label sequence
target = paddle.layer.data(name='target', type=d_type(label_dict_len)) target = paddle.layer.data(name='target', type=d_type(label_dict_len))
``` ```
Speciala note: hidden_dim = 512 means LSTM hidden vector of 128 dimension (512/4). Please refer PaddlePaddle official documentation for detail: [lstmemory](http://www.paddlepaddle.org/doc/ui/api/trainer_config_helpers/layers.html#lstmemory)。 Speciala note: hidden_dim = 512 means LSTM hidden vector of 128 dimension (512/4). Please refer PaddlePaddle official documentation for detail: [lstmemory](http://www.paddlepaddle.org/doc/ui/api/trainer_config_helpers/layers.html#lstmemory)。
- 2. The word sequence, predicate, predicate context, and region mark sequence are transformed into embedding vector sequences. - 2. The word sequence, predicate, predicate context, and region mark sequence are transformed into embedding vector sequences.
```python ```python
# Since word vectorlookup table is pre-trained, we won't update it this time. # Since word vectorlookup table is pre-trained, we won't update it this time.
# is_static being True prevents updating the lookup table during training. # is_static being True prevents updating the lookup table during training.
...@@ -446,7 +417,7 @@ parameters = paddle.parameters.create([crf_cost, crf_dec]) ...@@ -446,7 +417,7 @@ parameters = paddle.parameters.create([crf_cost, crf_dec])
``` ```
We can print out parameter name. It will be generated if not specified. We can print out parameter name. It will be generated if not specified.
```python ```python
print parameters.keys() print parameters.keys()
``` ```
...@@ -542,6 +513,7 @@ Semantic Role Labeling is an important intermediate step in a wide range of natu ...@@ -542,6 +513,7 @@ Semantic Role Labeling is an important intermediate step in a wide range of natu
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
...@@ -560,6 +532,6 @@ marked.setOptions({ ...@@ -560,6 +532,6 @@ marked.setOptions({
} }
}); });
document.getElementById("context").innerHTML = marked( document.getElementById("context").innerHTML = marked(
document.getElementById("markdown").innerHTML) document.getElementById("markdown").innerHTML)
</script> </script>
</body> </body>
<html> <html>
<head> <head>
<script type="text/x-mathjax-config"> <script type="text/x-mathjax-config">
...@@ -5,8 +6,8 @@ ...@@ -5,8 +6,8 @@
extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"], extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
jax: ["input/TeX", "output/HTML-CSS"], jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: { tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ], inlineMath: [ ['$','$'] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ], displayMath: [ ['$$','$$'] ],
processEscapes: true processEscapes: true
}, },
"HTML-CSS": { availableFonts: ["TeX"] } "HTML-CSS": { availableFonts: ["TeX"] }
...@@ -93,7 +94,7 @@ $$\mbox{[小明]}_{\mbox{Agent}}\mbox{[昨天]}_{\mbox{Time}}\mbox{[晚上]}_\mb ...@@ -93,7 +94,7 @@ $$\mbox{[小明]}_{\mbox{Agent}}\mbox{[昨天]}_{\mbox{Time}}\mbox{[晚上]}_\mb
图3是最终得到的栈式循环神经网络结构示意图。 图3是最终得到的栈式循环神经网络结构示意图。
<p align="center"> <p align="center">
<img src="./image/stacked_lstm.png" width = "40%" align=center><br> <img src="./image/stacked_lstm.png" width = "40%" align=center><br>
图3. 基于LSTM的栈式循环神经网络结构示意图 图3. 基于LSTM的栈式循环神经网络结构示意图
</p> </p>
...@@ -104,7 +105,7 @@ $$\mbox{[小明]}_{\mbox{Agent}}\mbox{[昨天]}_{\mbox{Time}}\mbox{[晚上]}_\mb ...@@ -104,7 +105,7 @@ $$\mbox{[小明]}_{\mbox{Agent}}\mbox{[昨天]}_{\mbox{Time}}\mbox{[晚上]}_\mb
为了克服这一缺陷,我们可以设计一种双向循环网络单元,它的思想简单且直接:对上一节的栈式循环神经网络进行一个小小的修改,堆叠多个LSTM单元,让每一层LSTM单元分别以:正向、反向、正向 …… 的顺序学习上一层的输出序列。于是,从第2层开始,$t$时刻我们的LSTM单元便总是可以看到历史和未来的信息。图4是基于LSTM的双向循环神经网络结构示意图。 为了克服这一缺陷,我们可以设计一种双向循环网络单元,它的思想简单且直接:对上一节的栈式循环神经网络进行一个小小的修改,堆叠多个LSTM单元,让每一层LSTM单元分别以:正向、反向、正向 …… 的顺序学习上一层的输出序列。于是,从第2层开始,$t$时刻我们的LSTM单元便总是可以看到历史和未来的信息。图4是基于LSTM的双向循环神经网络结构示意图。
<p align="center"> <p align="center">
<img src="./image/bidirectional_stacked_lstm.png" width = "60%" align=center><br> <img src="./image/bidirectional_stacked_lstm.png" width = "60%" align=center><br>
图4. 基于LSTM的双向循环神经网络结构示意图 图4. 基于LSTM的双向循环神经网络结构示意图
</p> </p>
...@@ -119,7 +120,7 @@ CRF是一种概率化结构模型,可以看作是一个概率无向图模型 ...@@ -119,7 +120,7 @@ CRF是一种概率化结构模型,可以看作是一个概率无向图模型
序列标注任务只需要考虑输入和输出都是一个线性序列,并且由于我们只是将输入序列作为条件,不做任何条件独立假设,因此输入序列的元素之间并不存在图结构。综上,在序列标注任务中使用的是如图5所示的定义在链式图上的CRF,称之为线性链条件随机场(Linear Chain Conditional Random Field)。 序列标注任务只需要考虑输入和输出都是一个线性序列,并且由于我们只是将输入序列作为条件,不做任何条件独立假设,因此输入序列的元素之间并不存在图结构。综上,在序列标注任务中使用的是如图5所示的定义在链式图上的CRF,称之为线性链条件随机场(Linear Chain Conditional Random Field)。
<p align="center"> <p align="center">
<img src="./image/linear_chain_crf.png" width = "35%" align=center><br> <img src="./image/linear_chain_crf.png" width = "35%" align=center><br>
图5. 序列标注任务中使用的线性链条件随机场 图5. 序列标注任务中使用的线性链条件随机场
</p> </p>
...@@ -163,7 +164,7 @@ $$L(\lambda, D) = - \text{log}\left(\prod_{m=1}^{N}p(Y_m|X_m, W)\right) + C \fra ...@@ -163,7 +164,7 @@ $$L(\lambda, D) = - \text{log}\left(\prod_{m=1}^{N}p(Y_m|X_m, W)\right) + C \fra
3. 第2步的4个词向量序列作为双向LSTM模型的输入;LSTM模型学习输入序列的特征表示,得到新的特性表示序列; 3. 第2步的4个词向量序列作为双向LSTM模型的输入;LSTM模型学习输入序列的特征表示,得到新的特性表示序列;
4. CRF以第3步中LSTM学习到的特征为输入,以标记序列为监督信号,完成序列标注; 4. CRF以第3步中LSTM学习到的特征为输入,以标记序列为监督信号,完成序列标注;
<div align="center"> <div align="center">
<img src="image/db_lstm_network.png" width = "60%" align=center /><br> <img src="image/db_lstm_network.png" width = "60%" align=center /><br>
图6. SRL任务上的深层双向LSTM模型 图6. SRL任务上的深层双向LSTM模型
</div> </div>
...@@ -202,7 +203,7 @@ conll05st-release/ ...@@ -202,7 +203,7 @@ conll05st-release/
预处理完成之后一条训练样本包含9个特征,分别是:句子序列、谓词、谓词上下文(占 5 列)、谓词上下区域标志、标注序列。下表是一条训练样本的示例。 预处理完成之后一条训练样本包含9个特征,分别是:句子序列、谓词、谓词上下文(占 5 列)、谓词上下区域标志、标注序列。下表是一条训练样本的示例。
| 句子序列 | 谓词 | 谓词上下文(窗口 = 5) | 谓词上下文区域标记 | 标注序列 | | 句子序列 | 谓词 | 谓词上下文(窗口 = 5) | 谓词上下文区域标记 | 标注序列 |
|---|---|---|---|---| |---|---|---|---|---|
| A | set | n't been set . × | 0 | B-A1 | | A | set | n't been set . × | 0 | B-A1 |
| record | set | n't been set . × | 0 | I-A1 | | record | set | n't been set . × | 0 | I-A1 |
...@@ -255,7 +256,7 @@ word_dim = 32 # 词向量维度 ...@@ -255,7 +256,7 @@ word_dim = 32 # 词向量维度
mark_dim = 5 # 谓词上下文区域通过词表被映射为一个实向量,这个是相邻的维度 mark_dim = 5 # 谓词上下文区域通过词表被映射为一个实向量,这个是相邻的维度
hidden_dim = 512 # LSTM隐层向量的维度 : 512 / 4 hidden_dim = 512 # LSTM隐层向量的维度 : 512 / 4
depth = 8 # 栈式LSTM的深度 depth = 8 # 栈式LSTM的深度
# 一条样本总共9个特征,下面定义了9个data层,每个层类型为integer_value_sequence,表示整数ID的序列类型. # 一条样本总共9个特征,下面定义了9个data层,每个层类型为integer_value_sequence,表示整数ID的序列类型.
def d_type(size): def d_type(size):
return paddle.data_type.integer_value_sequence(size) return paddle.data_type.integer_value_sequence(size)
...@@ -263,10 +264,10 @@ def d_type(size): ...@@ -263,10 +264,10 @@ def d_type(size):
# 句子序列 # 句子序列
word = paddle.layer.data(name='word_data', type=d_type(word_dict_len)) word = paddle.layer.data(name='word_data', type=d_type(word_dict_len))
# 谓词 # 谓词
predicate = paddle.layer.data(name='verb_data', type=d_type(pred_len)) predicate = paddle.layer.data(name='verb_data', type=d_type(pred_len))
# 谓词上下文5个特征 # 谓词上下文5个特征
ctx_n2 = paddle.layer.data(name='ctx_n2_data', type=d_type(word_dict_len)) ctx_n2 = paddle.layer.data(name='ctx_n2_data', type=d_type(word_dict_len))
ctx_n1 = paddle.layer.data(name='ctx_n1_data', type=d_type(word_dict_len)) ctx_n1 = paddle.layer.data(name='ctx_n1_data', type=d_type(word_dict_len))
ctx_0 = paddle.layer.data(name='ctx_0_data', type=d_type(word_dict_len)) ctx_0 = paddle.layer.data(name='ctx_0_data', type=d_type(word_dict_len))
ctx_p1 = paddle.layer.data(name='ctx_p1_data', type=d_type(word_dict_len)) ctx_p1 = paddle.layer.data(name='ctx_p1_data', type=d_type(word_dict_len))
...@@ -278,12 +279,12 @@ mark = paddle.layer.data(name='mark_data', type=d_type(mark_dict_len)) ...@@ -278,12 +279,12 @@ mark = paddle.layer.data(name='mark_data', type=d_type(mark_dict_len))
# 标注序列 # 标注序列
target = paddle.layer.data(name='target', type=d_type(label_dict_len)) target = paddle.layer.data(name='target', type=d_type(label_dict_len))
``` ```
这里需要特别说明的是hidden_dim = 512指定了LSTM隐层向量的维度为128维,关于这一点请参考PaddlePaddle官方文档中[lstmemory](http://www.paddlepaddle.org/doc/ui/api/trainer_config_helpers/layers.html#lstmemory)的说明。 这里需要特别说明的是hidden_dim = 512指定了LSTM隐层向量的维度为128维,关于这一点请参考PaddlePaddle官方文档中[lstmemory](http://www.paddlepaddle.org/doc/ui/api/trainer_config_helpers/layers.html#lstmemory)的说明。
- 2. 将句子序列、谓词、谓词上下文、谓词上下文区域标记通过词表,转换为实向量表示的词向量序列。 - 2. 将句子序列、谓词、谓词上下文、谓词上下文区域标记通过词表,转换为实向量表示的词向量序列。
```python ```python
# 在本教程中,我们加载了预训练的词向量,这里设置了:is_static=True # 在本教程中,我们加载了预训练的词向量,这里设置了:is_static=True
# is_static 为 True 时保证了在训练 SRL 模型过程中,词表不再更新 # is_static 为 True 时保证了在训练 SRL 模型过程中,词表不再更新
...@@ -410,7 +411,7 @@ parameters = paddle.parameters.create([crf_cost, crf_dec]) ...@@ -410,7 +411,7 @@ parameters = paddle.parameters.create([crf_cost, crf_dec])
``` ```
可以打印参数名字,如果在网络配置中没有指定名字,则默认生成。 可以打印参数名字,如果在网络配置中没有指定名字,则默认生成。
```python ```python
print parameters.keys() print parameters.keys()
``` ```
...@@ -509,6 +510,7 @@ trainer.train( ...@@ -509,6 +510,7 @@ trainer.train(
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
...@@ -527,6 +529,6 @@ marked.setOptions({ ...@@ -527,6 +529,6 @@ marked.setOptions({
} }
}); });
document.getElementById("context").innerHTML = marked( document.getElementById("context").innerHTML = marked(
document.getElementById("markdown").innerHTML) document.getElementById("markdown").innerHTML)
</script> </script>
</body> </body>
...@@ -19,7 +19,7 @@ function get_best_pass() { ...@@ -19,7 +19,7 @@ function get_best_pass() {
cat $1 | grep -Pzo 'Test .*\n.*pass-.*' | \ cat $1 | grep -Pzo 'Test .*\n.*pass-.*' | \
sed -r 'N;s/Test.* cost=([0-9]+\.[0-9]+).*\n.*pass-([0-9]+)/\1 \2/g' | \ sed -r 'N;s/Test.* cost=([0-9]+\.[0-9]+).*\n.*pass-([0-9]+)/\1 \2/g' | \
sort -n | head -n 1 sort -n | head -n 1
} }
log=train.log log=train.log
LOG=`get_best_pass $log` LOG=`get_best_pass $log`
...@@ -28,11 +28,11 @@ best_model_path="output/pass-${LOG[1]}" ...@@ -28,11 +28,11 @@ best_model_path="output/pass-${LOG[1]}"
config_file=db_lstm.py config_file=db_lstm.py
dict_file=./data/wordDict.txt dict_file=./data/wordDict.txt
label_file=./data/targetDict.txt label_file=./data/targetDict.txt
predicate_dict_file=./data/verbDict.txt predicate_dict_file=./data/verbDict.txt
input_file=./data/feature input_file=./data/feature
output_file=predict.res output_file=predict.res
python predict.py \ python predict.py \
-c $config_file \ -c $config_file \
-w $best_model_path \ -w $best_model_path \
......
...@@ -9,19 +9,19 @@ Machine translation (MT) leverages computers to translate from one language to a ...@@ -9,19 +9,19 @@ Machine translation (MT) leverages computers to translate from one language to a
Early machine translation systems are mainly rule-based i.e. they rely on a language expert to specify the translation rules between the two languages. It is quite difficult to cover all the rules used in one languge. So it is quite a challenge for language experts to specify all possible rules in two or more different languages. Hence, a major challenge in conventional machine translation has been the difficulty in obtaining a complete rule set \[[1](#References)\] Early machine translation systems are mainly rule-based i.e. they rely on a language expert to specify the translation rules between the two languages. It is quite difficult to cover all the rules used in one languge. So it is quite a challenge for language experts to specify all possible rules in two or more different languages. Hence, a major challenge in conventional machine translation has been the difficulty in obtaining a complete rule set \[[1](#References)\]
To address the aforementioned problems, statistical machine translation techniques have been developed. These techniques learn the translation rules from a large corpus, instead of being designed by a language expert. While these techniques overcome the bottleneck of knowledge acquisition, there are still quite a lot of challenges, for example: To address the aforementioned problems, statistical machine translation techniques have been developed. These techniques learn the translation rules from a large corpus, instead of being designed by a language expert. While these techniques overcome the bottleneck of knowledge acquisition, there are still quite a lot of challenges, for example:
1. human designed features cannot cover all possible linguistic variations; 1. human designed features cannot cover all possible linguistic variations;
2. it is difficult to use global features; 2. it is difficult to use global features;
3. the techniques heavily rely on pre-processing techniques like word alignment, word segmentation and tokenization, rule-extraction and syntactic parsing etc. The error introduced in any of these steps could accumulate and impact translation quality. 3. the techniques heavily rely on pre-processing techniques like word alignment, word segmentation and tokenization, rule-extraction and syntactic parsing etc. The error introduced in any of these steps could accumulate and impact translation quality.
The recent development of deep learning provides new solutions to these challenges. The two main categories for deep learning based machine translation techniques are: The recent development of deep learning provides new solutions to these challenges. The two main categories for deep learning based machine translation techniques are:
1. techniques based on the statistical machine translation system but with some key components improved with neural networks, e.g., language model, reordering model (please refer to the left part of Figure 1); 1. techniques based on the statistical machine translation system but with some key components improved with neural networks, e.g., language model, reordering model (please refer to the left part of Figure 1);
2. techniques mapping from source language to target language directly using a neural network, or end-to-end neural machine translation (NMT). 2. techniques mapping from source language to target language directly using a neural network, or end-to-end neural machine translation (NMT).
...@@ -57,7 +57,7 @@ This section will introduce Gated Recurrent Unit (GRU), Bi-directional Recurrent ...@@ -57,7 +57,7 @@ This section will introduce Gated Recurrent Unit (GRU), Bi-directional Recurrent
We already introduced RNN and LSTM in the [Sentiment Analysis](https://github.com/PaddlePaddle/book/blob/develop/understand_sentiment/README.md) chapter. We already introduced RNN and LSTM in the [Sentiment Analysis](https://github.com/PaddlePaddle/book/blob/develop/understand_sentiment/README.md) chapter.
Compared to a simple RNN, the LSTM added memory cell, input gate, forget gate and output gate. These gates combined with the memory cell greatly improve the ability to handle long-term dependencies. Compared to a simple RNN, the LSTM added memory cell, input gate, forget gate and output gate. These gates combined with the memory cell greatly improve the ability to handle long-term dependencies.
GRU\[[2](#References)\] proposed by Cho et al is a simplified LSTM and an extension of a simple RNN. It is shown in the figure below. GRU\[[2](#References)\] proposed by Cho et al is a simplified LSTM and an extension of a simple RNN. It is shown in the figure below.
A GRU unit has only two gates: A GRU unit has only two gates:
- reset gate: when this gate is closed, the history information is discarded, i.e., the irrelevant historical information has no effect on the future output. - reset gate: when this gate is closed, the history information is discarded, i.e., the irrelevant historical information has no effect on the future output.
- update gate: it combines the input gate and the forget gate and is used to control the impact of historical information on the hidden output. The historical information is passed over when the update gate is close to 1. - update gate: it combines the input gate and the forget gate and is used to control the impact of historical information on the hidden output. The historical information is passed over when the update gate is close to 1.
...@@ -96,20 +96,20 @@ There are three steps for encoding a sentence: ...@@ -96,20 +96,20 @@ There are three steps for encoding a sentence:
1. One-hot vector representation of a word: Each word $x_i$ in the source sentence $x=\left \{ x_1,x_2,...,x_T \right \}$ is represented as a vector $w_i\epsilon R^{\left | V \right |},i=1,2,...,T$ where $w_i$ has the same dimensionality as the size of the dictionary, i.e., $\left | V \right |$, and has an element of one at the location corresponding to the location of the word in the dictionary and zero elsewhere. 1. One-hot vector representation of a word: Each word $x_i$ in the source sentence $x=\left \{ x_1,x_2,...,x_T \right \}$ is represented as a vector $w_i\epsilon R^{\left | V \right |},i=1,2,...,T$ where $w_i$ has the same dimensionality as the size of the dictionary, i.e., $\left | V \right |$, and has an element of one at the location corresponding to the location of the word in the dictionary and zero elsewhere.
2. Word embedding as a representation in the low-dimensional semantic space: There are two problems with one-hot vector representation 2. Word embedding as a representation in the low-dimensional semantic space: There are two problems with one-hot vector representation
* the dimensionality of the vector is typically large, leading to the curse of dimensionality; * the dimensionality of the vector is typically large, leading to the curse of dimensionality;
* it is hard to capture the relationships between words, i.e., semantic similarities. Therefore, it is useful to project the one-hot vector into a low-dimensional semantic space as a dense vector with fixed dimensions, i.e., $s_i=Cw_i$ for the $i$-th word, with $C\epsilon R^{K\times \left | V \right |}$ as the projection matrix and $K$ is the dimensionality of the word embedding vector. * it is hard to capture the relationships between words, i.e., semantic similarities. Therefore, it is useful to project the one-hot vector into a low-dimensional semantic space as a dense vector with fixed dimensions, i.e., $s_i=Cw_i$ for the $i$-th word, with $C\epsilon R^{K\times \left | V \right |}$ as the projection matrix and $K$ is the dimensionality of the word embedding vector.
3. Encoding of the source sequence via RNN: This can be described mathematically as: 3. Encoding of the source sequence via RNN: This can be described mathematically as:
$$h_i=\varnothing _\theta \left ( h_{i-1}, s_i \right )$$ $$h_i=\varnothing _\theta \left ( h_{i-1}, s_i \right )$$
where where
$h_0$ is a zero vector, $h_0$ is a zero vector,
$\varnothing _\theta$ is a non-linear activation function, and $\varnothing _\theta$ is a non-linear activation function, and
$\mathbf{h}=\left \{ h_1,..., h_T \right \}$ $\mathbf{h}=\left \{ h_1,..., h_T \right \}$
is the sequential encoding of the first $T$ words from the source sequence. The vector representation of the whole sentence can be represented as the encoding vector at the last time step $T$ from $\mathbf{h}$, or by temporal pooling over $\mathbf{h}$. is the sequential encoding of the first $T$ words from the source sequence. The vector representation of the whole sentence can be represented as the encoding vector at the last time step $T$ from $\mathbf{h}$, or by temporal pooling over $\mathbf{h}$.
...@@ -142,8 +142,8 @@ The generation process of machine translation is to translate the source sentenc ...@@ -142,8 +142,8 @@ The generation process of machine translation is to translate the source sentenc
### Attention Mechanism ### Attention Mechanism
There are a few problems with the fixed dimensional vector representation from the encoding stage: There are a few problems with the fixed dimensional vector representation from the encoding stage:
* It is very challenging to encode both the semantic and syntactic information a sentence with a fixed dimensional vector regardless of the length of the sentence. * It is very challenging to encode both the semantic and syntactic information a sentence with a fixed dimensional vector regardless of the length of the sentence.
* Intuitively, when translating a sentence, we typically pay more attention to the parts in the source sentence more relevant to the current translation. Moreover, the focus changes along the process of the translation. With a fixed dimensional vector, all the information from the source sentence is treated equally in terms of attention. This is not reasonable. Therefore, Bahdanau et al. \[[4](#References)\] introduced attention mechanism, which can decode based on different fragments of the context sequence in order to address the difficulty of feature learning for long sentences. Decoder with attention will be explained in the following. * Intuitively, when translating a sentence, we typically pay more attention to the parts in the source sentence more relevant to the current translation. Moreover, the focus changes along the process of the translation. With a fixed dimensional vector, all the information from the source sentence is treated equally in terms of attention. This is not reasonable. Therefore, Bahdanau et al. \[[4](#References)\] introduced attention mechanism, which can decode based on different fragments of the context sequence in order to address the difficulty of feature learning for long sentences. Decoder with attention will be explained in the following.
Different from the simple decoder, $z_i$ is computed as: Different from the simple decoder, $z_i$ is computed as:
...@@ -172,7 +172,7 @@ Figure 6. Decoder with Attention Mechanism ...@@ -172,7 +172,7 @@ Figure 6. Decoder with Attention Mechanism
[Beam Search](http://en.wikipedia.org/wiki/Beam_search) is a heuristic search algorithm that explores a graph by expanding the most promising node in a limited set. It is typically used when the solution space is huge (e.g., for machine translation, speech recognition), and there is not enough memory for all the possible solutions. For example, if we want to translate “`<s>你好<e>`” into English, even if there are only three words in the dictionary (`<s>`, `<e>`, `hello`), it is still possible to generate an infinite number of sentences, where the word `hello` can appear different number of times. Beam search could be used to find a good translation among them. [Beam Search](http://en.wikipedia.org/wiki/Beam_search) is a heuristic search algorithm that explores a graph by expanding the most promising node in a limited set. It is typically used when the solution space is huge (e.g., for machine translation, speech recognition), and there is not enough memory for all the possible solutions. For example, if we want to translate “`<s>你好<e>`” into English, even if there are only three words in the dictionary (`<s>`, `<e>`, `hello`), it is still possible to generate an infinite number of sentences, where the word `hello` can appear different number of times. Beam search could be used to find a good translation among them.
Beam search builds a search tree using breadth first search and sorts the nodes according to a heuristic cost (sum of the log probability of the generated words) at each level of the tree. Only a fixed number of nodes according to the pre-specified beam size (or beam width) are considered. Thus, only nodes with highest scores are expanded in the next level. This reduces the space and time requirements significantly. However, a globally optimal solution is not guaranteed. Beam search builds a search tree using breadth first search and sorts the nodes according to a heuristic cost (sum of the log probability of the generated words) at each level of the tree. Only a fixed number of nodes according to the pre-specified beam size (or beam width) are considered. Thus, only nodes with highest scores are expanded in the next level. This reduces the space and time requirements significantly. However, a globally optimal solution is not guaranteed.
The goal is to maximize the probability of the generated sequence when using beam search in decoding, The procedure is as follows: The goal is to maximize the probability of the generated sequence when using beam search in decoding, The procedure is as follows:
...@@ -452,7 +452,7 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn ...@@ -452,7 +452,7 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn
source_dict_dim = len(open(src_lang_dict, "r").readlines()) # size of the source language dictionary source_dict_dim = len(open(src_lang_dict, "r").readlines()) # size of the source language dictionary
target_dict_dim = len(open(trg_lang_dict, "r").readlines()) # size of target language dictionary target_dict_dim = len(open(trg_lang_dict, "r").readlines()) # size of target language dictionary
word_vector_dim = 512 # dimensionality of word vector word_vector_dim = 512 # dimensionality of word vector
encoder_size = 512 # dimensionality of the hidden state of encoder GRU encoder_size = 512 # dimensionality of the hidden state of encoder GRU
decoder_size = 512 # dimentionality of the hidden state of decoder GRU decoder_size = 512 # dimentionality of the hidden state of decoder GRU
if is_generating: if is_generating:
......
...@@ -93,7 +93,7 @@ GRU\[[2](#参考文献)\]是Cho等人在LSTM上提出的简化版本,也是RNN ...@@ -93,7 +93,7 @@ GRU\[[2](#参考文献)\]是Cho等人在LSTM上提出的简化版本,也是RNN
机器翻译任务的训练过程中,解码阶段的目标是最大化下一个正确的目标语言词的概率。思路是: 机器翻译任务的训练过程中,解码阶段的目标是最大化下一个正确的目标语言词的概率。思路是:
1. 每一个时刻,根据源语言句子的编码信息(又叫上下文向量,context vector)$c$、真实目标语言序列的第$i$个词$u_i$和$i$时刻RNN的隐层状态$z_i$,计算出下一个隐层状态$z_{i+1}$。计算公式如下: 1. 每一个时刻,根据源语言句子的编码信息(又叫上下文向量,context vector)$c$、真实目标语言序列的第$i$个词$u_i$和$i$时刻RNN的隐层状态$z_i$,计算出下一个隐层状态$z_{i+1}$。计算公式如下:
$$z_{i+1}=\phi _{\theta '}\left ( c,u_i,z_i \right )$$ $$z_{i+1}=\phi _{\theta '}\left ( c,u_i,z_i \right )$$
其中$\phi _{\theta '}$是一个非线性激活函数;$c=q\mathbf{h}$是源语言句子的上下文向量,在不使用[注意力机制](#注意力机制)时,如果[编码器](#编码器)的输出是源语言句子编码后的最后一个元素,则可以定义$c=h_T$;$u_i$是目标语言序列的第$i$个单词,$u_0$是目标语言序列的开始标记`<s>`,表示解码开始;$z_i$是$i$时刻解码RNN的隐层状态,$z_0$是一个全零的向量。 其中$\phi _{\theta '}$是一个非线性激活函数;$c=q\mathbf{h}$是源语言句子的上下文向量,在不使用[注意力机制](#注意力机制)时,如果[编码器](#编码器)的输出是源语言句子编码后的最后一个元素,则可以定义$c=h_T$;$u_i$是目标语言序列的第$i$个单词,$u_0$是目标语言序列的开始标记`<s>`,表示解码开始;$z_i$是$i$时刻解码RNN的隐层状态,$z_0$是一个全零的向量。
...@@ -150,17 +150,19 @@ e_{ij}&=align(z_i,h_j)\\\\ ...@@ -150,17 +150,19 @@ e_{ij}&=align(z_i,h_j)\\\\
注意:$z_{i+1}$和$p_{i+1}$的计算公式同[解码器](#解码器)中的一样。且由于生成时的每一步都是通过贪心法实现的,因此并不能保证得到全局最优解。 注意:$z_{i+1}$和$p_{i+1}$的计算公式同[解码器](#解码器)中的一样。且由于生成时的每一步都是通过贪心法实现的,因此并不能保证得到全局最优解。
## 数据准备 ## 数据介绍
### 下载与解压缩 ### 下载与解压缩
本教程使用[WMT-14](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/)数据集中的[bitexts(after selection)](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz)作为训练集,[dev+test data](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz)作为测试集和生成集。 本教程使用[WMT-14](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/)数据集中的[bitexts(after selection)](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz)作为训练集,[dev+test data](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz)作为测试集和生成集。
在Linux下,只需简单地运行以下命令: 在Linux下,只需简单地运行以下命令:
```bash ```bash
cd data cd data
./wmt14_data.sh ./wmt14_data.sh
``` ```
得到的数据集`data/wmt14`包含如下三个文件夹: 得到的数据集`data/wmt14`包含如下三个文件夹:
<p align = "center"> <p align = "center">
<table> <table>
...@@ -198,29 +200,6 @@ cd data ...@@ -198,29 +200,6 @@ cd data
- `XXX.src`是源法语文件,`XXX.trg`是目标英语文件,文件中的每行存放一个句子 - `XXX.src`是源法语文件,`XXX.trg`是目标英语文件,文件中的每行存放一个句子
- `XXX.src``XXX.trg`的行数一致,且两者任意第$i$行的句子之间都有着一一对应的关系。 - `XXX.src``XXX.trg`的行数一致,且两者任意第$i$行的句子之间都有着一一对应的关系。
### 用户自定义数据集(可选)
如果您想使用自己的数据集,只需按照如下方式组织,并将它们放在`data`目录下:
```text
user_dataset
├── train
│   ├── train_file1.src
│   ├── train_file1.trg
│   └── ...
├── test
│   ├── test_file1.src
│   ├── test_file1.trg
│   └── ...
├── gen
│   ├── gen_file1.src
│   ├── gen_file1.trg
│   └── ...
```
- 一级目录`user_dataset`:用户自定义的数据集名字。
- 二级目录`train``test``gen`:必须使用这三个文件夹名字。
- 三级目录:存放源语言到目标语言的平行语料库文件,后缀名必须使用`.src``.trg`
### 数据预处理 ### 数据预处理
我们的预处理流程包括两步: 我们的预处理流程包括两步:
...@@ -229,243 +208,99 @@ user_dataset ...@@ -229,243 +208,99 @@ user_dataset
- `XXX`中的第$i$行内容为`XXX.src`中的第$i$行和`XXX.trg`中的第$i$行连接,用'\t'分隔。 - `XXX`中的第$i$行内容为`XXX.src`中的第$i$行和`XXX.trg`中的第$i$行连接,用'\t'分隔。
- 创建训练数据的“源字典”和“目标字典”。每个字典都有**DICTSIZE**个单词,包括:语料中词频最高的(DICTSIZE - 3)个单词,和3个特殊符号`<s>`(序列的开始)、`<e>`(序列的结束)和`<unk>`(未登录词)。 - 创建训练数据的“源字典”和“目标字典”。每个字典都有**DICTSIZE**个单词,包括:语料中词频最高的(DICTSIZE - 3)个单词,和3个特殊符号`<s>`(序列的开始)、`<e>`(序列的结束)和`<unk>`(未登录词)。
预处理可以使用`preprocess.py` ### 示例数据
```python
python preprocess.py -i INPUT [-d DICTSIZE] [-m]
```
- `-i INPUT`:输入的原始数据集路径。
- `-d DICTSIZE`:指定的字典单词数,如果没有设置,字典会包含输入数据集中的所有单词。
- `-m --mergeDict`:合并“源字典”和“目标字典”,即这两个字典的内容完全一样。
本教程的具体命令如下: 因为完整的数据集数据量较大,为了验证训练流程,PaddlePaddle接口paddle.dataset.wmt14中默认提供了一个经过预处理的[较小规模的数据集](http://paddlepaddle.bj.bcebos.com/demo/wmt_shrinked_data/wmt14.tgz)
```python
python preprocess.py -i data/wmt14 -d 30000
```
请耐心等待几分钟的时间,您会在屏幕上看到:
```text
concat parallel corpora for dataset
build source dictionary for train data
build target dictionary for train data
dictionary size is 30000
```
预处理好的数据集存放在`data/pre-wmt14`目录下:
```text
pre-wmt14
├── train
│   └── train
├── test
│   └── test
├── gen
│   └── gen
├── train.list
├── test.list
├── gen.list
├── src.dict
└── trg.dict
```
- `train``test``gen`:分别包含了法英平行语料库的训练、测试和生成数据。其每个文件的每一行以“\t”分为两列,第一列是法语序列,第二列是对应的英语序列。
- `train.list``test.list``gen.list`:分别记录了`train``test``gen`文件夹中的文件路径。
- `src.dict``trg.dict`:源(法语)和目标(英语)字典。每个字典都含有30000个单词,包括29997个最高频单词和3个特殊符号。
### 提供数据给PaddlePaddle 该数据集有193319条训练数据,6003条测试数据,词典长度为30000。因为数据规模限制,使用该数据集训练出来的模型效果无法保证。
我们通过`dataprovider.py`将数据提供给PaddlePaddle。具体步骤如下: ## 训练流程说明
1. 首先,引入PaddlePaddle的PyDataProvider2包,并定义三个特殊符号。 ### paddle初始化
```python ```python
from paddle.trainer.PyDataProvider2 import * # 加载 paddle的python包
UNK_IDX = 2 #未登录词 import paddle.v2 as paddle
START = "<s>" #序列的开始
END = "<e>" #序列的结束
```
2. 其次,使用初始化函数`hook`,分别定义了训练模式和生成模式下的数据输入格式(`input_types`)。
- 训练模式:有三个输入序列,其中“源语言序列”和“目标语言序列”作为输入数据,“目标语言的下一个词序列”作为标签数据。
- 生成模式:有两个输入序列,其中“源语言序列”作为输入数据,“源语言序列编号”作为输入数据的编号(该输入非必须,可以省略)。
`hook`函数中的`src_dict_path`是源语言字典路径,`trg_dict_path`是目标语言字典路径,`is_generating`(训练或生成模式)是从模型配置中传入的对象。`hook`函数的具体调用方式请见[模型配置说明](#模型配置说明)
```python
def hook(settings, src_dict_path, trg_dict_path, is_generating, file_list,
**kwargs):
# job_mode = 1: 训练模式;0: 生成模式
settings.job_mode = not is_generating
def fun(dict_path): # 根据字典路径加载字典
out_dict = dict()
with open(dict_path, "r") as fin:
out_dict = {
line.strip(): line_count
for line_count, line in enumerate(fin)
}
return out_dict
settings.src_dict = fun(src_dict_path)
settings.trg_dict = fun(trg_dict_path)
if settings.job_mode: #训练模式
settings.input_types = {
'source_language_word': #源语言序列
integer_value_sequence(len(settings.src_dict)),
'target_language_word': #目标语言序列
integer_value_sequence(len(settings.trg_dict)),
'target_language_next_word': #目标语言的下一个词序列
integer_value_sequence(len(settings.trg_dict))
}
else: #生成模式
settings.input_types = {
'source_language_word': #源语言序列
integer_value_sequence(len(settings.src_dict)),
'sent_id': #源语言序列编号
integer_value_sequence(len(open(file_list[0], "r").readlines()))
}
```
3. 最后,使用`process`函数打开文本文件`file_name`,读取每一行,将行中的数据转换成与`input_types`一致的格式,再用`yield`关键字返回给PaddlePaddle进程。具体来说,
- 在源语言序列的每句话前面补上开始符号`<s>`、末尾补上结束符号`<e>`,得到“source_language_word”;
- 在目标语言序列的每句话前面补上`<s>`,得到“target_language_word”;
- 在目标语言序列的每句话末尾补上`<e>`,作为目标语言的下一个词序列(“target_language_next_word”)。
```python
def _get_ids(s, dictionary): # 获得源语言序列中的每个单词在字典中的位置
words = s.strip().split()
return [dictionary[START]] + \
[dictionary.get(w, UNK_IDX) for w in words] + \
[dictionary[END]]
@provider(init_hook=hook, pool_size=50000)
def process(settings, file_name):
with open(file_name, 'r') as f:
for line_count, line in enumerate(f):
line_split = line.strip().split('\t')
if settings.job_mode and len(line_split) != 2:
continue
src_seq = line_split[0]
src_ids = _get_ids(src_seq, settings.src_dict)
if settings.job_mode:
trg_seq = line_split[1]
trg_words = trg_seq.split()
trg_ids = [settings.trg_dict.get(w, UNK_IDX) for w in trg_words]
# 如果任意一个序列长度超过80个单词,在训练模式下会移除这条样本,以防止RNN过深。
if len(src_ids) > 80 or len(trg_ids) > 80:
continue
trg_ids_next = trg_ids + [settings.trg_dict[END]]
trg_ids = [settings.trg_dict[START]] + trg_ids
yield {
'source_language_word': src_ids,
'target_language_word': trg_ids,
'target_language_next_word': trg_ids_next
}
else:
yield {'source_language_word': src_ids, 'sent_id': [line_count]}
```
注意:由于本示例中的训练数据有3.55G,对于内存较小的机器,不能一次性加载进内存,所以推荐使用`pool_size`变量来设置内存中暂存的数据条数。
## 模型配置说明 # 配置只使用cpu,并且使用一个cpu进行训练
paddle.init(use_gpu=False, trainer_count=1)
```
### 数据定义 ### 数据定义
1. 首先,定义数据集路径和源/目标语言字典路径,并用`is_generating`变量定义当前配置是训练模式(默认)还是生成模式。该变量接受从命令行传入的参数,使用方法见[应用命令与结果](#应用命令与结果) 首先要定义词典大小,数据生成和网络配置都需要用到。然后获取wmt14的dataset reader。
```python
import os
from paddle.trainer_config_helpers import *
data_dir = "./data/pre-wmt14" # 数据集路径
src_lang_dict = os.path.join(data_dir, 'src.dict') # 源语言字典路径
trg_lang_dict = os.path.join(data_dir, 'trg.dict') # 目标语言字典路径
is_generating = get_config_arg("is_generating", bool, False) # 配置模式
```
2. 其次,通过`define_py_data_sources2`函数从`dataprovider.py`中读取数据,并用`args`变量传入源/目标语言的字典路径以及配置模式。
```python
if not is_generating:
train_list = os.path.join(data_dir, 'train.list')
test_list = os.path.join(data_dir, 'test.list')
else:
train_list = None
test_list = os.path.join(data_dir, 'gen.list')
define_py_data_sources2(
train_list,
test_list,
module="dataprovider",
obj="process",
args={
"src_dict_path": src_lang_dict, # 源语言字典路径
"trg_dict_path": trg_lang_dict, # 目标语言字典路径
"is_generating": is_generating # 配置模式
})
```
### 算法配置
```python ```python
settings( # source and target dict dim.
learning_method = AdamOptimizer(), dict_size = 30000
batch_size = 50,
learning_rate = 5e-4) feeding = {
'source_language_word': 0,
'target_language_word': 1,
'target_language_next_word': 2
}
wmt14_reader = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.wmt14.train(dict_size=dict_size), buf_size=8192),
batch_size=5)
``` ```
本教程使用默认的SGD随机梯度下降算法和Adam学习方法,并指定学习率为5e-4。注意:生成模式下的`batch_size = 50`,表示同时生成50条序列。
### 模型结构 ### 模型结构
1. 首先,定义了一些全局变量。 1. 首先,定义了一些全局变量。
```python ```python
source_dict_dim = len(open(src_lang_dict, "r").readlines()) # 源语言字典维度 source_dict_dim = dict_size # 源语言字典维度
target_dict_dim = len(open(trg_lang_dict, "r").readlines()) # 目标语言字典维度 target_dict_dim = dict_size # 目标语言字典维度
word_vector_dim = 512 # 词向量维度 word_vector_dim = 512 # 词向量维度
encoder_size = 512 # 编码器中的GRU隐层大小 encoder_size = 512 # 编码器中的GRU隐层大小
decoder_size = 512 # 解码器中的GRU隐层大小 decoder_size = 512 # 解码器中的GRU隐层大小
if is_generating:
beam_size=3 # 柱搜索算法中的宽度
max_length=250 # 生成句子的最大长度
gen_trans_file = get_config_arg("gen_trans_file", str, None) # 生成后的文件
``` ```
2. 其次实现编码器框架分为三步 2. 其次实现编码器框架分为三步
2.1 传入已经在`dataprovider.py`转换成one-hot vector表示的源语言序列$\mathbf{w}$。 2.1 将在dataset reader中生成的用每个单词在字典中的索引表示的源语言序列
转换成one-hot vector表示的源语言序列$\mathbf{w}$,其类型为integer_value_sequence
```python ```python
src_word_id = data_layer(name='source_language_word', size=source_dict_dim) src_word_id = paddle.layer.data(
name='source_language_word',
type=paddle.data_type.integer_value_sequence(source_dict_dim))
``` ```
2.2 将上述编码映射到低维语言空间的词向量$\mathbf{s}$。 2.2 将上述编码映射到低维语言空间的词向量$\mathbf{s}$。
```python ```python
src_embedding = embedding_layer( src_embedding = paddle.layer.embedding(
input=src_word_id, input=src_word_id,
size=word_vector_dim, size=word_vector_dim,
param_attr=ParamAttr(name='_source_language_embedding')) param_attr=paddle.attr.ParamAttr(name='_source_language_embedding'))
``` ```
2.3 用双向GRU编码源语言序列,拼接两个GRU的编码结果得到$\mathbf{h}$。 2.3 用双向GRU编码源语言序列,拼接两个GRU的编码结果得到$\mathbf{h}$。
```python ```python
src_forward = simple_gru(input=src_embedding, size=encoder_size) src_forward = paddle.networks.simple_gru(
src_backward = simple_gru( input=src_embedding, size=encoder_size)
input=src_embedding, size=encoder_size, reverse=True) src_backward = paddle.networks.simple_gru(
encoded_vector = concat_layer(input=[src_forward, src_backward]) input=src_embedding, size=encoder_size, reverse=True)
encoded_vector = paddle.layer.concat(input=[src_forward, src_backward])
``` ```
3. 接着,定义基于注意力机制的解码器框架。分为三步: 3. 接着,定义基于注意力机制的解码器框架。分为三步:
3.1 对源语言序列编码后的结果(见2.3),过一个前馈神经网络(Feed Forward Neural Network),得到其映射。 3.1 对源语言序列编码后的结果(见2.3),过一个前馈神经网络(Feed Forward Neural Network),得到其映射。
```python ```python
with mixed_layer(size=decoder_size) as encoded_proj: with paddle.layer.mixed(size=decoder_size) as encoded_proj:
encoded_proj += full_matrix_projection(input=encoded_vector) encoded_proj += paddle.layer.full_matrix_projection(
input=encoded_vector)
``` ```
3.2 构造解码器RNN的初始状态。由于解码器需要预测时序目标序列,但在0时刻并没有初始值,所以我们希望对其进行初始化。这里采用的是将源语言序列逆序编码后的最后一个状态进行非线性映射,作为该初始值,即$c_0=h_T$。 3.2 构造解码器RNN的初始状态。由于解码器需要预测时序目标序列,但在0时刻并没有初始值,所以我们希望对其进行初始化。这里采用的是将源语言序列逆序编码后的最后一个状态进行非线性映射,作为该初始值,即$c_0=h_T$。
```python ```python
backward_first = first_seq(input=src_backward) backward_first = paddle.layer.first_seq(input=src_backward)
with mixed_layer( with paddle.layer.mixed(
size=decoder_size, size=decoder_size, act=paddle.activation.Tanh()) as decoder_boot:
act=TanhActivation(), ) as decoder_boot: decoder_boot += paddle.layer.full_matrix_projection(
decoder_boot += full_matrix_projection(input=backward_first) input=backward_first)
``` ```
3.3 定义解码阶段每一个时间步的RNN行为,即根据当前时刻的源语言上下文向量$c_i$、解码器隐层状态$z_i$和目标语言中第$i$个词$u_i$,来预测第$i+1$个词的概率$p_{i+1}$。 3.3 定义解码阶段每一个时间步的RNN行为,即根据当前时刻的源语言上下文向量$c_i$、解码器隐层状态$z_i$和目标语言中第$i$个词$u_i$,来预测第$i+1$个词的概率$p_{i+1}$。
...@@ -473,43 +308,47 @@ settings( ...@@ -473,43 +308,47 @@ settings(
- context通过调用`simple_attention`函数,实现公式$c_i=\sum {j=1}^{T}a_{ij}h_j$。其中,enc_vec是$h_j$,enc_proj是$h_j$的映射(见3.1),权重$a_{ij}$的计算已经封装在`simple_attention`函数中。 - context通过调用`simple_attention`函数,实现公式$c_i=\sum {j=1}^{T}a_{ij}h_j$。其中,enc_vec是$h_j$,enc_proj是$h_j$的映射(见3.1),权重$a_{ij}$的计算已经封装在`simple_attention`函数中。
- decoder_inputs融合了$c_i$和当前目标词current_word(即$u_i$)的表示。 - decoder_inputs融合了$c_i$和当前目标词current_word(即$u_i$)的表示。
- gru_step通过调用`gru_step_layer`函数,在decoder_inputs和decoder_mem上做了激活操作,即实现公式$z_{i+1}=\phi _{\theta '}\left ( c_i,u_i,z_i \right )$。 - gru_step通过调用`gru_step_layer`函数,在decoder_inputs和decoder_mem上做了激活操作,即实现公式$z_{i+1}=\phi _{\theta '}\left ( c_i,u_i,z_i \right )$。
- 最后,使用softmax归一化计算单词的概率,将out结果返回,即实现公式$p\left ( u_i|u_{&lt;i},\mathbf{x} \right )=softmax(W_sz_i+b_z)$。 - 最后,使用softmax归一化计算单词的概率,将out结果返回,即实现公式$p\left ( u_i|u_{&lt;i},\mathbf{x} \right )=softmax(W_sz_i+b_z)$。
```python ```python
def gru_decoder_with_attention(enc_vec, enc_proj, current_word): def gru_decoder_with_attention(enc_vec, enc_proj, current_word):
decoder_mem = memory(
name='gru_decoder', size=decoder_size, boot_layer=decoder_boot) decoder_mem = paddle.layer.memory(
name='gru_decoder', size=decoder_size, boot_layer=decoder_boot)
context = simple_attention(
encoded_sequence=enc_vec, context = paddle.networks.simple_attention(
encoded_proj=enc_proj, encoded_sequence=enc_vec,
decoder_state=decoder_mem, ) encoded_proj=enc_proj,
decoder_state=decoder_mem)
with mixed_layer(size=decoder_size * 3) as decoder_inputs:
decoder_inputs += full_matrix_projection(input=context) with paddle.layer.mixed(size=decoder_size * 3) as decoder_inputs:
decoder_inputs += full_matrix_projection(input=current_word) decoder_inputs += paddle.layer.full_matrix_projection(input=context)
decoder_inputs += paddle.layer.full_matrix_projection(
gru_step = gru_step_layer( input=current_word)
name='gru_decoder',
input=decoder_inputs, gru_step = paddle.layer.gru_step(
output_mem=decoder_mem, name='gru_decoder',
size=decoder_size) input=decoder_inputs,
output_mem=decoder_mem,
with mixed_layer( size=decoder_size)
size=target_dict_dim, bias_attr=True,
act=SoftmaxActivation()) as out: with paddle.layer.mixed(
out += full_matrix_projection(input=gru_step) size=target_dict_dim,
return out bias_attr=True,
act=paddle.activation.Softmax()) as out:
out += paddle.layer.full_matrix_projection(input=gru_step)
return out
``` ```
4. 训练模式与生成模式下的解码器调用区别。 4. 训练模式与生成模式下的解码器调用区别。
4.1 定义解码器框架名字,和`gru_decoder_with_attention`函数的前两个输入。注意:这两个输入使用`StaticInput`,具体说明可见[StaticInput文档](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) 4.1 定义解码器框架名字,和`gru_decoder_with_attention`函数的前两个输入。注意:这两个输入使用`StaticInput`,具体说明可见[StaticInput文档](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入)
```python ```python
decoder_group_name = "decoder_group" decoder_group_name = "decoder_group"
group_input1 = StaticInput(input=encoded_vector, is_seq=True) group_input1 = paddle.layer.StaticInputV2(input=encoded_vector, is_seq=True)
group_input2 = StaticInput(input=encoded_proj, is_seq=True) group_input2 = paddle.layer.StaticInputV2(input=encoded_proj, is_seq=True)
group_inputs = [group_input1, group_input2] group_inputs = [group_input1, group_input2]
``` ```
4.2 训练模式下的解码器调用: 4.2 训练模式下的解码器调用:
...@@ -519,99 +358,85 @@ settings( ...@@ -519,99 +358,85 @@ settings(
- 最后,用多类交叉熵损失函数`classification_cost`来计算损失值。 - 最后,用多类交叉熵损失函数`classification_cost`来计算损失值。
```python ```python
if not is_generating: trg_embedding = paddle.layer.embedding(
trg_embedding = embedding_layer( input=paddle.layer.data(
input=data_layer( name='target_language_word',
name='target_language_word', size=target_dict_dim), type=paddle.data_type.integer_value_sequence(target_dict_dim)),
size=word_vector_dim, size=word_vector_dim,
param_attr=ParamAttr(name='_target_language_embedding')) param_attr=paddle.attr.ParamAttr(name='_target_language_embedding'))
group_inputs.append(trg_embedding) group_inputs.append(trg_embedding)
decoder = recurrent_group( # For decoder equipped with attention mechanism, in training,
name=decoder_group_name, # target embeding (the groudtruth) is the data input,
step=gru_decoder_with_attention, # while encoded source sequence is accessed to as an unbounded memory.
input=group_inputs) # Here, the StaticInput defines a read-only memory
# for the recurrent_group.
lbl = data_layer(name='target_language_next_word', size=target_dict_dim) decoder = paddle.layer.recurrent_group(
cost = classification_cost(input=decoder, label=lbl) name=decoder_group_name,
outputs(cost) step=gru_decoder_with_attention,
``` input=group_inputs)
4.3 生成模式下的解码器调用:
lbl = paddle.layer.data(
- 首先,在序列生成任务中,由于解码阶段的RNN总是引用上一时刻生成出的词的词向量,作为当前时刻的输入,因此,使用`GeneratedInput`来自动完成这一过程。具体说明可见[GeneratedInput文档](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入)。 name='target_language_next_word',
- 其次,使用`beam_search`函数循环调用`gru_decoder_with_attention`函数,生成出序列id。 type=paddle.data_type.integer_value_sequence(target_dict_dim))
- 最后,使用`seqtext_printer_evaluator`函数,根据目标字典`trg_lang_dict`,打印出完整的句子保存在`gen_trans_file`中。 cost = paddle.layer.classification_cost(input=decoder, label=lbl)
```python
else:
trg_embedding = GeneratedInput(
size=target_dict_dim,
embedding_name='_target_language_embedding',
embedding_size=word_vector_dim)
group_inputs.append(trg_embedding)
beam_gen = beam_search(
name=decoder_group_name,
step=gru_decoder_with_attention,
input=group_inputs,
bos_id=0,
eos_id=1,
beam_size=beam_size,
max_length=max_length)
seqtext_printer_evaluator(
input=beam_gen,
id_input=data_layer(
name="sent_id", size=1),
dict_file=trg_lang_dict,
result_file=gen_trans_file)
outputs(beam_gen)
``` ```
注意:我们提供的配置在Bahdanau的论文\[[4](#参考文献)\]上做了一些简化,可参考[issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) 注意:我们提供的配置在Bahdanau的论文\[[4](#参考文献)\]上做了一些简化,可参考[issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133)
### 参数定义
## 训练模型 首先依据模型配置的`cost`定义模型参数。
可以通过以下命令来训练模型: ```python
# create parameters
parameters = paddle.parameters.create(cost)
```
```bash 可以打印参数名字,如果在网络配置中没有指定名字,则默认生成。
./train.sh
```python
for param in parameters.keys():
print param
``` ```
其中`train.sh` 的内容为:
```bash ### 训练模型
paddle train \ 1. 构造trainer
--config='seqToseq_net.py' \
--save_dir='model' \ 根据优化目标cost,网络拓扑结构和模型参数来构造出trainer用来训练,在构造时还需指定优化方法,这里使用最基本的SGD方法。
--use_gpu=false \
--num_passes=16 \ ```python
--show_parameter_stats_period=100 \ optimizer = paddle.optimizer.Adam(learning_rate=1e-4)
--trainer_count=4 \ trainer = paddle.trainer.SGD(cost=cost,
--log_period=10 \ parameters=parameters,
--dot_period=5 \ update_equation=optimizer)
2>&1 | tee 'train.log'
``` ```
- config: 设置神经网络的配置文件。
- save_dir: 设置保存模型的输出路径。 2. 构造event_handler
- use_gpu: 是否使用GPU训练,这里使用CPU。
- num_passes: 设置passes的数量。PaddlePaddle中的一个pass表示对数据集中所有样本的一次完整训练。 可以通过自定义回调函数来评估训练过程中的各种状态,比如错误率等。下面的代码通过event.batch_id % 10 == 0 指定没10个batch打印一次日志,包含cost等信息。
- show_parameter_stats_period: 这里每隔100个batch显示一次参数统计信息。 ```python
- trainer_count: 设置CPU线程数或者GPU设备数。 def event_handler(event):
- log_period: 这里每隔10个batch打印一次日志。 if isinstance(event, paddle.event.EndIteration):
- dot_period: 这里每个5个batch打印一个点"."。 if event.batch_id % 10 == 0:
print "Pass %d, Batch %d, Cost %f, %s" % (
训练的损失函数每隔10个batch打印一次,您将会看到如下消息: event.pass_id, event.batch_id, event.cost, event.metrics)
```text ```
I0719 19:16:45.952062 15563 TrainerInternal.cpp:160] Batch=10 samples=500 AvgCost=198.475 CurrentCost=198.475 Eval: classification_error_evaluator=0.737155 CurrentEval: classification_error_evaluator=0.737155 3. 启动训练:
I0719 19:17:56.707319 15563 TrainerInternal.cpp:160] Batch=20 samples=1000 AvgCost=157.479 CurrentCost=116.483 Eval: classification_error_evaluator=0.698392 CurrentEval: classification_error_evaluator=0.659065
..... ```python
trainer.train(
reader=wmt14_reader,
event_handler=event_handler,
num_passes=10000,
feeding=feeding)
```
训练开始后,可以观察到event_handler输出的日志如下:
```text
Pass 0, Batch 0, Cost 247.408008, {'classification_error_evaluator': 1.0}
Pass 0, Batch 10, Cost 212.058789, {'classification_error_evaluator': 0.8737863898277283}
...
``` ```
- AvgCost:从第0个batch到当前batch的平均损失值。
- CurrentCost:当前batch的损失值。
- classification\_error\_evaluator(Eval):从第0个评估到当前评估中,每个单词的预测错误率。
- classification\_error\_evaluator(CurrentEval):当前评估中,每个单词的预测错误率。
当classification\_error\_evaluator的值低于0.35时,模型就训练成功了。
## 应用模型 ## 应用模型
...@@ -625,30 +450,7 @@ cd pretrained ...@@ -625,30 +450,7 @@ cd pretrained
### 应用命令与结果 ### 应用命令与结果
可以通过以下命令来进行法英翻译: 新版api尚未支持机器翻译的翻译过程,尽请期待。
```bash
./gen.sh
```
其中`gen.sh` 的内容为:
```bash
paddle train \
--job=test \
--config='seqToseq_net.py' \
--save_dir='pretrained/wmt14_model' \
--use_gpu=true \
--num_passes=13 \
--test_pass=12 \
--trainer_count=1 \
--config_args=is_generating=1,gen_trans_file="gen_result" \
2>&1 | tee 'translation/gen.log'
```
与训练命令不同的参数如下:
- job:设置任务的模式为测试。
- save_dir:设置存放预训练模型的路径。
- num_passes和test_pass:加载第$i\epsilon \left [ test\_pass,num\_passes-1 \right ]$轮的模型参数,这里只加载 `data/wmt14_model/pass-00012`
- config_args:将命令行中的自定义参数传递给模型配置。`is_generating=1`表示当前为生成模式,`gen_trans_file="gen_result"`表示生成结果的存储文件。
翻译结果请见[效果展示](#效果展示) 翻译结果请见[效果展示](#效果展示)
......
import paddle.v2 as paddle
def seqToseq_net(source_dict_dim, target_dict_dim):
### Network Architecture
word_vector_dim = 512 # dimension of word vector
decoder_size = 512 # dimension of hidden unit in GRU Decoder network
encoder_size = 512 # dimension of hidden unit in GRU Encoder network
#### Encoder
src_word_id = paddle.layer.data(
name='source_language_word',
type=paddle.data_type.integer_value_sequence(source_dict_dim))
src_embedding = paddle.layer.embedding(
input=src_word_id,
size=word_vector_dim,
param_attr=paddle.attr.ParamAttr(name='_source_language_embedding'))
src_forward = paddle.networks.simple_gru(
input=src_embedding, size=encoder_size)
src_backward = paddle.networks.simple_gru(
input=src_embedding, size=encoder_size, reverse=True)
encoded_vector = paddle.layer.concat(input=[src_forward, src_backward])
#### Decoder
with paddle.layer.mixed(size=decoder_size) as encoded_proj:
encoded_proj += paddle.layer.full_matrix_projection(
input=encoded_vector)
backward_first = paddle.layer.first_seq(input=src_backward)
with paddle.layer.mixed(
size=decoder_size, act=paddle.activation.Tanh()) as decoder_boot:
decoder_boot += paddle.layer.full_matrix_projection(
input=backward_first)
def gru_decoder_with_attention(enc_vec, enc_proj, current_word):
decoder_mem = paddle.layer.memory(
name='gru_decoder', size=decoder_size, boot_layer=decoder_boot)
context = paddle.networks.simple_attention(
encoded_sequence=enc_vec,
encoded_proj=enc_proj,
decoder_state=decoder_mem)
with paddle.layer.mixed(size=decoder_size * 3) as decoder_inputs:
decoder_inputs += paddle.layer.full_matrix_projection(input=context)
decoder_inputs += paddle.layer.full_matrix_projection(
input=current_word)
gru_step = paddle.layer.gru_step(
name='gru_decoder',
input=decoder_inputs,
output_mem=decoder_mem,
size=decoder_size)
with paddle.layer.mixed(
size=target_dict_dim,
bias_attr=True,
act=paddle.activation.Softmax()) as out:
out += paddle.layer.full_matrix_projection(input=gru_step)
return out
decoder_group_name = "decoder_group"
group_input1 = paddle.layer.StaticInputV2(input=encoded_vector, is_seq=True)
group_input2 = paddle.layer.StaticInputV2(input=encoded_proj, is_seq=True)
group_inputs = [group_input1, group_input2]
trg_embedding = paddle.layer.embedding(
input=paddle.layer.data(
name='target_language_word',
type=paddle.data_type.integer_value_sequence(target_dict_dim)),
size=word_vector_dim,
param_attr=paddle.attr.ParamAttr(name='_target_language_embedding'))
group_inputs.append(trg_embedding)
# For decoder equipped with attention mechanism, in training,
# target embeding (the groudtruth) is the data input,
# while encoded source sequence is accessed to as an unbounded memory.
# Here, the StaticInput defines a read-only memory
# for the recurrent_group.
decoder = paddle.layer.recurrent_group(
name=decoder_group_name,
step=gru_decoder_with_attention,
input=group_inputs)
lbl = paddle.layer.data(
name='target_language_next_word',
type=paddle.data_type.integer_value_sequence(target_dict_dim))
cost = paddle.layer.classification_cost(input=decoder, label=lbl)
return cost
def main():
paddle.init(use_gpu=False, trainer_count=1)
# source and target dict dim.
dict_size = 30000
source_dict_dim = target_dict_dim = dict_size
# define network topology
cost = seqToseq_net(source_dict_dim, target_dict_dim)
parameters = paddle.parameters.create(cost)
# define optimize method and trainer
optimizer = paddle.optimizer.Adam(learning_rate=1e-4)
trainer = paddle.trainer.SGD(cost=cost,
parameters=parameters,
update_equation=optimizer)
# define data reader
feeding = {
'source_language_word': 0,
'target_language_word': 1,
'target_language_next_word': 2
}
wmt14_reader = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.wmt14.train(dict_size=dict_size), buf_size=8192),
batch_size=5)
# define event_handler callback
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 10 == 0:
print "Pass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
# start to train
trainer.train(
reader=wmt14_reader,
event_handler=event_handler,
num_passes=10000,
feeding=feeding)
if __name__ == '__main__':
main()
...@@ -32,17 +32,17 @@ rm dev+test.tgz ...@@ -32,17 +32,17 @@ rm dev+test.tgz
# separate the dev and test dataset # separate the dev and test dataset
mkdir test gen mkdir test gen
mv dev/ntst1213.* test mv dev/ntst1213.* test
mv dev/ntst14.* gen mv dev/ntst14.* gen
rm -rf dev rm -rf dev
set +x set +x
# rename the suffix, .fr->.src, .en->.trg # rename the suffix, .fr->.src, .en->.trg
for dir in train test gen for dir in train test gen
do do
filelist=`ls $dir` filelist=`ls $dir`
cd $dir cd $dir
for file in $filelist for file in $filelist
do do
if [ ${file##*.} = "fr" ]; then if [ ${file##*.} = "fr" ]; then
mv $file ${file/%fr/src} mv $file ${file/%fr/src}
elif [ ${file##*.} = 'en' ]; then elif [ ${file##*.} = 'en' ]; then
......
...@@ -31,7 +31,7 @@ else ...@@ -31,7 +31,7 @@ else
print $3; print $3;
read_pos += (2 + res_num); read_pos += (2 + res_num);
}}' res_num=$beam_size $gen_file >$top1 }}' res_num=$beam_size $gen_file >$top1
fi fi
# evalute bleu value # evalute bleu value
bleu_script=multi-bleu.perl bleu_script=multi-bleu.perl
......
<html> <html>
<head> <head>
<script type="text/x-mathjax-config"> <script type="text/x-mathjax-config">
...@@ -5,8 +6,8 @@ ...@@ -5,8 +6,8 @@
extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"], extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
jax: ["input/TeX", "output/HTML-CSS"], jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: { tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ], inlineMath: [ ['$','$'] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ], displayMath: [ ['$$','$$'] ],
processEscapes: true processEscapes: true
}, },
"HTML-CSS": { availableFonts: ["TeX"] } "HTML-CSS": { availableFonts: ["TeX"] }
...@@ -50,19 +51,19 @@ Machine translation (MT) leverages computers to translate from one language to a ...@@ -50,19 +51,19 @@ Machine translation (MT) leverages computers to translate from one language to a
Early machine translation systems are mainly rule-based i.e. they rely on a language expert to specify the translation rules between the two languages. It is quite difficult to cover all the rules used in one languge. So it is quite a challenge for language experts to specify all possible rules in two or more different languages. Hence, a major challenge in conventional machine translation has been the difficulty in obtaining a complete rule set \[[1](#References)\]。 Early machine translation systems are mainly rule-based i.e. they rely on a language expert to specify the translation rules between the two languages. It is quite difficult to cover all the rules used in one languge. So it is quite a challenge for language experts to specify all possible rules in two or more different languages. Hence, a major challenge in conventional machine translation has been the difficulty in obtaining a complete rule set \[[1](#References)\]。
To address the aforementioned problems, statistical machine translation techniques have been developed. These techniques learn the translation rules from a large corpus, instead of being designed by a language expert. While these techniques overcome the bottleneck of knowledge acquisition, there are still quite a lot of challenges, for example: To address the aforementioned problems, statistical machine translation techniques have been developed. These techniques learn the translation rules from a large corpus, instead of being designed by a language expert. While these techniques overcome the bottleneck of knowledge acquisition, there are still quite a lot of challenges, for example:
1. human designed features cannot cover all possible linguistic variations; 1. human designed features cannot cover all possible linguistic variations;
2. it is difficult to use global features; 2. it is difficult to use global features;
3. the techniques heavily rely on pre-processing techniques like word alignment, word segmentation and tokenization, rule-extraction and syntactic parsing etc. The error introduced in any of these steps could accumulate and impact translation quality. 3. the techniques heavily rely on pre-processing techniques like word alignment, word segmentation and tokenization, rule-extraction and syntactic parsing etc. The error introduced in any of these steps could accumulate and impact translation quality.
The recent development of deep learning provides new solutions to these challenges. The two main categories for deep learning based machine translation techniques are: The recent development of deep learning provides new solutions to these challenges. The two main categories for deep learning based machine translation techniques are:
1. techniques based on the statistical machine translation system but with some key components improved with neural networks, e.g., language model, reordering model (please refer to the left part of Figure 1); 1. techniques based on the statistical machine translation system but with some key components improved with neural networks, e.g., language model, reordering model (please refer to the left part of Figure 1);
2. techniques mapping from source language to target language directly using a neural network, or end-to-end neural machine translation (NMT). 2. techniques mapping from source language to target language directly using a neural network, or end-to-end neural machine translation (NMT).
...@@ -98,7 +99,7 @@ This section will introduce Gated Recurrent Unit (GRU), Bi-directional Recurrent ...@@ -98,7 +99,7 @@ This section will introduce Gated Recurrent Unit (GRU), Bi-directional Recurrent
We already introduced RNN and LSTM in the [Sentiment Analysis](https://github.com/PaddlePaddle/book/blob/develop/understand_sentiment/README.md) chapter. We already introduced RNN and LSTM in the [Sentiment Analysis](https://github.com/PaddlePaddle/book/blob/develop/understand_sentiment/README.md) chapter.
Compared to a simple RNN, the LSTM added memory cell, input gate, forget gate and output gate. These gates combined with the memory cell greatly improve the ability to handle long-term dependencies. Compared to a simple RNN, the LSTM added memory cell, input gate, forget gate and output gate. These gates combined with the memory cell greatly improve the ability to handle long-term dependencies.
GRU\[[2](#References)\] proposed by Cho et al is a simplified LSTM and an extension of a simple RNN. It is shown in the figure below. GRU\[[2](#References)\] proposed by Cho et al is a simplified LSTM and an extension of a simple RNN. It is shown in the figure below.
A GRU unit has only two gates: A GRU unit has only two gates:
- reset gate: when this gate is closed, the history information is discarded, i.e., the irrelevant historical information has no effect on the future output. - reset gate: when this gate is closed, the history information is discarded, i.e., the irrelevant historical information has no effect on the future output.
- update gate: it combines the input gate and the forget gate and is used to control the impact of historical information on the hidden output. The historical information is passed over when the update gate is close to 1. - update gate: it combines the input gate and the forget gate and is used to control the impact of historical information on the hidden output. The historical information is passed over when the update gate is close to 1.
...@@ -137,20 +138,20 @@ There are three steps for encoding a sentence: ...@@ -137,20 +138,20 @@ There are three steps for encoding a sentence:
1. One-hot vector representation of a word: Each word $x_i$ in the source sentence $x=\left \{ x_1,x_2,...,x_T \right \}$ is represented as a vector $w_i\epsilon R^{\left | V \right |},i=1,2,...,T$ where $w_i$ has the same dimensionality as the size of the dictionary, i.e., $\left | V \right |$, and has an element of one at the location corresponding to the location of the word in the dictionary and zero elsewhere. 1. One-hot vector representation of a word: Each word $x_i$ in the source sentence $x=\left \{ x_1,x_2,...,x_T \right \}$ is represented as a vector $w_i\epsilon R^{\left | V \right |},i=1,2,...,T$ where $w_i$ has the same dimensionality as the size of the dictionary, i.e., $\left | V \right |$, and has an element of one at the location corresponding to the location of the word in the dictionary and zero elsewhere.
2. Word embedding as a representation in the low-dimensional semantic space: There are two problems with one-hot vector representation 2. Word embedding as a representation in the low-dimensional semantic space: There are two problems with one-hot vector representation
* the dimensionality of the vector is typically large, leading to the curse of dimensionality; * the dimensionality of the vector is typically large, leading to the curse of dimensionality;
* it is hard to capture the relationships between words, i.e., semantic similarities. Therefore, it is useful to project the one-hot vector into a low-dimensional semantic space as a dense vector with fixed dimensions, i.e., $s_i=Cw_i$ for the $i$-th word, with $C\epsilon R^{K\times \left | V \right |}$ as the projection matrix and $K$ is the dimensionality of the word embedding vector. * it is hard to capture the relationships between words, i.e., semantic similarities. Therefore, it is useful to project the one-hot vector into a low-dimensional semantic space as a dense vector with fixed dimensions, i.e., $s_i=Cw_i$ for the $i$-th word, with $C\epsilon R^{K\times \left | V \right |}$ as the projection matrix and $K$ is the dimensionality of the word embedding vector.
3. Encoding of the source sequence via RNN: This can be described mathematically as: 3. Encoding of the source sequence via RNN: This can be described mathematically as:
$$h_i=\varnothing _\theta \left ( h_{i-1}, s_i \right )$$ $$h_i=\varnothing _\theta \left ( h_{i-1}, s_i \right )$$
where where
$h_0$ is a zero vector, $h_0$ is a zero vector,
$\varnothing _\theta$ is a non-linear activation function, and $\varnothing _\theta$ is a non-linear activation function, and
$\mathbf{h}=\left \{ h_1,..., h_T \right \}$ $\mathbf{h}=\left \{ h_1,..., h_T \right \}$
is the sequential encoding of the first $T$ words from the source sequence. The vector representation of the whole sentence can be represented as the encoding vector at the last time step $T$ from $\mathbf{h}$, or by temporal pooling over $\mathbf{h}$. is the sequential encoding of the first $T$ words from the source sequence. The vector representation of the whole sentence can be represented as the encoding vector at the last time step $T$ from $\mathbf{h}$, or by temporal pooling over $\mathbf{h}$.
...@@ -183,8 +184,8 @@ The generation process of machine translation is to translate the source sentenc ...@@ -183,8 +184,8 @@ The generation process of machine translation is to translate the source sentenc
### Attention Mechanism ### Attention Mechanism
There are a few problems with the fixed dimensional vector representation from the encoding stage: There are a few problems with the fixed dimensional vector representation from the encoding stage:
* It is very challenging to encode both the semantic and syntactic information a sentence with a fixed dimensional vector regardless of the length of the sentence. * It is very challenging to encode both the semantic and syntactic information a sentence with a fixed dimensional vector regardless of the length of the sentence.
* Intuitively, when translating a sentence, we typically pay more attention to the parts in the source sentence more relevant to the current translation. Moreover, the focus changes along the process of the translation. With a fixed dimensional vector, all the information from the source sentence is treated equally in terms of attention. This is not reasonable. Therefore, Bahdanau et al. \[[4](#References)\] introduced attention mechanism, which can decode based on different fragments of the context sequence in order to address the difficulty of feature learning for long sentences. Decoder with attention will be explained in the following. * Intuitively, when translating a sentence, we typically pay more attention to the parts in the source sentence more relevant to the current translation. Moreover, the focus changes along the process of the translation. With a fixed dimensional vector, all the information from the source sentence is treated equally in terms of attention. This is not reasonable. Therefore, Bahdanau et al. \[[4](#References)\] introduced attention mechanism, which can decode based on different fragments of the context sequence in order to address the difficulty of feature learning for long sentences. Decoder with attention will be explained in the following.
Different from the simple decoder, $z_i$ is computed as: Different from the simple decoder, $z_i$ is computed as:
...@@ -213,7 +214,7 @@ Figure 6. Decoder with Attention Mechanism ...@@ -213,7 +214,7 @@ Figure 6. Decoder with Attention Mechanism
[Beam Search](http://en.wikipedia.org/wiki/Beam_search) is a heuristic search algorithm that explores a graph by expanding the most promising node in a limited set. It is typically used when the solution space is huge (e.g., for machine translation, speech recognition), and there is not enough memory for all the possible solutions. For example, if we want to translate “`<s>你好<e>`” into English, even if there are only three words in the dictionary (`<s>`, `<e>`, `hello`), it is still possible to generate an infinite number of sentences, where the word `hello` can appear different number of times. Beam search could be used to find a good translation among them. [Beam Search](http://en.wikipedia.org/wiki/Beam_search) is a heuristic search algorithm that explores a graph by expanding the most promising node in a limited set. It is typically used when the solution space is huge (e.g., for machine translation, speech recognition), and there is not enough memory for all the possible solutions. For example, if we want to translate “`<s>你好<e>`” into English, even if there are only three words in the dictionary (`<s>`, `<e>`, `hello`), it is still possible to generate an infinite number of sentences, where the word `hello` can appear different number of times. Beam search could be used to find a good translation among them.
Beam search builds a search tree using breadth first search and sorts the nodes according to a heuristic cost (sum of the log probability of the generated words) at each level of the tree. Only a fixed number of nodes according to the pre-specified beam size (or beam width) are considered. Thus, only nodes with highest scores are expanded in the next level. This reduces the space and time requirements significantly. However, a globally optimal solution is not guaranteed. Beam search builds a search tree using breadth first search and sorts the nodes according to a heuristic cost (sum of the log probability of the generated words) at each level of the tree. Only a fixed number of nodes according to the pre-specified beam size (or beam width) are considered. Thus, only nodes with highest scores are expanded in the next level. This reduces the space and time requirements significantly. However, a globally optimal solution is not guaranteed.
The goal is to maximize the probability of the generated sequence when using beam search in decoding, The procedure is as follows: The goal is to maximize the probability of the generated sequence when using beam search in decoding, The procedure is as follows:
...@@ -493,7 +494,7 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn ...@@ -493,7 +494,7 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn
source_dict_dim = len(open(src_lang_dict, "r").readlines()) # size of the source language dictionary source_dict_dim = len(open(src_lang_dict, "r").readlines()) # size of the source language dictionary
target_dict_dim = len(open(trg_lang_dict, "r").readlines()) # size of target language dictionary target_dict_dim = len(open(trg_lang_dict, "r").readlines()) # size of target language dictionary
word_vector_dim = 512 # dimensionality of word vector word_vector_dim = 512 # dimensionality of word vector
encoder_size = 512 # dimensionality of the hidden state of encoder GRU encoder_size = 512 # dimensionality of the hidden state of encoder GRU
decoder_size = 512 # dimentionality of the hidden state of decoder GRU decoder_size = 512 # dimentionality of the hidden state of decoder GRU
if is_generating: if is_generating:
...@@ -764,6 +765,7 @@ End-to-end neural machine translation is a recently developed way to perform mac ...@@ -764,6 +765,7 @@ End-to-end neural machine translation is a recently developed way to perform mac
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
...@@ -782,6 +784,6 @@ marked.setOptions({ ...@@ -782,6 +784,6 @@ marked.setOptions({
} }
}); });
document.getElementById("context").innerHTML = marked( document.getElementById("context").innerHTML = marked(
document.getElementById("markdown").innerHTML) document.getElementById("markdown").innerHTML)
</script> </script>
</body> </body>
<html> <html>
<head> <head>
<script type="text/x-mathjax-config"> <script type="text/x-mathjax-config">
...@@ -5,8 +6,8 @@ ...@@ -5,8 +6,8 @@
extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"], extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
jax: ["input/TeX", "output/HTML-CSS"], jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: { tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ], inlineMath: [ ['$','$'] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ], displayMath: [ ['$$','$$'] ],
processEscapes: true processEscapes: true
}, },
"HTML-CSS": { availableFonts: ["TeX"] } "HTML-CSS": { availableFonts: ["TeX"] }
...@@ -134,7 +135,7 @@ GRU\[[2](#参考文献)\]是Cho等人在LSTM上提出的简化版本,也是RNN ...@@ -134,7 +135,7 @@ GRU\[[2](#参考文献)\]是Cho等人在LSTM上提出的简化版本,也是RNN
机器翻译任务的训练过程中,解码阶段的目标是最大化下一个正确的目标语言词的概率。思路是: 机器翻译任务的训练过程中,解码阶段的目标是最大化下一个正确的目标语言词的概率。思路是:
1. 每一个时刻,根据源语言句子的编码信息(又叫上下文向量,context vector)$c$、真实目标语言序列的第$i$个词$u_i$和$i$时刻RNN的隐层状态$z_i$,计算出下一个隐层状态$z_{i+1}$。计算公式如下: 1. 每一个时刻,根据源语言句子的编码信息(又叫上下文向量,context vector)$c$、真实目标语言序列的第$i$个词$u_i$和$i$时刻RNN的隐层状态$z_i$,计算出下一个隐层状态$z_{i+1}$。计算公式如下:
$$z_{i+1}=\phi _{\theta '}\left ( c,u_i,z_i \right )$$ $$z_{i+1}=\phi _{\theta '}\left ( c,u_i,z_i \right )$$
其中$\phi _{\theta '}$是一个非线性激活函数;$c=q\mathbf{h}$是源语言句子的上下文向量,在不使用[注意力机制](#注意力机制)时,如果[编码器](#编码器)的输出是源语言句子编码后的最后一个元素,则可以定义$c=h_T$;$u_i$是目标语言序列的第$i$个单词,$u_0$是目标语言序列的开始标记`<s>`,表示解码开始;$z_i$是$i$时刻解码RNN的隐层状态,$z_0$是一个全零的向量。 其中$\phi _{\theta '}$是一个非线性激活函数;$c=q\mathbf{h}$是源语言句子的上下文向量,在不使用[注意力机制](#注意力机制)时,如果[编码器](#编码器)的输出是源语言句子编码后的最后一个元素,则可以定义$c=h_T$;$u_i$是目标语言序列的第$i$个单词,$u_0$是目标语言序列的开始标记`<s>`,表示解码开始;$z_i$是$i$时刻解码RNN的隐层状态,$z_0$是一个全零的向量。
...@@ -191,17 +192,19 @@ e_{ij}&=align(z_i,h_j)\\\\ ...@@ -191,17 +192,19 @@ e_{ij}&=align(z_i,h_j)\\\\
注意:$z_{i+1}$和$p_{i+1}$的计算公式同[解码器](#解码器)中的一样。且由于生成时的每一步都是通过贪心法实现的,因此并不能保证得到全局最优解。 注意:$z_{i+1}$和$p_{i+1}$的计算公式同[解码器](#解码器)中的一样。且由于生成时的每一步都是通过贪心法实现的,因此并不能保证得到全局最优解。
## 数据准备 ## 数据介绍
### 下载与解压缩 ### 下载与解压缩
本教程使用[WMT-14](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/)数据集中的[bitexts(after selection)](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz)作为训练集,[dev+test data](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz)作为测试集和生成集。 本教程使用[WMT-14](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/)数据集中的[bitexts(after selection)](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz)作为训练集,[dev+test data](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz)作为测试集和生成集。
在Linux下,只需简单地运行以下命令: 在Linux下,只需简单地运行以下命令:
```bash ```bash
cd data cd data
./wmt14_data.sh ./wmt14_data.sh
``` ```
得到的数据集`data/wmt14`包含如下三个文件夹: 得到的数据集`data/wmt14`包含如下三个文件夹:
<p align = "center"> <p align = "center">
<table> <table>
...@@ -239,29 +242,6 @@ cd data ...@@ -239,29 +242,6 @@ cd data
- `XXX.src`是源法语文件,`XXX.trg`是目标英语文件,文件中的每行存放一个句子 - `XXX.src`是源法语文件,`XXX.trg`是目标英语文件,文件中的每行存放一个句子
- `XXX.src`和`XXX.trg`的行数一致,且两者任意第$i$行的句子之间都有着一一对应的关系。 - `XXX.src`和`XXX.trg`的行数一致,且两者任意第$i$行的句子之间都有着一一对应的关系。
### 用户自定义数据集(可选)
如果您想使用自己的数据集,只需按照如下方式组织,并将它们放在`data`目录下:
```text
user_dataset
├── train
│   ├── train_file1.src
│   ├── train_file1.trg
│   └── ...
├── test
│   ├── test_file1.src
│   ├── test_file1.trg
│   └── ...
├── gen
│   ├── gen_file1.src
│   ├── gen_file1.trg
│   └── ...
```
- 一级目录`user_dataset`:用户自定义的数据集名字。
- 二级目录`train`、`test`和`gen`:必须使用这三个文件夹名字。
- 三级目录:存放源语言到目标语言的平行语料库文件,后缀名必须使用`.src`和`.trg`。
### 数据预处理 ### 数据预处理
我们的预处理流程包括两步: 我们的预处理流程包括两步:
...@@ -270,243 +250,99 @@ user_dataset ...@@ -270,243 +250,99 @@ user_dataset
- `XXX`中的第$i$行内容为`XXX.src`中的第$i$行和`XXX.trg`中的第$i$行连接,用'\t'分隔。 - `XXX`中的第$i$行内容为`XXX.src`中的第$i$行和`XXX.trg`中的第$i$行连接,用'\t'分隔。
- 创建训练数据的“源字典”和“目标字典”。每个字典都有**DICTSIZE**个单词,包括:语料中词频最高的(DICTSIZE - 3)个单词,和3个特殊符号`<s>`(序列的开始)、`<e>`(序列的结束)和`<unk>`(未登录词)。 - 创建训练数据的“源字典”和“目标字典”。每个字典都有**DICTSIZE**个单词,包括:语料中词频最高的(DICTSIZE - 3)个单词,和3个特殊符号`<s>`(序列的开始)、`<e>`(序列的结束)和`<unk>`(未登录词)。
预处理可以使用`preprocess.py`: ### 示例数据
```python
python preprocess.py -i INPUT [-d DICTSIZE] [-m]
```
- `-i INPUT`:输入的原始数据集路径。
- `-d DICTSIZE`:指定的字典单词数,如果没有设置,字典会包含输入数据集中的所有单词。
- `-m --mergeDict`:合并“源字典”和“目标字典”,即这两个字典的内容完全一样。
本教程的具体命令如下: 因为完整的数据集数据量较大,为了验证训练流程,PaddlePaddle接口paddle.dataset.wmt14中默认提供了一个经过预处理的[较小规模的数据集](http://paddlepaddle.bj.bcebos.com/demo/wmt_shrinked_data/wmt14.tgz)。
```python
python preprocess.py -i data/wmt14 -d 30000
```
请耐心等待几分钟的时间,您会在屏幕上看到:
```text
concat parallel corpora for dataset
build source dictionary for train data
build target dictionary for train data
dictionary size is 30000
```
预处理好的数据集存放在`data/pre-wmt14`目录下:
```text
pre-wmt14
├── train
│   └── train
├── test
│   └── test
├── gen
│   └── gen
├── train.list
├── test.list
├── gen.list
├── src.dict
└── trg.dict
```
- `train`、`test`和`gen`:分别包含了法英平行语料库的训练、测试和生成数据。其每个文件的每一行以“\t”分为两列,第一列是法语序列,第二列是对应的英语序列。
- `train.list`、`test.list`和`gen.list`:分别记录了`train`、`test`和`gen`文件夹中的文件路径。
- `src.dict`和`trg.dict`:源(法语)和目标(英语)字典。每个字典都含有30000个单词,包括29997个最高频单词和3个特殊符号。
### 提供数据给PaddlePaddle 该数据集有193319条训练数据,6003条测试数据,词典长度为30000。因为数据规模限制,使用该数据集训练出来的模型效果无法保证。
我们通过`dataprovider.py`将数据提供给PaddlePaddle。具体步骤如下: ## 训练流程说明
1. 首先,引入PaddlePaddle的PyDataProvider2包,并定义三个特殊符号。 ### paddle初始化
```python ```python
from paddle.trainer.PyDataProvider2 import * # 加载 paddle的python包
UNK_IDX = 2 #未登录词 import paddle.v2 as paddle
START = "<s>" #序列的开始
END = "<e>" #序列的结束
```
2. 其次,使用初始化函数`hook`,分别定义了训练模式和生成模式下的数据输入格式(`input_types`)。
- 训练模式:有三个输入序列,其中“源语言序列”和“目标语言序列”作为输入数据,“目标语言的下一个词序列”作为标签数据。
- 生成模式:有两个输入序列,其中“源语言序列”作为输入数据,“源语言序列编号”作为输入数据的编号(该输入非必须,可以省略)。
`hook`函数中的`src_dict_path`是源语言字典路径,`trg_dict_path`是目标语言字典路径,`is_generating`(训练或生成模式)是从模型配置中传入的对象。`hook`函数的具体调用方式请见[模型配置说明](#模型配置说明)。
```python
def hook(settings, src_dict_path, trg_dict_path, is_generating, file_list,
**kwargs):
# job_mode = 1: 训练模式;0: 生成模式
settings.job_mode = not is_generating
def fun(dict_path): # 根据字典路径加载字典
out_dict = dict()
with open(dict_path, "r") as fin:
out_dict = {
line.strip(): line_count
for line_count, line in enumerate(fin)
}
return out_dict
settings.src_dict = fun(src_dict_path)
settings.trg_dict = fun(trg_dict_path)
if settings.job_mode: #训练模式
settings.input_types = {
'source_language_word': #源语言序列
integer_value_sequence(len(settings.src_dict)),
'target_language_word': #目标语言序列
integer_value_sequence(len(settings.trg_dict)),
'target_language_next_word': #目标语言的下一个词序列
integer_value_sequence(len(settings.trg_dict))
}
else: #生成模式
settings.input_types = {
'source_language_word': #源语言序列
integer_value_sequence(len(settings.src_dict)),
'sent_id': #源语言序列编号
integer_value_sequence(len(open(file_list[0], "r").readlines()))
}
```
3. 最后,使用`process`函数打开文本文件`file_name`,读取每一行,将行中的数据转换成与`input_types`一致的格式,再用`yield`关键字返回给PaddlePaddle进程。具体来说,
- 在源语言序列的每句话前面补上开始符号`<s>`、末尾补上结束符号`<e>`,得到“source_language_word”;
- 在目标语言序列的每句话前面补上`<s>`,得到“target_language_word”;
- 在目标语言序列的每句话末尾补上`<e>`,作为目标语言的下一个词序列(“target_language_next_word”)。
```python
def _get_ids(s, dictionary): # 获得源语言序列中的每个单词在字典中的位置
words = s.strip().split()
return [dictionary[START]] + \
[dictionary.get(w, UNK_IDX) for w in words] + \
[dictionary[END]]
@provider(init_hook=hook, pool_size=50000)
def process(settings, file_name):
with open(file_name, 'r') as f:
for line_count, line in enumerate(f):
line_split = line.strip().split('\t')
if settings.job_mode and len(line_split) != 2:
continue
src_seq = line_split[0]
src_ids = _get_ids(src_seq, settings.src_dict)
if settings.job_mode:
trg_seq = line_split[1]
trg_words = trg_seq.split()
trg_ids = [settings.trg_dict.get(w, UNK_IDX) for w in trg_words]
# 如果任意一个序列长度超过80个单词,在训练模式下会移除这条样本,以防止RNN过深。
if len(src_ids) > 80 or len(trg_ids) > 80:
continue
trg_ids_next = trg_ids + [settings.trg_dict[END]]
trg_ids = [settings.trg_dict[START]] + trg_ids
yield {
'source_language_word': src_ids,
'target_language_word': trg_ids,
'target_language_next_word': trg_ids_next
}
else:
yield {'source_language_word': src_ids, 'sent_id': [line_count]}
```
注意:由于本示例中的训练数据有3.55G,对于内存较小的机器,不能一次性加载进内存,所以推荐使用`pool_size`变量来设置内存中暂存的数据条数。
## 模型配置说明 # 配置只使用cpu,并且使用一个cpu进行训练
paddle.init(use_gpu=False, trainer_count=1)
```
### 数据定义 ### 数据定义
1. 首先,定义数据集路径和源/目标语言字典路径,并用`is_generating`变量定义当前配置是训练模式(默认)还是生成模式。该变量接受从命令行传入的参数,使用方法见[应用命令与结果](#应用命令与结果)。 首先要定义词典大小,数据生成和网络配置都需要用到。然后获取wmt14的dataset reader。
```python
import os
from paddle.trainer_config_helpers import *
data_dir = "./data/pre-wmt14" # 数据集路径
src_lang_dict = os.path.join(data_dir, 'src.dict') # 源语言字典路径
trg_lang_dict = os.path.join(data_dir, 'trg.dict') # 目标语言字典路径
is_generating = get_config_arg("is_generating", bool, False) # 配置模式
```
2. 其次,通过`define_py_data_sources2`函数从`dataprovider.py`中读取数据,并用`args`变量传入源/目标语言的字典路径以及配置模式。
```python
if not is_generating:
train_list = os.path.join(data_dir, 'train.list')
test_list = os.path.join(data_dir, 'test.list')
else:
train_list = None
test_list = os.path.join(data_dir, 'gen.list')
define_py_data_sources2(
train_list,
test_list,
module="dataprovider",
obj="process",
args={
"src_dict_path": src_lang_dict, # 源语言字典路径
"trg_dict_path": trg_lang_dict, # 目标语言字典路径
"is_generating": is_generating # 配置模式
})
```
### 算法配置
```python ```python
settings( # source and target dict dim.
learning_method = AdamOptimizer(), dict_size = 30000
batch_size = 50,
learning_rate = 5e-4) feeding = {
'source_language_word': 0,
'target_language_word': 1,
'target_language_next_word': 2
}
wmt14_reader = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.wmt14.train(dict_size=dict_size), buf_size=8192),
batch_size=5)
``` ```
本教程使用默认的SGD随机梯度下降算法和Adam学习方法,并指定学习率为5e-4。注意:生成模式下的`batch_size = 50`,表示同时生成50条序列。
### 模型结构 ### 模型结构
1. 首先,定义了一些全局变量。 1. 首先,定义了一些全局变量。
```python ```python
source_dict_dim = len(open(src_lang_dict, "r").readlines()) # 源语言字典维度 source_dict_dim = dict_size # 源语言字典维度
target_dict_dim = len(open(trg_lang_dict, "r").readlines()) # 目标语言字典维度 target_dict_dim = dict_size # 目标语言字典维度
word_vector_dim = 512 # 词向量维度 word_vector_dim = 512 # 词向量维度
encoder_size = 512 # 编码器中的GRU隐层大小 encoder_size = 512 # 编码器中的GRU隐层大小
decoder_size = 512 # 解码器中的GRU隐层大小 decoder_size = 512 # 解码器中的GRU隐层大小
if is_generating:
beam_size=3 # 柱搜索算法中的宽度
max_length=250 # 生成句子的最大长度
gen_trans_file = get_config_arg("gen_trans_file", str, None) # 生成后的文件
``` ```
2. 其次,实现编码器框架。分为三步: 2. 其次,实现编码器框架。分为三步:
2.1 传入已经在`dataprovider.py`转换成one-hot vector表示的源语言序列$\mathbf{w}$。 2.1 将在dataset reader中生成的用每个单词在字典中的索引表示的源语言序列
转换成one-hot vector表示的源语言序列$\mathbf{w}$,其类型为integer_value_sequence。
```python ```python
src_word_id = data_layer(name='source_language_word', size=source_dict_dim) src_word_id = paddle.layer.data(
name='source_language_word',
type=paddle.data_type.integer_value_sequence(source_dict_dim))
``` ```
2.2 将上述编码映射到低维语言空间的词向量$\mathbf{s}$。 2.2 将上述编码映射到低维语言空间的词向量$\mathbf{s}$。
```python ```python
src_embedding = embedding_layer( src_embedding = paddle.layer.embedding(
input=src_word_id, input=src_word_id,
size=word_vector_dim, size=word_vector_dim,
param_attr=ParamAttr(name='_source_language_embedding')) param_attr=paddle.attr.ParamAttr(name='_source_language_embedding'))
``` ```
2.3 用双向GRU编码源语言序列,拼接两个GRU的编码结果得到$\mathbf{h}$。 2.3 用双向GRU编码源语言序列,拼接两个GRU的编码结果得到$\mathbf{h}$。
```python ```python
src_forward = simple_gru(input=src_embedding, size=encoder_size) src_forward = paddle.networks.simple_gru(
src_backward = simple_gru( input=src_embedding, size=encoder_size)
input=src_embedding, size=encoder_size, reverse=True) src_backward = paddle.networks.simple_gru(
encoded_vector = concat_layer(input=[src_forward, src_backward]) input=src_embedding, size=encoder_size, reverse=True)
encoded_vector = paddle.layer.concat(input=[src_forward, src_backward])
``` ```
3. 接着,定义基于注意力机制的解码器框架。分为三步: 3. 接着,定义基于注意力机制的解码器框架。分为三步:
3.1 对源语言序列编码后的结果(见2.3),过一个前馈神经网络(Feed Forward Neural Network),得到其映射。 3.1 对源语言序列编码后的结果(见2.3),过一个前馈神经网络(Feed Forward Neural Network),得到其映射。
```python ```python
with mixed_layer(size=decoder_size) as encoded_proj: with paddle.layer.mixed(size=decoder_size) as encoded_proj:
encoded_proj += full_matrix_projection(input=encoded_vector) encoded_proj += paddle.layer.full_matrix_projection(
input=encoded_vector)
``` ```
3.2 构造解码器RNN的初始状态。由于解码器需要预测时序目标序列,但在0时刻并没有初始值,所以我们希望对其进行初始化。这里采用的是将源语言序列逆序编码后的最后一个状态进行非线性映射,作为该初始值,即$c_0=h_T$。 3.2 构造解码器RNN的初始状态。由于解码器需要预测时序目标序列,但在0时刻并没有初始值,所以我们希望对其进行初始化。这里采用的是将源语言序列逆序编码后的最后一个状态进行非线性映射,作为该初始值,即$c_0=h_T$。
```python ```python
backward_first = first_seq(input=src_backward) backward_first = paddle.layer.first_seq(input=src_backward)
with mixed_layer( with paddle.layer.mixed(
size=decoder_size, size=decoder_size, act=paddle.activation.Tanh()) as decoder_boot:
act=TanhActivation(), ) as decoder_boot: decoder_boot += paddle.layer.full_matrix_projection(
decoder_boot += full_matrix_projection(input=backward_first) input=backward_first)
``` ```
3.3 定义解码阶段每一个时间步的RNN行为,即根据当前时刻的源语言上下文向量$c_i$、解码器隐层状态$z_i$和目标语言中第$i$个词$u_i$,来预测第$i+1$个词的概率$p_{i+1}$。 3.3 定义解码阶段每一个时间步的RNN行为,即根据当前时刻的源语言上下文向量$c_i$、解码器隐层状态$z_i$和目标语言中第$i$个词$u_i$,来预测第$i+1$个词的概率$p_{i+1}$。
...@@ -514,43 +350,47 @@ settings( ...@@ -514,43 +350,47 @@ settings(
- context通过调用`simple_attention`函数,实现公式$c_i=\sum {j=1}^{T}a_{ij}h_j$。其中,enc_vec是$h_j$,enc_proj是$h_j$的映射(见3.1),权重$a_{ij}$的计算已经封装在`simple_attention`函数中。 - context通过调用`simple_attention`函数,实现公式$c_i=\sum {j=1}^{T}a_{ij}h_j$。其中,enc_vec是$h_j$,enc_proj是$h_j$的映射(见3.1),权重$a_{ij}$的计算已经封装在`simple_attention`函数中。
- decoder_inputs融合了$c_i$和当前目标词current_word(即$u_i$)的表示。 - decoder_inputs融合了$c_i$和当前目标词current_word(即$u_i$)的表示。
- gru_step通过调用`gru_step_layer`函数,在decoder_inputs和decoder_mem上做了激活操作,即实现公式$z_{i+1}=\phi _{\theta '}\left ( c_i,u_i,z_i \right )$。 - gru_step通过调用`gru_step_layer`函数,在decoder_inputs和decoder_mem上做了激活操作,即实现公式$z_{i+1}=\phi _{\theta '}\left ( c_i,u_i,z_i \right )$。
- 最后,使用softmax归一化计算单词的概率,将out结果返回,即实现公式$p\left ( u_i|u_{&lt;i},\mathbf{x} \right )=softmax(W_sz_i+b_z)$。 - 最后,使用softmax归一化计算单词的概率,将out结果返回,即实现公式$p\left ( u_i|u_{&lt;i},\mathbf{x} \right )=softmax(W_sz_i+b_z)$。
```python ```python
def gru_decoder_with_attention(enc_vec, enc_proj, current_word): def gru_decoder_with_attention(enc_vec, enc_proj, current_word):
decoder_mem = memory(
name='gru_decoder', size=decoder_size, boot_layer=decoder_boot) decoder_mem = paddle.layer.memory(
name='gru_decoder', size=decoder_size, boot_layer=decoder_boot)
context = simple_attention(
encoded_sequence=enc_vec, context = paddle.networks.simple_attention(
encoded_proj=enc_proj, encoded_sequence=enc_vec,
decoder_state=decoder_mem, ) encoded_proj=enc_proj,
decoder_state=decoder_mem)
with mixed_layer(size=decoder_size * 3) as decoder_inputs:
decoder_inputs += full_matrix_projection(input=context) with paddle.layer.mixed(size=decoder_size * 3) as decoder_inputs:
decoder_inputs += full_matrix_projection(input=current_word) decoder_inputs += paddle.layer.full_matrix_projection(input=context)
decoder_inputs += paddle.layer.full_matrix_projection(
gru_step = gru_step_layer( input=current_word)
name='gru_decoder',
input=decoder_inputs, gru_step = paddle.layer.gru_step(
output_mem=decoder_mem, name='gru_decoder',
size=decoder_size) input=decoder_inputs,
output_mem=decoder_mem,
with mixed_layer( size=decoder_size)
size=target_dict_dim, bias_attr=True,
act=SoftmaxActivation()) as out: with paddle.layer.mixed(
out += full_matrix_projection(input=gru_step) size=target_dict_dim,
return out bias_attr=True,
act=paddle.activation.Softmax()) as out:
out += paddle.layer.full_matrix_projection(input=gru_step)
return out
``` ```
4. 训练模式与生成模式下的解码器调用区别。 4. 训练模式与生成模式下的解码器调用区别。
4.1 定义解码器框架名字,和`gru_decoder_with_attention`函数的前两个输入。注意:这两个输入使用`StaticInput`,具体说明可见[StaticInput文档](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入)。 4.1 定义解码器框架名字,和`gru_decoder_with_attention`函数的前两个输入。注意:这两个输入使用`StaticInput`,具体说明可见[StaticInput文档](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入)。
```python ```python
decoder_group_name = "decoder_group" decoder_group_name = "decoder_group"
group_input1 = StaticInput(input=encoded_vector, is_seq=True) group_input1 = paddle.layer.StaticInputV2(input=encoded_vector, is_seq=True)
group_input2 = StaticInput(input=encoded_proj, is_seq=True) group_input2 = paddle.layer.StaticInputV2(input=encoded_proj, is_seq=True)
group_inputs = [group_input1, group_input2] group_inputs = [group_input1, group_input2]
``` ```
4.2 训练模式下的解码器调用: 4.2 训练模式下的解码器调用:
...@@ -560,99 +400,85 @@ settings( ...@@ -560,99 +400,85 @@ settings(
- 最后,用多类交叉熵损失函数`classification_cost`来计算损失值。 - 最后,用多类交叉熵损失函数`classification_cost`来计算损失值。
```python ```python
if not is_generating: trg_embedding = paddle.layer.embedding(
trg_embedding = embedding_layer( input=paddle.layer.data(
input=data_layer( name='target_language_word',
name='target_language_word', size=target_dict_dim), type=paddle.data_type.integer_value_sequence(target_dict_dim)),
size=word_vector_dim, size=word_vector_dim,
param_attr=ParamAttr(name='_target_language_embedding')) param_attr=paddle.attr.ParamAttr(name='_target_language_embedding'))
group_inputs.append(trg_embedding) group_inputs.append(trg_embedding)
decoder = recurrent_group( # For decoder equipped with attention mechanism, in training,
name=decoder_group_name, # target embeding (the groudtruth) is the data input,
step=gru_decoder_with_attention, # while encoded source sequence is accessed to as an unbounded memory.
input=group_inputs) # Here, the StaticInput defines a read-only memory
# for the recurrent_group.
lbl = data_layer(name='target_language_next_word', size=target_dict_dim) decoder = paddle.layer.recurrent_group(
cost = classification_cost(input=decoder, label=lbl) name=decoder_group_name,
outputs(cost) step=gru_decoder_with_attention,
``` input=group_inputs)
4.3 生成模式下的解码器调用:
lbl = paddle.layer.data(
- 首先,在序列生成任务中,由于解码阶段的RNN总是引用上一时刻生成出的词的词向量,作为当前时刻的输入,因此,使用`GeneratedInput`来自动完成这一过程。具体说明可见[GeneratedInput文档](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入)。 name='target_language_next_word',
- 其次,使用`beam_search`函数循环调用`gru_decoder_with_attention`函数,生成出序列id。 type=paddle.data_type.integer_value_sequence(target_dict_dim))
- 最后,使用`seqtext_printer_evaluator`函数,根据目标字典`trg_lang_dict`,打印出完整的句子保存在`gen_trans_file`中。 cost = paddle.layer.classification_cost(input=decoder, label=lbl)
```python
else:
trg_embedding = GeneratedInput(
size=target_dict_dim,
embedding_name='_target_language_embedding',
embedding_size=word_vector_dim)
group_inputs.append(trg_embedding)
beam_gen = beam_search(
name=decoder_group_name,
step=gru_decoder_with_attention,
input=group_inputs,
bos_id=0,
eos_id=1,
beam_size=beam_size,
max_length=max_length)
seqtext_printer_evaluator(
input=beam_gen,
id_input=data_layer(
name="sent_id", size=1),
dict_file=trg_lang_dict,
result_file=gen_trans_file)
outputs(beam_gen)
``` ```
注意:我们提供的配置在Bahdanau的论文\[[4](#参考文献)\]上做了一些简化,可参考[issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133)。 注意:我们提供的配置在Bahdanau的论文\[[4](#参考文献)\]上做了一些简化,可参考[issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133)。
### 参数定义
## 训练模型 首先依据模型配置的`cost`定义模型参数。
可以通过以下命令来训练模型: ```python
# create parameters
parameters = paddle.parameters.create(cost)
```
```bash 可以打印参数名字,如果在网络配置中没有指定名字,则默认生成。
./train.sh
```python
for param in parameters.keys():
print param
``` ```
其中`train.sh` 的内容为:
```bash ### 训练模型
paddle train \ 1. 构造trainer
--config='seqToseq_net.py' \
--save_dir='model' \ 根据优化目标cost,网络拓扑结构和模型参数来构造出trainer用来训练,在构造时还需指定优化方法,这里使用最基本的SGD方法。
--use_gpu=false \
--num_passes=16 \ ```python
--show_parameter_stats_period=100 \ optimizer = paddle.optimizer.Adam(learning_rate=1e-4)
--trainer_count=4 \ trainer = paddle.trainer.SGD(cost=cost,
--log_period=10 \ parameters=parameters,
--dot_period=5 \ update_equation=optimizer)
2>&1 | tee 'train.log'
``` ```
- config: 设置神经网络的配置文件。
- save_dir: 设置保存模型的输出路径。 2. 构造event_handler
- use_gpu: 是否使用GPU训练,这里使用CPU。
- num_passes: 设置passes的数量。PaddlePaddle中的一个pass表示对数据集中所有样本的一次完整训练。 可以通过自定义回调函数来评估训练过程中的各种状态,比如错误率等。下面的代码通过event.batch_id % 10 == 0 指定没10个batch打印一次日志,包含cost等信息。
- show_parameter_stats_period: 这里每隔100个batch显示一次参数统计信息。 ```python
- trainer_count: 设置CPU线程数或者GPU设备数。 def event_handler(event):
- log_period: 这里每隔10个batch打印一次日志。 if isinstance(event, paddle.event.EndIteration):
- dot_period: 这里每个5个batch打印一个点"."。 if event.batch_id % 10 == 0:
print "Pass %d, Batch %d, Cost %f, %s" % (
训练的损失函数每隔10个batch打印一次,您将会看到如下消息: event.pass_id, event.batch_id, event.cost, event.metrics)
```text ```
I0719 19:16:45.952062 15563 TrainerInternal.cpp:160] Batch=10 samples=500 AvgCost=198.475 CurrentCost=198.475 Eval: classification_error_evaluator=0.737155 CurrentEval: classification_error_evaluator=0.737155 3. 启动训练:
I0719 19:17:56.707319 15563 TrainerInternal.cpp:160] Batch=20 samples=1000 AvgCost=157.479 CurrentCost=116.483 Eval: classification_error_evaluator=0.698392 CurrentEval: classification_error_evaluator=0.659065
..... ```python
trainer.train(
reader=wmt14_reader,
event_handler=event_handler,
num_passes=10000,
feeding=feeding)
```
训练开始后,可以观察到event_handler输出的日志如下:
```text
Pass 0, Batch 0, Cost 247.408008, {'classification_error_evaluator': 1.0}
Pass 0, Batch 10, Cost 212.058789, {'classification_error_evaluator': 0.8737863898277283}
...
``` ```
- AvgCost:从第0个batch到当前batch的平均损失值。
- CurrentCost:当前batch的损失值。
- classification\_error\_evaluator(Eval):从第0个评估到当前评估中,每个单词的预测错误率。
- classification\_error\_evaluator(CurrentEval):当前评估中,每个单词的预测错误率。
当classification\_error\_evaluator的值低于0.35时,模型就训练成功了。
## 应用模型 ## 应用模型
...@@ -666,30 +492,7 @@ cd pretrained ...@@ -666,30 +492,7 @@ cd pretrained
### 应用命令与结果 ### 应用命令与结果
可以通过以下命令来进行法英翻译: 新版api尚未支持机器翻译的翻译过程,尽请期待。
```bash
./gen.sh
```
其中`gen.sh` 的内容为:
```bash
paddle train \
--job=test \
--config='seqToseq_net.py' \
--save_dir='pretrained/wmt14_model' \
--use_gpu=true \
--num_passes=13 \
--test_pass=12 \
--trainer_count=1 \
--config_args=is_generating=1,gen_trans_file="gen_result" \
2>&1 | tee 'translation/gen.log'
```
与训练命令不同的参数如下:
- job:设置任务的模式为测试。
- save_dir:设置存放预训练模型的路径。
- num_passes和test_pass:加载第$i\epsilon \left [ test\_pass,num\_passes-1 \right ]$轮的模型参数,这里只加载 `data/wmt14_model/pass-00012`。
- config_args:将命令行中的自定义参数传递给模型配置。`is_generating=1`表示当前为生成模式,`gen_trans_file="gen_result"`表示生成结果的存储文件。
翻译结果请见[效果展示](#效果展示)。 翻译结果请见[效果展示](#效果展示)。
...@@ -728,6 +531,7 @@ BLEU = 26.92 ...@@ -728,6 +531,7 @@ BLEU = 26.92
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
...@@ -746,6 +550,6 @@ marked.setOptions({ ...@@ -746,6 +550,6 @@ marked.setOptions({
} }
}); });
document.getElementById("context").innerHTML = marked( document.getElementById("context").innerHTML = marked(
document.getElementById("markdown").innerHTML) document.getElementById("markdown").innerHTML)
</script> </script>
</body> </body>
...@@ -20,4 +20,4 @@ wget http://paddlepaddle.bj.bcebos.com/model_zoo/wmt14_model.tar.gz ...@@ -20,4 +20,4 @@ wget http://paddlepaddle.bj.bcebos.com/model_zoo/wmt14_model.tar.gz
# untar the model # untar the model
tar -zxvf wmt14_model.tar.gz tar -zxvf wmt14_model.tar.gz
rm wmt14_model.tar.gz rm wmt14_model.tar.gz
markdown_file=$1 import argparse
import re
import sys
# Notice: the single-quotes around EOF below make outputs HEAD = """
# verbatium. c.f. http://stackoverflow.com/a/9870274/724872
cat <<'EOF'
<html> <html>
<head> <head>
<script type="text/x-mathjax-config"> <script type="text/x-mathjax-config">
...@@ -10,8 +10,8 @@ cat <<'EOF' ...@@ -10,8 +10,8 @@ cat <<'EOF'
extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"], extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
jax: ["input/TeX", "output/HTML-CSS"], jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: { tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ], inlineMath: [ ['$','$'] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ], displayMath: [ ['$$','$$'] ],
processEscapes: true processEscapes: true
}, },
"HTML-CSS": { availableFonts: ["TeX"] } "HTML-CSS": { availableFonts: ["TeX"] }
...@@ -44,11 +44,9 @@ cat <<'EOF' ...@@ -44,11 +44,9 @@ cat <<'EOF'
<!-- This block will be replaced by each markdown file content. Please do not change lines below.--> <!-- This block will be replaced by each markdown file content. Please do not change lines below.-->
<div id="markdown" style='display:none'> <div id="markdown" style='display:none'>
EOF """
cat $markdown_file TAIL = """
cat <<'EOF'
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
...@@ -67,7 +65,31 @@ marked.setOptions({ ...@@ -67,7 +65,31 @@ marked.setOptions({
} }
}); });
document.getElementById("context").innerHTML = marked( document.getElementById("context").innerHTML = marked(
document.getElementById("markdown").innerHTML) document.getElementById("markdown").innerHTML)
</script> </script>
</body> </body>
EOF """
def convert_markdown_into_html(argv=None):
parser = argparse.ArgumentParser()
parser.add_argument('filenames', nargs='*', help='Filenames to fix')
args = parser.parse_args(argv)
retv = 0
for filename in args.filenames:
with open(
re.sub(r"README", "index", re.sub(r"\.md$", ".html", filename)),
"w") as output:
output.write(HEAD)
with open(filename) as input:
for line in input:
output.write(line)
output.write(TAIL)
return retv
if __name__ == '__main__':
sys.exit(convert_markdown_into_html())
<html> <html>
<head> <head>
<script type="text/x-mathjax-config"> <script type="text/x-mathjax-config">
...@@ -5,8 +6,8 @@ ...@@ -5,8 +6,8 @@
extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"], extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
jax: ["input/TeX", "output/HTML-CSS"], jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: { tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ], inlineMath: [ ['$','$'] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ], displayMath: [ ['$$','$$'] ],
processEscapes: true processEscapes: true
}, },
"HTML-CSS": { availableFonts: ["TeX"] } "HTML-CSS": { availableFonts: ["TeX"] }
...@@ -39,6 +40,7 @@ ...@@ -39,6 +40,7 @@
<!-- This block will be replaced by each markdown file content. Please do not change lines below.--> <!-- This block will be replaced by each markdown file content. Please do not change lines below.-->
<div id="markdown" style='display:none'> <div id="markdown" style='display:none'>
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
...@@ -57,6 +59,6 @@ marked.setOptions({ ...@@ -57,6 +59,6 @@ marked.setOptions({
} }
}); });
document.getElementById("context").innerHTML = marked( document.getElementById("context").innerHTML = marked(
document.getElementById("markdown").innerHTML) document.getElementById("markdown").innerHTML)
</script> </script>
</body> </body>
...@@ -87,7 +87,7 @@ Fig. 5 Pooling layer<br/> ...@@ -87,7 +87,7 @@ Fig. 5 Pooling layer<br/>
A Pooling layer performs downsampling. The main functionality of this layer is to reduce computation by reducing the network parameters. It also prevents overfitting to some extent. Usually, a pooling layer is added after a convolutional layer. Pooling layer can be of various types like max pooling, average pooling, etc. Max pooling uses rectangles to segment the input layer into several parts and computes the maximum value in each part as the output (Fig. 5.) A Pooling layer performs downsampling. The main functionality of this layer is to reduce computation by reducing the network parameters. It also prevents overfitting to some extent. Usually, a pooling layer is added after a convolutional layer. Pooling layer can be of various types like max pooling, average pooling, etc. Max pooling uses rectangles to segment the input layer into several parts and computes the maximum value in each part as the output (Fig. 5.)
#### LeNet-5 Network #### LeNet-5 Network
<p align="center"> <p align="center">
<img src="image/cnn_en.png"><br/> <img src="image/cnn_en.png"><br/>
...@@ -227,7 +227,7 @@ trainer = paddle.trainer.SGD(cost=cost, ...@@ -227,7 +227,7 @@ trainer = paddle.trainer.SGD(cost=cost,
Then we specify the training data `paddle.dataset.movielens.train()` and testing data `paddle.dataset.movielens.test()`. These two functions are *reader creators*, once called, returns a *reader*. A reader is a Python function, which, once called, returns a Python generator, which yields instances of data. Then we specify the training data `paddle.dataset.movielens.train()` and testing data `paddle.dataset.movielens.test()`. These two functions are *reader creators*, once called, returns a *reader*. A reader is a Python function, which, once called, returns a Python generator, which yields instances of data.
Here `shuffle` is a reader decorator, which takes a reader A as its parameter, and returns a new reader B, where B calls A to read in `buffer_size` data instances everytime into a buffer, then shuffles and yield instances in the buffer. If you want very shuffled data, try use a larger buffer size. Here `shuffle` is a reader decorator, which takes a reader A as its parameter, and returns a new reader B, where B calls A to read in `buffer_size` data instances everytime into a buffer, then shuffles and yield instances in the buffer. If you want very shuffled data, try use a larger buffer size.
`batch` is a special decorator, whose input is a reader and output is a *batch reader*, which doesn't yield an instance at a time, but a minibatch. `batch` is a special decorator, whose input is a reader and output is a *batch reader*, which doesn't yield an instance at a time, but a minibatch.
......
...@@ -56,7 +56,7 @@ Softmax回归模型采用了最简单的两层神经网络,即只有输入层 ...@@ -56,7 +56,7 @@ Softmax回归模型采用了最简单的两层神经网络,即只有输入层
1. 经过第一个隐藏层,可以得到 $ H_1 = \phi(W_1X + b_1) $,其中$\phi$代表激活函数,常见的有sigmoid、tanh或ReLU等函数。 1. 经过第一个隐藏层,可以得到 $ H_1 = \phi(W_1X + b_1) $,其中$\phi$代表激活函数,常见的有sigmoid、tanh或ReLU等函数。
2. 经过第二个隐藏层,可以得到 $ H_2 = \phi(W_2H_1 + b_2) $。 2. 经过第二个隐藏层,可以得到 $ H_2 = \phi(W_2H_1 + b_2) $。
3. 最后,再经过输出层,得到的$Y=softmax(W_3H_2 + b_3)$,即为最后的分类结果向量。 3. 最后,再经过输出层,得到的$Y=softmax(W_3H_2 + b_3)$,即为最后的分类结果向量。
图3为多层感知器的网络结构图,图中权重用蓝线表示、偏置用红线表示、+1代表偏置参数的系数为1。 图3为多层感知器的网络结构图,图中权重用蓝线表示、偏置用红线表示、+1代表偏置参数的系数为1。
......
<html> <html>
<head> <head>
<script type="text/x-mathjax-config"> <script type="text/x-mathjax-config">
...@@ -5,8 +6,8 @@ ...@@ -5,8 +6,8 @@
extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"], extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
jax: ["input/TeX", "output/HTML-CSS"], jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: { tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ], inlineMath: [ ['$','$'] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ], displayMath: [ ['$$','$$'] ],
processEscapes: true processEscapes: true
}, },
"HTML-CSS": { availableFonts: ["TeX"] } "HTML-CSS": { availableFonts: ["TeX"] }
...@@ -128,7 +129,7 @@ Fig. 5 Pooling layer<br/> ...@@ -128,7 +129,7 @@ Fig. 5 Pooling layer<br/>
A Pooling layer performs downsampling. The main functionality of this layer is to reduce computation by reducing the network parameters. It also prevents overfitting to some extent. Usually, a pooling layer is added after a convolutional layer. Pooling layer can be of various types like max pooling, average pooling, etc. Max pooling uses rectangles to segment the input layer into several parts and computes the maximum value in each part as the output (Fig. 5.) A Pooling layer performs downsampling. The main functionality of this layer is to reduce computation by reducing the network parameters. It also prevents overfitting to some extent. Usually, a pooling layer is added after a convolutional layer. Pooling layer can be of various types like max pooling, average pooling, etc. Max pooling uses rectangles to segment the input layer into several parts and computes the maximum value in each part as the output (Fig. 5.)
#### LeNet-5 Network #### LeNet-5 Network
<p align="center"> <p align="center">
<img src="image/cnn_en.png"><br/> <img src="image/cnn_en.png"><br/>
...@@ -268,7 +269,7 @@ trainer = paddle.trainer.SGD(cost=cost, ...@@ -268,7 +269,7 @@ trainer = paddle.trainer.SGD(cost=cost,
Then we specify the training data `paddle.dataset.movielens.train()` and testing data `paddle.dataset.movielens.test()`. These two functions are *reader creators*, once called, returns a *reader*. A reader is a Python function, which, once called, returns a Python generator, which yields instances of data. Then we specify the training data `paddle.dataset.movielens.train()` and testing data `paddle.dataset.movielens.test()`. These two functions are *reader creators*, once called, returns a *reader*. A reader is a Python function, which, once called, returns a Python generator, which yields instances of data.
Here `shuffle` is a reader decorator, which takes a reader A as its parameter, and returns a new reader B, where B calls A to read in `buffer_size` data instances everytime into a buffer, then shuffles and yield instances in the buffer. If you want very shuffled data, try use a larger buffer size. Here `shuffle` is a reader decorator, which takes a reader A as its parameter, and returns a new reader B, where B calls A to read in `buffer_size` data instances everytime into a buffer, then shuffles and yield instances in the buffer. If you want very shuffled data, try use a larger buffer size.
`batch` is a special decorator, whose input is a reader and output is a *batch reader*, which doesn't yield an instance at a time, but a minibatch. `batch` is a special decorator, whose input is a reader and output is a *batch reader*, which doesn't yield an instance at a time, but a minibatch.
...@@ -338,6 +339,7 @@ This tutorial describes a few basic Deep Learning models viz. Softmax regression ...@@ -338,6 +339,7 @@ This tutorial describes a few basic Deep Learning models viz. Softmax regression
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">This book</span> is created by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and uses <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Shared knowledge signature - non commercial use-Sharing 4.0 International Licensing Protocal</a>. <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">This book</span> is created by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and uses <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Shared knowledge signature - non commercial use-Sharing 4.0 International Licensing Protocal</a>.
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
...@@ -356,6 +358,6 @@ marked.setOptions({ ...@@ -356,6 +358,6 @@ marked.setOptions({
} }
}); });
document.getElementById("context").innerHTML = marked( document.getElementById("context").innerHTML = marked(
document.getElementById("markdown").innerHTML) document.getElementById("markdown").innerHTML)
</script> </script>
</body> </body>
<html> <html>
<head> <head>
<script type="text/x-mathjax-config"> <script type="text/x-mathjax-config">
...@@ -5,8 +6,8 @@ ...@@ -5,8 +6,8 @@
extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"], extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
jax: ["input/TeX", "output/HTML-CSS"], jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: { tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ], inlineMath: [ ['$','$'] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ], displayMath: [ ['$$','$$'] ],
processEscapes: true processEscapes: true
}, },
"HTML-CSS": { availableFonts: ["TeX"] } "HTML-CSS": { availableFonts: ["TeX"] }
...@@ -97,7 +98,7 @@ Softmax回归模型采用了最简单的两层神经网络,即只有输入层 ...@@ -97,7 +98,7 @@ Softmax回归模型采用了最简单的两层神经网络,即只有输入层
1. 经过第一个隐藏层,可以得到 $ H_1 = \phi(W_1X + b_1) $,其中$\phi$代表激活函数,常见的有sigmoid、tanh或ReLU等函数。 1. 经过第一个隐藏层,可以得到 $ H_1 = \phi(W_1X + b_1) $,其中$\phi$代表激活函数,常见的有sigmoid、tanh或ReLU等函数。
2. 经过第二个隐藏层,可以得到 $ H_2 = \phi(W_2H_1 + b_2) $。 2. 经过第二个隐藏层,可以得到 $ H_2 = \phi(W_2H_1 + b_2) $。
3. 最后,再经过输出层,得到的$Y=softmax(W_3H_2 + b_3)$,即为最后的分类结果向量。 3. 最后,再经过输出层,得到的$Y=softmax(W_3H_2 + b_3)$,即为最后的分类结果向量。
图3为多层感知器的网络结构图,图中权重用蓝线表示、偏置用红线表示、+1代表偏置参数的系数为1。 图3为多层感知器的网络结构图,图中权重用蓝线表示、偏置用红线表示、+1代表偏置参数的系数为1。
...@@ -340,6 +341,7 @@ trainer.train( ...@@ -340,6 +341,7 @@ trainer.train(
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
...@@ -358,6 +360,6 @@ marked.setOptions({ ...@@ -358,6 +360,6 @@ marked.setOptions({
} }
}); });
document.getElementById("context").innerHTML = marked( document.getElementById("context").innerHTML = marked(
document.getElementById("markdown").innerHTML) document.getElementById("markdown").innerHTML)
</script> </script>
</body> </body>
.idea
.ipynb_checkpoints
...@@ -72,7 +72,7 @@ Given the feature vectors of users and movies, we compute the relevance using co ...@@ -72,7 +72,7 @@ Given the feature vectors of users and movies, we compute the relevance using co
<img src="image/rec_regression_network_en.png" width="90%" ><br/> <img src="image/rec_regression_network_en.png" width="90%" ><br/>
Figure 3. A hybrid recommendation model. Figure 3. A hybrid recommendation model.
</p> </p>
## Dataset ## Dataset
......
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"source": [
"# 个性化推荐\n",
"\n",
"本教程源代码目录在[book/recommender_system](https://github.com/PaddlePaddle/book/tree/develop/recommender_system), 初次使用请参考PaddlePaddle[安装教程](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html)。\n",
"\n",
"## 背景介绍\n",
"\n",
"在网络技术不断发展和电子商务规模不断扩大的背景下,商品数量和种类快速增长,用户需要花费大量时间才能找到自己想买的商品,这就是信息超载问题。为了解决这个难题,推荐系统(Recommender System)应运而生。\n",
"\n",
"个性化推荐系统是信息过滤系统(Information Filtering System)的子集,它可以用在很多领域,如电影、音乐、电商和 Feed 流推荐等。推荐系统通过分析、挖掘用户行为,发现用户的个性化需求与兴趣特点,将用户可能感兴趣的信息或商品推荐给用户。与搜索引擎不同,推荐系统不需要用户准确地描述出自己的需求,而是根据分析历史行为建模,主动提供满足用户兴趣和需求的信息。\n",
"\n",
"传统的推荐系统方法主要有:\n",
"\n",
"- 协同过滤推荐(Collaborative Filtering Recommendation):该方法收集分析用户历史行为、活动、偏好,计算一个用户与其他用户的相似度,利用目标用户的相似用户对商品评价的加权评价值,来预测目标用户对特定商品的喜好程度。优点是可以给用户推荐未浏览过的新产品;缺点是对于没有任何行为的新用户存在冷启动的问题,同时也存在用户与商品之间的交互数据不够多造成的稀疏问题,会导致模型难以找到相近用户。\n",
"- 基于内容过滤推荐[[1](#参考文献)](Content-based Filtering Recommendation):该方法利用商品的内容描述,抽象出有意义的特征,通过计算用户的兴趣和商品描述之间的相似度,来给用户做推荐。优点是简单直接,不需要依据其他用户对商品的评价,而是通过商品属性进行商品相似度度量,从而推荐给用户所感兴趣商品的相似商品;缺点是对于没有任何行为的新用户同样存在冷启动的问题。\n",
"- 组合推荐[[2](#参考文献)](Hybrid Recommendation):运用不同的输入和技术共同进行推荐,以弥补各自推荐技术的缺点。\n",
"\n",
"其中协同过滤是应用最广泛的技术之一,它又可以分为多个子类:基于用户 (User-Based)的推荐[[3](#参考文献)] 、基于物品(Item-Based)的推荐[[4](#参考文献)]、基于社交网络关系(Social-Based)的推荐[[5](#参考文献)]、基于模型(Model-based)的推荐等。1994年明尼苏达大学推出的GroupLens系统[[3](#参考文献)]一般被认为是推荐系统成为一个相对独立的研究方向的标志。该系统首次提出了基于协同过滤来完成推荐任务的思想,此后,基于该模型的协同过滤推荐引领了推荐系统十几年的发展方向。\n",
"\n",
"深度学习具有优秀的自动提取特征的能力,能够学习多层次的抽象特征表示,并对异质或跨域的内容信息进行学习,可以一定程度上处理推荐系统冷启动问题[[6](#参考文献)]。本教程主要介绍个性化推荐的深度学习模型,以及如何使用PaddlePaddle实现模型。\n",
"\n",
"## 效果展示\n",
"\n",
"我们使用包含用户信息、电影信息与电影评分的数据集作为个性化推荐的应用场景。当我们训练好模型后,只需要输入对应的用户ID和电影ID,就可以得出一个匹配的分数(范围[1,5],分数越高视为兴趣越大),然后根据所有电影的推荐得分排序,推荐给用户可能感兴趣的电影。\n",
"\n",
"```\n",
"Input movie_id: 1962\n",
"Input user_id: 1\n",
"Prediction Score is 4.25\n",
"```\n",
"\n",
"## 模型概览\n",
"\n",
"本章中,我们首先介绍YouTube的视频推荐系统[[7](#参考文献)],然后介绍我们实现的融合推荐模型。\n",
"\n",
"### YouTube的深度神经网络推荐系统\n",
"\n",
"YouTube是世界上最大的视频上传、分享和发现网站,YouTube推荐系统为超过10亿用户从不断增长的视频库中推荐个性化的内容。整个系统由两个神经网络组成:候选生成网络和排序网络。候选生成网络从百万量级的视频库中生成上百个候选,排序网络对候选进行打分排序,输出排名最高的数十个结果。系统结构如图1所示:\n",
"\n",
"<p align=\"center\">\n",
"<img src=\"image/YouTube_Overview.png\" width=\"70%\" ><br/>\n",
"图1. YouTube 推荐系统结构\n",
"</p>\n",
"\n",
"#### 候选生成网络(Candidate Generation Network)\n",
"\n",
"候选生成网络将推荐问题建模为一个类别数极大的多类分类问题:对于一个Youtube用户,使用其观看历史(视频ID)、搜索词记录(search tokens)、人口学信息(如地理位置、用户登录设备)、二值特征(如性别,是否登录)和连续特征(如用户年龄)等,对视频库中所有视频进行多分类,得到每一类别的分类结果(即每一个视频的推荐概率),最终输出概率较高的几百个视频。\n",
"\n",
"首先,将观看历史及搜索词记录这类历史信息,映射为向量后取平均值得到定长表示;同时,输入人口学特征以优化新用户的推荐效果,并将二值特征和连续特征归一化处理到[0, 1]范围。接下来,将所有特征表示拼接为一个向量,并输入给非线形多层感知器(MLP,详见[识别数字](https://github.com/PaddlePaddle/book/blob/develop/recognize_digits/README.md)教程)处理。最后,训练时将MLP的输出给softmax做分类,预测时计算用户的综合特征(MLP的输出)与所有视频的相似度,取得分最高的$k$个作为候选生成网络的筛选结果。图2显示了候选生成网络结构。\n",
"\n",
"<p align=\"center\">\n",
"<img src=\"image/Deep_candidate_generation_model_architecture.png\" width=\"70%\" ><br/>\n",
"图2. 候选生成网络结构\n",
"</p>\n",
"\n",
"对于一个用户$U$,预测此刻用户要观看的视频$\\omega$为视频$i$的概率公式为:\n",
"\n",
"$$P(\\omega=i|u)=\\frac{e^{v_{i}u}}{\\sum_{j \\in V}e^{v_{j}u}}$$\n",
"\n",
"其中$u$为用户$U$的特征表示,$V$为视频库集合,$v_i$为视频库中第$i$个视频的特征表示。$u$和$v_i$为长度相等的向量,两者点积可以通过全连接层实现。\n",
"\n",
"考虑到softmax分类的类别数非常多,为了保证一定的计算效率:1)训练阶段,使用负样本类别采样将实际计算的类别数缩小至数千;2)推荐(预测)阶段,忽略softmax的归一化计算(不影响结果),将类别打分问题简化为点积(dot product)空间中的最近邻(nearest neighbor)搜索问题,取与$u$最近的$k$个视频作为生成的候选。\n",
"\n",
"#### 排序网络(Ranking Network)\n",
"排序网络的结构类似于候选生成网络,但是它的目标是对候选进行更细致的打分排序。和传统广告排序中的特征抽取方法类似,这里也构造了大量的用于视频排序的相关特征(如视频 ID、上次观看时间等)。这些特征的处理方式和候选生成网络类似,不同之处是排序网络的顶部是一个加权逻辑回归(weighted logistic regression),它对所有候选视频进行打分,从高到底排序后将分数较高的一些视频返回给用户。\n",
"\n",
"### 融合推荐模型\n",
"\n",
"在下文的电影推荐系统中:\n",
"\n",
"1. 首先,使用用户特征和电影特征作为神经网络的输入,其中:\n",
"\n",
" - 用户特征融合了四个属性信息,分别是用户ID、性别、职业和年龄。\n",
"\n",
" - 电影特征融合了三个属性信息,分别是电影ID、电影类型ID和电影名称。\n",
"\n",
"2. 对用户特征,将用户ID映射为维度大小为256的向量表示,输入全连接层,并对其他三个属性也做类似的处理。然后将四个属性的特征表示分别全连接并相加。\n",
"\n",
"3. 对电影特征,将电影ID以类似用户ID的方式进行处理,电影类型ID以向量的形式直接输入全连接层,电影名称用文本卷积神经网络(详见[第5章](https://github.com/PaddlePaddle/book/blob/develop/understand_sentiment/README.md))得到其定长向量表示。然后将三个属性的特征表示分别全连接并相加。\n",
"\n",
"4. 得到用户和电影的向量表示后,计算二者的余弦相似度作为推荐系统的打分。最后,用该相似度打分和用户真实打分的差异的平方作为该回归模型的损失函数。\n",
"\n",
"<p align=\"center\">\n",
"\n",
"<img src=\"image/rec_regression_network.png\" width=\"90%\" ><br/>\n",
"图3. 融合推荐模型 \n",
"</p> "
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## 数据准备\n",
"\n",
"### 数据介绍与下载\n",
"\n",
"我们以 [MovieLens 百万数据集(ml-1m)](http://files.grouplens.org/datasets/movielens/ml-1m.zip)为例进行介绍。ml-1m 数据集包含了 6,000 位用户对 4,000 部电影的 1,000,000 条评价(评分范围 1~5 分,均为整数),由 GroupLens Research 实验室搜集整理。\n",
"\n",
"Paddle在API中提供了自动加载数据的模块。数据模块为 `paddle.dataset.movielens`"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"import paddle.v2 as paddle\n",
"paddle.init(use_gpu=False)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"# Run this block to show dataset's documentation\n",
"# help(paddle.dataset.movielens)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"在原始数据中包含电影的特征数据,用户的特征数据,和用户对电影的评分。\n",
"\n",
"例如,其中某一个电影特征为:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<MovieInfo id(1), title(Toy Story ), categories(['Animation', \"Children's\", 'Comedy'])>\n"
]
}
],
"source": [
"movie_info = paddle.dataset.movielens.movie_info()\n",
"print movie_info.values()[0]"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"这表示,电影的id是1,标题是《Toy Story》,该电影被分为到三个类别中。这三个类别是动画,儿童,喜剧。"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<UserInfo id(1), gender(F), age(1), job(10)>\n"
]
}
],
"source": [
"user_info = paddle.dataset.movielens.user_info()\n",
"print user_info.values()[0]"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"这表示,该用户ID是1,女性,年龄比18岁还年轻。职业ID是10。\n",
"\n",
"\n",
"其中,年龄使用下列分布\n",
"* 1: \"Under 18\"\n",
"* 18: \"18-24\"\n",
"* 25: \"25-34\"\n",
"* 35: \"35-44\"\n",
"* 45: \"45-49\"\n",
"* 50: \"50-55\"\n",
"* 56: \"56+\"\n",
"\n",
"职业是从下面几种选项里面选则得出:\n",
"* 0: \"other\" or not specified\n",
"* 1: \"academic/educator\"\n",
"* 2: \"artist\"\n",
"* 3: \"clerical/admin\"\n",
"* 4: \"college/grad student\"\n",
"* 5: \"customer service\"\n",
"* 6: \"doctor/health care\"\n",
"* 7: \"executive/managerial\"\n",
"* 8: \"farmer\"\n",
"* 9: \"homemaker\"\n",
"* 10: \"K-12 student\"\n",
"* 11: \"lawyer\"\n",
"* 12: \"programmer\"\n",
"* 13: \"retired\"\n",
"* 14: \"sales/marketing\"\n",
"* 15: \"scientist\"\n",
"* 16: \"self-employed\"\n",
"* 17: \"technician/engineer\"\n",
"* 18: \"tradesman/craftsman\"\n",
"* 19: \"unemployed\"\n",
"* 20: \"writer\""
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"而对于每一条训练/测试数据,均为 <用户特征> + <电影特征> + 评分。\n",
"\n",
"例如,我们获得第一条训练数据:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"User <UserInfo id(1), gender(F), age(1), job(10)> rates Movie <MovieInfo id(1193), title(One Flew Over the Cuckoo's Nest ), categories(['Drama'])> with Score [5.0]\n"
]
}
],
"source": [
"train_set_creator = paddle.dataset.movielens.train()\n",
"train_sample = next(train_set_creator())\n",
"uid = train_sample[0]\n",
"mov_id = train_sample[len(user_info[uid].value())]\n",
"print \"User %s rates Movie %s with Score %s\"%(user_info[uid], movie_info[mov_id], train_sample[-1])"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"即用户1对电影1193的评价为5分。"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## 模型配置说明\n",
"\n",
"下面我们开始根据输入数据的形式配置模型。"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"uid = paddle.layer.data(\n",
" name='user_id',\n",
" type=paddle.data_type.integer_value(\n",
" paddle.dataset.movielens.max_user_id() + 1))\n",
"usr_emb = paddle.layer.embedding(input=uid, size=32)\n",
"\n",
"usr_gender_id = paddle.layer.data(\n",
" name='gender_id', type=paddle.data_type.integer_value(2))\n",
"usr_gender_emb = paddle.layer.embedding(input=usr_gender_id, size=16)\n",
"\n",
"usr_age_id = paddle.layer.data(\n",
" name='age_id',\n",
" type=paddle.data_type.integer_value(\n",
" len(paddle.dataset.movielens.age_table)))\n",
"usr_age_emb = paddle.layer.embedding(input=usr_age_id, size=16)\n",
"\n",
"usr_job_id = paddle.layer.data(\n",
" name='job_id',\n",
" type=paddle.data_type.integer_value(paddle.dataset.movielens.max_job_id(\n",
" ) + 1))\n",
"usr_job_emb = paddle.layer.embedding(input=usr_job_id, size=16)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"如上述代码所示,对于每个用户,我们输入4维特征。其中包括`user_id`,`gender_id`,`age_id`,`job_id`。这几维特征均是简单的整数值。为了后续神经网络处理这些特征方便,我们借鉴NLP中的语言模型,将这几维离散的整数值,变换成embedding取出。分别形成`usr_emb`, `usr_gender_emb`, `usr_age_emb`, `usr_job_emb`。"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"usr_combined_features = paddle.layer.fc(\n",
" input=[usr_emb, usr_gender_emb, usr_age_emb, usr_job_emb],\n",
" size=200,\n",
" act=paddle.activation.Tanh())"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"然后,我们对于所有的用户特征,均输入到一个全连接层(fc)中。将所有特征融合为一个200维度的特征。"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"进而,我们对每一个电影特征做类似的变换,网络配置为:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"mov_id = paddle.layer.data(\n",
" name='movie_id',\n",
" type=paddle.data_type.integer_value(\n",
" paddle.dataset.movielens.max_movie_id() + 1))\n",
"mov_emb = paddle.layer.embedding(input=mov_id, size=32)\n",
"\n",
"mov_categories = paddle.layer.data(\n",
" name='category_id',\n",
" type=paddle.data_type.sparse_binary_vector(\n",
" len(paddle.dataset.movielens.movie_categories())))\n",
"\n",
"mov_categories_hidden = paddle.layer.fc(input=mov_categories, size=32)\n",
"\n",
"\n",
"movie_title_dict = paddle.dataset.movielens.get_movie_title_dict()\n",
"mov_title_id = paddle.layer.data(\n",
" name='movie_title',\n",
" type=paddle.data_type.integer_value_sequence(len(movie_title_dict)))\n",
"mov_title_emb = paddle.layer.embedding(input=mov_title_id, size=32)\n",
"mov_title_conv = paddle.networks.sequence_conv_pool(\n",
" input=mov_title_emb, hidden_size=32, context_len=3)\n",
"\n",
"mov_combined_features = paddle.layer.fc(\n",
" input=[mov_emb, mov_categories_hidden, mov_title_conv],\n",
" size=200,\n",
" act=paddle.activation.Tanh())"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"电影ID和电影类型分别映射到其对应的特征隐层。对于电影标题名称(title),一个ID序列表示的词语序列,在输入卷积层后,将得到每个时间窗口的特征(序列特征),然后通过在时间维度降采样得到固定维度的特征,整个过程在text_conv_pool实现。\n",
"\n",
"最后再将电影的特征融合进`mov_combined_features`中。"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"inference = paddle.layer.cos_sim(a=usr_combined_features, b=mov_combined_features, size=1, scale=5)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"进而,我们使用余弦相似度计算用户特征与电影特征的相似性。并将这个相似性拟合(回归)到用户评分上。"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"cost = paddle.layer.regression_cost(\n",
" input=inference,\n",
" label=paddle.layer.data(\n",
" name='score', type=paddle.data_type.dense_vector(1)))"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"至此,我们的优化目标就是这个网络配置中的`cost`了。"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## 训练模型\n",
"\n",
"### 定义参数\n",
"神经网络的模型,我们可以简单的理解为网络拓朴结构+参数。之前一节,我们定义出了优化目标`cost`。这个`cost`即为网络模型的拓扑结构。我们开始训练模型,需要先定义出参数。定义方法为:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[INFO 2017-03-06 17:12:13,284 networks.py:1472] The input order is [user_id, gender_id, age_id, job_id, movie_id, category_id, movie_title, score]\n",
"[INFO 2017-03-06 17:12:13,287 networks.py:1478] The output order is [__regression_cost_0__]\n"
]
}
],
"source": [
"parameters = paddle.parameters.create(cost)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"`parameters`是模型的所有参数集合。他是一个python的dict。我们可以查看到这个网络中的所有参数名称。因为之前定义模型的时候,我们没有指定参数名称,这里参数名称是自动生成的。当然,我们也可以指定每一个参数名称,方便日后维护。"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[u'___fc_layer_2__.wbias', u'___fc_layer_2__.w2', u'___embedding_layer_3__.w0', u'___embedding_layer_5__.w0', u'___embedding_layer_2__.w0', u'___embedding_layer_1__.w0', u'___fc_layer_1__.wbias', u'___fc_layer_0__.wbias', u'___fc_layer_1__.w0', u'___fc_layer_0__.w2', u'___fc_layer_0__.w3', u'___fc_layer_0__.w0', u'___fc_layer_0__.w1', u'___fc_layer_2__.w1', u'___fc_layer_2__.w0', u'___embedding_layer_4__.w0', u'___sequence_conv_pool_0___conv_fc.w0', u'___embedding_layer_0__.w0', u'___sequence_conv_pool_0___conv_fc.wbias']\n"
]
}
],
"source": [
"print parameters.keys()"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### 构造训练(trainer)\n",
"\n",
"下面,我们根据网络拓扑结构和模型参数来构造出一个本地训练(trainer)。在构造本地训练的时候,我们还需要指定这个训练的优化方法。这里我们使用Adam来作为优化算法。"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[INFO 2017-03-06 17:12:13,378 networks.py:1472] The input order is [user_id, gender_id, age_id, job_id, movie_id, category_id, movie_title, score]\n",
"[INFO 2017-03-06 17:12:13,379 networks.py:1478] The output order is [__regression_cost_0__]\n"
]
}
],
"source": [
"trainer = paddle.trainer.SGD(cost=cost, parameters=parameters, \n",
" update_equation=paddle.optimizer.Adam(learning_rate=1e-4))"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### 训练\n",
"\n",
"下面我们开始训练过程。\n",
"\n",
"我们直接使用Paddle提供的数据集读取程序。`paddle.dataset.movielens.train()`和`paddle.dataset.movielens.test()`分别做训练和预测数据集。并且通过`reader_dict`来指定每一个数据和data_layer的对应关系。\n",
"\n",
"例如,这里的reader_dict表示的是,对于数据层 `user_id`,使用了reader中每一条数据的第0个元素。`gender_id`数据层使用了第1个元素。以此类推。\n",
"\n",
"训练过程是完全自动的。我们可以使用event_handler来观察训练过程,或进行测试等。这里我们在event_handler里面绘制了训练误差曲线和测试误差曲线。并且保存了模型。"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAD8CAYAAABn919SAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzsnXd4HNX1v987s0VarbptuVvucpcLYGOaKQm2CRACKQQI\nEEpCEgL8EkIKIUC+hEBCEkIIMSWVTmgBjI0pxmCwccW9N7mrd22b3x+zMzu7Oyutula+7/P4sXbq\n3dndzz333HPOFZqmIZFIJJLUR+nuBkgkEomkY5CCLpFIJL0EKegSiUTSS5CCLpFIJL0EKegSiUTS\nS5CCLpFIJL0EKegSiUTSS5CCLpFIJL0EKegSiUTSS3B05c369OmjFRYWduUtJRKJJOVZvXp1qaZp\nfVs6rksFvbCwkFWrVnXlLSUSiSTlEULsS+Y46XKRSCSSXoIUdIlEIuklSEGXSCSSXkKX+tDt8Pv9\nlJSU0NjY2N1N6TWkpaUxePBgnE5ndzdFIpF0Id0u6CUlJWRmZlJYWIgQorubk/JomkZZWRklJSUM\nHz68u5sjkUi6kG53uTQ2NpKfny/FvIMQQpCfny9HPBLJCUi3CzogxbyDkc9TIjkx6RGC3hIVdT7K\napu6uxkSiUTSo0kJQa9q8FNe5+uUa5eVlVFcXExxcTH9+/dn0KBB5mufL7l7XnPNNWzbtq1V933z\nzTeZPn06EyZMoLi4mJ/85CetbvuaNWt4++23W32eRCLpnXT7pGgyCAGhTlrLOj8/n3Xr1gHwq1/9\nCq/Xy49+9KOoYzRNQ9M0FMW+//v73//eqnuuX7+eW265hTfffJMxY8YQDAZZsGBBq9u+Zs0aNm7c\nyPnnn9/qcyUSSe8jJSx0RQhCWicpegJ27tzJ+PHj+eY3v8mECRM4fPgwN9xwAzNmzGDChAncc889\n5rGnnXYa69atIxAIkJOTwx133MGUKVOYNWsWx44di7v2b3/7W+68807GjBkDgKqqfPe73wVgz549\nzJkzh8mTJ3PeeedRUlICwHPPPcfEiROZMmUKc+bMoaGhgXvuuYenn36a4uJiXnrppS54KhKJpCfT\noyz0u/+3ic2HquO2+wIhAqEQHlfrmzt+YBZ3fWlCm9qzdetW/vWvfzFjxgwA7r//fvLy8ggEAsyZ\nM4dLL72U8ePHR51TVVXFmWeeyf33389tt93GU089xR133BF1zMaNG/n5z39ue8+bbrqJ6667jm9+\n85ssWLCAW265hZdeeom7776bDz74gIKCAiorK0lPT+eXv/wlGzdu5I9//GOb3p9EIuldpISFjoCu\ntc91Ro4caYo5wLPPPsu0adOYNm0aW7ZsYfPmzXHnpKenM3fuXACmT5/O3r17W3XPFStW8PWvfx2A\nq666imXLlgEwe/ZsrrrqKp544glCoVAb35FEIunN9CgLPZElfbS6kaPVjUwalN2lIXkZGRnm3zt2\n7OBPf/oTK1euJCcnhyuuuMI21tvlcpl/q6pKIBCIO2bChAmsXr2aCROSHzk8/vjjrFixgjfeeINp\n06axdu3aVr4biUTS20kJC10Ja3hnTYwmQ3V1NZmZmWRlZXH48GEWLVrU5mvdfvvt3HvvvezcuROA\nYDDIY489BsDMmTN54YUXAPjPf/7DGWecAcDu3buZOXMm9957L7m5uRw8eJDMzExqamra+c4kEklv\noUdZ6IkwrPKQpqHSPUkz06ZNY/z48RQVFTFs2DBmz57d5mtNnTqV3//+93z1q181rfyLLroIgL/8\n5S9ce+21/OY3v6GgoMCMoLn11lvZs2cPmqbxhS98gYkTJ1JQUMCDDz7I1KlT+fnPf86ll17a/jcq\nkUhSFqF1YfTIjBkztNgFLrZs2cK4ceOaPa+8zkdJRT1F/TNxOdTObGKvIZnnKpFIUgMhxGpN02a0\ndJx0uUgkEkkvIUUEPeJykUgkEok9qSHoYRM9JE10iUQiSUhKCLoattCDUtAlEokkIakh6GELPShd\nLhKJRJKQ1BJ0aaFLJBJJQlJC0BUBAtEpgt4R5XMBnnrqKY4cOWK7T9M0HnjgAcaOHUtxcTEnnXQS\nTz/9dKvb+vLLL7N169ZWnyeRSE4MUiaxSFU6R9CTKZ+bDE899RTTpk2jf//+cfv+8pe/8P7777Nq\n1SoyMzOpqqritddea/U9Xn75ZRRFoaioqNXnSiSS3k9KWOhAWNC79p7//Oc/OfnkkykuLuamm24i\nFAoRCAS48sormTRpEhMnTuThhx/m+eefZ926dXzta1+ztezvu+8+HnvsMTIzMwHIzs7mqquuAmDx\n4sUUFxczadIkrr/+evPcH//4x4wfP57Jkyfzk5/8hGXLlvHWW29x6623Ulxc3OqiXxKJpPfTsyz0\nhXfAkQ22u4b6A4AAZyszRftPgrn3t7opGzdu5JVXXmH58uU4HA5uuOEGnnvuOUaOHElpaSkbNujt\nrKysJCcnhz//+c888sgjFBcXR12nvLwcv9/PsGHD4u5RX1/Ptddey9KlSxk5cqRZMveyyy7jrbfe\nYtOmTQghzHvMmzePSy+9lIsvvrjV70cikfR+UsNC14IodK15vmTJEj777DNmzJhBcXExS5cuZdeu\nXYwaNYpt27Zx8803s2jRIrKzs9t8jy1btjBmzBhGjhwJ6OVyP/zwQ/Ly8lAUheuvv55XXnklquqj\nRCKRJKJnWeiJLOnSnSj+JvaKoYztn9klTdE0jWuvvZZ77703bt/nn3/OwoUL+ctf/sJ///vfZpeP\ny8vLw+l0sn//foYOHZrUvZ1OJ6tWreKdd97hxRdf5K9//SuLFy9u83uRSCQnBqlhoadl49R8qKGm\nLrvlueeeywsvvEBpaSmgR8Ps37+f48ePo2kal112Gffccw9r1qwBaLaU7R133MFNN91k7q+urubf\n//4348aNY8eOHezevRvQy+WeeeaZ1NTUUF1dzQUXXMAf/vAHs/a5LJcrkUiao2dZ6IlIy4bqErxa\nHZqW3yWLXEyaNIm77rqLc889l1AohNPp5LHHHkNVVb797W+jaRpCCH77298CcM0113DdddeRnp7O\nypUroxa6+MEPfkBdXR3Tp0/H5XLhdDq5/fbb8Xg8PPnkk1xyySUEg0FOOeUUrr/+eo4dO8Yll1xC\nU1MToVCIhx56CIBvfOMb3Hjjjfz+97/n1VdfpbCwsNOfg0QiSR1aLJ8rhHgKuAA4pmnaxPC2POB5\noBDYC3xV07SKlm7W1vK5AP4jW/AFNdIGjDMTjSSJkeVzJZLeQ0eWz/0HcH7MtjuAdzVNGw28G37d\nqfidmXhoIhRIPtlHIpFITiRaFHRN0z4EymM2XwT8M/z3P4FOj6MLuLIQAmiq7uxbSSQSSUrS1knR\nAk3TDof/PgIUtKcRSa2a5EjHpzkQjVXtudUJQVeuQiWRSHoO7Y5y0XT1SKggQogbhBCrhBCrjh8/\nHrc/LS2NsrKyFkVIUQTVeFD9tRAKtrfZvRZN0ygrKyMtLa27myKRSLqYtka5HBVCDNA07bAQYgBw\nLNGBmqYtABaAPikau3/w4MGUlJRgJ/ZWfIEQVTU19BVVUBoEp6eNTe/9pKWlMXjw4O5uhkQi6WLa\nKuivA98C7g//3/pKU2GcTifDhw9v8bh9ZXV85cElbM78Pq7xF8CX/9rWW0okEkmvpEWXixDiWeAT\nYKwQokQI8W10IT9PCLEDODf8ulPJcDsI4OBAn9Ng+9sQDHT2LSUSiSSlaNFC1zTtGwl2ndPBbWkW\nr1tv6o7cMxh5+C04sAIKZ3dlEyQSiaRHkxqp/4DboeBQBFszTgbVBdve6u4mSSQSSY8iZQRdCIHL\noVCrpcPwM2DrmyDD8yQSicQkZQQdwKkq+IMhGDsPKvbAsS3d3SSJRCLpMaScoPuCmi7oANve7N4G\nSSQSSQ8ipQTdpQrdQs8aAIOmw1bpR5dIJBKDlBJ0p0MhYCwsOnYeHFoD1YebP0kikUhOEFJL0FUF\nfzA8EVo0X/9fRrtIJBIJkIKC7jMs9L5FkDtcCrpEIpGESSlBN33oAELoVvqeD6FJLssmkUgkKSXo\nZtiiwdh5EPTBziXd1yiJRCLpIaSeoAcsyURDToH0PBntIpFIJKSaoDssPnQA1QFjzocdiyDo776G\nSSQSSQ8gpQQ9yoduUDQPGqtg3/LuaZREIpH0EFJK0ON86AAjzwZHmox2kUgkJzwpKOgxBblcGTDi\nLN2PLot1SSSSE5iUE3RfIBS/Y+w8qNoPRzZ0faMkEomkh5BSgu5yCAIhO0GfCwjpdpFIJCc0KSXo\nti4XAG8/GHKyXiNdIpFITlBST9DtXC6gu12OfA6VB7q2URKJRNJDSClBdzkUGgNB+51msa6FXdcg\niUQi6UGklKB73Q78QY0mO1HvMxryR8tFLyQSyQlLSgl6ZpoDgJrGgP0BRfNg70fQUNmFrZJIJJKe\nQe8S9LHzIRSQxbokEskJSWoJutsJQE1jgrotg2dARl8Z7SKRSE5IUkvQW7LQFVUv1rVzCQR8Xdgy\niUQi6X5STNBbsNBBj3Zpqoa9y7qoVRKJRNIzSDFB1y306kQWOuh1XZwemTUqkUhOOFJS0GubE3Rn\nul6BURbrkkgkJxgpJehedws+dIOx86DmEBxa2wWtkkgkkp5BSgm6Q1XwuNTmfeigT4wKRbpdJBLJ\nCUVKCTroVnqLFnpGPgydJdcalUgkJxQpJ+iZaQ5qmpJYP3TsPDi2CSr2dnqbJBKJpCeQgoLubNlC\nB70MAEgrXSKRnDCkoKAn4XIByBsBfcdJP7pEIjlhSElBr21KQtBBt9L3LYf68s5tlEQikfQAUk7Q\n3Q7VvnyuHWPngxaEHYs7t1ESiUTSA2iXoAshbhVCbBJCbBRCPCuESOuohiXClWihaDsGTgVvf1ms\nSyKRnBC0WdCFEIOAm4EZmqZNBFTg6x3VsES4HK0QdEXRF5De+S74Gzu3YRKJRNLNtNfl4gDShRAO\nwAMcan+TmqdVgg56sS5/Hez5sPMaJZFIJD2ANgu6pmkHgd8B+4HDQJWmaXHOaiHEDUKIVUKIVceP\nH297S8O4HAq+YCsEffgZ4PLKpekkEkmvpz0ul1zgImA4MBDIEEJcEXucpmkLNE2boWnajL59+7a9\npWHcDgV/UCMUSrLwlsMNo87RF48OtaIjkEgkkhSjPS6Xc4E9mqYd1zTND7wMnNoxzUqMy6E3uVVW\netEFUHsUDq7upFZJJBJJ99MeQd8PzBRCeIQQAjgH2NIxzUqMS9Wb3NQaP/ro80Co0u0ikUh6Ne3x\noa8AXgLWABvC11rQQe1KiNuw0AMhqhv9yble0nOhcLYsAyCRSHo17Ypy0TTtLk3TijRNm6hp2pWa\npjV1VMMS4XaoAFTU+5j8q8Xc//bW5E4cOx9Kt0HZrk5snUQikXQfKZcpavjQS2v1vuP1dUlGSprF\nuqTbRSKR9E5SVtD9Qd3VEkp2mbmcoVAwSRbrkkgkvZbUE/TwpKg/PCnaqlVDi+bBgRVQV9rxDZNI\nJJJuJuUE3e3Um7z5cDXQynWgx84DLQTb3+6ElkkkEkn3knKC7lD0Jj/0zvbwllYo+oApkDVYRrtI\nJJJeScoJ+uTB2QCkhS31VlnoQujFuna9B776TmidRCKRdB8pJ+gZbgcnD88jJ90FtNKHDrofPdAA\nuz/o6KZJJBJJt5Jygg7gVIW5yIXWKhMdGHYauLNk1qhEIul1pKSgOxTFTP1PtkZX5GQXjP4CbHsb\nQkmufCSRSCQpQEoKum6hh8MWW2uhg+52qS+FAys7uGUSiUTSfaSkoDsUhWDYNG+DnMOo80BxSreL\nRCLpVaSmoKsi8qItip6WBcNP18MX22LhSyQSSQ8kJQXdqUaa3WY5HjsPyndB6faWj5VIJJIUICUF\n3aFELPQ2+dBBF3SQxbokEkmvITUFvSMs9OxBMKBYFuuSSCS9hpQUdKfFh550tUU7iuZDySqoOdoB\nrZJIJJLuJSUF3ajnAu2c0xw7D9Bg+8J2t0kikUi6m5QUdKuF3q4YlYIJep10WaxLIpH0AlJS0Nsd\ntmgghL403e4PoKm2vc2SSCSSbiU1Bd3qcmmfja5njQab9AqMEolEksKkpKBHuVzamxc09FRIy5HR\nLhKJJOVJSUHvkLBFA9UBY87XVzEKBtp7NYlEIuk2UlPQOyKxyErRPGiogP2ftP9aEolE0k2kpKC7\nHJFmt7p8rh0jzwHVLd0uEokkpUlJQVeEaPmg1uD2wogz9TIAsliXRCJJUVJS0Hcfr+v4i46dB5X7\n4Njmjr+2RCKRdAEpKegnFeZ2/EXHztX/l0lGEokkRUlJQZ87aUDU60Aw1P6LZvaHQTPkohcSiSRl\nSUlBj6W6sYPCDYvmwaG1UH2oY64nkUgkXUivEPTKel/HXGjsfP1/Ge0ikUhSkF4h6C+sKumYC/Ud\nC3kjpB9dIpGkJL1C0B9buqtjLiSEHu2y50NorO6Ya0okEkkX0SsEvV+mu+MuVjQfQn7YuaTjrimR\nSCRdQMoL+swReQzISe+4Cw45BTz50o8ukUhSjpQVdMMqdzlUQh2S/x9GUWHMXNi+GIL+jruuRCKR\ndDLtEnQhRI4Q4iUhxFYhxBYhxKyOalhL/Pe7p/LApZNxqQqBjhR00MMXm6pg70cde12JRCLpRNpr\nof8JeFvTtCJgCrCl/U1KjiF5Hr46YwgORXSshQ4wYg440qXbRSKRpBRtFnQhRDZwBvAkgKZpPk3T\nKjuqYcmiqoJAqAMyRa24PDByjh6+KIt1SSSSFKE9Fvpw4DjwdyHEWiHEE0KIjA5qV9KoQhDsaAsd\n9PDF6hI48nnHX1sikUg6gfYIugOYBvxV07SpQB1wR+xBQogbhBCrhBCrjh8/3o7bJWiEIgh2hhU9\n5nxAyCQjiUSSMrRH0EuAEk3TVoRfv4Qu8FFomrZA07QZmqbN6Nu3bztuZ4+qCILBThB0b189hFEW\n65JIJClCmwVd07QjwAEhxNjwpnOALi8mriqi46NcDIrmwZENULm/c64vkUgkHUh7o1x+ADwthPgc\nKAbua3+TWoeqCEKdNXFpFuta2DnXl0gkkg6kXYKuadq6sDtlsqZpF2uaVtFRDUsWR2da6H1GQZ8x\n+tJ0EolE0sNJ2UxRA6WzfOgGY+fBvo+hocsjMiUSiaRVpLygd1qUi0HRfAgFYMc7nXcPiUQi6QBS\nXtBVpf2p/8dqGimvS7BIxqAZkNGPmvWvUXjHm6zZ3+VeJYlEIkmKXiDotDux6OT/e5dp9yawwBUF\nxs7Fvfc9XPh5Y/3hdt1LIpFIOoteIOgKwZCG1sluF1ewjlnK5o4vMyCRSCQdRMoLukMRAHRWoAsA\nw8/Er6ZznrIKf2dOwEokEkk7SHlBV8OC3qmWszONI31nc666hlAw0Hn3kUgkknbQawQ9FIKbn13L\n5Y9/anvc/rJ6bn52LfW+tgny4f5n019U0L9ua5vbKpFIJJ1Jygu6w2Khv77+EMt3lQHgD4ai/Opf\nW/AJr68/xJbDNQmvdeerG9ly2H5x6NKBZxHQFMZVy0UvJBJJzyTlBV0RuqDf91ZkbY3S2iZG/3wh\n/1y+19x2uKoRAF8gsWvm35/u4+q/r7Tdp6XlsUoby6Tajzug1RKJRNLxpLygO1Rd0J9decDctr+8\nHoBX1h2KO74ll8vR6ibqmuKPURXBO8HpDPLtgfI97WmypAto9Af5dHdZdzdDIulSUl7QDQvdSiAc\nieJQBOsOVHLP/zZjHFbvC7Z4zQcXbbPZqrE4NF3/Uy5N1+N5cNE2vr7gUzYfsnehSSS9kZQX9EAw\n3oXiD29TFcFX//YJT328x1xJriEJQa9u8MdtC2lwQCvggKNQLnqRAhyt1l1sieZEJJLeSMoLer0/\nXqAr63VBdqoCNcaCf339oSh/ux12keZGNupnaTNh/3KoL29bgyVdwpA8DwD7wu43ieREIOUF3c7i\nPl6jW2eqophRMAYf7SxlwYe7m72mXdapUXN9hWsWaCHYvqitTe6xNPqDbQ7r7Gk4Vf2rXW8zHyKR\n9FZSXtDtfOKltXqhLaciUNV4H3tbMAR9qzIKMgf2yqXp5vzuA8b/snd0VE3hkVsgpPHP5XubjW6y\ncqiygVfWlnRm0ySSTsPR3Q1oL3aCfrymCdAjYGIt9GSwd7no/wdCGoydC+ufA38DONNbff2eihHa\n2RtoDAv6858doMEfpLLezw/PHd3ieZc//il7y+qZO3EAaU61s5spkXQoKW+hn1SYG7ft+VV6CKND\nUcxM0lhaW8zLsND9wZC+1qi/DnYvbWVrJV1FU9gibwgLe3ldU1LnHQlPpra3gqdE0h2kvKB/eeog\nHv7GVNt9GhoOxf4ttraGeih8fFWDHwpPB1dmr3S7QO8Qs8aYyfJkP29BOPNYFmGTpCApL+hCCAZm\np5mvf3fZFPNvXyBkJh4BWI315n6wdsa7sSpSRb0fHG4YfS5se1svItPLsAvbtEPTOrlscTto9Ed/\nLq0VaH8v/FwlvZ+UF3QAlyPyNpwWAW8KhKJ86F+ZNtj822cTv27w+vpDvLwmemLMMPDMybWx86Hu\nGBxc1Z6m90gqkxT04T99i5+9srGTW9M2GgNttNDDXxdpoUtSkV4h6EaIGhDlYmkKhKL2zSjM5YLJ\nAwD7hCQrt72wXnevhAlZBKHeF4DR54HigK29w+1iJOJA8v5mgGdX7u+M5rSbplgLPUmL2+j+/S18\nPySSnkivEPTmLHS3ZZ/LoXDqyD5AxGJrzmVgjcm2+pXLan2QngPDZsO2tyipqOf8P37IzmOJKzn2\nVKrq/VTV+7nk0eXmth1Ha1s8r6tdLcdrmthQUpX08W230I3qndJCl6QevUPQVaugR/5ef6CS7RZx\nqm0MmD51w3XSnC5Zk5ZCmtVCD28vmg+l29m2aS1bj9Twk/9uaNf76A5O+c0S5vz+Aw5WNpjbNiVR\n/6Sr9W7un5bxpUeSL10c70NvncXd2uMlkp5ArxB0qxXuiEkkavAHOXdcAbedN4YLiweZFvzd/9tM\noz9oTnYa/OOak7j3oglAdIy7VdCbDOtv7FwAhh7/AIDV+yo65g21g0AwxCWPfsyH24/H7dtfVs8b\nn0cqUJZU1NPoD1Fe54s6zup+SURXuyRKa5N3A0F8+5KJ3Fm1N1LOIdWXGgyGNKobk5sLkfQeeoWg\nJ/KhG2SlO7j5nNFkpzvN/Uu2HOU/n+6L+6FPHZpLYZ8MIBLDDNEWqWn95QyF/pPoU7LE3NfZlt35\nf/yQL/7hw4T7y+t9rNlfya3Pr4vbd8Gfl/H9Z9aaryvqIj/4AeFIoYIsd7MTxgY9PbQx9nNoyYWy\neNMRLn3sE2rDpQJSfTHwO1/byORfLe5xI41dx2spvONN1uzvfuOnN9IrBN0V5SePTyRyO+xdMvW+\nYJTlDbr7xuPSMwQve+wTc7tVwJqs/tmx88kpW0s+un/3UGXEuv3zuzu44okVSb+Pel+gxS/61iM1\nbDua2FdvxFHbiXJ1Y1iswvusotXoDzJ5cDZDcj1xafKfl1Tyv/XRteW7y8ecbEcS276Wolb2xxTx\nSnUL/fnP9OS6njYXYIwcX1t7sJtb0jvpVYI+ZXC2rYXudkRSuK2TpoFgKM4X7FAF6c5IRQRj8s8a\n5fL/XljPpkPhCbqi+Qg0zlZ1y/d4bUTQf//Odj7aWdpi+49UNbLpUBW3PLeOSx5dHucCaQ2GSDfn\nEjGyKK0/9tqmAG6HgsuhmPsNLnzkY37w7FpCIY3CO97kNwu3dJvll2xNllgBt7O4//7xHh5arNe+\nj62r39Ms29ZidHwdMZLaeayGrz72ie3CL63FeMo9q5tpO1c9tTJqZbTuplcIulNVeO6Gmfzz2pPj\nfOgA/bLc5t8Oi4X+0uoSfvisLsQ3njGCv105HafFQoeI28X6uzhW08T8hz9i+c5S6D+JuvSBfEFZ\nDUTqyLSGG/+9ivkPf8SyHbr4x2Y5tgZDyJqzMI3rW0XfH9RwO1TcDiWhaBrJNk8s29NtLpekBT0J\nC/3u/23m4fd2EgxpxFaI6GmWbVvpiPdx/8KtrNxbbq7X2x4UY1H3HpqQ1hqWbD7Kh9uPc9frm7q7\nKSa9QtABZo7IJ8fjinKpGPTPimSSWi30Q1WNvLv1mH5MdhpfnNAfIErQjVj02MlTgMufWAFCcKjg\nLE5TNpBGU1KC/p9P97H1SDVLtx+n8I432RxehMHoPGwWYUoaQ6SDIY2PdtiPDgwLPFaUDQs9kWga\n20Oa1qxQaJrGFU+sYMnmo61ufywHKxsoqYi4Q5Lx7wMEQ8370KvqI/MHO47VmOGKBr0lDj3UIR1T\nx4mw8Zx7gZ5z3b/ikwrrmgKc99BS1h2o7IYW9SJBN7Crrhgt6PZv2VrEK91G0JuLuy7pN4d04eMM\n5XOOVscLeuy5v3h1I+f/cRkvhP2csdZ0e36DVuG64slo/73xFg0LPdZqTXOquBxq9ByBBaOdmhZ9\nbuyIwhcM8dHOUtsvfGuZff97nPbb96OunQwtuVzK6yNurdIaX1wn2lsyRTvCQjeeTUfkHhjfwVQf\nACV6FusPVLLjWC33L2x+EZ3OotcJup1gF1hqvSQqp2u10DyuiA/dsOSaczEcyplGleZhvmutaW1b\nWWoJIbR+EXI8TtvrBROIibUNiSyv5ixLo9MyLPTYY90OpVmXi1XorQK5PWaSNlm3SFswrv3Z3nLe\n/PxwwuNihexgRUP0fst7r/cF4iz0VI9yMegI15hiCnq7L2VO2vfUGkDJUt1gP59gvCtBO4bZ7eDE\nEPRkLHTLD1pVBI9fNQNo3uViEMDB+6FizlbWsmG/7md89IOd5v6r//4ZwZDGQ4u3RblkEgm6tTDU\n+1uP8cwKPb3eGkYZmwlpntuMZWkIumFRx7lcnPaTogZ1TVZBj5y77UjbBd0fDPGr1zdxuKqh5YPD\n1/7dom1c9tgnfO+ZNSzfZe9WihXkino/Ryz13q3tb/AH435+3RHlomkaL6w6kNS6t8nS3Pc2WYwJ\n4454IhELPbUF3ZqIZ8V4W+1xm7aHXifodpOiXnfE4k7scol+PW1oDqoi+CBsXWsapDkTl+J9JziD\nzFAVI5oGVn3cAAAgAElEQVQ2A/qEq5Xlu0p5+L2d/PTlSDapdSQQdT2LmFzzj8/42Sv6OdZSBHYL\ne+jnNmOhixgLPc6H3vykqDXKwdrGmsZoayVRh2DHsh3H+cfyvdz9+uakjvcHQzzyfqSzvPxx+7BQ\nO1eDGZlEdPvrmoLxLpdusNA3HKzi9pc+5/+9GJ9DYPDoBzv1yfgkSTTaaw2iA0VYaaUPfUNJFff8\nb3OPs+gb/IksdL2dsVFTXUWvE3Rngvrn5v4ES9LFDrnzvW6umjWM51bu50B5PcGQFrfgNOhWVTAU\nYmloMkHh4ByxilBIY1z/rKjjjGJRNRZRjBVCg0RiYrXcEllxzVmWSoyFHiv+poWeoFOwCnp0XH70\n8a2x0I1nYNcR25FMZxEMabaCsdUykrA+Y7t1VLvDQjdE4K0NRxIK2ANvb9Mn45OkIzom47fRgfOr\nSV/ror98xFMf74kr5dDdJJpjCUkLvWNx2iQWWelv8adbsRPr88YVENL04VUwpJmCaMUXDBEIadTi\n4UD2DM5TVlHvC8StlGQMfa0/1LIE6ex2XxZN06JcHgkt9GZ+wI4YH3qsFetWFdyqbqHbCUqtRdCt\nbqHYSdTmJi41TeNfn+zlUGUDoZDGD5/TrVGv28EvX9tI4R3NV69sSiKk0+4ZpDmVqGFylMvFF4xz\nP3XHpKi1TTuOtVwgLRk6xoceFvQ2Xsuubn6yFrdxy+4YMdU2BTiWoAxGR7iyOoN2C7oQQhVCrBVC\nvNERDWov1sSiN28+jSW3nRG1P5Gbw26pOmc4Yckf1AXO7pj6pqA5rN3X9yyGK0e5+O6neD0ms9L4\nYVl/E4nqk9h9eRv8wahhXlldE2W1TWiaxpMf7THdCclMiv7ohfUs31UaJ1r5Xjfu8DqadqJcl6D6\nZKz1FFu61sq+snp++dombn1+XVStkQy3g399si/heQaJOjIrdiKW4XLwya4yM7IoyuXiC8ZZ5O0J\nW/zjku0tdkx2WNtdlWRN+hav2RGhhuH/k40wiuXiR5cz5e7FQKRTaG2rumPE9JVHl3Pyfe/a7kvU\nURodVeyIv6voCAv9h0D3xOjYYHWpTBiYzah+mXHHXH1qId84eUjUNjvr2/C3+4Mhgppm6xer9wdN\ny+pwwRwAM8nISkTQLRZ6goxQO+uwpjEQJWYr95Qz/ddLePKjPdz7xmbmP/xRuK3R51q/eEb7a5oC\nfP+ZtXEdx9B8j1m50s61UWsZIVgFL95Cj7yubvTzajjN+2h1I9f84zNAH5JWWmLBMyyhos1ZgnvL\n6uK2xVp7dj/+dJfKntI6bv/v50B0p9ngC8S5n5IR9JdWl8RZcBsPVvHHJTtaPNcO63ejPROj1o6y\nI0Yaxte+rdFL6w9URspOJFG22o72dLB/fncHq/eVt3jcE8t2mx0PYJbYsGtrYkHX/+8mj0v7BF0I\nMRiYDzzRMc1pP8n0jL+6cAK/uWQyV84cZm6zi2Z0mqV2NZr8IdtjXl93iGBIw6EI/N4BrA+N4Dw1\nXtAjSTmRbaUJkpDsJvRqGv1Rgm5UdnxzQyR0LxjS4n7Axn1LKuqjBGNIbnrcscPyPLjDE792VnZz\nPvSXVpew8WCV+drgpy9v4Jbn17HlcDU/f2UDe0p1QR6U46HCEguuWkZWVnfO4Nz0qDb8+s1426E2\nJiXd7sdmTRaD6Gdc7wvGPfOWfPXldT5+9OJ6rv77Z1HbL/hzpMTvna9uNEsLJIP182iwcS0l6/Iw\nPgfoWJdLR4Sj2o1Uk6E99/79O9v5yl8/afG4X7+5haoGf5yA24UoJpogNjqeVPWh/xG4HehRMxbf\nmjWMp687pcXj7r14ovm3nQ/dKOp1rKaRF1eXUFobb1F/XlJJIKS7Y1yqYHFwBlOVnfxAfZnL1XeZ\nq6xglrIJV/kWCijHEYqI+OEE/jnjS2EtY1vTGIiy2oy2WIfmD7y9Nc7qNiZAT/vt+5TW+jh3XD9O\nH90HIUTcsQNz0unr1cskHKiILlYFMVEuUS6XID96cb0pZlYxNEIFqxv8+CyC1TfTHWWhW9tiFbZk\nBCnWPWEX6ZPujBH0YLSgx1qAdhOlVox2HWom3PLfn+7j4fd2JtwfS0sW+r8+2ZvUdezCM9/ZfDSp\nssh2mBZ6O7Nnj1Y38sE2PTO7tREzbU2QakuHVlrri/pOldqs4JVo5GOMDrvLQrd3KCeBEOIC4Jim\naauFEGc1c9wNwA0AQ4cObevtWsXdF01s+aAwqiISTngaLpdjNtmfBrVNAYIhfe1Sh6LwRmgm39be\n4v85X4o+8GP4UhpQBg1uFxV4qdQyqdC84b+9VJBJpeal764DhLRCvvPPrRQKLxVaJpV1TaYPO8Ol\nUtukf+GsCzp/sO04owuiXUyHqxr5zn8iIwaHooBDUBrwxbkm0pwq04flArB6bwXThubGvVeDf3y8\n1/z75TXRlfOs1pQxyqmoj7Z8NDQqGyIdpLUt1h9LMj/k2ExV45w0p0JjeGTltgj6Ux/t4Z439DDJ\nDJdKZYMv7gdqTEA3+oMcqmygf3Za1PyL0QFYz2vJjaBpGk0Bvf78gg9386Mvjo0KqY2NjY/lV/9L\nLrTT2qGGNI1gSOP6f61ieJ8M3v/RWUldw4qRJONvp4U+90/LzMJzyei59Xk253KpawqQ4baXskRZ\nz81x0v8tiXpdVutjZN/oYxJ1SIZh0l0+9DYLOjAbuFAIMQ9IA7KEEP/RNO0K60Gapi0AFgDMmDGj\nx00Nq0IQxN4/bgh6cxNLVQ1+00J3OhT2af2Z1rQANz7OGuJkb8kBckUtV07O5KMN2xmb6cdXW0Yu\nNeSIWvootfTX9pOj1JJDLarQYCWwEl6J1BQj9LyCz5nFTFc6jY4syuq9HHNmUNPkpVT1UokXt68P\neUePMk4cD3cUmTz87g5W7In4Dx2qQFUFTYH4yA6Afllp5GW4bH3Vy3dGijMttVlAwyBa0PVn+J3/\nrCYrzSJeQS3KQo8qFGax1pOxsBr9IR5buosrZg7D63aY57hUXdBVRUSVUP7zexEfd99MNxV1/qh7\nQqTzOv2B981ksL33z49rr3VkcfFfPm62nY9+sIsHF22jj9dNaW0TXxhfwKmj+pj7rS6V9vjQrZFA\ngaBmiprh7mqJJZuPsmJPGVfPHs6gnHRTWNtroVuriGpJTItaO6ZELpfV+8r5yl8/4e9Xn8Scon5x\n++3CHf+2dBfnT+zPsPyMZJrNsZr4kU0iQ8MwTAz78OZn17JqbznLf3pOUvdqL20WdE3Tfgr8FCBs\nof8oVsxTAVUREIxPLIKIGDX346pq8Os+dFXBabHym3AxbWIRQW9/Vu8r58WGHD4IDmGEM4PdgTr6\neF2U1vooyHKb9V8EITKp548XDmWQq4HfvLycXGrJFbWcOUQlV9Sxr+QAwxxN5DWWMULZRy41ZKjh\nEUQDsBLOtnQEDTtd/NIdGQ1kHO5LnSOLnQ0uhh0YzCVKI9VkUEcaHBwA7izGeuqorqrUzShLR9dc\nHXYr1h+itdRCdaM1MSlEbfh1ptsRJehRFrple7pTtbVcX1t3kMeX7WFfWT2/unC8eS23U4XGAIqI\nFnRrO/pmujlY0WBjoesTpYmKrRkC0+gPcd9bW/jZvHGsb2HN0xdW6RE2RnRTrJ++JQs9WazXDYY0\ncz4kGaNR0zSzBs+SLcd4/0dnmQlobfFjJ/L7JxOFaP3dJbLQPwlXgFyxpzyBoEc/x/I6H79ZuJWn\nV+znw9vntNwI4J7/beb7z6xl491fNEdUVkNjRJ9IxxBpp/6wjWi33y3axo++ODap+7WH9ljovQKH\nKsBvn9llRHw0VwfaaqE7YnqF7HQng3PT+XS3xgfbjkddyygAluFyAPoPXEOhGi/V6UNpUBU+CEX8\ns/vS+jG6IJOnDuzh7CH9eHvTEXOfGx/Z1DExN8Cl4zy8/ukmckUNudSSI2rN0UCuqGVg02489dXM\nDFaj7g1xlsvS4Mf/D4BnAWqAuwW4vHzqdlCrpVNLGnVaOnWkUUs6tcbf4f/rSINNfvIONzJNHKaW\ndLZvO042+v6A5evmD2nU+gK4HArpLjWmlK+9hd4vy82+Mt23n+l2mElaZeH5hGdX7uf1dQd59oaZ\nQGQiVBd01faafbxuNhysMjuOr580hHUHKqltCsQJ7s5jNWbUlHXfgg9387N542iJ2CzlWNG2tuvB\nRdt4YdUBlv44OdGxEiXommZa1sk4Ae6wrItr9MXGs2lNBrBBXYK5iGTCKWfdHwkZTGQRG4ECjy3d\nxW3njYla7AbiBd1wlewvr+eR93bw/bNHt9iOY+FOvbzWFyfow/I9ZKbrJTye+mgPHyfI4n3k/Z2p\nI+iapn0AfNAR1+pqjEzFrPT4uipGklJ9Amspx+OkusGPP6D70GOzULPSnTgUEfWlrg/7Zj3hRTQy\nw26IkX0z2HVcHxIHQhrlddHDvNqmAPW+AB6XSlZ69MfWhItjuHivAt5bDnBywvf7jeKhuB0Kr6zZ\nz1XT8ln02WZcgRq8opHnvjURmmp4cfkWjpWW8r1T+1NbU8nSz7aTIRrw0kiGaCCPGjJowKs0kEEj\nbmH50b64gLOJHiUYNGpOakmnTkvDuS0Ln+rhNIeDpqAH7/4cRjqC1JGOZ9UW/Nm5qOlZnK5to0Kk\nUUsa/UQf6ghRgZdRBTms3a+XKLU+3zpfkM/26hFAQ/M87Curp8EfjLLQrfTNdNPoD1HTFKCP18X9\nX5nMNxZ8yrIdpXELjZz70Icsu30OGW4Hv317a8JnnAiXmpzYGBidF7RuwQ2r3zgYClksdMHOY7U8\n+v5OHrh0cpwBAvB8eBQBMCTPA0Q62JYmiu2wJsNZsbO4j1U3cvNza3nk8mn08bqj3CWJ/PfWyK+F\nGw9zUfGgqP3Wa+gJTpF9v1u8PSlBN7DafFa3nlGq2ZiXAb0z7I5yBSe8hT6mwMv2o7VMGZwTty/W\n5ZLjcZp+379dOZ09pXXcv3ArJRUNug/d8gPJz3Axa0Q+60sqo75ENTEW+oDsdK47fQSnDM8zkxju\nem0jF0weGNWWRn+Qel8Qj1MlM82+qFcyOFXd/dAYgFqRwWFlACMGj2bK8DwYq1uZuw5u5ckDuxnR\ndyrfXbwGOKv5axLQBV408tEPZ/Dqim28smIbGTSQIRrx0oDX8neGaGS0A9yhevqKKjzaEbJrmpio\n1uEVjbD8ZfPaj6mAYVzXos/WALXHvRxzeSknC3EwnzkONxVkUqZl4diwnjlKiNMyxrJX1FCuZeFO\nkEFsRPUcr2kyk9I+2a0P4//24a644yvqffxm4RZzMZLW4HQ0b6HbWaGh8IR9rFGxaNMRs35/LNaQ\nU6sPXQDX/uMz9pfXc9OcUYzq5222vb5AiAZL0lWZTZRXSyTqBOwSxJ76eC+f7i7n+c8O8L05o6Lb\nkqBDs7pl7KJ4rEXs/EGt1VEvTlXY5jUYna/LoWDXNCFo18pjbeWEF/TnbphFgz9omwVq+H/fCy+C\n8ejl08w6GqeOzDdL636yu4zCfI8p6H28blb94lwgcW0Zwx3gcih8aUq0eNf5glGWklMV1PuCNPiC\npLtU06pvCw4lXCI3GCIQ1HCogte+NzvqmKL+mfiDGt99ek3c+TeeMYK/fbjbfH3vRRO487VNVJJJ\npZYJ/SdywOtmaci+xIJJaeReIU1jZF8vCzceQRDinvMLeeTtdRSk+RG+Wq6YmscXRnmpqijn8XfX\nk0cN43L8+KuPkUsNg/xHOUutII9qXCIIR+FbLmArfDs8UghsdHGzW48YKtMyTfGfcWQkR9UG0sv6\nkYMXjg5giKuGQz4P//l0v23TrYtrWzF+/D85vyjOgj9a3cj6mEUPYifs7PzNDf4gGW5HnNvvF69u\nTCzosT70QMSHbqydmqimkZXlu8oY98u3mT0qH4DjCTKbmyPRXEC1TSasYdFuP1qTVLIYRI/O7OY7\nrKOgQCjU6hICuR6X6XKJmucJRQS9vinITptSDYer2hYm2h5OeEHPy3Al3BcbemQdojpVJcpNo/vQ\n9eOtE4F2HQVEC3pLZKe7qPcFwy4XR1SoW2txqgK3U0XT9DhpOyYOyk54/qTBkX3Lbp/DgOw07nwt\nsgRXTaM/ztc6bkAWW2zqxIOe8t/oj/wgNBTufHs/kMfAfrpb5ezcMWRPHc2xozX8e3EBALNz8/m4\nTLekx+TpoyzQ8NJAcX6Q2vKj3H1eAf9+dw251HD2UIX9B0rIE9XkiRoGc5w8pYbsHYuY5AQM78Zf\n72SZAqRBpZZBuZZJOVlmR9B/5TK+WO1noOKgjEx9O1k01VUyJCedCYNzKMz3xL1PY9FmK/e+sZmv\nnTTE/DwNkXjoq1O47YX1gG7J2gm6Xd6EgS/Gh25Y6FZRbE0qvdHxxArmZ3vLKav1MbrAy8i+9tZ+\noqJatoIe/v+1dYf44TnRrpBEk6JWCz22fRV1PkrKI/NQ/mB84p1BovLNeRkRQbeOoIzO16kqBLUA\n5z60NOo8gTAFffLg7LhFyDuLE17QW4NTFcwelc/HO8twKCKqnnlIi/hIrZUDreI+f/IAc1GG9HBM\ns1Xvn7p6Btf+I36Vn+x0B5X1eqZoukuN8wf/8JzR/Ond5NLN05xqnC83lpF9M+ib6ba1eMZY4tzz\nva44P+z3n1nL2P6RY9KdKs9efwrvbD7Kj1/6PO56GeEww02H4gXf8JEbz9M6sZmTHumII35aQS0e\nNje6KNcyqR82k5eCelvSho/iz3vik3xe++7JXPfXxeSJaqbkBXlg7kCoL+Oh1z41xT+PagaJUiYp\nu+mz8WOuDvkh1g548BYW4qRhVw7K0T7826mao4D6JRs4+H4p5yhZlGnZlJJNqZZFI26++finvPq9\n2QghTJGwukIMC7M2xhedwE4AYn3omm3Wr2Gp+gIhTvq/Jdxz0QQujBkpGhjiW1rbZLqAjlY3ctlj\nn5jtXXLbmYC+8PbEQdmcVJgX1f5Y7GrVWEU7duWvRIJudakci/m+nnLfu1GumkAw3kLfeLCKDQer\nEsaV53sjH7Sdhe52KLYjq5CmmZ3EqH7eqEqfnYkU9FbgVBUWXDmD/eX1OFSFbIuF3ugPmsJjtcpV\ni7h/afJAU9A9zkgEhsGsEZGYZCvZ6U4OVTZS7wvSx+uKsupvO28MN58zmu+fPYrRP1/Y4nvwuFQz\nvR/gipnxyV5CCK6cOYyH3tlubrv61EJuPHNElP/ertDZ0u3Ho2b605wKOR6XbUgZ6Ik9jS3EXBvP\n01qP/vJThtI/O40nP9pDjaV2SabbYfourW6FtJhMUYO+2V6Ok8NxLYe09GyYeBoAj7wy0ExPv/OC\n8dwbnvC698Lx/Pb1VeSKGvKp4aopXpZv2MbPzuzHG59uYFymn+GeBjzlBxjMcfKVGjwfLeK3NtMe\ntVoapceyOfC7vmypTqNo4BBudcDQnXuYpxylVMsmcHQgftcwth+OHtIbiXB1TQF++domfnD2KArD\n4XNNgZCZVKX70G0EPWypVtbrWZH3vrGFeZMGAHDVrGHsK6s3cw2MzFN/UKOqwU9uhiuq9LP1+d8d\nTn7ae/98bnthXcKQ3zpfEF8gxDMr9jGmfyZDcj1ReQlldbGCHhHNYzWN9PW6EUJE+cRjDZBYv3sg\nFL8WrpHdHDsiMMj1RATdLoPZpSq2cx+BkMbR6kZURTA0z4MvECIQDNlORHckUtBbgVNVyHA7GDdA\nr3UeK+jGMNgq6FYL3WpZG5OiVrdOIvdMjsdFgz9InS/AUJcnStDnTuxvtu2SaYPisjaH98mISihJ\nd6lRP8bzxtv7YWPbMn1YLgOy022PjcX6BTfmFaw/DCtet6PFyoLGM7Ra6G6Hwk1njeTJj/ZExZX3\ny3JTc1x/3S8z4sdP5FqzWmBW0V/9i/OYeu87AORYPuc7X98MeKjVPByggAsHj+eldf25esJp/OHT\nFXypcCBfGN8/aj3Xu+eO5LGFK+kjqrh4tJOtu3bRh2ryRRV9RBX51dUMFUcZVLqT76tVqB+8wqNG\ns56/F4CLNZXT3dmUaVmUatk0+PNg8fu8s8NP6CCsZhyFp08Fbz98Ph8ZLgeNfl/Yhx4vqoa1acSY\nq0pEpPpnp3HIUmq4pilgjtiO1zbx6e4yc37FiEZZs78iqlBZeZ0v6rvoUESc8C3ceDgq+3Xa0Ehg\nQkVY3H998UR+8epG/vPpPi6dPpj9ZfWc8eD7/HRuETeeOTKqJkxLPn5/eN7IDjuXGOjBDQaBUIja\npgAOJdKROFXF1p0SCOnlrjNcqulSq/cHyZKC3nOIXYTBKuhuh2rG1kb70BXLMYqZSGQcY3WFJpqo\nMu5TUecj3aVGRdOoCToMgHOK+vHwN6Yy4a5F5rY0p0p+RiSmMNsmXBOIyuq878uToiZuh/fJSFjL\nPRajrYk6q76Z7rihcixGp2cdWaiKsJ1/KMhKY9fxOoSIrn2fSNDdDn2SuaYxECXouRku3OHl+BIt\nFQiR59fg1y1Op6rEfU/uWrgLyOewls/MguG8uH2IzZXg/NH9WbzpEKtvm8a+/ft44OVl3HlWX15c\nukYXfqr1/0UVRcFDaCuWc3GwiYtdwObwP+BJoFzzUurKpuz1bEIZfbnLkUZp2N1TpmXx3CtH6Pel\n2TS69QnPo9VNpjVtlLGwMmFgFh9sO87xmiYeWBQpODYs38PqfRVc8ujyqOPX7q+Iep2Z5jBF2iB2\ngZfPLYlZxpyB8XzXhSeUDTfGki1HufHMkVGTp5X1fpoCwaiO38pr6w5x6sh8231HEtS5ybP8VvxB\njYl3LWJQTjrfDI9sE82BBcPi73U7TOOtwRckqx0RaskgBb0VxEaseFwqUwZns76kigy3aoYnWt0S\nVnF3ORTuuWgitz6/jj6Z+hfFqnOJ6j+Ygl7vx+OK9oE7ojqM6C9yXoYrrsZFulNl3qSIVZ5I0L92\n0lBzsjPDHX3dRbeckVTqNrS8EtGofl7bCAErZuanI3pS2u7H1C/8XPtluqP25zcz+V2QlUZNY23c\nEoMORdBE4rVfIfL86sPhfU5VaTaCJCdmpJLrcZpCpyqCEAqKtx+iIIPloQquXuXmaLAg7jqF+R7e\nuvk0Tr7rVfqIKq6Y6OG6qV427tjFOys3kC908e8jquhXv51JahVZwmJJVgD/0v/c5HZTqmVT80g+\nL7qCDFnj5aymEFc6/QRR0BAMrvJyhbMBnnPxa6FQ6QyioTCwIYN9zkZCKIQ0QRCFEIJhn7zKPY4K\n83yP4qLKESSEYh4zcftSblbLzGOCKJw9sT+LtxxnzJ5VXKWWMXrfNi5XSwghCKwuo6CikS8ruxlS\n54WNhzmpfie5SgMBVIKoNG1zsXB7GQPyvEwVO8Pb9Xu+vPggZ3xjOoPFMYKaSgCFIJH/g5bXWrhu\nYWZMuQrQF7yxTorGMjA7DX9Q04MY3A4zACKZWv7tRQp6C4zom8HucMJP7GpIQgjuvGA8lz72CV63\ng6L+mdxy7mi+dlLEAlNjBP2LE/qz+Z7z+efyvfo1ksjfs0bTpLvUKKGy+uiTiZjxuNSojiMngaC7\nHAqnj+7Dsh2lcR1FMvcxsH7hH7tielShsB99YQwXFw/i3XBYaCKsiTEGqiJsQ0KLh+Tw6rpDZke2\n+NYzqKjz2SaOGfTLdLPzWG1cRUbjs0vU6QFkh8W+wRfAFwzhcihRo7K448PXUgQ8dfVJ/Or1Taag\nGxNzqhoZfcRODoJeUri6MYAvqK+UVat5WK8MZAkD2Zo5jj8Fx3DB5AG88fnhqPNc+Mm3uHqKvI18\nc6KHRSs/J19UM9hXT5PmQxMqCkGcIkAaIRQ08kWIelGL6tPwOAV9hV/f7nOQJ5pQCSEUDZUQKiG8\nhxUuUH1haQzhCIBQgyhCP0ZoIdTdGsWxj3YnnOIE9sLZTmAN3Gcc8z8oBP7gQs9kfkkv9Ro1Qf0i\nXBz+8xWb5DZeho/stscQ0gQBFJT3nFzm1jsbz0tuVrpDBFDIWJHGfFeQjF1pXO8KEUQhgIrqcOLU\nHPhLFUS5g6aQYMzaXJYMDJDtGweMaPnm7UAKegu89//OMlefiR2GQmSyJjfDhRCCW84dE7XfOntu\nFUZje3PRCgZDLDXBM1yOKEFN5KNPhHHuNbML+fvHe5OKaXcnWBzb4Bfzx9nWKYdoQT9/YrS//trT\nhqMoosWoG+uCGZHrCtsKmV+eOpiFG49wx9wiIBKVY3URPXbFNL7zn0iMfUGW7pqJnTg12h67/Rsn\nD+XZlXqMuiHQhh/fpYqozyQW41oXFQ/irLH9yE6PTDwbk5eqEHF14K143Q6OVjdGhQSW1zWZNVgA\nfvzFsXGC7sPJ4bDrBw12KOnMKprEr5evBCBLdVDtD/DrUyayZn9FlA/8lS+fyoOLtrF8VxmzBueb\nyVd/+PIUbn1+fVwbvzNzJI8tjSRmnTI8jxV7yrn61ELOLurHVU+t5MbTh7Ng2S7UcKcxZZCXX8wr\n4orHP+ErUwfy2tr9PHfdyZSU1/Lzl9fz72tPorSmnp+8tJ7RfdJ58qppfOdfKzhQWqN3GgT5w2UT\nueOltagEyXELGpp8fPOkgbz02T5UQnz/rEIe/2AHqgjiIIRKEAdBVELceNpQ/v7RLn270Ld/YXQf\nlm49jEqIcwbl8dH2I6hoTMzysKO+kpHpaZTU15jXKsr1UFFbj6YFUUNNZBLC6w8xyhmEdHtXUEci\nBb0V2A2lTyrM5epTC/nOmSNtz7EOs6xCbK5sEuNmeeArk81VdQysceEZbkeUcEeNAGKE0ehKJg3K\nZkN40QPj+Dvnj+f2LxYlNeveUp9z3ekjeHzZbltrsjn3Q1q4g2tR0G2iNBJZwdkeJ8/fOCtue77X\nzXM3zGT8wKy4hUX6ZekmW2yHaDyrWP//5MHZPLsyfL+woD+zQhd4l0OxHYYbGM/DmCD0WjpUI8RP\nUSDT5eSGM0awwJLEZWCEwVknkw9WRCYxhWg+v8LAFwxFRaEYnZLuQ49+z5lpTk4ensfyXWVRGa+D\nc+o8XYAAABQSSURBVONj7oEoMVcVEZV3YTyf6ibdbRMIuzc8GVmoaZnU4KE0mE4FWahZ/chQcjhC\nCceUvlS7/OzXjuASXug7lu3aYXZrkUn/43nT+DS85kCOcHLhKQNRivqxcIW+EMlXhp3Ef0PRi5KA\nPiFbf3Ixjy79IGr7yOLp/HqjPqp0j53EzzbrtW5uGjmSBQd3c8WIYfzjyF7z+PtPmcQ7m49ypLqR\nkAaDctJ54lszbJ9RZ9DrFonuTOx+qA5V4VcXTki4+LQ1IcQqGIaFHus2/+pJ8RNmRk0NgAHZabjU\nSE8f66O3438/OM2MIDDeg6IIc7KmJZLxlhuz/pdMja6l0Zy4GRZ2Swt724XdtXZkAjBzRD5Zac64\nNhWEo2Fi72OE8cWGZ55vydA0LG5j0s5uUtSK0TkYtVmsk2TG/Y2RYFbM6MmlKrx9y+nMHKFP7BlV\nG8cNyGKvpe5LdrozqeQzfzAUtU6ttY2xHX1WmiNcSC66RonHpfLFCfE+fivpTtW8ntuh4Ap/3tZw\nR9DnE4zP0vjduFTVnP8orW0yM08DwRC+QCguwuQPllDbJn8oLu8imSX5Mi3PLtcyf/KzVyKFy6ob\n/SiKiOvsHapCmkul0R+krimA1935VrkVKeitoKUJPjusxYnsLHS7Ko9PfmsGv79sSuQ8yxdyUE56\ntA+9GWGz5koYgtucO6A9GBbnD8+NjueNvd8v5sdXJWxO9MHeQjeewZLbzmDlz85tVVtjP0djgro6\nRmB+MX8cK392TpwP3TrRHOt3h8TlHiAi1oar7teWVbOM92k8stiaPWlOhaL+WaabzBD06cOi6xDl\nelxxIz87F44vELJ1lzhs3EaZaU7TALC6EV2qws0JYrgNzhwbWR3CpSqmQRK7dGCOx2V+F+rCIwen\nQ5AfrrdTXuczBd0f1PjmE5/GhUIu3xWp2W8UZbO+l0QJSkII8/cyMCfyrPpl2RtqR6qabEcyAj23\nwszsbkdWd1uQgt4KmvuhJsJanMgqXEaUiJ28njOugK9MH2w5L3LUwJz0qNdWv74rZvLSGoliWJuD\ncpKLJYdIxmKiGHIrRocRK3BThkSLzXWnx08KfTls1b9z6xk8cvnUuP1WMXrmulP49cUTzWiWUf0y\nzYnJZImdCzEEOzYe3qEq5g/6sSumm9udzUxE7z5e12xp2IjLRRcWQ6wgsgC2IcaxVTWNVZeMCV4j\nkSY2P+AKy1q5BvleNzv+by7Xzh5ubksUdaEqSpyhkeZUzGgna7ihQ1UY2dfLMJtyBwPDo9biwTmm\nVe92KuaIrDYmbDEvI5I0Z1joTlUhJ92JIvTiYIaLKBAKmVU1m8PtUKJGG82tbTAs38Mdc4t46pqT\nzG3G9yyWfWV1qELEdZwa+oiurilgFtPrSqSgtwK7SbiWuOXcMZw+ug/P3zAzytK7qHgQw/I9XDWr\nMOG5U8NuEiEEz1x/ChcXD4zLFG3OQrdywxkj2Hj3FxNaHHbcMbeIf1xzEsVD4itRxhI0U6EjX+D/\nfvdUfmxTA3rhD0/n8asifsXJg3PYe/98RhdkxlWZHNXPGzU/ceqoPlwxc1jcDynT7Ug6+ibWZ28M\n6e0mvQ2sE7rNLS92zezCZj8HI2yxIDP+c4iNy46NWTbanWVa6HpGbKzonDI8L+7a7rDv+tunD4/b\nF4tDEXHFsYQQpuvJuri3UxWkOVUW3XJG3HWM+QGPJaRXt9D19xH7fnM9EXeY0dk4VQVFEeRluCir\n85nzDK2pRWPtgBOVwhbh9/idM0dGGT2JlrbbW1aHqgoaYqpJappGhlu30JsCoRYDCjoaOSmaBN89\nayR//SC+lGoy9M9O49/fjl+wuiArrcXFC/797VPM7LtTR/bh1JF6aYBEUS7NRUYIIVpd1MvtUDlr\nrH3KfizpTv1L7HIozBqRz8wR+eb6pLGMG5BlZtvasfTHZ/HG54d5cNE2RvfzJjVxu+rOc5NapxLi\nXS4TBmbx07lFXBzj/48l0+3g2tOaF8QR4SJVr31vNoX5Gfzrk7383uLXnT4slz99vZhzx0X8zm/d\nfDrzHl4Wd63YUEujozBcMYbLpW+MoBsug+V3nM0fl2znhVUl5qSkM8YoMfIorChCRC3obWD40Cst\n1SZdFt94LMZozet2mKMWlyOSGBe7AlaOJ2Ks1Jo+dP11XoaLI1UN5GUYORnxpWmvPrWQf4TDgQ0O\nVzVGddSbDja/qpQdsXWNFKF3KJX1fg6GM2ozXCp1viCaplvoZjVGVVroPY6fnF8UtZ5kV+F1O0yB\nsOK2fEmsowZrJUQgudnMDuL5G2dyx9wi0l0qz94wM86X3hqG5WeYE3/JVqlzO9SE9VpiiRV0IQQ3\nnjnSDF9MxIa7v8it50XCUo0ELbvswylDcsj2OLlsRmSSe3L487moeFCU5ZcodDTWQjdGKsZ2o6RD\nH2+0oBsTeQNz0s2wzSHhaJTYztHu++VQhDlp+4OzR/H2LacDkXIVNU3xbkS7UYuxz+NymCn6boeS\ncAST63GZAm4IqGFd98tM4/1tx1m4QV+py67zvtUSMnz1qYWWdkTa9lyCFP/mlud79/+dGfXa6rYv\nCUcXGYELGhoZlmCD1uRsdATSQk9BEn1JPC4Hd31pPDuP1fL0Cvta3p3FqH6Z5vJsHcGYAl1oLi5u\n3mpuC8ZcyDkJCoYlw+775pki8MS3ZjD+l4tsjzPEa8qQnLi68waJyiJYhd5qUBjbV+/TfcjWjuyp\nq2dEiasxKT86/DxjOzO7kFFVFebk4fA+GRT110dTdiO82EU7rBh1YjJcqunCsQvrzEpzUN0YICs9\n3m1mPJtrZhfy0c5SdpfW4QlPOsZidW9cOn0wI/t5uXDyQMptrPlYYhP8nrn+FLMTtM6dffSTOZz2\n2/fN18P7ZLD1SA3D8j1sPVKjW+iW55RM3fmORAp6CtJcr3/N7OH8d3UJT6/Y35UGeoeTmeZk133z\nEopde1AUwbLb58S5Klp7DYN0p8o5Rf3M+h5WcjNcPPyNqcwaYV9DBBJHTyXKbvW4VFRLgSirxRub\n1fvt04ejCD0ZCuIn9h2qYMltZ7LpUBU/fG6dvk0RZqVC63fNboLQLmrqri+Nx6kqZmy+x+2I+NAd\nSlwn0MfrproxYL4vK0bnZJ3HuXjqIPPaVqydk9ft4Mrw5HAygh4bnWC4NyH684mNu3/wsil896yR\nPG1ZDMXa8SUbUttRSJdLCtKSyNmV8U1FOrP9Q/I8SbtoWkIIwZNXn8TZRfbx2BdOGdhs55Eoeioj\nQZ6AECLKerdaprEC4nU7+ME5o02rOLbzGDcgi1H9vHzJMhmtKgJfwAhzjVwvx+OMMybsQk5PC09c\nG1Z+mlMxQx3dDiVKeN+8+TQe+lox540vYFh+hu37hehEqZkxneOkQdk8c/0pUZ2sxxL/HWzlKkWx\nGJ2WMe9h/V563Q4mD85hRqE+XzS6wGvOV4B0uUg6gLkTB7D21MqENZ4lPYtEFnpz0TRWUbRa5S0J\niNWifuk7s8yJa6sYOhTFFGOXJelLCEFBlpsDllWA7Dpdw58cCEU6BWv9cKsbYsJAfV7BGvX0wo2z\n+OrfPom6phCCa2cPZ1Q/LyP7RoT/kcunMnfigLh2ZLisyUEth902hxCCT356ttmpvPzdU7nkr8uj\nFgS5dPpgZo3MZ3Cuh61HIou1SEGXJMUjl09lVYI4XJdDz16VpAZWK/edW+PD/+ww4sQvnT6Y7HSn\nuYB5otKxBtZOYkZhfHgj6CJtxMnbZdVaBd3KvRdN0DMlwyOfmSPy2VNaR67HGVXqormOCuBkm7BL\ngF9+aXzctsL8DNtOxZoPke9188jlU/n+M2sT3rOlsaA11n/KkBx23Tcv+nwhTHfMKMtEc1dHuUhB\nT1EumDwwLmZbkppYBWl0QfTE8k1njYwq/RB7juGbz07XBT1ZN1WihBnQrXh/wL48bHOlhK+Myam4\n+8IJXH/6cPK9bjPJraPcaINz0ympaEjoo47NGenfQgRTC31Mq7BGEkkLXSI5wWiuHMPt5xfZbs/x\nODlY2WAm75w2qg/7yvYnNQn38k2nMsymkzBQLevlxka2GOGWXxhfwPzJA5q9j8uhmGGRhsulo4Tz\n6lML+fWbW8hNoggZkHARa4Nkyli3BpeqmOWUuxIp6BJJN9OSC8IOw59rLHv4qwsn8NUZQ2yt+Vim\nDbVP+DJQFcEDl07mzA2HmTAwOgHMsNinDs3lolaElBqx20bfde9FExg/MHFy2bLb5zS7NOG3TxvO\n5acMtV3X1o7cDBf3XjSBrHSnGc3z2vdms+1oDbfbLF7eXtyOsKB38pJzscgoF4kkBZkaDuMzLECn\nqsTVzWkrDkWQ43HxzVPiSywYr1pbRfCscIGu/mFf9JWzCpk+zN5XDvrEqrVsdCzWUgTJcuWsQi4q\nHvT/2zvfGKmuMg4/P5ZdsLTuQkHcAC2g2xqsCoQgWCS1jasQU/uhiUtMJFXTRE1aUqOBNmli/KQf\njDUxto3/+kFrtVolREVsm/jnA3X5Vyi4slVMIdBdbdoaP9Ht64f7zjI7zu7szN6ZOXfyPslkzj33\nzr3PbM6+c+57zz2Xfp9j5n2r+rjlhszrE1VmOZ0LpZFHPTVmEs2b6KEHQQG557YBblrRywcHltbe\nuE5mk+euN5Vwz60D7Np8Xc27cVvBb+/dPtn7f9tbFzblLvAFk3P9x0XRIAhqML9rHoPvfnvtDeug\nu0tcnrAZJylrlHnz1JJg/ocvfWjK9L7V6L2qu+4ZOuuldC2j1lz/eRMBPQgS4KGh9QzkOHVCI3R3\nzePyxMSU59RWkudokGZwXZVpfNtB6Qxm4s3W3q8dOfQgSICPr18x40XCVjB5N+kMKZetPhHZDcvb\n++OTOu/x/H9edyPPluihB0EA1H5yFGQP4f7AO5YmkQtPma/ecRN3bFhRc7hk3kQPPQgCAHr9CUmV\nj3WrJIJ5bRZ2d3HzO/O/YF2L6KEHQQDAD+/azFPHLkw+Oi4oHhHQgyAAsrHftR74HKRNpFyCIAg6\nhAjoQRAEHULDAV3SKknPSjot6QVJ9+YpFgRBENTHXHLobwBfNLOjkq4Bjkg6ZGanc3ILgiAI6qDh\nHrqZXTSzo17+D3AGyP+JvkEQBMGsyCWHLmk1sAE4XGXd3ZKGJQ2Pj4/ncbggCIKgCnMO6JKuBn4O\n7DGz1yvXm9mjZrbJzDYtW7ZsrocLgiAIpmFOAV1SN1kw/5GZ/SIfpSAIgqARZDWmmpz2g9nM948B\nr5jZnll+Zhz4Z0MHhKXAvxr8bDsokm+4No8i+RbJFYrlO1fX682sZopjLgF9G/BH4CTwplffb2a/\nbmiHtY83bGabmrHvZlAk33BtHkXyLZIrFMu3Va4ND1s0sz9Bzk9WDYIgCBom7hQNgiDoEIoU0B9t\nt0CdFMk3XJtHkXyL5ArF8m2Ja8M59CAIgiAtitRDD4IgCGagEAFd0kcljUgalbS3TQ7flzQm6VRZ\n3RJJhySd9ffFXi9J33Lf5yVtLPvMbt/+rKTdTXKtOnFawr4LJT0n6YT7fsXr10g67F5PSOrx+gW+\nPOrrV5fta5/Xj0j6SDN8/Thdko5JOlAA13OSTko6LmnY61JtC32SnpT0V0lnJG1N2PVG/5uWXq9L\n2tNWXzNL+gV0AS8Ca4Ee4ASwrg0e24GNwKmyuq8De728F/ial3cCvyEbBbQFOOz1S4C/+/tiLy9u\ngms/sNHL1wB/A9Yl7Cvgai93k00hsQX4KTDk9Q8Dn/Py54GHvTwEPOHldd4+FgBrvN10Nak93Af8\nGDjgyym7ngOWVtSl2hYeAz7r5R6gL1XXCu8u4BJwfTt9m/YFc/xDbQUOli3vA/a1yWU1UwP6CNDv\n5X5gxMuPALsqtwN2AY+U1U/ZronevwI+XARf4CrgKPB+shsx5le2A+AgsNXL8307VbaN8u1ydlwJ\nPA3cChzwYyfp6vs+x/8H9OTaAtAL/AO/tpeyaxX3QeDP7fYtQsplBfBS2fJ50pnVcbmZXfTyJWC5\nl6dzbvl30dSJ05L19RTGcWAMOETWY33VzN6ocuxJL1//GnBtC32/CXyZKzfUXZuwK4ABv5N0RNLd\nXpdiW1gDjAM/8HTWdyUtStS1kiHgcS+3zbcIAb0QWPbTmtSQIc0wcVpqvmY2YWbryXq/m4F3tVmp\nKpI+BoyZ2ZF2u9TBNjPbCOwAviBpe/nKhNrCfLK05nfMbAPwX7KUxSQJuU7i10tuB35Wua7VvkUI\n6BeAVWXLK70uBV6W1A/g72NeP51zy76Lqk+clqxvCTN7FXiWLG3RJ6l0N3P5sSe9fH0v8O8W+d4M\n3C7pHPATsrTLQ4m6AmBmF/x9DHiK7AczxbZwHjhvZqVpuJ8kC/ApupazAzhqZi/7ctt8ixDQ/wIM\n+CiCHrJTm/1tdiqxHyhdkd5Nlqsu1X/Kr2pvAV7zU7CDwKCkxX7le9DrckWSgO8BZ8zsGwXwXSap\nz8tvIcv3nyEL7HdO41v6HncCz3hPaD8w5CNL1gADwHN5uprZPjNbaWarydriM2b2yRRdASQtUvZE\nMTx9MQicIsG2YGaXgJck3ehVtwGnU3StYBdX0i0lr/b4NvNCQY4XHHaSjdR4EXigTQ6PAxeBy2Q9\nic+Q5UKfBs4CvweW+LYCvu2+J4FNZfv5NDDqr7ua5LqN7DTveeC4v3Ym7Pte4Jj7ngIe9Pq1ZEFu\nlOx0doHXL/TlUV+/tmxfD/j3GAF2NLlN3MKVUS5JurrXCX+9UPr/SbgtrAeGvS38kmzUR5KufpxF\nZGdcvWV1bfONO0WDIAg6hCKkXIIgCIJZEAE9CIKgQ4iAHgRB0CFEQA+CIOgQIqAHQRB0CBHQgyAI\nOoQI6EEQBB1CBPQgCIIO4X9iGnorp+WAJQAAAABJRU5ErkJggg==\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x119bae4d0>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"<matplotlib.figure.Figure at 0x119bae4d0>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"%matplotlib inline\n",
"\n",
"import matplotlib.pyplot as plt\n",
"from IPython import display\n",
"import cPickle\n",
"\n",
"feeding = {\n",
" 'user_id': 0,\n",
" 'gender_id': 1,\n",
" 'age_id': 2,\n",
" 'job_id': 3,\n",
" 'movie_id': 4,\n",
" 'category_id': 5,\n",
" 'movie_title': 6,\n",
" 'score': 7\n",
"}\n",
"\n",
"step=0\n",
"\n",
"train_costs=[],[]\n",
"test_costs=[],[]\n",
"\n",
"def event_handler(event):\n",
" global step\n",
" global train_costs\n",
" global test_costs\n",
" if isinstance(event, paddle.event.EndIteration):\n",
" need_plot = False\n",
" if step % 10 == 0: # every 10 batches, record a train cost\n",
" train_costs[0].append(step)\n",
" train_costs[1].append(event.cost)\n",
" \n",
" if step % 1000 == 0: # every 1000 batches, record a test cost\n",
" result = trainer.test(reader=paddle.batch(\n",
" paddle.dataset.movielens.test(), batch_size=256))\n",
" test_costs[0].append(step)\n",
" test_costs[1].append(result.cost)\n",
" \n",
" if step % 100 == 0: # every 100 batches, update cost plot\n",
" plt.plot(*train_costs)\n",
" plt.plot(*test_costs)\n",
" plt.legend(['Train Cost', 'Test Cost'], loc='upper left')\n",
" display.clear_output(wait=True)\n",
" display.display(plt.gcf())\n",
" plt.gcf().clear()\n",
" step += 1\n",
"\n",
"trainer.train(\n",
" reader=paddle.batch(\n",
" paddle.reader.shuffle(\n",
" paddle.dataset.movielens.train(), buf_size=8192),\n",
" batch_size=256),\n",
" event_handler=event_handler,\n",
" feeding=feeding,\n",
" num_passes=2)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## 应用模型\n",
"\n",
"在训练了几轮以后,您可以对模型进行推断。我们可以使用任意一个用户ID和电影ID,来预测该用户对该电影的评分。示例程序为:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[INFO 2017-03-06 17:17:08,132 networks.py:1472] The input order is [user_id, gender_id, age_id, job_id, movie_id, category_id, movie_title]\n",
"[INFO 2017-03-06 17:17:08,134 networks.py:1478] The output order is [__cos_sim_0__]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[Predict] User 234 Rating Movie 345 With Score 4.16\n"
]
}
],
"source": [
"import copy\n",
"user_id = 234\n",
"movie_id = 345\n",
"\n",
"user = user_info[user_id]\n",
"movie = movie_info[movie_id]\n",
"\n",
"feature = user.value() + movie.value()\n",
"\n",
"infer_dict = copy.copy(feeding)\n",
"del infer_dict['score']\n",
"\n",
"prediction = paddle.infer(output=inference, parameters=parameters, input=[feature], feeding=infer_dict)\n",
"score = (prediction[0][0] + 5.0) / 2\n",
"print \"[Predict] User %d Rating Movie %d With Score %.2f\"%(user_id, movie_id, score)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## 总结\n",
"\n",
"本章介绍了传统的推荐系统方法和YouTube的深度神经网络推荐系统,并以电影推荐为例,使用PaddlePaddle训练了一个个性化推荐神经网络模型。推荐系统几乎涵盖了电商系统、社交网络、广告推荐、搜索引擎等领域的方方面面,而在图像处理、自然语言处理等领域已经发挥重要作用的深度学习技术,也将会在推荐系统领域大放异彩。\n",
"\n",
"## 参考文献\n",
"\n",
"1. [Peter Brusilovsky](https://en.wikipedia.org/wiki/Peter_Brusilovsky) (2007). *The Adaptive Web*. p. 325.\n",
"2. Robin Burke , [Hybrid Web Recommender Systems](http://www.dcs.warwick.ac.uk/~acristea/courses/CS411/2010/Book%20-%20The%20Adaptive%20Web/HybridWebRecommenderSystems.pdf), pp. 377-408, The Adaptive Web, Peter Brusilovsky, Alfred Kobsa, Wolfgang Nejdl (Ed.), Lecture Notes in Computer Science, Springer-Verlag, Berlin, Germany, Lecture Notes in Computer Science, Vol. 4321, May 2007, 978-3-540-72078-2.\n",
"3. P. Resnick, N. Iacovou, etc. “[GroupLens: An Open Architecture for Collaborative Filtering of Netnews](http://ccs.mit.edu/papers/CCSWP165.html)”, Proceedings of ACM Conference on Computer Supported Cooperative Work, CSCW 1994. pp.175-186.\n",
"4. Sarwar, Badrul, et al. \"[Item-based collaborative filtering recommendation algorithms.](http://files.grouplens.org/papers/www10_sarwar.pdf)\" *Proceedings of the 10th international conference on World Wide Web*. ACM, 2001.\n",
"5. Kautz, Henry, Bart Selman, and Mehul Shah. \"[Referral Web: combining social networks and collaborative filtering.](http://www.cs.cornell.edu/selman/papers/pdf/97.cacm.refweb.pdf)\" Communications of the ACM 40.3 (1997): 63-65. APA\n",
"6. Yuan, Jianbo, et al. [\"Solving Cold-Start Problem in Large-scale Recommendation Engines: A Deep Learning Approach.\"](https://arxiv.org/pdf/1611.05480v1.pdf) *arXiv preprint arXiv:1611.05480* (2016).\n",
"7. Covington P, Adams J, Sargin E. [Deep neural networks for youtube recommendations](https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf)[C]//Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 2016: 191-198.\n",
"\n",
"<br/>\n",
"<a rel=\"license\" href=\"http://creativecommons.org/licenses/by-nc-sa/4.0/\"><img alt=\"知识共享许可协议\" style=\"border-width:0\" src=\"https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png\" /></a><br /><span xmlns:dct=\"http://purl.org/dc/terms/\" href=\"http://purl.org/dc/dcmitype/Text\" property=\"dct:title\" rel=\"dct:type\">本教程</span> 由 <a xmlns:cc=\"http://creativecommons.org/ns#\" href=\"http://book.paddlepaddle.org\" property=\"cc:attributionName\" rel=\"cc:attributionURL\">PaddlePaddle</a> 创作,采用 <a rel=\"license\" href=\"http://creativecommons.org/licenses/by-nc-sa/4.0/\">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
...@@ -82,8 +82,8 @@ $$P(\omega=i|u)=\frac{e^{v_{i}u}}{\sum_{j \in V}e^{v_{j}u}}$$ ...@@ -82,8 +82,8 @@ $$P(\omega=i|u)=\frac{e^{v_{i}u}}{\sum_{j \in V}e^{v_{j}u}}$$
<p align="center"> <p align="center">
<img src="image/rec_regression_network.png" width="90%" ><br/> <img src="image/rec_regression_network.png" width="90%" ><br/>
图3. 融合推荐模型 图3. 融合推荐模型
</p> </p>
## 数据准备 ## 数据准备
...@@ -91,278 +91,330 @@ $$P(\omega=i|u)=\frac{e^{v_{i}u}}{\sum_{j \in V}e^{v_{j}u}}$$ ...@@ -91,278 +91,330 @@ $$P(\omega=i|u)=\frac{e^{v_{i}u}}{\sum_{j \in V}e^{v_{j}u}}$$
我们以 [MovieLens 百万数据集(ml-1m)](http://files.grouplens.org/datasets/movielens/ml-1m.zip)为例进行介绍。ml-1m 数据集包含了 6,000 位用户对 4,000 部电影的 1,000,000 条评价(评分范围 1~5 分,均为整数),由 GroupLens Research 实验室搜集整理。 我们以 [MovieLens 百万数据集(ml-1m)](http://files.grouplens.org/datasets/movielens/ml-1m.zip)为例进行介绍。ml-1m 数据集包含了 6,000 位用户对 4,000 部电影的 1,000,000 条评价(评分范围 1~5 分,均为整数),由 GroupLens Research 实验室搜集整理。
您可以运行 `data/getdata.sh` 下载数据,如果数椐获取成功,您将在目录`data/ml-1m`中看到下面的文件: Paddle在API中提供了自动加载数据的模块。数据模块为 `paddle.dataset.movielens`
```python
import paddle.v2 as paddle
paddle.init(use_gpu=False)
``` ```
movies.dat ratings.dat users.dat README
```python
# Run this block to show dataset's documentation
# help(paddle.dataset.movielens)
``` ```
- movies.dat:电影特征数据,格式为`电影ID::电影名称::电影类型` 在原始数据中包含电影的特征数据,用户的特征数据,和用户对电影的评分。
- ratings.dat:评分数据,格式为`用户ID::电影ID::评分::时间戳`
- users.dat:用户特征数据,格式为`用户ID::性别::年龄::职业::邮编`
- README:数据集的详细描述
### 数据预处理 例如,其中某一个电影特征为:
首先安装 Python 第三方库(推荐使用 Virtualenv):
```shell ```python
pip install -r data/requirements.txt movie_info = paddle.dataset.movielens.movie_info()
print movie_info.values()[0]
``` ```
其次在预处理`./preprocess.sh`过程中,我们将字段配置文件`data/config.json`转化为meta配置文件`meta_config.json`,并生成对应的meta文件`meta.bin`,以完成数据文件的序列化。然后再将`ratings.dat`分为训练集、测试集两部分,把它们的地址写入`train.list``test.list` <MovieInfo id(1), title(Toy Story ), categories(['Animation', "Children's", 'Comedy'])>
运行成功后目录`./data` 新增以下文件:
``` 这表示,电影的id是1,标题是《Toy Story》,该电影被分为到三个类别中。这三个类别是动画,儿童,喜剧。
meta_config.json meta.bin ratings.dat.train ratings.dat.test train.list test.list
```
- meta.bin: meta文件是Python的pickle对象, 存储着电影和用户信息。
- meta_config.json: meta配置文件,用来具体描述如何解析数据集中的每一个字段,由字段配置文件生成。
- ratings.dat.train和ratings.dat.test: 训练集和测试集,训练集已经随机打乱。
- train.list和test.list: 训练集和测试集的文件地址列表。
### 提供数据给 PaddlePaddle ```python
user_info = paddle.dataset.movielens.user_info()
print user_info.values()[0]
```
<UserInfo id(1), gender(F), age(1), job(10)>
这表示,该用户ID是1,女性,年龄比18岁还年轻。职业ID是10。
其中,年龄使用下列分布
* 1: "Under 18"
* 18: "18-24"
* 25: "25-34"
* 35: "35-44"
* 45: "45-49"
* 50: "50-55"
* 56: "56+"
职业是从下面几种选项里面选则得出:
* 0: "other" or not specified
* 1: "academic/educator"
* 2: "artist"
* 3: "clerical/admin"
* 4: "college/grad student"
* 5: "customer service"
* 6: "doctor/health care"
* 7: "executive/managerial"
* 8: "farmer"
* 9: "homemaker"
* 10: "K-12 student"
* 11: "lawyer"
* 12: "programmer"
* 13: "retired"
* 14: "sales/marketing"
* 15: "scientist"
* 16: "self-employed"
* 17: "technician/engineer"
* 18: "tradesman/craftsman"
* 19: "unemployed"
* 20: "writer"
而对于每一条训练/测试数据,均为 <用户特征> + <电影特征> + 评分。
例如,我们获得第一条训练数据:
我们使用 Python 接口传递数据给系统,下面 `dataprovider.py` 给出了完整示例。
```python ```python
from paddle.trainer.PyDataProvider2 import * train_set_creator = paddle.dataset.movielens.train()
from common_utils import meta_to_header train_sample = next(train_set_creator())
uid = train_sample[0]
def __list_to_map__(lst): # 将list转为map mov_id = train_sample[len(user_info[uid].value())]
ret_val = dict() print "User %s rates Movie %s with Score %s"%(user_info[uid], movie_info[mov_id], train_sample[-1])
for each in lst:
k, v = each
ret_val[k] = v
return ret_val
def hook(settings, meta, **kwargs): # 读取meta.bin
# 定义电影特征
movie_headers = list(meta_to_header(meta, 'movie'))
settings.movie_names = [h[0] for h in movie_headers]
headers = movie_headers
# 定义用户特征
user_headers = list(meta_to_header(meta, 'user'))
settings.user_names = [h[0] for h in user_headers]
headers.extend(user_headers)
# 加载评分信息
headers.append(("rating", dense_vector(1)))
settings.input_types = __list_to_map__(headers)
settings.meta = meta
@provider(init_hook=hook, cache=CacheType.CACHE_PASS_IN_MEM)
def process(settings, filename):
with open(filename, 'r') as f:
for line in f:
# 从评分文件中读取评分
user_id, movie_id, score = map(int, line.split('::')[:-1])
# 将评分平移到[-2, +2]范围内的整数
score = float(score - 3)
movie_meta = settings.meta['movie'][movie_id]
user_meta = settings.meta['user'][user_id]
# 添加电影ID与电影特征
outputs = [('movie_id', movie_id - 1)]
for i, each_meta in enumerate(movie_meta):
outputs.append((settings.movie_names[i + 1], each_meta))
# 添加用户ID与用户特征
outputs.append(('user_id', user_id - 1))
for i, each_meta in enumerate(user_meta):
outputs.append((settings.user_names[i + 1], each_meta))
# 添加评分
outputs.append(('rating', [score]))
# 将数据返回给 paddle
yield __list_to_map__(outputs)
``` ```
User <UserInfo id(1), gender(F), age(1), job(10)> rates Movie <MovieInfo id(1193), title(One Flew Over the Cuckoo's Nest ), categories(['Drama'])> with Score [5.0]
即用户1对电影1193的评价为5分。
## 模型配置说明 ## 模型配置说明
### 数据定义 下面我们开始根据输入数据的形式配置模型。
加载`meta.bin`文件并定义通过`define_py_data_sources2`从dataprovider中读入数据:
```python ```python
from paddle.trainer_config_helpers import * uid = paddle.layer.data(
name='user_id',
type=paddle.data_type.integer_value(
paddle.dataset.movielens.max_user_id() + 1))
usr_emb = paddle.layer.embedding(input=uid, size=32)
usr_gender_id = paddle.layer.data(
name='gender_id', type=paddle.data_type.integer_value(2))
usr_gender_emb = paddle.layer.embedding(input=usr_gender_id, size=16)
usr_age_id = paddle.layer.data(
name='age_id',
type=paddle.data_type.integer_value(
len(paddle.dataset.movielens.age_table)))
usr_age_emb = paddle.layer.embedding(input=usr_age_id, size=16)
usr_job_id = paddle.layer.data(
name='job_id',
type=paddle.data_type.integer_value(paddle.dataset.movielens.max_job_id(
) + 1))
usr_job_emb = paddle.layer.embedding(input=usr_job_id, size=16)
```
try: 如上述代码所示,对于每个用户,我们输入4维特征。其中包括`user_id`,`gender_id`,`age_id`,`job_id`。这几维特征均是简单的整数值。为了后续神经网络处理这些特征方便,我们借鉴NLP中的语言模型,将这几维离散的整数值,变换成embedding取出。分别形成`usr_emb`, `usr_gender_emb`, `usr_age_emb`, `usr_job_emb`
import cPickle as pickle
except ImportError:
import pickle
is_predict = get_config_arg('is_predict', bool, False)
META_FILE = 'data/meta.bin' ```python
usr_combined_features = paddle.layer.fc(
input=[usr_emb, usr_gender_emb, usr_age_emb, usr_job_emb],
size=200,
act=paddle.activation.Tanh())
```
# 加载 meta 文件 然后,我们对于所有的用户特征,均输入到一个全连接层(fc)中。将所有特征融合为一个200维度的特征。
with open(META_FILE, 'rb') as f:
meta = pickle.load(f)
if not is_predict: 进而,我们对每一个电影特征做类似的变换,网络配置为:
define_py_data_sources2(
'data/train.list',
'data/test.list', ```python
module='dataprovider', mov_id = paddle.layer.data(
obj='process', name='movie_id',
args={'meta': meta}) type=paddle.data_type.integer_value(
paddle.dataset.movielens.max_movie_id() + 1))
mov_emb = paddle.layer.embedding(input=mov_id, size=32)
mov_categories = paddle.layer.data(
name='category_id',
type=paddle.data_type.sparse_binary_vector(
len(paddle.dataset.movielens.movie_categories())))
mov_categories_hidden = paddle.layer.fc(input=mov_categories, size=32)
movie_title_dict = paddle.dataset.movielens.get_movie_title_dict()
mov_title_id = paddle.layer.data(
name='movie_title',
type=paddle.data_type.integer_value_sequence(len(movie_title_dict)))
mov_title_emb = paddle.layer.embedding(input=mov_title_id, size=32)
mov_title_conv = paddle.networks.sequence_conv_pool(
input=mov_title_emb, hidden_size=32, context_len=3)
mov_combined_features = paddle.layer.fc(
input=[mov_emb, mov_categories_hidden, mov_title_conv],
size=200,
act=paddle.activation.Tanh())
``` ```
### 算法配置 电影ID和电影类型分别映射到其对应的特征隐层。对于电影标题名称(title),一个ID序列表示的词语序列,在输入卷积层后,将得到每个时间窗口的特征(序列特征),然后通过在时间维度降采样得到固定维度的特征,整个过程在text_conv_pool实现。
最后再将电影的特征融合进`mov_combined_features`中。
这里我们设置了batch size、网络初始学习率和RMSProp自适应优化方法。
```python ```python
settings( inference = paddle.layer.cos_sim(a=usr_combined_features, b=mov_combined_features, size=1, scale=5)
batch_size=1600, learning_rate=1e-3, learning_method=RMSPropOptimizer())
``` ```
### 模型结构 进而,我们使用余弦相似度计算用户特征与电影特征的相似性。并将这个相似性拟合(回归)到用户评分上。
1. 定义数据输入和参数维度。
```python ```python
movie_meta = meta['movie']['__meta__']['raw_meta'] cost = paddle.layer.regression_cost(
user_meta = meta['user']['__meta__']['raw_meta'] input=inference,
label=paddle.layer.data(
name='score', type=paddle.data_type.dense_vector(1)))
```
movie_id = data_layer('movie_id', size=movie_meta[0]['max']) # 电影ID 至此,我们的优化目标就是这个网络配置中的`cost`了。
title = data_layer('title', size=len(movie_meta[1]['dict'])) # 电影名称
genres = data_layer('genres', size=len(movie_meta[2]['dict'])) # 电影类型
user_id = data_layer('user_id', size=user_meta[0]['max']) # 用户ID
gender = data_layer('gender', size=len(user_meta[1]['dict'])) # 用户性别
age = data_layer('age', size=len(user_meta[2]['dict'])) # 用户年龄
occupation = data_layer('occupation', size=len(user_meta[3]['dict'])) # 用户职业
embsize = 256 # 向量维度 ## 训练模型
```
2. 构造“电影”特征。 ### 定义参数
神经网络的模型,我们可以简单的理解为网络拓朴结构+参数。之前一节,我们定义出了优化目标`cost`。这个`cost`即为网络模型的拓扑结构。我们开始训练模型,需要先定义出参数。定义方法为:
```python
# 电影ID和电影类型分别映射到其对应的特征隐层(256维)。
movie_id_emb = embedding_layer(input=movie_id, size=embsize)
movie_id_hidden = fc_layer(input=movie_id_emb, size=embsize)
genres_emb = fc_layer(input=genres, size=embsize) ```python
parameters = paddle.parameters.create(cost)
```
# 对于电影名称,一个ID序列表示的词语序列,在输入卷积层后, [INFO 2017-03-06 17:12:13,284 networks.py:1472] The input order is [user_id, gender_id, age_id, job_id, movie_id, category_id, movie_title, score]
# 将得到每个时间窗口的特征(序列特征),然后通过在时间维度 [INFO 2017-03-06 17:12:13,287 networks.py:1478] The output order is [__regression_cost_0__]
# 降采样得到固定维度的特征,整个过程在text_conv_pool实现
title_emb = embedding_layer(input=title, size=embsize)
title_hidden = text_conv_pool(
input=title_emb, context_len=5, hidden_size=embsize)
# 将三个属性的特征表示分别全连接并相加,结果即是电影特征的最终表示
movie_feature = fc_layer(
input=[movie_id_hidden, title_hidden, genres_emb], size=embsize)
```
3. 构造“用户”特征 `parameters`是模型的所有参数集合。他是一个python的dict。我们可以查看到这个网络中的所有参数名称。因为之前定义模型的时候,我们没有指定参数名称,这里参数名称是自动生成的。当然,我们也可以指定每一个参数名称,方便日后维护
```python
# 将用户ID,性别,职业,年龄四个属性分别映射到其特征隐层。
user_id_emb = embedding_layer(input=user_id, size=embsize)
user_id_hidden = fc_layer(input=user_id_emb, size=embsize)
gender_emb = embedding_layer(input=gender, size=embsize) ```python
gender_hidden = fc_layer(input=gender_emb, size=embsize) print parameters.keys()
```
age_emb = embedding_layer(input=age, size=embsize) [u'___fc_layer_2__.wbias', u'___fc_layer_2__.w2', u'___embedding_layer_3__.w0', u'___embedding_layer_5__.w0', u'___embedding_layer_2__.w0', u'___embedding_layer_1__.w0', u'___fc_layer_1__.wbias', u'___fc_layer_0__.wbias', u'___fc_layer_1__.w0', u'___fc_layer_0__.w2', u'___fc_layer_0__.w3', u'___fc_layer_0__.w0', u'___fc_layer_0__.w1', u'___fc_layer_2__.w1', u'___fc_layer_2__.w0', u'___embedding_layer_4__.w0', u'___sequence_conv_pool_0___conv_fc.w0', u'___embedding_layer_0__.w0', u'___sequence_conv_pool_0___conv_fc.wbias']
age_hidden = fc_layer(input=age_emb, size=embsize)
occup_emb = embedding_layer(input=occupation, size=embsize)
occup_hidden = fc_layer(input=occup_emb, size=embsize)
# 同样将这四个属性分别全连接并相加形成用户特征的最终表示。 ### 构造训练(trainer)
user_feature = fc_layer(
input=[user_id_hidden, gender_hidden, age_hidden, occup_hidden],
size=embsize)
```
4. 计算余弦相似度,定义损失函数和网络输出 下面,我们根据网络拓扑结构和模型参数来构造出一个本地训练(trainer)。在构造本地训练的时候,我们还需要指定这个训练的优化方法。这里我们使用Adam来作为优化算法
```python
similarity = cos_sim(a=movie_feature, b=user_feature, scale=2)
# 训练时,采用regression_cost作为损失函数计算回归误差代价,并作为网络的输出。 ```python
# 预测时,网络的输出即为余弦相似度。 trainer = paddle.trainer.SGD(cost=cost, parameters=parameters,
if not is_predict: update_equation=paddle.optimizer.Adam(learning_rate=1e-4))
lbl=data_layer('rating', size=1) ```
cost=regression_cost(input=similarity, label=lbl)
outputs(cost)
else:
outputs(similarity)
```
## 训练模型 [INFO 2017-03-06 17:12:13,378 networks.py:1472] The input order is [user_id, gender_id, age_id, job_id, movie_id, category_id, movie_title, score]
[INFO 2017-03-06 17:12:13,379 networks.py:1478] The output order is [__regression_cost_0__]
执行`sh train.sh` 开始训练模型,将日志写入文件 `log.txt` 并打印在屏幕上。其中指定了总共需要执行 50 个pass。
```shell
set -e
paddle train \
--config=trainer_config.py \ # 神经网络配置文件
--save_dir=./output \ # 模型保存路径
--use_gpu=false \ # 是否使用GPU(默认不使用)
--trainer_count=4\ # 一台机器上面的线程数量
--test_all_data_in_one_period=true \ # 每个训练周期训练一次所有数据,否则每个训练周期测试batch_size个batch数据
--log_period=100 \ # 训练log_period个batch后打印日志
--dot_period=1 \ # 每训练dot_period个batch后打印一个"."
--num_passes=50 2>&1 | tee 'log.txt'
```
成功的输出类似如下: ### 训练
```bash 下面我们开始训练过程。
I0117 01:01:48.585651 9998 TrainerInternal.cpp:165] Batch=100 samples=160000 AvgCost=0.600042 CurrentCost=0.600042 Eval: CurrentEval:
................................................................................................... 我们直接使用Paddle提供的数据集读取程序。`paddle.dataset.movielens.train()``paddle.dataset.movielens.test()`分别做训练和预测数据集。并且通过`reader_dict`来指定每一个数据和data_layer的对应关系。
I0117 01:02:53.821918 9998 TrainerInternal.cpp:165] Batch=200 samples=320000 AvgCost=0.602855 CurrentCost=0.605668 Eval: CurrentEval:
................................................................................................... 例如,这里的reader_dict表示的是,对于数据层 `user_id`,使用了reader中每一条数据的第0个元素。`gender_id`数据层使用了第1个元素。以此类推。
I0117 01:03:58.937922 9998 TrainerInternal.cpp:165] Batch=300 samples=480000 AvgCost=0.605199 CurrentCost=0.609887 Eval: CurrentEval:
................................................................................................... 训练过程是完全自动的。我们可以使用event_handler来观察训练过程,或进行测试等。这里我们在event_handler里面绘制了训练误差曲线和测试误差曲线。并且保存了模型。
I0117 01:05:04.083251 9998 TrainerInternal.cpp:165] Batch=400 samples=640000 AvgCost=0.608693 CurrentCost=0.619175 Eval: CurrentEval:
...................................................................................................
I0117 01:06:09.155859 9998 TrainerInternal.cpp:165] Batch=500 samples=800000 AvgCost=0.613273 CurrentCost=0.631591 Eval: CurrentEval: ```python
.................................................................I0117 01:06:51.109654 9998 TrainerInternal.cpp:181] %matplotlib inline
Pass=49 Batch=565 samples=902826 AvgCost=0.614772 Eval:
I0117 01:07:04.205142 9998 Tester.cpp:115] Test samples=97383 cost=0.721995 Eval: import matplotlib.pyplot as plt
I0117 01:07:04.205281 9998 GradientMachine.cpp:113] Saving parameters to ./output/pass-00049 from IPython import display
import cPickle
feeding = {
'user_id': 0,
'gender_id': 1,
'age_id': 2,
'job_id': 3,
'movie_id': 4,
'category_id': 5,
'movie_title': 6,
'score': 7
}
step=0
train_costs=[],[]
test_costs=[],[]
def event_handler(event):
global step
global train_costs
global test_costs
if isinstance(event, paddle.event.EndIteration):
need_plot = False
if step % 10 == 0: # every 10 batches, record a train cost
train_costs[0].append(step)
train_costs[1].append(event.cost)
if step % 1000 == 0: # every 1000 batches, record a test cost
result = trainer.test(reader=paddle.batch(
paddle.dataset.movielens.test(), batch_size=256))
test_costs[0].append(step)
test_costs[1].append(result.cost)
if step % 100 == 0: # every 100 batches, update cost plot
plt.plot(*train_costs)
plt.plot(*test_costs)
plt.legend(['Train Cost', 'Test Cost'], loc='upper left')
display.clear_output(wait=True)
display.display(plt.gcf())
plt.gcf().clear()
step += 1
trainer.train(
reader=paddle.batch(
paddle.reader.shuffle(
paddle.dataset.movielens.train(), buf_size=8192),
batch_size=256),
event_handler=event_handler,
feeding=feeding,
num_passes=2)
``` ```
![png](./image/output_32_0.png)
## 应用模型 ## 应用模型
在训练了几轮以后,您可以对模型进行评估。运行以下命令,可以通过选择最小训练误差的一轮参数得到最好轮次的模型。 在训练了几轮以后,您可以对模型进行推断。我们可以使用任意一个用户ID和电影ID,来预测该用户对该电影的评分。示例程序为:
```shell
./evaluate.py log.txt
```
您将看到: ```python
import copy
user_id = 234
movie_id = 345
```shell user = user_info[user_id]
Best pass is 00036, error is 0.719281, which means predict get error as 0.424052 movie = movie_info[movie_id]
evaluating from pass output/pass-00036
``` feature = user.value() + movie.value()
预测任何用户对于任何一部电影评价的命令如下: infer_dict = copy.copy(feeding)
del infer_dict['score']
```shell prediction = paddle.infer(output=inference, parameters=parameters, input=[feature], feeding=infer_dict)
python prediction.py 'output/pass-00036/' score = (prediction[0][0] + 5.0) / 2
print "[Predict] User %d Rating Movie %d With Score %.2f"%(user_id, movie_id, score)
``` ```
预测程序将读取用户的输入,然后输出预测分数。您会看到如下命令行界面: [INFO 2017-03-06 17:17:08,132 networks.py:1472] The input order is [user_id, gender_id, age_id, job_id, movie_id, category_id, movie_title]
[INFO 2017-03-06 17:17:08,134 networks.py:1478] The output order is [__cos_sim_0__]
[Predict] User 234 Rating Movie 345 With Score 4.16
```
Input movie_id: 1962
Input user_id: 1
Prediction Score is 4.25
```
## 总结 ## 总结
......
<html> <html>
<head> <head>
<script type="text/x-mathjax-config"> <script type="text/x-mathjax-config">
...@@ -5,8 +6,8 @@ ...@@ -5,8 +6,8 @@
extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"], extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
jax: ["input/TeX", "output/HTML-CSS"], jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: { tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ], inlineMath: [ ['$','$'] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ], displayMath: [ ['$$','$$'] ],
processEscapes: true processEscapes: true
}, },
"HTML-CSS": { availableFonts: ["TeX"] } "HTML-CSS": { availableFonts: ["TeX"] }
...@@ -113,7 +114,7 @@ Given the feature vectors of users and movies, we compute the relevance using co ...@@ -113,7 +114,7 @@ Given the feature vectors of users and movies, we compute the relevance using co
<img src="image/rec_regression_network_en.png" width="90%" ><br/> <img src="image/rec_regression_network_en.png" width="90%" ><br/>
Figure 3. A hybrid recommendation model. Figure 3. A hybrid recommendation model.
</p> </p>
## Dataset ## Dataset
...@@ -150,6 +151,7 @@ This tutorial goes over traditional approaches in recommender system and a deep ...@@ -150,6 +151,7 @@ This tutorial goes over traditional approaches in recommender system and a deep
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">This tutorial</span> was created by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">the PaddlePaddle community</a> and published under <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Common Creative 4.0 License</a> <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">This tutorial</span> was created by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">the PaddlePaddle community</a> and published under <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Common Creative 4.0 License</a>
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
...@@ -168,6 +170,6 @@ marked.setOptions({ ...@@ -168,6 +170,6 @@ marked.setOptions({
} }
}); });
document.getElementById("context").innerHTML = marked( document.getElementById("context").innerHTML = marked(
document.getElementById("markdown").innerHTML) document.getElementById("markdown").innerHTML)
</script> </script>
</body> </body>
<html> <html>
<head> <head>
<script type="text/x-mathjax-config"> <script type="text/x-mathjax-config">
...@@ -5,8 +6,8 @@ ...@@ -5,8 +6,8 @@
extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"], extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
jax: ["input/TeX", "output/HTML-CSS"], jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: { tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ], inlineMath: [ ['$','$'] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ], displayMath: [ ['$$','$$'] ],
processEscapes: true processEscapes: true
}, },
"HTML-CSS": { availableFonts: ["TeX"] } "HTML-CSS": { availableFonts: ["TeX"] }
...@@ -123,8 +124,8 @@ $$P(\omega=i|u)=\frac{e^{v_{i}u}}{\sum_{j \in V}e^{v_{j}u}}$$ ...@@ -123,8 +124,8 @@ $$P(\omega=i|u)=\frac{e^{v_{i}u}}{\sum_{j \in V}e^{v_{j}u}}$$
<p align="center"> <p align="center">
<img src="image/rec_regression_network.png" width="90%" ><br/> <img src="image/rec_regression_network.png" width="90%" ><br/>
图3. 融合推荐模型 图3. 融合推荐模型
</p> </p>
## 数据准备 ## 数据准备
...@@ -132,278 +133,330 @@ $$P(\omega=i|u)=\frac{e^{v_{i}u}}{\sum_{j \in V}e^{v_{j}u}}$$ ...@@ -132,278 +133,330 @@ $$P(\omega=i|u)=\frac{e^{v_{i}u}}{\sum_{j \in V}e^{v_{j}u}}$$
我们以 [MovieLens 百万数据集(ml-1m)](http://files.grouplens.org/datasets/movielens/ml-1m.zip)为例进行介绍。ml-1m 数据集包含了 6,000 位用户对 4,000 部电影的 1,000,000 条评价(评分范围 1~5 分,均为整数),由 GroupLens Research 实验室搜集整理。 我们以 [MovieLens 百万数据集(ml-1m)](http://files.grouplens.org/datasets/movielens/ml-1m.zip)为例进行介绍。ml-1m 数据集包含了 6,000 位用户对 4,000 部电影的 1,000,000 条评价(评分范围 1~5 分,均为整数),由 GroupLens Research 实验室搜集整理。
您可以运行 `data/getdata.sh` 下载数据,如果数椐获取成功,您将在目录`data/ml-1m`中看到下面的文件: Paddle在API中提供了自动加载数据的模块。数据模块为 `paddle.dataset.movielens`
```python
import paddle.v2 as paddle
paddle.init(use_gpu=False)
``` ```
movies.dat ratings.dat users.dat README
```python
# Run this block to show dataset's documentation
# help(paddle.dataset.movielens)
``` ```
- movies.dat:电影特征数据,格式为`电影ID::电影名称::电影类型` 在原始数据中包含电影的特征数据,用户的特征数据,和用户对电影的评分。
- ratings.dat:评分数据,格式为`用户ID::电影ID::评分::时间戳`
- users.dat:用户特征数据,格式为`用户ID::性别::年龄::职业::邮编`
- README:数据集的详细描述
### 数据预处理 例如,其中某一个电影特征为:
首先安装 Python 第三方库(推荐使用 Virtualenv):
```shell ```python
pip install -r data/requirements.txt movie_info = paddle.dataset.movielens.movie_info()
print movie_info.values()[0]
``` ```
其次在预处理`./preprocess.sh`过程中,我们将字段配置文件`data/config.json`转化为meta配置文件`meta_config.json`,并生成对应的meta文件`meta.bin`,以完成数据文件的序列化。然后再将`ratings.dat`分为训练集、测试集两部分,把它们的地址写入`train.list`和`test.list`。 <MovieInfo id(1), title(Toy Story ), categories(['Animation', "Children's", 'Comedy'])>
运行成功后目录`./data` 新增以下文件: 这表示,电影的id是1,标题是《Toy Story》,该电影被分为到三个类别中。这三个类别是动画,儿童,喜剧。
```python
user_info = paddle.dataset.movielens.user_info()
print user_info.values()[0]
``` ```
meta_config.json meta.bin ratings.dat.train ratings.dat.test train.list test.list
<UserInfo id(1), gender(F), age(1), job(10)>
这表示,该用户ID是1,女性,年龄比18岁还年轻。职业ID是10。
其中,年龄使用下列分布
* 1: "Under 18"
* 18: "18-24"
* 25: "25-34"
* 35: "35-44"
* 45: "45-49"
* 50: "50-55"
* 56: "56+"
职业是从下面几种选项里面选则得出:
* 0: "other" or not specified
* 1: "academic/educator"
* 2: "artist"
* 3: "clerical/admin"
* 4: "college/grad student"
* 5: "customer service"
* 6: "doctor/health care"
* 7: "executive/managerial"
* 8: "farmer"
* 9: "homemaker"
* 10: "K-12 student"
* 11: "lawyer"
* 12: "programmer"
* 13: "retired"
* 14: "sales/marketing"
* 15: "scientist"
* 16: "self-employed"
* 17: "technician/engineer"
* 18: "tradesman/craftsman"
* 19: "unemployed"
* 20: "writer"
而对于每一条训练/测试数据,均为 <用户特征> + <电影特征> + 评分。
例如,我们获得第一条训练数据:
```python
train_set_creator = paddle.dataset.movielens.train()
train_sample = next(train_set_creator())
uid = train_sample[0]
mov_id = train_sample[len(user_info[uid].value())]
print "User %s rates Movie %s with Score %s"%(user_info[uid], movie_info[mov_id], train_sample[-1])
``` ```
- meta.bin: meta文件是Python的pickle对象, 存储着电影和用户信息。 User <UserInfo id(1), gender(F), age(1), job(10)> rates Movie <MovieInfo id(1193), title(One Flew Over the Cuckoo's Nest ), categories(['Drama'])> with Score [5.0]
- meta_config.json: meta配置文件,用来具体描述如何解析数据集中的每一个字段,由字段配置文件生成。
- ratings.dat.train和ratings.dat.test: 训练集和测试集,训练集已经随机打乱。
- train.list和test.list: 训练集和测试集的文件地址列表。
### 提供数据给 PaddlePaddle 即用户1对电影1193的评价为5分。
## 模型配置说明
下面我们开始根据输入数据的形式配置模型。
我们使用 Python 接口传递数据给系统,下面 `dataprovider.py` 给出了完整示例。
```python ```python
from paddle.trainer.PyDataProvider2 import * uid = paddle.layer.data(
from common_utils import meta_to_header name='user_id',
type=paddle.data_type.integer_value(
def __list_to_map__(lst): # 将list转为map paddle.dataset.movielens.max_user_id() + 1))
ret_val = dict() usr_emb = paddle.layer.embedding(input=uid, size=32)
for each in lst:
k, v = each usr_gender_id = paddle.layer.data(
ret_val[k] = v name='gender_id', type=paddle.data_type.integer_value(2))
return ret_val usr_gender_emb = paddle.layer.embedding(input=usr_gender_id, size=16)
def hook(settings, meta, **kwargs): # 读取meta.bin usr_age_id = paddle.layer.data(
# 定义电影特征 name='age_id',
movie_headers = list(meta_to_header(meta, 'movie')) type=paddle.data_type.integer_value(
settings.movie_names = [h[0] for h in movie_headers] len(paddle.dataset.movielens.age_table)))
headers = movie_headers usr_age_emb = paddle.layer.embedding(input=usr_age_id, size=16)
# 定义用户特征 usr_job_id = paddle.layer.data(
user_headers = list(meta_to_header(meta, 'user')) name='job_id',
settings.user_names = [h[0] for h in user_headers] type=paddle.data_type.integer_value(paddle.dataset.movielens.max_job_id(
headers.extend(user_headers) ) + 1))
usr_job_emb = paddle.layer.embedding(input=usr_job_id, size=16)
# 加载评分信息
headers.append(("rating", dense_vector(1)))
settings.input_types = __list_to_map__(headers)
settings.meta = meta
@provider(init_hook=hook, cache=CacheType.CACHE_PASS_IN_MEM)
def process(settings, filename):
with open(filename, 'r') as f:
for line in f:
# 从评分文件中读取评分
user_id, movie_id, score = map(int, line.split('::')[:-1])
# 将评分平移到[-2, +2]范围内的整数
score = float(score - 3)
movie_meta = settings.meta['movie'][movie_id]
user_meta = settings.meta['user'][user_id]
# 添加电影ID与电影特征
outputs = [('movie_id', movie_id - 1)]
for i, each_meta in enumerate(movie_meta):
outputs.append((settings.movie_names[i + 1], each_meta))
# 添加用户ID与用户特征
outputs.append(('user_id', user_id - 1))
for i, each_meta in enumerate(user_meta):
outputs.append((settings.user_names[i + 1], each_meta))
# 添加评分
outputs.append(('rating', [score]))
# 将数据返回给 paddle
yield __list_to_map__(outputs)
``` ```
## 模型配置说明 如上述代码所示,对于每个用户,我们输入4维特征。其中包括`user_id`,`gender_id`,`age_id`,`job_id`。这几维特征均是简单的整数值。为了后续神经网络处理这些特征方便,我们借鉴NLP中的语言模型,将这几维离散的整数值,变换成embedding取出。分别形成`usr_emb`, `usr_gender_emb`, `usr_age_emb`, `usr_job_emb`。
### 数据定义 ```python
usr_combined_features = paddle.layer.fc(
input=[usr_emb, usr_gender_emb, usr_age_emb, usr_job_emb],
size=200,
act=paddle.activation.Tanh())
```
然后,我们对于所有的用户特征,均输入到一个全连接层(fc)中。将所有特征融合为一个200维度的特征。
进而,我们对每一个电影特征做类似的变换,网络配置为:
加载`meta.bin`文件并定义通过`define_py_data_sources2`从dataprovider中读入数据:
```python ```python
from paddle.trainer_config_helpers import * mov_id = paddle.layer.data(
name='movie_id',
type=paddle.data_type.integer_value(
paddle.dataset.movielens.max_movie_id() + 1))
mov_emb = paddle.layer.embedding(input=mov_id, size=32)
mov_categories = paddle.layer.data(
name='category_id',
type=paddle.data_type.sparse_binary_vector(
len(paddle.dataset.movielens.movie_categories())))
mov_categories_hidden = paddle.layer.fc(input=mov_categories, size=32)
movie_title_dict = paddle.dataset.movielens.get_movie_title_dict()
mov_title_id = paddle.layer.data(
name='movie_title',
type=paddle.data_type.integer_value_sequence(len(movie_title_dict)))
mov_title_emb = paddle.layer.embedding(input=mov_title_id, size=32)
mov_title_conv = paddle.networks.sequence_conv_pool(
input=mov_title_emb, hidden_size=32, context_len=3)
mov_combined_features = paddle.layer.fc(
input=[mov_emb, mov_categories_hidden, mov_title_conv],
size=200,
act=paddle.activation.Tanh())
```
try: 电影ID和电影类型分别映射到其对应的特征隐层。对于电影标题名称(title),一个ID序列表示的词语序列,在输入卷积层后,将得到每个时间窗口的特征(序列特征),然后通过在时间维度降采样得到固定维度的特征,整个过程在text_conv_pool实现。
import cPickle as pickle
except ImportError:
import pickle
is_predict = get_config_arg('is_predict', bool, False) 最后再将电影的特征融合进`mov_combined_features`中。
META_FILE = 'data/meta.bin'
# 加载 meta 文件 ```python
with open(META_FILE, 'rb') as f: inference = paddle.layer.cos_sim(a=usr_combined_features, b=mov_combined_features, size=1, scale=5)
meta = pickle.load(f) ```
if not is_predict: 进而,我们使用余弦相似度计算用户特征与电影特征的相似性。并将这个相似性拟合(回归)到用户评分上。
define_py_data_sources2(
'data/train.list',
'data/test.list', ```python
module='dataprovider', cost = paddle.layer.regression_cost(
obj='process', input=inference,
args={'meta': meta}) label=paddle.layer.data(
name='score', type=paddle.data_type.dense_vector(1)))
``` ```
### 算法配置 至此,我们的优化目标就是这个网络配置中的`cost`了。
## 训练模型
### 定义参数
神经网络的模型,我们可以简单的理解为网络拓朴结构+参数。之前一节,我们定义出了优化目标`cost`。这个`cost`即为网络模型的拓扑结构。我们开始训练模型,需要先定义出参数。定义方法为:
这里我们设置了batch size、网络初始学习率和RMSProp自适应优化方法。
```python ```python
settings( parameters = paddle.parameters.create(cost)
batch_size=1600, learning_rate=1e-3, learning_method=RMSPropOptimizer())
``` ```
### 模型结构 [INFO 2017-03-06 17:12:13,284 networks.py:1472] The input order is [user_id, gender_id, age_id, job_id, movie_id, category_id, movie_title, score]
[INFO 2017-03-06 17:12:13,287 networks.py:1478] The output order is [__regression_cost_0__]
1. 定义数据输入和参数维度。
```python `parameters`是模型的所有参数集合。他是一个python的dict。我们可以查看到这个网络中的所有参数名称。因为之前定义模型的时候,我们没有指定参数名称,这里参数名称是自动生成的。当然,我们也可以指定每一个参数名称,方便日后维护。
movie_meta = meta['movie']['__meta__']['raw_meta']
user_meta = meta['user']['__meta__']['raw_meta']
movie_id = data_layer('movie_id', size=movie_meta[0]['max']) # 电影ID
title = data_layer('title', size=len(movie_meta[1]['dict'])) # 电影名称
genres = data_layer('genres', size=len(movie_meta[2]['dict'])) # 电影类型
user_id = data_layer('user_id', size=user_meta[0]['max']) # 用户ID
gender = data_layer('gender', size=len(user_meta[1]['dict'])) # 用户性别
age = data_layer('age', size=len(user_meta[2]['dict'])) # 用户年龄
occupation = data_layer('occupation', size=len(user_meta[3]['dict'])) # 用户职业
embsize = 256 # 向量维度 ```python
``` print parameters.keys()
```
2. 构造“电影”特征。 [u'___fc_layer_2__.wbias', u'___fc_layer_2__.w2', u'___embedding_layer_3__.w0', u'___embedding_layer_5__.w0', u'___embedding_layer_2__.w0', u'___embedding_layer_1__.w0', u'___fc_layer_1__.wbias', u'___fc_layer_0__.wbias', u'___fc_layer_1__.w0', u'___fc_layer_0__.w2', u'___fc_layer_0__.w3', u'___fc_layer_0__.w0', u'___fc_layer_0__.w1', u'___fc_layer_2__.w1', u'___fc_layer_2__.w0', u'___embedding_layer_4__.w0', u'___sequence_conv_pool_0___conv_fc.w0', u'___embedding_layer_0__.w0', u'___sequence_conv_pool_0___conv_fc.wbias']
```python
# 电影ID和电影类型分别映射到其对应的特征隐层(256维)。
movie_id_emb = embedding_layer(input=movie_id, size=embsize)
movie_id_hidden = fc_layer(input=movie_id_emb, size=embsize)
genres_emb = fc_layer(input=genres, size=embsize) ### 构造训练(trainer)
# 对于电影名称,一个ID序列表示的词语序列,在输入卷积层后, 下面,我们根据网络拓扑结构和模型参数来构造出一个本地训练(trainer)。在构造本地训练的时候,我们还需要指定这个训练的优化方法。这里我们使用Adam来作为优化算法。
# 将得到每个时间窗口的特征(序列特征),然后通过在时间维度
# 降采样得到固定维度的特征,整个过程在text_conv_pool实现
title_emb = embedding_layer(input=title, size=embsize)
title_hidden = text_conv_pool(
input=title_emb, context_len=5, hidden_size=embsize)
# 将三个属性的特征表示分别全连接并相加,结果即是电影特征的最终表示
movie_feature = fc_layer(
input=[movie_id_hidden, title_hidden, genres_emb], size=embsize)
```
3. 构造“用户”特征。 ```python
trainer = paddle.trainer.SGD(cost=cost, parameters=parameters,
update_equation=paddle.optimizer.Adam(learning_rate=1e-4))
```
```python [INFO 2017-03-06 17:12:13,378 networks.py:1472] The input order is [user_id, gender_id, age_id, job_id, movie_id, category_id, movie_title, score]
# 将用户ID,性别,职业,年龄四个属性分别映射到其特征隐层。 [INFO 2017-03-06 17:12:13,379 networks.py:1478] The output order is [__regression_cost_0__]
user_id_emb = embedding_layer(input=user_id, size=embsize)
user_id_hidden = fc_layer(input=user_id_emb, size=embsize)
gender_emb = embedding_layer(input=gender, size=embsize)
gender_hidden = fc_layer(input=gender_emb, size=embsize)
age_emb = embedding_layer(input=age, size=embsize) ### 训练
age_hidden = fc_layer(input=age_emb, size=embsize)
occup_emb = embedding_layer(input=occupation, size=embsize) 下面我们开始训练过程。
occup_hidden = fc_layer(input=occup_emb, size=embsize)
# 同样将这四个属性分别全连接并相加形成用户特征的最终表示。 我们直接使用Paddle提供的数据集读取程序。`paddle.dataset.movielens.train()`和`paddle.dataset.movielens.test()`分别做训练和预测数据集。并且通过`reader_dict`来指定每一个数据和data_layer的对应关系。
user_feature = fc_layer(
input=[user_id_hidden, gender_hidden, age_hidden, occup_hidden],
size=embsize)
```
4. 计算余弦相似度,定义损失函数和网络输出 例如,这里的reader_dict表示的是,对于数据层 `user_id`,使用了reader中每一条数据的第0个元素。`gender_id`数据层使用了第1个元素。以此类推
```python 训练过程是完全自动的。我们可以使用event_handler来观察训练过程,或进行测试等。这里我们在event_handler里面绘制了训练误差曲线和测试误差曲线。并且保存了模型。
similarity = cos_sim(a=movie_feature, b=user_feature, scale=2)
# 训练时,采用regression_cost作为损失函数计算回归误差代价,并作为网络的输出。
# 预测时,网络的输出即为余弦相似度。
if not is_predict:
lbl=data_layer('rating', size=1)
cost=regression_cost(input=similarity, label=lbl)
outputs(cost)
else:
outputs(similarity)
```
## 训练模型 ```python
%matplotlib inline
import matplotlib.pyplot as plt
from IPython import display
import cPickle
feeding = {
'user_id': 0,
'gender_id': 1,
'age_id': 2,
'job_id': 3,
'movie_id': 4,
'category_id': 5,
'movie_title': 6,
'score': 7
}
执行`sh train.sh` 开始训练模型,将日志写入文件 `log.txt` 并打印在屏幕上。其中指定了总共需要执行 50 个pass。 step=0
```shell train_costs=[],[]
set -e test_costs=[],[]
paddle train \
--config=trainer_config.py \ # 神经网络配置文件 def event_handler(event):
--save_dir=./output \ # 模型保存路径 global step
--use_gpu=false \ # 是否使用GPU(默认不使用) global train_costs
--trainer_count=4\ # 一台机器上面的线程数量 global test_costs
--test_all_data_in_one_period=true \ # 每个训练周期训练一次所有数据,否则每个训练周期测试batch_size个batch数据 if isinstance(event, paddle.event.EndIteration):
--log_period=100 \ # 训练log_period个batch后打印日志 need_plot = False
--dot_period=1 \ # 每训练dot_period个batch后打印一个"." if step % 10 == 0: # every 10 batches, record a train cost
--num_passes=50 2>&1 | tee 'log.txt' train_costs[0].append(step)
train_costs[1].append(event.cost)
if step % 1000 == 0: # every 1000 batches, record a test cost
result = trainer.test(reader=paddle.batch(
paddle.dataset.movielens.test(), batch_size=256))
test_costs[0].append(step)
test_costs[1].append(result.cost)
if step % 100 == 0: # every 100 batches, update cost plot
plt.plot(*train_costs)
plt.plot(*test_costs)
plt.legend(['Train Cost', 'Test Cost'], loc='upper left')
display.clear_output(wait=True)
display.display(plt.gcf())
plt.gcf().clear()
step += 1
trainer.train(
reader=paddle.batch(
paddle.reader.shuffle(
paddle.dataset.movielens.train(), buf_size=8192),
batch_size=256),
event_handler=event_handler,
feeding=feeding,
num_passes=2)
``` ```
成功的输出类似如下:
![png](./image/output_32_0.png)
```bash
I0117 01:01:48.585651 9998 TrainerInternal.cpp:165] Batch=100 samples=160000 AvgCost=0.600042 CurrentCost=0.600042 Eval: CurrentEval:
...................................................................................................
I0117 01:02:53.821918 9998 TrainerInternal.cpp:165] Batch=200 samples=320000 AvgCost=0.602855 CurrentCost=0.605668 Eval: CurrentEval:
...................................................................................................
I0117 01:03:58.937922 9998 TrainerInternal.cpp:165] Batch=300 samples=480000 AvgCost=0.605199 CurrentCost=0.609887 Eval: CurrentEval:
...................................................................................................
I0117 01:05:04.083251 9998 TrainerInternal.cpp:165] Batch=400 samples=640000 AvgCost=0.608693 CurrentCost=0.619175 Eval: CurrentEval:
...................................................................................................
I0117 01:06:09.155859 9998 TrainerInternal.cpp:165] Batch=500 samples=800000 AvgCost=0.613273 CurrentCost=0.631591 Eval: CurrentEval:
.................................................................I0117 01:06:51.109654 9998 TrainerInternal.cpp:181]
Pass=49 Batch=565 samples=902826 AvgCost=0.614772 Eval:
I0117 01:07:04.205142 9998 Tester.cpp:115] Test samples=97383 cost=0.721995 Eval:
I0117 01:07:04.205281 9998 GradientMachine.cpp:113] Saving parameters to ./output/pass-00049
```
## 应用模型 ## 应用模型
在训练了几轮以后,您可以对模型进行评估。运行以下命令,可以通过选择最小训练误差的一轮参数得到最好轮次的模型。 在训练了几轮以后,您可以对模型进行推断。我们可以使用任意一个用户ID和电影ID,来预测该用户对该电影的评分。示例程序为:
```shell
./evaluate.py log.txt
```
您将看到: ```python
import copy
user_id = 234
movie_id = 345
```shell user = user_info[user_id]
Best pass is 00036, error is 0.719281, which means predict get error as 0.424052 movie = movie_info[movie_id]
evaluating from pass output/pass-00036
```
预测任何用户对于任何一部电影评价的命令如下: feature = user.value() + movie.value()
```shell infer_dict = copy.copy(feeding)
python prediction.py 'output/pass-00036/' del infer_dict['score']
prediction = paddle.infer(output=inference, parameters=parameters, input=[feature], feeding=infer_dict)
score = (prediction[0][0] + 5.0) / 2
print "[Predict] User %d Rating Movie %d With Score %.2f"%(user_id, movie_id, score)
``` ```
预测程序将读取用户的输入,然后输出预测分数。您会看到如下命令行界面: [INFO 2017-03-06 17:17:08,132 networks.py:1472] The input order is [user_id, gender_id, age_id, job_id, movie_id, category_id, movie_title]
[INFO 2017-03-06 17:17:08,134 networks.py:1478] The output order is [__cos_sim_0__]
[Predict] User 234 Rating Movie 345 With Score 4.16
```
Input movie_id: 1962
Input user_id: 1
Prediction Score is 4.25
```
## 总结 ## 总结
...@@ -421,6 +474,7 @@ Prediction Score is 4.25 ...@@ -421,6 +474,7 @@ Prediction Score is 4.25
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
...@@ -439,6 +493,6 @@ marked.setOptions({ ...@@ -439,6 +493,6 @@ marked.setOptions({
} }
}); });
document.getElementById("context").innerHTML = marked( document.getElementById("context").innerHTML = marked(
document.getElementById("markdown").innerHTML) document.getElementById("markdown").innerHTML)
</script> </script>
</body> </body>
...@@ -17,9 +17,9 @@ set -e ...@@ -17,9 +17,9 @@ set -e
UNAME_STR=`uname` UNAME_STR=`uname`
if [[ ${UNAME_STR} == 'Linux' ]]; then if [[ ${UNAME_STR} == 'Linux' ]]; then
SHUF_PROG='shuf' SHUF_PROG='shuf'
else else
SHUF_PROG='gshuf' SHUF_PROG='gshuf'
fi fi
......
<html> <html>
<head> <head>
<script type="text/x-mathjax-config"> <script type="text/x-mathjax-config">
...@@ -5,8 +6,8 @@ ...@@ -5,8 +6,8 @@
extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"], extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
jax: ["input/TeX", "output/HTML-CSS"], jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: { tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ], inlineMath: [ ['$','$'] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ], displayMath: [ ['$$','$$'] ],
processEscapes: true processEscapes: true
}, },
"HTML-CSS": { availableFonts: ["TeX"] } "HTML-CSS": { availableFonts: ["TeX"] }
...@@ -39,6 +40,7 @@ ...@@ -39,6 +40,7 @@
<!-- This block will be replaced by each markdown file content. Please do not change lines below.--> <!-- This block will be replaced by each markdown file content. Please do not change lines below.-->
<div id="markdown" style='display:none'> <div id="markdown" style='display:none'>
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
...@@ -57,6 +59,6 @@ marked.setOptions({ ...@@ -57,6 +59,6 @@ marked.setOptions({
} }
}); });
document.getElementById("context").innerHTML = marked( document.getElementById("context").innerHTML = marked(
document.getElementById("markdown").innerHTML) document.getElementById("markdown").innerHTML)
</script> </script>
</body> </body>
<html> <html>
<head> <head>
<script type="text/x-mathjax-config"> <script type="text/x-mathjax-config">
...@@ -5,8 +6,8 @@ ...@@ -5,8 +6,8 @@
extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"], extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
jax: ["input/TeX", "output/HTML-CSS"], jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: { tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ], inlineMath: [ ['$','$'] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ], displayMath: [ ['$$','$$'] ],
processEscapes: true processEscapes: true
}, },
"HTML-CSS": { availableFonts: ["TeX"] } "HTML-CSS": { availableFonts: ["TeX"] }
...@@ -39,6 +40,7 @@ ...@@ -39,6 +40,7 @@
<!-- This block will be replaced by each markdown file content. Please do not change lines below.--> <!-- This block will be replaced by each markdown file content. Please do not change lines below.-->
<div id="markdown" style='display:none'> <div id="markdown" style='display:none'>
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
...@@ -57,6 +59,6 @@ marked.setOptions({ ...@@ -57,6 +59,6 @@ marked.setOptions({
} }
}); });
document.getElementById("context").innerHTML = marked( document.getElementById("context").innerHTML = marked(
document.getElementById("markdown").innerHTML) document.getElementById("markdown").innerHTML)
</script> </script>
</body> </body>
...@@ -22,7 +22,7 @@ For a piece of text, BOW model ignores its word order, grammar and syntax, and r ...@@ -22,7 +22,7 @@ For a piece of text, BOW model ignores its word order, grammar and syntax, and r
In this chapter, we introduce our deep learning model which handles these issues in BOW. Our model embeds texts into a low-dimensional space and takes word order into consideration. It is an end-to-end framework, and has large performance improvement over traditional methods \[[1](#Reference)\]. In this chapter, we introduce our deep learning model which handles these issues in BOW. Our model embeds texts into a low-dimensional space and takes word order into consideration. It is an end-to-end framework, and has large performance improvement over traditional methods \[[1](#Reference)\].
## Model Overview ## Model Overview
The model we used in this chapter is the CNN (Convolutional Neural Networks) and RNN (Recurrent Neural Networks) with some specific extension. The model we used in this chapter is the CNN (Convolutional Neural Networks) and RNN (Recurrent Neural Networks) with some specific extension.
### Convolutional Neural Networks for Texts (CNN) ### Convolutional Neural Networks for Texts (CNN)
...@@ -33,10 +33,10 @@ CNN mainly contains convolution and pooling operation, with various extensions. ...@@ -33,10 +33,10 @@ CNN mainly contains convolution and pooling operation, with various extensions.
<p align="center"> <p align="center">
<img src="image/text_cnn_en.png" width = "80%" align="center"/><br/> <img src="image/text_cnn_en.png" width = "80%" align="center"/><br/>
Figure 1. CNN for text modeling. Figure 1. CNN for text modeling.
</p> </p>
Assuming the length of the sentence is $n$, where the $i$-th word has embedding as $x_i\in\mathbb{R}^k$,where $k$ is the embedding dimensionality. Assuming the length of the sentence is $n$, where the $i$-th word has embedding as $x_i\in\mathbb{R}^k$,where $k$ is the embedding dimensionality.
First, we concatenate the words together: we piece every $h$ words as a window of length $h$: $x_{i:i+h-1}$. It refers to $x_{i},x_{i+1},\ldots,x_{i+h-1}$, where $i$ is the first word in the window, ranging from $1$ to $n-h+1$: $x_{i:i+h-1}\in\mathbb{R}^{hk}$. First, we concatenate the words together: we piece every $h$ words as a window of length $h$: $x_{i:i+h-1}$. It refers to $x_{i},x_{i+1},\ldots,x_{i+h-1}$, where $i$ is the first word in the window, ranging from $1$ to $n-h+1$: $x_{i:i+h-1}\in\mathbb{R}^{hk}$.
...@@ -60,7 +60,7 @@ RNN is an effective model for sequential data. Theoretical, the computational a ...@@ -60,7 +60,7 @@ RNN is an effective model for sequential data. Theoretical, the computational a
<p align="center"> <p align="center">
<img src="image/rnn.png" width = "60%" align="center"/><br/> <img src="image/rnn.png" width = "60%" align="center"/><br/>
Figure 2. An illustration of an unrolled RNN across “time”. Figure 2. An illustration of an unrolled RNN across “time”.
</p> </p>
As shown in Figure 2, we unroll an RNN: at $t$-th time step, the network takes the $t$-th input vector and the latent state from last time-step $h_{t-1}$ as inputs and compute the latent state of current step. The whole process is repeated until all inputs are consumed. If we regard the RNN as a function $f$, it can be formulated as: As shown in Figure 2, we unroll an RNN: at $t$-th time step, the network takes the $t$-th input vector and the latent state from last time-step $h_{t-1}$ as inputs and compute the latent state of current step. The whole process is repeated until all inputs are consumed. If we regard the RNN as a function $f$, it can be formulated as:
...@@ -140,7 +140,7 @@ If it runs successfully, `./data/pre-imdb` will contain: ...@@ -140,7 +140,7 @@ If it runs successfully, `./data/pre-imdb` will contain:
dict.txt labels.list test.list test_part_000 train.list train_part_000 dict.txt labels.list test.list test_part_000 train.list train_part_000
``` ```
* test\_part\_000 和 train\_part\_000: all labeled training and testing set, and the training set is shuffled. * test\_part\_000 和 train\_part\_000: all labeled training and testing set, and the training set is shuffled.
* train.list and test.list: training and testing file-list (containing list of file names). * train.list and test.list: training and testing file-list (containing list of file names).
* dict.txt: dictionary generated from training set. * dict.txt: dictionary generated from training set.
* labels.list: class label, 0 stands for negative while 1 for positive. * labels.list: class label, 0 stands for negative while 1 for positive.
...@@ -239,7 +239,7 @@ gradient_clipping_threshold=25) ...@@ -239,7 +239,7 @@ gradient_clipping_threshold=25)
### Model Structure ### Model Structure
We use PaddlePaddle to implement two classification algorithms, based on above mentioned model [Text-CNN](#Text-CNN(CNN))[Stacked-bidirectional LSTM](#Stacked-bidirectional LSTM(Stacked Bidirectional LSTM))。 We use PaddlePaddle to implement two classification algorithms, based on above mentioned model [Text-CNN](#Text-CNN(CNN))[Stacked-bidirectional LSTM](#Stacked-bidirectional LSTM(Stacked Bidirectional LSTM))。
#### Implementation of Text CNN #### Implementation of Text CNN
```python ```python
def convolution_net(input_dim, def convolution_net(input_dim,
class_dim=2, class_dim=2,
...@@ -477,7 +477,7 @@ predicting label is pos ...@@ -477,7 +477,7 @@ predicting label is pos
`10007_10.txt` in folder`./data/aclImdb/test/pos`, the predicted label is also pos,so the prediction is correct. `10007_10.txt` in folder`./data/aclImdb/test/pos`, the predicted label is also pos,so the prediction is correct.
## Summary ## Summary
In this chapter, we use sentiment analysis as an example to introduce applying deep learning models on end-to-end short text classification, as well as how to use PaddlePaddle to implement the model. Meanwhile, we briefly introduce two models for text processing: CNN and RNN. In following chapters we will see how these models can be applied in other tasks. In this chapter, we use sentiment analysis as an example to introduce applying deep learning models on end-to-end short text classification, as well as how to use PaddlePaddle to implement the model. Meanwhile, we briefly introduce two models for text processing: CNN and RNN. In following chapters we will see how these models can be applied in other tasks.
## Reference ## Reference
1. Kim Y. [Convolutional neural networks for sentence classification](http://arxiv.org/pdf/1408.5882)[J]. arXiv preprint arXiv:1408.5882, 2014. 1. Kim Y. [Convolutional neural networks for sentence classification](http://arxiv.org/pdf/1408.5882)[J]. arXiv preprint arXiv:1408.5882, 2014.
2. Kalchbrenner N, Grefenstette E, Blunsom P. [A convolutional neural network for modelling sentences](http://arxiv.org/pdf/1404.2188.pdf?utm_medium=App.net&utm_source=PourOver)[J]. arXiv preprint arXiv:1404.2188, 2014. 2. Kalchbrenner N, Grefenstette E, Blunsom P. [A convolutional neural network for modelling sentences](http://arxiv.org/pdf/1404.2188.pdf?utm_medium=App.net&utm_source=PourOver)[J]. arXiv preprint arXiv:1404.2188, 2014.
......
...@@ -57,7 +57,7 @@ $$\hat c=max(c)$$ ...@@ -57,7 +57,7 @@ $$\hat c=max(c)$$
$$h_t=f(x_t,h_{t-1})=\sigma(W_{xh}x_t+W_{hh}h_{h-1}+b_h)$$ $$h_t=f(x_t,h_{t-1})=\sigma(W_{xh}x_t+W_{hh}h_{h-1}+b_h)$$
其中$W_{xh}$是输入到隐层的矩阵参数,$W_{hh}$是隐层到隐层的矩阵参数,$b_h$为隐层的偏置向量(bias)参数,$\sigma$为$sigmoid$函数。 其中$W_{xh}$是输入到隐层的矩阵参数,$W_{hh}$是隐层到隐层的矩阵参数,$b_h$为隐层的偏置向量(bias)参数,$\sigma$为$sigmoid$函数。
在处理自然语言时,一般会先将词(one-hot表示)映射为其词向量(word embedding)表示,然后再作为循环神经网络每一时刻的输入$x_t$。此外,可以根据实际需要的不同在循环神经网络的隐层上连接其它层。如,可以把一个循环神经网络的隐层输出连接至下一个循环神经网络的输入构建深层(deep or stacked)循环神经网络,或者提取最后一个时刻的隐层状态作为句子表示进而使用分类模型等等。 在处理自然语言时,一般会先将词(one-hot表示)映射为其词向量(word embedding)表示,然后再作为循环神经网络每一时刻的输入$x_t$。此外,可以根据实际需要的不同在循环神经网络的隐层上连接其它层。如,可以把一个循环神经网络的隐层输出连接至下一个循环神经网络的输入构建深层(deep or stacked)循环神经网络,或者提取最后一个时刻的隐层状态作为句子表示进而使用分类模型等等。
### 长短期记忆网络(LSTM) ### 长短期记忆网络(LSTM)
......
...@@ -33,7 +33,7 @@ echo "Unzipping..." ...@@ -33,7 +33,7 @@ echo "Unzipping..."
tar -zxvf aclImdb_v1.tar.gz tar -zxvf aclImdb_v1.tar.gz
unzip master.zip unzip master.zip
#move train and test set to imdb_data directory #move train and test set to imdb_data directory
#in order to process when traing #in order to process when traing
mkdir -p imdb/train mkdir -p imdb/train
mkdir -p imdb/test mkdir -p imdb/test
......
<html> <html>
<head> <head>
<script type="text/x-mathjax-config"> <script type="text/x-mathjax-config">
...@@ -5,8 +6,8 @@ ...@@ -5,8 +6,8 @@
extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"], extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
jax: ["input/TeX", "output/HTML-CSS"], jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: { tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ], inlineMath: [ ['$','$'] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ], displayMath: [ ['$$','$$'] ],
processEscapes: true processEscapes: true
}, },
"HTML-CSS": { availableFonts: ["TeX"] } "HTML-CSS": { availableFonts: ["TeX"] }
...@@ -63,7 +64,7 @@ For a piece of text, BOW model ignores its word order, grammar and syntax, and r ...@@ -63,7 +64,7 @@ For a piece of text, BOW model ignores its word order, grammar and syntax, and r
In this chapter, we introduce our deep learning model which handles these issues in BOW. Our model embeds texts into a low-dimensional space and takes word order into consideration. It is an end-to-end framework, and has large performance improvement over traditional methods \[[1](#Reference)\]. In this chapter, we introduce our deep learning model which handles these issues in BOW. Our model embeds texts into a low-dimensional space and takes word order into consideration. It is an end-to-end framework, and has large performance improvement over traditional methods \[[1](#Reference)\].
## Model Overview ## Model Overview
The model we used in this chapter is the CNN (Convolutional Neural Networks) and RNN (Recurrent Neural Networks) with some specific extension. The model we used in this chapter is the CNN (Convolutional Neural Networks) and RNN (Recurrent Neural Networks) with some specific extension.
### Convolutional Neural Networks for Texts (CNN) ### Convolutional Neural Networks for Texts (CNN)
...@@ -74,10 +75,10 @@ CNN mainly contains convolution and pooling operation, with various extensions. ...@@ -74,10 +75,10 @@ CNN mainly contains convolution and pooling operation, with various extensions.
<p align="center"> <p align="center">
<img src="image/text_cnn_en.png" width = "80%" align="center"/><br/> <img src="image/text_cnn_en.png" width = "80%" align="center"/><br/>
Figure 1. CNN for text modeling. Figure 1. CNN for text modeling.
</p> </p>
Assuming the length of the sentence is $n$, where the $i$-th word has embedding as $x_i\in\mathbb{R}^k$,where $k$ is the embedding dimensionality. Assuming the length of the sentence is $n$, where the $i$-th word has embedding as $x_i\in\mathbb{R}^k$,where $k$ is the embedding dimensionality.
First, we concatenate the words together: we piece every $h$ words as a window of length $h$: $x_{i:i+h-1}$. It refers to $x_{i},x_{i+1},\ldots,x_{i+h-1}$, where $i$ is the first word in the window, ranging from $1$ to $n-h+1$: $x_{i:i+h-1}\in\mathbb{R}^{hk}$. First, we concatenate the words together: we piece every $h$ words as a window of length $h$: $x_{i:i+h-1}$. It refers to $x_{i},x_{i+1},\ldots,x_{i+h-1}$, where $i$ is the first word in the window, ranging from $1$ to $n-h+1$: $x_{i:i+h-1}\in\mathbb{R}^{hk}$.
...@@ -101,7 +102,7 @@ RNN is an effective model for sequential data. Theoretical, the computational a ...@@ -101,7 +102,7 @@ RNN is an effective model for sequential data. Theoretical, the computational a
<p align="center"> <p align="center">
<img src="image/rnn.png" width = "60%" align="center"/><br/> <img src="image/rnn.png" width = "60%" align="center"/><br/>
Figure 2. An illustration of an unrolled RNN across “time”. Figure 2. An illustration of an unrolled RNN across “time”.
</p> </p>
As shown in Figure 2, we unroll an RNN: at $t$-th time step, the network takes the $t$-th input vector and the latent state from last time-step $h_{t-1}$ as inputs and compute the latent state of current step. The whole process is repeated until all inputs are consumed. If we regard the RNN as a function $f$, it can be formulated as: As shown in Figure 2, we unroll an RNN: at $t$-th time step, the network takes the $t$-th input vector and the latent state from last time-step $h_{t-1}$ as inputs and compute the latent state of current step. The whole process is repeated until all inputs are consumed. If we regard the RNN as a function $f$, it can be formulated as:
...@@ -181,7 +182,7 @@ If it runs successfully, `./data/pre-imdb` will contain: ...@@ -181,7 +182,7 @@ If it runs successfully, `./data/pre-imdb` will contain:
dict.txt labels.list test.list test_part_000 train.list train_part_000 dict.txt labels.list test.list test_part_000 train.list train_part_000
``` ```
* test\_part\_000 和 train\_part\_000: all labeled training and testing set, and the training set is shuffled. * test\_part\_000 和 train\_part\_000: all labeled training and testing set, and the training set is shuffled.
* train.list and test.list: training and testing file-list (containing list of file names). * train.list and test.list: training and testing file-list (containing list of file names).
* dict.txt: dictionary generated from training set. * dict.txt: dictionary generated from training set.
* labels.list: class label, 0 stands for negative while 1 for positive. * labels.list: class label, 0 stands for negative while 1 for positive.
...@@ -280,7 +281,7 @@ gradient_clipping_threshold=25) ...@@ -280,7 +281,7 @@ gradient_clipping_threshold=25)
### Model Structure ### Model Structure
We use PaddlePaddle to implement two classification algorithms, based on above mentioned model [Text-CNN](#Text-CNN(CNN))和[Stacked-bidirectional LSTM](#Stacked-bidirectional LSTM(Stacked Bidirectional LSTM))。 We use PaddlePaddle to implement two classification algorithms, based on above mentioned model [Text-CNN](#Text-CNN(CNN))和[Stacked-bidirectional LSTM](#Stacked-bidirectional LSTM(Stacked Bidirectional LSTM))。
#### Implementation of Text CNN #### Implementation of Text CNN
```python ```python
def convolution_net(input_dim, def convolution_net(input_dim,
class_dim=2, class_dim=2,
...@@ -518,7 +519,7 @@ predicting label is pos ...@@ -518,7 +519,7 @@ predicting label is pos
`10007_10.txt` in folder`./data/aclImdb/test/pos`, the predicted label is also pos,so the prediction is correct. `10007_10.txt` in folder`./data/aclImdb/test/pos`, the predicted label is also pos,so the prediction is correct.
## Summary ## Summary
In this chapter, we use sentiment analysis as an example to introduce applying deep learning models on end-to-end short text classification, as well as how to use PaddlePaddle to implement the model. Meanwhile, we briefly introduce two models for text processing: CNN and RNN. In following chapters we will see how these models can be applied in other tasks. In this chapter, we use sentiment analysis as an example to introduce applying deep learning models on end-to-end short text classification, as well as how to use PaddlePaddle to implement the model. Meanwhile, we briefly introduce two models for text processing: CNN and RNN. In following chapters we will see how these models can be applied in other tasks.
## Reference ## Reference
1. Kim Y. [Convolutional neural networks for sentence classification](http://arxiv.org/pdf/1408.5882)[J]. arXiv preprint arXiv:1408.5882, 2014. 1. Kim Y. [Convolutional neural networks for sentence classification](http://arxiv.org/pdf/1408.5882)[J]. arXiv preprint arXiv:1408.5882, 2014.
2. Kalchbrenner N, Grefenstette E, Blunsom P. [A convolutional neural network for modelling sentences](http://arxiv.org/pdf/1404.2188.pdf?utm_medium=App.net&utm_source=PourOver)[J]. arXiv preprint arXiv:1404.2188, 2014. 2. Kalchbrenner N, Grefenstette E, Blunsom P. [A convolutional neural network for modelling sentences](http://arxiv.org/pdf/1404.2188.pdf?utm_medium=App.net&utm_source=PourOver)[J]. arXiv preprint arXiv:1404.2188, 2014.
...@@ -532,6 +533,7 @@ In this chapter, we use sentiment analysis as an example to introduce applying d ...@@ -532,6 +533,7 @@ In this chapter, we use sentiment analysis as an example to introduce applying d
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
...@@ -550,6 +552,6 @@ marked.setOptions({ ...@@ -550,6 +552,6 @@ marked.setOptions({
} }
}); });
document.getElementById("context").innerHTML = marked( document.getElementById("context").innerHTML = marked(
document.getElementById("markdown").innerHTML) document.getElementById("markdown").innerHTML)
</script> </script>
</body> </body>
<html> <html>
<head> <head>
<script type="text/x-mathjax-config"> <script type="text/x-mathjax-config">
...@@ -5,8 +6,8 @@ ...@@ -5,8 +6,8 @@
extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"], extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
jax: ["input/TeX", "output/HTML-CSS"], jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: { tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ], inlineMath: [ ['$','$'] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ], displayMath: [ ['$$','$$'] ],
processEscapes: true processEscapes: true
}, },
"HTML-CSS": { availableFonts: ["TeX"] } "HTML-CSS": { availableFonts: ["TeX"] }
...@@ -98,7 +99,7 @@ $$\hat c=max(c)$$ ...@@ -98,7 +99,7 @@ $$\hat c=max(c)$$
$$h_t=f(x_t,h_{t-1})=\sigma(W_{xh}x_t+W_{hh}h_{h-1}+b_h)$$ $$h_t=f(x_t,h_{t-1})=\sigma(W_{xh}x_t+W_{hh}h_{h-1}+b_h)$$
其中$W_{xh}$是输入到隐层的矩阵参数,$W_{hh}$是隐层到隐层的矩阵参数,$b_h$为隐层的偏置向量(bias)参数,$\sigma$为$sigmoid$函数。 其中$W_{xh}$是输入到隐层的矩阵参数,$W_{hh}$是隐层到隐层的矩阵参数,$b_h$为隐层的偏置向量(bias)参数,$\sigma$为$sigmoid$函数。
在处理自然语言时,一般会先将词(one-hot表示)映射为其词向量(word embedding)表示,然后再作为循环神经网络每一时刻的输入$x_t$。此外,可以根据实际需要的不同在循环神经网络的隐层上连接其它层。如,可以把一个循环神经网络的隐层输出连接至下一个循环神经网络的输入构建深层(deep or stacked)循环神经网络,或者提取最后一个时刻的隐层状态作为句子表示进而使用分类模型等等。 在处理自然语言时,一般会先将词(one-hot表示)映射为其词向量(word embedding)表示,然后再作为循环神经网络每一时刻的输入$x_t$。此外,可以根据实际需要的不同在循环神经网络的隐层上连接其它层。如,可以把一个循环神经网络的隐层输出连接至下一个循环神经网络的输入构建深层(deep or stacked)循环神经网络,或者提取最后一个时刻的隐层状态作为句子表示进而使用分类模型等等。
### 长短期记忆网络(LSTM) ### 长短期记忆网络(LSTM)
...@@ -353,6 +354,7 @@ Test with Pass 0, {'classification_error_evaluator': 0.11432000249624252} ...@@ -353,6 +354,7 @@ Test with Pass 0, {'classification_error_evaluator': 0.11432000249624252}
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
...@@ -371,6 +373,6 @@ marked.setOptions({ ...@@ -371,6 +373,6 @@ marked.setOptions({
} }
}); });
document.getElementById("context").innerHTML = marked( document.getElementById("context").innerHTML = marked(
document.getElementById("markdown").innerHTML) document.getElementById("markdown").innerHTML)
</script> </script>
</body> </body>
...@@ -24,7 +24,7 @@ from optparse import OptionParser ...@@ -24,7 +24,7 @@ from optparse import OptionParser
from paddle.utils.preprocess_util import * from paddle.utils.preprocess_util import *
""" """
Usage: run following command to show help message. Usage: run following command to show help message.
python preprocess.py -h python preprocess.py -h
""" """
......
...@@ -36,8 +36,8 @@ The neural network based model does not require storing huge hash tables of stat ...@@ -36,8 +36,8 @@ The neural network based model does not require storing huge hash tables of stat
In this section, after training the word embedding model, we could use the data visualization algorithm $t-$SNE\[[4](#reference)\] to draw the word embedding vectors after projecting them onto a two-dimensional space (see figure below). From the figure we could see that the semantically relevant words -- *a*, *the*, and *these* or *big* and *huge* -- are close to each other in the projected space, while irrelevant words -- *say* and *business* or *decision* and *japan* -- are far from each other. In this section, after training the word embedding model, we could use the data visualization algorithm $t-$SNE\[[4](#reference)\] to draw the word embedding vectors after projecting them onto a two-dimensional space (see figure below). From the figure we could see that the semantically relevant words -- *a*, *the*, and *these* or *big* and *huge* -- are close to each other in the projected space, while irrelevant words -- *say* and *business* or *decision* and *japan* -- are far from each other.
<p align="center"> <p align="center">
<img src = "image/2d_similarity.png" width=400><br/> <img src = "image/2d_similarity.png" width=400><br/>
Figure 1. Two dimension projection of word embeddings Figure 1. Two dimension projection of word embeddings
</p> </p>
### Cosine Similarity ### Cosine Similarity
...@@ -70,16 +70,16 @@ Before diving into word embedding models, we will first introduce the concept of ...@@ -70,16 +70,16 @@ Before diving into word embedding models, we will first introduce the concept of
In general, models that generate the probability of a sequence can be applied to many fields, like machine translation, speech recognition, information retrieval, part-of-speech tagging, and handwriting recognition. Take information retrieval, for example. If you were to search for "how long is a football bame" (where bame is a medical noun), the search engine would have asked if you had meant "how long is a football game" instead. This is because the probability of "how long is a football bame" is very low according to the language model; in addition, among all of the words easily confused with "bame", "game" would build the most probable sentence. In general, models that generate the probability of a sequence can be applied to many fields, like machine translation, speech recognition, information retrieval, part-of-speech tagging, and handwriting recognition. Take information retrieval, for example. If you were to search for "how long is a football bame" (where bame is a medical noun), the search engine would have asked if you had meant "how long is a football game" instead. This is because the probability of "how long is a football bame" is very low according to the language model; in addition, among all of the words easily confused with "bame", "game" would build the most probable sentence.
#### Target Probability #### Target Probability
For language model's target probability $P(w_1, ..., w_T)$, if the words in the sentence were to be independent, the joint probability of the whole sentence would be the product of each word's probability: For language model's target probability $P(w_1, ..., w_T)$, if the words in the sentence were to be independent, the joint probability of the whole sentence would be the product of each word's probability:
$$P(w_1, ..., w_T) = \prod_{t=1}^TP(w_t)$$ $$P(w_1, ..., w_T) = \prod_{t=1}^TP(w_t)$$
However, the frequency of words in a sentence typically relates to the words before them, so canonical language models are constructed using conditional probability in its target probability: However, the frequency of words in a sentence typically relates to the words before them, so canonical language models are constructed using conditional probability in its target probability:
$$P(w_1, ..., w_T) = \prod_{t=1}^TP(w_t | w_1, ... , w_{t-1})$$ $$P(w_1, ..., w_T) = \prod_{t=1}^TP(w_t | w_1, ... , w_{t-1})$$
### N-gram neural model ### N-gram neural model
In computational linguistics, n-gram is an important method to represent text. An n-gram represents a contiguous sequence of n consecutive items given a text. Based on the desired application scenario, each item could be a letter, a syllable or a word. The N-gram model is also an important method in statistical language modeling. When training language models with n-grams, the first (n-1) words of an n-gram are used to predict the *n*th word. In computational linguistics, n-gram is an important method to represent text. An n-gram represents a contiguous sequence of n consecutive items given a text. Based on the desired application scenario, each item could be a letter, a syllable or a word. The N-gram model is also an important method in statistical language modeling. When training language models with n-grams, the first (n-1) words of an n-gram are used to predict the *n*th word.
...@@ -89,47 +89,47 @@ We have previously described language model using conditional probability, where ...@@ -89,47 +89,47 @@ We have previously described language model using conditional probability, where
$$P(w_1, ..., w_T) = \prod_{t=n}^TP(w_t|w_{t-1}, w_{t-2}, ..., w_{t-n+1})$$ $$P(w_1, ..., w_T) = \prod_{t=n}^TP(w_t|w_{t-1}, w_{t-2}, ..., w_{t-n+1})$$
Given some real corpus in which all sentences are meaningful, the n-gram model should maximize the following objective function: Given some real corpus in which all sentences are meaningful, the n-gram model should maximize the following objective function:
$$\frac{1}{T}\sum_t f(w_t, w_{t-1}, ..., w_{t-n+1};\theta) + R(\theta)$$ $$\frac{1}{T}\sum_t f(w_t, w_{t-1}, ..., w_{t-n+1};\theta) + R(\theta)$$
where $f(w_t, w_{t-1}, ..., w_{t-n+1})$ represents the conditional probability of the current word $w_t$ given its previous $n-1$ words, and $R(\theta)$ represents parameter regularization term. where $f(w_t, w_{t-1}, ..., w_{t-n+1})$ represents the conditional probability of the current word $w_t$ given its previous $n-1$ words, and $R(\theta)$ represents parameter regularization term.
<p align="center"> <p align="center">
<img src="image/nnlm_en.png" width=500><br/> <img src="image/nnlm_en.png" width=500><br/>
Figure 2. N-gram neural network model Figure 2. N-gram neural network model
</p> </p>
Figure 2 shows the N-gram neural network model. From the bottom up, the model has the following components: Figure 2 shows the N-gram neural network model. From the bottom up, the model has the following components:
- For each sample, the model gets input $w_{t-n+1},...w_{t-1}$, and outputs the probability that the t-th word is one of `|V|` in the dictionary. - For each sample, the model gets input $w_{t-n+1},...w_{t-1}$, and outputs the probability that the t-th word is one of `|V|` in the dictionary.
Every input word $w_{t-n+1},...w_{t-1}$ first gets transformed into word embedding $C(w_{t-n+1}),...C(w_{t-1})$ through a transformation matrix. Every input word $w_{t-n+1},...w_{t-1}$ first gets transformed into word embedding $C(w_{t-n+1}),...C(w_{t-1})$ through a transformation matrix.
- All the word embeddings concatenate into a single vector, which is mapped (nonlinearly) into the $t$-th word hidden representation: - All the word embeddings concatenate into a single vector, which is mapped (nonlinearly) into the $t$-th word hidden representation:
$$g=Utanh(\theta^Tx + b_1) + Wx + b_2$$ $$g=Utanh(\theta^Tx + b_1) + Wx + b_2$$
where $x$ is the large vector concatenated from all the word embeddings representing the context; $\theta$, $U$, $b_1$, $b_2$ and $W$ are parameters connecting word embedding layers to the hidden layers. $g$ represents the unnormalized probability of the output word, $g_i$ represents the unnormalized probability of the output word being the i-th word in the dictionary. where $x$ is the large vector concatenated from all the word embeddings representing the context; $\theta$, $U$, $b_1$, $b_2$ and $W$ are parameters connecting word embedding layers to the hidden layers. $g$ represents the unnormalized probability of the output word, $g_i$ represents the unnormalized probability of the output word being the i-th word in the dictionary.
- Based on the definition of softmax, using normalized $g_i$, the probability that the output word is $w_t$ is represented as: - Based on the definition of softmax, using normalized $g_i$, the probability that the output word is $w_t$ is represented as:
$$P(w_t | w_1, ..., w_{t-n+1}) = \frac{e^{g_{w_t}}}{\sum_i^{|V|} e^{g_i}}$$ $$P(w_t | w_1, ..., w_{t-n+1}) = \frac{e^{g_{w_t}}}{\sum_i^{|V|} e^{g_i}}$$
- The cost of the entire network is a multi-class cross-entropy and can be described by the following loss function - The cost of the entire network is a multi-class cross-entropy and can be described by the following loss function
$$J(\theta) = -\sum_{i=1}^N\sum_{c=1}^{|V|}y_k^{i}log(softmax(g_k^i))$$ $$J(\theta) = -\sum_{i=1}^N\sum_{c=1}^{|V|}y_k^{i}log(softmax(g_k^i))$$
where $y_k^i$ represents the true label for the $k$-th class in the $i$-th sample ($0$ or $1$), $softmax(g_k^i)$ represents the softmax probability for the $k$-th class in the $i$-th sample. where $y_k^i$ represents the true label for the $k$-th class in the $i$-th sample ($0$ or $1$), $softmax(g_k^i)$ represents the softmax probability for the $k$-th class in the $i$-th sample.
### Continuous Bag-of-Words model(CBOW) ### Continuous Bag-of-Words model(CBOW)
CBOW model predicts the current word based on the N words both before and after it. When $N=2$, the model is as the figure below: CBOW model predicts the current word based on the N words both before and after it. When $N=2$, the model is as the figure below:
<p align="center"> <p align="center">
<img src="image/cbow_en.png" width=250><br/> <img src="image/cbow_en.png" width=250><br/>
Figure 3. CBOW model Figure 3. CBOW model
</p> </p>
Specifically, by ignoring the order of words in the sequence, CBOW uses the average value of the word embedding of the context to predict the current word: Specifically, by ignoring the order of words in the sequence, CBOW uses the average value of the word embedding of the context to predict the current word:
...@@ -138,30 +138,30 @@ $$\text{context} = \frac{x_{t-1} + x_{t-2} + x_{t+1} + x_{t+2}}{4}$$ ...@@ -138,30 +138,30 @@ $$\text{context} = \frac{x_{t-1} + x_{t-2} + x_{t+1} + x_{t+2}}{4}$$
where $x_t$ is the word embedding of the t-th word, classification score vector is $z=U*\text{context}$, the final classification $y$ uses softmax and the loss function uses multi-class cross-entropy. where $x_t$ is the word embedding of the t-th word, classification score vector is $z=U*\text{context}$, the final classification $y$ uses softmax and the loss function uses multi-class cross-entropy.
### Skip-gram model ### Skip-gram model
The advantages of CBOW is that it smooths over the word embeddings of the context and reduces noise, so it is very effective on small dataset. Skip-gram uses a word to predict its context and get multiple context for the given word, so it can be used in larger datasets. The advantages of CBOW is that it smooths over the word embeddings of the context and reduces noise, so it is very effective on small dataset. Skip-gram uses a word to predict its context and get multiple context for the given word, so it can be used in larger datasets.
<p align="center"> <p align="center">
<img src="image/skipgram_en.png" width=250><br/> <img src="image/skipgram_en.png" width=250><br/>
Figure 4. Skip-gram model Figure 4. Skip-gram model
</p> </p>
As illustrated in the figure above, skip-gram model maps the word embedding of the given word onto $2n$ word embeddings (including $n$ words before and $n$ words after the given word), and then combine the classification loss of all those $2n$ words by softmax. As illustrated in the figure above, skip-gram model maps the word embedding of the given word onto $2n$ word embeddings (including $n$ words before and $n$ words after the given word), and then combine the classification loss of all those $2n$ words by softmax.
## Data Preparation ## Data Preparation
## Model Configuration ## Model Configuration
<p align="center"> <p align="center">
<img src="image/ngram.en.png" width=400><br/> <img src="image/ngram.en.png" width=400><br/>
Figure 5. N-gram neural network model in model configuration Figure 5. N-gram neural network model in model configuration
</p> </p>
## Model Training ## Model Training
## Model Application ## Model Application
## Conclusion ## Conclusion
This chapter introduces word embedding, the relationship between language model and word embedding, and how to train neural networks to learn word embedding. This chapter introduces word embedding, the relationship between language model and word embedding, and how to train neural networks to learn word embedding.
......
# 词向量 # 词向量
本教程源代码目录在[book/word2vec](https://github.com/PaddlePaddle/book/tree/develop/word2vec), 初次使用请参考PaddlePaddle[安装教程](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html) 本教程源代码目录在[book/word2vec](https://github.com/PaddlePaddle/book/tree/develop/word2vec), 初次使用请参考PaddlePaddle[安装教程](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html)
...@@ -6,7 +7,7 @@ ...@@ -6,7 +7,7 @@
本章我们介绍词的向量表征,也称为word embedding。词向量是自然语言处理中常见的一个操作,是搜索引擎、广告系统、推荐系统等互联网服务背后常见的基础技术。 本章我们介绍词的向量表征,也称为word embedding。词向量是自然语言处理中常见的一个操作,是搜索引擎、广告系统、推荐系统等互联网服务背后常见的基础技术。
在这些互联网服务里,我们经常要比较两个词或者两段文本之间的相关性。为了做这样的比较,我们往往先要把词表示成计算机适合处理的方式。最自然的方式恐怕莫过于向量空间模型(vector space model)。 在这些互联网服务里,我们经常要比较两个词或者两段文本之间的相关性。为了做这样的比较,我们往往先要把词表示成计算机适合处理的方式。最自然的方式恐怕莫过于向量空间模型(vector space model)。
在这种方式里,每个词被表示成一个实数向量(one-hot vector),其长度为字典大小,每个维度对应一个字典里的每个词,除了这个词对应维度上的值是1,其他元素都是0。 在这种方式里,每个词被表示成一个实数向量(one-hot vector),其长度为字典大小,每个维度对应一个字典里的每个词,除了这个词对应维度上的值是1,其他元素都是0。
One-hot vector虽然自然,但是用处有限。比如,在互联网广告系统里,如果用户输入的query是“母亲节”,而有一个广告的关键词是“康乃馨”。虽然按照常理,我们知道这两个词之间是有联系的——母亲节通常应该送给母亲一束康乃馨;但是这两个词对应的one-hot vectors之间的距离度量,无论是欧氏距离还是余弦相似度(cosine similarity),由于其向量正交,都认为这两个词毫无相关性。 得出这种与我们相悖的结论的根本原因是:每个词本身的信息量都太小。所以,仅仅给定两个词,不足以让我们准确判别它们是否相关。要想精确计算相关性,我们还需要更多的信息——从大量数据里通过机器学习方法归纳出来的知识。 One-hot vector虽然自然,但是用处有限。比如,在互联网广告系统里,如果用户输入的query是“母亲节”,而有一个广告的关键词是“康乃馨”。虽然按照常理,我们知道这两个词之间是有联系的——母亲节通常应该送给母亲一束康乃馨;但是这两个词对应的one-hot vectors之间的距离度量,无论是欧氏距离还是余弦相似度(cosine similarity),由于其向量正交,都认为这两个词毫无相关性。 得出这种与我们相悖的结论的根本原因是:每个词本身的信息量都太小。所以,仅仅给定两个词,不足以让我们准确判别它们是否相关。要想精确计算相关性,我们还需要更多的信息——从大量数据里通过机器学习方法归纳出来的知识。
...@@ -68,7 +69,7 @@ $$P(w_1, ..., w_T) = \prod_{t=1}^TP(w_t | w_1, ... , w_{t-1})$$ ...@@ -68,7 +69,7 @@ $$P(w_1, ..., w_T) = \prod_{t=1}^TP(w_t | w_1, ... , w_{t-1})$$
### N-gram neural model ### N-gram neural model
在计算语言学中,n-gram是一种重要的文本表示方法,表示一个文本中连续的n个项。基于具体的应用场景,每一项可以是一个字母、单词或者音节。 n-gram模型也是统计语言模型中的一种重要方法,用n-gram训练语言模型时,一般用每个n-gram的历史n-1个词语组成的内容来预测第n个词。 在计算语言学中,n-gram是一种重要的文本表示方法,表示一个文本中连续的n个项。基于具体的应用场景,每一项可以是一个字母、单词或者音节。 n-gram模型也是统计语言模型中的一种重要方法,用n-gram训练语言模型时,一般用每个n-gram的历史n-1个词语组成的内容来预测第n个词。
...@@ -84,39 +85,39 @@ $$\frac{1}{T}\sum_t f(w_t, w_{t-1}, ..., w_{t-n+1};\theta) + R(\theta)$$ ...@@ -84,39 +85,39 @@ $$\frac{1}{T}\sum_t f(w_t, w_{t-1}, ..., w_{t-n+1};\theta) + R(\theta)$$
其中$f(w_t, w_{t-1}, ..., w_{t-n+1})$表示根据历史n-1个词得到当前词$w_t$的条件概率,$R(\theta)$表示参数正则项。 其中$f(w_t, w_{t-1}, ..., w_{t-n+1})$表示根据历史n-1个词得到当前词$w_t$的条件概率,$R(\theta)$表示参数正则项。
<p align="center"> <p align="center">
<img src="image/nnlm.png" width=500><br/> <img src="image/nnlm.png" width=500><br/>
图2. N-gram神经网络模型 图2. N-gram神经网络模型
</p> </p>
图2展示了N-gram神经网络模型,从下往上看,该模型分为以下几个部分: 图2展示了N-gram神经网络模型,从下往上看,该模型分为以下几个部分:
- 对于每个样本,模型输入$w_{t-n+1},...w_{t-1}$, 输出句子第t个词为字典中`|V|`个词的概率。 - 对于每个样本,模型输入$w_{t-n+1},...w_{t-1}$, 输出句子第t个词为字典中`|V|`个词的概率。
每个输入词$w_{t-n+1},...w_{t-1}$首先通过映射矩阵映射到词向量$C(w_{t-n+1}),...C(w_{t-1})$。 每个输入词$w_{t-n+1},...w_{t-1}$首先通过映射矩阵映射到词向量$C(w_{t-n+1}),...C(w_{t-1})$。
- 然后所有词语的词向量连接成一个大向量,并经过一个非线性映射得到历史词语的隐层表示: - 然后所有词语的词向量连接成一个大向量,并经过一个非线性映射得到历史词语的隐层表示:
$$g=Utanh(\theta^Tx + b_1) + Wx + b_2$$ $$g=Utanh(\theta^Tx + b_1) + Wx + b_2$$
其中,$x$为所有词语的词向量连接成的大向量,表示文本历史特征;$\theta$、$U$、$b_1$、$b_2$和$W$分别为词向量层到隐层连接的参数。$g$表示未经归一化的所有输出单词概率,$g_i$表示未经归一化的字典中第$i$个单词的输出概率。 其中,$x$为所有词语的词向量连接成的大向量,表示文本历史特征;$\theta$、$U$、$b_1$、$b_2$和$W$分别为词向量层到隐层连接的参数。$g$表示未经归一化的所有输出单词概率,$g_i$表示未经归一化的字典中第$i$个单词的输出概率。
- 根据softmax的定义,通过归一化$g_i$, 生成目标词$w_t$的概率为: - 根据softmax的定义,通过归一化$g_i$, 生成目标词$w_t$的概率为:
$$P(w_t | w_1, ..., w_{t-n+1}) = \frac{e^{g_{w_t}}}{\sum_i^{|V|} e^{g_i}}$$ $$P(w_t | w_1, ..., w_{t-n+1}) = \frac{e^{g_{w_t}}}{\sum_i^{|V|} e^{g_i}}$$
- 整个网络的损失值(cost)为多类分类交叉熵,用公式表示为 - 整个网络的损失值(cost)为多类分类交叉熵,用公式表示为
$$J(\theta) = -\sum_{i=1}^N\sum_{c=1}^{|V|}y_k^{i}log(softmax(g_k^i))$$ $$J(\theta) = -\sum_{i=1}^N\sum_{c=1}^{|V|}y_k^{i}log(softmax(g_k^i))$$
其中$y_k^i$表示第$i$个样本第$k$类的真实标签(0或1),$softmax(g_k^i)$表示第i个样本第k类softmax输出的概率。 其中$y_k^i$表示第$i$个样本第$k$类的真实标签(0或1),$softmax(g_k^i)$表示第i个样本第k类softmax输出的概率。
### Continuous Bag-of-Words model(CBOW)
### Continuous Bag-of-Words model(CBOW)
CBOW模型通过一个词的上下文(各N个词)预测当前词。当N=2时,模型如下图所示: CBOW模型通过一个词的上下文(各N个词)预测当前词。当N=2时,模型如下图所示:
<p align="center"> <p align="center">
<img src="image/cbow.png" width=250><br/> <img src="image/cbow.png" width=250><br/>
图3. CBOW模型 图3. CBOW模型
</p> </p>
...@@ -127,11 +128,11 @@ $$context = \frac{x_{t-1} + x_{t-2} + x_{t+1} + x_{t+2}}{4}$$ ...@@ -127,11 +128,11 @@ $$context = \frac{x_{t-1} + x_{t-2} + x_{t+1} + x_{t+2}}{4}$$
其中$x_t$为第$t$个词的词向量,分类分数(score)向量 $z=U*context$,最终的分类$y$采用softmax,损失函数采用多类分类交叉熵。 其中$x_t$为第$t$个词的词向量,分类分数(score)向量 $z=U*context$,最终的分类$y$采用softmax,损失函数采用多类分类交叉熵。
### Skip-gram model ### Skip-gram model
CBOW的好处是对上下文词语的分布在词向量上进行了平滑,去掉了噪声,因此在小数据集上很有效。而Skip-gram的方法中,用一个词预测其上下文,得到了当前词上下文的很多样本,因此可用于更大的数据集。 CBOW的好处是对上下文词语的分布在词向量上进行了平滑,去掉了噪声,因此在小数据集上很有效。而Skip-gram的方法中,用一个词预测其上下文,得到了当前词上下文的很多样本,因此可用于更大的数据集。
<p align="center"> <p align="center">
<img src="image/skipgram.png" width=250><br/> <img src="image/skipgram.png" width=250><br/>
图4. Skip-gram模型 图4. Skip-gram模型
</p> </p>
...@@ -165,7 +166,7 @@ CBOW的好处是对上下文词语的分布在词向量上进行了平滑,去 ...@@ -165,7 +166,7 @@ CBOW的好处是对上下文词语的分布在词向量上进行了平滑,去
</table> </table>
</p> </p>
### 数据预处理 ### 数据预处理
本章训练的是5-gram模型,表示在PaddlePaddle训练时,每条数据的前4个词用来预测第5个词。PaddlePaddle提供了对应PTB数据集的python包`paddle.dataset.imikolov`,自动做数据的下载与预处理,方便大家使用。 本章训练的是5-gram模型,表示在PaddlePaddle训练时,每条数据的前4个词用来预测第5个词。PaddlePaddle提供了对应PTB数据集的python包`paddle.dataset.imikolov`,自动做数据的下载与预处理,方便大家使用。
...@@ -186,7 +187,7 @@ dream that one day <e> ...@@ -186,7 +187,7 @@ dream that one day <e>
本配置的模型结构如下图所示: 本配置的模型结构如下图所示:
<p align="center"> <p align="center">
<img src="image/ngram.png" width=400><br/> <img src="image/ngram.png" width=400><br/>
图5. 模型配置中的N-gram神经网络模型 图5. 模型配置中的N-gram神经网络模型
</p> </p>
...@@ -208,8 +209,8 @@ N = 5 # 训练5-Gram ...@@ -208,8 +209,8 @@ N = 5 # 训练5-Gram
接着,定义网络结构: 接着,定义网络结构:
- 将$w_t$之前的$n-1$个词 $w_{t-n+1},...w_{t-1}$,通过$|V|\times D$的矩阵映射到D维词向量(本例中取D=32)。 - 将$w_t$之前的$n-1$个词 $w_{t-n+1},...w_{t-1}$,通过$|V|\times D$的矩阵映射到D维词向量(本例中取D=32)。
```python ```python
def wordemb(inlayer): def wordemb(inlayer):
wordemb = paddle.layer.table_projection( wordemb = paddle.layer.table_projection(
input=inlayer, input=inlayer,
...@@ -225,54 +226,54 @@ def wordemb(inlayer): ...@@ -225,54 +226,54 @@ def wordemb(inlayer):
- 定义输入层接受的数据类型以及名字。 - 定义输入层接受的数据类型以及名字。
```python ```python
def main(): paddle.init(use_gpu=False, trainer_count=3) # 初始化PaddlePaddle
paddle.init(use_gpu=False, trainer_count=1) # 初始化PaddlePaddle word_dict = paddle.dataset.imikolov.build_dict()
word_dict = paddle.dataset.imikolov.build_dict() dict_size = len(word_dict)
dict_size = len(word_dict) # 每个输入层都接受整形数据,这些数据的范围是[0, dict_size)
# 每个输入层都接受整形数据,这些数据的范围是[0, dict_size) firstword = paddle.layer.data(
firstword = paddle.layer.data( name="firstw", type=paddle.data_type.integer_value(dict_size))
name="firstw", type=paddle.data_type.integer_value(dict_size)) secondword = paddle.layer.data(
secondword = paddle.layer.data( name="secondw", type=paddle.data_type.integer_value(dict_size))
name="secondw", type=paddle.data_type.integer_value(dict_size)) thirdword = paddle.layer.data(
thirdword = paddle.layer.data( name="thirdw", type=paddle.data_type.integer_value(dict_size))
name="thirdw", type=paddle.data_type.integer_value(dict_size)) fourthword = paddle.layer.data(
fourthword = paddle.layer.data( name="fourthw", type=paddle.data_type.integer_value(dict_size))
name="fourthw", type=paddle.data_type.integer_value(dict_size)) nextword = paddle.layer.data(
nextword = paddle.layer.data( name="fifthw", type=paddle.data_type.integer_value(dict_size))
name="fifthw", type=paddle.data_type.integer_value(dict_size))
Efirst = wordemb(firstword)
Efirst = wordemb(firstword) Esecond = wordemb(secondword)
Esecond = wordemb(secondword) Ethird = wordemb(thirdword)
Ethird = wordemb(thirdword) Efourth = wordemb(fourthword)
Efourth = wordemb(fourthword)
``` ```
- 将这n-1个词向量经过concat_layer连接成一个大向量作为历史文本特征。 - 将这n-1个词向量经过concat_layer连接成一个大向量作为历史文本特征。
```python ```python
contextemb = paddle.layer.concat(input=[Efirst, Esecond, Ethird, Efourth]) contextemb = paddle.layer.concat(input=[Efirst, Esecond, Ethird, Efourth])
``` ```
- 将历史文本特征经过一个全连接得到文本隐层特征。 - 将历史文本特征经过一个全连接得到文本隐层特征。
```python ```python
hidden1 = paddle.layer.fc(input=contextemb, hidden1 = paddle.layer.fc(input=contextemb,
size=hiddensize, size=hiddensize,
act=paddle.activation.Sigmoid(), act=paddle.activation.Sigmoid(),
layer_attr=paddle.attr.Extra(drop_rate=0.5), layer_attr=paddle.attr.Extra(drop_rate=0.5),
bias_attr=paddle.attr.Param(learning_rate=2), bias_attr=paddle.attr.Param(learning_rate=2),
param_attr=paddle.attr.Param( param_attr=paddle.attr.Param(
initial_std=1. / math.sqrt(embsize * 8), initial_std=1. / math.sqrt(embsize * 8),
learning_rate=1)) learning_rate=1))
``` ```
- 将文本隐层特征,再经过一个全连接,映射成一个$|V|$维向量,同时通过softmax归一化得到这`|V|`个词的生成概率。 - 将文本隐层特征,再经过一个全连接,映射成一个$|V|$维向量,同时通过softmax归一化得到这`|V|`个词的生成概率。
```python ```python
predictword = paddle.layer.fc(input=hidden1, predictword = paddle.layer.fc(input=hidden1,
size=dict_size, size=dict_size,
bias_attr=paddle.attr.Param(learning_rate=2), bias_attr=paddle.attr.Param(learning_rate=2),
act=paddle.activation.Softmax()) act=paddle.activation.Softmax())
``` ```
- 网络的损失函数为多分类交叉熵,可直接调用`classification_cost`函数。 - 网络的损失函数为多分类交叉熵,可直接调用`classification_cost`函数。
...@@ -288,11 +289,11 @@ cost = paddle.layer.classification_cost(input=predictword, label=nextword) ...@@ -288,11 +289,11 @@ cost = paddle.layer.classification_cost(input=predictword, label=nextword)
- 正则化(regularization): 是防止网络过拟合的一种手段,此处采用L2正则化。 - 正则化(regularization): 是防止网络过拟合的一种手段,此处采用L2正则化。
```python ```python
parameters = paddle.parameters.create(cost) parameters = paddle.parameters.create(cost)
adam_optimizer = paddle.optimizer.Adam( adam_optimizer = paddle.optimizer.Adam(
learning_rate=3e-3, learning_rate=3e-3,
regularization=paddle.optimizer.L2Regularization(8e-4)) regularization=paddle.optimizer.L2Regularization(8e-4))
trainer = paddle.trainer.SGD(cost, parameters, adam_optimizer) trainer = paddle.trainer.SGD(cost, parameters, adam_optimizer)
``` ```
下一步,我们开始训练过程。`paddle.dataset.imikolov.train()``paddle.dataset.imikolov.test()`分别做训练和测试数据集。这两个函数各自返回一个reader——PaddlePaddle中的reader是一个Python函数,每次调用的时候返回一个Python generator。 下一步,我们开始训练过程。`paddle.dataset.imikolov.train()``paddle.dataset.imikolov.test()`分别做训练和测试数据集。这两个函数各自返回一个reader——PaddlePaddle中的reader是一个Python函数,每次调用的时候返回一个Python generator。
...@@ -300,113 +301,95 @@ cost = paddle.layer.classification_cost(input=predictword, label=nextword) ...@@ -300,113 +301,95 @@ cost = paddle.layer.classification_cost(input=predictword, label=nextword)
`paddle.batch`的输入是一个reader,输出是一个batched reader —— 在PaddlePaddle里,一个reader每次yield一条训练数据,而一个batched reader每次yield一个minbatch。 `paddle.batch`的输入是一个reader,输出是一个batched reader —— 在PaddlePaddle里,一个reader每次yield一条训练数据,而一个batched reader每次yield一个minbatch。
```python ```python
def event_handler(event): import gzip
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0: def event_handler(event):
result = trainer.test( if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0:
print "Pass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
if isinstance(event, paddle.event.EndPass):
result = trainer.test(
paddle.batch( paddle.batch(
paddle.dataset.imikolov.test(word_dict, N), 32)) paddle.dataset.imikolov.test(word_dict, N), 32))
print "Pass %d, Batch %d, Cost %f, %s, Testing metrics %s" % ( print "Pass %d, Testing metrics %s" % (event.pass_id, result.metrics)
event.pass_id, event.batch_id, event.cost, event.metrics, with gzip.open("model_%d.tar.gz"%event.pass_id, 'w') as f:
result.metrics) parameters.to_tar(f)
trainer.train( trainer.train(
paddle.batch(paddle.dataset.imikolov.train(word_dict, N), 32), paddle.batch(paddle.dataset.imikolov.train(word_dict, N), 32),
num_passes=30, num_passes=100,
event_handler=event_handler) event_handler=event_handler)
``` ```
训练过程是完全自动的,event_handler里打印的日志类似如下所示: ...
Pass 0, Batch 25000, Cost 4.251861, {'classification_error_evaluator': 0.84375}
Pass 0, Batch 25100, Cost 4.847692, {'classification_error_evaluator': 0.8125}
Pass 0, Testing metrics {'classification_error_evaluator': 0.7417652606964111}
训练过程是完全自动的,event_handler里打印的日志类似如上所示:
```text
.............................
I1222 09:27:16.477841 12590 TrainerInternal.cpp:162] Batch=3000 samples=300000 AvgCost=5.36135 CurrentCost=5.36135 Eval: classification_error_evaluator=0.818653 CurrentEval: class
ification_error_evaluator=0.818653
.............................
I1222 09:27:22.416700 12590 TrainerInternal.cpp:162] Batch=6000 samples=600000 AvgCost=5.29301 CurrentCost=5.22467 Eval: classification_error_evaluator=0.814542 CurrentEval: class
ification_error_evaluator=0.81043
.............................
I1222 09:27:28.343756 12590 TrainerInternal.cpp:162] Batch=9000 samples=900000 AvgCost=5.22494 CurrentCost=5.08876 Eval: classification_error_evaluator=0.810088 CurrentEval: class
ification_error_evaluator=0.80118
..I1222 09:27:29.128582 12590 TrainerInternal.cpp:179] Pass=0 Batch=9296 samples=929600 AvgCost=5.21786 Eval: classification_error_evaluator=0.809647
I1222 09:27:29.627616 12590 Tester.cpp:111] Test samples=73760 cost=4.9594 Eval: classification_error_evaluator=0.79676
I1222 09:27:29.627713 12590 GradientMachine.cpp:112] Saving parameters to model/pass-00000
```
经过30个pass,我们将得到平均错误率为classification_error_evaluator=0.735611。 经过30个pass,我们将得到平均错误率为classification_error_evaluator=0.735611。
## 应用模型 ## 应用模型
训练模型后,我们可以加载模型参数,用训练出来的词向量初始化其他模型,也可以将模型参数从二进制格式转换成文本格式进行后续应用。 训练模型后,我们可以加载模型参数,用训练出来的词向量初始化其他模型,也可以将模型查看参数用来做后续应用。
### 初始化其他模型
训练好的模型参数可以用来初始化其他模型。具体方法如下:
在PaddlePaddle 训练命令行中,用`--init_model_path` 来定义初始化模型的位置,用`--load_missing_parameter_strategy`指定除了词向量以外的新模型其他参数的初始化策略。注意,新模型需要和原模型共享被初始化参数的参数名。
### 查看词向量 ### 查看词向量
PaddlePaddle训练出来的参数为二进制格式,存储在对应训练pass的文件夹下。这里我们提供了文件`format_convert.py`用来互转PaddlePaddle训练结果的二进制文件和文本格式特征文件。
```bash PaddlePaddle训练出来的参数可以直接使用`parameters.get()`获取出来。例如查看单词的word的词向量,即为
python format_convert.py --b2t -i INPUT -o OUTPUT -d DIM
```
其中,INPUT是输入的(二进制)词向量模型名称,OUTPUT是输出的文本模型名称,DIM是词向量参数维度。
用法如:
```bash ```python
python format_convert.py --b2t -i model/pass-00029/_proj -o model/pass-00029/_proj.txt -d 32 embeddings = parameters.get("_proj").reshape(len(word_dict), embsize)
```
转换后得到的文本文件如下:
```text print embeddings[word_dict['word']]
0,4,62496
-0.7444070,-0.1846171,-1.5771370,0.7070392,2.1963732,-0.0091410, ......
-0.0721337,-0.2429973,-0.0606297,0.1882059,-0.2072131,-0.7661019, ......
......
``` ```
其中,第一行是PaddlePaddle 输出文件的格式说明,包含3个属性:<br/> [-0.38961065 -0.02392169 -0.00093231 0.36301503 0.13538605 0.16076435
1) PaddlePaddle的版本号,本例中为0;<br/> -0.0678709 0.1090285 0.42014077 -0.24119169 -0.31847557 0.20410083
2) 浮点数占用的字节数,本例中为4;<br/> 0.04910378 0.19021918 -0.0122014 -0.04099389 -0.16924137 0.1911236
3) 总计的参数个数, 本例中为62496(即1953*32);<br/> -0.10917275 0.13068172 -0.23079982 0.42699069 -0.27679482 -0.01472992
第二行及之后的每一行都按顺序表示字典里一个词的特征,用逗号分隔。 0.2069038 0.09005053 -0.3282454 0.12717034 -0.24218646 0.25304323
0.19072419 -0.24286366]
### 修改词向量
我们可以对词向量进行修改,并转换成PaddlePaddle参数二进制格式,方法:
```bash
python format_convert.py --t2b -i INPUT -o OUTPUT
```
其中,INPUT是输入的输入的文本词向量模型名称,OUTPUT是输出的二进制词向量模型名称 ### 修改词向量
获得到的embedding为一个标准的numpy矩阵。我们可以对这个numpy矩阵进行修改,然后赋值回去。
输入的文本格式如下(注意,不包含上面二进制转文本后第一行的格式说明):
```text ```python
-0.7444070,-0.1846171,-1.5771370,0.7070392,2.1963732,-0.0091410, ...... def modify_embedding(emb):
-0.0721337,-0.2429973,-0.0606297,0.1882059,-0.2072131,-0.7661019, ...... # Add your modification here.
...... pass
modify_embedding(embeddings)
parameters.set("_proj", embeddings)
``` ```
### 计算词语之间的余弦距离 ### 计算词语之间的余弦距离
两个向量之间的距离可以用余弦值来表示,余弦值在$[-1,1]$的区间内,向量间余弦值越大,其距离越近。这里我们在`calculate_dis.py`中实现不同词语的距离度量。 两个向量之间的距离可以用余弦值来表示,余弦值在$[-1,1]$的区间内,向量间余弦值越大,其距离越近。这里我们在`calculate_dis.py`中实现不同词语的距离度量。
用法如下: 用法如下:
```bash
python calculate_dis.py VOCABULARY EMBEDDINGLAYER`
```
其中,`VOCABULARY`是字典,`EMBEDDINGLAYER`是词向量模型,示例如下: ```python
from scipy import spatial
emb_1 = embeddings[word_dict['world']]
emb_2 = embeddings[word_dict['would']]
```bash print spatial.distance.cosine(emb_1, emb_2)
python calculate_dis.py data/vocabulary.txt model/pass-00029/_proj.txt
``` ```
0.99375076448
## 总结 ## 总结
本章中,我们介绍了词向量、语言模型和词向量的关系、以及如何通过训练神经网络模型获得词向量。在信息检索中,我们可以根据向量间的余弦夹角,来判断query和文档关键词这二者间的相关性。在句法分析和语义分析中,训练好的词向量可以用来初始化模型,以得到更好的效果。在文档分类中,有了词向量之后,可以用聚类的方法将文档中同义词进行分组。希望大家在本章后能够自行运用词向量进行相关领域的研究。 本章中,我们介绍了词向量、语言模型和词向量的关系、以及如何通过训练神经网络模型获得词向量。在信息检索中,我们可以根据向量间的余弦夹角,来判断query和文档关键词这二者间的相关性。在句法分析和语义分析中,训练好的词向量可以用来初始化模型,以得到更好的效果。在文档分类中,有了词向量之后,可以用聚类的方法将文档中同义词进行分组。希望大家在本章后能够自行运用词向量进行相关领域的研究。
......
...@@ -30,25 +30,25 @@ import struct ...@@ -30,25 +30,25 @@ import struct
def binary2text(input, output, paraDim): def binary2text(input, output, paraDim):
""" """
Convert a binary parameter file of embedding model to be a text file. Convert a binary parameter file of embedding model to be a text file.
input: the name of input binary parameter file, the format is: input: the name of input binary parameter file, the format is:
1) the first 16 bytes is filehead: 1) the first 16 bytes is filehead:
version(4 bytes): version of paddle, default = 0 version(4 bytes): version of paddle, default = 0
floatSize(4 bytes): sizeof(float) = 4 floatSize(4 bytes): sizeof(float) = 4
paraCount(8 bytes): total number of parameter paraCount(8 bytes): total number of parameter
2) the next (paraCount * 4) bytes is parameters, each has 4 bytes 2) the next (paraCount * 4) bytes is parameters, each has 4 bytes
output: the name of output text parameter file, for example: output: the name of output text parameter file, for example:
0,4,32156096 0,4,32156096
-0.7845433,1.1937413,-0.1704215,... -0.7845433,1.1937413,-0.1704215,...
0.0000909,0.0009465,-0.0008813,... 0.0000909,0.0009465,-0.0008813,...
... ...
the format is: the format is:
1) the first line is filehead: 1) the first line is filehead:
version=0, floatSize=4, paraCount=32156096 version=0, floatSize=4, paraCount=32156096
2) other lines print the paramters 2) other lines print the paramters
a) each line prints paraDim paramters splitted by ',' a) each line prints paraDim paramters splitted by ','
b) there is paraCount/paraDim lines (embedding words) b) there is paraCount/paraDim lines (embedding words)
paraDim: dimension of parameters paraDim: dimension of parameters
""" """
fi = open(input, "rb") fi = open(input, "rb")
fo = open(output, "w") fo = open(output, "w")
...@@ -78,7 +78,7 @@ def binary2text(input, output, paraDim): ...@@ -78,7 +78,7 @@ def binary2text(input, output, paraDim):
def get_para_count(input): def get_para_count(input):
""" """
Compute the total number of embedding parameters in input text file. Compute the total number of embedding parameters in input text file.
input: the name of input text file input: the name of input text file
""" """
numRows = 1 numRows = 1
...@@ -96,14 +96,14 @@ def text2binary(input, output, paddle_head=True): ...@@ -96,14 +96,14 @@ def text2binary(input, output, paddle_head=True):
Convert a text parameter file of embedding model to be a binary file. Convert a text parameter file of embedding model to be a binary file.
input: the name of input text parameter file, for example: input: the name of input text parameter file, for example:
-0.7845433,1.1937413,-0.1704215,... -0.7845433,1.1937413,-0.1704215,...
0.0000909,0.0009465,-0.0008813,... 0.0000909,0.0009465,-0.0008813,...
... ...
the format is: the format is:
1) it doesn't have filehead 1) it doesn't have filehead
2) each line stores the same dimension of parameters, 2) each line stores the same dimension of parameters,
the separator is commas ',' the separator is commas ','
output: the name of output binary parameter file, the format is: output: the name of output binary parameter file, the format is:
1) the first 16 bytes is filehead: 1) the first 16 bytes is filehead:
version(4 bytes), floatSize(4 bytes), paraCount(8 bytes) version(4 bytes), floatSize(4 bytes), paraCount(8 bytes)
2) the next (paraCount * 4) bytes is parameters, each has 4 bytes 2) the next (paraCount * 4) bytes is parameters, each has 4 bytes
""" """
...@@ -127,7 +127,7 @@ def text2binary(input, output, paddle_head=True): ...@@ -127,7 +127,7 @@ def text2binary(input, output, paddle_head=True):
def main(): def main():
""" """
Main entry for running format_convert.py Main entry for running format_convert.py
""" """
usage = "usage: \n" \ usage = "usage: \n" \
"python %prog --b2t -i INPUT -o OUTPUT -d DIM \n" \ "python %prog --b2t -i INPUT -o OUTPUT -d DIM \n" \
......
<html> <html>
<head> <head>
<script type="text/x-mathjax-config"> <script type="text/x-mathjax-config">
...@@ -5,8 +6,8 @@ ...@@ -5,8 +6,8 @@
extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"], extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
jax: ["input/TeX", "output/HTML-CSS"], jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: { tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ], inlineMath: [ ['$','$'] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ], displayMath: [ ['$$','$$'] ],
processEscapes: true processEscapes: true
}, },
"HTML-CSS": { availableFonts: ["TeX"] } "HTML-CSS": { availableFonts: ["TeX"] }
...@@ -77,8 +78,8 @@ The neural network based model does not require storing huge hash tables of stat ...@@ -77,8 +78,8 @@ The neural network based model does not require storing huge hash tables of stat
In this section, after training the word embedding model, we could use the data visualization algorithm $t-$SNE\[[4](#reference)\] to draw the word embedding vectors after projecting them onto a two-dimensional space (see figure below). From the figure we could see that the semantically relevant words -- *a*, *the*, and *these* or *big* and *huge* -- are close to each other in the projected space, while irrelevant words -- *say* and *business* or *decision* and *japan* -- are far from each other. In this section, after training the word embedding model, we could use the data visualization algorithm $t-$SNE\[[4](#reference)\] to draw the word embedding vectors after projecting them onto a two-dimensional space (see figure below). From the figure we could see that the semantically relevant words -- *a*, *the*, and *these* or *big* and *huge* -- are close to each other in the projected space, while irrelevant words -- *say* and *business* or *decision* and *japan* -- are far from each other.
<p align="center"> <p align="center">
<img src = "image/2d_similarity.png" width=400><br/> <img src = "image/2d_similarity.png" width=400><br/>
Figure 1. Two dimension projection of word embeddings Figure 1. Two dimension projection of word embeddings
</p> </p>
### Cosine Similarity ### Cosine Similarity
...@@ -111,16 +112,16 @@ Before diving into word embedding models, we will first introduce the concept of ...@@ -111,16 +112,16 @@ Before diving into word embedding models, we will first introduce the concept of
In general, models that generate the probability of a sequence can be applied to many fields, like machine translation, speech recognition, information retrieval, part-of-speech tagging, and handwriting recognition. Take information retrieval, for example. If you were to search for "how long is a football bame" (where bame is a medical noun), the search engine would have asked if you had meant "how long is a football game" instead. This is because the probability of "how long is a football bame" is very low according to the language model; in addition, among all of the words easily confused with "bame", "game" would build the most probable sentence. In general, models that generate the probability of a sequence can be applied to many fields, like machine translation, speech recognition, information retrieval, part-of-speech tagging, and handwriting recognition. Take information retrieval, for example. If you were to search for "how long is a football bame" (where bame is a medical noun), the search engine would have asked if you had meant "how long is a football game" instead. This is because the probability of "how long is a football bame" is very low according to the language model; in addition, among all of the words easily confused with "bame", "game" would build the most probable sentence.
#### Target Probability #### Target Probability
For language model's target probability $P(w_1, ..., w_T)$, if the words in the sentence were to be independent, the joint probability of the whole sentence would be the product of each word's probability: For language model's target probability $P(w_1, ..., w_T)$, if the words in the sentence were to be independent, the joint probability of the whole sentence would be the product of each word's probability:
$$P(w_1, ..., w_T) = \prod_{t=1}^TP(w_t)$$ $$P(w_1, ..., w_T) = \prod_{t=1}^TP(w_t)$$
However, the frequency of words in a sentence typically relates to the words before them, so canonical language models are constructed using conditional probability in its target probability: However, the frequency of words in a sentence typically relates to the words before them, so canonical language models are constructed using conditional probability in its target probability:
$$P(w_1, ..., w_T) = \prod_{t=1}^TP(w_t | w_1, ... , w_{t-1})$$ $$P(w_1, ..., w_T) = \prod_{t=1}^TP(w_t | w_1, ... , w_{t-1})$$
### N-gram neural model ### N-gram neural model
In computational linguistics, n-gram is an important method to represent text. An n-gram represents a contiguous sequence of n consecutive items given a text. Based on the desired application scenario, each item could be a letter, a syllable or a word. The N-gram model is also an important method in statistical language modeling. When training language models with n-grams, the first (n-1) words of an n-gram are used to predict the *n*th word. In computational linguistics, n-gram is an important method to represent text. An n-gram represents a contiguous sequence of n consecutive items given a text. Based on the desired application scenario, each item could be a letter, a syllable or a word. The N-gram model is also an important method in statistical language modeling. When training language models with n-grams, the first (n-1) words of an n-gram are used to predict the *n*th word.
...@@ -130,47 +131,47 @@ We have previously described language model using conditional probability, where ...@@ -130,47 +131,47 @@ We have previously described language model using conditional probability, where
$$P(w_1, ..., w_T) = \prod_{t=n}^TP(w_t|w_{t-1}, w_{t-2}, ..., w_{t-n+1})$$ $$P(w_1, ..., w_T) = \prod_{t=n}^TP(w_t|w_{t-1}, w_{t-2}, ..., w_{t-n+1})$$
Given some real corpus in which all sentences are meaningful, the n-gram model should maximize the following objective function: Given some real corpus in which all sentences are meaningful, the n-gram model should maximize the following objective function:
$$\frac{1}{T}\sum_t f(w_t, w_{t-1}, ..., w_{t-n+1};\theta) + R(\theta)$$ $$\frac{1}{T}\sum_t f(w_t, w_{t-1}, ..., w_{t-n+1};\theta) + R(\theta)$$
where $f(w_t, w_{t-1}, ..., w_{t-n+1})$ represents the conditional probability of the current word $w_t$ given its previous $n-1$ words, and $R(\theta)$ represents parameter regularization term. where $f(w_t, w_{t-1}, ..., w_{t-n+1})$ represents the conditional probability of the current word $w_t$ given its previous $n-1$ words, and $R(\theta)$ represents parameter regularization term.
<p align="center"> <p align="center">
<img src="image/nnlm_en.png" width=500><br/> <img src="image/nnlm_en.png" width=500><br/>
Figure 2. N-gram neural network model Figure 2. N-gram neural network model
</p> </p>
Figure 2 shows the N-gram neural network model. From the bottom up, the model has the following components: Figure 2 shows the N-gram neural network model. From the bottom up, the model has the following components:
- For each sample, the model gets input $w_{t-n+1},...w_{t-1}$, and outputs the probability that the t-th word is one of `|V|` in the dictionary. - For each sample, the model gets input $w_{t-n+1},...w_{t-1}$, and outputs the probability that the t-th word is one of `|V|` in the dictionary.
Every input word $w_{t-n+1},...w_{t-1}$ first gets transformed into word embedding $C(w_{t-n+1}),...C(w_{t-1})$ through a transformation matrix. Every input word $w_{t-n+1},...w_{t-1}$ first gets transformed into word embedding $C(w_{t-n+1}),...C(w_{t-1})$ through a transformation matrix.
- All the word embeddings concatenate into a single vector, which is mapped (nonlinearly) into the $t$-th word hidden representation: - All the word embeddings concatenate into a single vector, which is mapped (nonlinearly) into the $t$-th word hidden representation:
$$g=Utanh(\theta^Tx + b_1) + Wx + b_2$$ $$g=Utanh(\theta^Tx + b_1) + Wx + b_2$$
where $x$ is the large vector concatenated from all the word embeddings representing the context; $\theta$, $U$, $b_1$, $b_2$ and $W$ are parameters connecting word embedding layers to the hidden layers. $g$ represents the unnormalized probability of the output word, $g_i$ represents the unnormalized probability of the output word being the i-th word in the dictionary. where $x$ is the large vector concatenated from all the word embeddings representing the context; $\theta$, $U$, $b_1$, $b_2$ and $W$ are parameters connecting word embedding layers to the hidden layers. $g$ represents the unnormalized probability of the output word, $g_i$ represents the unnormalized probability of the output word being the i-th word in the dictionary.
- Based on the definition of softmax, using normalized $g_i$, the probability that the output word is $w_t$ is represented as: - Based on the definition of softmax, using normalized $g_i$, the probability that the output word is $w_t$ is represented as:
$$P(w_t | w_1, ..., w_{t-n+1}) = \frac{e^{g_{w_t}}}{\sum_i^{|V|} e^{g_i}}$$ $$P(w_t | w_1, ..., w_{t-n+1}) = \frac{e^{g_{w_t}}}{\sum_i^{|V|} e^{g_i}}$$
- The cost of the entire network is a multi-class cross-entropy and can be described by the following loss function - The cost of the entire network is a multi-class cross-entropy and can be described by the following loss function
$$J(\theta) = -\sum_{i=1}^N\sum_{c=1}^{|V|}y_k^{i}log(softmax(g_k^i))$$ $$J(\theta) = -\sum_{i=1}^N\sum_{c=1}^{|V|}y_k^{i}log(softmax(g_k^i))$$
where $y_k^i$ represents the true label for the $k$-th class in the $i$-th sample ($0$ or $1$), $softmax(g_k^i)$ represents the softmax probability for the $k$-th class in the $i$-th sample. where $y_k^i$ represents the true label for the $k$-th class in the $i$-th sample ($0$ or $1$), $softmax(g_k^i)$ represents the softmax probability for the $k$-th class in the $i$-th sample.
### Continuous Bag-of-Words model(CBOW) ### Continuous Bag-of-Words model(CBOW)
CBOW model predicts the current word based on the N words both before and after it. When $N=2$, the model is as the figure below: CBOW model predicts the current word based on the N words both before and after it. When $N=2$, the model is as the figure below:
<p align="center"> <p align="center">
<img src="image/cbow_en.png" width=250><br/> <img src="image/cbow_en.png" width=250><br/>
Figure 3. CBOW model Figure 3. CBOW model
</p> </p>
Specifically, by ignoring the order of words in the sequence, CBOW uses the average value of the word embedding of the context to predict the current word: Specifically, by ignoring the order of words in the sequence, CBOW uses the average value of the word embedding of the context to predict the current word:
...@@ -179,30 +180,30 @@ $$\text{context} = \frac{x_{t-1} + x_{t-2} + x_{t+1} + x_{t+2}}{4}$$ ...@@ -179,30 +180,30 @@ $$\text{context} = \frac{x_{t-1} + x_{t-2} + x_{t+1} + x_{t+2}}{4}$$
where $x_t$ is the word embedding of the t-th word, classification score vector is $z=U*\text{context}$, the final classification $y$ uses softmax and the loss function uses multi-class cross-entropy. where $x_t$ is the word embedding of the t-th word, classification score vector is $z=U*\text{context}$, the final classification $y$ uses softmax and the loss function uses multi-class cross-entropy.
### Skip-gram model ### Skip-gram model
The advantages of CBOW is that it smooths over the word embeddings of the context and reduces noise, so it is very effective on small dataset. Skip-gram uses a word to predict its context and get multiple context for the given word, so it can be used in larger datasets. The advantages of CBOW is that it smooths over the word embeddings of the context and reduces noise, so it is very effective on small dataset. Skip-gram uses a word to predict its context and get multiple context for the given word, so it can be used in larger datasets.
<p align="center"> <p align="center">
<img src="image/skipgram_en.png" width=250><br/> <img src="image/skipgram_en.png" width=250><br/>
Figure 4. Skip-gram model Figure 4. Skip-gram model
</p> </p>
As illustrated in the figure above, skip-gram model maps the word embedding of the given word onto $2n$ word embeddings (including $n$ words before and $n$ words after the given word), and then combine the classification loss of all those $2n$ words by softmax. As illustrated in the figure above, skip-gram model maps the word embedding of the given word onto $2n$ word embeddings (including $n$ words before and $n$ words after the given word), and then combine the classification loss of all those $2n$ words by softmax.
## Data Preparation ## Data Preparation
## Model Configuration ## Model Configuration
<p align="center"> <p align="center">
<img src="image/ngram.en.png" width=400><br/> <img src="image/ngram.en.png" width=400><br/>
Figure 5. N-gram neural network model in model configuration Figure 5. N-gram neural network model in model configuration
</p> </p>
## Model Training ## Model Training
## Model Application ## Model Application
## Conclusion ## Conclusion
This chapter introduces word embedding, the relationship between language model and word embedding, and how to train neural networks to learn word embedding. This chapter introduces word embedding, the relationship between language model and word embedding, and how to train neural networks to learn word embedding.
...@@ -219,6 +220,7 @@ In information retrieval, the relevance between the query and document keyword c ...@@ -219,6 +220,7 @@ In information retrieval, the relevance between the query and document keyword c
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
...@@ -237,6 +239,6 @@ marked.setOptions({ ...@@ -237,6 +239,6 @@ marked.setOptions({
} }
}); });
document.getElementById("context").innerHTML = marked( document.getElementById("context").innerHTML = marked(
document.getElementById("markdown").innerHTML) document.getElementById("markdown").innerHTML)
</script> </script>
</body> </body>
<html> <html>
<head> <head>
<script type="text/x-mathjax-config"> <script type="text/x-mathjax-config">
...@@ -5,8 +6,8 @@ ...@@ -5,8 +6,8 @@
extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"], extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
jax: ["input/TeX", "output/HTML-CSS"], jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: { tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ], inlineMath: [ ['$','$'] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ], displayMath: [ ['$$','$$'] ],
processEscapes: true processEscapes: true
}, },
"HTML-CSS": { availableFonts: ["TeX"] } "HTML-CSS": { availableFonts: ["TeX"] }
...@@ -39,6 +40,7 @@ ...@@ -39,6 +40,7 @@
<!-- This block will be replaced by each markdown file content. Please do not change lines below.--> <!-- This block will be replaced by each markdown file content. Please do not change lines below.-->
<div id="markdown" style='display:none'> <div id="markdown" style='display:none'>
# 词向量 # 词向量
本教程源代码目录在[book/word2vec](https://github.com/PaddlePaddle/book/tree/develop/word2vec), 初次使用请参考PaddlePaddle[安装教程](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html)。 本教程源代码目录在[book/word2vec](https://github.com/PaddlePaddle/book/tree/develop/word2vec), 初次使用请参考PaddlePaddle[安装教程](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html)。
...@@ -47,7 +49,7 @@ ...@@ -47,7 +49,7 @@
本章我们介绍词的向量表征,也称为word embedding。词向量是自然语言处理中常见的一个操作,是搜索引擎、广告系统、推荐系统等互联网服务背后常见的基础技术。 本章我们介绍词的向量表征,也称为word embedding。词向量是自然语言处理中常见的一个操作,是搜索引擎、广告系统、推荐系统等互联网服务背后常见的基础技术。
在这些互联网服务里,我们经常要比较两个词或者两段文本之间的相关性。为了做这样的比较,我们往往先要把词表示成计算机适合处理的方式。最自然的方式恐怕莫过于向量空间模型(vector space model)。 在这些互联网服务里,我们经常要比较两个词或者两段文本之间的相关性。为了做这样的比较,我们往往先要把词表示成计算机适合处理的方式。最自然的方式恐怕莫过于向量空间模型(vector space model)。
在这种方式里,每个词被表示成一个实数向量(one-hot vector),其长度为字典大小,每个维度对应一个字典里的每个词,除了这个词对应维度上的值是1,其他元素都是0。 在这种方式里,每个词被表示成一个实数向量(one-hot vector),其长度为字典大小,每个维度对应一个字典里的每个词,除了这个词对应维度上的值是1,其他元素都是0。
One-hot vector虽然自然,但是用处有限。比如,在互联网广告系统里,如果用户输入的query是“母亲节”,而有一个广告的关键词是“康乃馨”。虽然按照常理,我们知道这两个词之间是有联系的——母亲节通常应该送给母亲一束康乃馨;但是这两个词对应的one-hot vectors之间的距离度量,无论是欧氏距离还是余弦相似度(cosine similarity),由于其向量正交,都认为这两个词毫无相关性。 得出这种与我们相悖的结论的根本原因是:每个词本身的信息量都太小。所以,仅仅给定两个词,不足以让我们准确判别它们是否相关。要想精确计算相关性,我们还需要更多的信息——从大量数据里通过机器学习方法归纳出来的知识。 One-hot vector虽然自然,但是用处有限。比如,在互联网广告系统里,如果用户输入的query是“母亲节”,而有一个广告的关键词是“康乃馨”。虽然按照常理,我们知道这两个词之间是有联系的——母亲节通常应该送给母亲一束康乃馨;但是这两个词对应的one-hot vectors之间的距离度量,无论是欧氏距离还是余弦相似度(cosine similarity),由于其向量正交,都认为这两个词毫无相关性。 得出这种与我们相悖的结论的根本原因是:每个词本身的信息量都太小。所以,仅仅给定两个词,不足以让我们准确判别它们是否相关。要想精确计算相关性,我们还需要更多的信息——从大量数据里通过机器学习方法归纳出来的知识。
...@@ -109,7 +111,7 @@ $$P(w_1, ..., w_T) = \prod_{t=1}^TP(w_t | w_1, ... , w_{t-1})$$ ...@@ -109,7 +111,7 @@ $$P(w_1, ..., w_T) = \prod_{t=1}^TP(w_t | w_1, ... , w_{t-1})$$
### N-gram neural model ### N-gram neural model
在计算语言学中,n-gram是一种重要的文本表示方法,表示一个文本中连续的n个项。基于具体的应用场景,每一项可以是一个字母、单词或者音节。 n-gram模型也是统计语言模型中的一种重要方法,用n-gram训练语言模型时,一般用每个n-gram的历史n-1个词语组成的内容来预测第n个词。 在计算语言学中,n-gram是一种重要的文本表示方法,表示一个文本中连续的n个项。基于具体的应用场景,每一项可以是一个字母、单词或者音节。 n-gram模型也是统计语言模型中的一种重要方法,用n-gram训练语言模型时,一般用每个n-gram的历史n-1个词语组成的内容来预测第n个词。
...@@ -125,39 +127,39 @@ $$\frac{1}{T}\sum_t f(w_t, w_{t-1}, ..., w_{t-n+1};\theta) + R(\theta)$$ ...@@ -125,39 +127,39 @@ $$\frac{1}{T}\sum_t f(w_t, w_{t-1}, ..., w_{t-n+1};\theta) + R(\theta)$$
其中$f(w_t, w_{t-1}, ..., w_{t-n+1})$表示根据历史n-1个词得到当前词$w_t$的条件概率,$R(\theta)$表示参数正则项。 其中$f(w_t, w_{t-1}, ..., w_{t-n+1})$表示根据历史n-1个词得到当前词$w_t$的条件概率,$R(\theta)$表示参数正则项。
<p align="center"> <p align="center">
<img src="image/nnlm.png" width=500><br/> <img src="image/nnlm.png" width=500><br/>
图2. N-gram神经网络模型 图2. N-gram神经网络模型
</p> </p>
图2展示了N-gram神经网络模型,从下往上看,该模型分为以下几个部分: 图2展示了N-gram神经网络模型,从下往上看,该模型分为以下几个部分:
- 对于每个样本,模型输入$w_{t-n+1},...w_{t-1}$, 输出句子第t个词为字典中`|V|`个词的概率。 - 对于每个样本,模型输入$w_{t-n+1},...w_{t-1}$, 输出句子第t个词为字典中`|V|`个词的概率。
每个输入词$w_{t-n+1},...w_{t-1}$首先通过映射矩阵映射到词向量$C(w_{t-n+1}),...C(w_{t-1})$。 每个输入词$w_{t-n+1},...w_{t-1}$首先通过映射矩阵映射到词向量$C(w_{t-n+1}),...C(w_{t-1})$。
- 然后所有词语的词向量连接成一个大向量,并经过一个非线性映射得到历史词语的隐层表示: - 然后所有词语的词向量连接成一个大向量,并经过一个非线性映射得到历史词语的隐层表示:
$$g=Utanh(\theta^Tx + b_1) + Wx + b_2$$ $$g=Utanh(\theta^Tx + b_1) + Wx + b_2$$
其中,$x$为所有词语的词向量连接成的大向量,表示文本历史特征;$\theta$、$U$、$b_1$、$b_2$和$W$分别为词向量层到隐层连接的参数。$g$表示未经归一化的所有输出单词概率,$g_i$表示未经归一化的字典中第$i$个单词的输出概率。 其中,$x$为所有词语的词向量连接成的大向量,表示文本历史特征;$\theta$、$U$、$b_1$、$b_2$和$W$分别为词向量层到隐层连接的参数。$g$表示未经归一化的所有输出单词概率,$g_i$表示未经归一化的字典中第$i$个单词的输出概率。
- 根据softmax的定义,通过归一化$g_i$, 生成目标词$w_t$的概率为: - 根据softmax的定义,通过归一化$g_i$, 生成目标词$w_t$的概率为:
$$P(w_t | w_1, ..., w_{t-n+1}) = \frac{e^{g_{w_t}}}{\sum_i^{|V|} e^{g_i}}$$ $$P(w_t | w_1, ..., w_{t-n+1}) = \frac{e^{g_{w_t}}}{\sum_i^{|V|} e^{g_i}}$$
- 整个网络的损失值(cost)为多类分类交叉熵,用公式表示为 - 整个网络的损失值(cost)为多类分类交叉熵,用公式表示为
$$J(\theta) = -\sum_{i=1}^N\sum_{c=1}^{|V|}y_k^{i}log(softmax(g_k^i))$$ $$J(\theta) = -\sum_{i=1}^N\sum_{c=1}^{|V|}y_k^{i}log(softmax(g_k^i))$$
其中$y_k^i$表示第$i$个样本第$k$类的真实标签(0或1),$softmax(g_k^i)$表示第i个样本第k类softmax输出的概率。 其中$y_k^i$表示第$i$个样本第$k$类的真实标签(0或1),$softmax(g_k^i)$表示第i个样本第k类softmax输出的概率。
### Continuous Bag-of-Words model(CBOW)
### Continuous Bag-of-Words model(CBOW)
CBOW模型通过一个词的上下文(各N个词)预测当前词。当N=2时,模型如下图所示: CBOW模型通过一个词的上下文(各N个词)预测当前词。当N=2时,模型如下图所示:
<p align="center"> <p align="center">
<img src="image/cbow.png" width=250><br/> <img src="image/cbow.png" width=250><br/>
图3. CBOW模型 图3. CBOW模型
</p> </p>
...@@ -168,11 +170,11 @@ $$context = \frac{x_{t-1} + x_{t-2} + x_{t+1} + x_{t+2}}{4}$$ ...@@ -168,11 +170,11 @@ $$context = \frac{x_{t-1} + x_{t-2} + x_{t+1} + x_{t+2}}{4}$$
其中$x_t$为第$t$个词的词向量,分类分数(score)向量 $z=U*context$,最终的分类$y$采用softmax,损失函数采用多类分类交叉熵。 其中$x_t$为第$t$个词的词向量,分类分数(score)向量 $z=U*context$,最终的分类$y$采用softmax,损失函数采用多类分类交叉熵。
### Skip-gram model ### Skip-gram model
CBOW的好处是对上下文词语的分布在词向量上进行了平滑,去掉了噪声,因此在小数据集上很有效。而Skip-gram的方法中,用一个词预测其上下文,得到了当前词上下文的很多样本,因此可用于更大的数据集。 CBOW的好处是对上下文词语的分布在词向量上进行了平滑,去掉了噪声,因此在小数据集上很有效。而Skip-gram的方法中,用一个词预测其上下文,得到了当前词上下文的很多样本,因此可用于更大的数据集。
<p align="center"> <p align="center">
<img src="image/skipgram.png" width=250><br/> <img src="image/skipgram.png" width=250><br/>
图4. Skip-gram模型 图4. Skip-gram模型
</p> </p>
...@@ -206,7 +208,7 @@ CBOW的好处是对上下文词语的分布在词向量上进行了平滑,去 ...@@ -206,7 +208,7 @@ CBOW的好处是对上下文词语的分布在词向量上进行了平滑,去
</table> </table>
</p> </p>
### 数据预处理 ### 数据预处理
本章训练的是5-gram模型,表示在PaddlePaddle训练时,每条数据的前4个词用来预测第5个词。PaddlePaddle提供了对应PTB数据集的python包`paddle.dataset.imikolov`,自动做数据的下载与预处理,方便大家使用。 本章训练的是5-gram模型,表示在PaddlePaddle训练时,每条数据的前4个词用来预测第5个词。PaddlePaddle提供了对应PTB数据集的python包`paddle.dataset.imikolov`,自动做数据的下载与预处理,方便大家使用。
...@@ -227,7 +229,7 @@ dream that one day <e> ...@@ -227,7 +229,7 @@ dream that one day <e>
本配置的模型结构如下图所示: 本配置的模型结构如下图所示:
<p align="center"> <p align="center">
<img src="image/ngram.png" width=400><br/> <img src="image/ngram.png" width=400><br/>
图5. 模型配置中的N-gram神经网络模型 图5. 模型配置中的N-gram神经网络模型
</p> </p>
...@@ -249,8 +251,8 @@ N = 5 # 训练5-Gram ...@@ -249,8 +251,8 @@ N = 5 # 训练5-Gram
接着,定义网络结构: 接着,定义网络结构:
- 将$w_t$之前的$n-1$个词 $w_{t-n+1},...w_{t-1}$,通过$|V|\times D$的矩阵映射到D维词向量(本例中取D=32)。 - 将$w_t$之前的$n-1$个词 $w_{t-n+1},...w_{t-1}$,通过$|V|\times D$的矩阵映射到D维词向量(本例中取D=32)。
```python ```python
def wordemb(inlayer): def wordemb(inlayer):
wordemb = paddle.layer.table_projection( wordemb = paddle.layer.table_projection(
input=inlayer, input=inlayer,
...@@ -266,54 +268,54 @@ def wordemb(inlayer): ...@@ -266,54 +268,54 @@ def wordemb(inlayer):
- 定义输入层接受的数据类型以及名字。 - 定义输入层接受的数据类型以及名字。
```python ```python
def main(): paddle.init(use_gpu=False, trainer_count=3) # 初始化PaddlePaddle
paddle.init(use_gpu=False, trainer_count=1) # 初始化PaddlePaddle word_dict = paddle.dataset.imikolov.build_dict()
word_dict = paddle.dataset.imikolov.build_dict() dict_size = len(word_dict)
dict_size = len(word_dict) # 每个输入层都接受整形数据,这些数据的范围是[0, dict_size)
# 每个输入层都接受整形数据,这些数据的范围是[0, dict_size) firstword = paddle.layer.data(
firstword = paddle.layer.data( name="firstw", type=paddle.data_type.integer_value(dict_size))
name="firstw", type=paddle.data_type.integer_value(dict_size)) secondword = paddle.layer.data(
secondword = paddle.layer.data( name="secondw", type=paddle.data_type.integer_value(dict_size))
name="secondw", type=paddle.data_type.integer_value(dict_size)) thirdword = paddle.layer.data(
thirdword = paddle.layer.data( name="thirdw", type=paddle.data_type.integer_value(dict_size))
name="thirdw", type=paddle.data_type.integer_value(dict_size)) fourthword = paddle.layer.data(
fourthword = paddle.layer.data( name="fourthw", type=paddle.data_type.integer_value(dict_size))
name="fourthw", type=paddle.data_type.integer_value(dict_size)) nextword = paddle.layer.data(
nextword = paddle.layer.data( name="fifthw", type=paddle.data_type.integer_value(dict_size))
name="fifthw", type=paddle.data_type.integer_value(dict_size))
Efirst = wordemb(firstword)
Efirst = wordemb(firstword) Esecond = wordemb(secondword)
Esecond = wordemb(secondword) Ethird = wordemb(thirdword)
Ethird = wordemb(thirdword) Efourth = wordemb(fourthword)
Efourth = wordemb(fourthword)
``` ```
- 将这n-1个词向量经过concat_layer连接成一个大向量作为历史文本特征。 - 将这n-1个词向量经过concat_layer连接成一个大向量作为历史文本特征。
```python ```python
contextemb = paddle.layer.concat(input=[Efirst, Esecond, Ethird, Efourth]) contextemb = paddle.layer.concat(input=[Efirst, Esecond, Ethird, Efourth])
``` ```
- 将历史文本特征经过一个全连接得到文本隐层特征。 - 将历史文本特征经过一个全连接得到文本隐层特征。
```python ```python
hidden1 = paddle.layer.fc(input=contextemb, hidden1 = paddle.layer.fc(input=contextemb,
size=hiddensize, size=hiddensize,
act=paddle.activation.Sigmoid(), act=paddle.activation.Sigmoid(),
layer_attr=paddle.attr.Extra(drop_rate=0.5), layer_attr=paddle.attr.Extra(drop_rate=0.5),
bias_attr=paddle.attr.Param(learning_rate=2), bias_attr=paddle.attr.Param(learning_rate=2),
param_attr=paddle.attr.Param( param_attr=paddle.attr.Param(
initial_std=1. / math.sqrt(embsize * 8), initial_std=1. / math.sqrt(embsize * 8),
learning_rate=1)) learning_rate=1))
``` ```
- 将文本隐层特征,再经过一个全连接,映射成一个$|V|$维向量,同时通过softmax归一化得到这`|V|`个词的生成概率。 - 将文本隐层特征,再经过一个全连接,映射成一个$|V|$维向量,同时通过softmax归一化得到这`|V|`个词的生成概率。
```python ```python
predictword = paddle.layer.fc(input=hidden1, predictword = paddle.layer.fc(input=hidden1,
size=dict_size, size=dict_size,
bias_attr=paddle.attr.Param(learning_rate=2), bias_attr=paddle.attr.Param(learning_rate=2),
act=paddle.activation.Softmax()) act=paddle.activation.Softmax())
``` ```
- 网络的损失函数为多分类交叉熵,可直接调用`classification_cost`函数。 - 网络的损失函数为多分类交叉熵,可直接调用`classification_cost`函数。
...@@ -329,11 +331,11 @@ cost = paddle.layer.classification_cost(input=predictword, label=nextword) ...@@ -329,11 +331,11 @@ cost = paddle.layer.classification_cost(input=predictword, label=nextword)
- 正则化(regularization): 是防止网络过拟合的一种手段,此处采用L2正则化。 - 正则化(regularization): 是防止网络过拟合的一种手段,此处采用L2正则化。
```python ```python
parameters = paddle.parameters.create(cost) parameters = paddle.parameters.create(cost)
adam_optimizer = paddle.optimizer.Adam( adam_optimizer = paddle.optimizer.Adam(
learning_rate=3e-3, learning_rate=3e-3,
regularization=paddle.optimizer.L2Regularization(8e-4)) regularization=paddle.optimizer.L2Regularization(8e-4))
trainer = paddle.trainer.SGD(cost, parameters, adam_optimizer) trainer = paddle.trainer.SGD(cost, parameters, adam_optimizer)
``` ```
下一步,我们开始训练过程。`paddle.dataset.imikolov.train()`和`paddle.dataset.imikolov.test()`分别做训练和测试数据集。这两个函数各自返回一个reader——PaddlePaddle中的reader是一个Python函数,每次调用的时候返回一个Python generator。 下一步,我们开始训练过程。`paddle.dataset.imikolov.train()`和`paddle.dataset.imikolov.test()`分别做训练和测试数据集。这两个函数各自返回一个reader——PaddlePaddle中的reader是一个Python函数,每次调用的时候返回一个Python generator。
...@@ -341,113 +343,95 @@ cost = paddle.layer.classification_cost(input=predictword, label=nextword) ...@@ -341,113 +343,95 @@ cost = paddle.layer.classification_cost(input=predictword, label=nextword)
`paddle.batch`的输入是一个reader,输出是一个batched reader —— 在PaddlePaddle里,一个reader每次yield一条训练数据,而一个batched reader每次yield一个minbatch。 `paddle.batch`的输入是一个reader,输出是一个batched reader —— 在PaddlePaddle里,一个reader每次yield一条训练数据,而一个batched reader每次yield一个minbatch。
```python ```python
def event_handler(event): import gzip
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0: def event_handler(event):
result = trainer.test( if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0:
print "Pass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
if isinstance(event, paddle.event.EndPass):
result = trainer.test(
paddle.batch( paddle.batch(
paddle.dataset.imikolov.test(word_dict, N), 32)) paddle.dataset.imikolov.test(word_dict, N), 32))
print "Pass %d, Batch %d, Cost %f, %s, Testing metrics %s" % ( print "Pass %d, Testing metrics %s" % (event.pass_id, result.metrics)
event.pass_id, event.batch_id, event.cost, event.metrics, with gzip.open("model_%d.tar.gz"%event.pass_id, 'w') as f:
result.metrics) parameters.to_tar(f)
trainer.train( trainer.train(
paddle.batch(paddle.dataset.imikolov.train(word_dict, N), 32), paddle.batch(paddle.dataset.imikolov.train(word_dict, N), 32),
num_passes=30, num_passes=100,
event_handler=event_handler) event_handler=event_handler)
``` ```
训练过程是完全自动的,event_handler里打印的日志类似如下所示: ...
Pass 0, Batch 25000, Cost 4.251861, {'classification_error_evaluator': 0.84375}
Pass 0, Batch 25100, Cost 4.847692, {'classification_error_evaluator': 0.8125}
Pass 0, Testing metrics {'classification_error_evaluator': 0.7417652606964111}
训练过程是完全自动的,event_handler里打印的日志类似如上所示:
```text
.............................
I1222 09:27:16.477841 12590 TrainerInternal.cpp:162] Batch=3000 samples=300000 AvgCost=5.36135 CurrentCost=5.36135 Eval: classification_error_evaluator=0.818653 CurrentEval: class
ification_error_evaluator=0.818653
.............................
I1222 09:27:22.416700 12590 TrainerInternal.cpp:162] Batch=6000 samples=600000 AvgCost=5.29301 CurrentCost=5.22467 Eval: classification_error_evaluator=0.814542 CurrentEval: class
ification_error_evaluator=0.81043
.............................
I1222 09:27:28.343756 12590 TrainerInternal.cpp:162] Batch=9000 samples=900000 AvgCost=5.22494 CurrentCost=5.08876 Eval: classification_error_evaluator=0.810088 CurrentEval: class
ification_error_evaluator=0.80118
..I1222 09:27:29.128582 12590 TrainerInternal.cpp:179] Pass=0 Batch=9296 samples=929600 AvgCost=5.21786 Eval: classification_error_evaluator=0.809647
I1222 09:27:29.627616 12590 Tester.cpp:111] Test samples=73760 cost=4.9594 Eval: classification_error_evaluator=0.79676
I1222 09:27:29.627713 12590 GradientMachine.cpp:112] Saving parameters to model/pass-00000
```
经过30个pass,我们将得到平均错误率为classification_error_evaluator=0.735611。 经过30个pass,我们将得到平均错误率为classification_error_evaluator=0.735611。
## 应用模型 ## 应用模型
训练模型后,我们可以加载模型参数,用训练出来的词向量初始化其他模型,也可以将模型参数从二进制格式转换成文本格式进行后续应用。 训练模型后,我们可以加载模型参数,用训练出来的词向量初始化其他模型,也可以将模型查看参数用来做后续应用。
### 初始化其他模型
训练好的模型参数可以用来初始化其他模型。具体方法如下:
在PaddlePaddle 训练命令行中,用`--init_model_path` 来定义初始化模型的位置,用`--load_missing_parameter_strategy`指定除了词向量以外的新模型其他参数的初始化策略。注意,新模型需要和原模型共享被初始化参数的参数名。
### 查看词向量 ### 查看词向量
PaddlePaddle训练出来的参数为二进制格式,存储在对应训练pass的文件夹下。这里我们提供了文件`format_convert.py`用来互转PaddlePaddle训练结果的二进制文件和文本格式特征文件。
```bash PaddlePaddle训练出来的参数可以直接使用`parameters.get()`获取出来。例如查看单词的word的词向量,即为
python format_convert.py --b2t -i INPUT -o OUTPUT -d DIM
```
其中,INPUT是输入的(二进制)词向量模型名称,OUTPUT是输出的文本模型名称,DIM是词向量参数维度。
用法如:
```bash ```python
python format_convert.py --b2t -i model/pass-00029/_proj -o model/pass-00029/_proj.txt -d 32 embeddings = parameters.get("_proj").reshape(len(word_dict), embsize)
```
转换后得到的文本文件如下:
```text print embeddings[word_dict['word']]
0,4,62496
-0.7444070,-0.1846171,-1.5771370,0.7070392,2.1963732,-0.0091410, ......
-0.0721337,-0.2429973,-0.0606297,0.1882059,-0.2072131,-0.7661019, ......
......
``` ```
其中,第一行是PaddlePaddle 输出文件的格式说明,包含3个属性:<br/> [-0.38961065 -0.02392169 -0.00093231 0.36301503 0.13538605 0.16076435
1) PaddlePaddle的版本号,本例中为0;<br/> -0.0678709 0.1090285 0.42014077 -0.24119169 -0.31847557 0.20410083
2) 浮点数占用的字节数,本例中为4;<br/> 0.04910378 0.19021918 -0.0122014 -0.04099389 -0.16924137 0.1911236
3) 总计的参数个数, 本例中为62496(即1953*32);<br/> -0.10917275 0.13068172 -0.23079982 0.42699069 -0.27679482 -0.01472992
第二行及之后的每一行都按顺序表示字典里一个词的特征,用逗号分隔。 0.2069038 0.09005053 -0.3282454 0.12717034 -0.24218646 0.25304323
0.19072419 -0.24286366]
### 修改词向量
我们可以对词向量进行修改,并转换成PaddlePaddle参数二进制格式,方法:
```bash
python format_convert.py --t2b -i INPUT -o OUTPUT
```
其中,INPUT是输入的输入的文本词向量模型名称,OUTPUT是输出的二进制词向量模型名称 ### 修改词向量
输入的文本格式如下(注意,不包含上面二进制转文本后第一行的格式说明): 获得到的embedding为一个标准的numpy矩阵。我们可以对这个numpy矩阵进行修改,然后赋值回去。
```text
-0.7444070,-0.1846171,-1.5771370,0.7070392,2.1963732,-0.0091410, ...... ```python
-0.0721337,-0.2429973,-0.0606297,0.1882059,-0.2072131,-0.7661019, ...... def modify_embedding(emb):
...... # Add your modification here.
pass
modify_embedding(embeddings)
parameters.set("_proj", embeddings)
``` ```
### 计算词语之间的余弦距离 ### 计算词语之间的余弦距离
两个向量之间的距离可以用余弦值来表示,余弦值在$[-1,1]$的区间内,向量间余弦值越大,其距离越近。这里我们在`calculate_dis.py`中实现不同词语的距离度量。 两个向量之间的距离可以用余弦值来表示,余弦值在$[-1,1]$的区间内,向量间余弦值越大,其距离越近。这里我们在`calculate_dis.py`中实现不同词语的距离度量。
用法如下: 用法如下:
```bash
python calculate_dis.py VOCABULARY EMBEDDINGLAYER`
```
其中,`VOCABULARY`是字典,`EMBEDDINGLAYER`是词向量模型,示例如下: ```python
from scipy import spatial
emb_1 = embeddings[word_dict['world']]
emb_2 = embeddings[word_dict['would']]
```bash print spatial.distance.cosine(emb_1, emb_2)
python calculate_dis.py data/vocabulary.txt model/pass-00029/_proj.txt
``` ```
0.99375076448
## 总结 ## 总结
本章中,我们介绍了词向量、语言模型和词向量的关系、以及如何通过训练神经网络模型获得词向量。在信息检索中,我们可以根据向量间的余弦夹角,来判断query和文档关键词这二者间的相关性。在句法分析和语义分析中,训练好的词向量可以用来初始化模型,以得到更好的效果。在文档分类中,有了词向量之后,可以用聚类的方法将文档中同义词进行分组。希望大家在本章后能够自行运用词向量进行相关领域的研究。 本章中,我们介绍了词向量、语言模型和词向量的关系、以及如何通过训练神经网络模型获得词向量。在信息检索中,我们可以根据向量间的余弦夹角,来判断query和文档关键词这二者间的相关性。在句法分析和语义分析中,训练好的词向量可以用来初始化模型,以得到更好的效果。在文档分类中,有了词向量之后,可以用聚类的方法将文档中同义词进行分组。希望大家在本章后能够自行运用词向量进行相关领域的研究。
...@@ -461,6 +445,7 @@ python calculate_dis.py data/vocabulary.txt model/pass-00029/_proj.txt ...@@ -461,6 +445,7 @@ python calculate_dis.py data/vocabulary.txt model/pass-00029/_proj.txt
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
...@@ -479,6 +464,6 @@ marked.setOptions({ ...@@ -479,6 +464,6 @@ marked.setOptions({
} }
}); });
document.getElementById("context").innerHTML = marked( document.getElementById("context").innerHTML = marked(
document.getElementById("markdown").innerHTML) document.getElementById("markdown").innerHTML)
</script> </script>
</body> </body>
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册