“87d90d2afd78ff57c59f4fd439ffd357f499aee1”上不存在“...examples/git@gitcode.net:s920243400/PaddleDetection.git”
提交 23a236bc 编写于 作者: G gongweibao

Merge remote-tracking branch 'upstream/develop' into develop

- repo: https://github.com/reyoung/mirrors-yapf.git - repo: https://github.com/pre-commit/mirrors-yapf.git
sha: v0.13.2 sha: v0.16.0
hooks: hooks:
- id: yapf - id: yapf
files: (.*\.(py|bzl)|BUILD|.*\.BUILD|WORKSPACE)$ # Bazel BUILD files follow Python syntax. files: \.py$
- repo: https://github.com/pre-commit/pre-commit-hooks - repo: https://github.com/pre-commit/pre-commit-hooks
sha: v0.7.1 sha: a11d9314b22d8f8c7556443875b731ef05965464
hooks: hooks:
- id: check-merge-conflict - id: check-merge-conflict
- id: check-symlinks - id: check-symlinks
...@@ -13,7 +13,7 @@ ...@@ -13,7 +13,7 @@
files: \.md$ files: \.md$
- id: trailing-whitespace - id: trailing-whitespace
files: \.md$ files: \.md$
- repo: git://github.com/Lucas-C/pre-commit-hooks - repo: https://github.com/Lucas-C/pre-commit-hooks
sha: v1.0.1 sha: v1.0.1
hooks: hooks:
- id: forbid-crlf - id: forbid-crlf
...@@ -24,11 +24,11 @@ ...@@ -24,11 +24,11 @@
files: \.md$ files: \.md$
- id: remove-tabs - id: remove-tabs
files: \.md$ files: \.md$
- repo: local - repo: local
hooks: hooks:
- id: convert-markdown-into-html - id: convert-markdown-into-html
name: convert-markdown-into-html name: convert-markdown-into-html
description: "Convert README.md into index.html and README.en.md into index.en.html" description: Convert README.md into index.html and README.en.md into index.en.html
entry: python pre-commit-hooks/convert_markdown_into_html.py entry: python pre-commit-hooks/convert_markdown_into_html.py
language: system language: system
files: \.md$ files: \.md$
# Deep Learning with PaddlePaddle
1. [Fit a Line](http://book.paddlepaddle.org/fit_a_line/index.en.html)
1. [Recognize Digits](http://book.paddlepaddle.org/recognize_digits/index.en.html)
1. [Image Classification](http://book.paddlepaddle.org/image_classification/index.en.html)
1. [Word to Vector](http://book.paddlepaddle.org/word2vec/index.en.html)
1. [Understand Sentiment](http://book.paddlepaddle.org/understand_sentiment/index.en.html)
1. [Label Semantic Roles](http://book.paddlepaddle.org/label_semantic_roles/index.en.html)
1. [Machine Translation](http://book.paddlepaddle.org/machine_translation/index.en.html)
1. [Recommender System](http://book.paddlepaddle.org/recommender_system/index.en.html)
This tutorial is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.
# 深度学习入门 # 深度学习入门
1. 新手入门 [[fit_a_line](fit_a_line/)] [[html](http://book.paddlepaddle.org/fit_a_line)] 1. [新手入门](http://book.paddlepaddle.org/fit_a_line)
1. 识别数字 [[recognize_digits](recognize_digits/)] [[html](http://book.paddlepaddle.org/recognize_digits)] 1. [识别数字](http://book.paddlepaddle.org/recognize_digits)
1. 图像分类 [[image_classification](image_classification/)] [[html](http://book.paddlepaddle.org/image_classification)] 1. [图像分类](http://book.paddlepaddle.org/image_classification)
1. 词向量 [[word2vec](word2vec/)] [[html](http://book.paddlepaddle.org/word2vec)] 1. [词向量](http://book.paddlepaddle.org/word2vec)
1. 情感分析 [[understand_sentiment](understand_sentiment/)] [[html](http://book.paddlepaddle.org/understand_sentiment)] 1. [情感分析](http://book.paddlepaddle.org/understand_sentiment)
1. 语义角色标注 [[label_semantic_roles](label_semantic_roles/)] [[html](http://book.paddlepaddle.org/label_semantic_roles)] 1. [语义角色标注](http://book.paddlepaddle.org/label_semantic_roles)
1. 机器翻译 [[machine_translation](machine_translation/)] [[html](http://book.paddlepaddle.org/machine_translation)] 1. [机器翻译](http://book.paddlepaddle.org/machine_translation)
1. 个性化推荐 [[recommender_system](recommender_system/)] [[html](http://book.paddlepaddle.org/recommender_system)] 1. [个性化推荐](http://book.paddlepaddle.org/recommender_system)
<br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。
...@@ -202,4 +202,4 @@ This chapter introduces *Linear Regression* and how to train and test this model ...@@ -202,4 +202,4 @@ This chapter introduces *Linear Regression* and how to train and test this model
4. Bishop C M. Pattern recognition[J]. Machine Learning, 2006, 128. 4. Bishop C M. Pattern recognition[J]. Machine Learning, 2006, 128.
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Common Creative License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a> This tutorial was created and published with [Creative Common License 4.0](http://creativecommons.org/licenses/by-nc-sa/4.0/). This tutorial is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.
...@@ -244,7 +244,7 @@ This chapter introduces *Linear Regression* and how to train and test this model ...@@ -244,7 +244,7 @@ This chapter introduces *Linear Regression* and how to train and test this model
4. Bishop C M. Pattern recognition[J]. Machine Learning, 2006, 128. 4. Bishop C M. Pattern recognition[J]. Machine Learning, 2006, 128.
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Common Creative License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a> This tutorial was created and published with [Creative Common License 4.0](http://creativecommons.org/licenses/by-nc-sa/4.0/). This tutorial is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
......
...@@ -18,9 +18,8 @@ def main(): ...@@ -18,9 +18,8 @@ def main():
# create optimizer # create optimizer
optimizer = paddle.optimizer.Momentum(momentum=0) optimizer = paddle.optimizer.Momentum(momentum=0)
trainer = paddle.trainer.SGD(cost=cost, trainer = paddle.trainer.SGD(
parameters=parameters, cost=cost, parameters=parameters, update_equation=optimizer)
update_equation=optimizer)
feeding = {'x': 0, 'y': 1} feeding = {'x': 0, 'y': 1}
...@@ -33,16 +32,14 @@ def main(): ...@@ -33,16 +32,14 @@ def main():
if isinstance(event, paddle.event.EndPass): if isinstance(event, paddle.event.EndPass):
result = trainer.test( result = trainer.test(
reader=paddle.batch( reader=paddle.batch(uci_housing.test(), batch_size=2),
uci_housing.test(), batch_size=2),
feeding=feeding) feeding=feeding)
print "Test %d, Cost %f" % (event.pass_id, result.cost) print "Test %d, Cost %f" % (event.pass_id, result.cost)
# training # training
trainer.train( trainer.train(
reader=paddle.batch( reader=paddle.batch(
paddle.reader.shuffle( paddle.reader.shuffle(uci_housing.train(), buf_size=500),
uci_housing.train(), buf_size=500),
batch_size=2), batch_size=2),
feeding=feeding, feeding=feeding,
event_handler=event_handler, event_handler=event_handler,
......
Image Classification Image Classification
======================= =======================
The source code of this chapter is in [book/image_classification](https://github.com/PaddlePaddle/book/tree/develop/image_classification). For the first-time users, please refer to PaddlePaddle[Installation Tutorial](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html) for installation instructions. The source code of this chapter is in [book/image_classification](https://github.com/PaddlePaddle/book/tree/develop/image_classification). For the first-time users, please refer to PaddlePaddle [Installation Tutorial](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html) for installation instructions.
## Background ## Background
...@@ -135,146 +135,73 @@ Figure 10. ResNet model for ImageNet ...@@ -135,146 +135,73 @@ Figure 10. ResNet model for ImageNet
</p> </p>
## Data Preparation ## Dataset
### Data description and downloading
Commonly used public datasets for image classification are CIFAR(https://www.cs.toronto.edu/~kriz/cifar.html), ImageNet(http://image-net.org/), COCO(http://mscoco.org/), etc. Those used for fine-grained image classification are CUB-200-2011(http://www.vision.caltech.edu/visipedia/CUB-200-2011.html), Stanford Dog(http://vision.stanford.edu/aditya86/ImageNetDogs/), Oxford-flowers(http://www.robots.ox.ac.uk/~vgg/data/flowers/), etc. Among them, ImageNet are the largest and most research results are reported on ImageNet as mentioned in Model Overview section. Since 2010, the data of Imagenet has gone through some changes. The commonly used ImageNet-2012 dataset contains 1000 categories. There are 1,281,167 training images, ranging from 732 to 1200 images per category, and 50,000 validation images with 50 images per category in average. Commonly used public datasets for image classification are CIFAR(https://www.cs.toronto.edu/~kriz/cifar.html), ImageNet(http://image-net.org/), COCO(http://mscoco.org/), etc. Those used for fine-grained image classification are CUB-200-2011(http://www.vision.caltech.edu/visipedia/CUB-200-2011.html), Stanford Dog(http://vision.stanford.edu/aditya86/ImageNetDogs/), Oxford-flowers(http://www.robots.ox.ac.uk/~vgg/data/flowers/), etc. Among them, ImageNet are the largest and most research results are reported on ImageNet as mentioned in Model Overview section. Since 2010, the data of Imagenet has gone through some changes. The commonly used ImageNet-2012 dataset contains 1000 categories. There are 1,281,167 training images, ranging from 732 to 1200 images per category, and 50,000 validation images with 50 images per category in average.
Since ImageNet is too large to be downloaded and trained efficiently, we use CIFAR10 (https://www.cs.toronto.edu/~kriz/cifar.html) in this tutorial. The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. Figure 11 shows all the classes in CIFAR10 as well as 10 images randomly sampled from each category. Since ImageNet is too large to be downloaded and trained efficiently, we use CIFAR-10 (https://www.cs.toronto.edu/~kriz/cifar.html) in this tutorial. The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. Figure 11 shows all the classes in CIFAR-10 as well as 10 images randomly sampled from each category.
<p align="center"> <p align="center">
<img src="image/cifar.png" width="350"><br/> <img src="image/cifar.png" width="350"><br/>
Figure 11. CIFAR10 dataset[21] Figure 11. CIFAR10 dataset[21]
</p> </p>
The following command is used for downloading data and calculating the mean image used for data preprocessing. `paddle.datasets` package encapsulates multiple public datasets, including `cifar`, `imdb`, `mnist`, `moivelens` and `wmt14`, etc. There's no need for us to manually download and preprocess CIFAR-10.
```bash
./data/get_data.sh
```
### Data provider for PaddlePaddle After issuing a command `python train.py`, training will starting immediately. The details will be unpacked by the following sessions to see how it works.
We use Python interface for providing data to PaddlePaddle. The following file dataprovider.py is a complete example for CIFAR10. ## Model Structure
- 'initializer' function performs initialization of dataprovider: loading the mean image, defining two input types -- image and label. ### Initialize PaddlePaddle
- 'process' function sends preprocessed data to PaddlePaddle. Data preprocessing performed in this function includes data perturbation, random horizontal flipping, deducting mean image from the raw image. We must import and initialize PaddlePaddle (enable/disable GPU, set the number of trainers, etc).
```python ```python
import numpy as np import sys
import cPickle import paddle.v2 as paddle
from paddle.trainer.PyDataProvider2 import *
def initializer(settings, mean_path, is_train, **kwargs):
settings.is_train = is_train
settings.input_size = 3 * 32 * 32
settings.mean = np.load(mean_path)['mean']
settings.input_types = {
'image': dense_vector(settings.input_size),
'label': integer_value(10)
}
@provider(init_hook=initializer, pool_size=50000)
def process(settings, file_list):
with open(file_list, 'r') as fdata:
for fname in fdata:
fo = open(fname.strip(), 'rb')
batch = cPickle.load(fo)
fo.close()
images = batch['data']
labels = batch['labels']
for im, lab in zip(images, labels):
if settings.is_train and np.random.randint(2):
im = im.reshape(3, 32, 32)
im = im[:,:,::-1]
im = im.flatten()
im = im - settings.mean
yield {
'image': im.astype('float32'),
'label': int(lab)
}
```
## Model Config # PaddlePaddle init
paddle.init(use_gpu=False, trainer_count=1)
### Data Definition
In model config file, function `define_py_data_sources2` sets argument 'module' to dataprovider file for loading data, 'args' to mean image file. If the config file is used for prediction, then there is no need to set argument 'train_list'.
```python
from paddle.trainer_config_helpers import *
is_predict = get_config_arg("is_predict", bool, False)
if not is_predict:
define_py_data_sources2(
train_list='data/train.list',
test_list='data/test.list',
module='dataprovider',
obj='process',
args={'mean_path': 'data/mean.meta'})
```
### Algorithm Settings
In model config file, function 'settings' specifies optimization algorithm, batch size, learning rate, momentum and L2 regularization.
```python
settings(
batch_size=128,
learning_rate=0.1 / 128.0,
learning_rate_decay_a=0.1,
learning_rate_decay_b=50000 * 100,
learning_rate_schedule='discexp',
learning_method=MomentumOptimizer(0.9),
regularization=L2Regularization(0.0005 * 128),)
``` ```
The learning rate adjustment policy can be defined with variables `learning_rate_decay_a`($a$), `learning_rate_decay_b`($b$) and `learning_rate_schedule`. In this example, discrete exponential method is used for adjusting learning rate. The formula is as follows, As alluded to in section [Model Overview](#model-overview), here we provide the implementations of both VGG and ResNet models.
$$ lr = lr_{0} * a^ {\lfloor \frac{n}{ b}\rfloor} $$
where $n$ is the number of processed samples, $lr_{0}$ is the learning_rate set in 'settings'.
### Model Architecture
Here we provide the cofig files for VGG and ResNet models.
#### VGG ### VGG
First we define VGG network. Since the image size and amount of CIFAR10 are relatively small comparing to ImageNet, we uses a small version of VGG network for CIFAR10. Convolution groups incorporate BN and dropout operations. First, we use a VGG network. Since the image size and amount of CIFAR10 are relatively small comparing to ImageNet, we uses a small version of VGG network for CIFAR10. Convolution groups incorporate BN and dropout operations.
1. Define input data and its dimension 1. Define input data and its dimension
The input to the network is defined as `data_layer`, or image pixels in the context of image classification. The images in CIFAR10 are 32x32 color images of three channels. Therefore, the size of the input data is 3072 (3x32x32), and the number of categories is 10. The input to the network is defined as `paddle.layer.data`, or image pixels in the context of image classification. The images in CIFAR10 are 32x32 color images of three channels. Therefore, the size of the input data is 3072 (3x32x32), and the number of categories is 10.
```python ```python
datadim = 3 * 32 * 32 datadim = 3 * 32 * 32
classdim = 10 classdim = 10
data = data_layer(name='image', size=datadim) image = paddle.layer.data(
name="image", type=paddle.data_type.dense_vector(datadim))
``` ```
2. Define VGG main module 2. Define VGG main module
```python ```python
net = vgg_bn_drop(data) net = vgg_bn_drop(image)
``` ```
The input to VGG main module is from data layer. `vgg_bn_drop` defines a 16-layer VGG network, with each convolutional layer followed by BN and dropout layers. Here is the definition in detail: The input to VGG main module is from the data layer. `vgg_bn_drop` defines a 16-layer VGG network, with each convolutional layer followed by BN and dropout layers. Here is the definition in detail:
```python ```python
def vgg_bn_drop(input, num_channels): def vgg_bn_drop(input):
def conv_block(ipt, num_filter, groups, dropouts, num_channels_=None): def conv_block(ipt, num_filter, groups, dropouts, num_channels=None):
return img_conv_group( return paddle.networks.img_conv_group(
input=ipt, input=ipt,
num_channels=num_channels_, num_channels=num_channels,
pool_size=2, pool_size=2,
pool_stride=2, pool_stride=2,
conv_num_filter=[num_filter] * groups, conv_num_filter=[num_filter] * groups,
conv_filter_size=3, conv_filter_size=3,
conv_act=ReluActivation(), conv_act=paddle.activation.Relu(),
conv_with_batchnorm=True, conv_with_batchnorm=True,
conv_batchnorm_drop_rate=dropouts, conv_batchnorm_drop_rate=dropouts,
pool_type=MaxPooling()) pool_type=paddle.pooling.Max())
conv1 = conv_block(input, 64, 2, [0.3, 0], 3) conv1 = conv_block(input, 64, 2, [0.3, 0], 3)
conv2 = conv_block(conv1, 128, 2, [0.4, 0]) conv2 = conv_block(conv1, 128, 2, [0.4, 0])
...@@ -282,16 +209,17 @@ First we define VGG network. Since the image size and amount of CIFAR10 are rela ...@@ -282,16 +209,17 @@ First we define VGG network. Since the image size and amount of CIFAR10 are rela
conv4 = conv_block(conv3, 512, 3, [0.4, 0.4, 0]) conv4 = conv_block(conv3, 512, 3, [0.4, 0.4, 0])
conv5 = conv_block(conv4, 512, 3, [0.4, 0.4, 0]) conv5 = conv_block(conv4, 512, 3, [0.4, 0.4, 0])
drop = dropout_layer(input=conv5, dropout_rate=0.5) drop = paddle.layer.dropout(input=conv5, dropout_rate=0.5)
fc1 = fc_layer(input=drop, size=512, act=LinearActivation()) fc1 = paddle.layer.fc(input=drop, size=512, act=paddle.activation.Linear())
bn = batch_norm_layer( bn = paddle.layer.batch_norm(
input=fc1, act=ReluActivation(), layer_attr=ExtraAttr(drop_rate=0.5)) input=fc1,
fc2 = fc_layer(input=bn, size=512, act=LinearActivation()) act=paddle.activation.Relu(),
layer_attr=paddle.attr.Extra(drop_rate=0.5))
fc2 = paddle.layer.fc(input=bn, size=512, act=paddle.activation.Linear())
return fc2 return fc2
``` ```
2.1. First defines a convolution block or conv_block. The default convolution kernel is 3x3, and the default pooling size is 2x2 with stride 2. Dropout specifies the probability in dropout operation. Function `img_conv_group` is defined in `paddle.trainer_config_helpers` consisting of a series of `Conv->BN->ReLu->Dropout` and a `Pooling`. 2.1. First defines a convolution block or conv_block. The default convolution kernel is 3x3, and the default pooling size is 2x2 with stride 2. Dropout specifies the probability in dropout operation. Function `img_conv_group` is defined in `paddle.networks` consisting of a series of `Conv->BN->ReLu->Dropout` and a `Pooling`.
2.2. Five groups of convolutions. The first two groups perform two convolutions, while the last three groups perform three convolutions. The dropout rate of the last convolution in each group is set to 0, which means there is no dropout for this layer. 2.2. Five groups of convolutions. The first two groups perform two convolutions, while the last three groups perform three convolutions. The dropout rate of the last convolution in each group is set to 0, which means there is no dropout for this layer.
...@@ -309,15 +237,12 @@ First we define VGG network. Since the image size and amount of CIFAR10 are rela ...@@ -309,15 +237,12 @@ First we define VGG network. Since the image size and amount of CIFAR10 are rela
4. Define Loss Function and Outputs 4. Define Loss Function and Outputs
In the context of supervised learning, labels of training images are defined in `data_layer`, too. During training, cross-entropy is used as loss function and as the output of the network; During testing, the outputs are the probabilities calculated in the classifier. In the context of supervised learning, labels of training images are defined in `paddle.layer.data`, too. During training, cross-entropy is used as loss function and as the output of the network; During testing, the outputs are the probabilities calculated in the classifier.
```python ```python
if not is_predict: lbl = paddle.layer.data(
lbl = data_layer(name="label", size=class_num) name="label", type=paddle.data_type.integer_value(classdim))
cost = classification_cost(input=out, label=lbl) cost = paddle.layer.classification_cost(input=out, label=lbl)
outputs(cost)
else:
outputs(out)
``` ```
### ResNet ### ResNet
...@@ -325,13 +250,13 @@ First we define VGG network. Since the image size and amount of CIFAR10 are rela ...@@ -325,13 +250,13 @@ First we define VGG network. Since the image size and amount of CIFAR10 are rela
The first, third and forth steps of a ResNet are the same as a VGG. The second one is the main module. The first, third and forth steps of a ResNet are the same as a VGG. The second one is the main module.
```python ```python
net = resnet_cifar10(data, depth=56) net = resnet_cifar10(data, depth=32)
``` ```
Here are some basic functions used in `resnet_cifar10`: Here are some basic functions used in `resnet_cifar10`:
- `conv_bn_layer` : convolutional layer followed by BN. - `conv_bn_layer` : convolutional layer followed by BN.
- `shortcut` : the shortcut branch in a residual block. There are two kinds of shortcuts: 1x1 convolution used when the number of channels between input and output are different; direct connection used otherwise. - `shortcut` : the shortcut branch in a residual block. There are two kinds of shortcuts: 1x1 convolution used when the number of channels between input and output is different; direct connection used otherwise.
- `basicblock` : a basic residual module as shown in the left of Figure 9, consisting of two sequential 3x3 convolutions and one "shortcut" branch. - `basicblock` : a basic residual module as shown in the left of Figure 9, consisting of two sequential 3x3 convolutions and one "shortcut" branch.
- `bottleneck` : a bottleneck module as shown in the right of Figure 9, consisting of a two 1x1 convolutions with one 3x3 convolution in between branch and a "shortcut" branch. - `bottleneck` : a bottleneck module as shown in the right of Figure 9, consisting of a two 1x1 convolutions with one 3x3 convolution in between branch and a "shortcut" branch.
...@@ -343,47 +268,38 @@ def conv_bn_layer(input, ...@@ -343,47 +268,38 @@ def conv_bn_layer(input,
filter_size, filter_size,
stride, stride,
padding, padding,
active_type=ReluActivation(), active_type=paddle.activation.Relu(),
ch_in=None): ch_in=None):
tmp = img_conv_layer( tmp = paddle.layer.img_conv(
input=input, input=input,
filter_size=filter_size, filter_size=filter_size,
num_channels=ch_in, num_channels=ch_in,
num_filters=ch_out, num_filters=ch_out,
stride=stride, stride=stride,
padding=padding, padding=padding,
act=LinearActivation(), act=paddle.activation.Linear(),
bias_attr=False) bias_attr=False)
return batch_norm_layer(input=tmp, act=active_type) return paddle.layer.batch_norm(input=tmp, act=active_type)
def shortcut(ipt, n_in, n_out, stride): def shortcut(ipt, n_in, n_out, stride):
if n_in != n_out: if n_in != n_out:
return conv_bn_layer(ipt, n_out, 1, stride, 0, LinearActivation()) return conv_bn_layer(ipt, n_out, 1, stride, 0,
paddle.activation.Linear())
else: else:
return ipt return ipt
def basicblock(ipt, ch_out, stride): def basicblock(ipt, ch_out, stride):
ch_in = ipt.num_filters ch_in = ch_out * 2
tmp = conv_bn_layer(ipt, ch_out, 3, stride, 1) tmp = conv_bn_layer(ipt, ch_out, 3, stride, 1)
tmp = conv_bn_layer(tmp, ch_out, 3, 1, 1, LinearActivation()) tmp = conv_bn_layer(tmp, ch_out, 3, 1, 1, paddle.activation.Linear())
short = shortcut(ipt, ch_in, ch_out, stride)
return addto_layer(input=[ipt, short], act=ReluActivation())
def bottleneck(ipt, ch_out, stride):
ch_in = ipt.num_filter
tmp = conv_bn_layer(ipt, ch_out, 1, stride, 0)
tmp = conv_bn_layer(tmp, ch_out, 3, 1, 1)
tmp = conv_bn_layer(tmp, ch_out * 4, 1, 1, 0, LinearActivation())
short = shortcut(ipt, ch_in, ch_out, stride) short = shortcut(ipt, ch_in, ch_out, stride)
return addto_layer(input=[ipt, short], act=ReluActivation()) return paddle.layer.addto(input=[tmp, short], act=paddle.activation.Relu())
def layer_warp(block_func, ipt, features, count, stride): def layer_warp(block_func, ipt, features, count, stride):
tmp = block_func(ipt, features, stride) tmp = block_func(ipt, features, stride)
for i in range(1, count): for i in range(1, count):
tmp = block_func(tmp, features, 1) tmp = block_func(tmp, features, 1)
return tmp return tmp
``` ```
The following are the components of `resnet_cifar10`: The following are the components of `resnet_cifar10`:
...@@ -395,106 +311,131 @@ The following are the components of `resnet_cifar10`: ...@@ -395,106 +311,131 @@ The following are the components of `resnet_cifar10`:
Note: besides the first convolutional layer and the last fully-connected layer, the total number of layers in three `layer_warp` should be dividable by 6, that is the depth of `resnet_cifar10` should satisfy $(depth - 2) % 6 == 0$. Note: besides the first convolutional layer and the last fully-connected layer, the total number of layers in three `layer_warp` should be dividable by 6, that is the depth of `resnet_cifar10` should satisfy $(depth - 2) % 6 == 0$.
```python ```python
def resnet_cifar10(ipt, depth=56): def resnet_cifar10(ipt, depth=32):
# depth should be one of 20, 32, 44, 56, 110, 1202 # depth should be one of 20, 32, 44, 56, 110, 1202
assert (depth - 2) % 6 == 0 assert (depth - 2) % 6 == 0
n = (depth - 2) / 6 n = (depth - 2) / 6
nStages = {16, 64, 128} nStages = {16, 64, 128}
conv1 = conv_bn_layer(ipt, conv1 = conv_bn_layer(
ch_in=3, ipt, ch_in=3, ch_out=16, filter_size=3, stride=1, padding=1)
ch_out=16,
filter_size=3,
stride=1,
padding=1)
res1 = layer_warp(basicblock, conv1, 16, n, 1) res1 = layer_warp(basicblock, conv1, 16, n, 1)
res2 = layer_warp(basicblock, res1, 32, n, 2) res2 = layer_warp(basicblock, res1, 32, n, 2)
res3 = layer_warp(basicblock, res2, 64, n, 2) res3 = layer_warp(basicblock, res2, 64, n, 2)
pool = img_pool_layer(input=res3, pool = paddle.layer.img_pool(
pool_size=8, input=res3, pool_size=8, stride=1, pool_type=paddle.pooling.Avg())
stride=1,
pool_type=AvgPooling())
return pool return pool
``` ```
## Model Training ## Model Training
We can train the model by running the script train.sh, which specifies config file, device type, number of threads, number of passes, path to the trained models, etc, ### Define Parameters
``` bash First, we create the model parameters according to the previous model configuration `cost`.
sh train.sh
```
Here is an example script `train.sh`: ```python
# Create parameters
```bash parameters = paddle.parameters.create(cost)
#cfg=models/resnet.py
cfg=models/vgg.py
output=output
log=train.log
paddle train \
--config=$cfg \
--use_gpu=true \
--trainer_count=1 \
--log_period=100 \
--num_passes=300 \
--save_dir=$output \
2>&1 | tee $log
``` ```
- `--config=$cfg` : specifies config file. The default is `models/vgg.py`. ### Create Trainer
- `--use_gpu=true` : uses GPU for training. If use CPU,set it to be false.
- `--trainer_count=1` : specifies the number of threads or GPUs.
- `--log_period=100` : specifies the number of batches between two logs.
- `--save_dir=$output` : specifies the path for saving trained models.
Here is an example log after training for one pass. The average error rates are 0.79958 on training set and 0.7858 on validation set. Before jumping into creating a training module, algorithm setting is also necessary.
Here we specified `Momentum` optimization algorithm via `paddle.optimizer`.
```text ```python
TrainerInternal.cpp:165] Batch=300 samples=38400 AvgCost=2.07708 CurrentCost=1.96158 Eval: classification_error_evaluator=0.81151 CurrentEval: classification_error_evaluator=0.789297 # Create optimizer
TrainerInternal.cpp:181] Pass=0 Batch=391 samples=50000 AvgCost=2.03348 Eval: classification_error_evaluator=0.79958 momentum_optimizer = paddle.optimizer.Momentum(
Tester.cpp:115] Test samples=10000 cost=1.99246 Eval: classification_error_evaluator=0.7858 momentum=0.9,
regularization=paddle.optimizer.L2Regularization(rate=0.0002 * 128),
learning_rate=0.1 / 128.0,
learning_rate_decay_a=0.1,
learning_rate_decay_b=50000 * 100,
learning_rate_schedule='discexp',
batch_size=128)
# Create trainer
trainer = paddle.trainer.SGD(cost=cost,
parameters=parameters,
update_equation=momentum_optimizer)
``` ```
Figure 12 shows the curve of training error rate, which indicates it converges at Pass 200 with error rate 8.54%. The learning rate adjustment policy can be defined with variables `learning_rate_decay_a`($a$), `learning_rate_decay_b`($b$) and `learning_rate_schedule`. In this example, discrete exponential method is used for adjusting learning rate. The formula is as follows,
$$ lr = lr_{0} * a^ {\lfloor \frac{n}{ b}\rfloor} $$
where $n$ is the number of processed samples, $lr_{0}$ is the learning_rate.
<p align="center"> ### Training
<img src="image/plot_en.png" width="400" ><br/>
Figure 12. The error rate of VGG model on CIFAR10
</p>
## Model Application `cifar.train10()` will yield records during each pass, after shuffling, a batch input is generated for training.
After training is done, the model from each pass is saved in `output/pass-%05d`. For example, the model of Pass 300 is saved in `output/pass-00299`. The script `classify.py` can be used to extract features and to classify an image. The default config file of this script is `models/vgg.py`. ```python
reader=paddle.batch(
paddle.reader.shuffle(
paddle.dataset.cifar.train10(), buf_size=50000),
batch_size=128)
```
`feeding` is devoted to specifying the correspondence between each yield record and `paddle.layer.data`. For instance,
the first column of data generated by `cifar.train10()` corresponds to image layer's feature.
```python
feeding={'image': 0,
'label': 1}
```
### Prediction Callback function `event_handler` will be called during training when a pre-defined event happens.
We can run the following script to predict the category of an image. The default device is GPU. If to use CPU, set `-c`.
```bash ```python
python classify.py --job=predict --model=output/pass-00299 --data=image/dog.png # -c # event handler to track training and testing process
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0:
print "\nPass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
else:
sys.stdout.write('.')
sys.stdout.flush()
if isinstance(event, paddle.event.EndPass):
result = trainer.test(
reader=paddle.batch(
paddle.dataset.cifar.test10(), batch_size=128),
feeding=feeding)
print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics)
``` ```
Here is the result: Finally, we can invoke `trainer.train` to start training:
```text ```python
Label of image/dog.png is: 5 trainer.train(
reader=reader,
num_passes=200,
event_handler=event_handler,
feeding=feeding)
``` ```
### Feature Extraction Here is an example log after training for one pass. The average error rates are 0.6875 on the training set and 0.8852 on the validation set.
We can run the following command to extract features from an image. Here `job` should be `extract` and the default layer is the first convolutional layer. Figure 13 shows the 64 feature maps output from the first convolutional layer of the VGG model. ```text
Pass 0, Batch 0, Cost 2.473182, {'classification_error_evaluator': 0.9140625}
```bash ...................................................................................................
python classify.py --job=extract --model=output/pass-00299 --data=image/dog.png # -c Pass 0, Batch 100, Cost 1.913076, {'classification_error_evaluator': 0.78125}
...................................................................................................
Pass 0, Batch 200, Cost 1.783041, {'classification_error_evaluator': 0.7421875}
...................................................................................................
Pass 0, Batch 300, Cost 1.668833, {'classification_error_evaluator': 0.6875}
..........................................................................................
Test with Pass 0, {'classification_error_evaluator': 0.885200023651123}
``` ```
Figure 12 shows the curve of training error rate, which indicates it converges at Pass 200 with error rate 8.54%.
<p align="center"> <p align="center">
<img src="image/fea_conv0.png" width="500"><br/> <img src="image/plot_en.png" width="400" ><br/>
Figre 13. Visualization of convolution layer feature maps Figure 12. The error rate of VGG model on CIFAR10
</p> </p>
After training is done, the model from each pass is saved in `output/pass-%05d`. For example, the model of Pass 300 is saved in `output/pass-00299`.
## Conclusion ## Conclusion
Traditional image classification methods involve multiple stages of processing and the framework is very complicated. In contrast, CNN models can be trained end-to-end with significant increase of classification accuracy. In this chapter, we introduce three models -- VGG, GoogleNet, ResNet, provide PaddlePaddle config files for training VGG and ResNet on CIFAR10, and explain how to perform prediction and feature extraction using PaddlePaddle API. For other datasets such as ImageNet, the procedure for config and training are the same and you are welcome to give it a try. Traditional image classification methods involve multiple stages of processing and the framework is very complicated. In contrast, CNN models can be trained end-to-end with significant increase of classification accuracy. In this chapter, we introduce three models -- VGG, GoogleNet, ResNet, provide PaddlePaddle config files for training VGG and ResNet on CIFAR10, and explain how to perform prediction and feature extraction using PaddlePaddle API. For other datasets such as ImageNet, the procedure for config and training are the same and you are welcome to give it a try.
...@@ -547,4 +488,4 @@ Traditional image classification methods involve multiple stages of processing a ...@@ -547,4 +488,4 @@ Traditional image classification methods involve multiple stages of processing a
[22] http://cs231n.github.io/classification/ [22] http://cs231n.github.io/classification/
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。 This tutorial is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.
...@@ -252,7 +252,7 @@ paddle.init(use_gpu=False, trainer_count=1) ...@@ -252,7 +252,7 @@ paddle.init(use_gpu=False, trainer_count=1)
ResNet模型的第1、3、4步和VGG模型相同,这里不再介绍。主要介绍第2步即CIFAR10数据集上ResNet核心模块。 ResNet模型的第1、3、4步和VGG模型相同,这里不再介绍。主要介绍第2步即CIFAR10数据集上ResNet核心模块。
```python ```python
net = resnet_cifar10(data, depth=56) net = resnet_cifar10(image, depth=56)
``` ```
先介绍`resnet_cifar10`中的一些基本函数,再介绍网络连接过程。 先介绍`resnet_cifar10`中的一些基本函数,再介绍网络连接过程。
...@@ -375,7 +375,7 @@ $$ lr = lr_{0} * a^ {\lfloor \frac{n}{ b}\rfloor} $$ ...@@ -375,7 +375,7 @@ $$ lr = lr_{0} * a^ {\lfloor \frac{n}{ b}\rfloor} $$
cifar.train10()每次产生一条样本,在完成shuffle和batch之后,作为训练的输入。 cifar.train10()每次产生一条样本,在完成shuffle和batch之后,作为训练的输入。
```python ```python
reader=paddle.reader.batch( reader=paddle.batch(
paddle.reader.shuffle( paddle.reader.shuffle(
paddle.dataset.cifar.train10(), buf_size=50000), paddle.dataset.cifar.train10(), buf_size=50000),
batch_size=128) batch_size=128)
...@@ -402,10 +402,9 @@ def event_handler(event): ...@@ -402,10 +402,9 @@ def event_handler(event):
sys.stdout.flush() sys.stdout.flush()
if isinstance(event, paddle.event.EndPass): if isinstance(event, paddle.event.EndPass):
result = trainer.test( result = trainer.test(
reader=paddle.reader.batch( reader=paddle.batch(
paddle.dataset.cifar.test10(), batch_size=128), paddle.dataset.cifar.test10(), batch_size=128),
reader_dict={'image': 0, feeding=feeding)
'label': 1})
print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics) print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics)
``` ```
......
...@@ -44,8 +44,9 @@ def vis_square(data, fname): ...@@ -44,8 +44,9 @@ def vis_square(data, fname):
(0, 1)) # add some space between filters (0, 1)) # add some space between filters
+ ((0, 0), ) * + ((0, 0), ) *
(data.ndim - 3)) # don't pad the last dimension (if there is one) (data.ndim - 3)) # don't pad the last dimension (if there is one)
data = np.pad(data, padding, mode='constant', data = np.pad(
constant_values=1) # pad with ones (white) data, padding, mode='constant',
constant_values=1) # pad with ones (white)
# tile the filters into an image # tile the filters into an image
data = data.reshape((n, n) + data.shape[1:]).transpose((0, 2, 1, 3) + tuple( data = data.reshape((n, n) + data.shape[1:]).transpose((0, 2, 1, 3) + tuple(
range(4, data.ndim + 1))) range(4, data.ndim + 1)))
......
...@@ -43,7 +43,7 @@ ...@@ -43,7 +43,7 @@
Image Classification Image Classification
======================= =======================
The source code of this chapter is in [book/image_classification](https://github.com/PaddlePaddle/book/tree/develop/image_classification). For the first-time users, please refer to PaddlePaddle[Installation Tutorial](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html) for installation instructions. The source code of this chapter is in [book/image_classification](https://github.com/PaddlePaddle/book/tree/develop/image_classification). For the first-time users, please refer to PaddlePaddle [Installation Tutorial](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html) for installation instructions.
## Background ## Background
...@@ -177,146 +177,73 @@ Figure 10. ResNet model for ImageNet ...@@ -177,146 +177,73 @@ Figure 10. ResNet model for ImageNet
</p> </p>
## Data Preparation ## Dataset
### Data description and downloading
Commonly used public datasets for image classification are CIFAR(https://www.cs.toronto.edu/~kriz/cifar.html), ImageNet(http://image-net.org/), COCO(http://mscoco.org/), etc. Those used for fine-grained image classification are CUB-200-2011(http://www.vision.caltech.edu/visipedia/CUB-200-2011.html), Stanford Dog(http://vision.stanford.edu/aditya86/ImageNetDogs/), Oxford-flowers(http://www.robots.ox.ac.uk/~vgg/data/flowers/), etc. Among them, ImageNet are the largest and most research results are reported on ImageNet as mentioned in Model Overview section. Since 2010, the data of Imagenet has gone through some changes. The commonly used ImageNet-2012 dataset contains 1000 categories. There are 1,281,167 training images, ranging from 732 to 1200 images per category, and 50,000 validation images with 50 images per category in average. Commonly used public datasets for image classification are CIFAR(https://www.cs.toronto.edu/~kriz/cifar.html), ImageNet(http://image-net.org/), COCO(http://mscoco.org/), etc. Those used for fine-grained image classification are CUB-200-2011(http://www.vision.caltech.edu/visipedia/CUB-200-2011.html), Stanford Dog(http://vision.stanford.edu/aditya86/ImageNetDogs/), Oxford-flowers(http://www.robots.ox.ac.uk/~vgg/data/flowers/), etc. Among them, ImageNet are the largest and most research results are reported on ImageNet as mentioned in Model Overview section. Since 2010, the data of Imagenet has gone through some changes. The commonly used ImageNet-2012 dataset contains 1000 categories. There are 1,281,167 training images, ranging from 732 to 1200 images per category, and 50,000 validation images with 50 images per category in average.
Since ImageNet is too large to be downloaded and trained efficiently, we use CIFAR10 (https://www.cs.toronto.edu/~kriz/cifar.html) in this tutorial. The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. Figure 11 shows all the classes in CIFAR10 as well as 10 images randomly sampled from each category. Since ImageNet is too large to be downloaded and trained efficiently, we use CIFAR-10 (https://www.cs.toronto.edu/~kriz/cifar.html) in this tutorial. The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. Figure 11 shows all the classes in CIFAR-10 as well as 10 images randomly sampled from each category.
<p align="center"> <p align="center">
<img src="image/cifar.png" width="350"><br/> <img src="image/cifar.png" width="350"><br/>
Figure 11. CIFAR10 dataset[21] Figure 11. CIFAR10 dataset[21]
</p> </p>
The following command is used for downloading data and calculating the mean image used for data preprocessing. `paddle.datasets` package encapsulates multiple public datasets, including `cifar`, `imdb`, `mnist`, `moivelens` and `wmt14`, etc. There's no need for us to manually download and preprocess CIFAR-10.
```bash
./data/get_data.sh
```
### Data provider for PaddlePaddle After issuing a command `python train.py`, training will starting immediately. The details will be unpacked by the following sessions to see how it works.
We use Python interface for providing data to PaddlePaddle. The following file dataprovider.py is a complete example for CIFAR10. ## Model Structure
- 'initializer' function performs initialization of dataprovider: loading the mean image, defining two input types -- image and label. ### Initialize PaddlePaddle
- 'process' function sends preprocessed data to PaddlePaddle. Data preprocessing performed in this function includes data perturbation, random horizontal flipping, deducting mean image from the raw image. We must import and initialize PaddlePaddle (enable/disable GPU, set the number of trainers, etc).
```python ```python
import numpy as np import sys
import cPickle import paddle.v2 as paddle
from paddle.trainer.PyDataProvider2 import *
def initializer(settings, mean_path, is_train, **kwargs):
settings.is_train = is_train
settings.input_size = 3 * 32 * 32
settings.mean = np.load(mean_path)['mean']
settings.input_types = {
'image': dense_vector(settings.input_size),
'label': integer_value(10)
}
@provider(init_hook=initializer, pool_size=50000)
def process(settings, file_list):
with open(file_list, 'r') as fdata:
for fname in fdata:
fo = open(fname.strip(), 'rb')
batch = cPickle.load(fo)
fo.close()
images = batch['data']
labels = batch['labels']
for im, lab in zip(images, labels):
if settings.is_train and np.random.randint(2):
im = im.reshape(3, 32, 32)
im = im[:,:,::-1]
im = im.flatten()
im = im - settings.mean
yield {
'image': im.astype('float32'),
'label': int(lab)
}
```
## Model Config # PaddlePaddle init
paddle.init(use_gpu=False, trainer_count=1)
### Data Definition
In model config file, function `define_py_data_sources2` sets argument 'module' to dataprovider file for loading data, 'args' to mean image file. If the config file is used for prediction, then there is no need to set argument 'train_list'.
```python
from paddle.trainer_config_helpers import *
is_predict = get_config_arg("is_predict", bool, False)
if not is_predict:
define_py_data_sources2(
train_list='data/train.list',
test_list='data/test.list',
module='dataprovider',
obj='process',
args={'mean_path': 'data/mean.meta'})
```
### Algorithm Settings
In model config file, function 'settings' specifies optimization algorithm, batch size, learning rate, momentum and L2 regularization.
```python
settings(
batch_size=128,
learning_rate=0.1 / 128.0,
learning_rate_decay_a=0.1,
learning_rate_decay_b=50000 * 100,
learning_rate_schedule='discexp',
learning_method=MomentumOptimizer(0.9),
regularization=L2Regularization(0.0005 * 128),)
``` ```
The learning rate adjustment policy can be defined with variables `learning_rate_decay_a`($a$), `learning_rate_decay_b`($b$) and `learning_rate_schedule`. In this example, discrete exponential method is used for adjusting learning rate. The formula is as follows, As alluded to in section [Model Overview](#model-overview), here we provide the implementations of both VGG and ResNet models.
$$ lr = lr_{0} * a^ {\lfloor \frac{n}{ b}\rfloor} $$
where $n$ is the number of processed samples, $lr_{0}$ is the learning_rate set in 'settings'.
### Model Architecture
Here we provide the cofig files for VGG and ResNet models.
#### VGG ### VGG
First we define VGG network. Since the image size and amount of CIFAR10 are relatively small comparing to ImageNet, we uses a small version of VGG network for CIFAR10. Convolution groups incorporate BN and dropout operations. First, we use a VGG network. Since the image size and amount of CIFAR10 are relatively small comparing to ImageNet, we uses a small version of VGG network for CIFAR10. Convolution groups incorporate BN and dropout operations.
1. Define input data and its dimension 1. Define input data and its dimension
The input to the network is defined as `data_layer`, or image pixels in the context of image classification. The images in CIFAR10 are 32x32 color images of three channels. Therefore, the size of the input data is 3072 (3x32x32), and the number of categories is 10. The input to the network is defined as `paddle.layer.data`, or image pixels in the context of image classification. The images in CIFAR10 are 32x32 color images of three channels. Therefore, the size of the input data is 3072 (3x32x32), and the number of categories is 10.
```python ```python
datadim = 3 * 32 * 32 datadim = 3 * 32 * 32
classdim = 10 classdim = 10
data = data_layer(name='image', size=datadim) image = paddle.layer.data(
name="image", type=paddle.data_type.dense_vector(datadim))
``` ```
2. Define VGG main module 2. Define VGG main module
```python ```python
net = vgg_bn_drop(data) net = vgg_bn_drop(image)
``` ```
The input to VGG main module is from data layer. `vgg_bn_drop` defines a 16-layer VGG network, with each convolutional layer followed by BN and dropout layers. Here is the definition in detail: The input to VGG main module is from the data layer. `vgg_bn_drop` defines a 16-layer VGG network, with each convolutional layer followed by BN and dropout layers. Here is the definition in detail:
```python ```python
def vgg_bn_drop(input, num_channels): def vgg_bn_drop(input):
def conv_block(ipt, num_filter, groups, dropouts, num_channels_=None): def conv_block(ipt, num_filter, groups, dropouts, num_channels=None):
return img_conv_group( return paddle.networks.img_conv_group(
input=ipt, input=ipt,
num_channels=num_channels_, num_channels=num_channels,
pool_size=2, pool_size=2,
pool_stride=2, pool_stride=2,
conv_num_filter=[num_filter] * groups, conv_num_filter=[num_filter] * groups,
conv_filter_size=3, conv_filter_size=3,
conv_act=ReluActivation(), conv_act=paddle.activation.Relu(),
conv_with_batchnorm=True, conv_with_batchnorm=True,
conv_batchnorm_drop_rate=dropouts, conv_batchnorm_drop_rate=dropouts,
pool_type=MaxPooling()) pool_type=paddle.pooling.Max())
conv1 = conv_block(input, 64, 2, [0.3, 0], 3) conv1 = conv_block(input, 64, 2, [0.3, 0], 3)
conv2 = conv_block(conv1, 128, 2, [0.4, 0]) conv2 = conv_block(conv1, 128, 2, [0.4, 0])
...@@ -324,16 +251,17 @@ First we define VGG network. Since the image size and amount of CIFAR10 are rela ...@@ -324,16 +251,17 @@ First we define VGG network. Since the image size and amount of CIFAR10 are rela
conv4 = conv_block(conv3, 512, 3, [0.4, 0.4, 0]) conv4 = conv_block(conv3, 512, 3, [0.4, 0.4, 0])
conv5 = conv_block(conv4, 512, 3, [0.4, 0.4, 0]) conv5 = conv_block(conv4, 512, 3, [0.4, 0.4, 0])
drop = dropout_layer(input=conv5, dropout_rate=0.5) drop = paddle.layer.dropout(input=conv5, dropout_rate=0.5)
fc1 = fc_layer(input=drop, size=512, act=LinearActivation()) fc1 = paddle.layer.fc(input=drop, size=512, act=paddle.activation.Linear())
bn = batch_norm_layer( bn = paddle.layer.batch_norm(
input=fc1, act=ReluActivation(), layer_attr=ExtraAttr(drop_rate=0.5)) input=fc1,
fc2 = fc_layer(input=bn, size=512, act=LinearActivation()) act=paddle.activation.Relu(),
layer_attr=paddle.attr.Extra(drop_rate=0.5))
fc2 = paddle.layer.fc(input=bn, size=512, act=paddle.activation.Linear())
return fc2 return fc2
``` ```
2.1. First defines a convolution block or conv_block. The default convolution kernel is 3x3, and the default pooling size is 2x2 with stride 2. Dropout specifies the probability in dropout operation. Function `img_conv_group` is defined in `paddle.trainer_config_helpers` consisting of a series of `Conv->BN->ReLu->Dropout` and a `Pooling`. 2.1. First defines a convolution block or conv_block. The default convolution kernel is 3x3, and the default pooling size is 2x2 with stride 2. Dropout specifies the probability in dropout operation. Function `img_conv_group` is defined in `paddle.networks` consisting of a series of `Conv->BN->ReLu->Dropout` and a `Pooling`.
2.2. Five groups of convolutions. The first two groups perform two convolutions, while the last three groups perform three convolutions. The dropout rate of the last convolution in each group is set to 0, which means there is no dropout for this layer. 2.2. Five groups of convolutions. The first two groups perform two convolutions, while the last three groups perform three convolutions. The dropout rate of the last convolution in each group is set to 0, which means there is no dropout for this layer.
...@@ -351,15 +279,12 @@ First we define VGG network. Since the image size and amount of CIFAR10 are rela ...@@ -351,15 +279,12 @@ First we define VGG network. Since the image size and amount of CIFAR10 are rela
4. Define Loss Function and Outputs 4. Define Loss Function and Outputs
In the context of supervised learning, labels of training images are defined in `data_layer`, too. During training, cross-entropy is used as loss function and as the output of the network; During testing, the outputs are the probabilities calculated in the classifier. In the context of supervised learning, labels of training images are defined in `paddle.layer.data`, too. During training, cross-entropy is used as loss function and as the output of the network; During testing, the outputs are the probabilities calculated in the classifier.
```python ```python
if not is_predict: lbl = paddle.layer.data(
lbl = data_layer(name="label", size=class_num) name="label", type=paddle.data_type.integer_value(classdim))
cost = classification_cost(input=out, label=lbl) cost = paddle.layer.classification_cost(input=out, label=lbl)
outputs(cost)
else:
outputs(out)
``` ```
### ResNet ### ResNet
...@@ -367,13 +292,13 @@ First we define VGG network. Since the image size and amount of CIFAR10 are rela ...@@ -367,13 +292,13 @@ First we define VGG network. Since the image size and amount of CIFAR10 are rela
The first, third and forth steps of a ResNet are the same as a VGG. The second one is the main module. The first, third and forth steps of a ResNet are the same as a VGG. The second one is the main module.
```python ```python
net = resnet_cifar10(data, depth=56) net = resnet_cifar10(data, depth=32)
``` ```
Here are some basic functions used in `resnet_cifar10`: Here are some basic functions used in `resnet_cifar10`:
- `conv_bn_layer` : convolutional layer followed by BN. - `conv_bn_layer` : convolutional layer followed by BN.
- `shortcut` : the shortcut branch in a residual block. There are two kinds of shortcuts: 1x1 convolution used when the number of channels between input and output are different; direct connection used otherwise. - `shortcut` : the shortcut branch in a residual block. There are two kinds of shortcuts: 1x1 convolution used when the number of channels between input and output is different; direct connection used otherwise.
- `basicblock` : a basic residual module as shown in the left of Figure 9, consisting of two sequential 3x3 convolutions and one "shortcut" branch. - `basicblock` : a basic residual module as shown in the left of Figure 9, consisting of two sequential 3x3 convolutions and one "shortcut" branch.
- `bottleneck` : a bottleneck module as shown in the right of Figure 9, consisting of a two 1x1 convolutions with one 3x3 convolution in between branch and a "shortcut" branch. - `bottleneck` : a bottleneck module as shown in the right of Figure 9, consisting of a two 1x1 convolutions with one 3x3 convolution in between branch and a "shortcut" branch.
...@@ -385,47 +310,38 @@ def conv_bn_layer(input, ...@@ -385,47 +310,38 @@ def conv_bn_layer(input,
filter_size, filter_size,
stride, stride,
padding, padding,
active_type=ReluActivation(), active_type=paddle.activation.Relu(),
ch_in=None): ch_in=None):
tmp = img_conv_layer( tmp = paddle.layer.img_conv(
input=input, input=input,
filter_size=filter_size, filter_size=filter_size,
num_channels=ch_in, num_channels=ch_in,
num_filters=ch_out, num_filters=ch_out,
stride=stride, stride=stride,
padding=padding, padding=padding,
act=LinearActivation(), act=paddle.activation.Linear(),
bias_attr=False) bias_attr=False)
return batch_norm_layer(input=tmp, act=active_type) return paddle.layer.batch_norm(input=tmp, act=active_type)
def shortcut(ipt, n_in, n_out, stride): def shortcut(ipt, n_in, n_out, stride):
if n_in != n_out: if n_in != n_out:
return conv_bn_layer(ipt, n_out, 1, stride, 0, LinearActivation()) return conv_bn_layer(ipt, n_out, 1, stride, 0,
paddle.activation.Linear())
else: else:
return ipt return ipt
def basicblock(ipt, ch_out, stride): def basicblock(ipt, ch_out, stride):
ch_in = ipt.num_filters ch_in = ch_out * 2
tmp = conv_bn_layer(ipt, ch_out, 3, stride, 1) tmp = conv_bn_layer(ipt, ch_out, 3, stride, 1)
tmp = conv_bn_layer(tmp, ch_out, 3, 1, 1, LinearActivation()) tmp = conv_bn_layer(tmp, ch_out, 3, 1, 1, paddle.activation.Linear())
short = shortcut(ipt, ch_in, ch_out, stride)
return addto_layer(input=[ipt, short], act=ReluActivation())
def bottleneck(ipt, ch_out, stride):
ch_in = ipt.num_filter
tmp = conv_bn_layer(ipt, ch_out, 1, stride, 0)
tmp = conv_bn_layer(tmp, ch_out, 3, 1, 1)
tmp = conv_bn_layer(tmp, ch_out * 4, 1, 1, 0, LinearActivation())
short = shortcut(ipt, ch_in, ch_out, stride) short = shortcut(ipt, ch_in, ch_out, stride)
return addto_layer(input=[ipt, short], act=ReluActivation()) return paddle.layer.addto(input=[tmp, short], act=paddle.activation.Relu())
def layer_warp(block_func, ipt, features, count, stride): def layer_warp(block_func, ipt, features, count, stride):
tmp = block_func(ipt, features, stride) tmp = block_func(ipt, features, stride)
for i in range(1, count): for i in range(1, count):
tmp = block_func(tmp, features, 1) tmp = block_func(tmp, features, 1)
return tmp return tmp
``` ```
The following are the components of `resnet_cifar10`: The following are the components of `resnet_cifar10`:
...@@ -437,106 +353,131 @@ The following are the components of `resnet_cifar10`: ...@@ -437,106 +353,131 @@ The following are the components of `resnet_cifar10`:
Note: besides the first convolutional layer and the last fully-connected layer, the total number of layers in three `layer_warp` should be dividable by 6, that is the depth of `resnet_cifar10` should satisfy $(depth - 2) % 6 == 0$. Note: besides the first convolutional layer and the last fully-connected layer, the total number of layers in three `layer_warp` should be dividable by 6, that is the depth of `resnet_cifar10` should satisfy $(depth - 2) % 6 == 0$.
```python ```python
def resnet_cifar10(ipt, depth=56): def resnet_cifar10(ipt, depth=32):
# depth should be one of 20, 32, 44, 56, 110, 1202 # depth should be one of 20, 32, 44, 56, 110, 1202
assert (depth - 2) % 6 == 0 assert (depth - 2) % 6 == 0
n = (depth - 2) / 6 n = (depth - 2) / 6
nStages = {16, 64, 128} nStages = {16, 64, 128}
conv1 = conv_bn_layer(ipt, conv1 = conv_bn_layer(
ch_in=3, ipt, ch_in=3, ch_out=16, filter_size=3, stride=1, padding=1)
ch_out=16,
filter_size=3,
stride=1,
padding=1)
res1 = layer_warp(basicblock, conv1, 16, n, 1) res1 = layer_warp(basicblock, conv1, 16, n, 1)
res2 = layer_warp(basicblock, res1, 32, n, 2) res2 = layer_warp(basicblock, res1, 32, n, 2)
res3 = layer_warp(basicblock, res2, 64, n, 2) res3 = layer_warp(basicblock, res2, 64, n, 2)
pool = img_pool_layer(input=res3, pool = paddle.layer.img_pool(
pool_size=8, input=res3, pool_size=8, stride=1, pool_type=paddle.pooling.Avg())
stride=1,
pool_type=AvgPooling())
return pool return pool
``` ```
## Model Training ## Model Training
We can train the model by running the script train.sh, which specifies config file, device type, number of threads, number of passes, path to the trained models, etc, ### Define Parameters
``` bash First, we create the model parameters according to the previous model configuration `cost`.
sh train.sh
```
Here is an example script `train.sh`: ```python
# Create parameters
```bash parameters = paddle.parameters.create(cost)
#cfg=models/resnet.py
cfg=models/vgg.py
output=output
log=train.log
paddle train \
--config=$cfg \
--use_gpu=true \
--trainer_count=1 \
--log_period=100 \
--num_passes=300 \
--save_dir=$output \
2>&1 | tee $log
``` ```
- `--config=$cfg` : specifies config file. The default is `models/vgg.py`. ### Create Trainer
- `--use_gpu=true` : uses GPU for training. If use CPU,set it to be false.
- `--trainer_count=1` : specifies the number of threads or GPUs.
- `--log_period=100` : specifies the number of batches between two logs.
- `--save_dir=$output` : specifies the path for saving trained models.
Here is an example log after training for one pass. The average error rates are 0.79958 on training set and 0.7858 on validation set. Before jumping into creating a training module, algorithm setting is also necessary.
Here we specified `Momentum` optimization algorithm via `paddle.optimizer`.
```text ```python
TrainerInternal.cpp:165] Batch=300 samples=38400 AvgCost=2.07708 CurrentCost=1.96158 Eval: classification_error_evaluator=0.81151 CurrentEval: classification_error_evaluator=0.789297 # Create optimizer
TrainerInternal.cpp:181] Pass=0 Batch=391 samples=50000 AvgCost=2.03348 Eval: classification_error_evaluator=0.79958 momentum_optimizer = paddle.optimizer.Momentum(
Tester.cpp:115] Test samples=10000 cost=1.99246 Eval: classification_error_evaluator=0.7858 momentum=0.9,
regularization=paddle.optimizer.L2Regularization(rate=0.0002 * 128),
learning_rate=0.1 / 128.0,
learning_rate_decay_a=0.1,
learning_rate_decay_b=50000 * 100,
learning_rate_schedule='discexp',
batch_size=128)
# Create trainer
trainer = paddle.trainer.SGD(cost=cost,
parameters=parameters,
update_equation=momentum_optimizer)
``` ```
Figure 12 shows the curve of training error rate, which indicates it converges at Pass 200 with error rate 8.54%. The learning rate adjustment policy can be defined with variables `learning_rate_decay_a`($a$), `learning_rate_decay_b`($b$) and `learning_rate_schedule`. In this example, discrete exponential method is used for adjusting learning rate. The formula is as follows,
$$ lr = lr_{0} * a^ {\lfloor \frac{n}{ b}\rfloor} $$
where $n$ is the number of processed samples, $lr_{0}$ is the learning_rate.
<p align="center"> ### Training
<img src="image/plot_en.png" width="400" ><br/>
Figure 12. The error rate of VGG model on CIFAR10
</p>
## Model Application `cifar.train10()` will yield records during each pass, after shuffling, a batch input is generated for training.
After training is done, the model from each pass is saved in `output/pass-%05d`. For example, the model of Pass 300 is saved in `output/pass-00299`. The script `classify.py` can be used to extract features and to classify an image. The default config file of this script is `models/vgg.py`. ```python
reader=paddle.batch(
paddle.reader.shuffle(
paddle.dataset.cifar.train10(), buf_size=50000),
batch_size=128)
```
`feeding` is devoted to specifying the correspondence between each yield record and `paddle.layer.data`. For instance,
the first column of data generated by `cifar.train10()` corresponds to image layer's feature.
```python
feeding={'image': 0,
'label': 1}
```
### Prediction Callback function `event_handler` will be called during training when a pre-defined event happens.
We can run the following script to predict the category of an image. The default device is GPU. If to use CPU, set `-c`.
```bash ```python
python classify.py --job=predict --model=output/pass-00299 --data=image/dog.png # -c # event handler to track training and testing process
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0:
print "\nPass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
else:
sys.stdout.write('.')
sys.stdout.flush()
if isinstance(event, paddle.event.EndPass):
result = trainer.test(
reader=paddle.batch(
paddle.dataset.cifar.test10(), batch_size=128),
feeding=feeding)
print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics)
``` ```
Here is the result: Finally, we can invoke `trainer.train` to start training:
```text ```python
Label of image/dog.png is: 5 trainer.train(
reader=reader,
num_passes=200,
event_handler=event_handler,
feeding=feeding)
``` ```
### Feature Extraction Here is an example log after training for one pass. The average error rates are 0.6875 on the training set and 0.8852 on the validation set.
We can run the following command to extract features from an image. Here `job` should be `extract` and the default layer is the first convolutional layer. Figure 13 shows the 64 feature maps output from the first convolutional layer of the VGG model. ```text
Pass 0, Batch 0, Cost 2.473182, {'classification_error_evaluator': 0.9140625}
```bash ...................................................................................................
python classify.py --job=extract --model=output/pass-00299 --data=image/dog.png # -c Pass 0, Batch 100, Cost 1.913076, {'classification_error_evaluator': 0.78125}
...................................................................................................
Pass 0, Batch 200, Cost 1.783041, {'classification_error_evaluator': 0.7421875}
...................................................................................................
Pass 0, Batch 300, Cost 1.668833, {'classification_error_evaluator': 0.6875}
..........................................................................................
Test with Pass 0, {'classification_error_evaluator': 0.885200023651123}
``` ```
Figure 12 shows the curve of training error rate, which indicates it converges at Pass 200 with error rate 8.54%.
<p align="center"> <p align="center">
<img src="image/fea_conv0.png" width="500"><br/> <img src="image/plot_en.png" width="400" ><br/>
Figre 13. Visualization of convolution layer feature maps Figure 12. The error rate of VGG model on CIFAR10
</p> </p>
After training is done, the model from each pass is saved in `output/pass-%05d`. For example, the model of Pass 300 is saved in `output/pass-00299`.
## Conclusion ## Conclusion
Traditional image classification methods involve multiple stages of processing and the framework is very complicated. In contrast, CNN models can be trained end-to-end with significant increase of classification accuracy. In this chapter, we introduce three models -- VGG, GoogleNet, ResNet, provide PaddlePaddle config files for training VGG and ResNet on CIFAR10, and explain how to perform prediction and feature extraction using PaddlePaddle API. For other datasets such as ImageNet, the procedure for config and training are the same and you are welcome to give it a try. Traditional image classification methods involve multiple stages of processing and the framework is very complicated. In contrast, CNN models can be trained end-to-end with significant increase of classification accuracy. In this chapter, we introduce three models -- VGG, GoogleNet, ResNet, provide PaddlePaddle config files for training VGG and ResNet on CIFAR10, and explain how to perform prediction and feature extraction using PaddlePaddle API. For other datasets such as ImageNet, the procedure for config and training are the same and you are welcome to give it a try.
...@@ -589,7 +530,7 @@ Traditional image classification methods involve multiple stages of processing a ...@@ -589,7 +530,7 @@ Traditional image classification methods involve multiple stages of processing a
[22] http://cs231n.github.io/classification/ [22] http://cs231n.github.io/classification/
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。 This tutorial is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
......
...@@ -294,7 +294,7 @@ paddle.init(use_gpu=False, trainer_count=1) ...@@ -294,7 +294,7 @@ paddle.init(use_gpu=False, trainer_count=1)
ResNet模型的第1、3、4步和VGG模型相同,这里不再介绍。主要介绍第2步即CIFAR10数据集上ResNet核心模块。 ResNet模型的第1、3、4步和VGG模型相同,这里不再介绍。主要介绍第2步即CIFAR10数据集上ResNet核心模块。
```python ```python
net = resnet_cifar10(data, depth=56) net = resnet_cifar10(image, depth=56)
``` ```
先介绍`resnet_cifar10`中的一些基本函数,再介绍网络连接过程。 先介绍`resnet_cifar10`中的一些基本函数,再介绍网络连接过程。
...@@ -417,7 +417,7 @@ $$ lr = lr_{0} * a^ {\lfloor \frac{n}{ b}\rfloor} $$ ...@@ -417,7 +417,7 @@ $$ lr = lr_{0} * a^ {\lfloor \frac{n}{ b}\rfloor} $$
cifar.train10()每次产生一条样本,在完成shuffle和batch之后,作为训练的输入。 cifar.train10()每次产生一条样本,在完成shuffle和batch之后,作为训练的输入。
```python ```python
reader=paddle.reader.batch( reader=paddle.batch(
paddle.reader.shuffle( paddle.reader.shuffle(
paddle.dataset.cifar.train10(), buf_size=50000), paddle.dataset.cifar.train10(), buf_size=50000),
batch_size=128) batch_size=128)
...@@ -444,10 +444,9 @@ def event_handler(event): ...@@ -444,10 +444,9 @@ def event_handler(event):
sys.stdout.flush() sys.stdout.flush()
if isinstance(event, paddle.event.EndPass): if isinstance(event, paddle.event.EndPass):
result = trainer.test( result = trainer.test(
reader=paddle.reader.batch( reader=paddle.batch(
paddle.dataset.cifar.test10(), batch_size=128), paddle.dataset.cifar.test10(), batch_size=128),
reader_dict={'image': 0, feeding=feeding)
'label': 1})
print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics) print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics)
``` ```
......
...@@ -36,9 +36,8 @@ def main(): ...@@ -36,9 +36,8 @@ def main():
# option 2. vgg # option 2. vgg
net = vgg_bn_drop(image) net = vgg_bn_drop(image)
out = paddle.layer.fc(input=net, out = paddle.layer.fc(
size=classdim, input=net, size=classdim, act=paddle.activation.Softmax())
act=paddle.activation.Softmax())
lbl = paddle.layer.data( lbl = paddle.layer.data(
name="label", type=paddle.data_type.integer_value(classdim)) name="label", type=paddle.data_type.integer_value(classdim))
...@@ -75,9 +74,8 @@ def main(): ...@@ -75,9 +74,8 @@ def main():
print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics) print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics)
# Create trainer # Create trainer
trainer = paddle.trainer.SGD(cost=cost, trainer = paddle.trainer.SGD(
parameters=parameters, cost=cost, parameters=parameters, update_equation=momentum_optimizer)
update_equation=momentum_optimizer)
trainer.train( trainer.train(
reader=paddle.batch( reader=paddle.batch(
paddle.reader.shuffle( paddle.reader.shuffle(
......
<html>
<head>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: {
inlineMath: [ ['$','$'] ],
displayMath: [ ['$$','$$'] ],
processEscapes: true
},
"HTML-CSS": { availableFonts: ["TeX"] }
});
</script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js" async></script>
<script type="text/javascript" src="../.tmpl/marked.js">
</script>
<link href="http://cdn.bootcss.com/highlight.js/9.9.0/styles/darcula.min.css" rel="stylesheet">
<script src="http://cdn.bootcss.com/highlight.js/9.9.0/highlight.min.js"></script>
<link href="http://cdn.bootcss.com/bootstrap/4.0.0-alpha.6/css/bootstrap.min.css" rel="stylesheet">
<link href="https://cdn.jsdelivr.net/perfect-scrollbar/0.6.14/css/perfect-scrollbar.min.css" rel="stylesheet">
<link href="../.tmpl/github-markdown.css" rel='stylesheet'>
</head>
<style type="text/css" >
.markdown-body {
box-sizing: border-box;
min-width: 200px;
max-width: 980px;
margin: 0 auto;
padding: 45px;
}
</style>
<body>
<div id="context" class="container markdown-body">
</div>
<!-- This block will be replaced by each markdown file content. Please do not change lines below.-->
<div id="markdown" style='display:none'>
# Deep Learning with PaddlePaddle
1. [Fit a Line](http://book.paddlepaddle.org/fit_a_line/index.en.html)
1. [Recognize Digits](http://book.paddlepaddle.org/recognize_digits/index.en.html)
1. [Image Classification](http://book.paddlepaddle.org/image_classification/index.en.html)
1. [Word to Vector](http://book.paddlepaddle.org/word2vec/index.en.html)
1. [Understand Sentiment](http://book.paddlepaddle.org/understand_sentiment/index.en.html)
1. [Label Semantic Roles](http://book.paddlepaddle.org/label_semantic_roles/index.en.html)
1. [Machine Translation](http://book.paddlepaddle.org/machine_translation/index.en.html)
1. [Recommender System](http://book.paddlepaddle.org/recommender_system/index.en.html)
This tutorial is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.
</div>
<!-- You can change the lines below now. -->
<script type="text/javascript">
marked.setOptions({
renderer: new marked.Renderer(),
gfm: true,
breaks: false,
smartypants: true,
highlight: function(code, lang) {
code = code.replace(/&amp;/g, "&")
code = code.replace(/&gt;/g, ">")
code = code.replace(/&lt;/g, "<")
code = code.replace(/&nbsp;/g, " ")
return hljs.highlightAuto(code, [lang]).value;
}
});
document.getElementById("context").innerHTML = marked(
document.getElementById("markdown").innerHTML)
</script>
</body>
...@@ -42,16 +42,16 @@ ...@@ -42,16 +42,16 @@
<div id="markdown" style='display:none'> <div id="markdown" style='display:none'>
# 深度学习入门 # 深度学习入门
1. 新手入门 [[fit_a_line](fit_a_line/)] [[html](http://book.paddlepaddle.org/fit_a_line)] 1. [新手入门](http://book.paddlepaddle.org/fit_a_line)
1. 识别数字 [[recognize_digits](recognize_digits/)] [[html](http://book.paddlepaddle.org/recognize_digits)] 1. [识别数字](http://book.paddlepaddle.org/recognize_digits)
1. 图像分类 [[image_classification](image_classification/)] [[html](http://book.paddlepaddle.org/image_classification)] 1. [图像分类](http://book.paddlepaddle.org/image_classification)
1. 词向量 [[word2vec](word2vec/)] [[html](http://book.paddlepaddle.org/word2vec)] 1. [词向量](http://book.paddlepaddle.org/word2vec)
1. 情感分析 [[understand_sentiment](understand_sentiment/)] [[html](http://book.paddlepaddle.org/understand_sentiment)] 1. [情感分析](http://book.paddlepaddle.org/understand_sentiment)
1. 语义角色标注 [[label_semantic_roles](label_semantic_roles/)] [[html](http://book.paddlepaddle.org/label_semantic_roles)] 1. [语义角色标注](http://book.paddlepaddle.org/label_semantic_roles)
1. 机器翻译 [[machine_translation](machine_translation/)] [[html](http://book.paddlepaddle.org/machine_translation)] 1. [机器翻译](http://book.paddlepaddle.org/machine_translation)
1. 个性化推荐 [[recommender_system](recommender_system/)] [[html](http://book.paddlepaddle.org/recommender_system)] 1. [个性化推荐](http://book.paddlepaddle.org/recommender_system)
<br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。
</div> </div>
......
...@@ -134,7 +134,7 @@ After modification, the model is as follows: ...@@ -134,7 +134,7 @@ After modification, the model is as follows:
<div align="center"> <div align="center">
<img src="image/db_lstm_en.png" width = "60%" align=center /><br> <img src="image/db_lstm_network_en.png" width = "60%" align=center /><br>
Fig 6. DB-LSTM for SRL tasks Fig 6. DB-LSTM for SRL tasks
</div> </div>
...@@ -200,6 +200,8 @@ import numpy as np ...@@ -200,6 +200,8 @@ import numpy as np
import paddle.v2 as paddle import paddle.v2 as paddle
import paddle.v2.dataset.conll05 as conll05 import paddle.v2.dataset.conll05 as conll05
paddle.init(use_gpu=False, trainer_count=1)
word_dict, verb_dict, label_dict = conll05.get_dict() word_dict, verb_dict, label_dict = conll05.get_dict()
word_dict_len = len(word_dict) word_dict_len = len(word_dict)
label_dict_len = len(label_dict) label_dict_len = len(label_dict)
...@@ -212,7 +214,7 @@ print pred_len ...@@ -212,7 +214,7 @@ print pred_len
## Model configuration ## Model configuration
- 1. Define input data dimensions and model hyperparameters. - Define input data dimensions and model hyperparameters.
```python ```python
mark_dict_len = 2 # Value range of region mark. Region mark is either 0 or 1, so range is 2 mark_dict_len = 2 # Value range of region mark. Region mark is either 0 or 1, so range is 2
...@@ -247,7 +249,7 @@ target = paddle.layer.data(name='target', type=d_type(label_dict_len)) ...@@ -247,7 +249,7 @@ target = paddle.layer.data(name='target', type=d_type(label_dict_len))
Speciala note: hidden_dim = 512 means LSTM hidden vector of 128 dimension (512/4). Please refer PaddlePaddle official documentation for detail: [lstmemory](http://www.paddlepaddle.org/doc/ui/api/trainer_config_helpers/layers.html#lstmemory) Speciala note: hidden_dim = 512 means LSTM hidden vector of 128 dimension (512/4). Please refer PaddlePaddle official documentation for detail: [lstmemory](http://www.paddlepaddle.org/doc/ui/api/trainer_config_helpers/layers.html#lstmemory)
- 2. The word sequence, predicate, predicate context, and region mark sequence are transformed into embedding vector sequences. - The word sequence, predicate, predicate context, and region mark sequence are transformed into embedding vector sequences.
```python ```python
...@@ -276,7 +278,7 @@ emb_layers.append(predicate_embedding) ...@@ -276,7 +278,7 @@ emb_layers.append(predicate_embedding)
emb_layers.append(mark_embedding) emb_layers.append(mark_embedding)
``` ```
- 3. 8 LSTM units will be trained in "forward / backward" order. - 8 LSTM units will be trained in "forward / backward" order.
```python ```python
hidden_0 = paddle.layer.mixed( hidden_0 = paddle.layer.mixed(
...@@ -326,7 +328,7 @@ for i in range(1, depth): ...@@ -326,7 +328,7 @@ for i in range(1, depth):
input_tmp = [mix_hidden, lstm] input_tmp = [mix_hidden, lstm]
``` ```
- 4. We will concatenate the output of top LSTM unit with it's input, and project into a hidden layer. Then put a fully connected layer on top of it to get the final vector representation. - We will concatenate the output of top LSTM unit with it's input, and project into a hidden layer. Then put a fully connected layer on top of it to get the final vector representation.
```python ```python
feature_out = paddle.layer.mixed( feature_out = paddle.layer.mixed(
...@@ -340,7 +342,7 @@ for i in range(1, depth): ...@@ -340,7 +342,7 @@ for i in range(1, depth):
], ) ], )
``` ```
- 5. We use CRF as cost function, the parameter of CRF cost will be named `crfw`. - We use CRF as cost function, the parameter of CRF cost will be named `crfw`.
```python ```python
crf_cost = paddle.layer.crf( crf_cost = paddle.layer.crf(
...@@ -353,7 +355,7 @@ crf_cost = paddle.layer.crf( ...@@ -353,7 +355,7 @@ crf_cost = paddle.layer.crf(
learning_rate=mix_hidden_lr)) learning_rate=mix_hidden_lr))
``` ```
- 6. CRF decoding layer is used for evaluation and inference. It shares parameter with CRF layer. The sharing of parameters among multiple layers is specified by the same parameter name in these layers. - CRF decoding layer is used for evaluation and inference. It shares parameter with CRF layer. The sharing of parameters among multiple layers is specified by the same parameter name in these layers.
```python ```python
crf_dec = paddle.layer.crf_decoding( crf_dec = paddle.layer.crf_decoding(
...@@ -470,4 +472,4 @@ Semantic Role Labeling is an important intermediate step in a wide range of natu ...@@ -470,4 +472,4 @@ Semantic Role Labeling is an important intermediate step in a wide range of natu
10. Zhou J, Xu W. [End-to-end learning of semantic role labeling using recurrent neural networks](http://www.aclweb.org/anthology/P/P15/P15-1109.pdf)[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2015. 10. Zhou J, Xu W. [End-to-end learning of semantic role labeling using recurrent neural networks](http://www.aclweb.org/anthology/P/P15/P15-1109.pdf)[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2015.
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。 This tutorial is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.
...@@ -206,7 +206,7 @@ print pred_len ...@@ -206,7 +206,7 @@ print pred_len
## 模型配置说明 ## 模型配置说明
- 1. 定义输入数据维度及模型超参数。 - 定义输入数据维度及模型超参数。
```python ```python
mark_dict_len = 2 # 谓上下文区域标志的维度,是一个0-1 2值特征,因此维度为2 mark_dict_len = 2 # 谓上下文区域标志的维度,是一个0-1 2值特征,因此维度为2
...@@ -240,7 +240,7 @@ target = paddle.layer.data(name='target', type=d_type(label_dict_len)) ...@@ -240,7 +240,7 @@ target = paddle.layer.data(name='target', type=d_type(label_dict_len))
这里需要特别说明的是hidden_dim = 512指定了LSTM隐层向量的维度为128维,关于这一点请参考PaddlePaddle官方文档中[lstmemory](http://www.paddlepaddle.org/doc/ui/api/trainer_config_helpers/layers.html#lstmemory)的说明。 这里需要特别说明的是hidden_dim = 512指定了LSTM隐层向量的维度为128维,关于这一点请参考PaddlePaddle官方文档中[lstmemory](http://www.paddlepaddle.org/doc/ui/api/trainer_config_helpers/layers.html#lstmemory)的说明。
- 2. 将句子序列、谓词、谓词上下文、谓词上下文区域标记通过词表,转换为实向量表示的词向量序列。 - 将句子序列、谓词、谓词上下文、谓词上下文区域标记通过词表,转换为实向量表示的词向量序列。
```python ```python
...@@ -269,7 +269,7 @@ emb_layers.append(predicate_embedding) ...@@ -269,7 +269,7 @@ emb_layers.append(predicate_embedding)
emb_layers.append(mark_embedding) emb_layers.append(mark_embedding)
``` ```
- 3. 8个LSTM单元以“正向/反向”的顺序对所有输入序列进行学习。 - 8个LSTM单元以“正向/反向”的顺序对所有输入序列进行学习。
```python ```python
hidden_0 = paddle.layer.mixed( hidden_0 = paddle.layer.mixed(
...@@ -319,7 +319,7 @@ for i in range(1, depth): ...@@ -319,7 +319,7 @@ for i in range(1, depth):
input_tmp = [mix_hidden, lstm] input_tmp = [mix_hidden, lstm]
``` ```
- 4. 取最后一个栈式LSTM的输出和这个LSTM单元的输入到隐层映射,经过一个全连接层映射到标记字典的维度,得到最终的特征向量表示。 - 取最后一个栈式LSTM的输出和这个LSTM单元的输入到隐层映射,经过一个全连接层映射到标记字典的维度,得到最终的特征向量表示。
```python ```python
feature_out = paddle.layer.mixed( feature_out = paddle.layer.mixed(
...@@ -333,7 +333,7 @@ input=[ ...@@ -333,7 +333,7 @@ input=[
], ) ], )
``` ```
- 5. 网络的末端定义CRF层计算损失(cost),指定参数名字为 `crfw`,该层需要输入正确的数据标签(target)。 - 网络的末端定义CRF层计算损失(cost),指定参数名字为 `crfw`,该层需要输入正确的数据标签(target)。
```python ```python
crf_cost = paddle.layer.crf( crf_cost = paddle.layer.crf(
...@@ -346,7 +346,7 @@ crf_cost = paddle.layer.crf( ...@@ -346,7 +346,7 @@ crf_cost = paddle.layer.crf(
learning_rate=mix_hidden_lr)) learning_rate=mix_hidden_lr))
``` ```
- 6. CRF译码层和CRF层参数名字相同,即共享权重。如果输入了正确的数据标签(target),会统计错误标签的个数,可以用来评估模型。如果没有输入正确的数据标签,该层可以推到出最优解,可以用来预测模型。 - CRF译码层和CRF层参数名字相同,即共享权重。如果输入了正确的数据标签(target),会统计错误标签的个数,可以用来评估模型。如果没有输入正确的数据标签,该层可以推到出最优解,可以用来预测模型。
```python ```python
crf_dec = paddle.layer.crf_decoding( crf_dec = paddle.layer.crf_decoding(
......
...@@ -75,8 +75,7 @@ settings( ...@@ -75,8 +75,7 @@ settings(
learning_method=MomentumOptimizer(momentum=0), learning_method=MomentumOptimizer(momentum=0),
learning_rate=2e-2, learning_rate=2e-2,
regularization=L2Regularization(8e-4), regularization=L2Regularization(8e-4),
model_average=ModelAverage( model_average=ModelAverage(average_window=0.5, max_average_window=10000), )
average_window=0.5, max_average_window=10000), )
####################################### network ############################## ####################################### network ##############################
#8 features and 1 target #8 features and 1 target
...@@ -102,13 +101,12 @@ std_default = ParameterAttribute(initial_std=default_std) ...@@ -102,13 +101,12 @@ std_default = ParameterAttribute(initial_std=default_std)
predicate_embedding = embedding_layer( predicate_embedding = embedding_layer(
size=word_dim, size=word_dim,
input=predicate, input=predicate,
param_attr=ParameterAttribute( param_attr=ParameterAttribute(name='vemb', initial_std=default_std))
name='vemb', initial_std=default_std))
word_input = [word, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2] word_input = [word, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2]
emb_layers = [ emb_layers = [
embedding_layer( embedding_layer(size=word_dim, input=x, param_attr=emb_para)
size=word_dim, input=x, param_attr=emb_para) for x in word_input for x in word_input
] ]
emb_layers.append(predicate_embedding) emb_layers.append(predicate_embedding)
mark_embedding = embedding_layer( mark_embedding = embedding_layer(
...@@ -120,8 +118,8 @@ hidden_0 = mixed_layer( ...@@ -120,8 +118,8 @@ hidden_0 = mixed_layer(
size=hidden_dim, size=hidden_dim,
bias_attr=std_default, bias_attr=std_default,
input=[ input=[
full_matrix_projection( full_matrix_projection(input=emb, param_attr=std_default)
input=emb, param_attr=std_default) for emb in emb_layers for emb in emb_layers
]) ])
mix_hidden_lr = 1e-3 mix_hidden_lr = 1e-3
...@@ -171,10 +169,8 @@ feature_out = mixed_layer( ...@@ -171,10 +169,8 @@ feature_out = mixed_layer(
size=label_dict_len, size=label_dict_len,
bias_attr=std_default, bias_attr=std_default,
input=[ input=[
full_matrix_projection( full_matrix_projection(input=input_tmp[0], param_attr=hidden_para_attr),
input=input_tmp[0], param_attr=hidden_para_attr), full_matrix_projection(input=input_tmp[1], param_attr=lstm_para_attr)
full_matrix_projection(
input=input_tmp[1], param_attr=lstm_para_attr)
], ) ], )
if not is_predict: if not is_predict:
......
...@@ -176,7 +176,7 @@ After modification, the model is as follows: ...@@ -176,7 +176,7 @@ After modification, the model is as follows:
<div align="center"> <div align="center">
<img src="image/db_lstm_en.png" width = "60%" align=center /><br> <img src="image/db_lstm_network_en.png" width = "60%" align=center /><br>
Fig 6. DB-LSTM for SRL tasks Fig 6. DB-LSTM for SRL tasks
</div> </div>
...@@ -242,6 +242,8 @@ import numpy as np ...@@ -242,6 +242,8 @@ import numpy as np
import paddle.v2 as paddle import paddle.v2 as paddle
import paddle.v2.dataset.conll05 as conll05 import paddle.v2.dataset.conll05 as conll05
paddle.init(use_gpu=False, trainer_count=1)
word_dict, verb_dict, label_dict = conll05.get_dict() word_dict, verb_dict, label_dict = conll05.get_dict()
word_dict_len = len(word_dict) word_dict_len = len(word_dict)
label_dict_len = len(label_dict) label_dict_len = len(label_dict)
...@@ -254,7 +256,7 @@ print pred_len ...@@ -254,7 +256,7 @@ print pred_len
## Model configuration ## Model configuration
- 1. Define input data dimensions and model hyperparameters. - Define input data dimensions and model hyperparameters.
```python ```python
mark_dict_len = 2 # Value range of region mark. Region mark is either 0 or 1, so range is 2 mark_dict_len = 2 # Value range of region mark. Region mark is either 0 or 1, so range is 2
...@@ -289,7 +291,7 @@ target = paddle.layer.data(name='target', type=d_type(label_dict_len)) ...@@ -289,7 +291,7 @@ target = paddle.layer.data(name='target', type=d_type(label_dict_len))
Speciala note: hidden_dim = 512 means LSTM hidden vector of 128 dimension (512/4). Please refer PaddlePaddle official documentation for detail: [lstmemory](http://www.paddlepaddle.org/doc/ui/api/trainer_config_helpers/layers.html#lstmemory)。 Speciala note: hidden_dim = 512 means LSTM hidden vector of 128 dimension (512/4). Please refer PaddlePaddle official documentation for detail: [lstmemory](http://www.paddlepaddle.org/doc/ui/api/trainer_config_helpers/layers.html#lstmemory)。
- 2. The word sequence, predicate, predicate context, and region mark sequence are transformed into embedding vector sequences. - The word sequence, predicate, predicate context, and region mark sequence are transformed into embedding vector sequences.
```python ```python
...@@ -318,7 +320,7 @@ emb_layers.append(predicate_embedding) ...@@ -318,7 +320,7 @@ emb_layers.append(predicate_embedding)
emb_layers.append(mark_embedding) emb_layers.append(mark_embedding)
``` ```
- 3. 8 LSTM units will be trained in "forward / backward" order. - 8 LSTM units will be trained in "forward / backward" order.
```python ```python
hidden_0 = paddle.layer.mixed( hidden_0 = paddle.layer.mixed(
...@@ -368,7 +370,7 @@ for i in range(1, depth): ...@@ -368,7 +370,7 @@ for i in range(1, depth):
input_tmp = [mix_hidden, lstm] input_tmp = [mix_hidden, lstm]
``` ```
- 4. We will concatenate the output of top LSTM unit with it's input, and project into a hidden layer. Then put a fully connected layer on top of it to get the final vector representation. - We will concatenate the output of top LSTM unit with it's input, and project into a hidden layer. Then put a fully connected layer on top of it to get the final vector representation.
```python ```python
feature_out = paddle.layer.mixed( feature_out = paddle.layer.mixed(
...@@ -382,7 +384,7 @@ for i in range(1, depth): ...@@ -382,7 +384,7 @@ for i in range(1, depth):
], ) ], )
``` ```
- 5. We use CRF as cost function, the parameter of CRF cost will be named `crfw`. - We use CRF as cost function, the parameter of CRF cost will be named `crfw`.
```python ```python
crf_cost = paddle.layer.crf( crf_cost = paddle.layer.crf(
...@@ -395,7 +397,7 @@ crf_cost = paddle.layer.crf( ...@@ -395,7 +397,7 @@ crf_cost = paddle.layer.crf(
learning_rate=mix_hidden_lr)) learning_rate=mix_hidden_lr))
``` ```
- 6. CRF decoding layer is used for evaluation and inference. It shares parameter with CRF layer. The sharing of parameters among multiple layers is specified by the same parameter name in these layers. - CRF decoding layer is used for evaluation and inference. It shares parameter with CRF layer. The sharing of parameters among multiple layers is specified by the same parameter name in these layers.
```python ```python
crf_dec = paddle.layer.crf_decoding( crf_dec = paddle.layer.crf_decoding(
...@@ -512,7 +514,7 @@ Semantic Role Labeling is an important intermediate step in a wide range of natu ...@@ -512,7 +514,7 @@ Semantic Role Labeling is an important intermediate step in a wide range of natu
10. Zhou J, Xu W. [End-to-end learning of semantic role labeling using recurrent neural networks](http://www.aclweb.org/anthology/P/P15/P15-1109.pdf)[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2015. 10. Zhou J, Xu W. [End-to-end learning of semantic role labeling using recurrent neural networks](http://www.aclweb.org/anthology/P/P15/P15-1109.pdf)[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2015.
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。 This tutorial is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
......
...@@ -248,7 +248,7 @@ print pred_len ...@@ -248,7 +248,7 @@ print pred_len
## 模型配置说明 ## 模型配置说明
- 1. 定义输入数据维度及模型超参数。 - 定义输入数据维度及模型超参数。
```python ```python
mark_dict_len = 2 # 谓上下文区域标志的维度,是一个0-1 2值特征,因此维度为2 mark_dict_len = 2 # 谓上下文区域标志的维度,是一个0-1 2值特征,因此维度为2
...@@ -282,7 +282,7 @@ target = paddle.layer.data(name='target', type=d_type(label_dict_len)) ...@@ -282,7 +282,7 @@ target = paddle.layer.data(name='target', type=d_type(label_dict_len))
这里需要特别说明的是hidden_dim = 512指定了LSTM隐层向量的维度为128维,关于这一点请参考PaddlePaddle官方文档中[lstmemory](http://www.paddlepaddle.org/doc/ui/api/trainer_config_helpers/layers.html#lstmemory)的说明。 这里需要特别说明的是hidden_dim = 512指定了LSTM隐层向量的维度为128维,关于这一点请参考PaddlePaddle官方文档中[lstmemory](http://www.paddlepaddle.org/doc/ui/api/trainer_config_helpers/layers.html#lstmemory)的说明。
- 2. 将句子序列、谓词、谓词上下文、谓词上下文区域标记通过词表,转换为实向量表示的词向量序列。 - 将句子序列、谓词、谓词上下文、谓词上下文区域标记通过词表,转换为实向量表示的词向量序列。
```python ```python
...@@ -311,7 +311,7 @@ emb_layers.append(predicate_embedding) ...@@ -311,7 +311,7 @@ emb_layers.append(predicate_embedding)
emb_layers.append(mark_embedding) emb_layers.append(mark_embedding)
``` ```
- 3. 8个LSTM单元以“正向/反向”的顺序对所有输入序列进行学习。 - 8个LSTM单元以“正向/反向”的顺序对所有输入序列进行学习。
```python ```python
hidden_0 = paddle.layer.mixed( hidden_0 = paddle.layer.mixed(
...@@ -361,7 +361,7 @@ for i in range(1, depth): ...@@ -361,7 +361,7 @@ for i in range(1, depth):
input_tmp = [mix_hidden, lstm] input_tmp = [mix_hidden, lstm]
``` ```
- 4. 取最后一个栈式LSTM的输出和这个LSTM单元的输入到隐层映射,经过一个全连接层映射到标记字典的维度,得到最终的特征向量表示。 - 取最后一个栈式LSTM的输出和这个LSTM单元的输入到隐层映射,经过一个全连接层映射到标记字典的维度,得到最终的特征向量表示。
```python ```python
feature_out = paddle.layer.mixed( feature_out = paddle.layer.mixed(
...@@ -375,7 +375,7 @@ input=[ ...@@ -375,7 +375,7 @@ input=[
], ) ], )
``` ```
- 5. 网络的末端定义CRF层计算损失(cost),指定参数名字为 `crfw`,该层需要输入正确的数据标签(target)。 - 网络的末端定义CRF层计算损失(cost),指定参数名字为 `crfw`,该层需要输入正确的数据标签(target)。
```python ```python
crf_cost = paddle.layer.crf( crf_cost = paddle.layer.crf(
...@@ -388,7 +388,7 @@ crf_cost = paddle.layer.crf( ...@@ -388,7 +388,7 @@ crf_cost = paddle.layer.crf(
learning_rate=mix_hidden_lr)) learning_rate=mix_hidden_lr))
``` ```
- 6. CRF译码层和CRF层参数名字相同,即共享权重。如果输入了正确的数据标签(target),会统计错误标签的个数,可以用来评估模型。如果没有输入正确的数据标签,该层可以推到出最优解,可以用来预测模型。 - CRF译码层和CRF层参数名字相同,即共享权重。如果输入了正确的数据标签(target),会统计错误标签的个数,可以用来评估模型。如果没有输入正确的数据标签,该层可以推到出最优解,可以用来预测模型。
```python ```python
crf_dec = paddle.layer.crf_decoding( crf_dec = paddle.layer.crf_decoding(
......
...@@ -40,15 +40,14 @@ def db_lstm(): ...@@ -40,15 +40,14 @@ def db_lstm():
predicate_embedding = paddle.layer.embedding( predicate_embedding = paddle.layer.embedding(
size=word_dim, size=word_dim,
input=predicate, input=predicate,
param_attr=paddle.attr.Param( param_attr=paddle.attr.Param(name='vemb', initial_std=default_std))
name='vemb', initial_std=default_std))
mark_embedding = paddle.layer.embedding( mark_embedding = paddle.layer.embedding(
size=mark_dim, input=mark, param_attr=std_0) size=mark_dim, input=mark, param_attr=std_0)
word_input = [word, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2] word_input = [word, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2]
emb_layers = [ emb_layers = [
paddle.layer.embedding( paddle.layer.embedding(size=word_dim, input=x, param_attr=emb_para)
size=word_dim, input=x, param_attr=emb_para) for x in word_input for x in word_input
] ]
emb_layers.append(predicate_embedding) emb_layers.append(predicate_embedding)
emb_layers.append(mark_embedding) emb_layers.append(mark_embedding)
...@@ -109,13 +108,12 @@ def db_lstm(): ...@@ -109,13 +108,12 @@ def db_lstm():
input=input_tmp[1], param_attr=lstm_para_attr) input=input_tmp[1], param_attr=lstm_para_attr)
], ) ], )
crf_cost = paddle.layer.crf(size=label_dict_len, crf_cost = paddle.layer.crf(
input=feature_out, size=label_dict_len,
label=target, input=feature_out,
param_attr=paddle.attr.Param( label=target,
name='crfw', param_attr=paddle.attr.Param(
initial_std=default_std, name='crfw', initial_std=default_std, learning_rate=mix_hidden_lr))
learning_rate=mix_hidden_lr))
crf_dec = paddle.layer.crf_decoding( crf_dec = paddle.layer.crf_decoding(
name='crf_dec_l', name='crf_dec_l',
...@@ -151,13 +149,11 @@ def main(): ...@@ -151,13 +149,11 @@ def main():
model_average=paddle.optimizer.ModelAverage( model_average=paddle.optimizer.ModelAverage(
average_window=0.5, max_average_window=10000), ) average_window=0.5, max_average_window=10000), )
trainer = paddle.trainer.SGD(cost=crf_cost, trainer = paddle.trainer.SGD(
parameters=parameters, cost=crf_cost, parameters=parameters, update_equation=optimizer)
update_equation=optimizer)
reader = paddle.batch( reader = paddle.batch(
paddle.reader.shuffle( paddle.reader.shuffle(conll05.test(), buf_size=8192), batch_size=10)
conll05.test(), buf_size=8192), batch_size=10)
feeding = { feeding = {
'word_data': 0, 'word_data': 0,
......
...@@ -185,77 +185,10 @@ Note: $z_{i+1}$ and $p_{i+1}$ are computed the same way as in [Decoder](#Decoder ...@@ -185,77 +185,10 @@ Note: $z_{i+1}$ and $p_{i+1}$ are computed the same way as in [Decoder](#Decoder
## Data Preparation ## Data Preparation
### Download and Uncompression
This tutorial uses a dataset from [WMT-14](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/), where [bitexts (after selection)](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz) is used as the training set, and [dev+test data](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz) is used as test and generation set. This tutorial uses a dataset from [WMT-14](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/), where [bitexts (after selection)](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz) is used as the training set, and [dev+test data](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz) is used as test and generation set.
Run the following command in Linux to obtain the data:
```bash
cd data
./wmt14_data.sh
```
There are three folders in the downloaded dataset `data/wmt14`:
<p align = "center">
<table>
<tr>
<td>Folder Name</td>
<td>French-English Parallel Corpus</td>
<td>Number of Files</td>
<td>Size of Files</td>
</tr>
<tr>
<td>train</td>
<td>ccb2_pc30.src, ccb2_pc30.trg, etc</td>
<td>12</td>
<td>3.55G</td>
</tr>
<tr>
<td>test</td>
<td>ntst1213.src, ntst1213.trg</td>
<td>2</td>
<td>1636k</td>
</tr>
</tr>
<tr>
<td>gen</td>
<td>ntst14.src, ntst14.trg</td>
<td>2</td>
<td>864k</td>
</tr>
</table>
</p>
- `XXX.src` is the source file in French and `XXX.trg`is the target file in English. Each row of the file contains one sentence.
- `XXX.src` and `XXX.trg` has the same number of rows and there is a one-to-one correspondance between the sentences at any row from the two files.
### User Defined Dataset (Optional) ### Data Preprocessing
To use your own dataset, just put it under the `data` folder and organize it as follows
```text
user_dataset
├── train
│   ├── train_file1.src
│   ├── train_file1.trg
│   └── ...
├── test
│   ├── test_file1.src
│   ├── test_file1.trg
│   └── ...
├── gen
│   ├── gen_file1.src
│   ├── gen_file1.trg
│   └── ...
```
Explanation of the directories:
- First level: `user_dataset`: the name of the user defined dataset.
- Second level: `train``test` and `gen`: these names should not be changed.
- Third level: Parallel corpus in source language and target language, each with a postfix of `.src` and `.trg`.
### Data Pre-processing
There are two steps for pre-processing: There are two steps for pre-processing:
- Merge the source and target parallel corpus files into one file - Merge the source and target parallel corpus files into one file
...@@ -264,245 +197,104 @@ There are two steps for pre-processing: ...@@ -264,245 +197,104 @@ There are two steps for pre-processing:
- Create source dictionary and target dictionary, each containing **DICTSIZE** number of words, including the most frequent (DICTSIZE - 3) fo word from the corpus and 3 special token `<s>` (begin of sequence), `<e>` (end of sequence) and `<unk>` (unknown words that are not in the vocabulary). - Create source dictionary and target dictionary, each containing **DICTSIZE** number of words, including the most frequent (DICTSIZE - 3) fo word from the corpus and 3 special token `<s>` (begin of sequence), `<e>` (end of sequence) and `<unk>` (unknown words that are not in the vocabulary).
`preprocess.py` is used for pre-processing: ### A Subset of Dataset
```python
python preprocess.py -i INPUT [-d DICTSIZE] [-m]
```
- `-i INPUT`: path to the original dataset.
- `-d DICTSIZE`: number of words in the dictionary. If unspecified, the dictionary will contain all the words appeared in the input dataset.
- `-m --mergeDict`: merge the source dictionary with target dictionary, making the two dictionaries have the same content.
The specific command to run the script is as follows:
```python
python preprocess.py -i data/wmt14 -d 30000
```
You will see the following messages after a few minutes:
```text
concat parallel corpora for dataset
build source dictionary for train data
build target dictionary for train data
dictionary size is 30000
```
The pre-processed data is located at `data/pre-wmt14`:
```text
pre-wmt14
├── train
│   └── train
├── test
│   └── test
├── gen
│   └── gen
├── train.list
├── test.list
├── gen.list
├── src.dict
└── trg.dict
```
- `train`, `test` and `gen`: contains French-English parallel corpus for training, testing and generation. Each row from each file is separated into two columns with a "\t", where the first column is the sequence in French and the second one is in English.
- `train.list`, `test.list` and `gen.list`: record respectively the path to `train`, `test` and `gen` folders.
- `src.dict` and `trg.dict`: source (French) and target (English) dictionary. Each dictionary contains 30000 words (29997 most frequent words and 3 special tokens).
### Providing Data to PaddlePaddle
We use `dataprovider.py` to provide data to PaddlePaddle as follows:
1. Import PyDataProvider2 package from PaddlePaddle and define three special tokens:
```python Because the full dataset is very big, to reduce the time for downloading the full dataset. PadddlePaddle package `paddle.dataset.wmt14` provides a preprocessed `subset of dataset`(http://paddlepaddle.bj.bcebos.com/demo/wmt_shrinked_data/wmt14.tgz).
from paddle.trainer.PyDataProvider2 import *
UNK_IDX = 2 #out of vocabulary word
START = "<s>" #begin of sequence
END = "<e>" #end of sequence
```
2. Use initialization function `hook` to define the input data types (`input_types`) for training and generation:
- Training: there are three input sequences, where "source language sequence" and "target language sequence" are input and the "target language next word sequence" is the label.
- Generation: there are two input sequences, where the "source language sequence" is the input and “source language sequence id” are the ids for the input data (optional).
`src_dict_path` in the `hook` function is the path to the source language dictionary, while `trg_dict_path` the path to target language dictionary. `is_generating` is passed from model config file. For more details on the usage of the `hook` function please refer to [Model Config](#Model Config). This subset has 193319 instances of training data and 6003 instances of test data. Dictionary size is 30000. Because of the limitation of size of the subset, the effectiveness of trained model from this subset is not guaranteed.
```python ## Training Instructions
def hook(settings, src_dict_path, trg_dict_path, is_generating, file_list,
**kwargs):
# job_mode = 1: training 0: generation
settings.job_mode = not is_generating
def fun(dict_path): # load dictionary according to the path
out_dict = dict()
with open(dict_path, "r") as fin:
out_dict = {
line.strip(): line_count
for line_count, line in enumerate(fin)
}
return out_dict
settings.src_dict = fun(src_dict_path)
settings.trg_dict = fun(trg_dict_path)
if settings.job_mode: #training
settings.input_types = {
'source_language_word': #source language sequence
integer_value_sequence(len(settings.src_dict)),
'target_language_word': #target language sequence
integer_value_sequence(len(settings.trg_dict)),
'target_language_next_word': #target language next word sequence
integer_value_sequence(len(settings.trg_dict))
}
else: #generation
settings.input_types = {
'source_language_word': #source language sequence
integer_value_sequence(len(settings.src_dict)),
'sent_id': #source language sequence id
integer_value_sequence(len(open(file_list[0], "r").readlines()))
}
```
3. Use `process` function to open the file `file_name`, read each row of the file, convert the data to be compatible with `input_types`, and then use `yield` to return to PaddlePaddle process. More specifically
- add `<s>` to the beginning of each source language sequence and add `<e>` to the end, producing "source_language_word". ### Initialize PaddlePaddle
- add `<s>` to the beginning of each target language senquence, producing "target_language_word".
- add `<e>` to the end of each target language senquence, producing "target_language_next_word".
```python ```python
def _get_ids(s, dictionary): # get the location of each word from the source language sequence in the dictionary import sys
words = s.strip().split() import paddle.v2 as paddle
return [dictionary[START]] + \
[dictionary.get(w, UNK_IDX) for w in words] + \
[dictionary[END]]
@provider(init_hook=hook, pool_size=50000)
def process(settings, file_name):
with open(file_name, 'r') as f:
for line_count, line in enumerate(f):
line_split = line.strip().split('\t')
if settings.job_mode and len(line_split) != 2:
continue
src_seq = line_split[0]
src_ids = _get_ids(src_seq, settings.src_dict)
if settings.job_mode:
trg_seq = line_split[1]
trg_words = trg_seq.split()
trg_ids = [settings.trg_dict.get(w, UNK_IDX) for w in trg_words]
# sequence with length longer than 80 with be removed during training to avoid an overly deep RNN.
if len(src_ids) > 80 or len(trg_ids) > 80:
continue
trg_ids_next = trg_ids + [settings.trg_dict[END]]
trg_ids = [settings.trg_dict[START]] + trg_ids
yield {
'source_language_word': src_ids,
'target_language_word': trg_ids,
'target_language_next_word': trg_ids_next
}
else:
yield {'source_language_word': src_ids, 'sent_id': [line_count]}
```
Note: The size of the training data is 3.55G. For machines with limited memories, it is recommended to use `pool_size` to set the number of data samples stored in memory.
## Model Config
### Data Definition
1. Specify the path to data and source/target dictionaries. `is_generating` accepts argument passed from command lines and is used to denote whether the current configuration is for training (default) or generation. See [Usage and Resutls](#Usage and Results).
```python
import os
from paddle.trainer_config_helpers import *
data_dir = "./data/pre-wmt14" # data path # train with a single CPU
src_lang_dict = os.path.join(data_dir, 'src.dict') # path to the source language dictionary paddle.init(use_gpu=False, trainer_count=1)
trg_lang_dict = os.path.join(data_dir, 'trg.dict') # path to the target language dictionary ```
is_generating = get_config_arg("is_generating", bool, False) # config mode
```
2. Use `define_py_data_sources2` to get data from `dataprovider.py`, and use `args` variable to input the source/target language dicitonary path and config mode.
```python ### Define DataSet
if not is_generating:
train_list = os.path.join(data_dir, 'train.list')
test_list = os.path.join(data_dir, 'test.list')
else:
train_list = None
test_list = os.path.join(data_dir, 'gen.list')
define_py_data_sources2(
train_list,
test_list,
module="dataprovider",
obj="process",
args={
"src_dict_path": src_lang_dict, # source language dictionary path
"trg_dict_path": trg_lang_dict, # target language dictionary path
"is_generating": is_generating # config mode
})
```
### Algorithm Configuration We will define dictionary size, and create [**data reader**](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader#python-data-reader-design-doc) for WMT-14 dataset.
```python ```python
settings( # source and target dict dim.
learning_method = AdamOptimizer(), dict_size = 30000
batch_size = 50,
learning_rate = 5e-4) feeding = {
'source_language_word': 0,
'target_language_word': 1,
'target_language_next_word': 2
}
wmt14_reader = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.wmt14.train(dict_size=dict_size), buf_size=8192),
batch_size=5)
``` ```
This tutorial will use the default SGD and Adam learning algorithm, with a learning rate of 5e-4. Note that the `batch_size = 50` denotes generating 50 sequence each time.
### Model Structure ### Model Configuration
1. Define some global variables 1. Define some global variables
```python ```python
source_dict_dim = len(open(src_lang_dict, "r").readlines()) # size of the source language dictionary source_dict_dim = dict_size # source language dictionary size
target_dict_dim = len(open(trg_lang_dict, "r").readlines()) # size of target language dictionary target_dict_dim = dict_size # destination language dictionary size
word_vector_dim = 512 # dimensionality of word vector word_vector_dim = 512 # word embedding dimension
encoder_size = 512 # dimensionality of the hidden state of encoder GRU encoder_size = 512 # hidden layer size of GRU in encoder
decoder_size = 512 # dimentionality of the hidden state of decoder GRU decoder_size = 512 # hidden layer size of GRU in decoder
if is_generating:
beam_size=3 # beam size for the beam search algorithm
max_length=250 # maximum length for the generated sentence
gen_trans_file = get_config_arg("gen_trans_file", str, None) # generate file
``` ```
2. Implement Encoder as follows: 1. Implement Encoder as follows:
1. Input is a sequence of words represented by an integer word index sequence. So we define data layer of data type `integer_value_sequence`. The value range of each element in the sequence is `[0, source_dict_dim)`
2.1 Input one-hot vector representations $\mathbf{w}$ converted with `dataprovider.py` from the source language sentence
```python ```python
src_word_id = data_layer(name='source_language_word', size=source_dict_dim) src_word_id = paddle.layer.data(
name='source_language_word',
type=paddle.data_type.integer_value_sequence(source_dict_dim))
``` ```
2.2 Map the one-hot vector into a word vector $\mathbf{s}$ in a low-dimensional semantic space
1. Map the one-hot vector (represented by word index) into a word vector $\mathbf{s}$ in a low-dimensional semantic space
```python ```python
src_embedding = embedding_layer( src_embedding = paddle.layer.embedding(
input=src_word_id, input=src_word_id,
size=word_vector_dim, size=word_vector_dim,
param_attr=ParamAttr(name='_source_language_embedding')) param_attr=paddle.attr.ParamAttr(name='_source_language_embedding'))
``` ```
2.3 Use bi-direcitonal GRU to encode the source language sequence, and concatenate the encoding outputs from the two GRUs to get $\mathbf{h}$
1. Use bi-direcitonal GRU to encode the source language sequence, and concatenate the encoding outputs from the two GRUs to get $\mathbf{h}$
```python ```python
src_forward = simple_gru(input=src_embedding, size=encoder_size) src_forward = paddle.networks.simple_gru(
src_backward = simple_gru( input=src_embedding, size=encoder_size)
input=src_embedding, size=encoder_size, reverse=True) src_backward = paddle.networks.simple_gru(
encoded_vector = concat_layer(input=[src_forward, src_backward]) input=src_embedding, size=encoder_size, reverse=True)
encoded_vector = paddle.layer.concat(input=[src_forward, src_backward])
``` ```
3. Implement Attention-based Decoder as follows: 1. Implement Attention-based Decoder as follows:
3.1 Get a projection of the encoding (c.f. 2.3) of the source language sequence by passing it into a feed forward neural network 1. Get a projection of the encoding (c.f. 2.3) of the source language sequence by passing it into a feed forward neural network
```python ```python
with mixed_layer(size=decoder_size) as encoded_proj: with paddle.layer.mixed(size=decoder_size) as encoded_proj:
encoded_proj += full_matrix_projection(input=encoded_vector) encoded_proj += paddle.layer.full_matrix_projection(
input=encoded_vector)
``` ```
3.2 Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$
1. Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$
```python ```python
backward_first = first_seq(input=src_backward) backward_first = paddle.layer.first_seq(input=src_backward)
with mixed_layer( with paddle.layer.mixed(
size=decoder_size, size=decoder_size, act=paddle.activation.Tanh()) as decoder_boot:
act=TanhActivation(), ) as decoder_boot: decoder_boot += paddle.layer.full_matrix_projection(
decoder_boot += full_matrix_projection(input=backward_first) input=backward_first)
``` ```
3.3 Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word.
1. Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word.
- decoder_mem records the hidden state $z_i$ from the previous time step, with an initial state as decoder_boot. - decoder_mem records the hidden state $z_i$ from the previous time step, with an initial state as decoder_boot.
- context is computed via `simple_attention` as $c_i=\sum {j=1}^{T}a_{ij}h_j$, where enc_vec is the projection of $h_j$ and enc_proj is the projection of $h_j$ (c.f. 3.1). $a_{ij}$ is calculated within `simple_attention`. - context is computed via `simple_attention` as $c_i=\sum {j=1}^{T}a_{ij}h_j$, where enc_vec is the projection of $h_j$ and enc_proj is the projection of $h_j$ (c.f. 3.1). $a_{ij}$ is calculated within `simple_attention`.
...@@ -511,181 +303,148 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn ...@@ -511,181 +303,148 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn
- Softmax normalization is used in the end to computed the probability of words, i.e., $p\left ( u_i|u_{&lt;i},\mathbf{x} \right )=softmax(W_sz_i+b_z)$. The output is returned. - Softmax normalization is used in the end to computed the probability of words, i.e., $p\left ( u_i|u_{&lt;i},\mathbf{x} \right )=softmax(W_sz_i+b_z)$. The output is returned.
```python ```python
def gru_decoder_with_attention(enc_vec, enc_proj, current_word): def gru_decoder_with_attention(enc_vec, enc_proj, current_word):
decoder_mem = memory(
name='gru_decoder', size=decoder_size, boot_layer=decoder_boot) decoder_mem = paddle.layer.memory(
name='gru_decoder', size=decoder_size, boot_layer=decoder_boot)
context = simple_attention(
encoded_sequence=enc_vec, context = paddle.networks.simple_attention(
encoded_proj=enc_proj, encoded_sequence=enc_vec,
decoder_state=decoder_mem, ) encoded_proj=enc_proj,
decoder_state=decoder_mem)
with mixed_layer(size=decoder_size * 3) as decoder_inputs:
decoder_inputs += full_matrix_projection(input=context) with paddle.layer.mixed(size=decoder_size * 3) as decoder_inputs:
decoder_inputs += full_matrix_projection(input=current_word) decoder_inputs += paddle.layer.full_matrix_projection(input=context)
decoder_inputs += paddle.layer.full_matrix_projection(
gru_step = gru_step_layer( input=current_word)
name='gru_decoder',
input=decoder_inputs, gru_step = paddle.layer.gru_step(
output_mem=decoder_mem, name='gru_decoder',
size=decoder_size) input=decoder_inputs,
output_mem=decoder_mem,
with mixed_layer( size=decoder_size)
size=target_dict_dim, bias_attr=True,
act=SoftmaxActivation()) as out: with paddle.layer.mixed(
out += full_matrix_projection(input=gru_step) size=target_dict_dim,
return out bias_attr=True,
act=paddle.activation.Softmax()) as out:
out += paddle.layer.full_matrix_projection(input=gru_step)
return out
``` ```
4. Decoder differences between the training and generation
4.1 Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details. 1. Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details.
```python ```python
decoder_group_name = "decoder_group" decoder_group_name = "decoder_group"
group_input1 = StaticInput(input=encoded_vector, is_seq=True) group_input1 = paddle.layer.StaticInputV2(input=encoded_vector, is_seq=True)
group_input2 = StaticInput(input=encoded_proj, is_seq=True) group_input2 = paddle.layer.StaticInputV2(input=encoded_proj, is_seq=True)
group_inputs = [group_input1, group_input2] group_inputs = [group_input1, group_input2]
``` ```
4.2 In training mode:
- word embedding from the target langauge trg_embedding is passed to `gru_decoder_with_attention` as current_word. 1. Training mode:
- word embedding from the target language trg_embedding is passed to `gru_decoder_with_attention` as current_word.
- `recurrent_group` calls `gru_decoder_with_attention` in a recurrent way - `recurrent_group` calls `gru_decoder_with_attention` in a recurrent way
- the sequence of next words from the target language is used as label (lbl) - the sequence of next words from the target language is used as label (lbl)
- multi-class cross-entropy (`classification_cost`) is used to calculate the cost - multi-class cross-entropy (`classification_cost`) is used to calculate the cost
```python ```python
if not is_generating: trg_embedding = paddle.layer.embedding(
trg_embedding = embedding_layer( input=paddle.layer.data(
input=data_layer( name='target_language_word',
name='target_language_word', size=target_dict_dim), type=paddle.data_type.integer_value_sequence(target_dict_dim)),
size=word_vector_dim, size=word_vector_dim,
param_attr=ParamAttr(name='_target_language_embedding')) param_attr=paddle.attr.ParamAttr(name='_target_language_embedding'))
group_inputs.append(trg_embedding) group_inputs.append(trg_embedding)
decoder = recurrent_group( # For decoder equipped with attention mechanism, in training,
name=decoder_group_name, # target embeding (the groudtruth) is the data input,
step=gru_decoder_with_attention, # while encoded source sequence is accessed to as an unbounded memory.
input=group_inputs) # Here, the StaticInput defines a read-only memory
# for the recurrent_group.
lbl = data_layer(name='target_language_next_word', size=target_dict_dim) decoder = paddle.layer.recurrent_group(
cost = classification_cost(input=decoder, label=lbl) name=decoder_group_name,
outputs(cost) step=gru_decoder_with_attention,
``` input=group_inputs)
4.3 In generation mode:
lbl = paddle.layer.data(
- during generation, as the decoder RNN will take the word vector generated from the previous time step as input, `GeneratedInput` is used to implement this automatically. Please refer to [GeneratedInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for details. name='target_language_next_word',
- `beam_search` will call `gru_decoder_with_attention` to generate id type=paddle.data_type.integer_value_sequence(target_dict_dim))
- `seqtext_printer_evaluator` outputs the generated sentence to `gen_trans_file` according to `trg_lang_dict` cost = paddle.layer.classification_cost(input=decoder, label=lbl)
```
```python
else:
trg_embedding = GeneratedInput(
size=target_dict_dim,
embedding_name='_target_language_embedding',
embedding_size=word_vector_dim)
group_inputs.append(trg_embedding)
beam_gen = beam_search(
name=decoder_group_name,
step=gru_decoder_with_attention,
input=group_inputs,
bos_id=0,
eos_id=1,
beam_size=beam_size,
max_length=max_length)
seqtext_printer_evaluator(
input=beam_gen,
id_input=data_layer(
name="sent_id", size=1),
dict_file=trg_lang_dict,
result_file=gen_trans_file)
outputs(beam_gen)
```
Note: Our configuration is based on Bahdanau et al. \[[4](#Reference)\] but with a few simplifications. Please refer to [issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) for more details. Note: Our configuration is based on Bahdanau et al. \[[4](#Reference)\] but with a few simplifications. Please refer to [issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) for more details.
### Create Parameters
## Model Training Create every parameter that `cost` layer needs.
Training can be started with the following command:
```bash ```python
./train.sh parameters = paddle.parameters.create(cost)
``` ```
where `train.sh` contains
```bash We can get parameter names. If the parameter name is not specified during model configuration, it will be generated.
paddle train \
--config='seqToseq_net.py' \ ```python
--save_dir='model' \ for param in parameters.keys():
--use_gpu=false \ print param
--num_passes=16 \
--show_parameter_stats_period=100 \
--trainer_count=4 \
--log_period=10 \
--dot_period=5 \
2>&1 | tee 'train.log'
```
- config: configuration file for the network
- save_dir: path to save the trained model
- use_gpu: whether to use GPU for training; CPU is used here
- num_passes: number of passes for training. In PaddlePaddle, one pass meansing one pass of complete training pass using all the data in the training set
- show_parameter_stats_period: here we show the statistics of parameters every 100 batches
- trainer_count: the number of CPU processes or GPU devices
- log_period: here we print log every 10 batches
- dot_period: we print one "." every 5 batches
The training loss will the printed every 10 batches, and you will see messages like those below:
```text
I0719 19:16:45.952062 15563 TrainerInternal.cpp:160] Batch=10 samples=500 AvgCost=198.475 CurrentCost=198.475 Eval: classification_error_evaluator=0.737155 CurrentEval: classification_error_evaluator=0.737155
I0719 19:17:56.707319 15563 TrainerInternal.cpp:160] Batch=20 samples=1000 AvgCost=157.479 CurrentCost=116.483 Eval: classification_error_evaluator=0.698392 CurrentEval: classification_error_evaluator=0.659065
.....
``` ```
- AvgCost: average cost from batch-0 to the current batch.
- CurrentCost: the cost for the current batch
- classification\_error\_evaluator (Eval): average error rate from evaluator-0 to the current evaluator for each word
- classification\_error\_evaluator (CurrentEval): error rate for the current evaluator for each word
The model training is successful when the classification\_error\_evaluator is lower than 0.35. ## Model Training
## Model Usage 1. Create trainer
### Download Pre-trained Model We need to tell trainer what to optimize, and how to optimize. Here trainer will optimize `cost` layer using stochastic gradient descent (SDG).
As the training of an NMT model is very time consuming, we provide a pre-trained model (pass-00012, ~205M). The model is trained with a cluster of 50 physical nodes (each node has two 6-core CPU). We trained 16 passes (taking about 5 days) with each pass taking about 7 hours. The provided model (pass-00012) has the highest [BLEU Score](#BLEU Score) of 26.92. Run the following command to down load the model: ```python
```bash optimizer = paddle.optimizer.Adam(
cd pretrained learning_rate=5e-5,
./wmt14_model.sh regularization=paddle.optimizer.L2Regularization(rate=1e-3))
``` trainer = paddle.trainer.SGD(cost=cost,
parameters=parameters,
update_equation=optimizer)
```
### Usage and Results 1. Define event handler
Run the following command to perform translation from French to English: The event handler is a callback function invoked by trainer when an event happens. Here we will print log in event handler.
```bash ```python
./gen.sh def event_handler(event):
``` if isinstance(event, paddle.event.EndIteration):
where `gen.sh` contains: if event.batch_id % 10 == 0:
print "\nPass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
```
1. Start training
```python
trainer.train(
reader=wmt14_reader,
event_handler=event_handler,
num_passes=10000,
feeding=feeding)
```
```text
Pass 0, Batch 0, Cost 247.408008, {'classification_error_evaluator': 1.0}
Pass 0, Batch 10, Cost 212.058789, {'classification_error_evaluator': 0.8737863898277283}
...
```
The model training is successful when the `classification_error_evaluator` is lower than 0.35.
## Model Usage
### Download Pre-trained Model
As the training of an NMT model is very time consuming, we provide a pre-trained model (pass-00012, ~205M). The model is trained with a cluster of 50 physical nodes (each node has two 6-core CPU). We trained 16 passes (taking about 5 days) with each pass taking about 7 hours. The provided model (pass-00012) has the highest [BLEU Score](#BLEU Score) of 26.92. Run the following command to download the model:
```bash ```bash
paddle train \ cd pretrained
--job=test \ ./wmt14_model.sh
--config='seqToseq_net.py' \
--save_dir='pretrained/wmt14_model' \
--use_gpu=true \
--num_passes=13 \
--test_pass=12 \
--trainer_count=1 \
--config_args=is_generating=1,gen_trans_file="gen_result" \
2>&1 | tee 'translation/gen.log'
``` ```
Parameters different training are listed as follows:
- job: set the mode as testing.
- save_dir: path to the pre-trained model.
- num_passes and test_pass: load the model parameters from pass $i\epsilon \left [ test\\_pass,num\\_passes-1 \right ]$. Here we only load `data/wmt14_model/pass-00012`.
- config_args: pass the self-defined command line parameters to model configuration. `is_generating=1` indicates generation mode and `gen_trans_file="gen_result"` represents the file generated.
For translation results please refer to [Illustrative Results](#Illustrative Results).
### BLEU Evaluation ### BLEU Evaluation
...@@ -711,7 +470,7 @@ BLEU = 26.92 ...@@ -711,7 +470,7 @@ BLEU = 26.92
## Summary ## Summary
End-to-end neural machine translation is a recently developed way to perform machine translations. In this chapter, we introduced the typical "Encoder-Decoder" framework and "attention" mechanism. Since NMT is a typical Sequence-to-Sequence (Seq2Seq) learning problem, tasks such as query rewriting, abstraction generation and single-turn dialogues can all be solved with the model presented in this chapter. End-to-end neural machine translation is a recently developed way to perform machine translations. In this chapter, we introduced the typical "Encoder-Decoder" framework and "attention" mechanism. Since NMT is a typical Sequence-to-Sequence (Seq2Seq) learning problem, tasks such as query rewriting, abstraction generation, and single-turn dialogues can all be solved with the model presented in this chapter.
## References ## References
...@@ -722,4 +481,4 @@ End-to-end neural machine translation is a recently developed way to perform mac ...@@ -722,4 +481,4 @@ End-to-end neural machine translation is a recently developed way to perform mac
5. Papineni K, Roukos S, Ward T, et al. [BLEU: a method for automatic evaluation of machine translation](http://dl.acm.org/citation.cfm?id=1073135)[C]//Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002: 311-318. 5. Papineni K, Roukos S, Ward T, et al. [BLEU: a method for automatic evaluation of machine translation](http://dl.acm.org/citation.cfm?id=1073135)[C]//Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002: 311-318.
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。 This tutorial is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.
...@@ -152,54 +152,8 @@ e_{ij}&=align(z_i,h_j)\\\\ ...@@ -152,54 +152,8 @@ e_{ij}&=align(z_i,h_j)\\\\
## 数据介绍 ## 数据介绍
### 下载与解压缩
本教程使用[WMT-14](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/)数据集中的[bitexts(after selection)](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz)作为训练集,[dev+test data](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz)作为测试集和生成集。 本教程使用[WMT-14](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/)数据集中的[bitexts(after selection)](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz)作为训练集,[dev+test data](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz)作为测试集和生成集。
在Linux下,只需简单地运行以下命令:
```bash
cd data
./wmt14_data.sh
```
得到的数据集`data/wmt14`包含如下三个文件夹:
<p align = "center">
<table>
<tr>
<td>文件夹名</td>
<td>法英平行语料文件</td>
<td>文件数</td>
<td>文件大小</td>
</tr>
<tr>
<td>train</td>
<td>ccb2_pc30.src, ccb2_pc30.trg, etc</td>
<td>12</td>
<td>3.55G</td>
</tr>
<tr>
<td>test</td>
<td>ntst1213.src, ntst1213.trg</td>
<td>2</td>
<td>1636k</td>
</tr>
</tr>
<tr>
<td>gen</td>
<td>ntst14.src, ntst14.trg</td>
<td>2</td>
<td>864k</td>
</tr>
</table>
</p>
- `XXX.src`是源法语文件,`XXX.trg`是目标英语文件,文件中的每行存放一个句子
- `XXX.src``XXX.trg`的行数一致,且两者任意第$i$行的句子之间都有着一一对应的关系。
### 数据预处理 ### 数据预处理
我们的预处理流程包括两步: 我们的预处理流程包括两步:
...@@ -220,6 +174,7 @@ cd data ...@@ -220,6 +174,7 @@ cd data
```python ```python
# 加载 paddle的python包 # 加载 paddle的python包
import sys
import paddle.v2 as paddle import paddle.v2 as paddle
# 配置只使用cpu,并且使用一个cpu进行训练 # 配置只使用cpu,并且使用一个cpu进行训练
...@@ -256,17 +211,16 @@ wmt14_reader = paddle.batch( ...@@ -256,17 +211,16 @@ wmt14_reader = paddle.batch(
decoder_size = 512 # 解码器中的GRU隐层大小 decoder_size = 512 # 解码器中的GRU隐层大小
``` ```
2. 其次实现编码器框架分为三步 1. 其次实现编码器框架分为三步
2.1 将在dataset reader中生成的用每个单词在字典中的索引表示的源语言序列 1 输入是一个文字序列被表示成整型的序列序列中每个元素是文字在字典中的索引所以我们定义数据层的数据类型为`integer_value_sequence`整型序列),序列中每个元素的范围是`[0, source_dict_dim)`
转换成one-hot vector表示的源语言序列$\mathbf{w}$,其类型为integer_value_sequence
```python ```python
src_word_id = paddle.layer.data( src_word_id = paddle.layer.data(
name='source_language_word', name='source_language_word',
type=paddle.data_type.integer_value_sequence(source_dict_dim)) type=paddle.data_type.integer_value_sequence(source_dict_dim))
``` ```
2.2 将上述编码映射到低维语言空间的词向量$\mathbf{s}$。 1. 将上述编码映射到低维语言空间的词向量$\mathbf{s}$。
```python ```python
src_embedding = paddle.layer.embedding( src_embedding = paddle.layer.embedding(
...@@ -274,7 +228,7 @@ wmt14_reader = paddle.batch( ...@@ -274,7 +228,7 @@ wmt14_reader = paddle.batch(
size=word_vector_dim, size=word_vector_dim,
param_attr=paddle.attr.ParamAttr(name='_source_language_embedding')) param_attr=paddle.attr.ParamAttr(name='_source_language_embedding'))
``` ```
2.3 用双向GRU编码源语言序列,拼接两个GRU的编码结果得到$\mathbf{h}$。 1. 用双向GRU编码源语言序列,拼接两个GRU的编码结果得到$\mathbf{h}$。
```python ```python
src_forward = paddle.networks.simple_gru( src_forward = paddle.networks.simple_gru(
...@@ -284,16 +238,17 @@ wmt14_reader = paddle.batch( ...@@ -284,16 +238,17 @@ wmt14_reader = paddle.batch(
encoded_vector = paddle.layer.concat(input=[src_forward, src_backward]) encoded_vector = paddle.layer.concat(input=[src_forward, src_backward])
``` ```
3. 接着,定义基于注意力机制的解码器框架。分为三步: 1. 接着,定义基于注意力机制的解码器框架。分为三步:
3.1 对源语言序列编码后的结果(见2.3),过一个前馈神经网络(Feed Forward Neural Network),得到其映射。 1. 对源语言序列编码后的结果(见2.3),过一个前馈神经网络(Feed Forward Neural Network),得到其映射。
```python ```python
with paddle.layer.mixed(size=decoder_size) as encoded_proj: with paddle.layer.mixed(size=decoder_size) as encoded_proj:
encoded_proj += paddle.layer.full_matrix_projection( encoded_proj += paddle.layer.full_matrix_projection(
input=encoded_vector) input=encoded_vector)
``` ```
3.2 构造解码器RNN的初始状态。由于解码器需要预测时序目标序列,但在0时刻并没有初始值,所以我们希望对其进行初始化。这里采用的是将源语言序列逆序编码后的最后一个状态进行非线性映射,作为该初始值,即$c_0=h_T$。
1. 构造解码器RNN的初始状态。由于解码器需要预测时序目标序列,但在0时刻并没有初始值,所以我们希望对其进行初始化。这里采用的是将源语言序列逆序编码后的最后一个状态进行非线性映射,作为该初始值,即$c_0=h_T$。
```python ```python
backward_first = paddle.layer.first_seq(input=src_backward) backward_first = paddle.layer.first_seq(input=src_backward)
...@@ -302,15 +257,14 @@ wmt14_reader = paddle.batch( ...@@ -302,15 +257,14 @@ wmt14_reader = paddle.batch(
decoder_boot += paddle.layer.full_matrix_projection( decoder_boot += paddle.layer.full_matrix_projection(
input=backward_first) input=backward_first)
``` ```
3.3 定义解码阶段每一个时间步的RNN行为,即根据当前时刻的源语言上下文向量$c_i$、解码器隐层状态$z_i$和目标语言中第$i$个词$u_i$,来预测第$i+1$个词的概率$p_{i+1}$。
1. 定义解码阶段每一个时间步的RNN行为,即根据当前时刻的源语言上下文向量$c_i$、解码器隐层状态$z_i$和目标语言中第$i$个词$u_i$,来预测第$i+1$个词的概率$p_{i+1}$。
- decoder_mem记录了前一个时间步的隐层状态$z_i$,其初始状态是decoder_boot。 - decoder_mem记录了前一个时间步的隐层状态$z_i$,其初始状态是decoder_boot。
- context通过调用`simple_attention`函数,实现公式$c_i=\sum {j=1}^{T}a_{ij}h_j$。其中,enc_vec是$h_j$,enc_proj是$h_j$的映射(见3.1),权重$a_{ij}$的计算已经封装在`simple_attention`函数中。 - context通过调用`simple_attention`函数,实现公式$c_i=\sum {j=1}^{T}a_{ij}h_j$。其中,enc_vec是$h_j$,enc_proj是$h_j$的映射(见3.1),权重$a_{ij}$的计算已经封装在`simple_attention`函数中。
- decoder_inputs融合了$c_i$和当前目标词current_word(即$u_i$)的表示。 - decoder_inputs融合了$c_i$和当前目标词current_word(即$u_i$)的表示。
- gru_step通过调用`gru_step_layer`函数,在decoder_inputs和decoder_mem上做了激活操作,即实现公式$z_{i+1}=\phi _{\theta '}\left ( c_i,u_i,z_i \right )$。 - gru_step通过调用`gru_step_layer`函数,在decoder_inputs和decoder_mem上做了激活操作,即实现公式$z_{i+1}=\phi _{\theta '}\left ( c_i,u_i,z_i \right )$。
- 最后,使用softmax归一化计算单词的概率,将out结果返回,即实现公式$p\left ( u_i|u_{&lt;i},\mathbf{x} \right )=softmax(W_sz_i+b_z)$。 - 最后,使用softmax归一化计算单词的概率,将out结果返回,即实现公式$p\left ( u_i|u_{&lt;i},\mathbf{x} \right )=softmax(W_sz_i+b_z)$。
```python ```python
def gru_decoder_with_attention(enc_vec, enc_proj, current_word): def gru_decoder_with_attention(enc_vec, enc_proj, current_word):
...@@ -340,24 +294,24 @@ wmt14_reader = paddle.batch( ...@@ -340,24 +294,24 @@ wmt14_reader = paddle.batch(
out += paddle.layer.full_matrix_projection(input=gru_step) out += paddle.layer.full_matrix_projection(input=gru_step)
return out return out
``` ```
4. 训练模式与生成模式下的解码器调用区别。
4.1 定义解码器框架名字,和`gru_decoder_with_attention`函数的前两个输入。注意:这两个输入使用`StaticInput`,具体说明可见[StaticInput文档](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) 1. 定义解码器框架名字,和`gru_decoder_with_attention`函数的前两个输入。注意:这两个输入使用`StaticInput`,具体说明可见[StaticInput文档](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入)
```python ```python
decoder_group_name = "decoder_group" decoder_group_name = "decoder_group"
group_input1 = paddle.layer.StaticInputV2(input=encoded_vector, is_seq=True) group_input1 = paddle.layer.StaticInputV2(input=encoded_vector, is_seq=True)
group_input2 = paddle.layer.StaticInputV2(input=encoded_proj, is_seq=True) group_input2 = paddle.layer.StaticInputV2(input=encoded_proj, is_seq=True)
group_inputs = [group_input1, group_input2] group_inputs = [group_input1, group_input2]
``` ```
4.2 训练模式下的解码器调用:
- 首先,将目标语言序列的词向量trg_embedding,直接作为训练模式下的current_word传给`gru_decoder_with_attention`函数。 1. 训练模式下的解码器调用:
- 其次,使用`recurrent_group`函数循环调用`gru_decoder_with_attention`函数。
- 接着,使用目标语言的下一个词序列作为标签层lbl,即预测目标词。
- 最后,用多类交叉熵损失函数`classification_cost`来计算损失值。
```python - 首先,将目标语言序列的词向量trg_embedding,直接作为训练模式下的current_word传给`gru_decoder_with_attention`函数。
- 其次,使用`recurrent_group`函数循环调用`gru_decoder_with_attention`函数。
- 接着,使用目标语言的下一个词序列作为标签层lbl,即预测目标词。
- 最后,用多类交叉熵损失函数`classification_cost`来计算损失值。
```python
trg_embedding = paddle.layer.embedding( trg_embedding = paddle.layer.embedding(
input=paddle.layer.data( input=paddle.layer.data(
name='target_language_word', name='target_language_word',
...@@ -380,7 +334,8 @@ wmt14_reader = paddle.batch( ...@@ -380,7 +334,8 @@ wmt14_reader = paddle.batch(
name='target_language_next_word', name='target_language_next_word',
type=paddle.data_type.integer_value_sequence(target_dict_dim)) type=paddle.data_type.integer_value_sequence(target_dict_dim))
cost = paddle.layer.classification_cost(input=decoder, label=lbl) cost = paddle.layer.classification_cost(input=decoder, label=lbl)
``` ```
注意:我们提供的配置在Bahdanau的论文\[[4](#参考文献)\]上做了一些简化,可参考[issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) 注意:我们提供的配置在Bahdanau的论文\[[4](#参考文献)\]上做了一些简化,可参考[issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133)
### 参数定义 ### 参数定义
...@@ -388,7 +343,6 @@ wmt14_reader = paddle.batch( ...@@ -388,7 +343,6 @@ wmt14_reader = paddle.batch(
首先依据模型配置的`cost`定义模型参数。 首先依据模型配置的`cost`定义模型参数。
```python ```python
# create parameters
parameters = paddle.parameters.create(cost) parameters = paddle.parameters.create(cost)
``` ```
...@@ -400,28 +354,36 @@ for param in parameters.keys(): ...@@ -400,28 +354,36 @@ for param in parameters.keys():
``` ```
### 训练模型 ### 训练模型
1. 构造trainer 1. 构造trainer
根据优化目标cost,网络拓扑结构和模型参数来构造出trainer用来训练,在构造时还需指定优化方法,这里使用最基本的SGD方法。 根据优化目标cost,网络拓扑结构和模型参数来构造出trainer用来训练,在构造时还需指定优化方法,这里使用最基本的SGD方法。
```python ```python
optimizer = paddle.optimizer.Adam(learning_rate=1e-4) optimizer = paddle.optimizer.Adam(
learning_rate=5e-5,
regularization=paddle.optimizer.L2Regularization(rate=1e-3))
trainer = paddle.trainer.SGD(cost=cost, trainer = paddle.trainer.SGD(cost=cost,
parameters=parameters, parameters=parameters,
update_equation=optimizer) update_equation=optimizer)
``` ```
2. 构造event_handler 1. 构造event_handler
可以通过自定义回调函数来评估训练过程中的各种状态,比如错误率等。下面的代码通过event.batch_id % 10 == 0 指定没10个batch打印一次日志,包含cost等信息。 可以通过自定义回调函数来评估训练过程中的各种状态,比如错误率等。下面的代码通过event.batch_id % 10 == 0 指定没10个batch打印一次日志,包含cost等信息。
```python ```python
def event_handler(event): def event_handler(event):
if isinstance(event, paddle.event.EndIteration): if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 10 == 0: if event.batch_id % 10 == 0:
print "Pass %d, Batch %d, Cost %f, %s" % ( print "\nPass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics) event.pass_id, event.batch_id, event.cost, event.metrics)
else:
sys.stdout.write('.')
sys.stdout.flush()
``` ```
3. 启动训练:
1. 启动训练:
```python ```python
trainer.train( trainer.train(
...@@ -430,30 +392,29 @@ for param in parameters.keys(): ...@@ -430,30 +392,29 @@ for param in parameters.keys():
num_passes=10000, num_passes=10000,
feeding=feeding) feeding=feeding)
``` ```
训练开始后,可以观察到event_handler输出的日志如下:
```text 训练开始后,可以观察到event_handler输出的日志如下:
Pass 0, Batch 0, Cost 247.408008, {'classification_error_evaluator': 1.0}
Pass 0, Batch 10, Cost 212.058789, {'classification_error_evaluator': 0.8737863898277283} ```text
... Pass 0, Batch 0, Cost 148.444983, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 10, Cost 335.896802, {'classification_error_evaluator': 0.9325153231620789}
.........
``` ```
当`classification_error_evaluator`的值低于0.35的时候,表示训练成功。
## 应用模型 ## 应用模型
### 下载预训练的模型 ### 下载预训练的模型
由于NMT模型的训练非常耗时,我们在50个物理节点(每节点含有2颗6核CPU)的集群中,花了5天时间训练了16个pass,其中每个pass耗时7个小时。因此,我们提供了一个预先训练好的模型(pass-00012)供大家直接下载使用。该模型大小为205MB,在所有16个模型中有最高的[BLEU评估](#BLEU评估)值26.92。下载并解压模型的命令如下: 由于NMT模型的训练非常耗时,我们在50个物理节点(每节点含有2颗6核CPU)的集群中,花了5天时间训练了16个pass,其中每个pass耗时7个小时。因此,我们提供了一个预先训练好的模型(pass-00012)供大家直接下载使用。该模型大小为205MB,在所有16个模型中有最高的[BLEU评估](#BLEU评估)值26.92。下载并解压模型的命令如下:
```bash ```bash
cd pretrained cd pretrained
./wmt14_model.sh ./wmt14_model.sh
``` ```
### 应用命令与结果
新版api尚未支持机器翻译的翻译过程,尽请期待。
翻译结果请见[效果展示](#效果展示)
### BLEU评估 ### BLEU评估
BLEU(Bilingual Evaluation understudy)是一种广泛使用的机器翻译自动评测指标,由IBM的watson研究中心于2002年提出\[[5](#参考文献)\],基本出发点是:机器译文越接近专业翻译人员的翻译结果,翻译系统的性能越好。其中,机器译文与人工参考译文之间的接近程度,采用句子精确度(precision)的计算方法,即比较两者的n元词组相匹配的个数,匹配的个数越多,BLEU得分越好。 BLEU(Bilingual Evaluation understudy)是一种广泛使用的机器翻译自动评测指标,由IBM的watson研究中心于2002年提出\[[5](#参考文献)\],基本出发点是:机器译文越接近专业翻译人员的翻译结果,翻译系统的性能越好。其中,机器译文与人工参考译文之间的接近程度,采用句子精确度(precision)的计算方法,即比较两者的n元词组相匹配的个数,匹配的个数越多,BLEU得分越好。
......
...@@ -105,9 +105,8 @@ def main(): ...@@ -105,9 +105,8 @@ def main():
# define optimize method and trainer # define optimize method and trainer
optimizer = paddle.optimizer.Adam(learning_rate=1e-4) optimizer = paddle.optimizer.Adam(learning_rate=1e-4)
trainer = paddle.trainer.SGD(cost=cost, trainer = paddle.trainer.SGD(
parameters=parameters, cost=cost, parameters=parameters, update_equation=optimizer)
update_equation=optimizer)
# define data reader # define data reader
feeding = { feeding = {
......
...@@ -227,77 +227,10 @@ Note: $z_{i+1}$ and $p_{i+1}$ are computed the same way as in [Decoder](#Decoder ...@@ -227,77 +227,10 @@ Note: $z_{i+1}$ and $p_{i+1}$ are computed the same way as in [Decoder](#Decoder
## Data Preparation ## Data Preparation
### Download and Uncompression
This tutorial uses a dataset from [WMT-14](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/), where [bitexts (after selection)](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz) is used as the training set, and [dev+test data](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz) is used as test and generation set. This tutorial uses a dataset from [WMT-14](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/), where [bitexts (after selection)](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz) is used as the training set, and [dev+test data](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz) is used as test and generation set.
Run the following command in Linux to obtain the data:
```bash
cd data
./wmt14_data.sh
```
There are three folders in the downloaded dataset `data/wmt14`:
<p align = "center">
<table>
<tr>
<td>Folder Name</td>
<td>French-English Parallel Corpus</td>
<td>Number of Files</td>
<td>Size of Files</td>
</tr>
<tr>
<td>train</td>
<td>ccb2_pc30.src, ccb2_pc30.trg, etc</td>
<td>12</td>
<td>3.55G</td>
</tr>
<tr>
<td>test</td>
<td>ntst1213.src, ntst1213.trg</td>
<td>2</td>
<td>1636k</td>
</tr>
</tr>
<tr>
<td>gen</td>
<td>ntst14.src, ntst14.trg</td>
<td>2</td>
<td>864k</td>
</tr>
</table>
</p>
- `XXX.src` is the source file in French and `XXX.trg`is the target file in English. Each row of the file contains one sentence.
- `XXX.src` and `XXX.trg` has the same number of rows and there is a one-to-one correspondance between the sentences at any row from the two files.
### User Defined Dataset (Optional)
To use your own dataset, just put it under the `data` folder and organize it as follows ### Data Preprocessing
```text
user_dataset
├── train
│   ├── train_file1.src
│   ├── train_file1.trg
│   └── ...
├── test
│   ├── test_file1.src
│   ├── test_file1.trg
│   └── ...
├── gen
│   ├── gen_file1.src
│   ├── gen_file1.trg
│   └── ...
```
Explanation of the directories:
- First level: `user_dataset`: the name of the user defined dataset.
- Second level: `train`、`test` and `gen`: these names should not be changed.
- Third level: Parallel corpus in source language and target language, each with a postfix of `.src` and `.trg`.
### Data Pre-processing
There are two steps for pre-processing: There are two steps for pre-processing:
- Merge the source and target parallel corpus files into one file - Merge the source and target parallel corpus files into one file
...@@ -306,245 +239,104 @@ There are two steps for pre-processing: ...@@ -306,245 +239,104 @@ There are two steps for pre-processing:
- Create source dictionary and target dictionary, each containing **DICTSIZE** number of words, including the most frequent (DICTSIZE - 3) fo word from the corpus and 3 special token `<s>` (begin of sequence), `<e>` (end of sequence) and `<unk>` (unknown words that are not in the vocabulary). - Create source dictionary and target dictionary, each containing **DICTSIZE** number of words, including the most frequent (DICTSIZE - 3) fo word from the corpus and 3 special token `<s>` (begin of sequence), `<e>` (end of sequence) and `<unk>` (unknown words that are not in the vocabulary).
`preprocess.py` is used for pre-processing: ### A Subset of Dataset
```python
python preprocess.py -i INPUT [-d DICTSIZE] [-m]
```
- `-i INPUT`: path to the original dataset.
- `-d DICTSIZE`: number of words in the dictionary. If unspecified, the dictionary will contain all the words appeared in the input dataset.
- `-m --mergeDict`: merge the source dictionary with target dictionary, making the two dictionaries have the same content.
The specific command to run the script is as follows: Because the full dataset is very big, to reduce the time for downloading the full dataset. PadddlePaddle package `paddle.dataset.wmt14` provides a preprocessed `subset of dataset`(http://paddlepaddle.bj.bcebos.com/demo/wmt_shrinked_data/wmt14.tgz).
```python
python preprocess.py -i data/wmt14 -d 30000
```
You will see the following messages after a few minutes:
```text
concat parallel corpora for dataset
build source dictionary for train data
build target dictionary for train data
dictionary size is 30000
```
The pre-processed data is located at `data/pre-wmt14`:
```text
pre-wmt14
├── train
│   └── train
├── test
│   └── test
├── gen
│   └── gen
├── train.list
├── test.list
├── gen.list
├── src.dict
└── trg.dict
```
- `train`, `test` and `gen`: contains French-English parallel corpus for training, testing and generation. Each row from each file is separated into two columns with a "\t", where the first column is the sequence in French and the second one is in English.
- `train.list`, `test.list` and `gen.list`: record respectively the path to `train`, `test` and `gen` folders.
- `src.dict` and `trg.dict`: source (French) and target (English) dictionary. Each dictionary contains 30000 words (29997 most frequent words and 3 special tokens).
### Providing Data to PaddlePaddle This subset has 193319 instances of training data and 6003 instances of test data. Dictionary size is 30000. Because of the limitation of size of the subset, the effectiveness of trained model from this subset is not guaranteed.
We use `dataprovider.py` to provide data to PaddlePaddle as follows: ## Training Instructions
1. Import PyDataProvider2 package from PaddlePaddle and define three special tokens: ### Initialize PaddlePaddle
```python ```python
from paddle.trainer.PyDataProvider2 import * import sys
UNK_IDX = 2 #out of vocabulary word import paddle.v2 as paddle
START = "<s>" #begin of sequence
END = "<e>" #end of sequence
```
2. Use initialization function `hook` to define the input data types (`input_types`) for training and generation:
- Training: there are three input sequences, where "source language sequence" and "target language sequence" are input and the "target language next word sequence" is the label.
- Generation: there are two input sequences, where the "source language sequence" is the input and “source language sequence id” are the ids for the input data (optional).
`src_dict_path` in the `hook` function is the path to the source language dictionary, while `trg_dict_path` the path to target language dictionary. `is_generating` is passed from model config file. For more details on the usage of the `hook` function please refer to [Model Config](#Model Config).
```python
def hook(settings, src_dict_path, trg_dict_path, is_generating, file_list,
**kwargs):
# job_mode = 1: training 0: generation
settings.job_mode = not is_generating
def fun(dict_path): # load dictionary according to the path
out_dict = dict()
with open(dict_path, "r") as fin:
out_dict = {
line.strip(): line_count
for line_count, line in enumerate(fin)
}
return out_dict
settings.src_dict = fun(src_dict_path)
settings.trg_dict = fun(trg_dict_path)
if settings.job_mode: #training
settings.input_types = {
'source_language_word': #source language sequence
integer_value_sequence(len(settings.src_dict)),
'target_language_word': #target language sequence
integer_value_sequence(len(settings.trg_dict)),
'target_language_next_word': #target language next word sequence
integer_value_sequence(len(settings.trg_dict))
}
else: #generation
settings.input_types = {
'source_language_word': #source language sequence
integer_value_sequence(len(settings.src_dict)),
'sent_id': #source language sequence id
integer_value_sequence(len(open(file_list[0], "r").readlines()))
}
```
3. Use `process` function to open the file `file_name`, read each row of the file, convert the data to be compatible with `input_types`, and then use `yield` to return to PaddlePaddle process. More specifically
- add `<s>` to the beginning of each source language sequence and add `<e>` to the end, producing "source_language_word".
- add `<s>` to the beginning of each target language senquence, producing "target_language_word".
- add `<e>` to the end of each target language senquence, producing "target_language_next_word".
```python
def _get_ids(s, dictionary): # get the location of each word from the source language sequence in the dictionary
words = s.strip().split()
return [dictionary[START]] + \
[dictionary.get(w, UNK_IDX) for w in words] + \
[dictionary[END]]
@provider(init_hook=hook, pool_size=50000)
def process(settings, file_name):
with open(file_name, 'r') as f:
for line_count, line in enumerate(f):
line_split = line.strip().split('\t')
if settings.job_mode and len(line_split) != 2:
continue
src_seq = line_split[0]
src_ids = _get_ids(src_seq, settings.src_dict)
if settings.job_mode:
trg_seq = line_split[1]
trg_words = trg_seq.split()
trg_ids = [settings.trg_dict.get(w, UNK_IDX) for w in trg_words]
# sequence with length longer than 80 with be removed during training to avoid an overly deep RNN.
if len(src_ids) > 80 or len(trg_ids) > 80:
continue
trg_ids_next = trg_ids + [settings.trg_dict[END]]
trg_ids = [settings.trg_dict[START]] + trg_ids
yield {
'source_language_word': src_ids,
'target_language_word': trg_ids,
'target_language_next_word': trg_ids_next
}
else:
yield {'source_language_word': src_ids, 'sent_id': [line_count]}
```
Note: The size of the training data is 3.55G. For machines with limited memories, it is recommended to use `pool_size` to set the number of data samples stored in memory.
## Model Config
### Data Definition
1. Specify the path to data and source/target dictionaries. `is_generating` accepts argument passed from command lines and is used to denote whether the current configuration is for training (default) or generation. See [Usage and Resutls](#Usage and Results).
```python
import os
from paddle.trainer_config_helpers import *
data_dir = "./data/pre-wmt14" # data path # train with a single CPU
src_lang_dict = os.path.join(data_dir, 'src.dict') # path to the source language dictionary paddle.init(use_gpu=False, trainer_count=1)
trg_lang_dict = os.path.join(data_dir, 'trg.dict') # path to the target language dictionary ```
is_generating = get_config_arg("is_generating", bool, False) # config mode
```
2. Use `define_py_data_sources2` to get data from `dataprovider.py`, and use `args` variable to input the source/target language dicitonary path and config mode.
```python ### Define DataSet
if not is_generating:
train_list = os.path.join(data_dir, 'train.list')
test_list = os.path.join(data_dir, 'test.list')
else:
train_list = None
test_list = os.path.join(data_dir, 'gen.list')
define_py_data_sources2(
train_list,
test_list,
module="dataprovider",
obj="process",
args={
"src_dict_path": src_lang_dict, # source language dictionary path
"trg_dict_path": trg_lang_dict, # target language dictionary path
"is_generating": is_generating # config mode
})
```
### Algorithm Configuration We will define dictionary size, and create [**data reader**](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader#python-data-reader-design-doc) for WMT-14 dataset.
```python ```python
settings( # source and target dict dim.
learning_method = AdamOptimizer(), dict_size = 30000
batch_size = 50,
learning_rate = 5e-4) feeding = {
'source_language_word': 0,
'target_language_word': 1,
'target_language_next_word': 2
}
wmt14_reader = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.wmt14.train(dict_size=dict_size), buf_size=8192),
batch_size=5)
``` ```
This tutorial will use the default SGD and Adam learning algorithm, with a learning rate of 5e-4. Note that the `batch_size = 50` denotes generating 50 sequence each time.
### Model Structure ### Model Configuration
1. Define some global variables 1. Define some global variables
```python ```python
source_dict_dim = len(open(src_lang_dict, "r").readlines()) # size of the source language dictionary source_dict_dim = dict_size # source language dictionary size
target_dict_dim = len(open(trg_lang_dict, "r").readlines()) # size of target language dictionary target_dict_dim = dict_size # destination language dictionary size
word_vector_dim = 512 # dimensionality of word vector word_vector_dim = 512 # word embedding dimension
encoder_size = 512 # dimensionality of the hidden state of encoder GRU encoder_size = 512 # hidden layer size of GRU in encoder
decoder_size = 512 # dimentionality of the hidden state of decoder GRU decoder_size = 512 # hidden layer size of GRU in decoder
if is_generating:
beam_size=3 # beam size for the beam search algorithm
max_length=250 # maximum length for the generated sentence
gen_trans_file = get_config_arg("gen_trans_file", str, None) # generate file
``` ```
2. Implement Encoder as follows: 1. Implement Encoder as follows:
1. Input is a sequence of words represented by an integer word index sequence. So we define data layer of data type `integer_value_sequence`. The value range of each element in the sequence is `[0, source_dict_dim)`
2.1 Input one-hot vector representations $\mathbf{w}$ converted with `dataprovider.py` from the source language sentence
```python ```python
src_word_id = data_layer(name='source_language_word', size=source_dict_dim) src_word_id = paddle.layer.data(
name='source_language_word',
type=paddle.data_type.integer_value_sequence(source_dict_dim))
``` ```
2.2 Map the one-hot vector into a word vector $\mathbf{s}$ in a low-dimensional semantic space
1. Map the one-hot vector (represented by word index) into a word vector $\mathbf{s}$ in a low-dimensional semantic space
```python ```python
src_embedding = embedding_layer( src_embedding = paddle.layer.embedding(
input=src_word_id, input=src_word_id,
size=word_vector_dim, size=word_vector_dim,
param_attr=ParamAttr(name='_source_language_embedding')) param_attr=paddle.attr.ParamAttr(name='_source_language_embedding'))
``` ```
2.3 Use bi-direcitonal GRU to encode the source language sequence, and concatenate the encoding outputs from the two GRUs to get $\mathbf{h}$
1. Use bi-direcitonal GRU to encode the source language sequence, and concatenate the encoding outputs from the two GRUs to get $\mathbf{h}$
```python ```python
src_forward = simple_gru(input=src_embedding, size=encoder_size) src_forward = paddle.networks.simple_gru(
src_backward = simple_gru( input=src_embedding, size=encoder_size)
input=src_embedding, size=encoder_size, reverse=True) src_backward = paddle.networks.simple_gru(
encoded_vector = concat_layer(input=[src_forward, src_backward]) input=src_embedding, size=encoder_size, reverse=True)
encoded_vector = paddle.layer.concat(input=[src_forward, src_backward])
``` ```
3. Implement Attention-based Decoder as follows: 1. Implement Attention-based Decoder as follows:
3.1 Get a projection of the encoding (c.f. 2.3) of the source language sequence by passing it into a feed forward neural network 1. Get a projection of the encoding (c.f. 2.3) of the source language sequence by passing it into a feed forward neural network
```python ```python
with mixed_layer(size=decoder_size) as encoded_proj: with paddle.layer.mixed(size=decoder_size) as encoded_proj:
encoded_proj += full_matrix_projection(input=encoded_vector) encoded_proj += paddle.layer.full_matrix_projection(
input=encoded_vector)
``` ```
3.2 Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$
1. Use a non-linear transformation of the last hidden state of the backward GRU on the source language sentence as the initial state of the decoder RNN $c_0=h_T$
```python ```python
backward_first = first_seq(input=src_backward) backward_first = paddle.layer.first_seq(input=src_backward)
with mixed_layer( with paddle.layer.mixed(
size=decoder_size, size=decoder_size, act=paddle.activation.Tanh()) as decoder_boot:
act=TanhActivation(), ) as decoder_boot: decoder_boot += paddle.layer.full_matrix_projection(
decoder_boot += full_matrix_projection(input=backward_first) input=backward_first)
``` ```
3.3 Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word.
1. Define the computation in each time step for the decoder RNN, i.e., according to the current context vector $c_i$, hidden state for the decoder $z_i$ and the $i$-th word $u_i$ in the target language to predict the probability $p_{i+1}$ for the $i+1$-th word.
- decoder_mem records the hidden state $z_i$ from the previous time step, with an initial state as decoder_boot. - decoder_mem records the hidden state $z_i$ from the previous time step, with an initial state as decoder_boot.
- context is computed via `simple_attention` as $c_i=\sum {j=1}^{T}a_{ij}h_j$, where enc_vec is the projection of $h_j$ and enc_proj is the projection of $h_j$ (c.f. 3.1). $a_{ij}$ is calculated within `simple_attention`. - context is computed via `simple_attention` as $c_i=\sum {j=1}^{T}a_{ij}h_j$, where enc_vec is the projection of $h_j$ and enc_proj is the projection of $h_j$ (c.f. 3.1). $a_{ij}$ is calculated within `simple_attention`.
...@@ -553,181 +345,148 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn ...@@ -553,181 +345,148 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn
- Softmax normalization is used in the end to computed the probability of words, i.e., $p\left ( u_i|u_{&lt;i},\mathbf{x} \right )=softmax(W_sz_i+b_z)$. The output is returned. - Softmax normalization is used in the end to computed the probability of words, i.e., $p\left ( u_i|u_{&lt;i},\mathbf{x} \right )=softmax(W_sz_i+b_z)$. The output is returned.
```python ```python
def gru_decoder_with_attention(enc_vec, enc_proj, current_word): def gru_decoder_with_attention(enc_vec, enc_proj, current_word):
decoder_mem = memory(
name='gru_decoder', size=decoder_size, boot_layer=decoder_boot) decoder_mem = paddle.layer.memory(
name='gru_decoder', size=decoder_size, boot_layer=decoder_boot)
context = simple_attention(
encoded_sequence=enc_vec, context = paddle.networks.simple_attention(
encoded_proj=enc_proj, encoded_sequence=enc_vec,
decoder_state=decoder_mem, ) encoded_proj=enc_proj,
decoder_state=decoder_mem)
with mixed_layer(size=decoder_size * 3) as decoder_inputs:
decoder_inputs += full_matrix_projection(input=context) with paddle.layer.mixed(size=decoder_size * 3) as decoder_inputs:
decoder_inputs += full_matrix_projection(input=current_word) decoder_inputs += paddle.layer.full_matrix_projection(input=context)
decoder_inputs += paddle.layer.full_matrix_projection(
gru_step = gru_step_layer( input=current_word)
name='gru_decoder',
input=decoder_inputs, gru_step = paddle.layer.gru_step(
output_mem=decoder_mem, name='gru_decoder',
size=decoder_size) input=decoder_inputs,
output_mem=decoder_mem,
with mixed_layer( size=decoder_size)
size=target_dict_dim, bias_attr=True,
act=SoftmaxActivation()) as out: with paddle.layer.mixed(
out += full_matrix_projection(input=gru_step) size=target_dict_dim,
return out bias_attr=True,
act=paddle.activation.Softmax()) as out:
out += paddle.layer.full_matrix_projection(input=gru_step)
return out
``` ```
4. Decoder differences between the training and generation
4.1 Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details. 1. Define the name for the decoder and the first two input for `gru_decoder_with_attention`. Note that `StaticInput` is used for the two inputs. Please refer to [StaticInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for more details.
```python ```python
decoder_group_name = "decoder_group" decoder_group_name = "decoder_group"
group_input1 = StaticInput(input=encoded_vector, is_seq=True) group_input1 = paddle.layer.StaticInputV2(input=encoded_vector, is_seq=True)
group_input2 = StaticInput(input=encoded_proj, is_seq=True) group_input2 = paddle.layer.StaticInputV2(input=encoded_proj, is_seq=True)
group_inputs = [group_input1, group_input2] group_inputs = [group_input1, group_input2]
``` ```
4.2 In training mode:
1. Training mode:
- word embedding from the target langauge trg_embedding is passed to `gru_decoder_with_attention` as current_word. - word embedding from the target language trg_embedding is passed to `gru_decoder_with_attention` as current_word.
- `recurrent_group` calls `gru_decoder_with_attention` in a recurrent way - `recurrent_group` calls `gru_decoder_with_attention` in a recurrent way
- the sequence of next words from the target language is used as label (lbl) - the sequence of next words from the target language is used as label (lbl)
- multi-class cross-entropy (`classification_cost`) is used to calculate the cost - multi-class cross-entropy (`classification_cost`) is used to calculate the cost
```python ```python
if not is_generating: trg_embedding = paddle.layer.embedding(
trg_embedding = embedding_layer( input=paddle.layer.data(
input=data_layer( name='target_language_word',
name='target_language_word', size=target_dict_dim), type=paddle.data_type.integer_value_sequence(target_dict_dim)),
size=word_vector_dim, size=word_vector_dim,
param_attr=ParamAttr(name='_target_language_embedding')) param_attr=paddle.attr.ParamAttr(name='_target_language_embedding'))
group_inputs.append(trg_embedding) group_inputs.append(trg_embedding)
decoder = recurrent_group( # For decoder equipped with attention mechanism, in training,
name=decoder_group_name, # target embeding (the groudtruth) is the data input,
step=gru_decoder_with_attention, # while encoded source sequence is accessed to as an unbounded memory.
input=group_inputs) # Here, the StaticInput defines a read-only memory
# for the recurrent_group.
lbl = data_layer(name='target_language_next_word', size=target_dict_dim) decoder = paddle.layer.recurrent_group(
cost = classification_cost(input=decoder, label=lbl) name=decoder_group_name,
outputs(cost) step=gru_decoder_with_attention,
``` input=group_inputs)
4.3 In generation mode:
lbl = paddle.layer.data(
- during generation, as the decoder RNN will take the word vector generated from the previous time step as input, `GeneratedInput` is used to implement this automatically. Please refer to [GeneratedInput Document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入) for details. name='target_language_next_word',
- `beam_search` will call `gru_decoder_with_attention` to generate id type=paddle.data_type.integer_value_sequence(target_dict_dim))
- `seqtext_printer_evaluator` outputs the generated sentence to `gen_trans_file` according to `trg_lang_dict` cost = paddle.layer.classification_cost(input=decoder, label=lbl)
```
```python
else:
trg_embedding = GeneratedInput(
size=target_dict_dim,
embedding_name='_target_language_embedding',
embedding_size=word_vector_dim)
group_inputs.append(trg_embedding)
beam_gen = beam_search(
name=decoder_group_name,
step=gru_decoder_with_attention,
input=group_inputs,
bos_id=0,
eos_id=1,
beam_size=beam_size,
max_length=max_length)
seqtext_printer_evaluator(
input=beam_gen,
id_input=data_layer(
name="sent_id", size=1),
dict_file=trg_lang_dict,
result_file=gen_trans_file)
outputs(beam_gen)
```
Note: Our configuration is based on Bahdanau et al. \[[4](#Reference)\] but with a few simplifications. Please refer to [issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) for more details. Note: Our configuration is based on Bahdanau et al. \[[4](#Reference)\] but with a few simplifications. Please refer to [issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) for more details.
### Create Parameters
## Model Training Create every parameter that `cost` layer needs.
Training can be started with the following command: ```python
parameters = paddle.parameters.create(cost)
```bash
./train.sh
``` ```
where `train.sh` contains
```bash We can get parameter names. If the parameter name is not specified during model configuration, it will be generated.
paddle train \
--config='seqToseq_net.py' \ ```python
--save_dir='model' \ for param in parameters.keys():
--use_gpu=false \ print param
--num_passes=16 \
--show_parameter_stats_period=100 \
--trainer_count=4 \
--log_period=10 \
--dot_period=5 \
2>&1 | tee 'train.log'
```
- config: configuration file for the network
- save_dir: path to save the trained model
- use_gpu: whether to use GPU for training; CPU is used here
- num_passes: number of passes for training. In PaddlePaddle, one pass meansing one pass of complete training pass using all the data in the training set
- show_parameter_stats_period: here we show the statistics of parameters every 100 batches
- trainer_count: the number of CPU processes or GPU devices
- log_period: here we print log every 10 batches
- dot_period: we print one "." every 5 batches
The training loss will the printed every 10 batches, and you will see messages like those below:
```text
I0719 19:16:45.952062 15563 TrainerInternal.cpp:160] Batch=10 samples=500 AvgCost=198.475 CurrentCost=198.475 Eval: classification_error_evaluator=0.737155 CurrentEval: classification_error_evaluator=0.737155
I0719 19:17:56.707319 15563 TrainerInternal.cpp:160] Batch=20 samples=1000 AvgCost=157.479 CurrentCost=116.483 Eval: classification_error_evaluator=0.698392 CurrentEval: classification_error_evaluator=0.659065
.....
``` ```
- AvgCost: average cost from batch-0 to the current batch.
- CurrentCost: the cost for the current batch
- classification\_error\_evaluator (Eval): average error rate from evaluator-0 to the current evaluator for each word
- classification\_error\_evaluator (CurrentEval): error rate for the current evaluator for each word
The model training is successful when the classification\_error\_evaluator is lower than 0.35. ## Model Training
## Model Usage 1. Create trainer
### Download Pre-trained Model We need to tell trainer what to optimize, and how to optimize. Here trainer will optimize `cost` layer using stochastic gradient descent (SDG).
As the training of an NMT model is very time consuming, we provide a pre-trained model (pass-00012, ~205M). The model is trained with a cluster of 50 physical nodes (each node has two 6-core CPU). We trained 16 passes (taking about 5 days) with each pass taking about 7 hours. The provided model (pass-00012) has the highest [BLEU Score](#BLEU Score) of 26.92. Run the following command to down load the model: ```python
```bash optimizer = paddle.optimizer.Adam(
cd pretrained learning_rate=5e-5,
./wmt14_model.sh regularization=paddle.optimizer.L2Regularization(rate=1e-3))
``` trainer = paddle.trainer.SGD(cost=cost,
parameters=parameters,
update_equation=optimizer)
```
### Usage and Results 1. Define event handler
Run the following command to perform translation from French to English: The event handler is a callback function invoked by trainer when an event happens. Here we will print log in event handler.
```bash ```python
./gen.sh def event_handler(event):
``` if isinstance(event, paddle.event.EndIteration):
where `gen.sh` contains: if event.batch_id % 10 == 0:
print "\nPass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
```
1. Start training
```python
trainer.train(
reader=wmt14_reader,
event_handler=event_handler,
num_passes=10000,
feeding=feeding)
```
```text
Pass 0, Batch 0, Cost 247.408008, {'classification_error_evaluator': 1.0}
Pass 0, Batch 10, Cost 212.058789, {'classification_error_evaluator': 0.8737863898277283}
...
```
The model training is successful when the `classification_error_evaluator` is lower than 0.35.
## Model Usage
### Download Pre-trained Model
As the training of an NMT model is very time consuming, we provide a pre-trained model (pass-00012, ~205M). The model is trained with a cluster of 50 physical nodes (each node has two 6-core CPU). We trained 16 passes (taking about 5 days) with each pass taking about 7 hours. The provided model (pass-00012) has the highest [BLEU Score](#BLEU Score) of 26.92. Run the following command to download the model:
```bash ```bash
paddle train \ cd pretrained
--job=test \ ./wmt14_model.sh
--config='seqToseq_net.py' \
--save_dir='pretrained/wmt14_model' \
--use_gpu=true \
--num_passes=13 \
--test_pass=12 \
--trainer_count=1 \
--config_args=is_generating=1,gen_trans_file="gen_result" \
2>&1 | tee 'translation/gen.log'
``` ```
Parameters different training are listed as follows:
- job: set the mode as testing.
- save_dir: path to the pre-trained model.
- num_passes and test_pass: load the model parameters from pass $i\epsilon \left [ test\\_pass,num\\_passes-1 \right ]$. Here we only load `data/wmt14_model/pass-00012`.
- config_args: pass the self-defined command line parameters to model configuration. `is_generating=1` indicates generation mode and `gen_trans_file="gen_result"` represents the file generated.
For translation results please refer to [Illustrative Results](#Illustrative Results).
### BLEU Evaluation ### BLEU Evaluation
...@@ -753,7 +512,7 @@ BLEU = 26.92 ...@@ -753,7 +512,7 @@ BLEU = 26.92
## Summary ## Summary
End-to-end neural machine translation is a recently developed way to perform machine translations. In this chapter, we introduced the typical "Encoder-Decoder" framework and "attention" mechanism. Since NMT is a typical Sequence-to-Sequence (Seq2Seq) learning problem, tasks such as query rewriting, abstraction generation and single-turn dialogues can all be solved with the model presented in this chapter. End-to-end neural machine translation is a recently developed way to perform machine translations. In this chapter, we introduced the typical "Encoder-Decoder" framework and "attention" mechanism. Since NMT is a typical Sequence-to-Sequence (Seq2Seq) learning problem, tasks such as query rewriting, abstraction generation, and single-turn dialogues can all be solved with the model presented in this chapter.
## References ## References
...@@ -764,7 +523,7 @@ End-to-end neural machine translation is a recently developed way to perform mac ...@@ -764,7 +523,7 @@ End-to-end neural machine translation is a recently developed way to perform mac
5. Papineni K, Roukos S, Ward T, et al. [BLEU: a method for automatic evaluation of machine translation](http://dl.acm.org/citation.cfm?id=1073135)[C]//Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002: 311-318. 5. Papineni K, Roukos S, Ward T, et al. [BLEU: a method for automatic evaluation of machine translation](http://dl.acm.org/citation.cfm?id=1073135)[C]//Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002: 311-318.
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。 This tutorial is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
......
...@@ -194,54 +194,8 @@ e_{ij}&=align(z_i,h_j)\\\\ ...@@ -194,54 +194,8 @@ e_{ij}&=align(z_i,h_j)\\\\
## 数据介绍 ## 数据介绍
### 下载与解压缩
本教程使用[WMT-14](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/)数据集中的[bitexts(after selection)](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz)作为训练集,[dev+test data](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz)作为测试集和生成集。 本教程使用[WMT-14](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/)数据集中的[bitexts(after selection)](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz)作为训练集,[dev+test data](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz)作为测试集和生成集。
在Linux下,只需简单地运行以下命令:
```bash
cd data
./wmt14_data.sh
```
得到的数据集`data/wmt14`包含如下三个文件夹:
<p align = "center">
<table>
<tr>
<td>文件夹名</td>
<td>法英平行语料文件</td>
<td>文件数</td>
<td>文件大小</td>
</tr>
<tr>
<td>train</td>
<td>ccb2_pc30.src, ccb2_pc30.trg, etc</td>
<td>12</td>
<td>3.55G</td>
</tr>
<tr>
<td>test</td>
<td>ntst1213.src, ntst1213.trg</td>
<td>2</td>
<td>1636k</td>
</tr>
</tr>
<tr>
<td>gen</td>
<td>ntst14.src, ntst14.trg</td>
<td>2</td>
<td>864k</td>
</tr>
</table>
</p>
- `XXX.src`是源法语文件,`XXX.trg`是目标英语文件,文件中的每行存放一个句子
- `XXX.src`和`XXX.trg`的行数一致,且两者任意第$i$行的句子之间都有着一一对应的关系。
### 数据预处理 ### 数据预处理
我们的预处理流程包括两步: 我们的预处理流程包括两步:
...@@ -262,6 +216,7 @@ cd data ...@@ -262,6 +216,7 @@ cd data
```python ```python
# 加载 paddle的python包 # 加载 paddle的python包
import sys
import paddle.v2 as paddle import paddle.v2 as paddle
# 配置只使用cpu,并且使用一个cpu进行训练 # 配置只使用cpu,并且使用一个cpu进行训练
...@@ -298,17 +253,16 @@ wmt14_reader = paddle.batch( ...@@ -298,17 +253,16 @@ wmt14_reader = paddle.batch(
decoder_size = 512 # 解码器中的GRU隐层大小 decoder_size = 512 # 解码器中的GRU隐层大小
``` ```
2. 其次,实现编码器框架。分为三步: 1. 其次,实现编码器框架。分为三步:
2.1 将在dataset reader中生成的用每个单词在字典中的索引表示的源语言序列 1 输入是一个文字序列,被表示成整型的序列。序列中每个元素是文字在字典中的索引。所以,我们定义数据层的数据类型为`integer_value_sequence`(整型序列),序列中每个元素的范围是`[0, source_dict_dim)`。
转换成one-hot vector表示的源语言序列$\mathbf{w}$,其类型为integer_value_sequence。
```python ```python
src_word_id = paddle.layer.data( src_word_id = paddle.layer.data(
name='source_language_word', name='source_language_word',
type=paddle.data_type.integer_value_sequence(source_dict_dim)) type=paddle.data_type.integer_value_sequence(source_dict_dim))
``` ```
2.2 将上述编码映射到低维语言空间的词向量$\mathbf{s}$。 1. 将上述编码映射到低维语言空间的词向量$\mathbf{s}$。
```python ```python
src_embedding = paddle.layer.embedding( src_embedding = paddle.layer.embedding(
...@@ -316,7 +270,7 @@ wmt14_reader = paddle.batch( ...@@ -316,7 +270,7 @@ wmt14_reader = paddle.batch(
size=word_vector_dim, size=word_vector_dim,
param_attr=paddle.attr.ParamAttr(name='_source_language_embedding')) param_attr=paddle.attr.ParamAttr(name='_source_language_embedding'))
``` ```
2.3 用双向GRU编码源语言序列,拼接两个GRU的编码结果得到$\mathbf{h}$。 1. 用双向GRU编码源语言序列,拼接两个GRU的编码结果得到$\mathbf{h}$。
```python ```python
src_forward = paddle.networks.simple_gru( src_forward = paddle.networks.simple_gru(
...@@ -326,16 +280,17 @@ wmt14_reader = paddle.batch( ...@@ -326,16 +280,17 @@ wmt14_reader = paddle.batch(
encoded_vector = paddle.layer.concat(input=[src_forward, src_backward]) encoded_vector = paddle.layer.concat(input=[src_forward, src_backward])
``` ```
3. 接着,定义基于注意力机制的解码器框架。分为三步: 1. 接着,定义基于注意力机制的解码器框架。分为三步:
3.1 对源语言序列编码后的结果(见2.3),过一个前馈神经网络(Feed Forward Neural Network),得到其映射。 1. 对源语言序列编码后的结果(见2.3),过一个前馈神经网络(Feed Forward Neural Network),得到其映射。
```python ```python
with paddle.layer.mixed(size=decoder_size) as encoded_proj: with paddle.layer.mixed(size=decoder_size) as encoded_proj:
encoded_proj += paddle.layer.full_matrix_projection( encoded_proj += paddle.layer.full_matrix_projection(
input=encoded_vector) input=encoded_vector)
``` ```
3.2 构造解码器RNN的初始状态。由于解码器需要预测时序目标序列,但在0时刻并没有初始值,所以我们希望对其进行初始化。这里采用的是将源语言序列逆序编码后的最后一个状态进行非线性映射,作为该初始值,即$c_0=h_T$。
1. 构造解码器RNN的初始状态。由于解码器需要预测时序目标序列,但在0时刻并没有初始值,所以我们希望对其进行初始化。这里采用的是将源语言序列逆序编码后的最后一个状态进行非线性映射,作为该初始值,即$c_0=h_T$。
```python ```python
backward_first = paddle.layer.first_seq(input=src_backward) backward_first = paddle.layer.first_seq(input=src_backward)
...@@ -344,15 +299,14 @@ wmt14_reader = paddle.batch( ...@@ -344,15 +299,14 @@ wmt14_reader = paddle.batch(
decoder_boot += paddle.layer.full_matrix_projection( decoder_boot += paddle.layer.full_matrix_projection(
input=backward_first) input=backward_first)
``` ```
3.3 定义解码阶段每一个时间步的RNN行为,即根据当前时刻的源语言上下文向量$c_i$、解码器隐层状态$z_i$和目标语言中第$i$个词$u_i$,来预测第$i+1$个词的概率$p_{i+1}$。
1. 定义解码阶段每一个时间步的RNN行为,即根据当前时刻的源语言上下文向量$c_i$、解码器隐层状态$z_i$和目标语言中第$i$个词$u_i$,来预测第$i+1$个词的概率$p_{i+1}$。
- decoder_mem记录了前一个时间步的隐层状态$z_i$,其初始状态是decoder_boot。 - decoder_mem记录了前一个时间步的隐层状态$z_i$,其初始状态是decoder_boot。
- context通过调用`simple_attention`函数,实现公式$c_i=\sum {j=1}^{T}a_{ij}h_j$。其中,enc_vec是$h_j$,enc_proj是$h_j$的映射(见3.1),权重$a_{ij}$的计算已经封装在`simple_attention`函数中。 - context通过调用`simple_attention`函数,实现公式$c_i=\sum {j=1}^{T}a_{ij}h_j$。其中,enc_vec是$h_j$,enc_proj是$h_j$的映射(见3.1),权重$a_{ij}$的计算已经封装在`simple_attention`函数中。
- decoder_inputs融合了$c_i$和当前目标词current_word(即$u_i$)的表示。 - decoder_inputs融合了$c_i$和当前目标词current_word(即$u_i$)的表示。
- gru_step通过调用`gru_step_layer`函数,在decoder_inputs和decoder_mem上做了激活操作,即实现公式$z_{i+1}=\phi _{\theta '}\left ( c_i,u_i,z_i \right )$。 - gru_step通过调用`gru_step_layer`函数,在decoder_inputs和decoder_mem上做了激活操作,即实现公式$z_{i+1}=\phi _{\theta '}\left ( c_i,u_i,z_i \right )$。
- 最后,使用softmax归一化计算单词的概率,将out结果返回,即实现公式$p\left ( u_i|u_{&lt;i},\mathbf{x} \right )=softmax(W_sz_i+b_z)$。 - 最后,使用softmax归一化计算单词的概率,将out结果返回,即实现公式$p\left ( u_i|u_{&lt;i},\mathbf{x} \right )=softmax(W_sz_i+b_z)$。
```python ```python
def gru_decoder_with_attention(enc_vec, enc_proj, current_word): def gru_decoder_with_attention(enc_vec, enc_proj, current_word):
...@@ -382,24 +336,24 @@ wmt14_reader = paddle.batch( ...@@ -382,24 +336,24 @@ wmt14_reader = paddle.batch(
out += paddle.layer.full_matrix_projection(input=gru_step) out += paddle.layer.full_matrix_projection(input=gru_step)
return out return out
``` ```
4. 训练模式与生成模式下的解码器调用区别。
4.1 定义解码器框架名字,和`gru_decoder_with_attention`函数的前两个输入。注意:这两个输入使用`StaticInput`,具体说明可见[StaticInput文档](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入)。 1. 定义解码器框架名字,和`gru_decoder_with_attention`函数的前两个输入。注意:这两个输入使用`StaticInput`,具体说明可见[StaticInput文档](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入)。
```python ```python
decoder_group_name = "decoder_group" decoder_group_name = "decoder_group"
group_input1 = paddle.layer.StaticInputV2(input=encoded_vector, is_seq=True) group_input1 = paddle.layer.StaticInputV2(input=encoded_vector, is_seq=True)
group_input2 = paddle.layer.StaticInputV2(input=encoded_proj, is_seq=True) group_input2 = paddle.layer.StaticInputV2(input=encoded_proj, is_seq=True)
group_inputs = [group_input1, group_input2] group_inputs = [group_input1, group_input2]
``` ```
4.2 训练模式下的解码器调用:
- 首先,将目标语言序列的词向量trg_embedding,直接作为训练模式下的current_word传给`gru_decoder_with_attention`函数。 1. 训练模式下的解码器调用:
- 其次,使用`recurrent_group`函数循环调用`gru_decoder_with_attention`函数。
- 接着,使用目标语言的下一个词序列作为标签层lbl,即预测目标词。
- 最后,用多类交叉熵损失函数`classification_cost`来计算损失值。
```python - 首先,将目标语言序列的词向量trg_embedding,直接作为训练模式下的current_word传给`gru_decoder_with_attention`函数。
- 其次,使用`recurrent_group`函数循环调用`gru_decoder_with_attention`函数。
- 接着,使用目标语言的下一个词序列作为标签层lbl,即预测目标词。
- 最后,用多类交叉熵损失函数`classification_cost`来计算损失值。
```python
trg_embedding = paddle.layer.embedding( trg_embedding = paddle.layer.embedding(
input=paddle.layer.data( input=paddle.layer.data(
name='target_language_word', name='target_language_word',
...@@ -422,7 +376,8 @@ wmt14_reader = paddle.batch( ...@@ -422,7 +376,8 @@ wmt14_reader = paddle.batch(
name='target_language_next_word', name='target_language_next_word',
type=paddle.data_type.integer_value_sequence(target_dict_dim)) type=paddle.data_type.integer_value_sequence(target_dict_dim))
cost = paddle.layer.classification_cost(input=decoder, label=lbl) cost = paddle.layer.classification_cost(input=decoder, label=lbl)
``` ```
注意:我们提供的配置在Bahdanau的论文\[[4](#参考文献)\]上做了一些简化,可参考[issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133)。 注意:我们提供的配置在Bahdanau的论文\[[4](#参考文献)\]上做了一些简化,可参考[issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133)。
### 参数定义 ### 参数定义
...@@ -430,7 +385,6 @@ wmt14_reader = paddle.batch( ...@@ -430,7 +385,6 @@ wmt14_reader = paddle.batch(
首先依据模型配置的`cost`定义模型参数。 首先依据模型配置的`cost`定义模型参数。
```python ```python
# create parameters
parameters = paddle.parameters.create(cost) parameters = paddle.parameters.create(cost)
``` ```
...@@ -442,28 +396,36 @@ for param in parameters.keys(): ...@@ -442,28 +396,36 @@ for param in parameters.keys():
``` ```
### 训练模型 ### 训练模型
1. 构造trainer 1. 构造trainer
根据优化目标cost,网络拓扑结构和模型参数来构造出trainer用来训练,在构造时还需指定优化方法,这里使用最基本的SGD方法。 根据优化目标cost,网络拓扑结构和模型参数来构造出trainer用来训练,在构造时还需指定优化方法,这里使用最基本的SGD方法。
```python ```python
optimizer = paddle.optimizer.Adam(learning_rate=1e-4) optimizer = paddle.optimizer.Adam(
learning_rate=5e-5,
regularization=paddle.optimizer.L2Regularization(rate=1e-3))
trainer = paddle.trainer.SGD(cost=cost, trainer = paddle.trainer.SGD(cost=cost,
parameters=parameters, parameters=parameters,
update_equation=optimizer) update_equation=optimizer)
``` ```
2. 构造event_handler 1. 构造event_handler
可以通过自定义回调函数来评估训练过程中的各种状态,比如错误率等。下面的代码通过event.batch_id % 10 == 0 指定没10个batch打印一次日志,包含cost等信息。 可以通过自定义回调函数来评估训练过程中的各种状态,比如错误率等。下面的代码通过event.batch_id % 10 == 0 指定没10个batch打印一次日志,包含cost等信息。
```python ```python
def event_handler(event): def event_handler(event):
if isinstance(event, paddle.event.EndIteration): if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 10 == 0: if event.batch_id % 10 == 0:
print "Pass %d, Batch %d, Cost %f, %s" % ( print "\nPass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics) event.pass_id, event.batch_id, event.cost, event.metrics)
else:
sys.stdout.write('.')
sys.stdout.flush()
``` ```
3. 启动训练:
1. 启动训练:
```python ```python
trainer.train( trainer.train(
...@@ -472,30 +434,29 @@ for param in parameters.keys(): ...@@ -472,30 +434,29 @@ for param in parameters.keys():
num_passes=10000, num_passes=10000,
feeding=feeding) feeding=feeding)
``` ```
训练开始后,可以观察到event_handler输出的日志如下:
```text 训练开始后,可以观察到event_handler输出的日志如下:
Pass 0, Batch 0, Cost 247.408008, {'classification_error_evaluator': 1.0}
Pass 0, Batch 10, Cost 212.058789, {'classification_error_evaluator': 0.8737863898277283} ```text
... Pass 0, Batch 0, Cost 148.444983, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 10, Cost 335.896802, {'classification_error_evaluator': 0.9325153231620789}
.........
``` ```
当`classification_error_evaluator`的值低于0.35的时候,表示训练成功。
## 应用模型 ## 应用模型
### 下载预训练的模型 ### 下载预训练的模型
由于NMT模型的训练非常耗时,我们在50个物理节点(每节点含有2颗6核CPU)的集群中,花了5天时间训练了16个pass,其中每个pass耗时7个小时。因此,我们提供了一个预先训练好的模型(pass-00012)供大家直接下载使用。该模型大小为205MB,在所有16个模型中有最高的[BLEU评估](#BLEU评估)值26.92。下载并解压模型的命令如下: 由于NMT模型的训练非常耗时,我们在50个物理节点(每节点含有2颗6核CPU)的集群中,花了5天时间训练了16个pass,其中每个pass耗时7个小时。因此,我们提供了一个预先训练好的模型(pass-00012)供大家直接下载使用。该模型大小为205MB,在所有16个模型中有最高的[BLEU评估](#BLEU评估)值26.92。下载并解压模型的命令如下:
```bash ```bash
cd pretrained cd pretrained
./wmt14_model.sh ./wmt14_model.sh
``` ```
### 应用命令与结果
新版api尚未支持机器翻译的翻译过程,尽请期待。
翻译结果请见[效果展示](#效果展示)。
### BLEU评估 ### BLEU评估
BLEU(Bilingual Evaluation understudy)是一种广泛使用的机器翻译自动评测指标,由IBM的watson研究中心于2002年提出\[[5](#参考文献)\],基本出发点是:机器译文越接近专业翻译人员的翻译结果,翻译系统的性能越好。其中,机器译文与人工参考译文之间的接近程度,采用句子精确度(precision)的计算方法,即比较两者的n元词组相匹配的个数,匹配的个数越多,BLEU得分越好。 BLEU(Bilingual Evaluation understudy)是一种广泛使用的机器翻译自动评测指标,由IBM的watson研究中心于2002年提出\[[5](#参考文献)\],基本出发点是:机器译文越接近专业翻译人员的翻译结果,翻译系统的性能越好。其中,机器译文与人工参考译文之间的接近程度,采用句子精确度(precision)的计算方法,即比较两者的n元词组相匹配的个数,匹配的个数越多,BLEU得分越好。
......
...@@ -110,8 +110,7 @@ group_inputs = [group_input1, group_input2] ...@@ -110,8 +110,7 @@ group_inputs = [group_input1, group_input2]
if not is_generating: if not is_generating:
trg_embedding = embedding_layer( trg_embedding = embedding_layer(
input=data_layer( input=data_layer(name='target_language_word', size=target_dict_dim),
name='target_language_word', size=target_dict_dim),
size=word_vector_dim, size=word_vector_dim,
param_attr=ParamAttr(name='_target_language_embedding')) param_attr=ParamAttr(name='_target_language_embedding'))
group_inputs.append(trg_embedding) group_inputs.append(trg_embedding)
...@@ -156,8 +155,7 @@ else: ...@@ -156,8 +155,7 @@ else:
seqtext_printer_evaluator( seqtext_printer_evaluator(
input=beam_gen, input=beam_gen,
id_input=data_layer( id_input=data_layer(name="sent_id", size=1),
name="sent_id", size=1),
dict_file=trg_lang_dict, dict_file=trg_lang_dict,
result_file=gen_trans_file) result_file=gen_trans_file)
outputs(beam_gen) outputs(beam_gen)
...@@ -32,15 +32,15 @@ In a simple softmax regression model, the input is fed to fully connected layers ...@@ -32,15 +32,15 @@ In a simple softmax regression model, the input is fed to fully connected layers
Input $X$ is multiplied with weights $W$, and bias $b$ is added to generate activations. Input $X$ is multiplied with weights $W$, and bias $b$ is added to generate activations.
$$ y_i = softmax(\sum_j W_{i,j}x_j + b_i) $$ $$ y_i = \text{softmax}(\sum_j W_{i,j}x_j + b_i) $$
where $ softmax(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} $ where $ \text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} $
For an $N$ class classification problem with $N$ output nodes, an $N$ dimensional vector is normalized to $N$ real values in the range [0, 1], each representing the probability of the sample to belong to the class. Here $y_i$ is the prediction probability that an image is digit $i$. For an $N$ class classification problem with $N$ output nodes, an $N$ dimensional vector is normalized to $N$ real values in the range [0, 1], each representing the probability of the sample to belong to the class. Here $y_i$ is the prediction probability that an image is digit $i$.
In such a classification problem, we usually use the cross entropy loss function: In such a classification problem, we usually use the cross entropy loss function:
$$ crossentropy(label, y) = -\sum_i label_ilog(y_i) $$ $$ \text{crossentropy}(label, y) = -\sum_i label_ilog(y_i) $$
Fig. 2 shows a softmax regression network, with weights in blue, and bias in red. +1 indicates bias is 1. Fig. 2 shows a softmax regression network, with weights in blue, and bias in red. +1 indicates bias is 1.
...@@ -55,7 +55,7 @@ The Softmax regression model described above uses the simplest two-layer neural ...@@ -55,7 +55,7 @@ The Softmax regression model described above uses the simplest two-layer neural
1. After the first hidden layer, we get $ H_1 = \phi(W_1X + b_1) $, where $\phi$ is the activation function. Some common ones are sigmoid, tanh and ReLU. 1. After the first hidden layer, we get $ H_1 = \phi(W_1X + b_1) $, where $\phi$ is the activation function. Some common ones are sigmoid, tanh and ReLU.
2. After the second hidden layer, we get $ H_2 = \phi(W_2H_1 + b_2) $. 2. After the second hidden layer, we get $ H_2 = \phi(W_2H_1 + b_2) $.
3. Finally, after output layer, we get $Y=softmax(W_3H_2 + b_3)$, the final classification result vector. 3. Finally, after output layer, we get $Y=\text{softmax}(W_3H_2 + b_3)$, the final classification result vector.
Fig. 3. is Multilayer Perceptron network, with weights in blue, and bias in red. +1 indicates bias is 1. Fig. 3. is Multilayer Perceptron network, with weights in blue, and bias in red. +1 indicates bias is 1.
...@@ -70,7 +70,7 @@ Fig. 3. Multilayer Perceptron network architecture<br/> ...@@ -70,7 +70,7 @@ Fig. 3. Multilayer Perceptron network architecture<br/>
#### Convolutional Layer #### Convolutional Layer
<p align="center"> <p align="center">
<img src="image/conv_layer_en.png" width=500><br/> <img src="image/conv_layer.png" width='750'><br/>
Fig. 4. Convolutional layer<br/> Fig. 4. Convolutional layer<br/>
</p> </p>
...@@ -240,7 +240,7 @@ def event_handler(event): ...@@ -240,7 +240,7 @@ def event_handler(event):
print "Pass %d, Batch %d, Cost %f, %s" % ( print "Pass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics) event.pass_id, event.batch_id, event.cost, event.metrics)
if isinstance(event, paddle.event.EndPass): if isinstance(event, paddle.event.EndPass):
result = trainer.test(reader=paddle.reader.batched( result = trainer.test(reader=paddle.batch(
paddle.dataset.mnist.test(), batch_size=128)) paddle.dataset.mnist.test(), batch_size=128))
print "Test with Pass %d, Cost %f, %s\n" % ( print "Test with Pass %d, Cost %f, %s\n" % (
event.pass_id, result.cost, result.metrics) event.pass_id, result.cost, result.metrics)
...@@ -248,7 +248,7 @@ def event_handler(event): ...@@ -248,7 +248,7 @@ def event_handler(event):
result.metrics['classification_error_evaluator'])) result.metrics['classification_error_evaluator']))
trainer.train( trainer.train(
reader=paddle.reader.batched( reader=paddle.batch(
paddle.reader.shuffle( paddle.reader.shuffle(
paddle.dataset.mnist.train(), buf_size=8192), paddle.dataset.mnist.train(), buf_size=8192),
batch_size=128), batch_size=128),
...@@ -293,7 +293,7 @@ This tutorial describes a few basic Deep Learning models viz. Softmax regression ...@@ -293,7 +293,7 @@ This tutorial describes a few basic Deep Learning models viz. Softmax regression
7. Deng, Li, Michael L. Seltzer, Dong Yu, Alex Acero, Abdel-rahman Mohamed, and Geoffrey E. Hinton. ["Binary coding of speech spectrograms using a deep auto-encoder."](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.185.1908&rep=rep1&type=pdf) In Interspeech, pp. 1692-1695. 2010. 7. Deng, Li, Michael L. Seltzer, Dong Yu, Alex Acero, Abdel-rahman Mohamed, and Geoffrey E. Hinton. ["Binary coding of speech spectrograms using a deep auto-encoder."](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.185.1908&rep=rep1&type=pdf) In Interspeech, pp. 1692-1695. 2010.
8. Kégl, Balázs, and Róbert Busa-Fekete. ["Boosting products of base classifiers."](http://dl.acm.org/citation.cfm?id=1553439) In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 497-504. ACM, 2009. 8. Kégl, Balázs, and Róbert Busa-Fekete. ["Boosting products of base classifiers."](http://dl.acm.org/citation.cfm?id=1553439) In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 497-504. ACM, 2009.
9. Rosenblatt, Frank. ["The perceptron: A probabilistic model for information storage and organization in the brain."](http://psycnet.apa.org/journals/rev/65/6/386/) Psychological review 65, no. 6 (1958): 386. 9. Rosenblatt, Frank. ["The perceptron: A probabilistic model for information storage and organization in the brain."](http://psycnet.apa.org/journals/rev/65/6/386/) Psychological review 65, no. 6 (1958): 386.
10. Bishop, Christopher M. ["Pattern recognition."](http://s3.amazonaws.com/academia.edu.documents/30428242/bg0137.pdf?AWSAccessKeyId=AKIAJ56TQJRTWSMTNPEA&Expires=1484816640&Signature=85Ad6%2Fca8T82pmHzxaSXermovIA%3D&response-content-disposition=inline%3B%20filename%3DPattern_recognition_and_machine_learning.pdf) Machine Learning 128 (2006): 1-58. 10. Bishop, Christopher M. ["Pattern recognition."](http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf) Machine Learning 128 (2006): 1-58.
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">This book</span> is created by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and uses <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Shared knowledge signature - non commercial use-Sharing 4.0 International Licensing Protocal</a>. This tutorial is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.
...@@ -32,15 +32,15 @@ Yann LeCun早先在手写字符识别上做了很多研究,并在研究过程 ...@@ -32,15 +32,15 @@ Yann LeCun早先在手写字符识别上做了很多研究,并在研究过程
输入层的数据$X$传到输出层,在激活操作之前,会乘以相应的权重 $W$ ,并加上偏置变量 $b$ ,具体如下: 输入层的数据$X$传到输出层,在激活操作之前,会乘以相应的权重 $W$ ,并加上偏置变量 $b$ ,具体如下:
$$ y_i = softmax(\sum_j W_{i,j}x_j + b_i) $$ $$ y_i = \text{softmax}(\sum_j W_{i,j}x_j + b_i) $$
其中 $ softmax(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} $ 其中 $ \text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} $
对于有 $N$ 个类别的多分类问题,指定 $N$ 个输出节点,$N$ 维输入特征经过softmax将归一化为 $N$ 个[0,1]范围内的实数值,分别表示该样本属于这 $N$ 个类别的概率。此处的 $y_i$ 即对应该图片为数字 $i$ 的预测概率。 对于有 $N$ 个类别的多分类问题,指定 $N$ 个输出节点,$N$ 维输入特征经过softmax将归一化为 $N$ 个[0,1]范围内的实数值,分别表示该样本属于这 $N$ 个类别的概率。此处的 $y_i$ 即对应该图片为数字 $i$ 的预测概率。
在分类问题中,我们一般采用交叉熵代价损失函数(cross entropy),公式如下: 在分类问题中,我们一般采用交叉熵代价损失函数(cross entropy),公式如下:
$$ crossentropy(label, y) = -\sum_i label_ilog(y_i) $$ $$ \text{crossentropy}(label, y) = -\sum_i label_ilog(y_i) $$
图2为softmax回归的网络图,图中权重用蓝线表示、偏置用红线表示、+1代表偏置参数的系数为1。 图2为softmax回归的网络图,图中权重用蓝线表示、偏置用红线表示、+1代表偏置参数的系数为1。
...@@ -55,7 +55,7 @@ Softmax回归模型采用了最简单的两层神经网络,即只有输入层 ...@@ -55,7 +55,7 @@ Softmax回归模型采用了最简单的两层神经网络,即只有输入层
1. 经过第一个隐藏层,可以得到 $ H_1 = \phi(W_1X + b_1) $,其中$\phi$代表激活函数,常见的有sigmoid、tanh或ReLU等函数。 1. 经过第一个隐藏层,可以得到 $ H_1 = \phi(W_1X + b_1) $,其中$\phi$代表激活函数,常见的有sigmoid、tanh或ReLU等函数。
2. 经过第二个隐藏层,可以得到 $ H_2 = \phi(W_2H_1 + b_2) $。 2. 经过第二个隐藏层,可以得到 $ H_2 = \phi(W_2H_1 + b_2) $。
3. 最后,再经过输出层,得到的$Y=softmax(W_3H_2 + b_3)$,即为最后的分类结果向量。 3. 最后,再经过输出层,得到的$Y=\text{softmax}(W_3H_2 + b_3)$,即为最后的分类结果向量。
图3为多层感知器的网络结构图,图中权重用蓝线表示、偏置用红线表示、+1代表偏置参数的系数为1。 图3为多层感知器的网络结构图,图中权重用蓝线表示、偏置用红线表示、+1代表偏置参数的系数为1。
...@@ -67,11 +67,11 @@ Softmax回归模型采用了最简单的两层神经网络,即只有输入层 ...@@ -67,11 +67,11 @@ Softmax回归模型采用了最简单的两层神经网络,即只有输入层
### 卷积神经网络(Convolutional Neural Network, CNN) ### 卷积神经网络(Convolutional Neural Network, CNN)
在多层感知器模型中,将图像展开成一维向量输入到网络中,忽略了图像的位置和结构信息,而卷积神经网络能够更好的利用图像的结构信息。[LeNet-5](http://yann.lecun.com/exdb/lenet/)是一个较简单的卷积神经网络。图6显示了其结构:输入的二维图像,先经过两次卷积层到池化层,再经过全连接层,最后使用softmax分类作为输出层。下面我们主要介绍卷积层和池化层。 在多层感知器模型中,将图像展开成一维向量输入到网络中,忽略了图像的位置和结构信息,而卷积神经网络能够更好的利用图像的结构信息。[LeNet-5](http://yann.lecun.com/exdb/lenet/)是一个较简单的卷积神经网络。图4显示了其结构:输入的二维图像,先经过两次卷积层到池化层,再经过全连接层,最后使用softmax分类作为输出层。下面我们主要介绍卷积层和池化层。
<p align="center"> <p align="center">
<img src="image/cnn.png"><br/> <img src="image/cnn.png"><br/>
6. LeNet-5卷积神经网络结构<br/> 4. LeNet-5卷积神经网络结构<br/>
</p> </p>
#### 卷积层 #### 卷积层
...@@ -79,17 +79,11 @@ Softmax回归模型采用了最简单的两层神经网络,即只有输入层 ...@@ -79,17 +79,11 @@ Softmax回归模型采用了最简单的两层神经网络,即只有输入层
卷积层是卷积神经网络的核心基石。在图像识别里我们提到的卷积是二维卷积,即离散二维滤波器(也称作卷积核)与二维图像做卷积操作,简单的讲是二维滤波器滑动到二维图像上所有位置,并在每个位置上与该像素点及其领域像素点做内积。卷积操作被广泛应用与图像处理领域,不同卷积核可以提取不同的特征,例如边沿、线性、角等特征。在深层卷积神经网络中,通过卷积操作可以提取出图像低级到复杂的特征。 卷积层是卷积神经网络的核心基石。在图像识别里我们提到的卷积是二维卷积,即离散二维滤波器(也称作卷积核)与二维图像做卷积操作,简单的讲是二维滤波器滑动到二维图像上所有位置,并在每个位置上与该像素点及其领域像素点做内积。卷积操作被广泛应用与图像处理领域,不同卷积核可以提取不同的特征,例如边沿、线性、角等特征。在深层卷积神经网络中,通过卷积操作可以提取出图像低级到复杂的特征。
<p align="center"> <p align="center">
<img src="image/conv_layer.png"><br/> <img src="image/conv_layer.png" width='750'><br/>
4. 卷积层图片<br/> 5. 卷积层图片<br/>
</p> </p>
图4给出一个卷积计算过程的示例图,输入图像大小为$H=5,W=5,D=3$,即$5 \times 5$大小的3通道(RGB,也称作深度)彩色图像。这个示例图中包含两(用$K$表示)组卷积核,即图中滤波器$W_0$和$W_1$。在卷积计算中,通常对不同的输入通道采用不同的卷积核,如图示例中每组卷积核包含($D=3)$个$3 \times 3$(用$F \times F$表示)大小的卷积核。另外,这个示例中卷积核在图像的水平方向($W$方向)和垂直方向($H$方向)的滑动步长为2(用$S$表示);对输入图像周围各填充1(用$P$表示)个0,即图中输入层原始数据为蓝色部分,灰色部分是进行了大小为1的扩展,用0来进行扩展。经过卷积操作得到输出为$3 \times 3 \times 2$(用$H_{o} \times W_{o} \times K$表示)大小的特征图,即$3 \times 3$大小的2通道特征图,其中$H_o$计算公式为:$H_o = (H - F + 2 \times P)/S + 1$,$W_o$同理。 而输出特征图中的每个像素,是每组滤波器与输入图像每个特征图的内积再求和,再加上偏置$b_o$,偏置通常对于每个输出特征图是共享的。例如图中输出特征图$o[:,:,0]$中的第一个$2$计算如下: 图5给出一个卷积计算过程的示例图,输入图像大小为$H=5,W=5,D=3$,即$5 \times 5$大小的3通道(RGB,也称作深度)彩色图像。这个示例图中包含两(用$K$表示)组卷积核,即图中滤波器$W_0$和$W_1$。在卷积计算中,通常对不同的输入通道采用不同的卷积核,如图示例中每组卷积核包含($D=3)$个$3 \times 3$(用$F \times F$表示)大小的卷积核。另外,这个示例中卷积核在图像的水平方向($W$方向)和垂直方向($H$方向)的滑动步长为2(用$S$表示);对输入图像周围各填充1(用$P$表示)个0,即图中输入层原始数据为蓝色部分,灰色部分是进行了大小为1的扩展,用0来进行扩展。经过卷积操作得到输出为$3 \times 3 \times 2$(用$H_{o} \times W_{o} \times K$表示)大小的特征图,即$3 \times 3$大小的2通道特征图,其中$H_o$计算公式为:$H_o = (H - F + 2 \times P)/S + 1$,$W_o$同理。 而输出特征图中的每个像素,是每组滤波器与输入图像每个特征图的内积再求和,再加上偏置$b_o$,偏置通常对于每个输出特征图是共享的。输出特征图$o[:,:,0]$中的最后一个$-2$计算如图5右下角公式所示。
$$ o[0,0,0] = \sum x[0:3,0:3,0] * w_{0}[:,:,0]] + \sum x[0:3,0:3,1] * w_{0}[:,:,1]] + \sum x[0:3,0:3,2] * w_{0}[:,:,2]] + b_0 = 2 $$
$$ \sum x[0:3,0:3,0] * w_{0}[:,:,0]] = 0*1 + 0*1 + 0*1 + 0*1 + 1*1 + 2*(-1) + 0*(-1) + 0*1 + 0*(-1) = -1 $$
$$ \sum x[0:3,0:3,1] * w_{0}[:,:,1]] = 0*0 + 0*1 + 0*1 + 0*(-1) + 0*0 + 1*1 + 0*1 + 2*0 + 1*1 = 2 $$
$$ \sum x[0:3,0:3,2] * w_{0}[:,:,2]] = 0*(-1) + 0*1 + 0*(-1) + 0*0 + 1*1 + 1*0 + 0*(-1) + 1*0 + 1*(-1) = 0 $$
$$ b_0 = 1 $$
在卷积操作中卷积核是可学习的参数,经过上面示例介绍,每层卷积的参数大小为$D \times F \times F \times K$。在多层感知器模型中,神经元通常是全部连接,参数较多。而卷积层的参数较少,这也是由卷积层的主要特性即局部连接和共享权重所决定。 在卷积操作中卷积核是可学习的参数,经过上面示例介绍,每层卷积的参数大小为$D \times F \times F \times K$。在多层感知器模型中,神经元通常是全部连接,参数较多。而卷积层的参数较少,这也是由卷积层的主要特性即局部连接和共享权重所决定。
...@@ -103,10 +97,10 @@ $$ b_0 = 1 $$ ...@@ -103,10 +97,10 @@ $$ b_0 = 1 $$
<p align="center"> <p align="center">
<img src="image/max_pooling.png" width="400px"><br/> <img src="image/max_pooling.png" width="400px"><br/>
5. 池化层图片<br/> 6. 池化层图片<br/>
</p> </p>
池化是非线性下采样的一种形式,主要作用是通过减少网络的参数来减小计算量,并且能够在一定程度上控制过拟合。通常在卷积层的后面会加上一个池化层。池化包括最大池化、平均池化等。其中最大池化是用不重叠的矩形框将输入层分成不同的区域,对于每个矩形框的数取最大值作为输出层,如图5所示。 池化是非线性下采样的一种形式,主要作用是通过减少网络的参数来减小计算量,并且能够在一定程度上控制过拟合。通常在卷积层的后面会加上一个池化层。池化包括最大池化、平均池化等。其中最大池化是用不重叠的矩形框将输入层分成不同的区域,对于每个矩形框的数取最大值作为输出层,如图6所示。
更详细的关于卷积神经网络的具体知识可以参考[斯坦福大学公开课]( http://cs231n.github.io/convolutional-networks/ )[图像分类](https://github.com/PaddlePaddle/book/blob/develop/image_classification/README.md)教程。 更详细的关于卷积神经网络的具体知识可以参考[斯坦福大学公开课]( http://cs231n.github.io/convolutional-networks/ )[图像分类](https://github.com/PaddlePaddle/book/blob/develop/image_classification/README.md)教程。
...@@ -251,7 +245,7 @@ def event_handler(event): ...@@ -251,7 +245,7 @@ def event_handler(event):
print "Pass %d, Batch %d, Cost %f, %s" % ( print "Pass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics) event.pass_id, event.batch_id, event.cost, event.metrics)
if isinstance(event, paddle.event.EndPass): if isinstance(event, paddle.event.EndPass):
result = trainer.test(reader=paddle.reader.batched( result = trainer.test(reader=paddle.batch(
paddle.dataset.mnist.test(), batch_size=128)) paddle.dataset.mnist.test(), batch_size=128))
print "Test with Pass %d, Cost %f, %s\n" % ( print "Test with Pass %d, Cost %f, %s\n" % (
event.pass_id, result.cost, result.metrics) event.pass_id, result.cost, result.metrics)
...@@ -259,7 +253,7 @@ def event_handler(event): ...@@ -259,7 +253,7 @@ def event_handler(event):
result.metrics['classification_error_evaluator'])) result.metrics['classification_error_evaluator']))
trainer.train( trainer.train(
reader=paddle.reader.batched( reader=paddle.batch(
paddle.reader.shuffle( paddle.reader.shuffle(
paddle.dataset.mnist.train(), buf_size=8192), paddle.dataset.mnist.train(), buf_size=8192),
batch_size=128), batch_size=128),
...@@ -295,7 +289,7 @@ trainer.train( ...@@ -295,7 +289,7 @@ trainer.train(
7. Deng, Li, Michael L. Seltzer, Dong Yu, Alex Acero, Abdel-rahman Mohamed, and Geoffrey E. Hinton. ["Binary coding of speech spectrograms using a deep auto-encoder."](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.185.1908&rep=rep1&type=pdf) In Interspeech, pp. 1692-1695. 2010. 7. Deng, Li, Michael L. Seltzer, Dong Yu, Alex Acero, Abdel-rahman Mohamed, and Geoffrey E. Hinton. ["Binary coding of speech spectrograms using a deep auto-encoder."](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.185.1908&rep=rep1&type=pdf) In Interspeech, pp. 1692-1695. 2010.
8. Kégl, Balázs, and Róbert Busa-Fekete. ["Boosting products of base classifiers."](http://dl.acm.org/citation.cfm?id=1553439) In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 497-504. ACM, 2009. 8. Kégl, Balázs, and Róbert Busa-Fekete. ["Boosting products of base classifiers."](http://dl.acm.org/citation.cfm?id=1553439) In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 497-504. ACM, 2009.
9. Rosenblatt, Frank. ["The perceptron: A probabilistic model for information storage and organization in the brain."](http://psycnet.apa.org/journals/rev/65/6/386/) Psychological review 65, no. 6 (1958): 386. 9. Rosenblatt, Frank. ["The perceptron: A probabilistic model for information storage and organization in the brain."](http://psycnet.apa.org/journals/rev/65/6/386/) Psychological review 65, no. 6 (1958): 386.
10. Bishop, Christopher M. ["Pattern recognition."](http://s3.amazonaws.com/academia.edu.documents/30428242/bg0137.pdf?AWSAccessKeyId=AKIAJ56TQJRTWSMTNPEA&Expires=1484816640&Signature=85Ad6%2Fca8T82pmHzxaSXermovIA%3D&response-content-disposition=inline%3B%20filename%3DPattern_recognition_and_machine_learning.pdf) Machine Learning 128 (2006): 1-58. 10. Bishop, Christopher M. ["Pattern recognition."](http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf) Machine Learning 128 (2006): 1-58.
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。
recognize_digits/image/conv_layer.png

248.3 KB | W: | H:

recognize_digits/image/conv_layer.png

571.4 KB | W: | H:

recognize_digits/image/conv_layer.png
recognize_digits/image/conv_layer.png
recognize_digits/image/conv_layer.png
recognize_digits/image/conv_layer.png
  • 2-up
  • Swipe
  • Onion skin
...@@ -74,15 +74,15 @@ In a simple softmax regression model, the input is fed to fully connected layers ...@@ -74,15 +74,15 @@ In a simple softmax regression model, the input is fed to fully connected layers
Input $X$ is multiplied with weights $W$, and bias $b$ is added to generate activations. Input $X$ is multiplied with weights $W$, and bias $b$ is added to generate activations.
$$ y_i = softmax(\sum_j W_{i,j}x_j + b_i) $$ $$ y_i = \text{softmax}(\sum_j W_{i,j}x_j + b_i) $$
where $ softmax(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} $ where $ \text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} $
For an $N$ class classification problem with $N$ output nodes, an $N$ dimensional vector is normalized to $N$ real values in the range [0, 1], each representing the probability of the sample to belong to the class. Here $y_i$ is the prediction probability that an image is digit $i$. For an $N$ class classification problem with $N$ output nodes, an $N$ dimensional vector is normalized to $N$ real values in the range [0, 1], each representing the probability of the sample to belong to the class. Here $y_i$ is the prediction probability that an image is digit $i$.
In such a classification problem, we usually use the cross entropy loss function: In such a classification problem, we usually use the cross entropy loss function:
$$ crossentropy(label, y) = -\sum_i label_ilog(y_i) $$ $$ \text{crossentropy}(label, y) = -\sum_i label_ilog(y_i) $$
Fig. 2 shows a softmax regression network, with weights in blue, and bias in red. +1 indicates bias is 1. Fig. 2 shows a softmax regression network, with weights in blue, and bias in red. +1 indicates bias is 1.
...@@ -97,7 +97,7 @@ The Softmax regression model described above uses the simplest two-layer neural ...@@ -97,7 +97,7 @@ The Softmax regression model described above uses the simplest two-layer neural
1. After the first hidden layer, we get $ H_1 = \phi(W_1X + b_1) $, where $\phi$ is the activation function. Some common ones are sigmoid, tanh and ReLU. 1. After the first hidden layer, we get $ H_1 = \phi(W_1X + b_1) $, where $\phi$ is the activation function. Some common ones are sigmoid, tanh and ReLU.
2. After the second hidden layer, we get $ H_2 = \phi(W_2H_1 + b_2) $. 2. After the second hidden layer, we get $ H_2 = \phi(W_2H_1 + b_2) $.
3. Finally, after output layer, we get $Y=softmax(W_3H_2 + b_3)$, the final classification result vector. 3. Finally, after output layer, we get $Y=\text{softmax}(W_3H_2 + b_3)$, the final classification result vector.
Fig. 3. is Multilayer Perceptron network, with weights in blue, and bias in red. +1 indicates bias is 1. Fig. 3. is Multilayer Perceptron network, with weights in blue, and bias in red. +1 indicates bias is 1.
...@@ -112,7 +112,7 @@ Fig. 3. Multilayer Perceptron network architecture<br/> ...@@ -112,7 +112,7 @@ Fig. 3. Multilayer Perceptron network architecture<br/>
#### Convolutional Layer #### Convolutional Layer
<p align="center"> <p align="center">
<img src="image/conv_layer_en.png" width=500><br/> <img src="image/conv_layer.png" width='750'><br/>
Fig. 4. Convolutional layer<br/> Fig. 4. Convolutional layer<br/>
</p> </p>
...@@ -282,7 +282,7 @@ def event_handler(event): ...@@ -282,7 +282,7 @@ def event_handler(event):
print "Pass %d, Batch %d, Cost %f, %s" % ( print "Pass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics) event.pass_id, event.batch_id, event.cost, event.metrics)
if isinstance(event, paddle.event.EndPass): if isinstance(event, paddle.event.EndPass):
result = trainer.test(reader=paddle.reader.batched( result = trainer.test(reader=paddle.batch(
paddle.dataset.mnist.test(), batch_size=128)) paddle.dataset.mnist.test(), batch_size=128))
print "Test with Pass %d, Cost %f, %s\n" % ( print "Test with Pass %d, Cost %f, %s\n" % (
event.pass_id, result.cost, result.metrics) event.pass_id, result.cost, result.metrics)
...@@ -290,7 +290,7 @@ def event_handler(event): ...@@ -290,7 +290,7 @@ def event_handler(event):
result.metrics['classification_error_evaluator'])) result.metrics['classification_error_evaluator']))
trainer.train( trainer.train(
reader=paddle.reader.batched( reader=paddle.batch(
paddle.reader.shuffle( paddle.reader.shuffle(
paddle.dataset.mnist.train(), buf_size=8192), paddle.dataset.mnist.train(), buf_size=8192),
batch_size=128), batch_size=128),
...@@ -335,10 +335,10 @@ This tutorial describes a few basic Deep Learning models viz. Softmax regression ...@@ -335,10 +335,10 @@ This tutorial describes a few basic Deep Learning models viz. Softmax regression
7. Deng, Li, Michael L. Seltzer, Dong Yu, Alex Acero, Abdel-rahman Mohamed, and Geoffrey E. Hinton. ["Binary coding of speech spectrograms using a deep auto-encoder."](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.185.1908&rep=rep1&type=pdf) In Interspeech, pp. 1692-1695. 2010. 7. Deng, Li, Michael L. Seltzer, Dong Yu, Alex Acero, Abdel-rahman Mohamed, and Geoffrey E. Hinton. ["Binary coding of speech spectrograms using a deep auto-encoder."](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.185.1908&rep=rep1&type=pdf) In Interspeech, pp. 1692-1695. 2010.
8. Kégl, Balázs, and Róbert Busa-Fekete. ["Boosting products of base classifiers."](http://dl.acm.org/citation.cfm?id=1553439) In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 497-504. ACM, 2009. 8. Kégl, Balázs, and Róbert Busa-Fekete. ["Boosting products of base classifiers."](http://dl.acm.org/citation.cfm?id=1553439) In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 497-504. ACM, 2009.
9. Rosenblatt, Frank. ["The perceptron: A probabilistic model for information storage and organization in the brain."](http://psycnet.apa.org/journals/rev/65/6/386/) Psychological review 65, no. 6 (1958): 386. 9. Rosenblatt, Frank. ["The perceptron: A probabilistic model for information storage and organization in the brain."](http://psycnet.apa.org/journals/rev/65/6/386/) Psychological review 65, no. 6 (1958): 386.
10. Bishop, Christopher M. ["Pattern recognition."](http://s3.amazonaws.com/academia.edu.documents/30428242/bg0137.pdf?AWSAccessKeyId=AKIAJ56TQJRTWSMTNPEA&Expires=1484816640&Signature=85Ad6%2Fca8T82pmHzxaSXermovIA%3D&response-content-disposition=inline%3B%20filename%3DPattern_recognition_and_machine_learning.pdf) Machine Learning 128 (2006): 1-58. 10. Bishop, Christopher M. ["Pattern recognition."](http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf) Machine Learning 128 (2006): 1-58.
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">This book</span> is created by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and uses <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Shared knowledge signature - non commercial use-Sharing 4.0 International Licensing Protocal</a>. This tutorial is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
......
...@@ -74,15 +74,15 @@ Yann LeCun早先在手写字符识别上做了很多研究,并在研究过程 ...@@ -74,15 +74,15 @@ Yann LeCun早先在手写字符识别上做了很多研究,并在研究过程
输入层的数据$X$传到输出层,在激活操作之前,会乘以相应的权重 $W$ ,并加上偏置变量 $b$ ,具体如下: 输入层的数据$X$传到输出层,在激活操作之前,会乘以相应的权重 $W$ ,并加上偏置变量 $b$ ,具体如下:
$$ y_i = softmax(\sum_j W_{i,j}x_j + b_i) $$ $$ y_i = \text{softmax}(\sum_j W_{i,j}x_j + b_i) $$
其中 $ softmax(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} $ 其中 $ \text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} $
对于有 $N$ 个类别的多分类问题,指定 $N$ 个输出节点,$N$ 维输入特征经过softmax将归一化为 $N$ 个[0,1]范围内的实数值,分别表示该样本属于这 $N$ 个类别的概率。此处的 $y_i$ 即对应该图片为数字 $i$ 的预测概率。 对于有 $N$ 个类别的多分类问题,指定 $N$ 个输出节点,$N$ 维输入特征经过softmax将归一化为 $N$ 个[0,1]范围内的实数值,分别表示该样本属于这 $N$ 个类别的概率。此处的 $y_i$ 即对应该图片为数字 $i$ 的预测概率。
在分类问题中,我们一般采用交叉熵代价损失函数(cross entropy),公式如下: 在分类问题中,我们一般采用交叉熵代价损失函数(cross entropy),公式如下:
$$ crossentropy(label, y) = -\sum_i label_ilog(y_i) $$ $$ \text{crossentropy}(label, y) = -\sum_i label_ilog(y_i) $$
图2为softmax回归的网络图,图中权重用蓝线表示、偏置用红线表示、+1代表偏置参数的系数为1。 图2为softmax回归的网络图,图中权重用蓝线表示、偏置用红线表示、+1代表偏置参数的系数为1。
...@@ -97,7 +97,7 @@ Softmax回归模型采用了最简单的两层神经网络,即只有输入层 ...@@ -97,7 +97,7 @@ Softmax回归模型采用了最简单的两层神经网络,即只有输入层
1. 经过第一个隐藏层,可以得到 $ H_1 = \phi(W_1X + b_1) $,其中$\phi$代表激活函数,常见的有sigmoid、tanh或ReLU等函数。 1. 经过第一个隐藏层,可以得到 $ H_1 = \phi(W_1X + b_1) $,其中$\phi$代表激活函数,常见的有sigmoid、tanh或ReLU等函数。
2. 经过第二个隐藏层,可以得到 $ H_2 = \phi(W_2H_1 + b_2) $。 2. 经过第二个隐藏层,可以得到 $ H_2 = \phi(W_2H_1 + b_2) $。
3. 最后,再经过输出层,得到的$Y=softmax(W_3H_2 + b_3)$,即为最后的分类结果向量。 3. 最后,再经过输出层,得到的$Y=\text{softmax}(W_3H_2 + b_3)$,即为最后的分类结果向量。
图3为多层感知器的网络结构图,图中权重用蓝线表示、偏置用红线表示、+1代表偏置参数的系数为1。 图3为多层感知器的网络结构图,图中权重用蓝线表示、偏置用红线表示、+1代表偏置参数的系数为1。
...@@ -109,11 +109,11 @@ Softmax回归模型采用了最简单的两层神经网络,即只有输入层 ...@@ -109,11 +109,11 @@ Softmax回归模型采用了最简单的两层神经网络,即只有输入层
### 卷积神经网络(Convolutional Neural Network, CNN) ### 卷积神经网络(Convolutional Neural Network, CNN)
在多层感知器模型中,将图像展开成一维向量输入到网络中,忽略了图像的位置和结构信息,而卷积神经网络能够更好的利用图像的结构信息。[LeNet-5](http://yann.lecun.com/exdb/lenet/)是一个较简单的卷积神经网络。图6显示了其结构:输入的二维图像,先经过两次卷积层到池化层,再经过全连接层,最后使用softmax分类作为输出层。下面我们主要介绍卷积层和池化层。 在多层感知器模型中,将图像展开成一维向量输入到网络中,忽略了图像的位置和结构信息,而卷积神经网络能够更好的利用图像的结构信息。[LeNet-5](http://yann.lecun.com/exdb/lenet/)是一个较简单的卷积神经网络。图4显示了其结构:输入的二维图像,先经过两次卷积层到池化层,再经过全连接层,最后使用softmax分类作为输出层。下面我们主要介绍卷积层和池化层。
<p align="center"> <p align="center">
<img src="image/cnn.png"><br/> <img src="image/cnn.png"><br/>
6. LeNet-5卷积神经网络结构<br/> 4. LeNet-5卷积神经网络结构<br/>
</p> </p>
#### 卷积层 #### 卷积层
...@@ -121,17 +121,11 @@ Softmax回归模型采用了最简单的两层神经网络,即只有输入层 ...@@ -121,17 +121,11 @@ Softmax回归模型采用了最简单的两层神经网络,即只有输入层
卷积层是卷积神经网络的核心基石。在图像识别里我们提到的卷积是二维卷积,即离散二维滤波器(也称作卷积核)与二维图像做卷积操作,简单的讲是二维滤波器滑动到二维图像上所有位置,并在每个位置上与该像素点及其领域像素点做内积。卷积操作被广泛应用与图像处理领域,不同卷积核可以提取不同的特征,例如边沿、线性、角等特征。在深层卷积神经网络中,通过卷积操作可以提取出图像低级到复杂的特征。 卷积层是卷积神经网络的核心基石。在图像识别里我们提到的卷积是二维卷积,即离散二维滤波器(也称作卷积核)与二维图像做卷积操作,简单的讲是二维滤波器滑动到二维图像上所有位置,并在每个位置上与该像素点及其领域像素点做内积。卷积操作被广泛应用与图像处理领域,不同卷积核可以提取不同的特征,例如边沿、线性、角等特征。在深层卷积神经网络中,通过卷积操作可以提取出图像低级到复杂的特征。
<p align="center"> <p align="center">
<img src="image/conv_layer.png"><br/> <img src="image/conv_layer.png" width='750'><br/>
4. 卷积层图片<br/> 5. 卷积层图片<br/>
</p> </p>
图4给出一个卷积计算过程的示例图,输入图像大小为$H=5,W=5,D=3$,即$5 \times 5$大小的3通道(RGB,也称作深度)彩色图像。这个示例图中包含两(用$K$表示)组卷积核,即图中滤波器$W_0$和$W_1$。在卷积计算中,通常对不同的输入通道采用不同的卷积核,如图示例中每组卷积核包含($D=3)$个$3 \times 3$(用$F \times F$表示)大小的卷积核。另外,这个示例中卷积核在图像的水平方向($W$方向)和垂直方向($H$方向)的滑动步长为2(用$S$表示);对输入图像周围各填充1(用$P$表示)个0,即图中输入层原始数据为蓝色部分,灰色部分是进行了大小为1的扩展,用0来进行扩展。经过卷积操作得到输出为$3 \times 3 \times 2$(用$H_{o} \times W_{o} \times K$表示)大小的特征图,即$3 \times 3$大小的2通道特征图,其中$H_o$计算公式为:$H_o = (H - F + 2 \times P)/S + 1$,$W_o$同理。 而输出特征图中的每个像素,是每组滤波器与输入图像每个特征图的内积再求和,再加上偏置$b_o$,偏置通常对于每个输出特征图是共享的。例如图中输出特征图$o[:,:,0]$中的第一个$2$计算如下: 图5给出一个卷积计算过程的示例图,输入图像大小为$H=5,W=5,D=3$,即$5 \times 5$大小的3通道(RGB,也称作深度)彩色图像。这个示例图中包含两(用$K$表示)组卷积核,即图中滤波器$W_0$和$W_1$。在卷积计算中,通常对不同的输入通道采用不同的卷积核,如图示例中每组卷积核包含($D=3)$个$3 \times 3$(用$F \times F$表示)大小的卷积核。另外,这个示例中卷积核在图像的水平方向($W$方向)和垂直方向($H$方向)的滑动步长为2(用$S$表示);对输入图像周围各填充1(用$P$表示)个0,即图中输入层原始数据为蓝色部分,灰色部分是进行了大小为1的扩展,用0来进行扩展。经过卷积操作得到输出为$3 \times 3 \times 2$(用$H_{o} \times W_{o} \times K$表示)大小的特征图,即$3 \times 3$大小的2通道特征图,其中$H_o$计算公式为:$H_o = (H - F + 2 \times P)/S + 1$,$W_o$同理。 而输出特征图中的每个像素,是每组滤波器与输入图像每个特征图的内积再求和,再加上偏置$b_o$,偏置通常对于每个输出特征图是共享的。输出特征图$o[:,:,0]$中的最后一个$-2$计算如图5右下角公式所示。
$$ o[0,0,0] = \sum x[0:3,0:3,0] * w_{0}[:,:,0]] + \sum x[0:3,0:3,1] * w_{0}[:,:,1]] + \sum x[0:3,0:3,2] * w_{0}[:,:,2]] + b_0 = 2 $$
$$ \sum x[0:3,0:3,0] * w_{0}[:,:,0]] = 0*1 + 0*1 + 0*1 + 0*1 + 1*1 + 2*(-1) + 0*(-1) + 0*1 + 0*(-1) = -1 $$
$$ \sum x[0:3,0:3,1] * w_{0}[:,:,1]] = 0*0 + 0*1 + 0*1 + 0*(-1) + 0*0 + 1*1 + 0*1 + 2*0 + 1*1 = 2 $$
$$ \sum x[0:3,0:3,2] * w_{0}[:,:,2]] = 0*(-1) + 0*1 + 0*(-1) + 0*0 + 1*1 + 1*0 + 0*(-1) + 1*0 + 1*(-1) = 0 $$
$$ b_0 = 1 $$
在卷积操作中卷积核是可学习的参数,经过上面示例介绍,每层卷积的参数大小为$D \times F \times F \times K$。在多层感知器模型中,神经元通常是全部连接,参数较多。而卷积层的参数较少,这也是由卷积层的主要特性即局部连接和共享权重所决定。 在卷积操作中卷积核是可学习的参数,经过上面示例介绍,每层卷积的参数大小为$D \times F \times F \times K$。在多层感知器模型中,神经元通常是全部连接,参数较多。而卷积层的参数较少,这也是由卷积层的主要特性即局部连接和共享权重所决定。
...@@ -145,10 +139,10 @@ $$ b_0 = 1 $$ ...@@ -145,10 +139,10 @@ $$ b_0 = 1 $$
<p align="center"> <p align="center">
<img src="image/max_pooling.png" width="400px"><br/> <img src="image/max_pooling.png" width="400px"><br/>
5. 池化层图片<br/> 6. 池化层图片<br/>
</p> </p>
池化是非线性下采样的一种形式,主要作用是通过减少网络的参数来减小计算量,并且能够在一定程度上控制过拟合。通常在卷积层的后面会加上一个池化层。池化包括最大池化、平均池化等。其中最大池化是用不重叠的矩形框将输入层分成不同的区域,对于每个矩形框的数取最大值作为输出层,如图5所示。 池化是非线性下采样的一种形式,主要作用是通过减少网络的参数来减小计算量,并且能够在一定程度上控制过拟合。通常在卷积层的后面会加上一个池化层。池化包括最大池化、平均池化等。其中最大池化是用不重叠的矩形框将输入层分成不同的区域,对于每个矩形框的数取最大值作为输出层,如图6所示。
更详细的关于卷积神经网络的具体知识可以参考[斯坦福大学公开课]( http://cs231n.github.io/convolutional-networks/ )和[图像分类](https://github.com/PaddlePaddle/book/blob/develop/image_classification/README.md)教程。 更详细的关于卷积神经网络的具体知识可以参考[斯坦福大学公开课]( http://cs231n.github.io/convolutional-networks/ )和[图像分类](https://github.com/PaddlePaddle/book/blob/develop/image_classification/README.md)教程。
...@@ -293,7 +287,7 @@ def event_handler(event): ...@@ -293,7 +287,7 @@ def event_handler(event):
print "Pass %d, Batch %d, Cost %f, %s" % ( print "Pass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics) event.pass_id, event.batch_id, event.cost, event.metrics)
if isinstance(event, paddle.event.EndPass): if isinstance(event, paddle.event.EndPass):
result = trainer.test(reader=paddle.reader.batched( result = trainer.test(reader=paddle.batch(
paddle.dataset.mnist.test(), batch_size=128)) paddle.dataset.mnist.test(), batch_size=128))
print "Test with Pass %d, Cost %f, %s\n" % ( print "Test with Pass %d, Cost %f, %s\n" % (
event.pass_id, result.cost, result.metrics) event.pass_id, result.cost, result.metrics)
...@@ -301,7 +295,7 @@ def event_handler(event): ...@@ -301,7 +295,7 @@ def event_handler(event):
result.metrics['classification_error_evaluator'])) result.metrics['classification_error_evaluator']))
trainer.train( trainer.train(
reader=paddle.reader.batched( reader=paddle.batch(
paddle.reader.shuffle( paddle.reader.shuffle(
paddle.dataset.mnist.train(), buf_size=8192), paddle.dataset.mnist.train(), buf_size=8192),
batch_size=128), batch_size=128),
...@@ -337,7 +331,7 @@ trainer.train( ...@@ -337,7 +331,7 @@ trainer.train(
7. Deng, Li, Michael L. Seltzer, Dong Yu, Alex Acero, Abdel-rahman Mohamed, and Geoffrey E. Hinton. ["Binary coding of speech spectrograms using a deep auto-encoder."](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.185.1908&rep=rep1&type=pdf) In Interspeech, pp. 1692-1695. 2010. 7. Deng, Li, Michael L. Seltzer, Dong Yu, Alex Acero, Abdel-rahman Mohamed, and Geoffrey E. Hinton. ["Binary coding of speech spectrograms using a deep auto-encoder."](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.185.1908&rep=rep1&type=pdf) In Interspeech, pp. 1692-1695. 2010.
8. Kégl, Balázs, and Róbert Busa-Fekete. ["Boosting products of base classifiers."](http://dl.acm.org/citation.cfm?id=1553439) In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 497-504. ACM, 2009. 8. Kégl, Balázs, and Róbert Busa-Fekete. ["Boosting products of base classifiers."](http://dl.acm.org/citation.cfm?id=1553439) In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 497-504. ACM, 2009.
9. Rosenblatt, Frank. ["The perceptron: A probabilistic model for information storage and organization in the brain."](http://psycnet.apa.org/journals/rev/65/6/386/) Psychological review 65, no. 6 (1958): 386. 9. Rosenblatt, Frank. ["The perceptron: A probabilistic model for information storage and organization in the brain."](http://psycnet.apa.org/journals/rev/65/6/386/) Psychological review 65, no. 6 (1958): 386.
10. Bishop, Christopher M. ["Pattern recognition."](http://s3.amazonaws.com/academia.edu.documents/30428242/bg0137.pdf?AWSAccessKeyId=AKIAJ56TQJRTWSMTNPEA&Expires=1484816640&Signature=85Ad6%2Fca8T82pmHzxaSXermovIA%3D&response-content-disposition=inline%3B%20filename%3DPattern_recognition_and_machine_learning.pdf) Machine Learning 128 (2006): 1-58. 10. Bishop, Christopher M. ["Pattern recognition."](http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf) Machine Learning 128 (2006): 1-58.
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。
......
...@@ -2,9 +2,8 @@ import paddle.v2 as paddle ...@@ -2,9 +2,8 @@ import paddle.v2 as paddle
def softmax_regression(img): def softmax_regression(img):
predict = paddle.layer.fc(input=img, predict = paddle.layer.fc(
size=10, input=img, size=10, act=paddle.activation.Softmax())
act=paddle.activation.Softmax())
return predict return predict
...@@ -12,14 +11,12 @@ def multilayer_perceptron(img): ...@@ -12,14 +11,12 @@ def multilayer_perceptron(img):
# The first fully-connected layer # The first fully-connected layer
hidden1 = paddle.layer.fc(input=img, size=128, act=paddle.activation.Relu()) hidden1 = paddle.layer.fc(input=img, size=128, act=paddle.activation.Relu())
# The second fully-connected layer and the according activation function # The second fully-connected layer and the according activation function
hidden2 = paddle.layer.fc(input=hidden1, hidden2 = paddle.layer.fc(
size=64, input=hidden1, size=64, act=paddle.activation.Relu())
act=paddle.activation.Relu())
# The thrid fully-connected layer, note that the hidden size should be 10, # The thrid fully-connected layer, note that the hidden size should be 10,
# which is the number of unique digits # which is the number of unique digits
predict = paddle.layer.fc(input=hidden2, predict = paddle.layer.fc(
size=10, input=hidden2, size=10, act=paddle.activation.Softmax())
act=paddle.activation.Softmax())
return predict return predict
...@@ -43,14 +40,12 @@ def convolutional_neural_network(img): ...@@ -43,14 +40,12 @@ def convolutional_neural_network(img):
pool_stride=2, pool_stride=2,
act=paddle.activation.Tanh()) act=paddle.activation.Tanh())
# The first fully-connected layer # The first fully-connected layer
fc1 = paddle.layer.fc(input=conv_pool_2, fc1 = paddle.layer.fc(
size=128, input=conv_pool_2, size=128, act=paddle.activation.Tanh())
act=paddle.activation.Tanh())
# The softmax layer, note that the hidden size should be 10, # The softmax layer, note that the hidden size should be 10,
# which is the number of unique digits # which is the number of unique digits
predict = paddle.layer.fc(input=fc1, predict = paddle.layer.fc(
size=10, input=fc1, size=10, act=paddle.activation.Softmax())
act=paddle.activation.Softmax())
return predict return predict
...@@ -76,9 +71,8 @@ optimizer = paddle.optimizer.Momentum( ...@@ -76,9 +71,8 @@ optimizer = paddle.optimizer.Momentum(
momentum=0.9, momentum=0.9,
regularization=paddle.optimizer.L2Regularization(rate=0.0005 * 128)) regularization=paddle.optimizer.L2Regularization(rate=0.0005 * 128))
trainer = paddle.trainer.SGD(cost=cost, trainer = paddle.trainer.SGD(
parameters=parameters, cost=cost, parameters=parameters, update_equation=optimizer)
update_equation=optimizer)
lists = [] lists = []
...@@ -89,7 +83,7 @@ def event_handler(event): ...@@ -89,7 +83,7 @@ def event_handler(event):
print "Pass %d, Batch %d, Cost %f, %s" % ( print "Pass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics) event.pass_id, event.batch_id, event.cost, event.metrics)
if isinstance(event, paddle.event.EndPass): if isinstance(event, paddle.event.EndPass):
result = trainer.test(reader=paddle.reader.batched( result = trainer.test(reader=paddle.batch(
paddle.dataset.mnist.test(), batch_size=128)) paddle.dataset.mnist.test(), batch_size=128))
print "Test with Pass %d, Cost %f, %s\n" % (event.pass_id, result.cost, print "Test with Pass %d, Cost %f, %s\n" % (event.pass_id, result.cost,
result.metrics) result.metrics)
...@@ -98,9 +92,8 @@ def event_handler(event): ...@@ -98,9 +92,8 @@ def event_handler(event):
trainer.train( trainer.train(
reader=paddle.reader.batched( reader=paddle.batch(
paddle.reader.shuffle( paddle.reader.shuffle(paddle.dataset.mnist.train(), buf_size=8192),
paddle.dataset.mnist.train(), buf_size=8192),
batch_size=128), batch_size=128),
event_handler=event_handler, event_handler=event_handler,
num_passes=100) num_passes=100)
......
...@@ -76,22 +76,287 @@ Figure 3. A hybrid recommendation model. ...@@ -76,22 +76,287 @@ Figure 3. A hybrid recommendation model.
## Dataset ## Dataset
We use the [MovieLens ml-1m](http://files.grouplens.org/datasets/movielens/ml-1m.zip) to train our model. This dataset includes 10,000 ratings of 4,000 movies from 6,000 users to 4,000 movies. Each rate is in the range of 1~5. Thanks to GroupLens Research for collecting, processing and publishing the dataset. We use the [MovieLens ml-1m](http://files.grouplens.org/datasets/movielens/ml-1m.zip) to train our model. This dataset includes 10,000 ratings of 4,000 movies from 6,000 users to 4,000 movies. Each rate is in the range of 1~5. Thanks to GroupLens Research for collecting, processing and publishing the dataset.
We don't have to download and preprocess the data. Instead, we can use PaddlePaddle's dataset module `paddle.v2.dataset.movielens`. `paddle.v2.datasets` package encapsulates multiple public datasets, including `cifar`, `imdb`, `mnist`, `moivelens` and `wmt14`, etc. There's no need for us to manually download and preprocess `MovieLens` dataset.
```python
## Model Specification # Run this block to show dataset's documentation
help(paddle.v2.dataset.movielens)
```
## Training The raw `MoiveLens` contains movie ratings, relevant features from both movies and users.
For instance, one movie's feature could be:
```python
## Inference movie_info = paddle.dataset.movielens.movie_info()
print movie_info.values()[0]
```
```text
<MovieInfo id(1), title(Toy Story), categories(['Animation', "Children's", 'Comedy'])>
```
One user's feature could be:
```python
user_info = paddle.dataset.movielens.user_info()
print user_info.values()[0]
```
```text
<UserInfo id(1), gender(F), age(1), job(10)>
```
In this dateset, the distribution of age is shown as follows:
```text
1: "Under 18"
18: "18-24"
25: "25-34"
35: "35-44"
45: "45-49"
50: "50-55"
56: "56+"
```
User's occupation is selected from the following options:
```text
0: "other" or not specified
1: "academic/educator"
2: "artist"
3: "clerical/admin"
4: "college/grad student"
5: "customer service"
6: "doctor/health care"
7: "executive/managerial"
8: "farmer"
9: "homemaker"
10: "K-12 student"
11: "lawyer"
12: "programmer"
13: "retired"
14: "sales/marketing"
15: "scientist"
16: "self-employed"
17: "technician/engineer"
18: "tradesman/craftsman"
19: "unemployed"
20: "writer"
```
Each record consists of three main components: user features, movie features and movie ratings.
Likewise, as a simple example, consider the following:
```python
train_set_creator = paddle.dataset.movielens.train()
train_sample = next(train_set_creator())
uid = train_sample[0]
mov_id = train_sample[len(user_info[uid].value())]
print "User %s rates Movie %s with Score %s"%(user_info[uid], movie_info[mov_id], train_sample[-1])
```
```text
User <UserInfo id(1), gender(F), age(1), job(10)> rates Movie <MovieInfo id(1193), title(One Flew Over the Cuckoo's Nest), categories(['Drama'])> with Score [5.0]
```
The output shows that user 1 gave movie `1193` a rating of 5.
After issuing a command `python train.py`, training will start immediately. The details will be unpacked by the following sessions to see how it works.
## Model Architecture
### Initialize PaddlePaddle
First, we must import and initialize PaddlePaddle (enable/disable GPU, set the number of trainers, etc).
```python
%matplotlib inline
import matplotlib.pyplot as plt
from IPython import display
import cPickle
import paddle.v2 as paddle
paddle.init(use_gpu=False)
```
### Model Configuration
```python
uid = paddle.layer.data(
name='user_id',
type=paddle.data_type.integer_value(
paddle.dataset.movielens.max_user_id() + 1))
usr_emb = paddle.layer.embedding(input=uid, size=32)
usr_gender_id = paddle.layer.data(
name='gender_id', type=paddle.data_type.integer_value(2))
usr_gender_emb = paddle.layer.embedding(input=usr_gender_id, size=16)
usr_age_id = paddle.layer.data(
name='age_id',
type=paddle.data_type.integer_value(
len(paddle.dataset.movielens.age_table)))
usr_age_emb = paddle.layer.embedding(input=usr_age_id, size=16)
usr_job_id = paddle.layer.data(
name='job_id',
type=paddle.data_type.integer_value(paddle.dataset.movielens.max_job_id(
) + 1))
usr_job_emb = paddle.layer.embedding(input=usr_job_id, size=16)
```
As shown in the above code, the input is four dimension integers for each user, that is, `user_id`,`gender_id`, `age_id` and `job_id`. In order to deal with these features conveniently, we use the language model in NLP to transform these discrete values into embedding vaules `usr_emb`, `usr_gender_emb`, `usr_age_emb` and `usr_job_emb`.
```python
usr_combined_features = paddle.layer.fc(
input=[usr_emb, usr_gender_emb, usr_age_emb, usr_job_emb],
size=200,
act=paddle.activation.Tanh())
```
Then, employing user features as input, directly connecting to a fully-connected layer, which is used to reduce dimension to 200.
Furthermore, we do a similar transformation for each movie feature. The model configuration is:
```python
mov_id = paddle.layer.data(
name='movie_id',
type=paddle.data_type.integer_value(
paddle.dataset.movielens.max_movie_id() + 1))
mov_emb = paddle.layer.embedding(input=mov_id, size=32)
mov_categories = paddle.layer.data(
name='category_id',
type=paddle.data_type.sparse_binary_vector(
len(paddle.dataset.movielens.movie_categories())))
mov_categories_hidden = paddle.layer.fc(input=mov_categories, size=32)
movie_title_dict = paddle.dataset.movielens.get_movie_title_dict()
mov_title_id = paddle.layer.data(
name='movie_title',
type=paddle.data_type.integer_value_sequence(len(movie_title_dict)))
mov_title_emb = paddle.layer.embedding(input=mov_title_id, size=32)
mov_title_conv = paddle.networks.sequence_conv_pool(
input=mov_title_emb, hidden_size=32, context_len=3)
mov_combined_features = paddle.layer.fc(
input=[mov_emb, mov_categories_hidden, mov_title_conv],
size=200,
act=paddle.activation.Tanh())
```
Movie title, a sequence of words represented by an integer word index sequence, will be feed into a `sequence_conv_pool` layer, which will apply convolution and pooling on time dimension. Because pooling is done on time dimension, the output will be a fixed-length vector regardless the length of the input sequence.
Finally, we can use cosine similarity to calculate the similarity between user characteristics and movie features.
```python
inference = paddle.layer.cos_sim(a=usr_combined_features, b=mov_combined_features, size=1, scale=5)
cost = paddle.layer.regression_cost(
input=inference,
label=paddle.layer.data(
name='score', type=paddle.data_type.dense_vector(1)))
```
## Model Training
### Define Parameters
First, we define the model parameters according to the previous model configuration `cost`.
```python
# Create parameters
parameters = paddle.parameters.create(cost)
```
### Create Trainer
Before jumping into creating a training module, algorithm setting is also necessary. Here we specified Adam optimization algorithm via `paddle.optimizer`.
```python
trainer = paddle.trainer.SGD(cost=cost, parameters=parameters,
update_equation=paddle.optimizer.Adam(learning_rate=1e-4))
```
```text
[INFO 2017-03-06 17:12:13,378 networks.py:1472] The input order is [user_id, gender_id, age_id, job_id, movie_id, category_id, movie_title, score]
[INFO 2017-03-06 17:12:13,379 networks.py:1478] The output order is [__regression_cost_0__]
```
### Training
`paddle.dataset.movielens.train` will yield records during each pass, after shuffling, a batch input is generated for training.
```python
reader=paddle.reader.batch(
paddle.reader.shuffle(
paddle.dataset.movielens.trai(), buf_size=8192),
batch_size=256)
```
`feeding` is devoted to specifying the correspondence between each yield record and `paddle.layer.data`. For instance, the first column of data generated by `movielens.train` corresponds to `user_id` feature.
```python
feeding = {
'user_id': 0,
'gender_id': 1,
'age_id': 2,
'job_id': 3,
'movie_id': 4,
'category_id': 5,
'movie_title': 6,
'score': 7
}
```
Callback function `event_handler` will be called during training when a pre-defined event happens.
```python
step=0
train_costs=[],[]
test_costs=[],[]
def event_handler(event):
global step
global train_costs
global test_costs
if isinstance(event, paddle.event.EndIteration):
need_plot = False
if step % 10 == 0: # every 10 batches, record a train cost
train_costs[0].append(step)
train_costs[1].append(event.cost)
if step % 1000 == 0: # every 1000 batches, record a test cost
result = trainer.test(reader=paddle.batch(
paddle.dataset.movielens.test(), batch_size=256))
test_costs[0].append(step)
test_costs[1].append(result.cost)
if step % 100 == 0: # every 100 batches, update cost plot
plt.plot(*train_costs)
plt.plot(*test_costs)
plt.legend(['Train Cost', 'Test Cost'], loc='upper left')
display.clear_output(wait=True)
display.display(plt.gcf())
plt.gcf().clear()
step += 1
```
Finally, we can invoke `trainer.train` to start training:
```python
trainer.train(
reader=reader,
event_handler=event_handler,
feeding=feeding,
num_passes=200)
```
## Conclusion ## Conclusion
...@@ -99,13 +364,13 @@ This tutorial goes over traditional approaches in recommender system and a deep ...@@ -99,13 +364,13 @@ This tutorial goes over traditional approaches in recommender system and a deep
## Reference ## Reference
1. [Peter Brusilovsky](https://en.wikipedia.org/wiki/Peter_Brusilovsky) (2007). *The Adaptive Web*. p. 325. 1. [Peter Brusilovsky](https://en.wikipedia.org/wiki/Peter_Brusilovsky) (2007). *The Adaptive Web*. p. 325.
2. Robin Burke , [Hybrid Web Recommender Systems](http://www.dcs.warwick.ac.uk/~acristea/courses/CS411/2010/Book%20-%20The%20Adaptive%20Web/HybridWebRecommenderSystems.pdf), pp. 377-408, The Adaptive Web, Peter Brusilovsky, Alfred Kobsa, Wolfgang Nejdl (Ed.), Lecture Notes in Computer Science, Springer-Verlag, Berlin, Germany, Lecture Notes in Computer Science, Vol. 4321, May 2007, 978-3-540-72078-2. 2. Robin Burke , [Hybrid Web Recommender Systems](http://www.dcs.warwick.ac.uk/~acristea/courses/CS411/2010/Book%20-%20The%20Adaptive%20Web/HybridWebRecommenderSystems.pdf), pp. 377-408, The Adaptive Web, Peter Brusilovsky, Alfred Kobsa, Wolfgang Nejdl (Ed.), Lecture Notes in Computer Science, Springer-Verlag, Berlin, Germany, Lecture Notes in Computer Science, Vol. 4321, May 2007, 978-3-540-72078-2.
3. P. Resnick, N. Iacovou, etc. “[GroupLens: An Open Architecture for Collaborative Filtering of Netnews](http://ccs.mit.edu/papers/CCSWP165.html)”, Proceedings of ACM Conference on Computer Supported Cooperative Work, CSCW 1994. pp.175-186. 3. P. Resnick, N. Iacovou, etc. “[GroupLens: An Open Architecture for Collaborative Filtering of Netnews](http://ccs.mit.edu/papers/CCSWP165.html)”, Proceedings of ACM Conference on Computer Supported Cooperative Work, CSCW 1994. pp.175-186.
4. Sarwar, Badrul, et al. "[Item-based collaborative filtering recommendation algorithms.](http://files.grouplens.org/papers/www10_sarwar.pdf)" *Proceedings of the 10th International Conference on World Wide Web*. ACM, 2001. 4. Sarwar, Badrul, et al. "[Item-based collaborative filtering recommendation algorithms.](http://files.grouplens.org/papers/www10_sarwar.pdf)" *Proceedings of the 10th International Conference on World Wide Web*. ACM, 2001.
5. Kautz, Henry, Bart Selman, and Mehul Shah. "[Referral Web: Combining Social networks and collaborative filtering.](http://www.cs.cornell.edu/selman/papers/pdf/97.cacm.refweb.pdf)" Communications of the ACM 40.3 (1997): 63-65. APA 5. Kautz, Henry, Bart Selman, and Mehul Shah. "[Referral Web: Combining Social networks and collaborative filtering.](http://www.cs.cornell.edu/selman/papers/pdf/97.cacm.refweb.pdf)" Communications of the ACM 40.3 (1997): 63-65. APA
6. Yuan, Jianbo, et al. ["Solving Cold-Start Problem in Large-scale Recommendation Engines: A Deep Learning Approach."](https://arxiv.org/pdf/1611.05480v1.pdf) *arXiv preprint arXiv:1611.05480* (2016). 6. Yuan, Jianbo, et al. ["Solving Cold-Start Problem in Large-scale Recommendation Engines: A Deep Learning Approach."](https://arxiv.org/pdf/1611.05480v1.pdf) *arXiv preprint arXiv:1611.05480* (2016).
7. Covington P, Adams J, Sargin E. [Deep neural networks for youtube recommendations](https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf)[C]//Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 2016: 191-198. 7. Covington P, Adams J, Sargin E. [Deep neural networks for youtube recommendations](https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf)[C]//Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 2016: 191-198.
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">This tutorial</span> was created by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">the PaddlePaddle community</a> and published under <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Common Creative 4.0 License</a> This tutorial is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.
...@@ -208,8 +208,8 @@ class EmbeddingFieldParser(object): ...@@ -208,8 +208,8 @@ class EmbeddingFieldParser(object):
elif config['dict']['type'] == 'split': elif config['dict']['type'] == 'split':
self.dict = SplitEmbeddingDict(config['dict'].get('delimiter', ',')) self.dict = SplitEmbeddingDict(config['dict'].get('delimiter', ','))
elif config['dict']['type'] == 'whole_content': elif config['dict']['type'] == 'whole_content':
self.dict = EmbeddingFieldParser.WholeContentDict(config['dict'][ self.dict = EmbeddingFieldParser.WholeContentDict(
'sort']) config['dict']['sort'])
else: else:
print config print config
assert False assert False
......
...@@ -118,22 +118,287 @@ Figure 3. A hybrid recommendation model. ...@@ -118,22 +118,287 @@ Figure 3. A hybrid recommendation model.
## Dataset ## Dataset
We use the [MovieLens ml-1m](http://files.grouplens.org/datasets/movielens/ml-1m.zip) to train our model. This dataset includes 10,000 ratings of 4,000 movies from 6,000 users to 4,000 movies. Each rate is in the range of 1~5. Thanks to GroupLens Research for collecting, processing and publishing the dataset. We use the [MovieLens ml-1m](http://files.grouplens.org/datasets/movielens/ml-1m.zip) to train our model. This dataset includes 10,000 ratings of 4,000 movies from 6,000 users to 4,000 movies. Each rate is in the range of 1~5. Thanks to GroupLens Research for collecting, processing and publishing the dataset.
`paddle.v2.datasets` package encapsulates multiple public datasets, including `cifar`, `imdb`, `mnist`, `moivelens` and `wmt14`, etc. There's no need for us to manually download and preprocess `MovieLens` dataset.
```python
# Run this block to show dataset's documentation
help(paddle.v2.dataset.movielens)
```
The raw `MoiveLens` contains movie ratings, relevant features from both movies and users.
For instance, one movie's feature could be:
```python
movie_info = paddle.dataset.movielens.movie_info()
print movie_info.values()[0]
```
```text
<MovieInfo id(1), title(Toy Story), categories(['Animation', "Children's", 'Comedy'])>
```
One user's feature could be:
```python
user_info = paddle.dataset.movielens.user_info()
print user_info.values()[0]
```
```text
<UserInfo id(1), gender(F), age(1), job(10)>
```
In this dateset, the distribution of age is shown as follows:
```text
1: "Under 18"
18: "18-24"
25: "25-34"
35: "35-44"
45: "45-49"
50: "50-55"
56: "56+"
```
User's occupation is selected from the following options:
```text
0: "other" or not specified
1: "academic/educator"
2: "artist"
3: "clerical/admin"
4: "college/grad student"
5: "customer service"
6: "doctor/health care"
7: "executive/managerial"
8: "farmer"
9: "homemaker"
10: "K-12 student"
11: "lawyer"
12: "programmer"
13: "retired"
14: "sales/marketing"
15: "scientist"
16: "self-employed"
17: "technician/engineer"
18: "tradesman/craftsman"
19: "unemployed"
20: "writer"
```
Each record consists of three main components: user features, movie features and movie ratings.
Likewise, as a simple example, consider the following:
```python
train_set_creator = paddle.dataset.movielens.train()
train_sample = next(train_set_creator())
uid = train_sample[0]
mov_id = train_sample[len(user_info[uid].value())]
print "User %s rates Movie %s with Score %s"%(user_info[uid], movie_info[mov_id], train_sample[-1])
```
```text
User <UserInfo id(1), gender(F), age(1), job(10)> rates Movie <MovieInfo id(1193), title(One Flew Over the Cuckoo's Nest), categories(['Drama'])> with Score [5.0]
```
The output shows that user 1 gave movie `1193` a rating of 5.
After issuing a command `python train.py`, training will start immediately. The details will be unpacked by the following sessions to see how it works.
## Model Architecture
### Initialize PaddlePaddle
First, we must import and initialize PaddlePaddle (enable/disable GPU, set the number of trainers, etc).
```python
%matplotlib inline
import matplotlib.pyplot as plt
from IPython import display
import cPickle
import paddle.v2 as paddle
paddle.init(use_gpu=False)
```
### Model Configuration
```python
uid = paddle.layer.data(
name='user_id',
type=paddle.data_type.integer_value(
paddle.dataset.movielens.max_user_id() + 1))
usr_emb = paddle.layer.embedding(input=uid, size=32)
usr_gender_id = paddle.layer.data(
name='gender_id', type=paddle.data_type.integer_value(2))
usr_gender_emb = paddle.layer.embedding(input=usr_gender_id, size=16)
usr_age_id = paddle.layer.data(
name='age_id',
type=paddle.data_type.integer_value(
len(paddle.dataset.movielens.age_table)))
usr_age_emb = paddle.layer.embedding(input=usr_age_id, size=16)
usr_job_id = paddle.layer.data(
name='job_id',
type=paddle.data_type.integer_value(paddle.dataset.movielens.max_job_id(
) + 1))
usr_job_emb = paddle.layer.embedding(input=usr_job_id, size=16)
```
As shown in the above code, the input is four dimension integers for each user, that is, `user_id`,`gender_id`, `age_id` and `job_id`. In order to deal with these features conveniently, we use the language model in NLP to transform these discrete values into embedding vaules `usr_emb`, `usr_gender_emb`, `usr_age_emb` and `usr_job_emb`.
```python
usr_combined_features = paddle.layer.fc(
input=[usr_emb, usr_gender_emb, usr_age_emb, usr_job_emb],
size=200,
act=paddle.activation.Tanh())
```
Then, employing user features as input, directly connecting to a fully-connected layer, which is used to reduce dimension to 200.
Furthermore, we do a similar transformation for each movie feature. The model configuration is:
```python
mov_id = paddle.layer.data(
name='movie_id',
type=paddle.data_type.integer_value(
paddle.dataset.movielens.max_movie_id() + 1))
mov_emb = paddle.layer.embedding(input=mov_id, size=32)
mov_categories = paddle.layer.data(
name='category_id',
type=paddle.data_type.sparse_binary_vector(
len(paddle.dataset.movielens.movie_categories())))
mov_categories_hidden = paddle.layer.fc(input=mov_categories, size=32)
We don't have to download and preprocess the data. Instead, we can use PaddlePaddle's dataset module `paddle.v2.dataset.movielens`. movie_title_dict = paddle.dataset.movielens.get_movie_title_dict()
mov_title_id = paddle.layer.data(
name='movie_title',
type=paddle.data_type.integer_value_sequence(len(movie_title_dict)))
mov_title_emb = paddle.layer.embedding(input=mov_title_id, size=32)
mov_title_conv = paddle.networks.sequence_conv_pool(
input=mov_title_emb, hidden_size=32, context_len=3)
mov_combined_features = paddle.layer.fc(
input=[mov_emb, mov_categories_hidden, mov_title_conv],
size=200,
act=paddle.activation.Tanh())
```
## Model Specification Movie title, a sequence of words represented by an integer word index sequence, will be feed into a `sequence_conv_pool` layer, which will apply convolution and pooling on time dimension. Because pooling is done on time dimension, the output will be a fixed-length vector regardless the length of the input sequence.
Finally, we can use cosine similarity to calculate the similarity between user characteristics and movie features.
```python
inference = paddle.layer.cos_sim(a=usr_combined_features, b=mov_combined_features, size=1, scale=5)
cost = paddle.layer.regression_cost(
input=inference,
label=paddle.layer.data(
name='score', type=paddle.data_type.dense_vector(1)))
```
## Training ## Model Training
### Define Parameters
First, we define the model parameters according to the previous model configuration `cost`.
## Inference ```python
# Create parameters
parameters = paddle.parameters.create(cost)
```
### Create Trainer
Before jumping into creating a training module, algorithm setting is also necessary. Here we specified Adam optimization algorithm via `paddle.optimizer`.
```python
trainer = paddle.trainer.SGD(cost=cost, parameters=parameters,
update_equation=paddle.optimizer.Adam(learning_rate=1e-4))
```
```text
[INFO 2017-03-06 17:12:13,378 networks.py:1472] The input order is [user_id, gender_id, age_id, job_id, movie_id, category_id, movie_title, score]
[INFO 2017-03-06 17:12:13,379 networks.py:1478] The output order is [__regression_cost_0__]
```
### Training
`paddle.dataset.movielens.train` will yield records during each pass, after shuffling, a batch input is generated for training.
```python
reader=paddle.reader.batch(
paddle.reader.shuffle(
paddle.dataset.movielens.trai(), buf_size=8192),
batch_size=256)
```
`feeding` is devoted to specifying the correspondence between each yield record and `paddle.layer.data`. For instance, the first column of data generated by `movielens.train` corresponds to `user_id` feature.
```python
feeding = {
'user_id': 0,
'gender_id': 1,
'age_id': 2,
'job_id': 3,
'movie_id': 4,
'category_id': 5,
'movie_title': 6,
'score': 7
}
```
Callback function `event_handler` will be called during training when a pre-defined event happens.
```python
step=0
train_costs=[],[]
test_costs=[],[]
def event_handler(event):
global step
global train_costs
global test_costs
if isinstance(event, paddle.event.EndIteration):
need_plot = False
if step % 10 == 0: # every 10 batches, record a train cost
train_costs[0].append(step)
train_costs[1].append(event.cost)
if step % 1000 == 0: # every 1000 batches, record a test cost
result = trainer.test(reader=paddle.batch(
paddle.dataset.movielens.test(), batch_size=256))
test_costs[0].append(step)
test_costs[1].append(result.cost)
if step % 100 == 0: # every 100 batches, update cost plot
plt.plot(*train_costs)
plt.plot(*test_costs)
plt.legend(['Train Cost', 'Test Cost'], loc='upper left')
display.clear_output(wait=True)
display.display(plt.gcf())
plt.gcf().clear()
step += 1
```
Finally, we can invoke `trainer.train` to start training:
```python
trainer.train(
reader=reader,
event_handler=event_handler,
feeding=feeding,
num_passes=200)
```
## Conclusion ## Conclusion
...@@ -141,16 +406,16 @@ This tutorial goes over traditional approaches in recommender system and a deep ...@@ -141,16 +406,16 @@ This tutorial goes over traditional approaches in recommender system and a deep
## Reference ## Reference
1. [Peter Brusilovsky](https://en.wikipedia.org/wiki/Peter_Brusilovsky) (2007). *The Adaptive Web*. p. 325. 1. [Peter Brusilovsky](https://en.wikipedia.org/wiki/Peter_Brusilovsky) (2007). *The Adaptive Web*. p. 325.
2. Robin Burke , [Hybrid Web Recommender Systems](http://www.dcs.warwick.ac.uk/~acristea/courses/CS411/2010/Book%20-%20The%20Adaptive%20Web/HybridWebRecommenderSystems.pdf), pp. 377-408, The Adaptive Web, Peter Brusilovsky, Alfred Kobsa, Wolfgang Nejdl (Ed.), Lecture Notes in Computer Science, Springer-Verlag, Berlin, Germany, Lecture Notes in Computer Science, Vol. 4321, May 2007, 978-3-540-72078-2. 2. Robin Burke , [Hybrid Web Recommender Systems](http://www.dcs.warwick.ac.uk/~acristea/courses/CS411/2010/Book%20-%20The%20Adaptive%20Web/HybridWebRecommenderSystems.pdf), pp. 377-408, The Adaptive Web, Peter Brusilovsky, Alfred Kobsa, Wolfgang Nejdl (Ed.), Lecture Notes in Computer Science, Springer-Verlag, Berlin, Germany, Lecture Notes in Computer Science, Vol. 4321, May 2007, 978-3-540-72078-2.
3. P. Resnick, N. Iacovou, etc. “[GroupLens: An Open Architecture for Collaborative Filtering of Netnews](http://ccs.mit.edu/papers/CCSWP165.html)”, Proceedings of ACM Conference on Computer Supported Cooperative Work, CSCW 1994. pp.175-186. 3. P. Resnick, N. Iacovou, etc. “[GroupLens: An Open Architecture for Collaborative Filtering of Netnews](http://ccs.mit.edu/papers/CCSWP165.html)”, Proceedings of ACM Conference on Computer Supported Cooperative Work, CSCW 1994. pp.175-186.
4. Sarwar, Badrul, et al. "[Item-based collaborative filtering recommendation algorithms.](http://files.grouplens.org/papers/www10_sarwar.pdf)" *Proceedings of the 10th International Conference on World Wide Web*. ACM, 2001. 4. Sarwar, Badrul, et al. "[Item-based collaborative filtering recommendation algorithms.](http://files.grouplens.org/papers/www10_sarwar.pdf)" *Proceedings of the 10th International Conference on World Wide Web*. ACM, 2001.
5. Kautz, Henry, Bart Selman, and Mehul Shah. "[Referral Web: Combining Social networks and collaborative filtering.](http://www.cs.cornell.edu/selman/papers/pdf/97.cacm.refweb.pdf)" Communications of the ACM 40.3 (1997): 63-65. APA 5. Kautz, Henry, Bart Selman, and Mehul Shah. "[Referral Web: Combining Social networks and collaborative filtering.](http://www.cs.cornell.edu/selman/papers/pdf/97.cacm.refweb.pdf)" Communications of the ACM 40.3 (1997): 63-65. APA
6. Yuan, Jianbo, et al. ["Solving Cold-Start Problem in Large-scale Recommendation Engines: A Deep Learning Approach."](https://arxiv.org/pdf/1611.05480v1.pdf) *arXiv preprint arXiv:1611.05480* (2016). 6. Yuan, Jianbo, et al. ["Solving Cold-Start Problem in Large-scale Recommendation Engines: A Deep Learning Approach."](https://arxiv.org/pdf/1611.05480v1.pdf) *arXiv preprint arXiv:1611.05480* (2016).
7. Covington P, Adams J, Sargin E. [Deep neural networks for youtube recommendations](https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf)[C]//Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 2016: 191-198. 7. Covington P, Adams J, Sargin E. [Deep neural networks for youtube recommendations](https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf)[C]//Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 2016: 191-198.
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">This tutorial</span> was created by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">the PaddlePaddle community</a> and published under <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Common Creative 4.0 License</a> This tutorial is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
......
...@@ -2,7 +2,8 @@ ...@@ -2,7 +2,8 @@
The source codes of this section can be located at [book/understand_sentiment](https://github.com/PaddlePaddle/book/tree/develop/understand_sentiment). First-time users may refer to PaddlePaddle for [Installation guide](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html). The source codes of this section can be located at [book/understand_sentiment](https://github.com/PaddlePaddle/book/tree/develop/understand_sentiment). First-time users may refer to PaddlePaddle for [Installation guide](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html).
## Background Introduction ## Background
In natural language processing, sentiment analysis refers to describing emotion status in texts. The texts may refer to a sentence, a paragraph or a document. Emotion status can be a binary classification problem (positive/negative or happy/sad), or a three-class problem (positive/neutral/negative). Sentiment analysis can be applied widely in various situations, such as online shopping (Amazon, Taobao), travel and movie websites. It can be used to grasp from the reviews how the customers feel about the product. Table 1 is an example of sentiment analysis in movie reviews: In natural language processing, sentiment analysis refers to describing emotion status in texts. The texts may refer to a sentence, a paragraph or a document. Emotion status can be a binary classification problem (positive/negative or happy/sad), or a three-class problem (positive/neutral/negative). Sentiment analysis can be applied widely in various situations, such as online shopping (Amazon, Taobao), travel and movie websites. It can be used to grasp from the reviews how the customers feel about the product. Table 1 is an example of sentiment analysis in movie reviews:
| Movie Review | Category | | Movie Review | Category |
...@@ -22,10 +23,12 @@ For a piece of text, BOW model ignores its word order, grammar and syntax, and r ...@@ -22,10 +23,12 @@ For a piece of text, BOW model ignores its word order, grammar and syntax, and r
In this chapter, we introduce our deep learning model which handles these issues in BOW. Our model embeds texts into a low-dimensional space and takes word order into consideration. It is an end-to-end framework, and has large performance improvement over traditional methods \[[1](#Reference)\]. In this chapter, we introduce our deep learning model which handles these issues in BOW. Our model embeds texts into a low-dimensional space and takes word order into consideration. It is an end-to-end framework, and has large performance improvement over traditional methods \[[1](#Reference)\].
## Model Overview ## Model Overview
The model we used in this chapter is the CNN (Convolutional Neural Networks) and RNN (Recurrent Neural Networks) with some specific extension. The model we used in this chapter is the CNN (Convolutional Neural Networks) and RNN (Recurrent Neural Networks) with some specific extension.
### Convolutional Neural Networks for Texts (CNN) ### Convolutional Neural Networks for Texts (CNN)
Convolutional Neural Networks are always applied in data with grid-like topology, such as 2-d images and 1-d texts. CNN can combine extracted multiple local features to produce higher-level abstract semantics. Experimentally, CNN is very efficient for image and text modeling. Convolutional Neural Networks are always applied in data with grid-like topology, such as 2-d images and 1-d texts. CNN can combine extracted multiple local features to produce higher-level abstract semantics. Experimentally, CNN is very efficient for image and text modeling.
CNN mainly contains convolution and pooling operation, with various extensions. We briefly describe CNN here with an example \[[1](#Refernce)\]. As shown in Figure 1: CNN mainly contains convolution and pooling operation, with various extensions. We briefly describe CNN here with an example \[[1](#Refernce)\]. As shown in Figure 1:
...@@ -55,7 +58,8 @@ Finally, the CNN features are concatenated together to produce a fixed-length re ...@@ -55,7 +58,8 @@ Finally, the CNN features are concatenated together to produce a fixed-length re
For short texts, above CNN model can achieve high accuracy \[[1](#Reference)\]. If we want to extract more abstract representation, we may apply a deeper CNN model \[[2](#Reference),[3](#Reference)\]. For short texts, above CNN model can achieve high accuracy \[[1](#Reference)\]. If we want to extract more abstract representation, we may apply a deeper CNN model \[[2](#Reference),[3](#Reference)\].
### Recurrent Neural Network(RNN) ### Recurrent Neural Network (RNN)
RNN is an effective model for sequential data. Theoretical, the computational ability of RNN is Turing-complete \[[4](#Reference)\]. NLP is a classical sequential data, and RNN (especially its variant LSTM\[[5](#Reference)\]) achieves State-of-the-Art performance on various tasks in NLP, such as language modeling, syntax parsing, POS-tagging, image captioning, dialog, machine translation and so forth. RNN is an effective model for sequential data. Theoretical, the computational ability of RNN is Turing-complete \[[4](#Reference)\]. NLP is a classical sequential data, and RNN (especially its variant LSTM\[[5](#Reference)\]) achieves State-of-the-Art performance on various tasks in NLP, such as language modeling, syntax parsing, POS-tagging, image captioning, dialog, machine translation and so forth.
<p align="center"> <p align="center">
...@@ -70,8 +74,9 @@ where $W_{xh}$ is the weight matrix from input to latent; $W_{hh}$ is the latent ...@@ -70,8 +74,9 @@ where $W_{xh}$ is the weight matrix from input to latent; $W_{hh}$ is the latent
In NLP, words are first represented as a one-hot vector and then mapped to an embedding. The embedded feature goes through an RNN as input $x_t$ at every time step. Moreover, we can add other layers on top of RNN. e.g., a deep or stacked RNN. Also, the last latent state can be used as a feature for sentence classification. In NLP, words are first represented as a one-hot vector and then mapped to an embedding. The embedded feature goes through an RNN as input $x_t$ at every time step. Moreover, we can add other layers on top of RNN. e.g., a deep or stacked RNN. Also, the last latent state can be used as a feature for sentence classification.
### Long-Short Term Memory ### Long-Short Term Memory (LSTM)
For data of long sequence, training RNN sometimes has gradient vanishing and explosion problem \[[6](#)\]. To solve this problem Hochreiter S, Schmidhuber J. (1997) proposed the LSTM(long short term memory\[[5](#Refernce)\]).
For data of long sequence, training RNN sometimes has gradient vanishing and explosion problem \[[6](#)\]. To solve this problem Hochreiter S, Schmidhuber J. (1997) proposed the LSTM(long short term memory\[[5](#Reference)\]).
Compared with simple RNN, the structrue of LSTM has included memory cell $c$, input gate $i$, forget gate $f$ and output gate $o$. These gates and memory cells largely improves the ability of handling long sequences. We can formulate LSTM-RNN as a function $F$ as: Compared with simple RNN, the structrue of LSTM has included memory cell $c$, input gate $i$, forget gate $f$ and output gate $o$. These gates and memory cells largely improves the ability of handling long sequences. We can formulate LSTM-RNN as a function $F$ as:
...@@ -99,6 +104,7 @@ $$ h_t=Recrurent(x_t,h_{t-1})$$ ...@@ -99,6 +104,7 @@ $$ h_t=Recrurent(x_t,h_{t-1})$$
where $Recrurent$ is a simple RNN, GRU or LSTM. where $Recrurent$ is a simple RNN, GRU or LSTM.
### Stacked Bidirectional LSTM ### Stacked Bidirectional LSTM
For vanilla LSTM, $h_t$ contains input information from previous time-step $1..t-1$ context. We can also apply an RNN with reverse-direction to take successive context $t+1…n$ into consideration. Combining constructing deep RNN (deeper RNN can contain more abstract and higher level semantic), we can design structures with deep stacked bidirectional LSTM to model sequential data\[[9](#Reference)\]. For vanilla LSTM, $h_t$ contains input information from previous time-step $1..t-1$ context. We can also apply an RNN with reverse-direction to take successive context $t+1…n$ into consideration. Combining constructing deep RNN (deeper RNN can contain more abstract and higher level semantic), we can design structures with deep stacked bidirectional LSTM to model sequential data\[[9](#Reference)\].
As shown in Figure 4 (3-layer RNN), odd/even layers are forward/reverse LSTM. Higher layers of LSTM take lower-layers LSTM as input, and the top-layer LSTM produces a fixed length vector by max-pooling (this representation considers contexts from previous and successive words for higher-level abstractions). Finally, we concatenate the output to a softmax layer for classification. As shown in Figure 4 (3-layer RNN), odd/even layers are forward/reverse LSTM. Higher layers of LSTM take lower-layers LSTM as input, and the top-layer LSTM produces a fixed length vector by max-pooling (this representation considers contexts from previous and successive words for higher-level abstractions). Finally, we concatenate the output to a softmax layer for classification.
...@@ -108,377 +114,247 @@ As shown in Figure 4 (3-layer RNN), odd/even layers are forward/reverse LSTM. Hi ...@@ -108,377 +114,247 @@ As shown in Figure 4 (3-layer RNN), odd/even layers are forward/reverse LSTM. Hi
Figure 4. Stacked Bidirectional LSTM for NLP modeling. Figure 4. Stacked Bidirectional LSTM for NLP modeling.
</p> </p>
## Data Preparation ## Dataset
### Data introduction and Download
We taks the [IMDB sentiment analysis dataset](http://ai.stanford.edu/%7Eamaas/data/sentiment/) as an example. IMDB dataset contains training and testing set, with 25000 movie reviews. With a 1-10 score, negative reviews are those with score<=4, while positives are those with score>=7. You may use following scripts to download the IMDB dataset and [Moses](http://www.statmt.org/moses/) toolbox:
We use [IMDB](http://ai.stanford.edu/%7Eamaas/data/sentiment/) dataset for sentiment analysis in this tutorial, which consists of 50,000 movie reviews split evenly into 25k train and 25k test sets. In the labeled train/test sets, a negative review has a score <= 4 out of 10, and a positive review has a score >= 7 out of 10.
```bash `paddle.datasets` package encapsulates multiple public datasets, including `cifar`, `imdb`, `mnist`, `moivelens`, and `wmt14`, etc. There's no need for us to manually download and preprocess IMDB.
./data/get_imdb.sh
```
If successful, you should see the directory ```data``` with following files:
``` After issuing a command `python train.py`, training will start immediately. The details will be unpacked by the following sessions to see how it works.
aclImdb get_imdb.sh imdb mosesdecoder-master
```
* aclImdb: original data downloaded from the website;
* imdb: containing only training and testing data
* mosesdecoder-master: Moses tool
### Data Preprocessing ## Model Structure
We use the script `preprocess.py` to preprocess the data. It will call `tokenizer.perl` in the Moses toolbox to split words and punctuations, randomly shuffle training set and construct the dictionary. Notice: we only use labeled training and testing set. Executing following commands will preprocess the data:
``` ### Initialize PaddlePaddle
data_dir="./data/imdb"
python preprocess.py -i $data_dir
```
If it runs successfully, `./data/pre-imdb` will contain: We must import and initialize PaddlePaddle (enable/disable GPU, set the number of trainers, etc).
``` ```python
dict.txt labels.list test.list test_part_000 train.list train_part_000 import sys
import paddle.v2 as paddle
# PaddlePaddle init
paddle.init(use_gpu=False, trainer_count=1)
``` ```
* test\_part\_000 和 train\_part\_000: all labeled training and testing set, and the training set is shuffled. As alluded to in section [Model Overview](#model-overview), here we provide the implementations of both Text CNN and Stacked-bidirectional LSTM models.
* train.list and test.list: training and testing file-list (containing list of file names).
* dict.txt: dictionary generated from training set.
* labels.list: class label, 0 stands for negative while 1 for positive.
### Data Provider for PaddlePaddle ### Text Convolution Neural Network (Text CNN)
PaddlePaddle can read Python-style script for configuration. The following `dataprovider.py` provides a detailed example, consisting of two parts:
* hook: define text information and class Id. Texts are defined as `integer_value_sequence` while class Ids are defined as `integer_value`. We create a neural network `convolution_net` as the following snippet code.
* process: read line by line for ID and text information split by `’\t\t’`, and yield the data as a generator.
```python Note: `paddle.networks.sequence_conv_pool` includes both convolution and pooling layer operations.
from paddle.trainer.PyDataProvider2 import *
def hook(settings, dictionary, **kwargs):
settings.word_dict = dictionary
settings.input_types = {
'word': integer_value_sequence(len(settings.word_dict)),
'label': integer_value(2)
}
settings.logger.info('dict len : %d' % (len(settings.word_dict)))
@provider(init_hook=hook)
def process(settings, file_name):
with open(file_name, 'r') as fdata:
for line_count, line in enumerate(fdata):
label, comment = line.strip().split('\t\t')
label = int(label)
words = comment.split()
word_slot = [
settings.word_dict[w] for w in words if w in settings.word_dict
]
yield {
'word': word_slot,
'label': label
}
```
## Model Setup
`trainer_config.py` is an example of a setup file.
### Data Definition
```python ```python
from os.path import join as join_path def convolution_net(input_dim, class_dim=2, emb_dim=128, hid_dim=128):
from paddle.trainer_config_helpers import * data = paddle.layer.data("word",
# if it is “test” mode paddle.data_type.integer_value_sequence(input_dim))
is_test = get_config_arg('is_test', bool, False) emb = paddle.layer.embedding(input=data, size=emb_dim)
# if it is “predict” mode conv_3 = paddle.networks.sequence_conv_pool(
is_predict = get_config_arg('is_predict', bool, False) input=emb, context_len=3, hidden_size=hid_dim)
conv_4 = paddle.networks.sequence_conv_pool(
# Data path input=emb, context_len=4, hidden_size=hid_dim)
data_dir = "./data/pre-imdb" output = paddle.layer.fc(input=[conv_3, conv_4],
# File names size=class_dim,
train_list = "train.list" act=paddle.activation.Softmax())
test_list = "test.list" lbl = paddle.layer.data("label", paddle.data_type.integer_value(2))
dict_file = "dict.txt" cost = paddle.layer.classification_cost(input=output, label=lbl)
return cost
# Dictionary size
dict_dim = len(open(join_path(data_dir, "dict.txt")).readlines())
# class number
class_dim = len(open(join_path(data_dir, 'labels.list')).readlines())
if not is_predict:
train_list = join_path(data_dir, train_list)
test_list = join_path(data_dir, test_list)
dict_file = join_path(data_dir, dict_file)
train_list = train_list if not is_test else None
# construct the dictionary
word_dict = dict()
with open(dict_file, 'r') as f:
for i, line in enumerate(open(dict_file, 'r')):
word_dict[line.split('\t')[0]] = i
# Call the function “define_py_data_sources2” in the file dataprovider.py to extract features
define_py_data_sources2(
train_list,
test_list,
module="dataprovider",
obj="process", # function to generate data
args={'dictionary': word_dict}) # extra parameters, here refers to dictionary
``` ```
### Algorithm Setup 1. Define input data and its dimension
```python Parameter `input_dim` denotes the dictionary size, and `class_dim` is the number of categories. In `convolution_net`, the input to the network is defined in `paddle.layer.data`.
settings(
batch_size=128,
learning_rate=2e-3,
learning_method=AdamOptimizer(),
regularization=L2Regularization(8e-4),
gradient_clipping_threshold=25)
```
* Batch size set as 128; 1. Define Classifier
* Set global learning rate;
* Apply ADAM algorithm for optimization;
* Set up L2 regularization;
* Set up gradient clipping threshold;
### Model Structure The above Text CNN network extracts high-level features and maps them to a vector of the same size as the categories. `paddle.activation.Softmax` function or classifier is then used for calculating the probability of the sentence belonging to each category.
We use PaddlePaddle to implement two classification algorithms, based on above mentioned model [Text-CNN](#Text-CNN(CNN))[Stacked-bidirectional LSTM](#Stacked-bidirectional LSTM(Stacked Bidirectional LSTM))。
#### Implementation of Text CNN 1. Define Loss Function
```python
def convolution_net(input_dim, In the context of supervised learning, labels of the training set are defined in `paddle.layer.data`, too. During training, cross-entropy is used as loss function in `paddle.layer.classification_cost` and as the output of the network; During testing, the outputs are the probabilities calculated in the classifier.
class_dim=2,
emb_dim=128,
hid_dim=128,
is_predict=False):
# network input: id denotes word order, dictionary size as input_dim
data = data_layer("word", input_dim)
# Embed one-hot id to embedding subspace
emb = embedding_layer(input=data, size=emb_dim)
# Convolution and max-pooling operation, convolution kernel size set as 3
conv_3 = sequence_conv_pool(input=emb, context_len=3, hidden_size=hid_dim)
# Convolution and max-pooling, convolution kernel size set as 4
conv_4 = sequence_conv_pool(input=emb, context_len=4, hidden_size=hid_dim)
# Concatenate conv_3 and conv_4 as input for softmax classification, class number as class_dim
output = fc_layer(
input=[conv_3, conv_4], size=class_dim, act=SoftmaxActivation())
if not is_predict:
lbl = data_layer("label", 1) #network input: class label
outputs(classification_cost(input=output, label=lbl))
else:
outputs(output)
```
In our implementation, we can use just a single layer [`sequence_conv_pool`](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/trainer_config_helpers/networks.py) to do convolution and pooling operation, convolution kernel size set as hidden_size parameters. #### Stacked bidirectional LSTM
#### Implementation of Stacked bidirectional LSTM We create a neural network `stacked_lstm_net` as below.
```python ```python
def stacked_lstm_net(input_dim, def stacked_lstm_net(input_dim,
class_dim=2, class_dim=2,
emb_dim=128, emb_dim=128,
hid_dim=512, hid_dim=512,
stacked_num=3, stacked_num=3):
is_predict=False): """
A Wrapper for sentiment classification task.
# layer number of LSTM “stacked_num” is an odd number to confirm the top-layer LSTM is forward This network uses bi-directional recurrent network,
assert stacked_num % 2 == 1 consisting three LSTM layers. This configure is referred to
# network attributes setup the paper as following url, but use fewer layrs.
layer_attr = ExtraLayerAttribute(drop_rate=0.5) http://www.aclweb.org/anthology/P15-1109
# parameter attributes setup input_dim: here is word dictionary dimension.
fc_para_attr = ParameterAttribute(learning_rate=1e-3) class_dim: number of categories.
lstm_para_attr = ParameterAttribute(initial_std=0., learning_rate=1.) emb_dim: dimension of word embedding.
para_attr = [fc_para_attr, lstm_para_attr] hid_dim: dimension of hidden layer.
bias_attr = ParameterAttribute(initial_std=0., l2_rate=0.) stacked_num: number of stacked lstm-hidden layer.
# Activation functions """
relu = ReluActivation() assert stacked_num % 2 == 1
linear = LinearActivation()
layer_attr = paddle.attr.Extra(drop_rate=0.5)
fc_para_attr = paddle.attr.Param(learning_rate=1e-3)
# Network input: id as word order, dictionary size is set as input_dim lstm_para_attr = paddle.attr.Param(initial_std=0., learning_rate=1.)
data = data_layer("word", input_dim) para_attr = [fc_para_attr, lstm_para_attr]
# Mapping id from word to the embedding subspace bias_attr = paddle.attr.Param(initial_std=0., l2_rate=0.)
emb = embedding_layer(input=data, size=emb_dim) relu = paddle.activation.Relu()
linear = paddle.activation.Linear()
fc1 = fc_layer(input=emb, size=hid_dim, act=linear, bias_attr=bias_attr)
# LSTM-based RNN data = paddle.layer.data("word",
lstm1 = lstmemory( paddle.data_type.integer_value_sequence(input_dim))
input=fc1, act=relu, bias_attr=bias_attr, layer_attr=layer_attr) emb = paddle.layer.embedding(input=data, size=emb_dim)
# Construct stacked bidirectional LSTM with fc_layer and lstmemory with layer depth as stacked_num: fc1 = paddle.layer.fc(input=emb,
inputs = [fc1, lstm1] size=hid_dim,
for i in range(2, stacked_num + 1): act=linear,
fc = fc_layer( bias_attr=bias_attr)
input=inputs, lstm1 = paddle.layer.lstmemory(
size=hid_dim, input=fc1, act=relu, bias_attr=bias_attr, layer_attr=layer_attr)
act=linear,
param_attr=para_attr, inputs = [fc1, lstm1]
bias_attr=bias_attr) for i in range(2, stacked_num + 1):
lstm = lstmemory( fc = paddle.layer.fc(input=inputs,
input=fc, size=hid_dim,
# Odd number-th layer: forward, Even number-th reverse. act=linear,
reverse=(i % 2) == 0, param_attr=para_attr,
act=relu, bias_attr=bias_attr)
bias_attr=bias_attr, lstm = paddle.layer.lstmemory(
layer_attr=layer_attr) input=fc,
inputs = [fc, lstm] reverse=(i % 2) == 0,
act=relu,
# Apply max-pooling along the temporal dimension on the last fc_layer to produce a fixed length vector bias_attr=bias_attr,
fc_last = pooling_layer(input=inputs[0], pooling_type=MaxPooling()) layer_attr=layer_attr)
# Apply max-pooling along tempoeral dim of lstmemory to obtain fixed length feature vector inputs = [fc, lstm]
lstm_last = pooling_layer(input=inputs[1], pooling_type=MaxPooling())
# concatenate fc_last and lstm_last as input for a softmax classification layer, with class number equals class_dim fc_last = paddle.layer.pooling(
output = fc_layer( input=inputs[0], pooling_type=paddle.pooling.Max())
input=[fc_last, lstm_last], lstm_last = paddle.layer.pooling(
size=class_dim, input=inputs[1], pooling_type=paddle.pooling.Max())
act=SoftmaxActivation(), output = paddle.layer.fc(input=[fc_last, lstm_last],
bias_attr=bias_attr, size=class_dim,
param_attr=para_attr) act=paddle.activation.Softmax(),
bias_attr=bias_attr,
if is_predict: param_attr=para_attr)
outputs(output)
else: lbl = paddle.layer.data("label", paddle.data_type.integer_value(2))
outputs(classification_cost(input=output, label=data_layer('label', 1))) cost = paddle.layer.classification_cost(input=output, label=lbl)
return cost
``` ```
Our model defined in `trainer_config.py` uses the `stacked_lstm_net` structure as default. If you want to use `convolution_net`, you can comment related lines. 1. Define input data and its dimension
Parameter `input_dim` denotes the dictionary size, and `class_dim` is the number of categories. In `stacked_lstm_net`, the input to the network is defined in `paddle.layer.data`.
1. Define Classifier
The above stacked bidirectional LSTM network extracts high-level features and maps them to a vector of the same size as the categories. `paddle.activation.Softmax` function or classifier is then used for calculating the probability of the sentence belonging to each category.
1. Define Loss Function
In the context of supervised learning, labels of the training set are defined in `paddle.layer.data`, too. During training, cross-entropy is used as loss function in `paddle.layer.classification_cost` and as the output of the network; During testing, the outputs are the probabilities calculated in the classifier.
To reiterate, we can either invoke `convolution_net` or `stacked_lstm_net`.
```python ```python
stacked_lstm_net( word_dict = paddle.dataset.imdb.word_dict()
dict_dim, class_dim=class_dim, stacked_num=3, is_predict=is_predict) dict_dim = len(word_dict)
# convolution_net(dict_dim, class_dim=class_dim, is_predict=is_predict) class_dim = 2
# option 1
cost = convolution_net(dict_dim, class_dim=class_dim)
# option 2
# cost = stacked_lstm_net(dict_dim, class_dim=class_dim, stacked_num=3)
``` ```
## Model Training ## Model Training
Use `train.sh` script to run local training:
``` ### Define Parameters
./train.sh
```
train.sh is as following: First, we create the model parameters according to the previous model configuration `cost`.
```bash ```python
paddle train --config=trainer_config.py \ # create parameters
--save_dir=./model_output \ parameters = paddle.parameters.create(cost)
--job=train \
--use_gpu=false \
--trainer_count=4 \
--num_passes=10 \
--log_period=20 \
--dot_period=20 \
--show_parameter_stats_period=100 \
--test_all_data_in_one_period=1 \
2>&1 | tee 'train.log'
``` ```
* \--config=trainer_config.py: set up model configuration. ### Create Trainer
* \--save\_dir=./model_output: set up output folder to save model parameters.
* \--job=train: set job mode as training.
* \--use\_gpu=false: Use CPU for training. If you have installed GPU-version PaddlePaddle and want to try GPU training, you may set this term as true.
* \--trainer\_count=4: setup thread number (or GPU numer).
* \--num\_passes=15: Setup pass. In PaddlePaddle, a pass means a training epoch over all samples.
* \--log\_period=20: print log every 20 batches.
* \--show\_parameter\_stats\_period=100: Print statistics to screen every 100 batch.
* \--test\_all_data\_in\_one\_period=1: Predict all testing data every time.
If it is running sussefully, the output log will be saved at `train.log`, model parameters will be saved at the directory `model_output/`. Output log will be as following: Before jumping into creating a training module, algorithm setting is also necessary.
Here we specified `Adam` optimization algorithm via `paddle.optimizer`.
```python
# create optimizer
adam_optimizer = paddle.optimizer.Adam(
learning_rate=2e-3,
regularization=paddle.optimizer.L2Regularization(rate=8e-4),
model_average=paddle.optimizer.ModelAverage(average_window=0.5))
# create trainer
trainer = paddle.trainer.SGD(cost=cost,
parameters=parameters,
update_equation=adam_optimizer)
``` ```
Batch=20 samples=2560 AvgCost=0.681644 CurrentCost=0.681644 Eval: classification_error_evaluator=0.36875 CurrentEval: classification_error_evaluator=0.36875
...
Pass=0 Batch=196 samples=25000 AvgCost=0.418964 Eval: classification_error_evaluator=0.1922
Test samples=24999 cost=0.39297 Eval: classification_error_evaluator=0.149406
```
* Batch=xx: Already |xx| Batch trained.
* samples=xx: xx samples have been processed during training.
* AvgCost=xx: Average loss from 0-th batch to the current batch.
* CurrentCost=xx: loss of the latest |log_period|-th batch;
* Eval: classification\_error\_evaluator=xx: Average accuracy from 0-th batch to current batch;
* CurrentEval: classification\_error\_evaluator: latest |log_period| batches of classification error;
* Pass=0: Running over all data in the training set is called as a Pass. Pass “0” denotes the first round.
### Training
## Application models `paddle.dataset.imdb.train()` will yield records during each pass, after shuffling, a batch input is generated for training.
### Testing
Testing refers to use trained model to evaluate labeled dataset. ```python
train_reader = paddle.batch(
paddle.reader.shuffle(
lambda: paddle.dataset.imdb.train(word_dict), buf_size=1000),
batch_size=100)
test_reader = paddle.batch(
lambda: paddle.dataset.imdb.test(word_dict), batch_size=100)
``` ```
./test.sh
```
Scripts for testing `test.sh` is as following, where the function `get_best_pass` ranks classification accuracy to obtain the best model: `feeding` is devoted to specifying the correspondence between each yield record and `paddle.layer.data`. For instance, the first column of data generated by `paddle.dataset.imdb.train()` corresponds to `word` feature.
```bash ```python
function get_best_pass() { feeding = {'word': 0, 'label': 1}
cat $1 | grep -Pzo 'Test .*\n.*pass-.*' | \
sed -r 'N;s/Test.* error=([0-9]+\.[0-9]+).*\n.*pass-([0-9]+)/\1 \2/g' | \
sort | head -n 1
}
log=train.log
LOG=`get_best_pass $log`
LOG=(${LOG})
evaluate_pass="model_output/pass-${LOG[1]}"
echo 'evaluating from pass '$evaluate_pass
model_list=./model.list
touch $model_list | echo $evaluate_pass > $model_list
net_conf=trainer_config.py
paddle train --config=$net_conf \
--model_list=$model_list \
--job=test \
--use_gpu=false \
--trainer_count=4 \
--config_args=is_test=1 \
2>&1 | tee 'test.log'
``` ```
Different from training, testing requires denoting `--job = test` and model path `--model_list = $model_list`. If successful, log will be saved at `test.log`. In our test, the best model is `model_output/pass-00002`, with classification error rate as 0.115645: Callback function `event_handler` will be invoked to track training progress when a pre-defined event happens.
``` ```python
Pass=0 samples=24999 AvgCost=0.280471 Eval: classification_error_evaluator=0.115645 def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0:
print "\nPass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
else:
sys.stdout.write('.')
sys.stdout.flush()
if isinstance(event, paddle.event.EndPass):
result = trainer.test(reader=test_reader, reader_dict=reader_dict)
print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics)
``` ```
### Prediction Finally, we can invoke `trainer.train` to start training:
`predict.py` script provides an API. Predicting IMDB data without labels as following:
```python
trainer.train(
reader=train_reader,
event_handler=event_handler,
feeding=feedig,
num_passes=10)
``` ```
./predict.sh
```
predict.sh is as following(default model path `model_output/pass-00002` may exist or modified to others):
```bash
model=model_output/pass-00002/
config=trainer_config.py
label=data/pre-imdb/labels.list
cat ./data/aclImdb/test/pos/10007_10.txt | python predict.py \
--tconf=$config \
--model=$model \
--label=$label \
--dict=./data/pre-imdb/dict.txt \
--batch_size=1
```
* `cat ./data/aclImdb/test/pos/10007_10.txt` : Input prediction samples.
* `predict.py` : Prediction script.
* `--tconf=$config` : Network set up.
* `--model=$model` : Model path set up.
* `--label=$label` : set up the label dictionary, mapping integer IDs to string labels.
* `--dict=data/pre-imdb/dict.txt` : set up the dictionary file.
* `--batch_size=1` : batch size during prediction.
Prediction result of our example: ## Conclusion
``` In this chapter, we use sentiment analysis as an example to introduce applying deep learning models on end-to-end short text classification, as well as how to use PaddlePaddle to implement the model. Meanwhile, we briefly introduce two models for text processing: CNN and RNN. In following chapters, we will see how these models can be applied in other tasks.
Loading parameters from model_output/pass-00002/
predicting label is pos
```
`10007_10.txt` in folder`./data/aclImdb/test/pos`, the predicted label is also pos,so the prediction is correct.
## Summary
In this chapter, we use sentiment analysis as an example to introduce applying deep learning models on end-to-end short text classification, as well as how to use PaddlePaddle to implement the model. Meanwhile, we briefly introduce two models for text processing: CNN and RNN. In following chapters we will see how these models can be applied in other tasks.
## Reference ## Reference
1. Kim Y. [Convolutional neural networks for sentence classification](http://arxiv.org/pdf/1408.5882)[J]. arXiv preprint arXiv:1408.5882, 2014. 1. Kim Y. [Convolutional neural networks for sentence classification](http://arxiv.org/pdf/1408.5882)[J]. arXiv preprint arXiv:1408.5882, 2014.
2. Kalchbrenner N, Grefenstette E, Blunsom P. [A convolutional neural network for modelling sentences](http://arxiv.org/pdf/1404.2188.pdf?utm_medium=App.net&utm_source=PourOver)[J]. arXiv preprint arXiv:1404.2188, 2014. 2. Kalchbrenner N, Grefenstette E, Blunsom P. [A convolutional neural network for modelling sentences](http://arxiv.org/pdf/1404.2188.pdf?utm_medium=App.net&utm_source=PourOver)[J]. arXiv preprint arXiv:1404.2188, 2014.
3. Yann N. Dauphin, et al. [Language Modeling with Gated Convolutional Networks](https://arxiv.org/pdf/1612.08083v1.pdf)[J] arXiv preprint arXiv:1612.08083, 2016. 3. Yann N. Dauphin, et al. [Language Modeling with Gated Convolutional Networks](https://arxiv.org/pdf/1612.08083v1.pdf)[J] arXiv preprint arXiv:1612.08083, 2016.
...@@ -490,4 +366,4 @@ In this chapter, we use sentiment analysis as an example to introduce applying d ...@@ -490,4 +366,4 @@ In this chapter, we use sentiment analysis as an example to introduce applying d
9. Zhou J, Xu W. [End-to-end learning of semantic role labeling using recurrent neural networks](http://www.aclweb.org/anthology/P/P15/P15-1109.pdf)[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2015. 9. Zhou J, Xu W. [End-to-end learning of semantic role labeling using recurrent neural networks](http://www.aclweb.org/anthology/P/P15/P15-1109.pdf)[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2015.
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。 This tutorial is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.
...@@ -108,16 +108,14 @@ aclImdb ...@@ -108,16 +108,14 @@ aclImdb
``` ```
Paddle在`dataset/imdb.py`中提实现了imdb数据集的自动下载和读取,并提供了读取字典、训练数据、测试数据等API。 Paddle在`dataset/imdb.py`中提实现了imdb数据集的自动下载和读取,并提供了读取字典、训练数据、测试数据等API。
``` ```python
import sys import sys
import paddle.trainer_config_helpers.attrs as attrs
from paddle.trainer_config_helpers.poolings import MaxPooling
import paddle.v2 as paddle import paddle.v2 as paddle
``` ```
## 配置模型 ## 配置模型
在该示例中,我们实现了两种文本分类算法,分别基于上文所述的[文本卷积神经网络](#文本卷积神经网络(CNN))[栈式双向LSTM](#栈式双向LSTM(Stacked Bidirectional LSTM))。 在该示例中,我们实现了两种文本分类算法,分别基于上文所述的[文本卷积神经网络](#文本卷积神经网络(CNN))[栈式双向LSTM](#栈式双向LSTM(Stacked Bidirectional LSTM))。
### 文本卷积神经网络 ### 文本卷积神经网络
``` ```python
def convolution_net(input_dim, def convolution_net(input_dim,
class_dim=2, class_dim=2,
emb_dim=128, emb_dim=128,
...@@ -138,7 +136,7 @@ def convolution_net(input_dim, ...@@ -138,7 +136,7 @@ def convolution_net(input_dim,
``` ```
网络的输入`input_dim`表示的是词典的大小,`class_dim`表示类别数。这里,我们使用[`sequence_conv_pool`](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/trainer_config_helpers/networks.py) API实现了卷积和池化操作。 网络的输入`input_dim`表示的是词典的大小,`class_dim`表示类别数。这里,我们使用[`sequence_conv_pool`](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/trainer_config_helpers/networks.py) API实现了卷积和池化操作。
### 栈式双向LSTM ### 栈式双向LSTM
``` ```python
def stacked_lstm_net(input_dim, def stacked_lstm_net(input_dim,
class_dim=2, class_dim=2,
emb_dim=128, emb_dim=128,
...@@ -159,11 +157,11 @@ def stacked_lstm_net(input_dim, ...@@ -159,11 +157,11 @@ def stacked_lstm_net(input_dim,
""" """
assert stacked_num % 2 == 1 assert stacked_num % 2 == 1
layer_attr = attrs.ExtraLayerAttribute(drop_rate=0.5) layer_attr = paddle.attr.Extra(drop_rate=0.5)
fc_para_attr = attrs.ParameterAttribute(learning_rate=1e-3) fc_para_attr = paddle.attr.Param(learning_rate=1e-3)
lstm_para_attr = attrs.ParameterAttribute(initial_std=0., learning_rate=1.) lstm_para_attr = paddle.attr.Param(initial_std=0., learning_rate=1.)
para_attr = [fc_para_attr, lstm_para_attr] para_attr = [fc_para_attr, lstm_para_attr]
bias_attr = attrs.ParameterAttribute(initial_std=0., l2_rate=0.) bias_attr = paddle.attr.Param(initial_std=0., l2_rate=0.)
relu = paddle.activation.Relu() relu = paddle.activation.Relu()
linear = paddle.activation.Linear() linear = paddle.activation.Linear()
...@@ -193,8 +191,8 @@ def stacked_lstm_net(input_dim, ...@@ -193,8 +191,8 @@ def stacked_lstm_net(input_dim,
layer_attr=layer_attr) layer_attr=layer_attr)
inputs = [fc, lstm] inputs = [fc, lstm]
fc_last = paddle.layer.pooling(input=inputs[0], pooling_type=MaxPooling()) fc_last = paddle.layer.pooling(input=inputs[0], pooling_type=paddle.pooling.Max())
lstm_last = paddle.layer.pooling(input=inputs[1], pooling_type=MaxPooling()) lstm_last = paddle.layer.pooling(input=inputs[1], pooling_type=paddle.pooling.Max())
output = paddle.layer.fc(input=[fc_last, lstm_last], output = paddle.layer.fc(input=[fc_last, lstm_last],
size=class_dim, size=class_dim,
act=paddle.activation.Softmax(), act=paddle.activation.Softmax(),
...@@ -207,7 +205,7 @@ def stacked_lstm_net(input_dim, ...@@ -207,7 +205,7 @@ def stacked_lstm_net(input_dim,
``` ```
网络的输入`stacked_num`表示的是LSTM的层数,需要是奇数,确保最高层LSTM正向。Paddle里面是通过一个fc和一个lstmemory来实现基于LSTM的循环神经网络。 网络的输入`stacked_num`表示的是LSTM的层数,需要是奇数,确保最高层LSTM正向。Paddle里面是通过一个fc和一个lstmemory来实现基于LSTM的循环神经网络。
## 训练模型 ## 训练模型
``` ```python
if __name__ == '__main__': if __name__ == '__main__':
# init # init
paddle.init(use_gpu=False) paddle.init(use_gpu=False)
...@@ -215,14 +213,14 @@ if __name__ == '__main__': ...@@ -215,14 +213,14 @@ if __name__ == '__main__':
启动paddle程序,use_gpu=False表示用CPU训练,如果系统支持GPU也可以修改成True使用GPU训练。 启动paddle程序,use_gpu=False表示用CPU训练,如果系统支持GPU也可以修改成True使用GPU训练。
### 训练数据 ### 训练数据
使用Paddle提供的数据集`dataset.imdb`中的API来读取训练数据。 使用Paddle提供的数据集`dataset.imdb`中的API来读取训练数据。
``` ```python
print 'load dictionary...' print 'load dictionary...'
word_dict = paddle.dataset.imdb.word_dict() word_dict = paddle.dataset.imdb.word_dict()
dict_dim = len(word_dict) dict_dim = len(word_dict)
class_dim = 2 class_dim = 2
``` ```
加载数据字典,这里通过`word_dict()`API可以直接构造字典。`class_dim`是指样本类别数,该示例中样本只有正负两类。 加载数据字典,这里通过`word_dict()`API可以直接构造字典。`class_dim`是指样本类别数,该示例中样本只有正负两类。
``` ```python
train_reader = paddle.batch( train_reader = paddle.batch(
paddle.reader.shuffle( paddle.reader.shuffle(
lambda: paddle.dataset.imdb.train(word_dict), buf_size=1000), lambda: paddle.dataset.imdb.train(word_dict), buf_size=1000),
...@@ -232,12 +230,12 @@ if __name__ == '__main__': ...@@ -232,12 +230,12 @@ if __name__ == '__main__':
batch_size=100) batch_size=100)
``` ```
这里,`dataset.imdb.train()``dataset.imdb.test()`分别是`dataset.imdb`中的训练数据和测试数据API。`train_reader`在训练时使用,意义是将读取的训练数据进行shuffle后,组成一个batch数据。同理,`test_reader`是在测试的时候使用,将读取的测试数据组成一个batch。 这里,`dataset.imdb.train()``dataset.imdb.test()`分别是`dataset.imdb`中的训练数据和测试数据API。`train_reader`在训练时使用,意义是将读取的训练数据进行shuffle后,组成一个batch数据。同理,`test_reader`是在测试的时候使用,将读取的测试数据组成一个batch。
```python
feeding={'word': 0, 'label': 1}
``` ```
reader_dict={'word': 0, 'label': 1} `feeding`用来指定`train_reader``test_reader`返回的数据与模型配置中data_layer的对应关系。这里表示reader返回的第0列数据对应`word`层,第1列数据对应`label`层。
```
`reader_dict`用来指定`train_reader``test_reader`返回的数据与模型配置中data_layer的对应关系。这里表示reader返回的第0列数据对应`word`层,第1列数据对应`label`层。
### 构造模型 ### 构造模型
``` ```python
# Please choose the way to build the network # Please choose the way to build the network
# by uncommenting the corresponding line. # by uncommenting the corresponding line.
cost = convolution_net(dict_dim, class_dim=class_dim) cost = convolution_net(dict_dim, class_dim=class_dim)
...@@ -245,13 +243,13 @@ if __name__ == '__main__': ...@@ -245,13 +243,13 @@ if __name__ == '__main__':
``` ```
该示例中默认使用`convolution_net`网络,如果使用`stacked_lstm_net`网络,注释相应的行即可。其中cost是网络的优化目标,同时cost包含了整个网络的拓扑信息。 该示例中默认使用`convolution_net`网络,如果使用`stacked_lstm_net`网络,注释相应的行即可。其中cost是网络的优化目标,同时cost包含了整个网络的拓扑信息。
### 网络参数 ### 网络参数
``` ```python
# create parameters # create parameters
parameters = paddle.parameters.create(cost) parameters = paddle.parameters.create(cost)
``` ```
根据网络的拓扑构造网络参数。这里parameters是整个网络的参数集。 根据网络的拓扑构造网络参数。这里parameters是整个网络的参数集。
### 优化算法 ### 优化算法
``` ```python
# create optimizer # create optimizer
adam_optimizer = paddle.optimizer.Adam( adam_optimizer = paddle.optimizer.Adam(
learning_rate=2e-3, learning_rate=2e-3,
...@@ -261,7 +259,7 @@ if __name__ == '__main__': ...@@ -261,7 +259,7 @@ if __name__ == '__main__':
Paddle中提供了一系列优化算法的API,这里使用Adam优化算法。 Paddle中提供了一系列优化算法的API,这里使用Adam优化算法。
### 训练 ### 训练
可以通过`paddle.trainer.SGD`构造一个sgd trainer,并调用`trainer.train`来训练模型。 可以通过`paddle.trainer.SGD`构造一个sgd trainer,并调用`trainer.train`来训练模型。
``` ```python
# End batch and end pass event handler # End batch and end pass event handler
def event_handler(event): def event_handler(event):
if isinstance(event, paddle.event.EndIteration): if isinstance(event, paddle.event.EndIteration):
...@@ -272,11 +270,11 @@ Paddle中提供了一系列优化算法的API,这里使用Adam优化算法。 ...@@ -272,11 +270,11 @@ Paddle中提供了一系列优化算法的API,这里使用Adam优化算法。
sys.stdout.write('.') sys.stdout.write('.')
sys.stdout.flush() sys.stdout.flush()
if isinstance(event, paddle.event.EndPass): if isinstance(event, paddle.event.EndPass):
result = trainer.test(reader=test_reader, reader_dict=reader_dict) result = trainer.test(reader=test_reader, feeding=feeding)
print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics) print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics)
``` ```
可以通过给train函数传递一个`event_handler`来获取每个batch和每个pass结束的状态。比如构造如下一个`event_handler`可以在每100个batch结束后输出cost和error;在每个pass结束后调用`trainer.test`计算一遍测试集并获得当前模型在测试集上的error。 可以通过给train函数传递一个`event_handler`来获取每个batch和每个pass结束的状态。比如构造如下一个`event_handler`可以在每100个batch结束后输出cost和error;在每个pass结束后调用`trainer.test`计算一遍测试集并获得当前模型在测试集上的error。
``` ```python
# create trainer # create trainer
trainer = paddle.trainer.SGD(cost=cost, trainer = paddle.trainer.SGD(cost=cost,
parameters=parameters, parameters=parameters,
...@@ -285,11 +283,11 @@ Paddle中提供了一系列优化算法的API,这里使用Adam优化算法。 ...@@ -285,11 +283,11 @@ Paddle中提供了一系列优化算法的API,这里使用Adam优化算法。
trainer.train( trainer.train(
reader=train_reader, reader=train_reader,
event_handler=event_handler, event_handler=event_handler,
reader_dict=reader_dict, feeding=feeding,
num_passes=2) num_passes=2)
``` ```
程序运行之后的输出如下。 程序运行之后的输出如下。
``` ```text
Pass 0, Batch 0, Cost 0.693721, {'classification_error_evaluator': 0.5546875} Pass 0, Batch 0, Cost 0.693721, {'classification_error_evaluator': 0.5546875}
................................................................................................... ...................................................................................................
Pass 0, Batch 100, Cost 0.294321, {'classification_error_evaluator': 0.1015625} Pass 0, Batch 100, Cost 0.294321, {'classification_error_evaluator': 0.1015625}
......
...@@ -44,7 +44,8 @@ ...@@ -44,7 +44,8 @@
The source codes of this section can be located at [book/understand_sentiment](https://github.com/PaddlePaddle/book/tree/develop/understand_sentiment). First-time users may refer to PaddlePaddle for [Installation guide](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html). The source codes of this section can be located at [book/understand_sentiment](https://github.com/PaddlePaddle/book/tree/develop/understand_sentiment). First-time users may refer to PaddlePaddle for [Installation guide](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html).
## Background Introduction ## Background
In natural language processing, sentiment analysis refers to describing emotion status in texts. The texts may refer to a sentence, a paragraph or a document. Emotion status can be a binary classification problem (positive/negative or happy/sad), or a three-class problem (positive/neutral/negative). Sentiment analysis can be applied widely in various situations, such as online shopping (Amazon, Taobao), travel and movie websites. It can be used to grasp from the reviews how the customers feel about the product. Table 1 is an example of sentiment analysis in movie reviews: In natural language processing, sentiment analysis refers to describing emotion status in texts. The texts may refer to a sentence, a paragraph or a document. Emotion status can be a binary classification problem (positive/negative or happy/sad), or a three-class problem (positive/neutral/negative). Sentiment analysis can be applied widely in various situations, such as online shopping (Amazon, Taobao), travel and movie websites. It can be used to grasp from the reviews how the customers feel about the product. Table 1 is an example of sentiment analysis in movie reviews:
| Movie Review | Category | | Movie Review | Category |
...@@ -64,10 +65,12 @@ For a piece of text, BOW model ignores its word order, grammar and syntax, and r ...@@ -64,10 +65,12 @@ For a piece of text, BOW model ignores its word order, grammar and syntax, and r
In this chapter, we introduce our deep learning model which handles these issues in BOW. Our model embeds texts into a low-dimensional space and takes word order into consideration. It is an end-to-end framework, and has large performance improvement over traditional methods \[[1](#Reference)\]. In this chapter, we introduce our deep learning model which handles these issues in BOW. Our model embeds texts into a low-dimensional space and takes word order into consideration. It is an end-to-end framework, and has large performance improvement over traditional methods \[[1](#Reference)\].
## Model Overview ## Model Overview
The model we used in this chapter is the CNN (Convolutional Neural Networks) and RNN (Recurrent Neural Networks) with some specific extension. The model we used in this chapter is the CNN (Convolutional Neural Networks) and RNN (Recurrent Neural Networks) with some specific extension.
### Convolutional Neural Networks for Texts (CNN) ### Convolutional Neural Networks for Texts (CNN)
Convolutional Neural Networks are always applied in data with grid-like topology, such as 2-d images and 1-d texts. CNN can combine extracted multiple local features to produce higher-level abstract semantics. Experimentally, CNN is very efficient for image and text modeling. Convolutional Neural Networks are always applied in data with grid-like topology, such as 2-d images and 1-d texts. CNN can combine extracted multiple local features to produce higher-level abstract semantics. Experimentally, CNN is very efficient for image and text modeling.
CNN mainly contains convolution and pooling operation, with various extensions. We briefly describe CNN here with an example \[[1](#Refernce)\]. As shown in Figure 1: CNN mainly contains convolution and pooling operation, with various extensions. We briefly describe CNN here with an example \[[1](#Refernce)\]. As shown in Figure 1:
...@@ -97,7 +100,8 @@ Finally, the CNN features are concatenated together to produce a fixed-length re ...@@ -97,7 +100,8 @@ Finally, the CNN features are concatenated together to produce a fixed-length re
For short texts, above CNN model can achieve high accuracy \[[1](#Reference)\]. If we want to extract more abstract representation, we may apply a deeper CNN model \[[2](#Reference),[3](#Reference)\]. For short texts, above CNN model can achieve high accuracy \[[1](#Reference)\]. If we want to extract more abstract representation, we may apply a deeper CNN model \[[2](#Reference),[3](#Reference)\].
### Recurrent Neural Network(RNN) ### Recurrent Neural Network (RNN)
RNN is an effective model for sequential data. Theoretical, the computational ability of RNN is Turing-complete \[[4](#Reference)\]. NLP is a classical sequential data, and RNN (especially its variant LSTM\[[5](#Reference)\]) achieves State-of-the-Art performance on various tasks in NLP, such as language modeling, syntax parsing, POS-tagging, image captioning, dialog, machine translation and so forth. RNN is an effective model for sequential data. Theoretical, the computational ability of RNN is Turing-complete \[[4](#Reference)\]. NLP is a classical sequential data, and RNN (especially its variant LSTM\[[5](#Reference)\]) achieves State-of-the-Art performance on various tasks in NLP, such as language modeling, syntax parsing, POS-tagging, image captioning, dialog, machine translation and so forth.
<p align="center"> <p align="center">
...@@ -112,8 +116,9 @@ where $W_{xh}$ is the weight matrix from input to latent; $W_{hh}$ is the latent ...@@ -112,8 +116,9 @@ where $W_{xh}$ is the weight matrix from input to latent; $W_{hh}$ is the latent
In NLP, words are first represented as a one-hot vector and then mapped to an embedding. The embedded feature goes through an RNN as input $x_t$ at every time step. Moreover, we can add other layers on top of RNN. e.g., a deep or stacked RNN. Also, the last latent state can be used as a feature for sentence classification. In NLP, words are first represented as a one-hot vector and then mapped to an embedding. The embedded feature goes through an RNN as input $x_t$ at every time step. Moreover, we can add other layers on top of RNN. e.g., a deep or stacked RNN. Also, the last latent state can be used as a feature for sentence classification.
### Long-Short Term Memory ### Long-Short Term Memory (LSTM)
For data of long sequence, training RNN sometimes has gradient vanishing and explosion problem \[[6](#)\]. To solve this problem Hochreiter S, Schmidhuber J. (1997) proposed the LSTM(long short term memory\[[5](#Refernce)\]).
For data of long sequence, training RNN sometimes has gradient vanishing and explosion problem \[[6](#)\]. To solve this problem Hochreiter S, Schmidhuber J. (1997) proposed the LSTM(long short term memory\[[5](#Reference)\]).
Compared with simple RNN, the structrue of LSTM has included memory cell $c$, input gate $i$, forget gate $f$ and output gate $o$. These gates and memory cells largely improves the ability of handling long sequences. We can formulate LSTM-RNN as a function $F$ as: Compared with simple RNN, the structrue of LSTM has included memory cell $c$, input gate $i$, forget gate $f$ and output gate $o$. These gates and memory cells largely improves the ability of handling long sequences. We can formulate LSTM-RNN as a function $F$ as:
...@@ -141,6 +146,7 @@ $$ h_t=Recrurent(x_t,h_{t-1})$$ ...@@ -141,6 +146,7 @@ $$ h_t=Recrurent(x_t,h_{t-1})$$
where $Recrurent$ is a simple RNN, GRU or LSTM. where $Recrurent$ is a simple RNN, GRU or LSTM.
### Stacked Bidirectional LSTM ### Stacked Bidirectional LSTM
For vanilla LSTM, $h_t$ contains input information from previous time-step $1..t-1$ context. We can also apply an RNN with reverse-direction to take successive context $t+1…n$ into consideration. Combining constructing deep RNN (deeper RNN can contain more abstract and higher level semantic), we can design structures with deep stacked bidirectional LSTM to model sequential data\[[9](#Reference)\]. For vanilla LSTM, $h_t$ contains input information from previous time-step $1..t-1$ context. We can also apply an RNN with reverse-direction to take successive context $t+1…n$ into consideration. Combining constructing deep RNN (deeper RNN can contain more abstract and higher level semantic), we can design structures with deep stacked bidirectional LSTM to model sequential data\[[9](#Reference)\].
As shown in Figure 4 (3-layer RNN), odd/even layers are forward/reverse LSTM. Higher layers of LSTM take lower-layers LSTM as input, and the top-layer LSTM produces a fixed length vector by max-pooling (this representation considers contexts from previous and successive words for higher-level abstractions). Finally, we concatenate the output to a softmax layer for classification. As shown in Figure 4 (3-layer RNN), odd/even layers are forward/reverse LSTM. Higher layers of LSTM take lower-layers LSTM as input, and the top-layer LSTM produces a fixed length vector by max-pooling (this representation considers contexts from previous and successive words for higher-level abstractions). Finally, we concatenate the output to a softmax layer for classification.
...@@ -150,377 +156,247 @@ As shown in Figure 4 (3-layer RNN), odd/even layers are forward/reverse LSTM. Hi ...@@ -150,377 +156,247 @@ As shown in Figure 4 (3-layer RNN), odd/even layers are forward/reverse LSTM. Hi
Figure 4. Stacked Bidirectional LSTM for NLP modeling. Figure 4. Stacked Bidirectional LSTM for NLP modeling.
</p> </p>
## Data Preparation ## Dataset
### Data introduction and Download
We taks the [IMDB sentiment analysis dataset](http://ai.stanford.edu/%7Eamaas/data/sentiment/) as an example. IMDB dataset contains training and testing set, with 25000 movie reviews. With a 1-10 score, negative reviews are those with score<=4, while positives are those with score>=7. You may use following scripts to download the IMDB dataset and [Moses](http://www.statmt.org/moses/) toolbox:
We use [IMDB](http://ai.stanford.edu/%7Eamaas/data/sentiment/) dataset for sentiment analysis in this tutorial, which consists of 50,000 movie reviews split evenly into 25k train and 25k test sets. In the labeled train/test sets, a negative review has a score <= 4 out of 10, and a positive review has a score >= 7 out of 10.
```bash `paddle.datasets` package encapsulates multiple public datasets, including `cifar`, `imdb`, `mnist`, `moivelens`, and `wmt14`, etc. There's no need for us to manually download and preprocess IMDB.
./data/get_imdb.sh
```
If successful, you should see the directory ```data``` with following files:
``` After issuing a command `python train.py`, training will start immediately. The details will be unpacked by the following sessions to see how it works.
aclImdb get_imdb.sh imdb mosesdecoder-master
```
* aclImdb: original data downloaded from the website;
* imdb: containing only training and testing data
* mosesdecoder-master: Moses tool
### Data Preprocessing ## Model Structure
We use the script `preprocess.py` to preprocess the data. It will call `tokenizer.perl` in the Moses toolbox to split words and punctuations, randomly shuffle training set and construct the dictionary. Notice: we only use labeled training and testing set. Executing following commands will preprocess the data:
``` ### Initialize PaddlePaddle
data_dir="./data/imdb"
python preprocess.py -i $data_dir
```
If it runs successfully, `./data/pre-imdb` will contain: We must import and initialize PaddlePaddle (enable/disable GPU, set the number of trainers, etc).
``` ```python
dict.txt labels.list test.list test_part_000 train.list train_part_000 import sys
``` import paddle.v2 as paddle
* test\_part\_000 和 train\_part\_000: all labeled training and testing set, and the training set is shuffled. # PaddlePaddle init
* train.list and test.list: training and testing file-list (containing list of file names). paddle.init(use_gpu=False, trainer_count=1)
* dict.txt: dictionary generated from training set. ```
* labels.list: class label, 0 stands for negative while 1 for positive.
### Data Provider for PaddlePaddle As alluded to in section [Model Overview](#model-overview), here we provide the implementations of both Text CNN and Stacked-bidirectional LSTM models.
PaddlePaddle can read Python-style script for configuration. The following `dataprovider.py` provides a detailed example, consisting of two parts:
* hook: define text information and class Id. Texts are defined as `integer_value_sequence` while class Ids are defined as `integer_value`. ### Text Convolution Neural Network (Text CNN)
* process: read line by line for ID and text information split by `’\t\t’`, and yield the data as a generator.
```python We create a neural network `convolution_net` as the following snippet code.
from paddle.trainer.PyDataProvider2 import *
def hook(settings, dictionary, **kwargs): Note: `paddle.networks.sequence_conv_pool` includes both convolution and pooling layer operations.
settings.word_dict = dictionary
settings.input_types = {
'word': integer_value_sequence(len(settings.word_dict)),
'label': integer_value(2)
}
settings.logger.info('dict len : %d' % (len(settings.word_dict)))
@provider(init_hook=hook)
def process(settings, file_name):
with open(file_name, 'r') as fdata:
for line_count, line in enumerate(fdata):
label, comment = line.strip().split('\t\t')
label = int(label)
words = comment.split()
word_slot = [
settings.word_dict[w] for w in words if w in settings.word_dict
]
yield {
'word': word_slot,
'label': label
}
```
## Model Setup
`trainer_config.py` is an example of a setup file.
### Data Definition
```python ```python
from os.path import join as join_path def convolution_net(input_dim, class_dim=2, emb_dim=128, hid_dim=128):
from paddle.trainer_config_helpers import * data = paddle.layer.data("word",
# if it is “test” mode paddle.data_type.integer_value_sequence(input_dim))
is_test = get_config_arg('is_test', bool, False) emb = paddle.layer.embedding(input=data, size=emb_dim)
# if it is “predict” mode conv_3 = paddle.networks.sequence_conv_pool(
is_predict = get_config_arg('is_predict', bool, False) input=emb, context_len=3, hidden_size=hid_dim)
conv_4 = paddle.networks.sequence_conv_pool(
# Data path input=emb, context_len=4, hidden_size=hid_dim)
data_dir = "./data/pre-imdb" output = paddle.layer.fc(input=[conv_3, conv_4],
# File names size=class_dim,
train_list = "train.list" act=paddle.activation.Softmax())
test_list = "test.list" lbl = paddle.layer.data("label", paddle.data_type.integer_value(2))
dict_file = "dict.txt" cost = paddle.layer.classification_cost(input=output, label=lbl)
return cost
# Dictionary size
dict_dim = len(open(join_path(data_dir, "dict.txt")).readlines())
# class number
class_dim = len(open(join_path(data_dir, 'labels.list')).readlines())
if not is_predict:
train_list = join_path(data_dir, train_list)
test_list = join_path(data_dir, test_list)
dict_file = join_path(data_dir, dict_file)
train_list = train_list if not is_test else None
# construct the dictionary
word_dict = dict()
with open(dict_file, 'r') as f:
for i, line in enumerate(open(dict_file, 'r')):
word_dict[line.split('\t')[0]] = i
# Call the function “define_py_data_sources2” in the file dataprovider.py to extract features
define_py_data_sources2(
train_list,
test_list,
module="dataprovider",
obj="process", # function to generate data
args={'dictionary': word_dict}) # extra parameters, here refers to dictionary
``` ```
### Algorithm Setup 1. Define input data and its dimension
```python Parameter `input_dim` denotes the dictionary size, and `class_dim` is the number of categories. In `convolution_net`, the input to the network is defined in `paddle.layer.data`.
settings(
batch_size=128,
learning_rate=2e-3,
learning_method=AdamOptimizer(),
regularization=L2Regularization(8e-4),
gradient_clipping_threshold=25)
```
* Batch size set as 128; 1. Define Classifier
* Set global learning rate;
* Apply ADAM algorithm for optimization;
* Set up L2 regularization;
* Set up gradient clipping threshold;
### Model Structure The above Text CNN network extracts high-level features and maps them to a vector of the same size as the categories. `paddle.activation.Softmax` function or classifier is then used for calculating the probability of the sentence belonging to each category.
We use PaddlePaddle to implement two classification algorithms, based on above mentioned model [Text-CNN](#Text-CNN(CNN))和[Stacked-bidirectional LSTM](#Stacked-bidirectional LSTM(Stacked Bidirectional LSTM))。
#### Implementation of Text CNN 1. Define Loss Function
```python
def convolution_net(input_dim,
class_dim=2,
emb_dim=128,
hid_dim=128,
is_predict=False):
# network input: id denotes word order, dictionary size as input_dim
data = data_layer("word", input_dim)
# Embed one-hot id to embedding subspace
emb = embedding_layer(input=data, size=emb_dim)
# Convolution and max-pooling operation, convolution kernel size set as 3
conv_3 = sequence_conv_pool(input=emb, context_len=3, hidden_size=hid_dim)
# Convolution and max-pooling, convolution kernel size set as 4
conv_4 = sequence_conv_pool(input=emb, context_len=4, hidden_size=hid_dim)
# Concatenate conv_3 and conv_4 as input for softmax classification, class number as class_dim
output = fc_layer(
input=[conv_3, conv_4], size=class_dim, act=SoftmaxActivation())
if not is_predict:
lbl = data_layer("label", 1) #network input: class label
outputs(classification_cost(input=output, label=lbl))
else:
outputs(output)
```
In our implementation, we can use just a single layer [`sequence_conv_pool`](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/trainer_config_helpers/networks.py) to do convolution and pooling operation, convolution kernel size set as hidden_size parameters. In the context of supervised learning, labels of the training set are defined in `paddle.layer.data`, too. During training, cross-entropy is used as loss function in `paddle.layer.classification_cost` and as the output of the network; During testing, the outputs are the probabilities calculated in the classifier.
#### Implementation of Stacked bidirectional LSTM #### Stacked bidirectional LSTM
We create a neural network `stacked_lstm_net` as below.
```python ```python
def stacked_lstm_net(input_dim, def stacked_lstm_net(input_dim,
class_dim=2, class_dim=2,
emb_dim=128, emb_dim=128,
hid_dim=512, hid_dim=512,
stacked_num=3, stacked_num=3):
is_predict=False): """
A Wrapper for sentiment classification task.
# layer number of LSTM “stacked_num” is an odd number to confirm the top-layer LSTM is forward This network uses bi-directional recurrent network,
assert stacked_num % 2 == 1 consisting three LSTM layers. This configure is referred to
# network attributes setup the paper as following url, but use fewer layrs.
layer_attr = ExtraLayerAttribute(drop_rate=0.5) http://www.aclweb.org/anthology/P15-1109
# parameter attributes setup input_dim: here is word dictionary dimension.
fc_para_attr = ParameterAttribute(learning_rate=1e-3) class_dim: number of categories.
lstm_para_attr = ParameterAttribute(initial_std=0., learning_rate=1.) emb_dim: dimension of word embedding.
para_attr = [fc_para_attr, lstm_para_attr] hid_dim: dimension of hidden layer.
bias_attr = ParameterAttribute(initial_std=0., l2_rate=0.) stacked_num: number of stacked lstm-hidden layer.
# Activation functions """
relu = ReluActivation() assert stacked_num % 2 == 1
linear = LinearActivation()
layer_attr = paddle.attr.Extra(drop_rate=0.5)
fc_para_attr = paddle.attr.Param(learning_rate=1e-3)
# Network input: id as word order, dictionary size is set as input_dim lstm_para_attr = paddle.attr.Param(initial_std=0., learning_rate=1.)
data = data_layer("word", input_dim) para_attr = [fc_para_attr, lstm_para_attr]
# Mapping id from word to the embedding subspace bias_attr = paddle.attr.Param(initial_std=0., l2_rate=0.)
emb = embedding_layer(input=data, size=emb_dim) relu = paddle.activation.Relu()
linear = paddle.activation.Linear()
fc1 = fc_layer(input=emb, size=hid_dim, act=linear, bias_attr=bias_attr)
# LSTM-based RNN data = paddle.layer.data("word",
lstm1 = lstmemory( paddle.data_type.integer_value_sequence(input_dim))
input=fc1, act=relu, bias_attr=bias_attr, layer_attr=layer_attr) emb = paddle.layer.embedding(input=data, size=emb_dim)
# Construct stacked bidirectional LSTM with fc_layer and lstmemory with layer depth as stacked_num: fc1 = paddle.layer.fc(input=emb,
inputs = [fc1, lstm1] size=hid_dim,
for i in range(2, stacked_num + 1): act=linear,
fc = fc_layer( bias_attr=bias_attr)
input=inputs, lstm1 = paddle.layer.lstmemory(
size=hid_dim, input=fc1, act=relu, bias_attr=bias_attr, layer_attr=layer_attr)
act=linear,
param_attr=para_attr, inputs = [fc1, lstm1]
bias_attr=bias_attr) for i in range(2, stacked_num + 1):
lstm = lstmemory( fc = paddle.layer.fc(input=inputs,
input=fc, size=hid_dim,
# Odd number-th layer: forward, Even number-th reverse. act=linear,
reverse=(i % 2) == 0, param_attr=para_attr,
act=relu, bias_attr=bias_attr)
bias_attr=bias_attr, lstm = paddle.layer.lstmemory(
layer_attr=layer_attr) input=fc,
inputs = [fc, lstm] reverse=(i % 2) == 0,
act=relu,
# Apply max-pooling along the temporal dimension on the last fc_layer to produce a fixed length vector bias_attr=bias_attr,
fc_last = pooling_layer(input=inputs[0], pooling_type=MaxPooling()) layer_attr=layer_attr)
# Apply max-pooling along tempoeral dim of lstmemory to obtain fixed length feature vector inputs = [fc, lstm]
lstm_last = pooling_layer(input=inputs[1], pooling_type=MaxPooling())
# concatenate fc_last and lstm_last as input for a softmax classification layer, with class number equals class_dim fc_last = paddle.layer.pooling(
output = fc_layer( input=inputs[0], pooling_type=paddle.pooling.Max())
input=[fc_last, lstm_last], lstm_last = paddle.layer.pooling(
size=class_dim, input=inputs[1], pooling_type=paddle.pooling.Max())
act=SoftmaxActivation(), output = paddle.layer.fc(input=[fc_last, lstm_last],
bias_attr=bias_attr, size=class_dim,
param_attr=para_attr) act=paddle.activation.Softmax(),
bias_attr=bias_attr,
if is_predict: param_attr=para_attr)
outputs(output)
else: lbl = paddle.layer.data("label", paddle.data_type.integer_value(2))
outputs(classification_cost(input=output, label=data_layer('label', 1))) cost = paddle.layer.classification_cost(input=output, label=lbl)
return cost
``` ```
Our model defined in `trainer_config.py` uses the `stacked_lstm_net` structure as default. If you want to use `convolution_net`, you can comment related lines. 1. Define input data and its dimension
Parameter `input_dim` denotes the dictionary size, and `class_dim` is the number of categories. In `stacked_lstm_net`, the input to the network is defined in `paddle.layer.data`.
1. Define Classifier
The above stacked bidirectional LSTM network extracts high-level features and maps them to a vector of the same size as the categories. `paddle.activation.Softmax` function or classifier is then used for calculating the probability of the sentence belonging to each category.
1. Define Loss Function
In the context of supervised learning, labels of the training set are defined in `paddle.layer.data`, too. During training, cross-entropy is used as loss function in `paddle.layer.classification_cost` and as the output of the network; During testing, the outputs are the probabilities calculated in the classifier.
To reiterate, we can either invoke `convolution_net` or `stacked_lstm_net`.
```python ```python
stacked_lstm_net( word_dict = paddle.dataset.imdb.word_dict()
dict_dim, class_dim=class_dim, stacked_num=3, is_predict=is_predict) dict_dim = len(word_dict)
# convolution_net(dict_dim, class_dim=class_dim, is_predict=is_predict) class_dim = 2
# option 1
cost = convolution_net(dict_dim, class_dim=class_dim)
# option 2
# cost = stacked_lstm_net(dict_dim, class_dim=class_dim, stacked_num=3)
``` ```
## Model Training ## Model Training
Use `train.sh` script to run local training:
``` ### Define Parameters
./train.sh
```
train.sh is as following: First, we create the model parameters according to the previous model configuration `cost`.
```bash ```python
paddle train --config=trainer_config.py \ # create parameters
--save_dir=./model_output \ parameters = paddle.parameters.create(cost)
--job=train \
--use_gpu=false \
--trainer_count=4 \
--num_passes=10 \
--log_period=20 \
--dot_period=20 \
--show_parameter_stats_period=100 \
--test_all_data_in_one_period=1 \
2>&1 | tee 'train.log'
``` ```
* \--config=trainer_config.py: set up model configuration. ### Create Trainer
* \--save\_dir=./model_output: set up output folder to save model parameters.
* \--job=train: set job mode as training.
* \--use\_gpu=false: Use CPU for training. If you have installed GPU-version PaddlePaddle and want to try GPU training, you may set this term as true.
* \--trainer\_count=4: setup thread number (or GPU numer).
* \--num\_passes=15: Setup pass. In PaddlePaddle, a pass means a training epoch over all samples.
* \--log\_period=20: print log every 20 batches.
* \--show\_parameter\_stats\_period=100: Print statistics to screen every 100 batch.
* \--test\_all_data\_in\_one\_period=1: Predict all testing data every time.
If it is running sussefully, the output log will be saved at `train.log`, model parameters will be saved at the directory `model_output/`. Output log will be as following: Before jumping into creating a training module, algorithm setting is also necessary.
Here we specified `Adam` optimization algorithm via `paddle.optimizer`.
```python
# create optimizer
adam_optimizer = paddle.optimizer.Adam(
learning_rate=2e-3,
regularization=paddle.optimizer.L2Regularization(rate=8e-4),
model_average=paddle.optimizer.ModelAverage(average_window=0.5))
# create trainer
trainer = paddle.trainer.SGD(cost=cost,
parameters=parameters,
update_equation=adam_optimizer)
``` ```
Batch=20 samples=2560 AvgCost=0.681644 CurrentCost=0.681644 Eval: classification_error_evaluator=0.36875 CurrentEval: classification_error_evaluator=0.36875
...
Pass=0 Batch=196 samples=25000 AvgCost=0.418964 Eval: classification_error_evaluator=0.1922
Test samples=24999 cost=0.39297 Eval: classification_error_evaluator=0.149406
```
* Batch=xx: Already |xx| Batch trained.
* samples=xx: xx samples have been processed during training.
* AvgCost=xx: Average loss from 0-th batch to the current batch.
* CurrentCost=xx: loss of the latest |log_period|-th batch;
* Eval: classification\_error\_evaluator=xx: Average accuracy from 0-th batch to current batch;
* CurrentEval: classification\_error\_evaluator: latest |log_period| batches of classification error;
* Pass=0: Running over all data in the training set is called as a Pass. Pass “0” denotes the first round.
### Training
## Application models `paddle.dataset.imdb.train()` will yield records during each pass, after shuffling, a batch input is generated for training.
### Testing
Testing refers to use trained model to evaluate labeled dataset. ```python
train_reader = paddle.batch(
paddle.reader.shuffle(
lambda: paddle.dataset.imdb.train(word_dict), buf_size=1000),
batch_size=100)
test_reader = paddle.batch(
lambda: paddle.dataset.imdb.test(word_dict), batch_size=100)
``` ```
./test.sh
```
Scripts for testing `test.sh` is as following, where the function `get_best_pass` ranks classification accuracy to obtain the best model:
```bash `feeding` is devoted to specifying the correspondence between each yield record and `paddle.layer.data`. For instance, the first column of data generated by `paddle.dataset.imdb.train()` corresponds to `word` feature.
function get_best_pass() {
cat $1 | grep -Pzo 'Test .*\n.*pass-.*' | \
sed -r 'N;s/Test.* error=([0-9]+\.[0-9]+).*\n.*pass-([0-9]+)/\1 \2/g' | \
sort | head -n 1
}
log=train.log ```python
LOG=`get_best_pass $log` feeding = {'word': 0, 'label': 1}
LOG=(${LOG})
evaluate_pass="model_output/pass-${LOG[1]}"
echo 'evaluating from pass '$evaluate_pass
model_list=./model.list
touch $model_list | echo $evaluate_pass > $model_list
net_conf=trainer_config.py
paddle train --config=$net_conf \
--model_list=$model_list \
--job=test \
--use_gpu=false \
--trainer_count=4 \
--config_args=is_test=1 \
2>&1 | tee 'test.log'
``` ```
Different from training, testing requires denoting `--job = test` and model path `--model_list = $model_list`. If successful, log will be saved at `test.log`. In our test, the best model is `model_output/pass-00002`, with classification error rate as 0.115645: Callback function `event_handler` will be invoked to track training progress when a pre-defined event happens.
``` ```python
Pass=0 samples=24999 AvgCost=0.280471 Eval: classification_error_evaluator=0.115645 def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0:
print "\nPass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
else:
sys.stdout.write('.')
sys.stdout.flush()
if isinstance(event, paddle.event.EndPass):
result = trainer.test(reader=test_reader, reader_dict=reader_dict)
print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics)
``` ```
### Prediction Finally, we can invoke `trainer.train` to start training:
`predict.py` script provides an API. Predicting IMDB data without labels as following:
```python
trainer.train(
reader=train_reader,
event_handler=event_handler,
feeding=feedig,
num_passes=10)
``` ```
./predict.sh
```
predict.sh is as following(default model path `model_output/pass-00002` may exist or modified to others):
```bash
model=model_output/pass-00002/
config=trainer_config.py
label=data/pre-imdb/labels.list
cat ./data/aclImdb/test/pos/10007_10.txt | python predict.py \
--tconf=$config \
--model=$model \
--label=$label \
--dict=./data/pre-imdb/dict.txt \
--batch_size=1
```
* `cat ./data/aclImdb/test/pos/10007_10.txt` : Input prediction samples.
* `predict.py` : Prediction script.
* `--tconf=$config` : Network set up.
* `--model=$model` : Model path set up.
* `--label=$label` : set up the label dictionary, mapping integer IDs to string labels.
* `--dict=data/pre-imdb/dict.txt` : set up the dictionary file.
* `--batch_size=1` : batch size during prediction.
Prediction result of our example: ## Conclusion
``` In this chapter, we use sentiment analysis as an example to introduce applying deep learning models on end-to-end short text classification, as well as how to use PaddlePaddle to implement the model. Meanwhile, we briefly introduce two models for text processing: CNN and RNN. In following chapters, we will see how these models can be applied in other tasks.
Loading parameters from model_output/pass-00002/
predicting label is pos
```
`10007_10.txt` in folder`./data/aclImdb/test/pos`, the predicted label is also pos,so the prediction is correct.
## Summary
In this chapter, we use sentiment analysis as an example to introduce applying deep learning models on end-to-end short text classification, as well as how to use PaddlePaddle to implement the model. Meanwhile, we briefly introduce two models for text processing: CNN and RNN. In following chapters we will see how these models can be applied in other tasks.
## Reference ## Reference
1. Kim Y. [Convolutional neural networks for sentence classification](http://arxiv.org/pdf/1408.5882)[J]. arXiv preprint arXiv:1408.5882, 2014. 1. Kim Y. [Convolutional neural networks for sentence classification](http://arxiv.org/pdf/1408.5882)[J]. arXiv preprint arXiv:1408.5882, 2014.
2. Kalchbrenner N, Grefenstette E, Blunsom P. [A convolutional neural network for modelling sentences](http://arxiv.org/pdf/1404.2188.pdf?utm_medium=App.net&utm_source=PourOver)[J]. arXiv preprint arXiv:1404.2188, 2014. 2. Kalchbrenner N, Grefenstette E, Blunsom P. [A convolutional neural network for modelling sentences](http://arxiv.org/pdf/1404.2188.pdf?utm_medium=App.net&utm_source=PourOver)[J]. arXiv preprint arXiv:1404.2188, 2014.
3. Yann N. Dauphin, et al. [Language Modeling with Gated Convolutional Networks](https://arxiv.org/pdf/1612.08083v1.pdf)[J] arXiv preprint arXiv:1612.08083, 2016. 3. Yann N. Dauphin, et al. [Language Modeling with Gated Convolutional Networks](https://arxiv.org/pdf/1612.08083v1.pdf)[J] arXiv preprint arXiv:1612.08083, 2016.
...@@ -532,7 +408,7 @@ In this chapter, we use sentiment analysis as an example to introduce applying d ...@@ -532,7 +408,7 @@ In this chapter, we use sentiment analysis as an example to introduce applying d
9. Zhou J, Xu W. [End-to-end learning of semantic role labeling using recurrent neural networks](http://www.aclweb.org/anthology/P/P15/P15-1109.pdf)[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2015. 9. Zhou J, Xu W. [End-to-end learning of semantic role labeling using recurrent neural networks](http://www.aclweb.org/anthology/P/P15/P15-1109.pdf)[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2015.
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。 This tutorial is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
......
...@@ -150,16 +150,14 @@ aclImdb ...@@ -150,16 +150,14 @@ aclImdb
``` ```
Paddle在`dataset/imdb.py`中提实现了imdb数据集的自动下载和读取,并提供了读取字典、训练数据、测试数据等API。 Paddle在`dataset/imdb.py`中提实现了imdb数据集的自动下载和读取,并提供了读取字典、训练数据、测试数据等API。
``` ```python
import sys import sys
import paddle.trainer_config_helpers.attrs as attrs
from paddle.trainer_config_helpers.poolings import MaxPooling
import paddle.v2 as paddle import paddle.v2 as paddle
``` ```
## 配置模型 ## 配置模型
在该示例中,我们实现了两种文本分类算法,分别基于上文所述的[文本卷积神经网络](#文本卷积神经网络(CNN))和[栈式双向LSTM](#栈式双向LSTM(Stacked Bidirectional LSTM))。 在该示例中,我们实现了两种文本分类算法,分别基于上文所述的[文本卷积神经网络](#文本卷积神经网络(CNN))和[栈式双向LSTM](#栈式双向LSTM(Stacked Bidirectional LSTM))。
### 文本卷积神经网络 ### 文本卷积神经网络
``` ```python
def convolution_net(input_dim, def convolution_net(input_dim,
class_dim=2, class_dim=2,
emb_dim=128, emb_dim=128,
...@@ -180,7 +178,7 @@ def convolution_net(input_dim, ...@@ -180,7 +178,7 @@ def convolution_net(input_dim,
``` ```
网络的输入`input_dim`表示的是词典的大小,`class_dim`表示类别数。这里,我们使用[`sequence_conv_pool`](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/trainer_config_helpers/networks.py) API实现了卷积和池化操作。 网络的输入`input_dim`表示的是词典的大小,`class_dim`表示类别数。这里,我们使用[`sequence_conv_pool`](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/trainer_config_helpers/networks.py) API实现了卷积和池化操作。
### 栈式双向LSTM ### 栈式双向LSTM
``` ```python
def stacked_lstm_net(input_dim, def stacked_lstm_net(input_dim,
class_dim=2, class_dim=2,
emb_dim=128, emb_dim=128,
...@@ -201,11 +199,11 @@ def stacked_lstm_net(input_dim, ...@@ -201,11 +199,11 @@ def stacked_lstm_net(input_dim,
""" """
assert stacked_num % 2 == 1 assert stacked_num % 2 == 1
layer_attr = attrs.ExtraLayerAttribute(drop_rate=0.5) layer_attr = paddle.attr.Extra(drop_rate=0.5)
fc_para_attr = attrs.ParameterAttribute(learning_rate=1e-3) fc_para_attr = paddle.attr.Param(learning_rate=1e-3)
lstm_para_attr = attrs.ParameterAttribute(initial_std=0., learning_rate=1.) lstm_para_attr = paddle.attr.Param(initial_std=0., learning_rate=1.)
para_attr = [fc_para_attr, lstm_para_attr] para_attr = [fc_para_attr, lstm_para_attr]
bias_attr = attrs.ParameterAttribute(initial_std=0., l2_rate=0.) bias_attr = paddle.attr.Param(initial_std=0., l2_rate=0.)
relu = paddle.activation.Relu() relu = paddle.activation.Relu()
linear = paddle.activation.Linear() linear = paddle.activation.Linear()
...@@ -235,8 +233,8 @@ def stacked_lstm_net(input_dim, ...@@ -235,8 +233,8 @@ def stacked_lstm_net(input_dim,
layer_attr=layer_attr) layer_attr=layer_attr)
inputs = [fc, lstm] inputs = [fc, lstm]
fc_last = paddle.layer.pooling(input=inputs[0], pooling_type=MaxPooling()) fc_last = paddle.layer.pooling(input=inputs[0], pooling_type=paddle.pooling.Max())
lstm_last = paddle.layer.pooling(input=inputs[1], pooling_type=MaxPooling()) lstm_last = paddle.layer.pooling(input=inputs[1], pooling_type=paddle.pooling.Max())
output = paddle.layer.fc(input=[fc_last, lstm_last], output = paddle.layer.fc(input=[fc_last, lstm_last],
size=class_dim, size=class_dim,
act=paddle.activation.Softmax(), act=paddle.activation.Softmax(),
...@@ -249,7 +247,7 @@ def stacked_lstm_net(input_dim, ...@@ -249,7 +247,7 @@ def stacked_lstm_net(input_dim,
``` ```
网络的输入`stacked_num`表示的是LSTM的层数,需要是奇数,确保最高层LSTM正向。Paddle里面是通过一个fc和一个lstmemory来实现基于LSTM的循环神经网络。 网络的输入`stacked_num`表示的是LSTM的层数,需要是奇数,确保最高层LSTM正向。Paddle里面是通过一个fc和一个lstmemory来实现基于LSTM的循环神经网络。
## 训练模型 ## 训练模型
``` ```python
if __name__ == '__main__': if __name__ == '__main__':
# init # init
paddle.init(use_gpu=False) paddle.init(use_gpu=False)
...@@ -257,14 +255,14 @@ if __name__ == '__main__': ...@@ -257,14 +255,14 @@ if __name__ == '__main__':
启动paddle程序,use_gpu=False表示用CPU训练,如果系统支持GPU也可以修改成True使用GPU训练。 启动paddle程序,use_gpu=False表示用CPU训练,如果系统支持GPU也可以修改成True使用GPU训练。
### 训练数据 ### 训练数据
使用Paddle提供的数据集`dataset.imdb`中的API来读取训练数据。 使用Paddle提供的数据集`dataset.imdb`中的API来读取训练数据。
``` ```python
print 'load dictionary...' print 'load dictionary...'
word_dict = paddle.dataset.imdb.word_dict() word_dict = paddle.dataset.imdb.word_dict()
dict_dim = len(word_dict) dict_dim = len(word_dict)
class_dim = 2 class_dim = 2
``` ```
加载数据字典,这里通过`word_dict()`API可以直接构造字典。`class_dim`是指样本类别数,该示例中样本只有正负两类。 加载数据字典,这里通过`word_dict()`API可以直接构造字典。`class_dim`是指样本类别数,该示例中样本只有正负两类。
``` ```python
train_reader = paddle.batch( train_reader = paddle.batch(
paddle.reader.shuffle( paddle.reader.shuffle(
lambda: paddle.dataset.imdb.train(word_dict), buf_size=1000), lambda: paddle.dataset.imdb.train(word_dict), buf_size=1000),
...@@ -274,12 +272,12 @@ if __name__ == '__main__': ...@@ -274,12 +272,12 @@ if __name__ == '__main__':
batch_size=100) batch_size=100)
``` ```
这里,`dataset.imdb.train()`和`dataset.imdb.test()`分别是`dataset.imdb`中的训练数据和测试数据API。`train_reader`在训练时使用,意义是将读取的训练数据进行shuffle后,组成一个batch数据。同理,`test_reader`是在测试的时候使用,将读取的测试数据组成一个batch。 这里,`dataset.imdb.train()`和`dataset.imdb.test()`分别是`dataset.imdb`中的训练数据和测试数据API。`train_reader`在训练时使用,意义是将读取的训练数据进行shuffle后,组成一个batch数据。同理,`test_reader`是在测试的时候使用,将读取的测试数据组成一个batch。
```python
feeding={'word': 0, 'label': 1}
``` ```
reader_dict={'word': 0, 'label': 1} `feeding`用来指定`train_reader`和`test_reader`返回的数据与模型配置中data_layer的对应关系。这里表示reader返回的第0列数据对应`word`层,第1列数据对应`label`层。
```
`reader_dict`用来指定`train_reader`和`test_reader`返回的数据与模型配置中data_layer的对应关系。这里表示reader返回的第0列数据对应`word`层,第1列数据对应`label`层。
### 构造模型 ### 构造模型
``` ```python
# Please choose the way to build the network # Please choose the way to build the network
# by uncommenting the corresponding line. # by uncommenting the corresponding line.
cost = convolution_net(dict_dim, class_dim=class_dim) cost = convolution_net(dict_dim, class_dim=class_dim)
...@@ -287,13 +285,13 @@ if __name__ == '__main__': ...@@ -287,13 +285,13 @@ if __name__ == '__main__':
``` ```
该示例中默认使用`convolution_net`网络,如果使用`stacked_lstm_net`网络,注释相应的行即可。其中cost是网络的优化目标,同时cost包含了整个网络的拓扑信息。 该示例中默认使用`convolution_net`网络,如果使用`stacked_lstm_net`网络,注释相应的行即可。其中cost是网络的优化目标,同时cost包含了整个网络的拓扑信息。
### 网络参数 ### 网络参数
``` ```python
# create parameters # create parameters
parameters = paddle.parameters.create(cost) parameters = paddle.parameters.create(cost)
``` ```
根据网络的拓扑构造网络参数。这里parameters是整个网络的参数集。 根据网络的拓扑构造网络参数。这里parameters是整个网络的参数集。
### 优化算法 ### 优化算法
``` ```python
# create optimizer # create optimizer
adam_optimizer = paddle.optimizer.Adam( adam_optimizer = paddle.optimizer.Adam(
learning_rate=2e-3, learning_rate=2e-3,
...@@ -303,7 +301,7 @@ if __name__ == '__main__': ...@@ -303,7 +301,7 @@ if __name__ == '__main__':
Paddle中提供了一系列优化算法的API,这里使用Adam优化算法。 Paddle中提供了一系列优化算法的API,这里使用Adam优化算法。
### 训练 ### 训练
可以通过`paddle.trainer.SGD`构造一个sgd trainer,并调用`trainer.train`来训练模型。 可以通过`paddle.trainer.SGD`构造一个sgd trainer,并调用`trainer.train`来训练模型。
``` ```python
# End batch and end pass event handler # End batch and end pass event handler
def event_handler(event): def event_handler(event):
if isinstance(event, paddle.event.EndIteration): if isinstance(event, paddle.event.EndIteration):
...@@ -314,11 +312,11 @@ Paddle中提供了一系列优化算法的API,这里使用Adam优化算法。 ...@@ -314,11 +312,11 @@ Paddle中提供了一系列优化算法的API,这里使用Adam优化算法。
sys.stdout.write('.') sys.stdout.write('.')
sys.stdout.flush() sys.stdout.flush()
if isinstance(event, paddle.event.EndPass): if isinstance(event, paddle.event.EndPass):
result = trainer.test(reader=test_reader, reader_dict=reader_dict) result = trainer.test(reader=test_reader, feeding=feeding)
print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics) print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics)
``` ```
可以通过给train函数传递一个`event_handler`来获取每个batch和每个pass结束的状态。比如构造如下一个`event_handler`可以在每100个batch结束后输出cost和error;在每个pass结束后调用`trainer.test`计算一遍测试集并获得当前模型在测试集上的error。 可以通过给train函数传递一个`event_handler`来获取每个batch和每个pass结束的状态。比如构造如下一个`event_handler`可以在每100个batch结束后输出cost和error;在每个pass结束后调用`trainer.test`计算一遍测试集并获得当前模型在测试集上的error。
``` ```python
# create trainer # create trainer
trainer = paddle.trainer.SGD(cost=cost, trainer = paddle.trainer.SGD(cost=cost,
parameters=parameters, parameters=parameters,
...@@ -327,11 +325,11 @@ Paddle中提供了一系列优化算法的API,这里使用Adam优化算法。 ...@@ -327,11 +325,11 @@ Paddle中提供了一系列优化算法的API,这里使用Adam优化算法。
trainer.train( trainer.train(
reader=train_reader, reader=train_reader,
event_handler=event_handler, event_handler=event_handler,
reader_dict=reader_dict, feeding=feeding,
num_passes=2) num_passes=2)
``` ```
程序运行之后的输出如下。 程序运行之后的输出如下。
``` ```text
Pass 0, Batch 0, Cost 0.693721, {'classification_error_evaluator': 0.5546875} Pass 0, Batch 0, Cost 0.693721, {'classification_error_evaluator': 0.5546875}
................................................................................................... ...................................................................................................
Pass 0, Batch 100, Cost 0.294321, {'classification_error_evaluator': 0.1015625} Pass 0, Batch 100, Cost 0.294321, {'classification_error_evaluator': 0.1015625}
......
...@@ -13,8 +13,6 @@ ...@@ -13,8 +13,6 @@
# limitations under the License. # limitations under the License.
import sys import sys
import paddle.trainer_config_helpers.attrs as attrs
from paddle.trainer_config_helpers.poolings import MaxPooling
import paddle.v2 as paddle import paddle.v2 as paddle
...@@ -26,9 +24,8 @@ def convolution_net(input_dim, class_dim=2, emb_dim=128, hid_dim=128): ...@@ -26,9 +24,8 @@ def convolution_net(input_dim, class_dim=2, emb_dim=128, hid_dim=128):
input=emb, context_len=3, hidden_size=hid_dim) input=emb, context_len=3, hidden_size=hid_dim)
conv_4 = paddle.networks.sequence_conv_pool( conv_4 = paddle.networks.sequence_conv_pool(
input=emb, context_len=4, hidden_size=hid_dim) input=emb, context_len=4, hidden_size=hid_dim)
output = paddle.layer.fc(input=[conv_3, conv_4], output = paddle.layer.fc(
size=class_dim, input=[conv_3, conv_4], size=class_dim, act=paddle.activation.Softmax())
act=paddle.activation.Softmax())
lbl = paddle.layer.data("label", paddle.data_type.integer_value(2)) lbl = paddle.layer.data("label", paddle.data_type.integer_value(2))
cost = paddle.layer.classification_cost(input=output, label=lbl) cost = paddle.layer.classification_cost(input=output, label=lbl)
return cost return cost
...@@ -54,11 +51,11 @@ def stacked_lstm_net(input_dim, ...@@ -54,11 +51,11 @@ def stacked_lstm_net(input_dim,
""" """
assert stacked_num % 2 == 1 assert stacked_num % 2 == 1
layer_attr = attrs.ExtraLayerAttribute(drop_rate=0.5) layer_attr = paddle.attr.Extra(drop_rate=0.5)
fc_para_attr = attrs.ParameterAttribute(learning_rate=1e-3) fc_para_attr = paddle.attr.Param(learning_rate=1e-3)
lstm_para_attr = attrs.ParameterAttribute(initial_std=0., learning_rate=1.) lstm_para_attr = paddle.attr.Param(initial_std=0., learning_rate=1.)
para_attr = [fc_para_attr, lstm_para_attr] para_attr = [fc_para_attr, lstm_para_attr]
bias_attr = attrs.ParameterAttribute(initial_std=0., l2_rate=0.) bias_attr = paddle.attr.Param(initial_std=0., l2_rate=0.)
relu = paddle.activation.Relu() relu = paddle.activation.Relu()
linear = paddle.activation.Linear() linear = paddle.activation.Linear()
...@@ -66,20 +63,19 @@ def stacked_lstm_net(input_dim, ...@@ -66,20 +63,19 @@ def stacked_lstm_net(input_dim,
paddle.data_type.integer_value_sequence(input_dim)) paddle.data_type.integer_value_sequence(input_dim))
emb = paddle.layer.embedding(input=data, size=emb_dim) emb = paddle.layer.embedding(input=data, size=emb_dim)
fc1 = paddle.layer.fc(input=emb, fc1 = paddle.layer.fc(
size=hid_dim, input=emb, size=hid_dim, act=linear, bias_attr=bias_attr)
act=linear,
bias_attr=bias_attr)
lstm1 = paddle.layer.lstmemory( lstm1 = paddle.layer.lstmemory(
input=fc1, act=relu, bias_attr=bias_attr, layer_attr=layer_attr) input=fc1, act=relu, bias_attr=bias_attr, layer_attr=layer_attr)
inputs = [fc1, lstm1] inputs = [fc1, lstm1]
for i in range(2, stacked_num + 1): for i in range(2, stacked_num + 1):
fc = paddle.layer.fc(input=inputs, fc = paddle.layer.fc(
size=hid_dim, input=inputs,
act=linear, size=hid_dim,
param_attr=para_attr, act=linear,
bias_attr=bias_attr) param_attr=para_attr,
bias_attr=bias_attr)
lstm = paddle.layer.lstmemory( lstm = paddle.layer.lstmemory(
input=fc, input=fc,
reverse=(i % 2) == 0, reverse=(i % 2) == 0,
...@@ -88,13 +84,16 @@ def stacked_lstm_net(input_dim, ...@@ -88,13 +84,16 @@ def stacked_lstm_net(input_dim,
layer_attr=layer_attr) layer_attr=layer_attr)
inputs = [fc, lstm] inputs = [fc, lstm]
fc_last = paddle.layer.pooling(input=inputs[0], pooling_type=MaxPooling()) fc_last = paddle.layer.pooling(
lstm_last = paddle.layer.pooling(input=inputs[1], pooling_type=MaxPooling()) input=inputs[0], pooling_type=paddle.pooling.Max())
output = paddle.layer.fc(input=[fc_last, lstm_last], lstm_last = paddle.layer.pooling(
size=class_dim, input=inputs[1], pooling_type=paddle.pooling.Max())
act=paddle.activation.Softmax(), output = paddle.layer.fc(
bias_attr=bias_attr, input=[fc_last, lstm_last],
param_attr=para_attr) size=class_dim,
act=paddle.activation.Softmax(),
bias_attr=bias_attr,
param_attr=para_attr)
lbl = paddle.layer.data("label", paddle.data_type.integer_value(2)) lbl = paddle.layer.data("label", paddle.data_type.integer_value(2))
cost = paddle.layer.classification_cost(input=output, label=lbl) cost = paddle.layer.classification_cost(input=output, label=lbl)
...@@ -117,7 +116,7 @@ if __name__ == '__main__': ...@@ -117,7 +116,7 @@ if __name__ == '__main__':
test_reader = paddle.batch( test_reader = paddle.batch(
lambda: paddle.dataset.imdb.test(word_dict), batch_size=100) lambda: paddle.dataset.imdb.test(word_dict), batch_size=100)
reader_dict = {'word': 0, 'label': 1} feeding = {'word': 0, 'label': 1}
# network config # network config
# Please choose the way to build the network # Please choose the way to build the network
...@@ -144,16 +143,15 @@ if __name__ == '__main__': ...@@ -144,16 +143,15 @@ if __name__ == '__main__':
sys.stdout.write('.') sys.stdout.write('.')
sys.stdout.flush() sys.stdout.flush()
if isinstance(event, paddle.event.EndPass): if isinstance(event, paddle.event.EndPass):
result = trainer.test(reader=test_reader, reader_dict=reader_dict) result = trainer.test(reader=test_reader, feeding=feeding)
print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics) print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics)
# create trainer # create trainer
trainer = paddle.trainer.SGD(cost=cost, trainer = paddle.trainer.SGD(
parameters=parameters, cost=cost, parameters=parameters, update_equation=adam_optimizer)
update_equation=adam_optimizer)
trainer.train( trainer.train(
reader=train_reader, reader=train_reader,
event_handler=event_handler, event_handler=event_handler,
reader_dict=reader_dict, feeding=feeding,
num_passes=2) num_passes=2)
...@@ -149,19 +149,257 @@ The advantages of CBOW is that it smooths over the word embeddings of the contex ...@@ -149,19 +149,257 @@ The advantages of CBOW is that it smooths over the word embeddings of the contex
As illustrated in the figure above, skip-gram model maps the word embedding of the given word onto $2n$ word embeddings (including $n$ words before and $n$ words after the given word), and then combine the classification loss of all those $2n$ words by softmax. As illustrated in the figure above, skip-gram model maps the word embedding of the given word onto $2n$ word embeddings (including $n$ words before and $n$ words after the given word), and then combine the classification loss of all those $2n$ words by softmax.
## Data Preparation ## Dataset
We will use Peen Treebank (PTB) (Tomas Mikolov's pre-processed version) dataset. PTB is a small dataset, used in Recurrent Neural Network Language Modeling Toolkit\[[2](#reference)\]. Its statistics are as follows:
<p align="center">
<table>
<tr>
<td>training set</td>
<td>validation set</td>
<td>test set</td>
</tr>
<tr>
<td>ptb.train.txt</td>
<td>ptb.valid.txt</td>
<td>ptb.test.txt</td>
</tr>
<tr>
<td>42068 lines</td>
<td>3370 lines</td>
<td>3761 lines</td>
</tr>
</table>
</p>
### Python Dataset Module
We encapsulated the PTB Data Set in our Python module `paddle.dataset.imikolov`. This module can
1. download the dataset to `~/.cache/paddle/dataset/imikolov`, if not yet, and
2. [preprocesses](#preprocessing) the dataset.
### Preprocessing
We will be training a 5-gram model. Given five words in a window, we will predict the fifth word given the first four words.
Beginning and end of a sentence have a special meaning, so we will add begin token `<s>` in the front of the sentence. And end token `<e>` in the end of the sentence. By moving the five word window in the sentence, data instances are generated.
For example, the sentence "I have a dream that one day" generates five data instances:
```text
<s> I have a dream
I have a dream that
have a dream that one
a dream that one day
dream that one day <e>
```
At last, each data instance will be converted into an integer sequence according it's words' index inside the dictionary.
## Training
The neural network that we will be using is illustrated in the graph below:
## Model Configuration
<p align="center"> <p align="center">
<img src="image/ngram.en.png" width=400><br/> <img src="image/ngram.en.png" width=400><br/>
Figure 5. N-gram neural network model in model configuration Figure 5. N-gram neural network model in model configuration
</p> </p>
`word2vec/train.py` demonstrates training word2vec using PaddlePaddle:
- Import packages.
```python
import math
import paddle.v2 as paddle
```
- Configure parameter.
```python
embsize = 32 # word vector dimension
hiddensize = 256 # hidden layer dimension
N = 5 # train 5-gram
```
- Map the $n-1$ words $w_{t-n+1},...w_{t-1}$ before $w_t$ to a D-dimensional vector though matrix of dimention $|V|\times D$ (D=32 in this example).
```python
def wordemb(inlayer):
wordemb = paddle.layer.table_projection(
input=inlayer,
size=embsize,
param_attr=paddle.attr.Param(
name="_proj",
initial_std=0.001,
learning_rate=1,
l2_rate=0, ))
return wordemb
```
- Define name and type for input to data layer.
```python
paddle.init(use_gpu=False, trainer_count=3)
word_dict = paddle.dataset.imikolov.build_dict()
dict_size = len(word_dict)
# Every layer takes integer value of range [0, dict_size)
firstword = paddle.layer.data(
name="firstw", type=paddle.data_type.integer_value(dict_size))
secondword = paddle.layer.data(
name="secondw", type=paddle.data_type.integer_value(dict_size))
thirdword = paddle.layer.data(
name="thirdw", type=paddle.data_type.integer_value(dict_size))
fourthword = paddle.layer.data(
name="fourthw", type=paddle.data_type.integer_value(dict_size))
nextword = paddle.layer.data(
name="fifthw", type=paddle.data_type.integer_value(dict_size))
Efirst = wordemb(firstword)
Esecond = wordemb(secondword)
Ethird = wordemb(thirdword)
Efourth = wordemb(fourthword)
```
- Concatenate n-1 word embedding vectors into a single feature vector.
```python
contextemb = paddle.layer.concat(input=[Efirst, Esecond, Ethird, Efourth])
```
- Feature vector will go through a fully connected layer which outputs a hidden feature vector.
```python
hidden1 = paddle.layer.fc(input=contextemb,
size=hiddensize,
act=paddle.activation.Sigmoid(),
layer_attr=paddle.attr.Extra(drop_rate=0.5),
bias_attr=paddle.attr.Param(learning_rate=2),
param_attr=paddle.attr.Param(
initial_std=1. / math.sqrt(embsize * 8),
learning_rate=1))
```
- Hidden feature vector will go through another fully conected layer, turn into a $|V|$ dimensional vector. At the same time softmax will be applied to get the probability of each word being generated.
```python
predictword = paddle.layer.fc(input=hidden1,
size=dict_size,
bias_attr=paddle.attr.Param(learning_rate=2),
act=paddle.activation.Softmax())
```
- We will use cross-entropy cost function.
```python
cost = paddle.layer.classification_cost(input=predictword, label=nextword)
```
- Create parameter, optimizer and trainer.
```python
parameters = paddle.parameters.create(cost)
adam_optimizer = paddle.optimizer.Adam(
learning_rate=3e-3,
regularization=paddle.optimizer.L2Regularization(8e-4))
trainer = paddle.trainer.SGD(cost, parameters, adam_optimizer)
```
Next, we will begin the training process. `paddle.dataset.imikolov.train()` and `paddle.dataset.imikolov.test()` is our training set and test set. Both of the function will return a **reader**: In PaddlePaddle, reader is a python function which returns a Python iterator which output a single data instance at a time.
`paddle.batch` takes reader as input, outputs a **batched reader**: In PaddlePaddle, a reader outputs a single data instance at a time but batched reader outputs a minibatch of data instances.
```python
import gzip
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0:
print "Pass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
if isinstance(event, paddle.event.EndPass):
result = trainer.test(
paddle.batch(
paddle.dataset.imikolov.test(word_dict, N), 32))
print "Pass %d, Testing metrics %s" % (event.pass_id, result.metrics)
with gzip.open("model_%d.tar.gz"%event.pass_id, 'w') as f:
parameters.to_tar(f)
trainer.train(
paddle.batch(paddle.dataset.imikolov.train(word_dict, N), 32),
num_passes=100,
event_handler=event_handler)
```
`trainer.train` will start training, the output of `event_handler` will be similar to following:
```text
Pass 0, Batch 0, Cost 7.870579, {'classification_error_evaluator': 1.0}, Testing metrics {'classification_error_evaluator': 0.999591588973999}
Pass 0, Batch 100, Cost 6.136420, {'classification_error_evaluator': 0.84375}, Testing metrics {'classification_error_evaluator': 0.8328699469566345}
Pass 0, Batch 200, Cost 5.786797, {'classification_error_evaluator': 0.8125}, Testing metrics {'classification_error_evaluator': 0.8328542709350586}
...
```
After 30 passes, we can get average error rate around 0.735611.
## Model Training
## Model Application ## Model Application
After the model is trained, we can load saved model parameters and uses it for other models. We can also use the parameters in applications.
### Viewing Word Vector
Parameters trained by PaddlePaddle can be viewed by `parameters.get()`. For example, we can check the word vector for word `apple`.
```python
embeddings = parameters.get("_proj").reshape(len(word_dict), embsize)
print embeddings[word_dict['apple']]
```
```text
[-0.38961065 -0.02392169 -0.00093231 0.36301503 0.13538605 0.16076435
-0.0678709 0.1090285 0.42014077 -0.24119169 -0.31847557 0.20410083
0.04910378 0.19021918 -0.0122014 -0.04099389 -0.16924137 0.1911236
-0.10917275 0.13068172 -0.23079982 0.42699069 -0.27679482 -0.01472992
0.2069038 0.09005053 -0.3282454 0.12717034 -0.24218646 0.25304323
0.19072419 -0.24286366]
```
### Modifying Word Vector
Word vectors (`embeddings`) that we get is a numpy array. We can modify this array and set it back to `parameters`.
```python
def modify_embedding(emb):
# Add your modification here.
pass
modify_embedding(embeddings)
parameters.set("_proj", embeddings)
```
### Calculating Cosine Similarity
Cosine similarity is one way of quantifying the similarity between two vectors. The range of result is $[-1, 1]$. The bigger the value, the similar two vectors are:
```python
from scipy import spatial
emb_1 = embeddings[word_dict['world']]
emb_2 = embeddings[word_dict['would']]
print spatial.distance.cosine(emb_1, emb_2)
```
```text
0.99375076448
```
## Conclusion ## Conclusion
This chapter introduces word embedding, the relationship between language model and word embedding, and how to train neural networks to learn word embedding. This chapter introduces word embedding, the relationship between language model and word embedding, and how to train neural networks to learn word embedding.
...@@ -177,4 +415,4 @@ In information retrieval, the relevance between the query and document keyword c ...@@ -177,4 +415,4 @@ In information retrieval, the relevance between the query and document keyword c
5. https://en.wikipedia.org/wiki/Singular_value_decomposition 5. https://en.wikipedia.org/wiki/Singular_value_decomposition
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。 This tutorial is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.
...@@ -144,7 +144,7 @@ CBOW的好处是对上下文词语的分布在词向量上进行了平滑,去 ...@@ -144,7 +144,7 @@ CBOW的好处是对上下文词语的分布在词向量上进行了平滑,去
### 数据介绍 ### 数据介绍
本教程使用Penn Tree Bank (PTB)数据集。PTB数据集较小,训练速度快,应用于Mikolov的公开语言模型训练工具\[[2](#参考文献)\]中。其统计情况如下: 本教程使用Penn Treebank (PTB)(经Tomas Mikolov预处理过的版本)数据集。PTB数据集较小,训练速度快,应用于Mikolov的公开语言模型训练工具\[[2](#参考文献)\]中。其统计情况如下:
<p align="center"> <p align="center">
<table> <table>
...@@ -183,6 +183,7 @@ a dream that one day ...@@ -183,6 +183,7 @@ a dream that one day
dream that one day <e> dream that one day <e>
``` ```
最后,每个输入会按其单词次在字典里的位置,转化成整数的索引序列,作为PaddlePaddle的输入。
## 编程实现 ## 编程实现
本配置的模型结构如下图所示: 本配置的模型结构如下图所示:
...@@ -245,7 +246,6 @@ Efirst = wordemb(firstword) ...@@ -245,7 +246,6 @@ Efirst = wordemb(firstword)
Esecond = wordemb(secondword) Esecond = wordemb(secondword)
Ethird = wordemb(thirdword) Ethird = wordemb(thirdword)
Efourth = wordemb(fourthword) Efourth = wordemb(fourthword)
``` ```
- 将这n-1个词向量经过concat_layer连接成一个大向量作为历史文本特征。 - 将这n-1个词向量经过concat_layer连接成一个大向量作为历史文本特征。
...@@ -323,11 +323,12 @@ trainer.train( ...@@ -323,11 +323,12 @@ trainer.train(
event_handler=event_handler) event_handler=event_handler)
``` ```
... ```text
Pass 0, Batch 25000, Cost 4.251861, {'classification_error_evaluator': 0.84375} Pass 0, Batch 0, Cost 7.870579, {'classification_error_evaluator': 1.0}, Testing metrics {'classification_error_evaluator': 0.999591588973999}
Pass 0, Batch 25100, Cost 4.847692, {'classification_error_evaluator': 0.8125} Pass 0, Batch 100, Cost 6.136420, {'classification_error_evaluator': 0.84375}, Testing metrics {'classification_error_evaluator': 0.8328699469566345}
Pass 0, Testing metrics {'classification_error_evaluator': 0.7417652606964111} Pass 0, Batch 200, Cost 5.786797, {'classification_error_evaluator': 0.8125}, Testing metrics {'classification_error_evaluator': 0.8328542709350586}
...
```
训练过程是完全自动的,event_handler里打印的日志类似如上所示: 训练过程是完全自动的,event_handler里打印的日志类似如上所示:
...@@ -340,22 +341,23 @@ trainer.train( ...@@ -340,22 +341,23 @@ trainer.train(
### 查看词向量 ### 查看词向量
PaddlePaddle训练出来的参数可以直接使用`parameters.get()`获取出来。例如查看单词的word的词向量,即为 PaddlePaddle训练出来的参数可以直接使用`parameters.get()`获取出来。例如查看单词`apple`的词向量,即为
```python ```python
embeddings = parameters.get("_proj").reshape(len(word_dict), embsize) embeddings = parameters.get("_proj").reshape(len(word_dict), embsize)
print embeddings[word_dict['word']] print embeddings[word_dict['apple']]
``` ```
[-0.38961065 -0.02392169 -0.00093231 0.36301503 0.13538605 0.16076435 ```text
-0.0678709 0.1090285 0.42014077 -0.24119169 -0.31847557 0.20410083 [-0.38961065 -0.02392169 -0.00093231 0.36301503 0.13538605 0.16076435
0.04910378 0.19021918 -0.0122014 -0.04099389 -0.16924137 0.1911236 -0.0678709 0.1090285 0.42014077 -0.24119169 -0.31847557 0.20410083
-0.10917275 0.13068172 -0.23079982 0.42699069 -0.27679482 -0.01472992 0.04910378 0.19021918 -0.0122014 -0.04099389 -0.16924137 0.1911236
0.2069038 0.09005053 -0.3282454 0.12717034 -0.24218646 0.25304323 -0.10917275 0.13068172 -0.23079982 0.42699069 -0.27679482 -0.01472992
0.19072419 -0.24286366] 0.2069038 0.09005053 -0.3282454 0.12717034 -0.24218646 0.25304323
0.19072419 -0.24286366]
```
### 修改词向量 ### 修改词向量
...@@ -387,8 +389,9 @@ emb_2 = embeddings[word_dict['would']] ...@@ -387,8 +389,9 @@ emb_2 = embeddings[word_dict['would']]
print spatial.distance.cosine(emb_1, emb_2) print spatial.distance.cosine(emb_1, emb_2)
``` ```
0.99375076448 ```text
0.99375076448
```
## 总结 ## 总结
本章中,我们介绍了词向量、语言模型和词向量的关系、以及如何通过训练神经网络模型获得词向量。在信息检索中,我们可以根据向量间的余弦夹角,来判断query和文档关键词这二者间的相关性。在句法分析和语义分析中,训练好的词向量可以用来初始化模型,以得到更好的效果。在文档分类中,有了词向量之后,可以用聚类的方法将文档中同义词进行分组。希望大家在本章后能够自行运用词向量进行相关领域的研究。 本章中,我们介绍了词向量、语言模型和词向量的关系、以及如何通过训练神经网络模型获得词向量。在信息检索中,我们可以根据向量间的余弦夹角,来判断query和文档关键词这二者间的相关性。在句法分析和语义分析中,训练好的词向量可以用来初始化模型,以得到更好的效果。在文档分类中,有了词向量之后,可以用聚类的方法将文档中同义词进行分组。希望大家在本章后能够自行运用词向量进行相关领域的研究。
......
...@@ -191,19 +191,257 @@ The advantages of CBOW is that it smooths over the word embeddings of the contex ...@@ -191,19 +191,257 @@ The advantages of CBOW is that it smooths over the word embeddings of the contex
As illustrated in the figure above, skip-gram model maps the word embedding of the given word onto $2n$ word embeddings (including $n$ words before and $n$ words after the given word), and then combine the classification loss of all those $2n$ words by softmax. As illustrated in the figure above, skip-gram model maps the word embedding of the given word onto $2n$ word embeddings (including $n$ words before and $n$ words after the given word), and then combine the classification loss of all those $2n$ words by softmax.
## Data Preparation ## Dataset
We will use Peen Treebank (PTB) (Tomas Mikolov's pre-processed version) dataset. PTB is a small dataset, used in Recurrent Neural Network Language Modeling Toolkit\[[2](#reference)\]. Its statistics are as follows:
<p align="center">
<table>
<tr>
<td>training set</td>
<td>validation set</td>
<td>test set</td>
</tr>
<tr>
<td>ptb.train.txt</td>
<td>ptb.valid.txt</td>
<td>ptb.test.txt</td>
</tr>
<tr>
<td>42068 lines</td>
<td>3370 lines</td>
<td>3761 lines</td>
</tr>
</table>
</p>
### Python Dataset Module
We encapsulated the PTB Data Set in our Python module `paddle.dataset.imikolov`. This module can
1. download the dataset to `~/.cache/paddle/dataset/imikolov`, if not yet, and
2. [preprocesses](#preprocessing) the dataset.
### Preprocessing
We will be training a 5-gram model. Given five words in a window, we will predict the fifth word given the first four words.
Beginning and end of a sentence have a special meaning, so we will add begin token `<s>` in the front of the sentence. And end token `<e>` in the end of the sentence. By moving the five word window in the sentence, data instances are generated.
For example, the sentence "I have a dream that one day" generates five data instances:
```text
<s> I have a dream
I have a dream that
have a dream that one
a dream that one day
dream that one day <e>
```
At last, each data instance will be converted into an integer sequence according it's words' index inside the dictionary.
## Training
The neural network that we will be using is illustrated in the graph below:
## Model Configuration
<p align="center"> <p align="center">
<img src="image/ngram.en.png" width=400><br/> <img src="image/ngram.en.png" width=400><br/>
Figure 5. N-gram neural network model in model configuration Figure 5. N-gram neural network model in model configuration
</p> </p>
`word2vec/train.py` demonstrates training word2vec using PaddlePaddle:
- Import packages.
```python
import math
import paddle.v2 as paddle
```
- Configure parameter.
```python
embsize = 32 # word vector dimension
hiddensize = 256 # hidden layer dimension
N = 5 # train 5-gram
```
- Map the $n-1$ words $w_{t-n+1},...w_{t-1}$ before $w_t$ to a D-dimensional vector though matrix of dimention $|V|\times D$ (D=32 in this example).
```python
def wordemb(inlayer):
wordemb = paddle.layer.table_projection(
input=inlayer,
size=embsize,
param_attr=paddle.attr.Param(
name="_proj",
initial_std=0.001,
learning_rate=1,
l2_rate=0, ))
return wordemb
```
- Define name and type for input to data layer.
```python
paddle.init(use_gpu=False, trainer_count=3)
word_dict = paddle.dataset.imikolov.build_dict()
dict_size = len(word_dict)
# Every layer takes integer value of range [0, dict_size)
firstword = paddle.layer.data(
name="firstw", type=paddle.data_type.integer_value(dict_size))
secondword = paddle.layer.data(
name="secondw", type=paddle.data_type.integer_value(dict_size))
thirdword = paddle.layer.data(
name="thirdw", type=paddle.data_type.integer_value(dict_size))
fourthword = paddle.layer.data(
name="fourthw", type=paddle.data_type.integer_value(dict_size))
nextword = paddle.layer.data(
name="fifthw", type=paddle.data_type.integer_value(dict_size))
Efirst = wordemb(firstword)
Esecond = wordemb(secondword)
Ethird = wordemb(thirdword)
Efourth = wordemb(fourthword)
```
- Concatenate n-1 word embedding vectors into a single feature vector.
```python
contextemb = paddle.layer.concat(input=[Efirst, Esecond, Ethird, Efourth])
```
- Feature vector will go through a fully connected layer which outputs a hidden feature vector.
```python
hidden1 = paddle.layer.fc(input=contextemb,
size=hiddensize,
act=paddle.activation.Sigmoid(),
layer_attr=paddle.attr.Extra(drop_rate=0.5),
bias_attr=paddle.attr.Param(learning_rate=2),
param_attr=paddle.attr.Param(
initial_std=1. / math.sqrt(embsize * 8),
learning_rate=1))
```
- Hidden feature vector will go through another fully conected layer, turn into a $|V|$ dimensional vector. At the same time softmax will be applied to get the probability of each word being generated.
```python
predictword = paddle.layer.fc(input=hidden1,
size=dict_size,
bias_attr=paddle.attr.Param(learning_rate=2),
act=paddle.activation.Softmax())
```
- We will use cross-entropy cost function.
```python
cost = paddle.layer.classification_cost(input=predictword, label=nextword)
```
- Create parameter, optimizer and trainer.
```python
parameters = paddle.parameters.create(cost)
adam_optimizer = paddle.optimizer.Adam(
learning_rate=3e-3,
regularization=paddle.optimizer.L2Regularization(8e-4))
trainer = paddle.trainer.SGD(cost, parameters, adam_optimizer)
```
Next, we will begin the training process. `paddle.dataset.imikolov.train()` and `paddle.dataset.imikolov.test()` is our training set and test set. Both of the function will return a **reader**: In PaddlePaddle, reader is a python function which returns a Python iterator which output a single data instance at a time.
`paddle.batch` takes reader as input, outputs a **batched reader**: In PaddlePaddle, a reader outputs a single data instance at a time but batched reader outputs a minibatch of data instances.
```python
import gzip
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0:
print "Pass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
if isinstance(event, paddle.event.EndPass):
result = trainer.test(
paddle.batch(
paddle.dataset.imikolov.test(word_dict, N), 32))
print "Pass %d, Testing metrics %s" % (event.pass_id, result.metrics)
with gzip.open("model_%d.tar.gz"%event.pass_id, 'w') as f:
parameters.to_tar(f)
trainer.train(
paddle.batch(paddle.dataset.imikolov.train(word_dict, N), 32),
num_passes=100,
event_handler=event_handler)
```
`trainer.train` will start training, the output of `event_handler` will be similar to following:
```text
Pass 0, Batch 0, Cost 7.870579, {'classification_error_evaluator': 1.0}, Testing metrics {'classification_error_evaluator': 0.999591588973999}
Pass 0, Batch 100, Cost 6.136420, {'classification_error_evaluator': 0.84375}, Testing metrics {'classification_error_evaluator': 0.8328699469566345}
Pass 0, Batch 200, Cost 5.786797, {'classification_error_evaluator': 0.8125}, Testing metrics {'classification_error_evaluator': 0.8328542709350586}
...
```
After 30 passes, we can get average error rate around 0.735611.
## Model Training
## Model Application ## Model Application
After the model is trained, we can load saved model parameters and uses it for other models. We can also use the parameters in applications.
### Viewing Word Vector
Parameters trained by PaddlePaddle can be viewed by `parameters.get()`. For example, we can check the word vector for word `apple`.
```python
embeddings = parameters.get("_proj").reshape(len(word_dict), embsize)
print embeddings[word_dict['apple']]
```
```text
[-0.38961065 -0.02392169 -0.00093231 0.36301503 0.13538605 0.16076435
-0.0678709 0.1090285 0.42014077 -0.24119169 -0.31847557 0.20410083
0.04910378 0.19021918 -0.0122014 -0.04099389 -0.16924137 0.1911236
-0.10917275 0.13068172 -0.23079982 0.42699069 -0.27679482 -0.01472992
0.2069038 0.09005053 -0.3282454 0.12717034 -0.24218646 0.25304323
0.19072419 -0.24286366]
```
### Modifying Word Vector
Word vectors (`embeddings`) that we get is a numpy array. We can modify this array and set it back to `parameters`.
```python
def modify_embedding(emb):
# Add your modification here.
pass
modify_embedding(embeddings)
parameters.set("_proj", embeddings)
```
### Calculating Cosine Similarity
Cosine similarity is one way of quantifying the similarity between two vectors. The range of result is $[-1, 1]$. The bigger the value, the similar two vectors are:
```python
from scipy import spatial
emb_1 = embeddings[word_dict['world']]
emb_2 = embeddings[word_dict['would']]
print spatial.distance.cosine(emb_1, emb_2)
```
```text
0.99375076448
```
## Conclusion ## Conclusion
This chapter introduces word embedding, the relationship between language model and word embedding, and how to train neural networks to learn word embedding. This chapter introduces word embedding, the relationship between language model and word embedding, and how to train neural networks to learn word embedding.
...@@ -219,7 +457,7 @@ In information retrieval, the relevance between the query and document keyword c ...@@ -219,7 +457,7 @@ In information retrieval, the relevance between the query and document keyword c
5. https://en.wikipedia.org/wiki/Singular_value_decomposition 5. https://en.wikipedia.org/wiki/Singular_value_decomposition
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。 This tutorial is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
......
...@@ -186,7 +186,7 @@ CBOW的好处是对上下文词语的分布在词向量上进行了平滑,去 ...@@ -186,7 +186,7 @@ CBOW的好处是对上下文词语的分布在词向量上进行了平滑,去
### 数据介绍 ### 数据介绍
本教程使用Penn Tree Bank (PTB)数据集。PTB数据集较小,训练速度快,应用于Mikolov的公开语言模型训练工具\[[2](#参考文献)\]中。其统计情况如下: 本教程使用Penn Treebank (PTB)(经Tomas Mikolov预处理过的版本)数据集。PTB数据集较小,训练速度快,应用于Mikolov的公开语言模型训练工具\[[2](#参考文献)\]中。其统计情况如下:
<p align="center"> <p align="center">
<table> <table>
...@@ -225,6 +225,7 @@ a dream that one day ...@@ -225,6 +225,7 @@ a dream that one day
dream that one day <e> dream that one day <e>
``` ```
最后,每个输入会按其单词次在字典里的位置,转化成整数的索引序列,作为PaddlePaddle的输入。
## 编程实现 ## 编程实现
本配置的模型结构如下图所示: 本配置的模型结构如下图所示:
...@@ -287,7 +288,6 @@ Efirst = wordemb(firstword) ...@@ -287,7 +288,6 @@ Efirst = wordemb(firstword)
Esecond = wordemb(secondword) Esecond = wordemb(secondword)
Ethird = wordemb(thirdword) Ethird = wordemb(thirdword)
Efourth = wordemb(fourthword) Efourth = wordemb(fourthword)
``` ```
- 将这n-1个词向量经过concat_layer连接成一个大向量作为历史文本特征。 - 将这n-1个词向量经过concat_layer连接成一个大向量作为历史文本特征。
...@@ -365,11 +365,12 @@ trainer.train( ...@@ -365,11 +365,12 @@ trainer.train(
event_handler=event_handler) event_handler=event_handler)
``` ```
... ```text
Pass 0, Batch 25000, Cost 4.251861, {'classification_error_evaluator': 0.84375} Pass 0, Batch 0, Cost 7.870579, {'classification_error_evaluator': 1.0}, Testing metrics {'classification_error_evaluator': 0.999591588973999}
Pass 0, Batch 25100, Cost 4.847692, {'classification_error_evaluator': 0.8125} Pass 0, Batch 100, Cost 6.136420, {'classification_error_evaluator': 0.84375}, Testing metrics {'classification_error_evaluator': 0.8328699469566345}
Pass 0, Testing metrics {'classification_error_evaluator': 0.7417652606964111} Pass 0, Batch 200, Cost 5.786797, {'classification_error_evaluator': 0.8125}, Testing metrics {'classification_error_evaluator': 0.8328542709350586}
...
```
训练过程是完全自动的,event_handler里打印的日志类似如上所示: 训练过程是完全自动的,event_handler里打印的日志类似如上所示:
...@@ -382,22 +383,23 @@ trainer.train( ...@@ -382,22 +383,23 @@ trainer.train(
### 查看词向量 ### 查看词向量
PaddlePaddle训练出来的参数可以直接使用`parameters.get()`获取出来。例如查看单词的word的词向量,即为 PaddlePaddle训练出来的参数可以直接使用`parameters.get()`获取出来。例如查看单词`apple`的词向量,即为
```python ```python
embeddings = parameters.get("_proj").reshape(len(word_dict), embsize) embeddings = parameters.get("_proj").reshape(len(word_dict), embsize)
print embeddings[word_dict['word']] print embeddings[word_dict['apple']]
``` ```
[-0.38961065 -0.02392169 -0.00093231 0.36301503 0.13538605 0.16076435 ```text
-0.0678709 0.1090285 0.42014077 -0.24119169 -0.31847557 0.20410083 [-0.38961065 -0.02392169 -0.00093231 0.36301503 0.13538605 0.16076435
0.04910378 0.19021918 -0.0122014 -0.04099389 -0.16924137 0.1911236 -0.0678709 0.1090285 0.42014077 -0.24119169 -0.31847557 0.20410083
-0.10917275 0.13068172 -0.23079982 0.42699069 -0.27679482 -0.01472992 0.04910378 0.19021918 -0.0122014 -0.04099389 -0.16924137 0.1911236
0.2069038 0.09005053 -0.3282454 0.12717034 -0.24218646 0.25304323 -0.10917275 0.13068172 -0.23079982 0.42699069 -0.27679482 -0.01472992
0.19072419 -0.24286366] 0.2069038 0.09005053 -0.3282454 0.12717034 -0.24218646 0.25304323
0.19072419 -0.24286366]
```
### 修改词向量 ### 修改词向量
...@@ -429,8 +431,9 @@ emb_2 = embeddings[word_dict['would']] ...@@ -429,8 +431,9 @@ emb_2 = embeddings[word_dict['would']]
print spatial.distance.cosine(emb_1, emb_2) print spatial.distance.cosine(emb_1, emb_2)
``` ```
0.99375076448 ```text
0.99375076448
```
## 总结 ## 总结
本章中,我们介绍了词向量、语言模型和词向量的关系、以及如何通过训练神经网络模型获得词向量。在信息检索中,我们可以根据向量间的余弦夹角,来判断query和文档关键词这二者间的相关性。在句法分析和语义分析中,训练好的词向量可以用来初始化模型,以得到更好的效果。在文档分类中,有了词向量之后,可以用聚类的方法将文档中同义词进行分组。希望大家在本章后能够自行运用词向量进行相关领域的研究。 本章中,我们介绍了词向量、语言模型和词向量的关系、以及如何通过训练神经网络模型获得词向量。在信息检索中,我们可以根据向量间的余弦夹角,来判断query和文档关键词这二者间的相关性。在句法分析和语义分析中,训练好的词向量可以用来初始化模型,以得到更好的效果。在文档分类中,有了词向量之后,可以用聚类的方法将文档中同义词进行分组。希望大家在本章后能够自行运用词向量进行相关领域的研究。
......
...@@ -40,18 +40,19 @@ def main(): ...@@ -40,18 +40,19 @@ def main():
Efourth = wordemb(fourthword) Efourth = wordemb(fourthword)
contextemb = paddle.layer.concat(input=[Efirst, Esecond, Ethird, Efourth]) contextemb = paddle.layer.concat(input=[Efirst, Esecond, Ethird, Efourth])
hidden1 = paddle.layer.fc(input=contextemb, hidden1 = paddle.layer.fc(
size=hiddensize, input=contextemb,
act=paddle.activation.Sigmoid(), size=hiddensize,
layer_attr=paddle.attr.Extra(drop_rate=0.5), act=paddle.activation.Sigmoid(),
bias_attr=paddle.attr.Param(learning_rate=2), layer_attr=paddle.attr.Extra(drop_rate=0.5),
param_attr=paddle.attr.Param( bias_attr=paddle.attr.Param(learning_rate=2),
initial_std=1. / math.sqrt(embsize * 8), param_attr=paddle.attr.Param(
learning_rate=1)) initial_std=1. / math.sqrt(embsize * 8), learning_rate=1))
predictword = paddle.layer.fc(input=hidden1, predictword = paddle.layer.fc(
size=dict_size, input=hidden1,
bias_attr=paddle.attr.Param(learning_rate=2), size=dict_size,
act=paddle.activation.Softmax()) bias_attr=paddle.attr.Param(learning_rate=2),
act=paddle.activation.Softmax())
def event_handler(event): def event_handler(event):
if isinstance(event, paddle.event.EndIteration): if isinstance(event, paddle.event.EndIteration):
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册