提交 e61a44c2 编写于 作者: C ceci3

Merge branch 'develop' of https://github.com/PaddlePaddle/book into develop

#!/bin/bash
#This file is only used for continuous evaluation.
python train.py --enable_ce | python _ce.py
...@@ -3,10 +3,21 @@ ...@@ -3,10 +3,21 @@
本教程源代码目录在[book/fit_a_line](https://github.com/PaddlePaddle/book/tree/develop/01.fit_a_line), 初次使用请您参考[Book文档使用说明](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书) 本教程源代码目录在[book/fit_a_line](https://github.com/PaddlePaddle/book/tree/develop/01.fit_a_line), 初次使用请您参考[Book文档使用说明](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)
### 说明:
1.硬件环境要求:
本文可支持在CPU、GPU下运行
2. Docker镜像支持的CUDA/cuDNN版本:
如果使用了Docker运行Book,请注意:这里所提供的默认镜像的GPU环境为 CUDA 8/cuDNN 5,对于NVIDIA Tesla V100等要求CUDA 9的 GPU,使用该镜像可能会运行失败。
3. 文档和脚本中代码的一致性问题:
请注意:为使本文更加易读易用,我们拆分、调整了train.py的代码并放入本文。本文中代码与train.py的运行结果一致,可直接运行[train.py](https://github.com/PaddlePaddle/book/blob/develop/01.fit_a_line/train.py)进行验证。
## 背景介绍 ## 背景介绍
给定一个大小为$n$的数据集 ${\{y_{i}, x_{i1}, ..., x_{id}\}}_{i=1}^{n}$,其中$x_{i1}, \ldots, x_{id}$是第$i$个样本$d$个属性上的取值,$y_i$是该样本待预测的目标。线性回归模型假设目标$y_i$可以被属性间的线性组合描述,即 给定一个大小为$n$的数据集 ${\{y_{i}, x_{i1}, ..., x_{id}\}}_{i=1}^{n}$,其中$x_{i1}, \ldots, x_{id}$是第$i$个样本$d$个属性上的取值,$y_i$是该样本待预测的目标。线性回归模型假设目标$y_i$可以被属性间的线性组合描述,即
$$y_i = \omega_1x_{i1} + \omega_2x_{i2} + \ldots + \omega_dx_{id} + b, i=1,\ldots,n$$
<p align="center">
<img src = "https://github.com/PaddlePaddle/book/blob/develop/01.fit_a_line/image/formula_fit_a_line_1.png?raw=true" width=550><br/>
</p>
例如,在我们将要建模的房价预测问题里,$x_{ij}$是描述房子$i$的各种属性(比如房间的个数、周围学校和医院的个数、交通状况等),而 $y_i$是房屋的价格。 例如,在我们将要建模的房价预测问题里,$x_{ij}$是描述房子$i$的各种属性(比如房间的个数、周围学校和医院的个数、交通状况等),而 $y_i$是房屋的价格。
...@@ -25,7 +36,9 @@ $$y_i = \omega_1x_{i1} + \omega_2x_{i2} + \ldots + \omega_dx_{id} + b, i=1,\ldo ...@@ -25,7 +36,9 @@ $$y_i = \omega_1x_{i1} + \omega_2x_{i2} + \ldots + \omega_dx_{id} + b, i=1,\ldo
在波士顿房价数据集中,和房屋相关的值共有14个:前13个用来描述房屋相关的各种信息,即模型中的 $x_i$;最后一个值为我们要预测的该类房屋价格的中位数,即模型中的 $y_i$。因此,我们的模型就可以表示成: 在波士顿房价数据集中,和房屋相关的值共有14个:前13个用来描述房屋相关的各种信息,即模型中的 $x_i$;最后一个值为我们要预测的该类房屋价格的中位数,即模型中的 $y_i$。因此,我们的模型就可以表示成:
$$\hat{Y} = \omega_1X_{1} + \omega_2X_{2} + \ldots + \omega_{13}X_{13} + b$$ <p align="center">
<img src = "https://github.com/PaddlePaddle/book/blob/develop/01.fit_a_line/image/formula_fit_a_line_2.png?raw=true" width=350><br/>
</p>
$\hat{Y}$ 表示模型的预测结果,用来和真实值$Y$区分。模型要学习的参数即:$\omega_1, \ldots, \omega_{13}, b$。 $\hat{Y}$ 表示模型的预测结果,用来和真实值$Y$区分。模型要学习的参数即:$\omega_1, \ldots, \omega_{13}, b$。
...@@ -33,13 +46,17 @@ $\hat{Y}$ 表示模型的预测结果,用来和真实值$Y$区分。模型要 ...@@ -33,13 +46,17 @@ $\hat{Y}$ 表示模型的预测结果,用来和真实值$Y$区分。模型要
对于线性回归模型来讲,最常见的损失函数就是均方误差(Mean Squared Error, [MSE](https://en.wikipedia.org/wiki/Mean_squared_error))了,它的形式是: 对于线性回归模型来讲,最常见的损失函数就是均方误差(Mean Squared Error, [MSE](https://en.wikipedia.org/wiki/Mean_squared_error))了,它的形式是:
$$MSE=\frac{1}{n}\sum_{i=1}^{n}{(\hat{Y_i}-Y_i)}^2$$ <p align="center">
<img src = "https://github.com/PaddlePaddle/book/blob/develop/01.fit_a_line/image/formula_fit_a_line_3.png?raw=true" width=200><br/>
</p>
即对于一个大小为$n$的测试集,$MSE$是$n$个数据预测结果误差平方的均值。 即对于一个大小为$n$的测试集,$MSE$是$n$个数据预测结果误差平方的均值。
对损失函数进行优化所采用的方法一般为梯度下降法。梯度下降法是一种一阶最优化算法。如果$f(x)$在点$x_n$有定义且可微,则认为$f(x)$在点$x_n$沿着梯度的负方向$-▽f(x_n)$下降的是最快的。反复调节$x$,使得$f(x)$接近最小值或者极小值,调节的方式为: 对损失函数进行优化所采用的方法一般为梯度下降法。梯度下降法是一种一阶最优化算法。如果$f(x)$在点$x_n$有定义且可微,则认为$f(x)$在点$x_n$沿着梯度的负方向$-▽f(x_n)$下降的是最快的。反复调节$x$,使得$f(x)$接近最小值或者极小值,调节的方式为:
$$x_n+1=x_n-λ▽f(x), n≧0$$ <p align="center">
<img src = "https://github.com/PaddlePaddle/book/blob/develop/01.fit_a_line/image/formula_fit_a_line_4.png?raw=true" width=250><br/>
</p>
其中λ代表学习率。这种调节的方法称为梯度下降法。 其中λ代表学习率。这种调节的方法称为梯度下降法。
...@@ -101,17 +118,17 @@ $$x_n+1=x_n-λ▽f(x), n≧0$$ ...@@ -101,17 +118,17 @@ $$x_n+1=x_n-λ▽f(x), n≧0$$
## 训练 ## 训练
`fit_a_line/trainer.py`演示了训练的整体过程。 `fit_a_line/train.py`演示了训练的整体过程。
### 配置数据提供器(Datafeeder) ### 配置数据提供器(Datafeeder)
首先我们引入必要的库: 首先我们引入必要的库:
```python ```python
from __future__ import print_function
import paddle import paddle
import paddle.fluid as fluid import paddle.fluid as fluid
import numpy import numpy
import math import math
import sys import sys
from __future__ import print_function
``` ```
我们通过uci_housing模块引入了数据集合[UCI Housing Data Set](http://paddlemodels.bj.bcebos.com/uci_housing/housing.data) 我们通过uci_housing模块引入了数据集合[UCI Housing Data Set](http://paddlemodels.bj.bcebos.com/uci_housing/housing.data)
...@@ -119,7 +136,7 @@ from __future__ import print_function ...@@ -119,7 +136,7 @@ from __future__ import print_function
其中,在uci_housing模块中封装了: 其中,在uci_housing模块中封装了:
1. 数据下载的过程。下载数据保存在~/.cache/paddle/dataset/uci_housing/housing.data。 1. 数据下载的过程。下载数据保存在~/.cache/paddle/dataset/uci_housing/housing.data。
2. [数据预处理](#数据预处理)的过程。 2. 数据预处理的过程。
接下来我们定义了用于训练的数据提供器。提供器每次读入一个大小为`BATCH_SIZE`的数据批次。如果用户希望加一些随机性,它可以同时定义一个批次大小和一个缓存大小。这样的话,每次数据提供器会从缓存中随机读取批次大小那么多的数据。 接下来我们定义了用于训练的数据提供器。提供器每次读入一个大小为`BATCH_SIZE`的数据批次。如果用户希望加一些随机性,它可以同时定义一个批次大小和一个缓存大小。这样的话,每次数据提供器会从缓存中随机读取批次大小那么多的数据。
...@@ -163,14 +180,18 @@ train_data = data[:offset] ...@@ -163,14 +180,18 @@ train_data = data[:offset]
test_data = data[offset:] test_data = data[offset:]
def reader(data):
for d in train_data:
yield d[:1], d[-1:]
train_reader = paddle.batch( train_reader = paddle.batch(
paddle.reader.shuffle( paddle.reader.shuffle(
train_data, buf_size=500), reader(train_data), buf_size=500),
batch_size=BATCH_SIZE) batch_size=BATCH_SIZE)
test_reader = paddle.batch( test_reader = paddle.batch(
paddle.reader.shuffle( paddle.reader.shuffle(
test_data, buf_size=500), reader(test_data), buf_size=500),
batch_size=BATCH_SIZE) batch_size=BATCH_SIZE)
### 配置训练程序 ### 配置训练程序
...@@ -196,13 +217,14 @@ avg_loss = fluid.layers.mean(cost) # 对方差求均值,得到平均损失 ...@@ -196,13 +217,14 @@ avg_loss = fluid.layers.mean(cost) # 对方差求均值,得到平均损失
在下面的 `SGD optimizer``learning_rate` 是学习率,与网络的训练收敛速度有关系。 在下面的 `SGD optimizer``learning_rate` 是学习率,与网络的训练收敛速度有关系。
```python ```python
sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.001)
sgd_optimizer.minimize(avg_loss)
#克隆main_program得到test_program #克隆main_program得到test_program
#有些operator在训练和测试之间的操作是不同的,例如batch_norm,使用参数for_test来区分该程序是用来训练还是用来测试 #有些operator在训练和测试之间的操作是不同的,例如batch_norm,使用参数for_test来区分该程序是用来训练还是用来测试
#该api不会删除任何操作符,请在backward和optimization之前使用 #该api不会删除任何操作符,请在backward和optimization之前使用
test_program = main_program.clone(for_test=True) test_program = main_program.clone(for_test=True)
sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.001)
sgd_optimizer.minimize(avg_loss)
``` ```
### 定义运算场所 ### 定义运算场所
...@@ -220,7 +242,7 @@ exe = fluid.Executor(place) ...@@ -220,7 +242,7 @@ exe = fluid.Executor(place)
[fluid.executor](http://www.paddlepaddle.org/documentation/docs/zh/develop/api_cn/fluid_cn.html#permalink-15-executor) [fluid.executor](http://www.paddlepaddle.org/documentation/docs/zh/develop/api_cn/fluid_cn.html#permalink-15-executor)
### 创建训练过程 ### 创建训练过程
训练需要有一个训练程序和一些必要参数,并构建了一个获取训练过程中测试误差的函数。必要参数有executor,program,reader,feeder,fetch_list,executor表示之前创建的执行器,program表示执行器所执行的program,是之前创建的program,如果该项参数没有给定的话则默认使用defalut_main_program,reader表示读取到的数据,feeder表示前向输入的变量,fetch_list表示用户想得到的变量或者命名的结果。 训练需要有一个训练程序和一些必要参数,并构建了一个获取训练过程中测试误差的函数。必要参数有executor,program,reader,feeder,fetch_list,executor表示之前创建的执行器,program表示执行器所执行的program,是之前创建的program,如果该项参数没有给定的话则默认使用default_main_program,reader表示读取到的数据,feeder表示前向输入的变量,fetch_list表示用户想得到的变量或者命名的结果。
```python ```python
num_epochs = 100 num_epochs = 100
...@@ -236,24 +258,6 @@ def train_test(executor, program, reader, feeder, fetch_list): ...@@ -236,24 +258,6 @@ def train_test(executor, program, reader, feeder, fetch_list):
count += 1 # 累加测试集中的样本数量 count += 1 # 累加测试集中的样本数量
return [x_d / count for x_d in accumulated] # 计算平均损失 return [x_d / count for x_d in accumulated] # 计算平均损失
```
可以直接输出损失值来观察`训练进程`:
```python
train_prompt = "train cost"
test_prompt = "test cost"
print("%s', out %f" % (train_prompt, out))
print("%s', out %f" % (test_prompt, out))
```
除此之外,还可以通过画图,来展现`训练进程`
```python
from paddle.utils.plot import ploter
plot_prompt = ploter(train_prompt, test_prompt)
``` ```
### 训练主循环 ### 训练主循环
...@@ -264,8 +268,11 @@ plot_prompt = ploter(train_prompt, test_prompt) ...@@ -264,8 +268,11 @@ plot_prompt = ploter(train_prompt, test_prompt)
%matplotlib inline %matplotlib inline
params_dirname = "fit_a_line.inference.model" params_dirname = "fit_a_line.inference.model"
feeder = fluid.DataFeeder(place=place, feed_list=[x, y]) feeder = fluid.DataFeeder(place=place, feed_list=[x, y])
naive_exe = fluid.Executor(place) exe.run(startup_program)
naive_exe.run(startup_program) train_prompt = "train cost"
test_prompt = "test cost"
from paddle.utils.plot import Ploter
plot_prompt = Ploter(train_prompt, test_prompt)
step = 0 step = 0
exe_test = fluid.Executor(place) exe_test = fluid.Executor(place)
...@@ -280,10 +287,12 @@ for pass_id in range(num_epochs): ...@@ -280,10 +287,12 @@ for pass_id in range(num_epochs):
avg_loss_value, = exe.run(main_program, avg_loss_value, = exe.run(main_program,
feed=feeder.feed(data_train), feed=feeder.feed(data_train),
fetch_list=[avg_loss]) fetch_list=[avg_loss])
if step % 10 == 0: # 每10个批次记录一下训练损失 if step % 10 == 0: # 每10个批次记录并输出一下训练损失
plot_prompt.append(train_prompt, step, avg_loss_value[0]) plot_prompt.append(train_prompt, step, avg_loss_value[0])
plot_prompt.plot() plot_prompt.plot()
if step % 100 == 0: # 每100批次记录一下测试损失 print("%s, Step %d, Cost %f" %
(train_prompt, step, avg_loss_value[0]))
if step % 100 == 0: # 每100批次记录并输出一下测试损失
test_metics = train_test(executor=exe_test, test_metics = train_test(executor=exe_test,
program=test_program, program=test_program,
reader=test_reader, reader=test_reader,
...@@ -291,6 +300,8 @@ for pass_id in range(num_epochs): ...@@ -291,6 +300,8 @@ for pass_id in range(num_epochs):
feeder=feeder) feeder=feeder)
plot_prompt.append(test_prompt, step, test_metics[0]) plot_prompt.append(test_prompt, step, test_metics[0])
plot_prompt.plot() plot_prompt.plot()
print("%s, Step %d, Cost %f" %
(test_prompt, step, test_metics[0]))
if test_metics[0] < 10.0: # 如果准确率达到要求,则停止训练 if test_metics[0] < 10.0: # 如果准确率达到要求,则停止训练
break break
...@@ -316,6 +327,24 @@ inference_scope = fluid.core.Scope() ...@@ -316,6 +327,24 @@ inference_scope = fluid.core.Scope()
``` ```
### 预测 ### 预测
保存图片
```python
def save_result(points1, points2):
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
x1 = [idx for idx in range(len(points1))]
y1 = points1
y2 = points2
l1 = plt.plot(x1, y1, 'r--', label='predictions')
l2 = plt.plot(x1, y2, 'g--', label='GT')
plt.plot(x1, y1, 'ro-', x1, y2, 'g+-')
plt.title('predictions VS GT')
plt.legend()
plt.savefig('./image/prediction_gt.png')
```
通过fluid.io.load_inference_model,预测器会从`params_dirname`中读取已经训练好的模型,来对从未遇见过的数据进行预测。 通过fluid.io.load_inference_model,预测器会从`params_dirname`中读取已经训练好的模型,来对从未遇见过的数据进行预测。
```python ```python
...@@ -337,27 +366,7 @@ with fluid.scope_guard(inference_scope): ...@@ -337,27 +366,7 @@ with fluid.scope_guard(inference_scope):
results = infer_exe.run(inference_program, results = infer_exe.run(inference_program,
feed={feed_target_names[0]: numpy.array(infer_feat)}, feed={feed_target_names[0]: numpy.array(infer_feat)},
fetch_list=fetch_targets) # 进行预测 fetch_list=fetch_targets) # 进行预测
``` #打印预测结果和标签并可视化结果
保存图片
```python
def save_result(points1, points2):
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
x1 = [idx for idx in range(len(points1))]
y1 = points1
y2 = points2
l1 = plt.plot(x1, y1, 'r--', label='predictions')
l2 = plt.plot(x1, y2, 'g--', label='GT')
plt.plot(x1, y1, 'ro-', x1, y2, 'g+-')
plt.title('predictions VS GT')
plt.legend()
plt.savefig('./image/prediction_gt.png')
```
打印预测结果和标签并可视化结果
```python
print("infer results: (House Price)") print("infer results: (House Price)")
for idx, val in enumerate(results[0]): for idx, val in enumerate(results[0]):
print("%d: %.2f" % (idx, val)) # 打印预测结果 print("%d: %.2f" % (idx, val)) # 打印预测结果
...@@ -366,8 +375,10 @@ def save_result(points1, points2): ...@@ -366,8 +375,10 @@ def save_result(points1, points2):
for idx, val in enumerate(infer_label): for idx, val in enumerate(infer_label):
print("%d: %.2f" % (idx, val)) # 打印标签值 print("%d: %.2f" % (idx, val)) # 打印标签值
save_result(results[0], infer_label) # 保存图片 save_result(results[0], infer_label) # 保存图片
``` ```
由于每次都是随机选择一个minibatch的数据作为当前迭代的训练数据,所以每次得到的预测结果会有所不同。
## 总结 ## 总结
在这章里,我们借助波士顿房价这一数据集,介绍了线性回归模型的基本概念,以及如何使用PaddlePaddle实现训练和测试的过程。很多的模型和技巧都是从简单的线性回归模型演化而来,因此弄清楚线性模型的原理和局限非常重要。 在这章里,我们借助波士顿房价这一数据集,介绍了线性回归模型的基本概念,以及如何使用PaddlePaddle实现训练和测试的过程。很多的模型和技巧都是从简单的线性回归模型演化而来,因此弄清楚线性模型的原理和局限非常重要。
...@@ -380,4 +391,4 @@ save_result(results[0], infer_label) # 保存图片 ...@@ -380,4 +391,4 @@ save_result(results[0], infer_label) # 保存图片
4. Bishop C M. Pattern recognition[J]. Machine Learning, 2006, 128. 4. Bishop C M. Pattern recognition[J]. Machine Learning, 2006, 128.
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">知识共享 署名-相同方式共享 4.0 国际 许可协议</a>进行许可。 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://paddlepaddleimage.cdn.bcebos.com/bookimage/camo.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://www.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">知识共享 署名-相同方式共享 4.0 国际 许可协议</a>进行许可。
# Linear Regression # Linear Regression
Let us begin the tutorial with a classical problem called Linear Regression \[[1](#References)\]. In this chapter, we will train a model from a realistic dataset to predict home prices. Some important concepts in Machine Learning will be covered through this example. Let's start this tutorial from the classic Linear Regression ([[1](#References)]) model.
The source code for this tutorial lives on [book/fit_a_line](https://github.com/PaddlePaddle/book/tree/develop/01.fit_a_line). For instructions on getting started with this book,see [Running This Book](https://github.com/PaddlePaddle/book/blob/develop/README.md#running-the-book). In this chapter, you will build a model to predict house price with real datasets and learn about several important concepts about machine learning.
## Problem Setup The source code of this tutorial is in [book/fit_a_line](https://github.com/PaddlePaddle/book/tree/develop/01.fit_a_line). For the new users, please refer to [Running This Book](https://github.com/PaddlePaddle/book/blob/develop/README.md#running-the-book) .
Suppose we have a dataset of $n$ real estate properties. Each real estate property will be referred to as **homes** in this chapter for clarity.
Each home is associated with $d$ attributes. The attributes describe characteristics such as the number of rooms in the home, the number of schools or hospitals in the neighborhood, and the traffic condition nearby.
In our problem setup, the attribute $x_{i,j}$ denotes the $j$th characteristic of the $i$th home. In addition, $y_i$ denotes the price of the $i$th home. Our task is to predict $y_i$ given a set of attributes $\{x_{i,1}, ..., x_{i,d}\}$. We assume that the price of a home is a linear combination of all of its attributes, namely, ## Background
Given a $n$ dataset ${\{y_{i}, x_{i1}, ..., x_{id}\}}_{i=1}^{n}$, of which $ x_{i1}, \ldots, x_{id}$ are the values of the $d$th attribute of $i$ sample, and $y_i$ is the target to be predicted for this sample.
$$y_i = \omega_1x_{i,1} + \omega_2x_{i,2} + \ldots + \omega_dx_{i,d} + b, i=1,\ldots,n$$ The linear regression model assumes that the target $y_i$ can be described by a linear combination among attributes, i.e.
where $\vec{\omega}$ and $b$ are the model parameters we want to estimate. Once they are learned, we will be able to predict the price of a home, given the attributes associated with it. We call this model **Linear Regression**. In other words, we want to regress a value against several values linearly. In practice, a linear model is often too simplistic to capture the real relationships between the variables. Yet, because Linear Regression is easy to train and analyze, it has been applied to a large number of real problems. As a result, it is an important topic in many classic Statistical Learning and Machine Learning textbooks \[[2,3,4](#References)\]. $$y_i = \omega_1x_{i1} + \omega_2x_{i2} + \ldots + \omega_dx_{id} + b, i=1,\ldots,n$$
## Results Demonstration For example, in the problem of prediction of house price we are going to explore, $x_{ij}$ is a description of the various attributes of the house $i$ (such as the number of rooms, the number of schools and hospitals around, traffic conditions, etc.). $y_i$ is the price of the house.
We first show the result of our model. The dataset [UCI Housing Data Set](http://paddlemodels.bj.bcebos.com/uci_housing/housing.data) is used to train a linear model to predict the home prices in Boston. The figure below shows the predictions the model makes for some home prices. The $X$-axis represents the median value of the prices of similar homes within a bin, while the $Y$-axis represents the home value our linear model predicts. The dotted line represents points where $X=Y$. When reading the diagram, the closer the point is to the dotted line, better the model's prediction.
At first glance, this assumption is too simple, and the true relationship among variables is unlikely to be linear. However, because the linear regression model has the advantages of simple form and easy to be modeled and analyzed, it has been widely applied in practical problems. Many classic statistical learning and machine learning books \[[2,3,4](#references)\] also focus on linear model in a chapter.
## Result Demo
We used the Boston house price dataset obtained from [UCI Housing dataset](http://paddlemodels.bj.bcebos.com/uci_housing/housing.data) to train and predict the model. The scatter plot below shows the result of price prediction for parts of house with model. Each point on x-axis represents the median of the real price of the same type of house, and the y-axis represents the result of the linear regression model based on the feature prediction. When the two values are completely equal, they will fall on the dotted line. So the more accurate the model is predicted, the closer the point is to the dotted line.
<p align="center"> <p align="center">
<img src = "image/predictions_en.png" width=400><br/> <img src = "https://github.com/PaddlePaddle/book/blob/develop/01.fit_a_line/image/predictions.png?raw=true" width=400><br/>
Figure 1. Predicted Value V.S. Actual Value Figure One. Predict value V.S Ground-truth value
</p> </p>
## Model Overview ## Model Overview
### Model Definition ### Model Definition
In the UCI Housing Data Set, there are 13 home attributes $\{x_{i,j}\}$ that are related to the median home price $y_i$, which we aim to predict. Thus, our model can be written as: In the dataset of Boston house price, there are 14 values associated with the home: the first 13 are used to describe various information of house, that is $x_i$ in the model; the last value is the medium price of the house we want to predict, which is $y_i$ in the model.
Therefore, our model can be expressed as:
$$\hat{Y} = \omega_1X_{1} + \omega_2X_{2} + \ldots + \omega_{13}X_{13} + b$$ $$\hat{Y} = \omega_1X_{1} + \omega_2X_{2} + \ldots + \omega_{13}X_{13} + b$$
where $\hat{Y}$ is the predicted value used to differentiate from actual value $Y$. The model learns parameters $\omega_1, \ldots, \omega_{13}, b$, where the entries of $\vec{\omega}$ are **weights** and $b$ is **bias**. $\hat{Y}$ represents the predicted result of the model and is used to distinguish it from the real value $Y$. The parameters to be learned by the model are: $\omega_1, \ldots, \omega_{13}, b$.
Now we need an objective to optimize, so that the learned parameters can make $\hat{Y}$ as close to $Y$ as possible. Let's refer to the concept of [Loss Function (Cost Function)](https://en.wikipedia.org/wiki/Loss_function). A loss function must output a non-negative value, given any pair of the actual value $y_i$ and the predicted value $\hat{y_i}$. This value reflects the magnitutude of the model error. After building the model, we need to give the model an optimization goal so that the learned parameters can make the predicted value $\hat{Y}$ get as close to the true value $Y$. Here we introduce the concept of loss function ([Loss Function](https://en.wikipedia.org/wiki/Loss_function), or Cost Function. Input the target value $y_{i}$ of any data sample and the predicted value $\hat{y_{i}}$ given by a model. Then the loss function outputs a non-negative real number, which is usually used to represent model error.
For Linear Regression, the most common loss function is [Mean Square Error (MSE)](https://en.wikipedia.org/wiki/Mean_squared_error) which has the following form: For linear regression models, the most common loss function is the Mean Squared Error ([MSE](https://en.wikipedia.org/wiki/Mean_squared_error)), which is:
$$MSE=\frac{1}{n}\sum_{i=1}^{n}{(\hat{Y_i}-Y_i)}^2$$ $$MSE=\frac{1}{n}\sum_{i=1}^{n}{(\hat{Y_i}-Y_i)}^2$$
That is, for a dataset of size $n$, MSE is the average value of the the prediction sqaure errors. That is, for a test set in size of $n$, $MSE$ is the mean of the squared error of the $n$ data prediction results.
The method used to optimize the loss function is generally the gradient descent method. The gradient descent method is a first-order optimization algorithm. If $f(x)$ is defined and divisible at point $x_n$, then $f(x)$ is considered to be the fastest in the negative direction of the gradient $-▽f(x_n)$ at point of $x_n$. Adjust $x$ repeatedly to make $f(x)$ close to the local or global minimum value. The adjustment is as follows:
$$x_n+1=x_n-λ▽f(x), n≧0$$
Where λ represents the learning rate. This method of adjustment is called the gradient descent method.
### Training Process ### Training Process
After setting up our model, there are several major steps to go through to train it: After defining the model structure, we will train the model through the following steps.
1. Initialize the parameters including the weights $\vec{\omega}$ and the bias $b$. For example, we can set their mean values as $0$s, and their standard deviations as $1$s.
2. Feedforward. Evaluate the network output and compute the corresponding loss. 1. Initialize parameters, including weights $\omega_i$ and bias $b$, to initialize them (eg. 0 as mean, 1 as variance).
3. [Backpropagate](https://en.wikipedia.org/wiki/Backpropagation) the errors. The errors will be propagated from the output layer back to the input layer, during which the model parameters will be updated with the corresponding errors. 2. Forward propagation of network calculates network output and loss functions.
4. Repeat steps 2~3, until the loss is below a predefined threshold or the maximum number of epochs is reached.  3. Reverse error propagation according to the loss function ( [backpropagation](https://en.wikipedia.org/wiki/Backpropagation) ), passing forward the network error from the output layer and updating the parameters in the network.
 4. Repeat steps 2~3 until the network training error reaches the specified level or the training round reaches the set value.
## Dataset
### An Introduction of the Dataset ## Dataset
The UCI housing dataset has 506 instances. Each instance describes the attributes of a house in surburban Boston. The attributes are explained below: ### Dataset Introduction
The dataset consists of 506 lines, each containing information about a type of houses in a suburb of Boston and the median price of that type of house. The meaning of each dimensional attribute is as follows:
| Attribute Name | Characteristic | Data Type | | Property Name | Explanation | Type |
| ------| ------ | ------ | | ------| ------ | ------ |
| CRIM | per capita crime rate by town | Continuous| CRIM | Per capita crime rate in the town | Continuous value |
| ZN | proportion of residential land zoned for lots over 25,000 sq.ft. | Continuous | | ZN | Proportion of residential land with an area of over 25,000 square feet | Continuous value |
| INDUS | proportion of non-retail business acres per town | Continuous | | INDUS | Proportion of non-retail commercial land | Continuous value |
| CHAS | Charles River dummy variable | Discrete, 1 if tract bounds river; 0 otherwise| CHAS | Whether it is adjacent to Charles River | Discrete value, 1=proximity; 0=not adjacent |
| NOX | nitric oxides concentration (parts per 10 million) | Continuous | NOX | Nitric Oxide Concentration | Continuous value |
| RM | average number of rooms per dwelling | Continuous | | RM | Average number of rooms per house | Continuous value |
| AGE | proportion of owner-occupied units built prior to 1940 | Continuous | | AGE | Proportion of self-use units built before 1940 | Continuous value |
| DIS | weighted distances to five Boston employment centres | Continuous | | DIS | Weighted Distance to 5 Job Centers in Boston | Continuous value |
| RAD | index of accessibility to radial highways | Continuous | | RAD | Accessibility Index to Radial Highway | Continuous value |
| TAX | full-value property-tax rate per \$10,000 | Continuous | | TAX | Tax Rate of Full-value Property | Continuous value |
| PTRATIO | pupil-teacher ratio by town | Continuous | | PTRATIO | Proportion of Student and Teacher | Continuous value |
| B | 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town | Continuous | | B | 1000(BK - 0.63)^2, where BK is black ratio | Continuous value |
| LSTAT | % lower status of the population | Continuous | LSTAT | Low-income population ratio | Continuous value |
| MEDV | Median value of owner-occupied homes in $1000's | Continuous | | MEDV | Median price of a similar home | Continuous value |
### Data Pre-processing
The last entry is the median home price. #### Continuous value and discrete value
Analyzing the data, first we find that all 13-dimensional attributes exist 12-dimensional continuous value and 1-dimensional discrete values (CHAS). Discrete value is often represented by numbers like 0, 1, and 2, but its meaning is different from continuous value's because the difference of discrete value here has no meaning. For example, if we use 0, 1, and 2 to represent red, green, and blue, we cannot infer that the distance between blue and red is longer than that between green and red. So usually for a discrete property with $d$ possible values, we will convert them to $d$ binary properties with a value of 0 or 1 or map each possible value to a multidimensional vector. However, there is no this problem for CHAS, since CHAS itself is a binary attribute .
### Preprocessing #### Normalization of attributes
Another fact that can be easily found is that the range of values of each dimensional attribute is largely different (as shown in Figure 2). For example, the value range of attribute B is [0.32, 396.90], and the value range of attribute NOX is [0.3850, 0.8170]. Here is a common operation - normalization. The goal of normalization is to scale the value of each attribute to a similar range, such as [-0.5, 0.5]. Here we use a very common operation method: subtract the mean and divide by the range of values.
#### Continuous and Discrete Data There are at least three reasons for implementing normalization (or [Feature scaling](https://en.wikipedia.org/wiki/Feature_scaling)):
We define a feature vector of length 13 for each home, where each entry corresponds to an attribute. Our first observation is that, among the 13 dimensions, there are 12 continuous dimensions and 1 discrete dimension. - A range of values that are too large or too small can cause floating value overflow or underflow during calculation.
Note that although a discrete value is also written as numeric values such as 0, 1, or 2, its meaning differs from a continuous value drastically. The linear difference between two discrete values has no meaning. For example, suppose $0$, $1$, and $2$ are used to represent colors *Red*, *Green*, and *Blue* respectively. Judging from the numeric representation of these colors, *Red* differs more from *Blue* than it does from *Green*. Yet in actuality, it is not true that extent to which the color *Blue* is different from *Red* is greater than the extent to which *Green* is different from *Red*. Therefore, when handling a discrete feature that has $d$ possible values, we usually convert it to $d$ new features where each feature takes a binary value, $0$ or $1$, indicating whether the original value is absent or present. Alternatively, the discrete features can be mapped onto a continuous multi-dimensional vector through an embedding table. For our problem here, because CHAS itself is a binary discrete value, we do not need to do any preprocessing. - Different ranges of number result in different attributes being different for the model (at least in the initial period of training), and this implicit assumption is often unreasonable. This can make the optimization process difficult and the training time greatly longer.
#### Feature Normalization - Many machine learning techniques/models (such as L1, L2 regular items, Vector Space Model) are based on the assumption that all attribute values are almost zero and their ranges of value are similar.
We also observe a huge difference among the value ranges of the 13 features (Figure 2). For instance, the values of feature *B* fall in $[0.32, 396.90]$, whereas those of feature *NOX* has a range of $[0.3850, 0.8170]$. An effective optimization would require data normalization. The goal of data normalization is to scale the values of each feature into roughly the same range, perhaps $[-0.5, 0.5]$. Here, we adopt a popular normalization technique where we subtract the mean value from the feature value and divide the result by the width of the original range.
There are at least three reasons for [Feature Normalization](https://en.wikipedia.org/wiki/Feature_scaling) (Feature Scaling):
- A value range that is too large or too small might cause floating number overflow or underflow during computation.
- Different value ranges might result in varying *importances* of different features to the model (at least in the beginning of the training process). This assumption about the data is often unreasonable, making the optimization difficult, which in turn results in increased training time.
- Many machine learning techniques or models (e.g., *L1/L2 regularization* and *Vector Space Model*) assumes that all the features have roughly zero means and their value ranges are similar.
<p align="center"> <p align="center">
<img src = "image/ranges_en.png" width=550><br/> <img src = "https://github.com/PaddlePaddle/book/blob/develop/01.fit_a_line/image/ranges.png?raw=true" width=550><br/>
Figure 2. The value ranges of the features Figure 2. Value range of attributes for all dimensions
</p> </p>
#### Prepare Training and Test Sets #### Organizing training set and testing set
We split the dataset in two, one for adjusting the model parameters, namely, for training the model, and the other for testing. The model error on the former is called the **training error**, and the error on the latter is called the **test error**. Our goal in training a model is to find the statistical dependency between the outputs and the inputs, so that we can predict outputs given new inputs. As a result, the test error reflects the performance of the model better than the training error does. We consider two things when deciding the ratio of the training set to the test set: 1) More training data will decrease the variance of the parameter estimation, yielding more reliable models; 2) More test data will decrease the variance of the test error, yielding more reliable test errors. One standard split ratio is $8:2$. We split the dataset into two parts: one is used to adjust the parameters of the model, that is, to train the model, the error of the model on this dataset is called ** training error **; the other is used to test.The error of the model on this dataset is called the ** test error**. The goal of our training model is to predict unknown new data by finding the regulation from the training data, so the test error is an better indicator for the performance of the model. When it comes to the ratio of the segmentation data, we should take into account two factors: more training data will reduce the square error of estimated parameters, resulting in a more reliable model; and more test data will reduce the square error of the test error, resulting in more credible test error. The split ratio set in our example is $8:2$
When training complex models, we usually have one more split: the validation set. Complex models usually have [Hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_optimization) that need to be set before the training process, such as the number of layers in the network. Because hyperparameters are not part of the model parameters, they cannot be trained using the same loss function. Thus we will try several sets of hyperparameters to train several models and cross-validate them on the validation set to pick the best one; finally, the selected trained model is tested on the test set. Because our model is relatively simple, we will omit this validation process.
In a more complex model training process, we often need more than one dataset: the validation set. Because complex models often have some hyperparameters ([Hyperparameter](https://en.wikipedia.org/wiki/Hyperparameter_optimization)) that need to be adjusted, we will try a combination of multiple hyperparameters to train multiple models separately and then compare their performance on the validation set to select the relatively best set of hyperparameters, and finally use the model with this set of parameters to evaluate the test error on the test set. Since the model trained in this chapter is relatively simple, we won't talk about this process at present.
## Training ## Training
`fit_a_line/trainer.py` demonstrates the training using [PaddlePaddle](http://paddlepaddle.org). `fit_a_line/train.py` demonstrates the overall process of training.
### Datafeeder Configuration ### Configuring the Data feeder
Our program starts with importing necessary packages: First we import the libraries:
```python ```python
from __future__ import print_function
import paddle import paddle
import paddle.fluid as fluid import paddle.fluid as fluid
import numpy import numpy
from __future__ import print_function import math
try: import sys
from paddle.fluid.contrib.trainer import *
from paddle.fluid.contrib.inferencer import *
except ImportError:
print(
"In the fluid 1.0, the trainer and inferencer are moving to paddle.fluid.contrib",
file=sys.stderr)
from paddle.fluid.trainer import *
from paddle.fluid.inferencer import *
``` ```
We encapsulated the [UCI Housing Data Set](http://paddlemodels.bj.bcebos.com/uci_housing/housing.data) in our Python module `uci_housing`. This module can We introduced the dataset [UCI Housing dataset](http://paddlemodels.bj.bcebos.com/uci_housing/housing.data) via the uci_housing module
1. download the dataset to `~/.cache/paddle/dataset/uci_housing/housing.data`, if you haven't yet, and It is encapsulated in the uci_housing module:
2. [preprocess](#preprocessing) the dataset.
1. The process of data download. The download data is saved in ~/.cache/paddle/dataset/uci_housing/housing.data.
2. The process of [data preprocessing](#data preprocessing).
We define data feeders for test and train. The feeder reads a `BATCH_SIZE` of data each time and feed them to the training/testing process. If the user wants some randomness on the data order, she can define both a `BATCH_SIZE` and a `buf_size`. That way the datafeeder will yield the first `BATCH_SIZE` data out of a shuffle of the first `buf_size` data. Next we define the data feeder for training. The data feeder reads a batch of data in the size of `BATCH_SIZE` each time. If the user wants the data to be random, it can define data in size of a batch and a cache. In this case, each time the data feeder randomly reads as same data as the batch size from the cache.
```python ```python
BATCH_SIZE = 20 BATCH_SIZE = 20
...@@ -149,169 +156,237 @@ test_reader = paddle.batch( ...@@ -149,169 +156,237 @@ test_reader = paddle.batch(
batch_size=BATCH_SIZE) batch_size=BATCH_SIZE)
``` ```
### Train Program Configuration If you want to read data directly from \*.txt file, you can refer to the method as follows.
`train_program` sets up the network structure of this current training model. For linear regression, it is simply a fully connected layer from the input to the output. More complex structures like CNN and RNN will be introduced in later chapters. The `train_program` must return an avg_loss as its first returned parameter because it is needed in backpropagation. feature_names = [
'CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
'PTRATIO', 'B', 'LSTAT', 'convert'
]
```python feature_num = len(feature_names)
def train_program():
y = fluid.layers.data(name='y', shape=[1], dtype='float32')
# feature vector of length 13 data = numpy.fromfile(filename, sep=' ') # Read primary data from file
x = fluid.layers.data(name='x', shape=[13], dtype='float32')
y_predict = fluid.layers.fc(input=x, size=1, act=None)
loss = fluid.layers.square_error_cost(input=y_predict, label=y) data = data.reshape(data.shape[0] // feature_num, feature_num)
avg_loss = fluid.layers.mean(loss)
return avg_loss maximums, minimums, avgs = data.max(axis=0), data.min(axis=0), data.sum(axis=0)/data.shape[0]
```
### Optimizer Function Configuration for i in six.moves.range(feature_num-1):
data[:, i] = (data[:, i] - avgs[i]) / (maximums[i] - minimums[i]) # six.moves is compatible to python2 and python3
In the following `SGD` optimizer, `learning_rate` specifies the learning rate in the optimization procedure. ratio = 0.8 # distribution ratio of train dataset and verification dataset
```python offset = int(data.shape[0]\*ratio)
def optimizer_program():
return fluid.optimizer.SGD(learning_rate=0.001)
```
### Specify Place train_data = data[:offset]
Specify your training environment, you should specify if the training is on CPU or GPU. test_data = data[offset:]
```python train_reader = paddle.batch(
use_cuda = False paddle.reader.shuffle(
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace() train_data, buf_size=500),
``` batch_size=BATCH_SIZE)
### Create Trainer test_reader = paddle.batch(
paddle.reader.shuffle(
test_data, buf_size=500),
batch_size=BATCH_SIZE)
The trainer will take the `train_program` as input. ### Configure Program for Training
The aim of the program for training is to define a network structure of a training model. For linear regression, it is a simple fully connected layer from input to output. More complex result, such as Convolutional Neural Network and Recurrent Neural Network, will be introduced in later chapters. It must return `mean error` as the first return value in program for training, for that `mean error` will be used for BackPropagation.
```python ```python
trainer = Trainer( x = fluid.layers.data(name='x', shape=[13], dtype='float32') # define shape and data type of input
train_func=train_program, y = fluid.layers.data(name='y', shape=[1], dtype='float32') # define shape and data type of output
place=place, y_predict = fluid.layers.fc(input=x, size=1, act=None) # fully connected layer connecting input and output
optimizer_func=optimizer_program)
main_program = fluid.default_main_program() # get default/global main function
startup_program = fluid.default_startup_program() # get default/global launch program
cost = fluid.layers.square_error_cost(input=y_predict, label=y) # use label and output predicted data to estimate square error
avg_loss = fluid.layers.mean(cost) # compute mean value for square error and get mean loss
``` ```
For details, please refer to:
[fluid.default_main_program](http://www.paddlepaddle.org/documentation/docs/zh/develop/api_cn/fluid_cn.html#default-main-program)
[fluid.default_startup_program](http://www.paddlepaddle.org/documentation/docs/zh/develop/api_cn/fluid_cn.html#default-startup-program)
### Feeding Data ### Optimizer Function Configuration
PaddlePaddle provides the `SGD optimizer`, `learning_rate` below are learning rate, which is related to rate of convergence for train of network.
[reader mechanism](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader)
for loading the training data. A reader may return multiple columns, and we need a Python dictionary to specify the mapping from column index to data layers.
```python ```python
feed_order=['x', 'y'] #Clone main_program to get test_program
# operations of some operators are different between train and test. For example, batch_norm use parameter for_test to determine whether the program is for training or for testing.
#The api will not delete any operator, please apply it before backward and optimization.
test_program = main_program.clone(for_test=True)
sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.001)
sgd_optimizer.minimize(avg_loss)
``` ```
Moreover, an event handler is provided to print the training progress: ### Define Training Place
```python We can define whether an operation runs on the CPU or on the GPU.
# Specify the directory to save the parameters
params_dirname = "fit_a_line.inference.model"
```python
use_cuda = False
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace() # define the execution space of executor
train_title = "Train cost" ###executor can accept input program and add data input operator and result fetch operator based on feed map and fetch list. Use close() to close executor and call run(...) to run the program.
test_title = "Test cost" exe = fluid.Executor(place)
step = 0 ```
For details, please refer to:
[fluid.executor](http://www.paddlepaddle.org/documentation/docs/zh/develop/api_cn/fluid_cn.html#permalink-15-executor)
# event_handler prints training and testing info ### Create Training Process
def event_handler(event): To train, it needs a train program and some parameters and creates a function to get test error in the process of train necessary parameters contain executor, program, reader, feeder, fetch_list, executor represents executor created before. Program created before represents program executed by executor. If the parameter is undefined, then it is defined default_main_program by default. Reader represents data read. Feeder represents forward input variable and fetch_list represents variable user wants to get or name.
global step
if isinstance(event, EndStepEvent):
if step % 10 == 0: # record a train cost every 10 batches
print("%s, Step %d, Cost %f" % (train_title, step, event.metrics[0]))
if step % 100 == 0: # record a test cost every 100 batches
test_metrics = trainer.test(
reader=test_reader, feed_order=feed_order)
print("%s, Step %d, Cost %f" % (test_title, step, test_metrics[0]))
if test_metrics[0] < 10.0:
# If the accuracy is good enough, we can stop the training.
print('loss is less than 10.0, stop')
trainer.stop()
step += 1
if isinstance(event, EndEpochEvent): ```python
if event.epoch % 10 == 0: num_epochs = 100
# We can save the trained parameters for the inferences later
if params_dirname is not None: def train_test(executor, program, reader, feeder, fetch_list):
trainer.save_params(params_dirname) accumulated = 1 * [0]
count = 0
for data_test in reader():
outs = executor.run(program=program,
feed=feeder.feed(data_test),
fetch_list=fetch_list)
accumulated = [x_c[0] + x_c[1][0] for x_c in zip(accumulated, outs)] # accumulate loss value in the process of test
count += 1 # accumulate samples in test dataset
return [x_d / count for x_d in accumulated] # compute mean loss
``` ```
### Start Training ### Train Main Loop
We now can start training by calling `trainer.train()`. give name of directory to be stored and initialize an executor
```python ```python
%matplotlib inline %matplotlib inline
params_dirname = "fit_a_line.inference.model"
feeder = fluid.DataFeeder(place=place, feed_list=[x, y])
exe.run(startup_program)
train_prompt = "train cost"
test_prompt = "test cost"
from paddle.utils.plot import Ploter
plot_prompt = Ploter(train_prompt, test_prompt)
step = 0
# The training could take up to a few minutes. exe_test = fluid.Executor(place)
trainer.train(
reader=train_reader,
num_epochs=100,
event_handler=event_handler,
feed_order=feed_order)
``` ```
Paddlepaddle provides reader mechanism to read training data. Reader provide multiple columns of data at one time. Therefore, we need a python list to read sequence. We create a loop to train until the result of train is good enough or time of loop is enough.
If the number of iterations for train is equal to the number of iterations for saving parameters, you can save train parameter into `params_dirname`.
Set main loop for training.
```python
for pass_id in range(num_epochs):
for data_train in train_reader():
avg_loss_value, = exe.run(main_program,
feed=feeder.feed(data_train),
fetch_list=[avg_loss])
if step % 10 == 0: # record and output train loss for every 10 batches.
plot_prompt.append(train_prompt, step, avg_loss_value[0])
plot_prompt.plot()
print("%s, Step %d, Cost %f" %
(train_prompt, step, avg_loss_value[0]))
if step % 100 == 0: # record and output test loss for every 100 batches.
test_metics = train_test(executor=exe_test,
program=test_program,
reader=test_reader,
fetch_list=[avg_loss.name],
feeder=feeder)
plot_prompt.append(test_prompt, step, test_metics[0])
plot_prompt.plot()
print("%s, Step %d, Cost %f" %
(test_prompt, step, test_metics[0]))
if test_metics[0] < 10.0: # If the accuracy is up to the requirement, the train can be stopped.
break
![png](./image/train_and_test.png) step += 1
## Inference if math.isnan(float(avg_loss_value[0])):
sys.exit("got NaN loss, training failed.")
Initialize the Inferencer with the inference_program and the params_dirname, which is where we saved our params #save train parameters into the path given before
if params_dirname is not None:
fluid.io.save_inference_model(params_dirname, ['x'], [y_predict], exe)
```
### Setup the Inference Program ## Predict
It needs to create trained parameters to run program for prediction. The trained parameters is in `params_dirname`.
Similar to the trainer.train, the Inferencer needs to take an inference_program to do inference. ### Prepare Environment for Prediction
Prune the train_program to only have the y_predict. Similar to the process of training, predictor needs a program for prediction. We can slightly modify our training program to include the prediction value.
```python ```python
def inference_program(): infer_exe = fluid.Executor(place)
x = fluid.layers.data(name='x', shape=[13], dtype='float32') inference_scope = fluid.core.Scope()
y_predict = fluid.layers.fc(input=x, size=1, act=None)
return y_predict
``` ```
### Infer ### Predict
Inferencer will load the trained model from `params_dirname` and use it to infer the unseen data.
Save pictures
```python ```python
inferencer = Inferencer( def save_result(points1, points2):
infer_func=inference_program, param_path=params_dirname, place=place) import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
x1 = [idx for idx in range(len(points1))]
y1 = points1
y2 = points2
l1 = plt.plot(x1, y1, 'r--', label='predictions')
l2 = plt.plot(x1, y2, 'g--', label='GT')
plt.plot(x1, y1, 'ro-', x1, y2, 'g+-')
plt.title('predictions VS GT')
plt.legend()
plt.savefig('./image/prediction_gt.png')
```
batch_size = 10 Via fluid.io.load_inference_model, predictor will read well-trained model from `params_dirname` to predict unknown data.
test_reader = paddle.batch(paddle.dataset.uci_housing.test(),batch_size=batch_size)
test_data = next(test_reader())
test_x = numpy.array([data[0] for data in test_data]).astype("float32")
test_y = numpy.array([data[1] for data in test_data]).astype("float32")
results = inferencer.infer({'x': test_x}) ```python
with fluid.scope_guard(inference_scope):
[inference_program, feed_target_names,
fetch_targets] = fluid.io.load_inference_model(params_dirname, infer_exe) # load pre-predict model
batch_size = 10
infer_reader = paddle.batch(
paddle.dataset.uci_housing.test(), batch_size=batch_size) # prepare test dataset
infer_data = next(infer_reader())
infer_feat = numpy.array(
[data[0] for data in infer_data]).astype("float32") # extract data in test dataset
infer_label = numpy.array(
[data[1] for data in infer_data]).astype("float32") # extract label in test dataset
assert feed_target_names[0] == 'x'
results = infer_exe.run(inference_program,
feed={feed_target_names[0]: numpy.array(infer_feat)},
fetch_list=fetch_targets) # predict
#print predict result and label and visualize the result
print("infer results: (House Price)")
for idx, val in enumerate(results[0]):
print("%d: %.2f" % (idx, val)) # print predict result
print("\nground truth:")
for idx, val in enumerate(infer_label):
print("%d: %.2f" % (idx, val)) # print label
save_result(results[0], infer_label) # save picture
```
print("infer results: (House Price)")
for idx, val in enumerate(results[0]):
print("%d: %.2f" % (idx, val))
print("\nground truth:")
for idx, val in enumerate(test_y):
print("%d: %.2f" % (idx, val))
```
## Summary ## Summary
In this chapter, we analyzed dataset of Boston House Price to introduce the basic concepts of linear regression model and how to use PaddlePaddle to implement training and testing. A number of models and theories are derived from linear regression model. Therefore, it is not unnecessary to figure out the principle and limitation of linear regression model.
This chapter introduces *Linear Regression* and how to train and test this model with PaddlePaddle, using the UCI Housing Data Set. Because a large number of more complex models and techniques are derived from linear regression, it is important to understand its underlying theory and limitation. <a name="References"></a>
## References ## References
1. https://en.wikipedia.org/wiki/Linear_regression 1. https://en.wikipedia.org/wiki/Linear_regression
2. Friedman J, Hastie T, Tibshirani R. The elements of statistical learning[M]. Springer, Berlin: Springer series in statistics, 2001. 2. Friedman J, Hastie T, Tibshirani R. The elements of statistical learning[M]. Springer, Berlin: Springer series in statistics, 2001.
3. Murphy K P. Machine learning: a probabilistic perspective[M]. MIT press, 2012. 3. Murphy K P. Machine learning: a probabilistic perspective[M]. MIT press, 2012.
4. Bishop C M. Pattern recognition[J]. Machine Learning, 2006, 128. 4. Bishop C M. Pattern recognition[J]. Machine Learning, 2006, 128.
<br/> <br/>
This tutorial is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://paddlepaddleimage.cdn.bcebos.com/bookimage/camo.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">This tutorial</span> is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.
### This file is only used for continuous evaluation test!
from __future__ import print_function
from __future__ import division
from __future__ import absolute_import
import os
import sys
sys.path.append(os.environ['ceroot'])
from kpi import CostKpi
train_cost_kpi = CostKpi('train_cost', 0.02, 0, actived=True, desc='train cost')
test_cost_kpi = CostKpi('test_cost', 0.02, 0, actived=True, desc='test cost')
tracking_kpis = [train_cost_kpi, test_cost_kpi]
def parse_log(log):
for line in log.split('\n'):
fs = line.strip().split('\t')
print(fs)
if len(fs) == 3 and fs[0] == 'kpis':
print("-----%s" % fs)
kpi_name = fs[1]
kpi_value = float(fs[2])
yield kpi_name, kpi_value
def log_to_ce(log):
kpi_tracker = {}
for kpi in tracking_kpis:
kpi_tracker[kpi.name] = kpi
for (kpi_name, kpi_value) in parse_log(log):
print(kpi_name, kpi_value)
kpi_tracker[kpi_name].add_record(kpi_value)
kpi_tracker[kpi_name].persist()
if __name__ == '__main__':
log = sys.stdin.read()
log_to_ce(log)
01.fit_a_line/image/ranges.png

8.6 KB | W: | H:

01.fit_a_line/image/ranges.png

6.6 KB | W: | H:

01.fit_a_line/image/ranges.png
01.fit_a_line/image/ranges.png
01.fit_a_line/image/ranges.png
01.fit_a_line/image/ranges.png
  • 2-up
  • Swipe
  • Onion skin
...@@ -45,10 +45,21 @@ ...@@ -45,10 +45,21 @@
本教程源代码目录在[book/fit_a_line](https://github.com/PaddlePaddle/book/tree/develop/01.fit_a_line), 初次使用请您参考[Book文档使用说明](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)。 本教程源代码目录在[book/fit_a_line](https://github.com/PaddlePaddle/book/tree/develop/01.fit_a_line), 初次使用请您参考[Book文档使用说明](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)。
### 说明:
1.硬件环境要求:
本文可支持在CPU、GPU下运行
2. Docker镜像支持的CUDA/cuDNN版本:
如果使用了Docker运行Book,请注意:这里所提供的默认镜像的GPU环境为 CUDA 8/cuDNN 5,对于NVIDIA Tesla V100等要求CUDA 9的 GPU,使用该镜像可能会运行失败。
3. 文档和脚本中代码的一致性问题:
请注意:为使本文更加易读易用,我们拆分、调整了train.py的代码并放入本文。本文中代码与train.py的运行结果一致,可直接运行[train.py](https://github.com/PaddlePaddle/book/blob/develop/01.fit_a_line/train.py)进行验证。
## 背景介绍 ## 背景介绍
给定一个大小为$n$的数据集 ${\{y_{i}, x_{i1}, ..., x_{id}\}}_{i=1}^{n}$,其中$x_{i1}, \ldots, x_{id}$是第$i$个样本$d$个属性上的取值,$y_i$是该样本待预测的目标。线性回归模型假设目标$y_i$可以被属性间的线性组合描述,即 给定一个大小为$n$的数据集 ${\{y_{i}, x_{i1}, ..., x_{id}\}}_{i=1}^{n}$,其中$x_{i1}, \ldots, x_{id}$是第$i$个样本$d$个属性上的取值,$y_i$是该样本待预测的目标。线性回归模型假设目标$y_i$可以被属性间的线性组合描述,即
$$y_i = \omega_1x_{i1} + \omega_2x_{i2} + \ldots + \omega_dx_{id} + b, i=1,\ldots,n$$
<p align="center">
<img src = "https://github.com/PaddlePaddle/book/blob/develop/01.fit_a_line/image/formula_fit_a_line_1.png?raw=true" width=550><br/>
</p>
例如,在我们将要建模的房价预测问题里,$x_{ij}$是描述房子$i$的各种属性(比如房间的个数、周围学校和医院的个数、交通状况等),而 $y_i$是房屋的价格。 例如,在我们将要建模的房价预测问题里,$x_{ij}$是描述房子$i$的各种属性(比如房间的个数、周围学校和医院的个数、交通状况等),而 $y_i$是房屋的价格。
...@@ -67,7 +78,9 @@ $$y_i = \omega_1x_{i1} + \omega_2x_{i2} + \ldots + \omega_dx_{id} + b, i=1,\ldo ...@@ -67,7 +78,9 @@ $$y_i = \omega_1x_{i1} + \omega_2x_{i2} + \ldots + \omega_dx_{id} + b, i=1,\ldo
在波士顿房价数据集中,和房屋相关的值共有14个:前13个用来描述房屋相关的各种信息,即模型中的 $x_i$;最后一个值为我们要预测的该类房屋价格的中位数,即模型中的 $y_i$。因此,我们的模型就可以表示成: 在波士顿房价数据集中,和房屋相关的值共有14个:前13个用来描述房屋相关的各种信息,即模型中的 $x_i$;最后一个值为我们要预测的该类房屋价格的中位数,即模型中的 $y_i$。因此,我们的模型就可以表示成:
$$\hat{Y} = \omega_1X_{1} + \omega_2X_{2} + \ldots + \omega_{13}X_{13} + b$$ <p align="center">
<img src = "https://github.com/PaddlePaddle/book/blob/develop/01.fit_a_line/image/formula_fit_a_line_2.png?raw=true" width=350><br/>
</p>
$\hat{Y}$ 表示模型的预测结果,用来和真实值$Y$区分。模型要学习的参数即:$\omega_1, \ldots, \omega_{13}, b$。 $\hat{Y}$ 表示模型的预测结果,用来和真实值$Y$区分。模型要学习的参数即:$\omega_1, \ldots, \omega_{13}, b$。
...@@ -75,13 +88,17 @@ $\hat{Y}$ 表示模型的预测结果,用来和真实值$Y$区分。模型要 ...@@ -75,13 +88,17 @@ $\hat{Y}$ 表示模型的预测结果,用来和真实值$Y$区分。模型要
对于线性回归模型来讲,最常见的损失函数就是均方误差(Mean Squared Error, [MSE](https://en.wikipedia.org/wiki/Mean_squared_error))了,它的形式是: 对于线性回归模型来讲,最常见的损失函数就是均方误差(Mean Squared Error, [MSE](https://en.wikipedia.org/wiki/Mean_squared_error))了,它的形式是:
$$MSE=\frac{1}{n}\sum_{i=1}^{n}{(\hat{Y_i}-Y_i)}^2$$ <p align="center">
<img src = "https://github.com/PaddlePaddle/book/blob/develop/01.fit_a_line/image/formula_fit_a_line_3.png?raw=true" width=200><br/>
</p>
即对于一个大小为$n$的测试集,$MSE$是$n$个数据预测结果误差平方的均值。 即对于一个大小为$n$的测试集,$MSE$是$n$个数据预测结果误差平方的均值。
对损失函数进行优化所采用的方法一般为梯度下降法。梯度下降法是一种一阶最优化算法。如果$f(x)$在点$x_n$有定义且可微,则认为$f(x)$在点$x_n$沿着梯度的负方向$-▽f(x_n)$下降的是最快的。反复调节$x$,使得$f(x)$接近最小值或者极小值,调节的方式为: 对损失函数进行优化所采用的方法一般为梯度下降法。梯度下降法是一种一阶最优化算法。如果$f(x)$在点$x_n$有定义且可微,则认为$f(x)$在点$x_n$沿着梯度的负方向$-▽f(x_n)$下降的是最快的。反复调节$x$,使得$f(x)$接近最小值或者极小值,调节的方式为:
$$x_n+1=x_n-λ▽f(x), n≧0$$ <p align="center">
<img src = "https://github.com/PaddlePaddle/book/blob/develop/01.fit_a_line/image/formula_fit_a_line_4.png?raw=true" width=250><br/>
</p>
其中λ代表学习率。这种调节的方法称为梯度下降法。 其中λ代表学习率。这种调节的方法称为梯度下降法。
...@@ -143,17 +160,17 @@ $$x_n+1=x_n-λ▽f(x), n≧0$$ ...@@ -143,17 +160,17 @@ $$x_n+1=x_n-λ▽f(x), n≧0$$
## 训练 ## 训练
`fit_a_line/trainer.py`演示了训练的整体过程。 `fit_a_line/train.py`演示了训练的整体过程。
### 配置数据提供器(Datafeeder) ### 配置数据提供器(Datafeeder)
首先我们引入必要的库: 首先我们引入必要的库:
```python ```python
from __future__ import print_function
import paddle import paddle
import paddle.fluid as fluid import paddle.fluid as fluid
import numpy import numpy
import math import math
import sys import sys
from __future__ import print_function
``` ```
我们通过uci_housing模块引入了数据集合[UCI Housing Data Set](http://paddlemodels.bj.bcebos.com/uci_housing/housing.data) 我们通过uci_housing模块引入了数据集合[UCI Housing Data Set](http://paddlemodels.bj.bcebos.com/uci_housing/housing.data)
...@@ -161,7 +178,7 @@ from __future__ import print_function ...@@ -161,7 +178,7 @@ from __future__ import print_function
其中,在uci_housing模块中封装了: 其中,在uci_housing模块中封装了:
1. 数据下载的过程。下载数据保存在~/.cache/paddle/dataset/uci_housing/housing.data。 1. 数据下载的过程。下载数据保存在~/.cache/paddle/dataset/uci_housing/housing.data。
2. [数据预处理](#数据预处理)的过程。 2. 数据预处理的过程。
接下来我们定义了用于训练的数据提供器。提供器每次读入一个大小为`BATCH_SIZE`的数据批次。如果用户希望加一些随机性,它可以同时定义一个批次大小和一个缓存大小。这样的话,每次数据提供器会从缓存中随机读取批次大小那么多的数据。 接下来我们定义了用于训练的数据提供器。提供器每次读入一个大小为`BATCH_SIZE`的数据批次。如果用户希望加一些随机性,它可以同时定义一个批次大小和一个缓存大小。这样的话,每次数据提供器会从缓存中随机读取批次大小那么多的数据。
...@@ -205,14 +222,18 @@ train_data = data[:offset] ...@@ -205,14 +222,18 @@ train_data = data[:offset]
test_data = data[offset:] test_data = data[offset:]
def reader(data):
for d in train_data:
yield d[:1], d[-1:]
train_reader = paddle.batch( train_reader = paddle.batch(
paddle.reader.shuffle( paddle.reader.shuffle(
train_data, buf_size=500), reader(train_data), buf_size=500),
batch_size=BATCH_SIZE) batch_size=BATCH_SIZE)
test_reader = paddle.batch( test_reader = paddle.batch(
paddle.reader.shuffle( paddle.reader.shuffle(
test_data, buf_size=500), reader(test_data), buf_size=500),
batch_size=BATCH_SIZE) batch_size=BATCH_SIZE)
### 配置训练程序 ### 配置训练程序
...@@ -238,13 +259,14 @@ avg_loss = fluid.layers.mean(cost) # 对方差求均值,得到平均损失 ...@@ -238,13 +259,14 @@ avg_loss = fluid.layers.mean(cost) # 对方差求均值,得到平均损失
在下面的 `SGD optimizer`,`learning_rate` 是学习率,与网络的训练收敛速度有关系。 在下面的 `SGD optimizer`,`learning_rate` 是学习率,与网络的训练收敛速度有关系。
```python ```python
sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.001)
sgd_optimizer.minimize(avg_loss)
#克隆main_program得到test_program #克隆main_program得到test_program
#有些operator在训练和测试之间的操作是不同的,例如batch_norm,使用参数for_test来区分该程序是用来训练还是用来测试 #有些operator在训练和测试之间的操作是不同的,例如batch_norm,使用参数for_test来区分该程序是用来训练还是用来测试
#该api不会删除任何操作符,请在backward和optimization之前使用 #该api不会删除任何操作符,请在backward和optimization之前使用
test_program = main_program.clone(for_test=True) test_program = main_program.clone(for_test=True)
sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.001)
sgd_optimizer.minimize(avg_loss)
``` ```
### 定义运算场所 ### 定义运算场所
...@@ -262,7 +284,7 @@ exe = fluid.Executor(place) ...@@ -262,7 +284,7 @@ exe = fluid.Executor(place)
[fluid.executor](http://www.paddlepaddle.org/documentation/docs/zh/develop/api_cn/fluid_cn.html#permalink-15-executor) [fluid.executor](http://www.paddlepaddle.org/documentation/docs/zh/develop/api_cn/fluid_cn.html#permalink-15-executor)
### 创建训练过程 ### 创建训练过程
训练需要有一个训练程序和一些必要参数,并构建了一个获取训练过程中测试误差的函数。必要参数有executor,program,reader,feeder,fetch_list,executor表示之前创建的执行器,program表示执行器所执行的program,是之前创建的program,如果该项参数没有给定的话则默认使用defalut_main_program,reader表示读取到的数据,feeder表示前向输入的变量,fetch_list表示用户想得到的变量或者命名的结果。 训练需要有一个训练程序和一些必要参数,并构建了一个获取训练过程中测试误差的函数。必要参数有executor,program,reader,feeder,fetch_list,executor表示之前创建的执行器,program表示执行器所执行的program,是之前创建的program,如果该项参数没有给定的话则默认使用default_main_program,reader表示读取到的数据,feeder表示前向输入的变量,fetch_list表示用户想得到的变量或者命名的结果。
```python ```python
num_epochs = 100 num_epochs = 100
...@@ -278,24 +300,6 @@ def train_test(executor, program, reader, feeder, fetch_list): ...@@ -278,24 +300,6 @@ def train_test(executor, program, reader, feeder, fetch_list):
count += 1 # 累加测试集中的样本数量 count += 1 # 累加测试集中的样本数量
return [x_d / count for x_d in accumulated] # 计算平均损失 return [x_d / count for x_d in accumulated] # 计算平均损失
```
可以直接输出损失值来观察`训练进程`:
```python
train_prompt = "train cost"
test_prompt = "test cost"
print("%s', out %f" % (train_prompt, out))
print("%s', out %f" % (test_prompt, out))
```
除此之外,还可以通过画图,来展现`训练进程`:
```python
from paddle.utils.plot import ploter
plot_prompt = ploter(train_prompt, test_prompt)
``` ```
### 训练主循环 ### 训练主循环
...@@ -306,8 +310,11 @@ plot_prompt = ploter(train_prompt, test_prompt) ...@@ -306,8 +310,11 @@ plot_prompt = ploter(train_prompt, test_prompt)
%matplotlib inline %matplotlib inline
params_dirname = "fit_a_line.inference.model" params_dirname = "fit_a_line.inference.model"
feeder = fluid.DataFeeder(place=place, feed_list=[x, y]) feeder = fluid.DataFeeder(place=place, feed_list=[x, y])
naive_exe = fluid.Executor(place) exe.run(startup_program)
naive_exe.run(startup_program) train_prompt = "train cost"
test_prompt = "test cost"
from paddle.utils.plot import Ploter
plot_prompt = Ploter(train_prompt, test_prompt)
step = 0 step = 0
exe_test = fluid.Executor(place) exe_test = fluid.Executor(place)
...@@ -322,10 +329,12 @@ for pass_id in range(num_epochs): ...@@ -322,10 +329,12 @@ for pass_id in range(num_epochs):
avg_loss_value, = exe.run(main_program, avg_loss_value, = exe.run(main_program,
feed=feeder.feed(data_train), feed=feeder.feed(data_train),
fetch_list=[avg_loss]) fetch_list=[avg_loss])
if step % 10 == 0: # 每10个批次记录一下训练损失 if step % 10 == 0: # 每10个批次记录并输出一下训练损失
plot_prompt.append(train_prompt, step, avg_loss_value[0]) plot_prompt.append(train_prompt, step, avg_loss_value[0])
plot_prompt.plot() plot_prompt.plot()
if step % 100 == 0: # 每100批次记录一下测试损失 print("%s, Step %d, Cost %f" %
(train_prompt, step, avg_loss_value[0]))
if step % 100 == 0: # 每100批次记录并输出一下测试损失
test_metics = train_test(executor=exe_test, test_metics = train_test(executor=exe_test,
program=test_program, program=test_program,
reader=test_reader, reader=test_reader,
...@@ -333,6 +342,8 @@ for pass_id in range(num_epochs): ...@@ -333,6 +342,8 @@ for pass_id in range(num_epochs):
feeder=feeder) feeder=feeder)
plot_prompt.append(test_prompt, step, test_metics[0]) plot_prompt.append(test_prompt, step, test_metics[0])
plot_prompt.plot() plot_prompt.plot()
print("%s, Step %d, Cost %f" %
(test_prompt, step, test_metics[0]))
if test_metics[0] < 10.0: # 如果准确率达到要求则停止训练 if test_metics[0] < 10.0: # 如果准确率达到要求则停止训练
break break
...@@ -358,6 +369,24 @@ inference_scope = fluid.core.Scope() ...@@ -358,6 +369,24 @@ inference_scope = fluid.core.Scope()
``` ```
### 预测 ### 预测
保存图片
```python
def save_result(points1, points2):
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
x1 = [idx for idx in range(len(points1))]
y1 = points1
y2 = points2
l1 = plt.plot(x1, y1, 'r--', label='predictions')
l2 = plt.plot(x1, y2, 'g--', label='GT')
plt.plot(x1, y1, 'ro-', x1, y2, 'g+-')
plt.title('predictions VS GT')
plt.legend()
plt.savefig('./image/prediction_gt.png')
```
通过fluid.io.load_inference_model预测器会从`params_dirname`中读取已经训练好的模型来对从未遇见过的数据进行预测 通过fluid.io.load_inference_model预测器会从`params_dirname`中读取已经训练好的模型来对从未遇见过的数据进行预测
```python ```python
...@@ -379,27 +408,7 @@ with fluid.scope_guard(inference_scope): ...@@ -379,27 +408,7 @@ with fluid.scope_guard(inference_scope):
results = infer_exe.run(inference_program, results = infer_exe.run(inference_program,
feed={feed_target_names[0]: numpy.array(infer_feat)}, feed={feed_target_names[0]: numpy.array(infer_feat)},
fetch_list=fetch_targets) # 进行预测 fetch_list=fetch_targets) # 进行预测
``` #打印预测结果和标签并可视化结果
保存图片
```python
def save_result(points1, points2):
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
x1 = [idx for idx in range(len(points1))]
y1 = points1
y2 = points2
l1 = plt.plot(x1, y1, 'r--', label='predictions')
l2 = plt.plot(x1, y2, 'g--', label='GT')
plt.plot(x1, y1, 'ro-', x1, y2, 'g+-')
plt.title('predictions VS GT')
plt.legend()
plt.savefig('./image/prediction_gt.png')
```
打印预测结果和标签并可视化结果
```python
print("infer results: (House Price)") print("infer results: (House Price)")
for idx, val in enumerate(results[0]): for idx, val in enumerate(results[0]):
print("%d: %.2f" % (idx, val)) # 打印预测结果 print("%d: %.2f" % (idx, val)) # 打印预测结果
...@@ -408,8 +417,10 @@ def save_result(points1, points2): ...@@ -408,8 +417,10 @@ def save_result(points1, points2):
for idx, val in enumerate(infer_label): for idx, val in enumerate(infer_label):
print("%d: %.2f" % (idx, val)) # 打印标签值 print("%d: %.2f" % (idx, val)) # 打印标签值
save_result(results[0], infer_label) # 保存图片 save_result(results[0], infer_label) # 保存图片
``` ```
由于每次都是随机选择一个minibatch的数据作为当前迭代的训练数据所以每次得到的预测结果会有所不同
## 总结 ## 总结
在这章里我们借助波士顿房价这一数据集介绍了线性回归模型的基本概念以及如何使用PaddlePaddle实现训练和测试的过程很多的模型和技巧都是从简单的线性回归模型演化而来因此弄清楚线性模型的原理和局限非常重要 在这章里我们借助波士顿房价这一数据集介绍了线性回归模型的基本概念以及如何使用PaddlePaddle实现训练和测试的过程很多的模型和技巧都是从简单的线性回归模型演化而来因此弄清楚线性模型的原理和局限非常重要
...@@ -422,7 +433,7 @@ save_result(results[0], infer_label) # 保存图片 ...@@ -422,7 +433,7 @@ save_result(results[0], infer_label) # 保存图片
4. Bishop C M. Pattern recognition[J]. Machine Learning, 2006, 128. 4. Bishop C M. Pattern recognition[J]. Machine Learning, 2006, 128.
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">知识共享 署名-相同方式共享 4.0 国际 许可协议</a>进行许可。 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://paddlepaddleimage.cdn.bcebos.com/bookimage/camo.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://www.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">知识共享 署名-相同方式共享 4.0 国际 许可协议</a>进行许可。
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
......
...@@ -42,140 +42,147 @@ ...@@ -42,140 +42,147 @@
<div id="markdown" style='display:none'> <div id="markdown" style='display:none'>
# Linear Regression # Linear Regression
Let us begin the tutorial with a classical problem called Linear Regression \[[1](#References)\]. In this chapter, we will train a model from a realistic dataset to predict home prices. Some important concepts in Machine Learning will be covered through this example. Let's start this tutorial from the classic Linear Regression ([[1](#References)]) model.
The source code for this tutorial lives on [book/fit_a_line](https://github.com/PaddlePaddle/book/tree/develop/01.fit_a_line). For instructions on getting started with this book,see [Running This Book](https://github.com/PaddlePaddle/book/blob/develop/README.md#running-the-book). In this chapter, you will build a model to predict house price with real datasets and learn about several important concepts about machine learning.
## Problem Setup The source code of this tutorial is in [book/fit_a_line](https://github.com/PaddlePaddle/book/tree/develop/01.fit_a_line). For the new users, please refer to [Running This Book](https://github.com/PaddlePaddle/book/blob/develop/README.md#running-the-book) .
Suppose we have a dataset of $n$ real estate properties. Each real estate property will be referred to as **homes** in this chapter for clarity.
Each home is associated with $d$ attributes. The attributes describe characteristics such as the number of rooms in the home, the number of schools or hospitals in the neighborhood, and the traffic condition nearby.
In our problem setup, the attribute $x_{i,j}$ denotes the $j$th characteristic of the $i$th home. In addition, $y_i$ denotes the price of the $i$th home. Our task is to predict $y_i$ given a set of attributes $\{x_{i,1}, ..., x_{i,d}\}$. We assume that the price of a home is a linear combination of all of its attributes, namely, ## Background
Given a $n$ dataset ${\{y_{i}, x_{i1}, ..., x_{id}\}}_{i=1}^{n}$, of which $ x_{i1}, \ldots, x_{id}$ are the values of the $d$th attribute of $i$ sample, and $y_i$ is the target to be predicted for this sample.
$$y_i = \omega_1x_{i,1} + \omega_2x_{i,2} + \ldots + \omega_dx_{i,d} + b, i=1,\ldots,n$$ The linear regression model assumes that the target $y_i$ can be described by a linear combination among attributes, i.e.
where $\vec{\omega}$ and $b$ are the model parameters we want to estimate. Once they are learned, we will be able to predict the price of a home, given the attributes associated with it. We call this model **Linear Regression**. In other words, we want to regress a value against several values linearly. In practice, a linear model is often too simplistic to capture the real relationships between the variables. Yet, because Linear Regression is easy to train and analyze, it has been applied to a large number of real problems. As a result, it is an important topic in many classic Statistical Learning and Machine Learning textbooks \[[2,3,4](#References)\]. $$y_i = \omega_1x_{i1} + \omega_2x_{i2} + \ldots + \omega_dx_{id} + b, i=1,\ldots,n$$
## Results Demonstration For example, in the problem of prediction of house price we are going to explore, $x_{ij}$ is a description of the various attributes of the house $i$ (such as the number of rooms, the number of schools and hospitals around, traffic conditions, etc.). $y_i$ is the price of the house.
We first show the result of our model. The dataset [UCI Housing Data Set](http://paddlemodels.bj.bcebos.com/uci_housing/housing.data) is used to train a linear model to predict the home prices in Boston. The figure below shows the predictions the model makes for some home prices. The $X$-axis represents the median value of the prices of similar homes within a bin, while the $Y$-axis represents the home value our linear model predicts. The dotted line represents points where $X=Y$. When reading the diagram, the closer the point is to the dotted line, better the model's prediction.
At first glance, this assumption is too simple, and the true relationship among variables is unlikely to be linear. However, because the linear regression model has the advantages of simple form and easy to be modeled and analyzed, it has been widely applied in practical problems. Many classic statistical learning and machine learning books \[[2,3,4](#references)\] also focus on linear model in a chapter.
## Result Demo
We used the Boston house price dataset obtained from [UCI Housing dataset](http://paddlemodels.bj.bcebos.com/uci_housing/housing.data) to train and predict the model. The scatter plot below shows the result of price prediction for parts of house with model. Each point on x-axis represents the median of the real price of the same type of house, and the y-axis represents the result of the linear regression model based on the feature prediction. When the two values are completely equal, they will fall on the dotted line. So the more accurate the model is predicted, the closer the point is to the dotted line.
<p align="center"> <p align="center">
<img src = "image/predictions_en.png" width=400><br/> <img src = "https://github.com/PaddlePaddle/book/blob/develop/01.fit_a_line/image/predictions.png?raw=true" width=400><br/>
Figure 1. Predicted Value V.S. Actual Value Figure One. Predict value V.S Ground-truth value
</p> </p>
## Model Overview ## Model Overview
### Model Definition ### Model Definition
In the UCI Housing Data Set, there are 13 home attributes $\{x_{i,j}\}$ that are related to the median home price $y_i$, which we aim to predict. Thus, our model can be written as: In the dataset of Boston house price, there are 14 values associated with the home: the first 13 are used to describe various information of house, that is $x_i$ in the model; the last value is the medium price of the house we want to predict, which is $y_i$ in the model.
Therefore, our model can be expressed as:
$$\hat{Y} = \omega_1X_{1} + \omega_2X_{2} + \ldots + \omega_{13}X_{13} + b$$ $$\hat{Y} = \omega_1X_{1} + \omega_2X_{2} + \ldots + \omega_{13}X_{13} + b$$
where $\hat{Y}$ is the predicted value used to differentiate from actual value $Y$. The model learns parameters $\omega_1, \ldots, \omega_{13}, b$, where the entries of $\vec{\omega}$ are **weights** and $b$ is **bias**. $\hat{Y}$ represents the predicted result of the model and is used to distinguish it from the real value $Y$. The parameters to be learned by the model are: $\omega_1, \ldots, \omega_{13}, b$.
Now we need an objective to optimize, so that the learned parameters can make $\hat{Y}$ as close to $Y$ as possible. Let's refer to the concept of [Loss Function (Cost Function)](https://en.wikipedia.org/wiki/Loss_function). A loss function must output a non-negative value, given any pair of the actual value $y_i$ and the predicted value $\hat{y_i}$. This value reflects the magnitutude of the model error. After building the model, we need to give the model an optimization goal so that the learned parameters can make the predicted value $\hat{Y}$ get as close to the true value $Y$. Here we introduce the concept of loss function ([Loss Function](https://en.wikipedia.org/wiki/Loss_function), or Cost Function. Input the target value $y_{i}$ of any data sample and the predicted value $\hat{y_{i}}$ given by a model. Then the loss function outputs a non-negative real number, which is usually used to represent model error.
For Linear Regression, the most common loss function is [Mean Square Error (MSE)](https://en.wikipedia.org/wiki/Mean_squared_error) which has the following form: For linear regression models, the most common loss function is the Mean Squared Error ([MSE](https://en.wikipedia.org/wiki/Mean_squared_error)), which is:
$$MSE=\frac{1}{n}\sum_{i=1}^{n}{(\hat{Y_i}-Y_i)}^2$$ $$MSE=\frac{1}{n}\sum_{i=1}^{n}{(\hat{Y_i}-Y_i)}^2$$
That is, for a dataset of size $n$, MSE is the average value of the the prediction sqaure errors. That is, for a test set in size of $n$, $MSE$ is the mean of the squared error of the $n$ data prediction results.
The method used to optimize the loss function is generally the gradient descent method. The gradient descent method is a first-order optimization algorithm. If $f(x)$ is defined and divisible at point $x_n$, then $f(x)$ is considered to be the fastest in the negative direction of the gradient $-▽f(x_n)$ at point of $x_n$. Adjust $x$ repeatedly to make $f(x)$ close to the local or global minimum value. The adjustment is as follows:
$$x_n+1=x_n-λ▽f(x), n≧0$$
Where λ represents the learning rate. This method of adjustment is called the gradient descent method.
### Training Process ### Training Process
After setting up our model, there are several major steps to go through to train it: After defining the model structure, we will train the model through the following steps.
1. Initialize the parameters including the weights $\vec{\omega}$ and the bias $b$. For example, we can set their mean values as $0$s, and their standard deviations as $1$s.
2. Feedforward. Evaluate the network output and compute the corresponding loss. 1. Initialize parameters, including weights $\omega_i$ and bias $b$, to initialize them (eg. 0 as mean, 1 as variance).
3. [Backpropagate](https://en.wikipedia.org/wiki/Backpropagation) the errors. The errors will be propagated from the output layer back to the input layer, during which the model parameters will be updated with the corresponding errors. 2. Forward propagation of network calculates network output and loss functions.
4. Repeat steps 2~3, until the loss is below a predefined threshold or the maximum number of epochs is reached.  3. Reverse error propagation according to the loss function ( [backpropagation](https://en.wikipedia.org/wiki/Backpropagation) ), passing forward the network error from the output layer and updating the parameters in the network.
 4. Repeat steps 2~3 until the network training error reaches the specified level or the training round reaches the set value.
## Dataset
### An Introduction of the Dataset ## Dataset
The UCI housing dataset has 506 instances. Each instance describes the attributes of a house in surburban Boston. The attributes are explained below: ### Dataset Introduction
The dataset consists of 506 lines, each containing information about a type of houses in a suburb of Boston and the median price of that type of house. The meaning of each dimensional attribute is as follows:
| Attribute Name | Characteristic | Data Type | | Property Name | Explanation | Type |
| ------| ------ | ------ | | ------| ------ | ------ |
| CRIM | per capita crime rate by town | Continuous| CRIM | Per capita crime rate in the town | Continuous value |
| ZN | proportion of residential land zoned for lots over 25,000 sq.ft. | Continuous | | ZN | Proportion of residential land with an area of over 25,000 square feet | Continuous value |
| INDUS | proportion of non-retail business acres per town | Continuous | | INDUS | Proportion of non-retail commercial land | Continuous value |
| CHAS | Charles River dummy variable | Discrete, 1 if tract bounds river; 0 otherwise| CHAS | Whether it is adjacent to Charles River | Discrete value, 1=proximity; 0=not adjacent |
| NOX | nitric oxides concentration (parts per 10 million) | Continuous | NOX | Nitric Oxide Concentration | Continuous value |
| RM | average number of rooms per dwelling | Continuous | | RM | Average number of rooms per house | Continuous value |
| AGE | proportion of owner-occupied units built prior to 1940 | Continuous | | AGE | Proportion of self-use units built before 1940 | Continuous value |
| DIS | weighted distances to five Boston employment centres | Continuous | | DIS | Weighted Distance to 5 Job Centers in Boston | Continuous value |
| RAD | index of accessibility to radial highways | Continuous | | RAD | Accessibility Index to Radial Highway | Continuous value |
| TAX | full-value property-tax rate per \$10,000 | Continuous | | TAX | Tax Rate of Full-value Property | Continuous value |
| PTRATIO | pupil-teacher ratio by town | Continuous | | PTRATIO | Proportion of Student and Teacher | Continuous value |
| B | 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town | Continuous | | B | 1000(BK - 0.63)^2, where BK is black ratio | Continuous value |
| LSTAT | % lower status of the population | Continuous | LSTAT | Low-income population ratio | Continuous value |
| MEDV | Median value of owner-occupied homes in $1000's | Continuous | | MEDV | Median price of a similar home | Continuous value |
### Data Pre-processing
The last entry is the median home price. #### Continuous value and discrete value
Analyzing the data, first we find that all 13-dimensional attributes exist 12-dimensional continuous value and 1-dimensional discrete values (CHAS). Discrete value is often represented by numbers like 0, 1, and 2, but its meaning is different from continuous value's because the difference of discrete value here has no meaning. For example, if we use 0, 1, and 2 to represent red, green, and blue, we cannot infer that the distance between blue and red is longer than that between green and red. So usually for a discrete property with $d$ possible values, we will convert them to $d$ binary properties with a value of 0 or 1 or map each possible value to a multidimensional vector. However, there is no this problem for CHAS, since CHAS itself is a binary attribute .
### Preprocessing #### Normalization of attributes
Another fact that can be easily found is that the range of values of each dimensional attribute is largely different (as shown in Figure 2). For example, the value range of attribute B is [0.32, 396.90], and the value range of attribute NOX is [0.3850, 0.8170]. Here is a common operation - normalization. The goal of normalization is to scale the value of each attribute to a similar range, such as [-0.5, 0.5]. Here we use a very common operation method: subtract the mean and divide by the range of values.
#### Continuous and Discrete Data There are at least three reasons for implementing normalization (or [Feature scaling](https://en.wikipedia.org/wiki/Feature_scaling)):
We define a feature vector of length 13 for each home, where each entry corresponds to an attribute. Our first observation is that, among the 13 dimensions, there are 12 continuous dimensions and 1 discrete dimension. - A range of values that are too large or too small can cause floating value overflow or underflow during calculation.
Note that although a discrete value is also written as numeric values such as 0, 1, or 2, its meaning differs from a continuous value drastically. The linear difference between two discrete values has no meaning. For example, suppose $0$, $1$, and $2$ are used to represent colors *Red*, *Green*, and *Blue* respectively. Judging from the numeric representation of these colors, *Red* differs more from *Blue* than it does from *Green*. Yet in actuality, it is not true that extent to which the color *Blue* is different from *Red* is greater than the extent to which *Green* is different from *Red*. Therefore, when handling a discrete feature that has $d$ possible values, we usually convert it to $d$ new features where each feature takes a binary value, $0$ or $1$, indicating whether the original value is absent or present. Alternatively, the discrete features can be mapped onto a continuous multi-dimensional vector through an embedding table. For our problem here, because CHAS itself is a binary discrete value, we do not need to do any preprocessing. - Different ranges of number result in different attributes being different for the model (at least in the initial period of training), and this implicit assumption is often unreasonable. This can make the optimization process difficult and the training time greatly longer.
#### Feature Normalization - Many machine learning techniques/models (such as L1, L2 regular items, Vector Space Model) are based on the assumption that all attribute values are almost zero and their ranges of value are similar.
We also observe a huge difference among the value ranges of the 13 features (Figure 2). For instance, the values of feature *B* fall in $[0.32, 396.90]$, whereas those of feature *NOX* has a range of $[0.3850, 0.8170]$. An effective optimization would require data normalization. The goal of data normalization is to scale the values of each feature into roughly the same range, perhaps $[-0.5, 0.5]$. Here, we adopt a popular normalization technique where we subtract the mean value from the feature value and divide the result by the width of the original range.
There are at least three reasons for [Feature Normalization](https://en.wikipedia.org/wiki/Feature_scaling) (Feature Scaling):
- A value range that is too large or too small might cause floating number overflow or underflow during computation.
- Different value ranges might result in varying *importances* of different features to the model (at least in the beginning of the training process). This assumption about the data is often unreasonable, making the optimization difficult, which in turn results in increased training time.
- Many machine learning techniques or models (e.g., *L1/L2 regularization* and *Vector Space Model*) assumes that all the features have roughly zero means and their value ranges are similar.
<p align="center"> <p align="center">
<img src = "image/ranges_en.png" width=550><br/> <img src = "https://github.com/PaddlePaddle/book/blob/develop/01.fit_a_line/image/ranges.png?raw=true" width=550><br/>
Figure 2. The value ranges of the features Figure 2. Value range of attributes for all dimensions
</p> </p>
#### Prepare Training and Test Sets #### Organizing training set and testing set
We split the dataset in two, one for adjusting the model parameters, namely, for training the model, and the other for testing. The model error on the former is called the **training error**, and the error on the latter is called the **test error**. Our goal in training a model is to find the statistical dependency between the outputs and the inputs, so that we can predict outputs given new inputs. As a result, the test error reflects the performance of the model better than the training error does. We consider two things when deciding the ratio of the training set to the test set: 1) More training data will decrease the variance of the parameter estimation, yielding more reliable models; 2) More test data will decrease the variance of the test error, yielding more reliable test errors. One standard split ratio is $8:2$. We split the dataset into two parts: one is used to adjust the parameters of the model, that is, to train the model, the error of the model on this dataset is called ** training error **; the other is used to test.The error of the model on this dataset is called the ** test error**. The goal of our training model is to predict unknown new data by finding the regulation from the training data, so the test error is an better indicator for the performance of the model. When it comes to the ratio of the segmentation data, we should take into account two factors: more training data will reduce the square error of estimated parameters, resulting in a more reliable model; and more test data will reduce the square error of the test error, resulting in more credible test error. The split ratio set in our example is $8:2$
When training complex models, we usually have one more split: the validation set. Complex models usually have [Hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_optimization) that need to be set before the training process, such as the number of layers in the network. Because hyperparameters are not part of the model parameters, they cannot be trained using the same loss function. Thus we will try several sets of hyperparameters to train several models and cross-validate them on the validation set to pick the best one; finally, the selected trained model is tested on the test set. Because our model is relatively simple, we will omit this validation process.
In a more complex model training process, we often need more than one dataset: the validation set. Because complex models often have some hyperparameters ([Hyperparameter](https://en.wikipedia.org/wiki/Hyperparameter_optimization)) that need to be adjusted, we will try a combination of multiple hyperparameters to train multiple models separately and then compare their performance on the validation set to select the relatively best set of hyperparameters, and finally use the model with this set of parameters to evaluate the test error on the test set. Since the model trained in this chapter is relatively simple, we won't talk about this process at present.
## Training ## Training
`fit_a_line/trainer.py` demonstrates the training using [PaddlePaddle](http://paddlepaddle.org). `fit_a_line/train.py` demonstrates the overall process of training.
### Datafeeder Configuration ### Configuring the Data feeder
Our program starts with importing necessary packages: First we import the libraries:
```python ```python
from __future__ import print_function
import paddle import paddle
import paddle.fluid as fluid import paddle.fluid as fluid
import numpy import numpy
from __future__ import print_function import math
try: import sys
from paddle.fluid.contrib.trainer import *
from paddle.fluid.contrib.inferencer import *
except ImportError:
print(
"In the fluid 1.0, the trainer and inferencer are moving to paddle.fluid.contrib",
file=sys.stderr)
from paddle.fluid.trainer import *
from paddle.fluid.inferencer import *
``` ```
We encapsulated the [UCI Housing Data Set](http://paddlemodels.bj.bcebos.com/uci_housing/housing.data) in our Python module `uci_housing`. This module can We introduced the dataset [UCI Housing dataset](http://paddlemodels.bj.bcebos.com/uci_housing/housing.data) via the uci_housing module
1. download the dataset to `~/.cache/paddle/dataset/uci_housing/housing.data`, if you haven't yet, and It is encapsulated in the uci_housing module:
2. [preprocess](#preprocessing) the dataset.
1. The process of data download. The download data is saved in ~/.cache/paddle/dataset/uci_housing/housing.data.
2. The process of [data preprocessing](#data preprocessing).
We define data feeders for test and train. The feeder reads a `BATCH_SIZE` of data each time and feed them to the training/testing process. If the user wants some randomness on the data order, she can define both a `BATCH_SIZE` and a `buf_size`. That way the datafeeder will yield the first `BATCH_SIZE` data out of a shuffle of the first `buf_size` data. Next we define the data feeder for training. The data feeder reads a batch of data in the size of `BATCH_SIZE` each time. If the user wants the data to be random, it can define data in size of a batch and a cache. In this case, each time the data feeder randomly reads as same data as the batch size from the cache.
```python ```python
BATCH_SIZE = 20 BATCH_SIZE = 20
...@@ -191,172 +198,240 @@ test_reader = paddle.batch( ...@@ -191,172 +198,240 @@ test_reader = paddle.batch(
batch_size=BATCH_SIZE) batch_size=BATCH_SIZE)
``` ```
### Train Program Configuration If you want to read data directly from \*.txt file, you can refer to the method as follows.
`train_program` sets up the network structure of this current training model. For linear regression, it is simply a fully connected layer from the input to the output. More complex structures like CNN and RNN will be introduced in later chapters. The `train_program` must return an avg_loss as its first returned parameter because it is needed in backpropagation. feature_names = [
'CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
'PTRATIO', 'B', 'LSTAT', 'convert'
]
```python feature_num = len(feature_names)
def train_program():
y = fluid.layers.data(name='y', shape=[1], dtype='float32')
# feature vector of length 13 data = numpy.fromfile(filename, sep=' ') # Read primary data from file
x = fluid.layers.data(name='x', shape=[13], dtype='float32')
y_predict = fluid.layers.fc(input=x, size=1, act=None)
loss = fluid.layers.square_error_cost(input=y_predict, label=y) data = data.reshape(data.shape[0] // feature_num, feature_num)
avg_loss = fluid.layers.mean(loss)
return avg_loss maximums, minimums, avgs = data.max(axis=0), data.min(axis=0), data.sum(axis=0)/data.shape[0]
```
### Optimizer Function Configuration for i in six.moves.range(feature_num-1):
data[:, i] = (data[:, i] - avgs[i]) / (maximums[i] - minimums[i]) # six.moves is compatible to python2 and python3
In the following `SGD` optimizer, `learning_rate` specifies the learning rate in the optimization procedure. ratio = 0.8 # distribution ratio of train dataset and verification dataset
```python offset = int(data.shape[0]\*ratio)
def optimizer_program():
return fluid.optimizer.SGD(learning_rate=0.001)
```
### Specify Place train_data = data[:offset]
Specify your training environment, you should specify if the training is on CPU or GPU. test_data = data[offset:]
```python train_reader = paddle.batch(
use_cuda = False paddle.reader.shuffle(
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace() train_data, buf_size=500),
``` batch_size=BATCH_SIZE)
### Create Trainer test_reader = paddle.batch(
paddle.reader.shuffle(
test_data, buf_size=500),
batch_size=BATCH_SIZE)
The trainer will take the `train_program` as input. ### Configure Program for Training
The aim of the program for training is to define a network structure of a training model. For linear regression, it is a simple fully connected layer from input to output. More complex result, such as Convolutional Neural Network and Recurrent Neural Network, will be introduced in later chapters. It must return `mean error` as the first return value in program for training, for that `mean error` will be used for BackPropagation.
```python ```python
trainer = Trainer( x = fluid.layers.data(name='x', shape=[13], dtype='float32') # define shape and data type of input
train_func=train_program, y = fluid.layers.data(name='y', shape=[1], dtype='float32') # define shape and data type of output
place=place, y_predict = fluid.layers.fc(input=x, size=1, act=None) # fully connected layer connecting input and output
optimizer_func=optimizer_program)
main_program = fluid.default_main_program() # get default/global main function
startup_program = fluid.default_startup_program() # get default/global launch program
cost = fluid.layers.square_error_cost(input=y_predict, label=y) # use label and output predicted data to estimate square error
avg_loss = fluid.layers.mean(cost) # compute mean value for square error and get mean loss
``` ```
For details, please refer to:
[fluid.default_main_program](http://www.paddlepaddle.org/documentation/docs/zh/develop/api_cn/fluid_cn.html#default-main-program)
[fluid.default_startup_program](http://www.paddlepaddle.org/documentation/docs/zh/develop/api_cn/fluid_cn.html#default-startup-program)
### Feeding Data ### Optimizer Function Configuration
PaddlePaddle provides the `SGD optimizer`, `learning_rate` below are learning rate, which is related to rate of convergence for train of network.
[reader mechanism](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader)
for loading the training data. A reader may return multiple columns, and we need a Python dictionary to specify the mapping from column index to data layers.
```python ```python
feed_order=['x', 'y'] #Clone main_program to get test_program
# operations of some operators are different between train and test. For example, batch_norm use parameter for_test to determine whether the program is for training or for testing.
#The api will not delete any operator, please apply it before backward and optimization.
test_program = main_program.clone(for_test=True)
sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.001)
sgd_optimizer.minimize(avg_loss)
``` ```
Moreover, an event handler is provided to print the training progress: ### Define Training Place
```python We can define whether an operation runs on the CPU or on the GPU.
# Specify the directory to save the parameters
params_dirname = "fit_a_line.inference.model"
```python
use_cuda = False
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace() # define the execution space of executor
train_title = "Train cost" ###executor can accept input program and add data input operator and result fetch operator based on feed map and fetch list. Use close() to close executor and call run(...) to run the program.
test_title = "Test cost" exe = fluid.Executor(place)
step = 0 ```
For details, please refer to:
[fluid.executor](http://www.paddlepaddle.org/documentation/docs/zh/develop/api_cn/fluid_cn.html#permalink-15-executor)
# event_handler prints training and testing info ### Create Training Process
def event_handler(event): To train, it needs a train program and some parameters and creates a function to get test error in the process of train necessary parameters contain executor, program, reader, feeder, fetch_list, executor represents executor created before. Program created before represents program executed by executor. If the parameter is undefined, then it is defined default_main_program by default. Reader represents data read. Feeder represents forward input variable and fetch_list represents variable user wants to get or name.
global step
if isinstance(event, EndStepEvent):
if step % 10 == 0: # record a train cost every 10 batches
print("%s, Step %d, Cost %f" % (train_title, step, event.metrics[0]))
if step % 100 == 0: # record a test cost every 100 batches
test_metrics = trainer.test(
reader=test_reader, feed_order=feed_order)
print("%s, Step %d, Cost %f" % (test_title, step, test_metrics[0]))
if test_metrics[0] < 10.0:
# If the accuracy is good enough, we can stop the training.
print('loss is less than 10.0, stop')
trainer.stop()
step += 1
if isinstance(event, EndEpochEvent): ```python
if event.epoch % 10 == 0: num_epochs = 100
# We can save the trained parameters for the inferences later
if params_dirname is not None: def train_test(executor, program, reader, feeder, fetch_list):
trainer.save_params(params_dirname) accumulated = 1 * [0]
count = 0
for data_test in reader():
outs = executor.run(program=program,
feed=feeder.feed(data_test),
fetch_list=fetch_list)
accumulated = [x_c[0] + x_c[1][0] for x_c in zip(accumulated, outs)] # accumulate loss value in the process of test
count += 1 # accumulate samples in test dataset
return [x_d / count for x_d in accumulated] # compute mean loss
``` ```
### Start Training ### Train Main Loop
We now can start training by calling `trainer.train()`. give name of directory to be stored and initialize an executor
```python ```python
%matplotlib inline %matplotlib inline
params_dirname = "fit_a_line.inference.model"
feeder = fluid.DataFeeder(place=place, feed_list=[x, y])
exe.run(startup_program)
train_prompt = "train cost"
test_prompt = "test cost"
from paddle.utils.plot import Ploter
plot_prompt = Ploter(train_prompt, test_prompt)
step = 0
# The training could take up to a few minutes. exe_test = fluid.Executor(place)
trainer.train(
reader=train_reader,
num_epochs=100,
event_handler=event_handler,
feed_order=feed_order)
``` ```
Paddlepaddle provides reader mechanism to read training data. Reader provide multiple columns of data at one time. Therefore, we need a python list to read sequence. We create a loop to train until the result of train is good enough or time of loop is enough.
If the number of iterations for train is equal to the number of iterations for saving parameters, you can save train parameter into `params_dirname`.
Set main loop for training.
```python
for pass_id in range(num_epochs):
for data_train in train_reader():
avg_loss_value, = exe.run(main_program,
feed=feeder.feed(data_train),
fetch_list=[avg_loss])
if step % 10 == 0: # record and output train loss for every 10 batches.
plot_prompt.append(train_prompt, step, avg_loss_value[0])
plot_prompt.plot()
print("%s, Step %d, Cost %f" %
(train_prompt, step, avg_loss_value[0]))
if step % 100 == 0: # record and output test loss for every 100 batches.
test_metics = train_test(executor=exe_test,
program=test_program,
reader=test_reader,
fetch_list=[avg_loss.name],
feeder=feeder)
plot_prompt.append(test_prompt, step, test_metics[0])
plot_prompt.plot()
print("%s, Step %d, Cost %f" %
(test_prompt, step, test_metics[0]))
if test_metics[0] < 10.0: # If the accuracy is up to the requirement, the train can be stopped.
break
![png](./image/train_and_test.png) step += 1
## Inference if math.isnan(float(avg_loss_value[0])):
sys.exit("got NaN loss, training failed.")
Initialize the Inferencer with the inference_program and the params_dirname, which is where we saved our params #save train parameters into the path given before
if params_dirname is not None:
fluid.io.save_inference_model(params_dirname, ['x'], [y_predict], exe)
```
### Setup the Inference Program ## Predict
It needs to create trained parameters to run program for prediction. The trained parameters is in `params_dirname`.
Similar to the trainer.train, the Inferencer needs to take an inference_program to do inference. ### Prepare Environment for Prediction
Prune the train_program to only have the y_predict. Similar to the process of training, predictor needs a program for prediction. We can slightly modify our training program to include the prediction value.
```python ```python
def inference_program(): infer_exe = fluid.Executor(place)
x = fluid.layers.data(name='x', shape=[13], dtype='float32') inference_scope = fluid.core.Scope()
y_predict = fluid.layers.fc(input=x, size=1, act=None)
return y_predict
``` ```
### Infer ### Predict
Inferencer will load the trained model from `params_dirname` and use it to infer the unseen data.
Save pictures
```python ```python
inferencer = Inferencer( def save_result(points1, points2):
infer_func=inference_program, param_path=params_dirname, place=place) import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
x1 = [idx for idx in range(len(points1))]
y1 = points1
y2 = points2
l1 = plt.plot(x1, y1, 'r--', label='predictions')
l2 = plt.plot(x1, y2, 'g--', label='GT')
plt.plot(x1, y1, 'ro-', x1, y2, 'g+-')
plt.title('predictions VS GT')
plt.legend()
plt.savefig('./image/prediction_gt.png')
```
batch_size = 10 Via fluid.io.load_inference_model, predictor will read well-trained model from `params_dirname` to predict unknown data.
test_reader = paddle.batch(paddle.dataset.uci_housing.test(),batch_size=batch_size)
test_data = next(test_reader())
test_x = numpy.array([data[0] for data in test_data]).astype("float32")
test_y = numpy.array([data[1] for data in test_data]).astype("float32")
results = inferencer.infer({'x': test_x}) ```python
with fluid.scope_guard(inference_scope):
[inference_program, feed_target_names,
fetch_targets] = fluid.io.load_inference_model(params_dirname, infer_exe) # load pre-predict model
batch_size = 10
infer_reader = paddle.batch(
paddle.dataset.uci_housing.test(), batch_size=batch_size) # prepare test dataset
infer_data = next(infer_reader())
infer_feat = numpy.array(
[data[0] for data in infer_data]).astype("float32") # extract data in test dataset
infer_label = numpy.array(
[data[1] for data in infer_data]).astype("float32") # extract label in test dataset
assert feed_target_names[0] == 'x'
results = infer_exe.run(inference_program,
feed={feed_target_names[0]: numpy.array(infer_feat)},
fetch_list=fetch_targets) # predict
#print predict result and label and visualize the result
print("infer results: (House Price)")
for idx, val in enumerate(results[0]):
print("%d: %.2f" % (idx, val)) # print predict result
print("\nground truth:")
for idx, val in enumerate(infer_label):
print("%d: %.2f" % (idx, val)) # print label
save_result(results[0], infer_label) # save picture
```
print("infer results: (House Price)")
for idx, val in enumerate(results[0]):
print("%d: %.2f" % (idx, val))
print("\nground truth:")
for idx, val in enumerate(test_y):
print("%d: %.2f" % (idx, val))
```
## Summary ## Summary
In this chapter, we analyzed dataset of Boston House Price to introduce the basic concepts of linear regression model and how to use PaddlePaddle to implement training and testing. A number of models and theories are derived from linear regression model. Therefore, it is not unnecessary to figure out the principle and limitation of linear regression model.
This chapter introduces *Linear Regression* and how to train and test this model with PaddlePaddle, using the UCI Housing Data Set. Because a large number of more complex models and techniques are derived from linear regression, it is important to understand its underlying theory and limitation. <a name="References"></a>
## References ## References
1. https://en.wikipedia.org/wiki/Linear_regression 1. https://en.wikipedia.org/wiki/Linear_regression
2. Friedman J, Hastie T, Tibshirani R. The elements of statistical learning[M]. Springer, Berlin: Springer series in statistics, 2001. 2. Friedman J, Hastie T, Tibshirani R. The elements of statistical learning[M]. Springer, Berlin: Springer series in statistics, 2001.
3. Murphy K P. Machine learning: a probabilistic perspective[M]. MIT press, 2012. 3. Murphy K P. Machine learning: a probabilistic perspective[M]. MIT press, 2012.
4. Bishop C M. Pattern recognition[J]. Machine Learning, 2006, 128. 4. Bishop C M. Pattern recognition[J]. Machine Learning, 2006, 128.
<br/> <br/>
This tutorial is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://paddlepaddleimage.cdn.bcebos.com/bookimage/camo.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">This tutorial</span> is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
......
...@@ -14,11 +14,31 @@ ...@@ -14,11 +14,31 @@
from __future__ import print_function from __future__ import print_function
import sys
import argparse
import math
import numpy
import paddle import paddle
import paddle.fluid as fluid import paddle.fluid as fluid
import numpy
import math
import sys def parse_args():
parser = argparse.ArgumentParser("fit_a_line")
parser.add_argument(
'--enable_ce',
action='store_true',
help="If set, run the task with continuous evaluation logs.")
parser.add_argument(
'--use_gpu',
type=bool,
default=False,
help="Whether to use GPU or not.")
parser.add_argument(
'--num_epochs', type=int, default=100, help="number of epochs.")
args = parser.parse_args()
return args
# For training test cost # For training test cost
...@@ -50,37 +70,50 @@ def save_result(points1, points2): ...@@ -50,37 +70,50 @@ def save_result(points1, points2):
def main(): def main():
batch_size = 20 batch_size = 20
if args.enable_ce:
train_reader = paddle.batch( train_reader = paddle.batch(
paddle.reader.shuffle(paddle.dataset.uci_housing.train(), buf_size=500), paddle.dataset.uci_housing.train(), batch_size=batch_size)
test_reader = paddle.batch(
paddle.dataset.uci_housing.test(), batch_size=batch_size)
else:
train_reader = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.uci_housing.train(), buf_size=500),
batch_size=batch_size) batch_size=batch_size)
test_reader = paddle.batch( test_reader = paddle.batch(
paddle.reader.shuffle(paddle.dataset.uci_housing.test(), buf_size=500), paddle.reader.shuffle(
paddle.dataset.uci_housing.test(), buf_size=500),
batch_size=batch_size) batch_size=batch_size)
# feature vector of length 13 # feature vector of length 13
x = fluid.layers.data(name='x', shape=[13], dtype='float32') x = fluid.layers.data(name='x', shape=[13], dtype='float32')
y = fluid.layers.data(name='y', shape=[1], dtype='float32') y = fluid.layers.data(name='y', shape=[1], dtype='float32')
y_predict = fluid.layers.fc(input=x, size=1, act=None)
main_program = fluid.default_main_program() main_program = fluid.default_main_program()
startup_program = fluid.default_startup_program() startup_program = fluid.default_startup_program()
if args.enable_ce:
main_program.random_seed = 90
startup_program.random_seed = 90
y_predict = fluid.layers.fc(input=x, size=1, act=None)
cost = fluid.layers.square_error_cost(input=y_predict, label=y) cost = fluid.layers.square_error_cost(input=y_predict, label=y)
avg_loss = fluid.layers.mean(cost) avg_loss = fluid.layers.mean(cost)
test_program = main_program.clone(for_test=True)
sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.001) sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.001)
sgd_optimizer.minimize(avg_loss) sgd_optimizer.minimize(avg_loss)
test_program = main_program.clone(for_test=True)
# can use CPU or GPU # can use CPU or GPU
use_cuda = False use_cuda = args.use_gpu
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace() place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
exe = fluid.Executor(place) exe = fluid.Executor(place)
# Specify the directory to save the parameters # Specify the directory to save the parameters
params_dirname = "fit_a_line.inference.model" params_dirname = "fit_a_line.inference.model"
num_epochs = 100 num_epochs = args.num_epochs
# main train loop. # main train loop.
feeder = fluid.DataFeeder(place=place, feed_list=[x, y]) feeder = fluid.DataFeeder(place=place, feed_list=[x, y])
...@@ -124,6 +157,10 @@ def main(): ...@@ -124,6 +157,10 @@ def main():
fluid.io.save_inference_model(params_dirname, ['x'], [y_predict], fluid.io.save_inference_model(params_dirname, ['x'], [y_predict],
exe) exe)
if args.enable_ce and pass_id == args.num_epochs - 1:
print("kpis\ttrain_cost\t%f" % avg_loss_value[0])
print("kpis\ttest_cost\t%f" % test_metics[0])
infer_exe = fluid.Executor(place) infer_exe = fluid.Executor(place)
inference_scope = fluid.core.Scope() inference_scope = fluid.core.Scope()
...@@ -160,4 +197,5 @@ def main(): ...@@ -160,4 +197,5 @@ def main():
if __name__ == '__main__': if __name__ == '__main__':
args = parse_args()
main() main()
#!/bin/bash
#This file is only used for continuous evaluation.
python train.py --enable_ce | python _ce.py
...@@ -2,6 +2,14 @@ ...@@ -2,6 +2,14 @@
本教程源代码目录在[book/recognize_digits](https://github.com/PaddlePaddle/book/tree/develop/02.recognize_digits),初次使用请您参考[Book文档使用说明](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书) 本教程源代码目录在[book/recognize_digits](https://github.com/PaddlePaddle/book/tree/develop/02.recognize_digits),初次使用请您参考[Book文档使用说明](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)
### 说明: ###
1. 硬件环境要求:
本文可支持在CPU、GPU下运行
2. Docker镜像支持的CUDA/cuDNN版本:
如果使用了Docker运行Book,请注意:这里所提供的默认镜像的GPU环境为 CUDA 8/cuDNN 5,对于NVIDIA Tesla V100等要求CUDA 9的 GPU,使用该镜像可能会运行失败。
3. 文档和脚本中代码的一致性问题:
请注意:为使本文更加易读易用,我们拆分、调整了train.py的代码并放入本文。本文中代码与train.py的运行结果一致,可直接运行[train.py](https://github.com/PaddlePaddle/book/blob/develop/02.recognize_digits/train.py)进行验证。
## 背景介绍 ## 背景介绍
当我们学习编程的时候,编写的第一个程序一般是实现打印"Hello World"。而机器学习(或深度学习)的入门教程,一般都是 [MNIST](http://yann.lecun.com/exdb/mnist/) 数据库上的手写识别问题。原因是手写识别属于典型的图像分类问题,比较简单,同时MNIST数据集也很完备。MNIST数据集作为一个简单的计算机视觉数据集,包含一系列如图1所示的手写数字图片和对应的标签。图片是28x28的像素矩阵,标签则对应着0~9的10个数字。每张图片都经过了大小归一化和居中处理。 当我们学习编程的时候,编写的第一个程序一般是实现打印"Hello World"。而机器学习(或深度学习)的入门教程,一般都是 [MNIST](http://yann.lecun.com/exdb/mnist/) 数据库上的手写识别问题。原因是手写识别属于典型的图像分类问题,比较简单,同时MNIST数据集也很完备。MNIST数据集作为一个简单的计算机视觉数据集,包含一系列如图1所示的手写数字图片和对应的标签。图片是28x28的像素矩阵,标签则对应着0~9的10个数字。每张图片都经过了大小归一化和居中处理。
...@@ -29,7 +37,7 @@ MNIST吸引了大量的科学家基于此数据集训练模型,1998年,LeCun ...@@ -29,7 +37,7 @@ MNIST吸引了大量的科学家基于此数据集训练模型,1998年,LeCun
- $Y$是输出:分类器的输出是10类数字(0-9),即$Y=\left ( y_0, y_1, \dots, y_9 \right )$,每一维$y_i$代表图片分类为第$i$类数字的概率。 - $Y$是输出:分类器的输出是10类数字(0-9),即$Y=\left ( y_0, y_1, \dots, y_9 \right )$,每一维$y_i$代表图片分类为第$i$类数字的概率。
- $Label$是图片的真实标签:$Label=\left ( l_0, l_1, \dots, l_9 \right )$也是10维,但只有一维为1,其他都为0。例如某张图片上的数字为2,则它的标签为$(0,0,1,0, \dot, 0)$ - $Label$是图片的真实标签:$Label=\left ( l_0, l_1, \dots, l_9 \right )$也是10维,但只有一维为1,其他都为0。例如某张图片上的数字为2,则它的标签为$(0,0,1,0, \dots, 0)$
### Softmax回归(Softmax Regression) ### Softmax回归(Softmax Regression)
...@@ -37,14 +45,19 @@ MNIST吸引了大量的科学家基于此数据集训练模型,1998年,LeCun ...@@ -37,14 +45,19 @@ MNIST吸引了大量的科学家基于此数据集训练模型,1998年,LeCun
输入层的数据$X$传到输出层,在激活操作之前,会乘以相应的权重 $W$ ,并加上偏置变量 $b$ ,具体如下: 输入层的数据$X$传到输出层,在激活操作之前,会乘以相应的权重 $W$ ,并加上偏置变量 $b$ ,具体如下:
$$ y_i = \text{softmax}(\sum_j W_{i,j}x_j + b_i) $$ <p align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/02.recognize_digits/image/01.gif?raw=true"><br/>
</p>
其中 $ \text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} $ 其中
<p align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/02.recognize_digits/image/02.gif?raw=true"><br/>
</p>
图2为softmax回归的网络图,图中权重用蓝线表示、偏置用红线表示、+1代表偏置参数的系数为1。 图2为softmax回归的网络图,图中权重用蓝线表示、偏置用红线表示、+1代表偏置参数的系数为1。
<p align="center"> <p align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/02.recognize_digits/image/softmax_regression.png?raw=true" width=400><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/02.recognize_digits/image/softmax_regression.png?raw=true" width=200><br/>
图2. softmax回归网络结构图<br/> 图2. softmax回归网络结构图<br/>
</p> </p>
...@@ -52,7 +65,9 @@ $$ y_i = \text{softmax}(\sum_j W_{i,j}x_j + b_i) $$ ...@@ -52,7 +65,9 @@ $$ y_i = \text{softmax}(\sum_j W_{i,j}x_j + b_i) $$
在分类问题中,我们一般采用交叉熵代价损失函数(cross entropy loss),公式如下: 在分类问题中,我们一般采用交叉熵代价损失函数(cross entropy loss),公式如下:
$$ L_{cross-entropy}(label, y) = -\sum_i label_ilog(y_i) $$ <p align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/02.recognize_digits/image/03.gif?raw=true"><br/>
</p>
...@@ -98,7 +113,7 @@ Softmax回归模型采用了最简单的两层神经网络,即只有输入层 ...@@ -98,7 +113,7 @@ Softmax回归模型采用了最简单的两层神经网络,即只有输入层
- 局部连接:每个神经元仅与输入神经元的一块区域连接,这块局部区域称作感受野(receptive field)。在图像卷积操作中,即神经元在空间维度(spatial dimension,即上图示例H和W所在的平面)是局部连接,但在深度上是全部连接。对于二维图像本身而言,也是局部像素关联较强。这种局部连接保证了学习后的过滤器能够对于局部的输入特征有最强的响应。局部连接的思想,也是受启发于生物学里面的视觉系统结构,视觉皮层的神经元就是局部接受信息的。 - 局部连接:每个神经元仅与输入神经元的一块区域连接,这块局部区域称作感受野(receptive field)。在图像卷积操作中,即神经元在空间维度(spatial dimension,即上图示例H和W所在的平面)是局部连接,但在深度上是全部连接。对于二维图像本身而言,也是局部像素关联较强。这种局部连接保证了学习后的过滤器能够对于局部的输入特征有最强的响应。局部连接的思想,也是受启发于生物学里面的视觉系统结构,视觉皮层的神经元就是局部接受信息的。
- 权重共享:计算同一个深度切片的神经元时采用的滤波器是共享的。例如图4中计算$o[:,:,0]$的每个每个神经元的滤波器均相同,都为$W_0$,这样可以很大程度上减少参数。共享权重在一定程度上讲是有意义的,例如图片的底层边缘特征与特征在图中的具体位置无关。但是在一些场景中是无意的,比如输入的图片是人脸,眼睛和头发位于不同的位置,希望在不同的位置学到不同的特征 (参考[斯坦福大学公开课]( http://cs231n.github.io/convolutional-networks/))。请注意权重只是对于同一深度切片的神经元是共享的,在卷积层,通常采用多组卷积核提取不同特征,即对应不同深度切片的特征,不同深度切片的神经元权重是不共享。另外,偏重对同一深度切片的所有神经元都是共享的。 - 权重共享:计算同一个深度切片的神经元时采用的滤波器是共享的。例如图5中计算$o[:,:,0]$的每个每个神经元的滤波器均相同,都为$W_0$,这样可以很大程度上减少参数。共享权重在一定程度上讲是有意义的,例如图片的底层边缘特征与特征在图中的具体位置无关。但是在一些场景中是无意的,比如输入的图片是人脸,眼睛和头发位于不同的位置,希望在不同的位置学到不同的特征 (参考[斯坦福大学公开课]( http://cs231n.github.io/convolutional-networks/))。请注意权重只是对于同一深度切片的神经元是共享的,在卷积层,通常采用多组卷积核提取不同特征,即对应不同深度切片的特征,不同深度切片的神经元权重是不共享。另外,偏重对同一深度切片的所有神经元都是共享的。
通过介绍卷积计算过程及其特性,可以看出卷积是线性操作,并具有平移不变性(shift-invariant),平移不变性即在图像每个位置执行相同的操作。卷积层的局部连接和权重共享使得需要学习的参数大大减小,这样也有利于训练较大卷积神经网络。 通过介绍卷积计算过程及其特性,可以看出卷积是线性操作,并具有平移不变性(shift-invariant),平移不变性即在图像每个位置执行相同的操作。卷积层的局部连接和权重共享使得需要学习的参数大大减小,这样也有利于训练较大卷积神经网络。
...@@ -117,9 +132,17 @@ Softmax回归模型采用了最简单的两层神经网络,即只有输入层 ...@@ -117,9 +132,17 @@ Softmax回归模型采用了最简单的两层神经网络,即只有输入层
<a name="常见激活函数介绍"></a> <a name="常见激活函数介绍"></a>
### 常见激活函数介绍 ### 常见激活函数介绍
- sigmoid激活函数: $ f(x) = sigmoid(x) = \frac{1}{1+e^{-x}} $ - sigmoid激活函数:
- tanh激活函数: $ f(x) = tanh(x) = \frac{e^x-e^{-x}}{e^x+e^{-x}} $ <p align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/02.recognize_digits/image/04.gif?raw=true"><br/>
</p>
- tanh激活函数:
<p align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/02.recognize_digits/image/05.gif?raw=true"><br/>
</p>
实际上,tanh函数只是规模变化的sigmoid函数,将sigmoid函数值放大2倍之后再向下平移1个单位:tanh(x) = 2sigmoid(2x) - 1 。 实际上,tanh函数只是规模变化的sigmoid函数,将sigmoid函数值放大2倍之后再向下平移1个单位:tanh(x) = 2sigmoid(2x) - 1 。
...@@ -160,13 +183,13 @@ PaddlePaddle在API中提供了自动加载[MNIST](http://yann.lecun.com/exdb/mni ...@@ -160,13 +183,13 @@ PaddlePaddle在API中提供了自动加载[MNIST](http://yann.lecun.com/exdb/mni
加载 PaddlePaddle 的 Fluid API 包。 加载 PaddlePaddle 的 Fluid API 包。
```python ```python
from __future__ import print_function # 将python3中的print特性导入当前版本
import os import os
from PIL import Image # 导入图像处理模块 from PIL import Image # 导入图像处理模块
import matplotlib.pyplot as plt import matplotlib.pyplot as plt
import numpy import numpy
import paddle # 导入paddle模块 import paddle # 导入paddle模块
import paddle.fluid as fluid import paddle.fluid as fluid
from __future__ import print_function # 将python3中的print特性导入当前版本
``` ```
### Program Functions 配置 ### Program Functions 配置
...@@ -240,7 +263,7 @@ def convolutional_neural_network(): ...@@ -240,7 +263,7 @@ def convolutional_neural_network():
act="relu") act="relu")
conv_pool_1 = fluid.layers.batch_norm(conv_pool_1) conv_pool_1 = fluid.layers.batch_norm(conv_pool_1)
# 第二个卷积-池化层 # 第二个卷积-池化层
# 使用20个5*5的滤波器,池化大小为2,池化步长为2,激活函数为Relu # 使用50个5*5的滤波器,池化大小为2,池化步长为2,激活函数为Relu
conv_pool_2 = fluid.nets.simple_img_conv_pool( conv_pool_2 = fluid.nets.simple_img_conv_pool(
input=conv_pool_1, input=conv_pool_1,
filter_size=5, filter_size=5,
...@@ -373,15 +396,13 @@ place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace() ...@@ -373,15 +396,13 @@ place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
# 调用train_program 获取预测值,损失值, # 调用train_program 获取预测值,损失值,
prediction, [avg_loss, acc] = train_program() prediction, [avg_loss, acc] = train_program()
# 输入的原始图像数据,大小为28*28*1 # 输入的原始图像数据,名称为img,大小为28*28*1
img = fluid.layers.data(name='img', shape=[1, 28, 28], dtype='float32')
# 标签层,名称为label,对应输入图片的类别标签 # 标签层,名称为label,对应输入图片的类别标签
label = fluid.layers.data(name='label', shape=[1], dtype='int64')
# 告知网络传入的数据分为两部分,第一部分是img值,第二部分是label值 # 告知网络传入的数据分为两部分,第一部分是img值,第二部分是label值
feeder = fluid.DataFeeder(feed_list=[img, label], place=place) feeder = fluid.DataFeeder(feed_list=['img', 'label'], place=place)
# 选择Adam优化器 # 选择Adam优化器
optimizer = fluid.optimizer.Adam(learning_rate=0.001) optimizer = optimizer_program()
optimizer.minimize(avg_loss) optimizer.minimize(avg_loss)
``` ```
...@@ -492,7 +513,7 @@ Pass 900, Batch 0, Cost 0.239809 ...@@ -492,7 +513,7 @@ Pass 900, Batch 0, Cost 0.239809
Test with Epoch 0, avg_cost: 0.053097883707459624, acc: 0.9822850318471338 Test with Epoch 0, avg_cost: 0.053097883707459624, acc: 0.9822850318471338
``` ```
训练之后,检查模型的预测准确度。用 MNIST 训练的时候,一般 softmax回归模型的分类准确率为约为 92.34%,多层感知器为97.66%,卷积神经网络可以达到 99.20%。 训练之后,检查模型的预测准确度。用 MNIST 训练的时候,一般 softmax回归模型的分类准确率约为 92.34%,多层感知器为97.66%,卷积神经网络可以达到 99.20%。
## 应用模型 ## 应用模型
...@@ -505,9 +526,13 @@ Test with Epoch 0, avg_cost: 0.053097883707459624, acc: 0.9822850318471338 ...@@ -505,9 +526,13 @@ Test with Epoch 0, avg_cost: 0.053097883707459624, acc: 0.9822850318471338
```python ```python
def load_image(file): def load_image(file):
# 读取图片文件,并将它转成灰度图
im = Image.open(file).convert('L') im = Image.open(file).convert('L')
# 将输入图片调整为 28*28 的高质量图
im = im.resize((28, 28), Image.ANTIALIAS) im = im.resize((28, 28), Image.ANTIALIAS)
# 将图片转换为numpy
im = numpy.array(im).reshape(1, 1, 28, 28).astype(numpy.float32) im = numpy.array(im).reshape(1, 1, 28, 28).astype(numpy.float32)
# 对数据作归一化处理
im = im / 255.0 * 2.0 - 1.0 im = im / 255.0 * 2.0 - 1.0
return im return im
...@@ -565,5 +590,4 @@ with fluid.scope_guard(inference_scope): ...@@ -565,5 +590,4 @@ with fluid.scope_guard(inference_scope):
10. Bishop, Christopher M. ["Pattern recognition."](http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf) Machine Learning 128 (2006): 1-58. 10. Bishop, Christopher M. ["Pattern recognition."](http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf) Machine Learning 128 (2006): 1-58.
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">知识共享 署名-相同方式共享 4.0 国际 许可协议</a>进行许可。 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://paddlepaddleimage.cdn.bcebos.com/bookimage/camo.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">知识共享 署名-相同方式共享 4.0 国际 许可协议</a>进行许可。
# Recognize Digits # Recognize Digits
The source code for this tutorial is here: [book/recognize_digits](https://github.com/PaddlePaddle/book/tree/develop/02.recognize_digits). For instructions on getting started with this book,see [Running This Book](https://github.com/PaddlePaddle/book/blob/develop/README.md#running-the-book). The source code of this tutorial is in [book/recognize_digits](https://github.com/PaddlePaddle/book/tree/develop/02.recognize_digits). For new users, please refer to [Running This Book](https://github.com/PaddlePaddle/book/blob/develop/README.md#running-the-book) .
## Introduction ## Background
When one learns to program, the first task is usually to write a program that prints "Hello World!". When we learning programming, the first program we write is generally to implement printing “Hello World”. But the tutorial of machine learning or deep learning for the beginner is usually handwriting recognition on the [MNIST](http://yann.lecun.com/exdb/mnist/) database. Because handwriting recognition is a typical classification problem, relatively simple and the MNIST dataset is complete. MNIST dataset as a simple computer vision dataset contains a series of pictures and corresponding labels of handwriting digits. The picture is a 28x28 pixel matrix, and the label corresponds to 10 numbers from 0 to 9. Each picture has been normalized in size and centered in the position.
In Machine Learning or Deep Learning, an equivalent task is to train a model to recognize hand-written digits using the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset.
Handwriting recognition is a classic image classification problem. The problem is relatively easy and MNIST is a complete dataset.
As a simple Computer Vision dataset, MNIST contains images of handwritten digits and their corresponding labels (Fig. 1).
The input image is a $28\times28$ matrix, and the label is one of the digits from $0$ to $9$. All images are normalized, meaning that they are both rescaled and centered.
<p align="center"> <p align="center">
<img src="image/mnist_example_image.png" width="400"><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/02.recognize_digits/image/mnist_example_image.png?raw=true" width="400"><br/>
Fig. 1. Examples of MNIST images Figure 1. Example of a MNIST picture
</p> </p>
The MNIST dataset is from the [NIST](https://www.nist.gov/srd/nist-special-database-19) Special Database 3 (SD-3) and the Special Database 1 (SD-1). MNIST dataset is created from [NIST](https://www.nist.gov/srd/nist-special-database-19) Special Database 3(SD-3) and Special Database 1(SD-1). Because SD-3 is labeled by stuff of US Census Bureau and SD-1 is labeled by US high school students, so SD-3 is clearer and easier to be recognized than SD-1. Yann LeCun et al. pick half of SD-1 and half of SD-3 as train dataset (60000 data) and test dataset (10000 data).250 annotators labeled the training set, thus guaranteed that there wasn't a complete overlap of annotators of training set and test set.
The SD-3 is labeled by the staff of the U.S. Census Bureau, while SD-1 is labeled by high school students. Therefore the SD-3 is cleaner and easier to recognize than the SD-1 dataset.
Yann LeCun et al. used half of the samples from each of SD-1 and SD-3 to create the MNIST training set of 60,000 samples and test set of 10,000 samples.
250 annotators labeled the training set, thus guaranteed that there wasn't a complete overlap of annotators of training set and test set.
The MNIST dataset has been used for evaluating many image recognition algorithms such as a single layer linear classifier, MNIST attracts scholars to train model based on the dataset. In 1998, LeCun conducted experiments respectively using Single layer classifier, Multilayer Perceptron and Multilayer convolutional neural network LeNet, constantly decreasing the error on test dataset ( from 12% to 0.7%)\[[1](#References)\]。 In the process of research, LeCun, the pioneer in the field of deep learning, came up with Convolutional Neural Network, largely improving the performance of handwriting recognition. After that, researchers take a large number of experiments based on K-Nearest Neighbors algorithm\[[2](#References)\], SVM\[[3](#References)\], Neural Network\[[4-7](#References)\] and Boosting method\[[8](#References)\] and so on, with multiple pre-processing methods(like distortion removal, noise removal, and blurring) to upgrade accuracy of recognition.
Multilayer Perceptron (MLP) and Multilayer CNN LeNet\[[1](#references)\], K-Nearest Neighbors (k-NN) \[[2](#references)\], Support Vector Machine (SVM) \[[3](#references)\],
Neural Networks \[[4-7](#references)\], Boosting \[[8](#references)\] and preprocessing methods like distortion removal, noise removal, and blurring.
Among these algorithms, the *Convolutional Neural Network* (CNN) has achieved a series of impressive results in Image Classification tasks, including VGGNet, GoogLeNet,
and ResNet (See [Image Classification](https://github.com/PaddlePaddle/book/tree/develop/03.image_classification) tutorial).
In this tutorial, we start with a simple **softmax** regression model and go on with MLP and CNN. Readers will see how these methods improve the recognition accuracy step-by-step. Convolutional Neural Network plays an important role in the field of deep learning now. From simple LeNet proposed by Yann LeCun in early days to model VGGNet, GoogleNet, ResNet and so on in the ImageNet competition (please refer to [Image Classification](https://github.com/PaddlePaddle/book/tree/develop/03.image_classification) tutorial ), we have gain a serious of great achievements with convolutional neural network in the field of image classification.
## Model Overview
Before introducing classification algorithms and training procedure, we define the following symbols: In this tutorial, starting from simple Softmax regression model, we help you learn about handwriting recognition and introduce you how to upgrade model and how to use MLP and CNN to optimize recognition result.
- $X$ is the input: Input is a $28\times 28$ MNIST image. It is flattened to a $784$ dimensional vector. $X=\left (x_0, x_1, \dots, x_{783} \right )$.
- $Y$ is the output: Output of the classifier is 1 of the 10 classes (digits from 0 to 9). $Y=\left (y_0, y_1, \dots, y_9 \right )$. Each dimension $y_i$ represents the probability that the input image belongs to class $i$.
- $L$ is the ground truth label: $L=\left ( l_0, l_1, \dots, l_9 \right )$. It is also 10 dimensional, but only one entry is $1$ and all others are $0$s. ## Exploration of Models
To train a classifier based on MNIST dataset, before the introduction of three basic image classification networks used in this tutorial, we first give some definitions:
- $X$ is the input: the MNIST image is a two-dimensional image of $28\times28$. For the calculation, we transform it into a $784$ dimensional vector, ie $X=\left ( x_0, x_1, \dots, x_{783} \right )$.
- $Y$ is the output: the output of the classifier is number (0-9), ie $Y=\left ( y_0, y_1, \dots, y_9 \right )$, and each dimension $y_i$ represents the probability of image classification as $i$th number.
- $Label$ is the actual label of the picture: $Label=\left ( l_0, l_1, \dots, l_9 \right ) $ is also 10 dimensions, but only one dimension represents 1, and the rest is 0. For example, if the number on an image is 2, its label is $(0,0,1,0, \dots, 0)$
### Softmax Regression ### Softmax Regression
In a simple softmax regression model, the input is first fed to fully connected layers. Then, a softmax function is applied to output probabilities of multiple output classes\[[9](#references)\]. The simplest Softmax regression model is to get features with input layer passing through a fully connected layer and then compute and ouput probabilities of multiple classifications directly via Softmax function \[[9](#references)\].
The input $X$ is multiplied by weights $W$ and then added to the bias $b$ to generate activations. The data of the input layer $X$ is passed to the output layer. The input $X$ is multiplied by weights $W$ and then added to the bias $b$ to generate activations:
$$ y_i = \text{softmax}(\sum_j W_{i,j}x_j + b_i) $$ $$ y_i = \text{softmax}(\sum_j W_{i,j}x_j + b_i) $$
where $ \text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} $ where $ \text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} $
For an $N$-class classification problem with $N$ output nodes, Softmax normalizes the resulting $N$ dimensional vector so that each of its entries falls in the range $[0,1]\in {R}$, representing the probability that the sample belongs to a certain class. Here $y_i$ denotes the predicted probability that an image is of digit $i$. Figure 2 is a network of softmax regression, in which weights are represented by blue lines, bias are represented by red lines, and +1 indicates that the bias is $1$.
In such a classification problem, we usually use the cross entropy loss function: <p align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/02.recognize_digits/image/softmax_regression.png?raw=true" width=400><br/>
Figure 2. Softmax regression network structure <br/>
</p>
$$ \text{_L_<sub>cross-entropy</sub>}(label, y) = -\sum_i label_ilog(y_i) $$ For an $N$-class classification problem with $N$ output nodes, Softmax normalizes the resulting $N$ dimensional vector so that each of its entries falls in the range $[0,1]\in {R}$, representing the probability that the sample belongs to a certain category. Here $y_i$ denotes the predicted probability that an image is of number $i$.
In the classification problem, we usually use cross-entropy loss, the formula is as follows:
$$ L_{cross-entropy}(label, y) = -\sum_i label_ilog(y_i) $$
Fig. 2 illustrates a softmax regression network, with the weights in blue, and the bias in red. `+1` indicates that the bias is $1$.
<p align="center">
<img src="image/softmax_regression_en.png" width=400><br/>
Fig. 2. Softmax regression network architecture<br/>
</p>
### Multilayer Perceptron ### Multilayer Perceptron
The softmax regression model described above uses the simplest two-layer neural network. That is, it only contains an input layer and an output layer, with limited regression capability. To achieve better recognition results, consider adding several hidden layers\[[10](#references)\] between the input layer and the output layer. The Softmax regression model uses the simplest two-layer neural network, which contains only the input layer and the output layer, so its performance is limited. In order to achieve better recognition, we consider adding several hidden layers \[[10](#references)\] between the input and output layer.
1. After the first hidden layer, we get $ H_1 = \phi(W_1X + b_1) $, where $\phi$ denotes the activation function. Some [common ones](###list-of-common-activation-functions) are sigmoid, tanh and ReLU. 1.In the first hidden layer, you can get $ H_1 = \phi(W_1X + b_1) $, where $\phi$ represents the activation function. And common functions are [sigmoid, tanh or ReLU](#common activation functions).
2. After the second hidden layer, we get $ H_2 = \phi(W_2H_1 + b_2) $. 2.In the second hidden layer, you can get $ H_2 = \phi(W_2H_1 + b_2) $.
3. Finally, the output layer outputs $Y=\text{softmax}(W_3H_2 + b_3)$, the vector denoting our classification result. 3.Finally, in the output layer, you can get $Y=\text{softmax}(W_3H_2 + b_3)$, that is the final classification result vector.
Fig. 3. shows a Multilayer Perceptron network, with the weights in blue, and the bias in red. +1 indicates that the bias is $1$.
<p align="center"> Figure 3 is a network structure of a multi-layer perceptron, in which weights are represented by blue lines, bias are represented by red lines, and +1 indicates that the bias is $1$.
<img src="image/mlp_en.png" width=500><br/>
Fig. 3. Multilayer Perceptron network architecture<br/>
<p align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/02.recognize_digits/image/mlp.png?raw=true" width=500><br/>
Figure 3. Multilayer perceptron network structure <br/>
</p> </p>
### Convolutional Neural Network ### Convolutional Neural Network
#### Convolutional Layer In the multi-layer perceptron model, an image is expanded into a one-dimensional vector and input into the network, ignoring its position and structure information. And the convolutional neural network can better utilize the structure information of the image. [LeNet-5](http://yann.lecun.com/exdb/lenet/) is a relatively simple convolutional neural network. Figure 4 shows the structure: the input two-dimensional image, first through the two convolutional layers to the pooling layer, then through the fully connected layer, and finally using the softmax as the output layer. Below we mainly introduce the convolutional layer and the pooling layer.
<p align="center"> <p align="center">
<img src="image/conv_layer.png" width='750'><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/02.recognize_digits/image/cnn.png?raw=true" width="600"><br/>
Fig. 4. Convolutional layer<br/> Figure 4. LeNet-5 convolutional neural network structure<br/>
</p> </p>
The **convolutional layer** is the core of a Convolutional Neural Network. The parameters in this layer are composed of a set of filters, also called kernels. We could visualize the convolution step in the following fashion: Each kernel slides horizontally and vertically till it covers the whole image. At every window, we compute the dot product of the kernel and the input. Then, we add the bias and apply an activation function. The result is a two-dimensional activation map. For example, some kernel may recognize corners, and some may recognize circles. These convolution kernels may respond strongly to the corresponding features. #### Convolutional Layer
Fig. 4 illustrates the dynamic programming of a convolutional layer, where depths are flattened for simplicity. The input is $W_1=5$, $H_1=5$, $D_1=3$. In fact, this is a common representation for colored images. $W_1$ and $H_1$ correspond to the width and height in a colored image. $D_1$ corresponds to the three color channels for RGB. The parameters of the convolutional layer are $K=2$, $F=3$, $S=2$, $P=1$. $K$ denotes the number of kernels; specifically, $Filter$ $W_0$ and $Filter$ $W_1$ are the kernels. $F$ is kernel size while $W0$ and $W1$ are both $F\timesF = 3\times3$ matrices in all depths. $S$ is the stride, which is the width of the sliding window; here, kernels move leftwards or downwards by two units each time. $P$ is the width of the padding, which denotes an extension of the input; here, the gray area shows zero padding with size 1.
#### Pooling Layer Convolutional Layer is the core of convolutional neural network. The convolution we mentioned in image recognition is a two-dimensional convolution, that is, a discrete two-dimensional filter (also called a convolutional kernel) and a two-dimensional image for convoluting. In short, the two-dimensional filter slides to all positions on two-dimensional images and dot product is taken for this pixel and its domain pixel at each position. Convolution operations are widely used in the field of image processing. Different convolutional kernels can extract different features, such as edges, lines, and angles. In deep convolutional neural networks, low-level to complex image features can be extracted by convolution operation.
<p align="center"> <p align="center">
<img src="image/max_pooling_en.png" width="400px"><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/02.recognize_digits/image/conv_layer.png?raw=true" width='750'><br/>
Fig. 5 Pooling layer using max-pooling<br/> Figure 5. Convolutional Layer Picture <br/>
</p> </p>
A **pooling layer** performs downsampling. The main functionality of this layer is to reduce computation by reducing the network parameters. It also prevents over-fitting to some extent. Usually, a pooling layer is added after a convolutional layer. Pooling layer can use various techniques, such as max pooling and average pooling. As shown in Fig.5, max pooling uses rectangles to segment the input layer into several parts and computes the maximum value in each part as the output. Figure 5 shows an example of the process of computing convolution with input image in size of $H=5, W=5, D=3$, ie $5 \times 5$ size of 3 channel (RGB, also known as depth) color image.
This example contains two (denoted by $K$) groups of convolutional kernels, i.e. $Filter W_0$ and $Filter W_1$ in the figure. In convolution calculation, different convolutional kernels are usually used for different input channels. In the example, each set of convolutional kernels contains ($D=3$) $3\times 3$ (indicated by $F \times F$) convolutional kernel. In addition, the stride of convolutional kernel in horizontal and vertical direction of image is 2 (indicated by $S$); Pad 1 (represented by $P$) 0 in the four directions of input image, that is, the input layer raw data in the figure is the blue part, and the gray part is expanded with 0 in the size of 1. The convolution operation yields a feature map of the size of $3 \times 3 \times 2$ (represented by $H_{o} \times W_{o} \times K$), which is a 2-channel feature map in size of $3 \times 3$, where $H_o$ is calculated as: $H_o = (H - F + 2 \times P)/S + 1$, so is $W_o$. And each pixel in the output feature map is the summation of the inner product of each set of filters and each feature of the input image, plus the bias $b_o$, the bias is usually shared by each output feature map. The last $-2$ in the output feature map $o[:,:,0]$ is calculated as shown in the lower right corner of Figure 5.
The convolutional kernel is a learnable parameter in the convolution operation. As shown in the example above, the parameter of each layer of convolution is $D \times F \times F \times K$. In the multi-layer perceptron model, neurons are usually fully connected therefore with a large number of parameters. There are fewer parameters in the convolutional layer, which is also determined by main features of the convolutional layer, namely local connections and shared weights.
- Local connection: Each neuron is connected to only one region of the input neuron, which is called Receptive Field. In the image convolution operation, that is, the neurons are locally connected in the spatial dimension (the plane in which the above examples H and W are located), but are fully connected in depth. For the two-dimensional image itself, the local pixels are strongly related. This local connection ensures that the learned filter makes the strongest response to local input features. The idea of local connection is also inspired by the structure of visual system in biology. The neurons in the visual cortex receive information locally.
- Weight sharing: The filters used to calculate neurons in the same deep slice are shared. For example, in Figure 5, the filter for each neuron calculated by $o[:,:,0]$ is the same, both are $W_0$, which can greatly reduce the parameters. The sharing weight is meaningful to a certain extent, for example, the bottom edge feature of the image is independent of the specific location of the feature in the graph. However, it is unintentional in some cases. For example, the input picture is a face, eyes and hair are in different positions. And to learn different features in different positions, please (refer to [Stanford University Open Class](http://cs231n.Github.io/convolutional-networks/)). Note that the weights are only shared for the neurons of the same depth slice. In the convolutional layer, multiple sets of convolutional kernels are usually used to extract different features, that is, the weights of neurons with different depth slices are not shared by the features with different depth slices. In addition, bias are shared by all neurons with the same depth.
By introducing the calculation process of convolution and its features, convolution could be seen as a linear operation with shift-invariant, which is the same operation performed at each position of the image. The local connection and weight sharing of the convolutional layer greatly reduce the parameters that need to be learned, which helps with training larger convolutional neural networks.
#### LeNet-5 Network For more information about convolution, please refer to [Reference Reading](http://ufldl.stanford.edu/wiki/index.php/Feature_extraction_using_convolution#Convolutions)
### Pooling Layer
<p align="center"> <p align="center">
<img src="image/cnn_en.png"><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/02.recognize_digits/image/max_pooling.png?raw=true" width="400px"><br/>
Fig. 6. LeNet-5 Convolutional Neural Network architecture<br/> Figure 6. Picture in pooling layer</br>
</p> </p>
[**LeNet-5**](http://yann.lecun.com/exdb/lenet/) is one of the simplest Convolutional Neural Networks. Fig. 6. shows its architecture: A 2-dimensional input image is fed into two sets of convolutional layers and pooling layers. This output is then fed to a fully connected layer and a softmax classifier. Compared to multilayer, fully connected perceptrons, the LeNet-5 can recognize images better. This is due to the following three properties of the convolution: Pooling is a form of nonlinear downsampling. The main functionality of this layer is to reduce computation by reducing the network parameters and to control the overfitting to some extent. Normally a pooling layer is added after the convolutional layer. Pooling includes maximum pooling, average pooling and so on. The largest pooling is to divide the input layer into different areas by non-overlapping rectangular boxes, and the maximum value of each rectangular box is taken as the output layer, as shown in Figure. 6.
- The 3D nature of the neurons: a convolutional layer is organized by width, height, and depth. Neurons in each layer are connected to only a small region in the previous layer. This region is called the receptive field. For details about convolutional neural network, please refer to the tutorial of [Standford Online Course]( http://cs231n.github.io/convolutional-networks/ ), [Ufldl](http://ufldl.stanford.edu/wiki/index.php/Pooling) and [Image Classification]( https://github.com/PaddlePaddle/book/tree/develop/03.image_classification ).
- Local connectivity: A CNN utilizes the local space correlation by connecting local neurons. This design guarantees that the learned filter has a strong response to local input features. Stacking many such layers generates a non-linear filter that is more global. This enables the network to first obtain good representation for small parts of input and then combine them to represent a larger region.
- Weight sharing: In a CNN, computation is iterated on shared parameters (weights and bias) to form a feature map. This means that all the neurons in the same depth of the output response to the same feature. This allows the network to detect a feature regardless of its position in the input.
For more details on Convolutional Neural Networks, please refer to the tutorial on [Image Classification](https://github.com/PaddlePaddle/book/blob/develop/image_classification/README.md) and the [relevant lecture](http://cs231n.github.io/convolutional-networks/) from a Stanford course. <a name="common activation functions"></a>
### Common activation functions
### List of Common Activation Functions
- Sigmoid activation function: $ f(x) = sigmoid(x) = \frac{1}{1+e^{-x}} $ - Sigmoid activation function: $ f(x) = sigmoid(x) = \frac{1}{1+e^{-x}} $
- Tanh activation function: $ f(x) = tanh(x) = \frac{e^x-e^{-x}}{e^x+e^{-x}} $ - Tanh activation function: $ f(x) = tanh(x) = \frac{e^x-e^{-x}}{e^x+e^{-x}} $
In fact, tanh function is just a rescaled version of the sigmoid function. It is obtained by magnifying the value of the sigmoid function and moving it downwards by 1. In fact, the tanh function is only a sigmoid function with change of scale. The value of the sigmoid function is doubled and then shifted down by 1 unit: tanh(x) = 2sigmoid(2x) - 1 .
- ReLU activation function: $ f(x) = max(0, x) $ - ReLU activation function: $ f(x) = max(0, x) $
For more information, please refer to [Activation functions on Wikipedia](https://en.wikipedia.org/wiki/Activation_function). For details, please refer to [activation function in Wikipedia](https://en.wikipedia.org/wiki/Activation_function).
## Data Preparation
PaddlePaddle provides a Python module, `paddle.dataset.mnist`, which downloads and caches the [MNIST dataset](http://yann.lecun.com/exdb/mnist/). The cache is under `/home/username/.cache/paddle/dataset/mnist`: ## Dataset Preparation
PaddlePaddle provides a module `paddle.dataset.mnist` that automatically loads [MNIST] (http://yann.lecun.com/exdb/mnist/) data in the API. The loaded data is located under `/home/username/.cache/paddle/dataset/mnist`:
| File name | Description | Size |
|----------------------|--------------|-----------|
|train-images-idx3-ubyte| Training images | 60,000 |
|train-labels-idx1-ubyte| Training labels | 60,000 |
|t10k-images-idx3-ubyte | Evaluation images | 10,000 |
|t10k-labels-idx1-ubyte | Evaluation labels | 10,000 |
| filename | note |
|----------------------|-------------------------|
|train-images-idx3-ubyte| train data picture, 60,000 data |
|train-labels-idx1-ubyte| train data label, 60,000 data |
|t10k-images-idx3-ubyte | test data picture, 10,000 data |
|t10k-labels-idx1-ubyte | test data label, 10,000 data |
## Fluid API Overview ## Fluid API Overview
The demo will be using the latest paddle fluid API. Fluid API is the latest Paddle API. It simplifies the model configurations without sacrifice the performance. The demo will use the latest [Fluid API](http://paddlepaddle.org/documentation/docs/en/1.2/api_cn/index_cn.html). Fluid API is the latest PaddlePaddle API. It simplifies model configuration without sacrificing performance.
We recommend using Fluid API as it is much easier to pick up. We recommend using the Fluid API, which is easy to learn and use to help you complete your machine learning tasks quickly.
Here are the quick overview on the major fluid API complements.
Here is an overview of several important concepts in the Fluid API:
1. `inference_program`: A function that specify how to get the prediction from the data input. 1. `inference_program`: specifies how to get the inference function from the data input.
This is where you specify the network flow. This is where the network flow is defined.
1. `train_program`: A function that specify how to get avg_cost from `inference_program` and labels.
This is where you specify the loss calculations.
1. `optimizer_func`:"A function that specifies the configuration of the the optimizer. The optimizer is responsible for minimizing the loss and driving the training. Paddle supports many different optimizers."
1. `Trainer`: Fluid trainer manages the training process specified by the `train_program` and `optimizer`. Users can monitor the training
progress through the `event_handler` callback function.
1. `Inferencer`: Fluid inferencer loads the `inference_program` and the parameters trained by the Trainer.
It then can infer the data and return prediction
We will go though all of them and dig more on the configurations in this demo. 2. `train_program`: specifies how to get the `loss` function from `inference_program` and `tag value`.
This is where the loss calculation is specified.
## Model Configuration 3. `optimizer_func`: Specifies the function of the optimizer configuration. The optimizer is responsible for reducing losses and driving training. Paddle supports a number of different optimizers.
A PaddlePaddle program starts from importing the API package: In the code examples below, we'll take a closer look at them.
## Configuration Instructions
Load the Fluid API package for PaddlePaddle.
```python ```python
import paddle from __future__ import print_function #load print of python3 into current version
import os
from PIL import Image # load module of image processing
import matplotlib.pyplot as plt
import numpy
import paddle # load paddle module
import paddle.fluid as fluid import paddle.fluid as fluid
from __future__ import print_function
try:
from paddle.fluid.contrib.trainer import *
from paddle.fluid.contrib.inferencer import *
except ImportError:
print(
"In the fluid 1.0, the trainer and inferencer are moving to paddle.fluid.contrib",
file=sys.stderr)
from paddle.fluid.trainer import *
from paddle.fluid.inferencer import *
``` ```
### Program Functions Configuration ### Program Functions Configuration
First, We need to setup the `inference_program` function. We want to use this program to demonstrate three different classifiers, each defined as a Python function. We need to configure `inference_program` function. We want to use this program to show three different classifiers, each of which is defined as a Python function.
We need to feed image data to the classifier. PaddlePaddle provides a special layer `layer.data` for reading data. We need to input the image data into the classifier. Paddle provides a special layer `layer.data` for reading data.
Let us create a data layer for reading images and connect it to the classification network. Let's create a data layer to read the image and connect it to the network of classification.
- Softmax regression: the network has a fully-connection layer with softmax activation: -Softmax regression: The results of the classification can be obtained only through a simple layer of simple fully connected layer with softmax as the activation function.
```python ```python
def softmax_regression(): def softmax_regression():
"""
Define softmax classifier:
A fully connected layer with activation function softmax
Return:
predict_image -- result of classification
"""
# input original image data in size of 28*28*1
img = fluid.layers.data(name='img', shape=[1, 28, 28], dtype='float32') img = fluid.layers.data(name='img', shape=[1, 28, 28], dtype='float32')
# With softmax as the fully connected layer of the activation function, the size of the output layer must be 10
predict = fluid.layers.fc( predict = fluid.layers.fc(
input=img, size=10, act='softmax') input=img, size=10, act='softmax')
return predict return predict
``` ```
- Multi-Layer Perceptron: this network has two hidden fully-connected layers, both are using ReLU as activation function. The output layer is using softmax activation: -Multilayer Perceptron: The following code implements a multilayer perceptron with two hidden layers (that is, fully connected layers). The activation functions of the two hidden layers are all ReLU, and the activation function of the output layer is Softmax.
```python ```python
def multilayer_perceptron(): def multilayer_perceptron():
"""
Define multilayer perceptron classifier:
Multilayer perceptron with two hidden layers (fully connected layers)
The activation function of the first two hidden layers uses ReLU, and the activation function of the output layer uses Softmax.
Return:
predict_image -- result of classification
"""
# input raw image data in size of 28*28*1
img = fluid.layers.data(name='img', shape=[1, 28, 28], dtype='float32') img = fluid.layers.data(name='img', shape=[1, 28, 28], dtype='float32')
# first fully-connected layer, using ReLu as its activation function # the first fully connected layer, whose activation function is ReLU
hidden = fluid.layers.fc(input=img, size=200, act='relu') hidden = fluid.layers.fc(input=img, size=200, act='relu')
# second fully-connected layer, using ReLu as its activation function # the second fully connected layer, whose activation function is ReLU
hidden = fluid.layers.fc(input=hidden, size=200, act='relu') hidden = fluid.layers.fc(input=hidden, size=200, act='relu')
# With softmax as the fully connected output layer of the activation function, the size of the output layer must be 10
prediction = fluid.layers.fc(input=hidden, size=10, act='softmax') prediction = fluid.layers.fc(input=hidden, size=10, act='softmax')
return prediction return prediction
``` ```
- Convolution network LeNet-5: the input image is fed through two convolution-pooling layers, a fully-connected layer, and the softmax output layer: -Convolutional neural network LeNet-5: The input two-dimensional image first passes through two convolutional layers to the pooling layer, then passes through the fully connected layer, and finally fully connection layer with softmax as activation function is used as output layer.
```python ```python
def convolutional_neural_network(): def convolutional_neural_network():
"""
Define convolutional neural network classifier:
The input 2D image passes through two convolution-pooling layers, using the fully connected layer with softmax as the output layer
Return:
predict -- result of classification
"""
# input raw image data in size of 28*28*1
img = fluid.layers.data(name='img', shape=[1, 28, 28], dtype='float32') img = fluid.layers.data(name='img', shape=[1, 28, 28], dtype='float32')
# first conv pool # the first convolution-pooling layer
# Use 20 5*5 filters, the pooling size is 2, the pooling step is 2, and the activation function is Relu.
conv_pool_1 = fluid.nets.simple_img_conv_pool( conv_pool_1 = fluid.nets.simple_img_conv_pool(
input=img, input=img,
filter_size=5, filter_size=5,
...@@ -215,7 +241,8 @@ def convolutional_neural_network(): ...@@ -215,7 +241,8 @@ def convolutional_neural_network():
pool_stride=2, pool_stride=2,
act="relu") act="relu")
conv_pool_1 = fluid.layers.batch_norm(conv_pool_1) conv_pool_1 = fluid.layers.batch_norm(conv_pool_1)
# second conv pool # the second convolution-pooling layer
# Use 20 5*5 filters, the pooling size is 2, the pooling step is 2, and the activation function is Relu.
conv_pool_2 = fluid.nets.simple_img_conv_pool( conv_pool_2 = fluid.nets.simple_img_conv_pool(
input=conv_pool_1, input=conv_pool_1,
filter_size=5, filter_size=5,
...@@ -223,121 +250,239 @@ def convolutional_neural_network(): ...@@ -223,121 +250,239 @@ def convolutional_neural_network():
pool_size=2, pool_size=2,
pool_stride=2, pool_stride=2,
act="relu") act="relu")
# output layer with softmax activation function. size = 10 since there are only 10 possible digits. # With softmax as the fully connected output layer of the activation function, the size of the output layer must be 10
prediction = fluid.layers.fc(input=conv_pool_2, size=10, act='softmax') prediction = fluid.layers.fc(input=conv_pool_2, size=10, act='softmax')
return prediction return prediction
``` ```
#### Train Program Configuration #### Train Program Configuration
Then we need to setup the the `train_program`. It takes the prediction from the classifier first. Then we need to set train program `train_program` It firstly infers from classifier.
During the training, it will calculate the `avg_loss` from the prediction. During the training, it will compute `avg_cost`.
**NOTE:** A train program should return an array and the first return argument has to be `avg_cost`. ** Note:** train program should return an array. The first parameter returned must be `avg_cost`. The trainer uses it to compute gradient.
The trainer always implicitly use it to calculate the gradient.
Please feel free to modify the code to test different results between `softmax regression`, `mlp`, and `convolutional neural network` classifier. Please write your code and then test results of different classifiers of `softmax_regression`, `MLP` and `convolutional neural network`.
```python ```python
def train_program(): def train_program():
"""
Configure train_program
Return:
predict -- result of classification
avg_cost -- mean loss
acc -- accuracy of classification
"""
# label layer, called label, correspondent with label category of input picture
label = fluid.layers.data(name='label', shape=[1], dtype='int64') label = fluid.layers.data(name='label', shape=[1], dtype='int64')
# predict = softmax_regression() # uncomment for Softmax # predict = softmax_regression() # cancel note and run Softmax regression
# predict = multilayer_perceptron() # uncomment for MLP # predict = multilayer_perceptron() # cancel note and run multiple perceptron
predict = convolutional_neural_network() # uncomment for LeNet5 predict = convolutional_neural_network() # cancel note and run LeNet5 convolutional neural network
# Calculate the cost from the prediction and label. # use class cross-entropy function to compute loss function between predict and label
cost = fluid.layers.cross_entropy(input=predict, label=label) cost = fluid.layers.cross_entropy(input=predict, label=label)
# compute mean loss
avg_cost = fluid.layers.mean(cost) avg_cost = fluid.layers.mean(cost)
# compute accuracy of classification
acc = fluid.layers.accuracy(input=predict, label=label) acc = fluid.layers.accuracy(input=predict, label=label)
return predict, [avg_cost, acc]
# The first item needs to be avg_cost.
return [avg_cost, acc]
``` ```
#### Optimizer Function Configuration #### Optimizer Function Configuration
In the following `Adam` optimizer, `learning_rate` specifies the learning rate in the optimization procedure. `Adam optimizer``learning_rate` below are learning rate. Their size is associated with speed of network train convergence.
```python ```python
def optimizer_program(): def optimizer_program():
return fluid.optimizer.Adam(learning_rate=0.001) return fluid.optimizer.Adam(learning_rate=0.001)
``` ```
### Data Feeders Configuration ### Data Feeders for dataset Configuration
Then we specify the training data `paddle.dataset.mnist.train()` and testing data `paddle.dataset.mnist.test()`. These two methods are *reader creators*. Once called, a reader creator returns a *reader*. A reader is a Python method, which, once called, returns a Python generator, which yields instances of data. Next We start the training process. `Paddle.dataset.mnist.train()` and `paddle.dataset.mnist.test()` are respectively as train dataset and test dataset. These two functions respectively return a reader-- reader in PaddlePaddle is a Python function, which returns a Python yield generator when calling the reader.
`shuffle` is a reader decorator. It takes a reader A as input and returns a new reader B. Under the hood, B calls A to read data in the following fashion: it copies in `buffer_size` instances at a time into a buffer, shuffles the data, and yields the shuffled instances one at a time. A large buffer size would yield very shuffled data. `Shuffle` below is a reader decorator, which receives a reader A and returns another reader B. Reader B read `buffer_size` train data into a buffer and then the data is disordered randomly and is output one by one.
`batch` is a special decorator, which takes a reader and outputs a *batch reader*, which doesn't yield an instance, but a minibatch at a time. `Batch` is a special decorator. Its input is a reader and output is a batched reader. In PaddlePaddle, a reader yield a piece of data every time while batched reader yield a minibatch every time.
```python ```python
# there are 64 data in a minibatch
BATCH_SIZE = 64
# read 500 data in train dataset, randomly disorder them and then transfer it into batched reader which yield 64 data each time.
train_reader = paddle.batch( train_reader = paddle.batch(
paddle.reader.shuffle( paddle.reader.shuffle(
paddle.dataset.mnist.train(), buf_size=500), paddle.dataset.mnist.train(), buf_size=500),
batch_size=64) batch_size=BATCH_SIZE)
# read data in test dataset and yield 64 data every time
test_reader = paddle.batch( test_reader = paddle.batch(
paddle.dataset.mnist.test(), batch_size=64) paddle.dataset.mnist.test(), batch_size=BATCH_SIZE)
``` ```
### Trainer Configuration ### create training process
Now we need to create a training process. We will use `train_program`, `place` and `optimizer` defined before, conclude test loss in the period of training iteration and training verification and save parameters of model for prediction.
Now, we need to setup the trainer. The trainer need to take in `train_program`, `place`, and `optimizer`.
#### Event Handler Configuration
We can call a handler function to supervise training process during training.
We display two `event_handler` programs here. Please freely update Jupyter Notebook and find the changes.
`Event_handler` is used to output training result during the train
```python
def event_handler(pass_id, batch_id, cost):
# print the intermediate results of training, like
# training iterations, number of batch, and loss function
print("Pass %d, Batch %d, Cost %f" % (pass_id,batch_id, cost))
```
```python ```python
use_cuda = False # set to True if training with GPU from paddle.utils.plot import Ploter
train_prompt = "Train cost"
test_prompt = "Test cost"
cost_ploter = Ploter(train_prompt, test_prompt)
# visualize training process
def event_handler_plot(ploter_title, step, cost):
cost_ploter.append(ploter_title, step, cost)
cost_ploter.plot()
```
`event_handler_plot` can be visualized as follows:
![png](./image/train_and_test.png)
### Start training
Aftering adding `event_handler` and `data reader` we configured, we can start to train the model.
Set parameters for operation to configure data description.
`Feed_order` is used to map data directory to `train_program`
Create a `train_test` reflecting the loss during our training.
Define network structure:
```python
# the model is run on single CPU
use_cuda = False # If you want to use GPU, please set it True
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace() place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
trainer = Trainer( # call train_program to get prediction value and loss value,
train_func=train_program, place=place, optimizer_func=optimizer_program) prediction, [avg_loss, acc] = train_program()
```
# input original image data in size of 28*28*1
# label layer, called label, correspondent with label category of input picture.
#### Event Handler # It is informed that data in network consists of two parts. One is img value, the other is label value.
feeder = fluid.DataFeeder(feed_list=['img', 'label'], place=place)
Fluid API provides a hook to the callback function during training. Users are able to monitor training progress through mechanism. # choose Adam optimizer
We will demonstrate two event handlers here. Please feel free to modify on the Jupyter notebook to see the differences. optimizer = optimizer_program()
optimizer.minimize(avg_loss)
```
`event_handler` is used to plot some text data when training. Configure hyper parameter during the training:
```python ```python
# Save the parameter into a directory. The Inferencer can load the parameters from it to do infer
params_dirname = "recognize_digits_network.inference.model" PASS_NUM = 5 #train 5 iterations
lists = [] epochs = [epoch_id for epoch_id in range(PASS_NUM)]
def event_handler(event):
if isinstance(event, EndStepEvent): # save parameters of model into save_dirname file
if event.step % 100 == 0: save_dirname = "recognize_digits.inference.model"
# event.metrics maps with train program return arguments. ```
# event.metrics[0] will yeild avg_cost and event.metrics[1] will yeild acc in this example.
print("Pass %d, Batch %d, Cost %f" % (
event.step, event.epoch, event.metrics[0])) ```python
def train_test(train_test_program,
if isinstance(event, EndEpochEvent): train_test_feed, train_test_reader):
avg_cost, acc = trainer.test(
reader=test_reader, feed_order=['img', 'label']) # save classification accuracy into acc_set
acc_set = []
print("Test with Epoch %d, avg_cost: %s, acc: %s" % (event.epoch, avg_cost, acc)) # save mean loss in avg_loss_set
avg_loss_set = []
# save parameters # transfer each data which is the output of testing reader_yield into network to train
trainer.save_params(params_dirname) for test_data in train_test_reader():
lists.append((event.epoch, avg_cost, acc)) acc_np, avg_loss_np = exe.run(
program=train_test_program,
feed=train_test_feed.feed(test_data),
fetch_list=[acc, avg_loss])
acc_set.append(float(acc_np))
avg_loss_set.append(float(avg_loss_np))
# get accuracy and loss value on the test data
acc_val_mean = numpy.array(acc_set).mean()
avg_loss_val_mean = numpy.array(avg_loss_set).mean()
# return mean loss value and mean accuracy
return avg_loss_val_mean, acc_val_mean
```
Create executor
```python
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
```
Set up main_program and test_program:
```python
main_program = fluid.default_main_program()
test_program = fluid.default_main_program().clone(for_test=True)
``` ```
Start training:
#### Start training
Now that we setup the event_handler and the reader, we can start training the model. `feed_order` is used to map the data dict to the train_program
```python ```python
# Train the model now lists = []
trainer.train( step = 0
num_epochs=5, for epoch_id in epochs:
event_handler=event_handler, for step_id, data in enumerate(train_reader()):
reader=train_reader, metrics = exe.run(main_program,
feed_order=['img', 'label']) feed=feeder.feed(data),
fetch_list=[avg_loss, acc])
if step % 100 == 0: # print a log for every 100 times of training
print("Pass %d, Batch %d, Cost %f" % (step, epoch_id, metrics[0]))
event_handler_plot(train_prompt, step, metrics[0])
step += 1
# test classification result of each epoch
avg_loss_val, acc_val = train_test(train_test_program=test_program,
train_test_reader=test_reader,
train_test_feed=feeder)
print("Test with Epoch %d, avg_cost: %s, acc: %s" %(epoch_id, avg_loss_val, acc_val))
event_handler_plot(test_prompt, step, metrics[0])
lists.append((epoch_id, avg_loss_val, acc_val))
# save parameters of trained model for prediction
if save_dirname is not None:
fluid.io.save_inference_model(save_dirname,
["img"], [prediction], exe,
model_filename=None,
params_filename=None)
# Choose the best pass
best = sorted(lists, key=lambda list: float(list[1]))[0]
print('Best pass is %s, testing Avgcost is %s' % (best[0], best[1]))
print('The classification accuracy is %.2f%%' % (float(best[2]) * 100))
``` ```
During training, `trainer.train` invokes `event_handler` for certain events. This gives us a chance to print the training progress.
The training process is completely automatic. The log printed in event_handler is like as follows.
Pass represents iterations of train. Batch represents times to train all data. cost represents loss value of current pass.
Compute the mean loss and accuracy of classification after an epoch.
``` ```
Pass 0, Batch 0, Cost 0.125650 Pass 0, Batch 0, Cost 0.125650
...@@ -353,79 +498,80 @@ Pass 900, Batch 0, Cost 0.239809 ...@@ -353,79 +498,80 @@ Pass 900, Batch 0, Cost 0.239809
Test with Epoch 0, avg_cost: 0.053097883707459624, acc: 0.9822850318471338 Test with Epoch 0, avg_cost: 0.053097883707459624, acc: 0.9822850318471338
``` ```
After the training, we can check the model's prediction accuracy. Check prediction accuracy of the model after training. In the train with MNIST, generally classification accuracy of softmax regression model is about 92.34%, while that of multilayer perceptron is 97.66% and that of convolutional neural network is 99.20%.
```python
# find the best pass
best = sorted(lists, key=lambda list: float(list[1]))[0]
print 'Best pass is %s, testing Avgcost is %s' % (best[0], best[1])
print 'The classification accuracy is %.2f%%' % (float(best[2]) * 100)
```
Usually, with MNIST data, the softmax regression model achieves an accuracy around 92.34%, the MLP 97.66%, and the convolution network around 99.20%. Convolution layers have been widely considered a great invention for image processing.
## Application
After training, users can use the trained model to classify images. The following code shows how to inference MNIST images through `fluid.contrib.inferencer.Inferencer`. ## Deploy the Model
### Create Inferencer You can use trained model to classify handwriting pictures of digits. The program below shows how to use well-trained model to predict.
The `Inferencer` takes an `infer_func` and `param_path` to setup the network and the trained parameters. ### Generate input data to be inferred
We can simply plug-in the classifier defined earlier here.
```python
inferencer = Inferencer(
# infer_func=softmax_regression, # uncomment for softmax regression
# infer_func=multilayer_perceptron, # uncomment for MLP
infer_func=convolutional_neural_network, # uncomment for LeNet5
param_path=params_dirname,
place=place)
```
#### Generate input data for inferring `infer_3.png` is an example picture of number 3. Transform it into a numpy to match feed data format
`infer_3.png` is an example image of the digit `3`. Turn it into an numpy array to match the data feeder format.
```python ```python
# Prepare the test image
import os
import numpy as np
from PIL import Image
def load_image(file): def load_image(file):
# open the image file and covert to grayscale
im = Image.open(file).convert('L') im = Image.open(file).convert('L')
# adjust the input image to a 28*28 high quality image
im = im.resize((28, 28), Image.ANTIALIAS) im = im.resize((28, 28), Image.ANTIALIAS)
im = np.array(im).reshape(1, 1, 28, 28).astype(np.float32) # convert img to numpy
im = numpy.array(im).reshape(1, 1, 28, 28).astype(numpy.float32)
# normalize
im = im / 255.0 * 2.0 - 1.0 im = im / 255.0 * 2.0 - 1.0
return im return im
cur_dir = os.getcwd() cur_dir = os.getcwd()
img = load_image(cur_dir + '/image/infer_3.png') tensor_img = load_image(cur_dir + '/image/infer_3.png')
``` ```
### Inference ### Inference
Now we are ready to do inference. By configuring network and training parameters via `load_inference_model`, We can simply insert classifier defined before.
```python ```python
results = inferencer.infer({'img': img}) inference_scope = fluid.core.Scope()
lab = np.argsort(results) # probs and lab are the results of one batch data with fluid.scope_guard(inference_scope):
print("Inference result of image/infer_3.png is: %d" % lab[0][0][-1]) # use fluid.io.load_inference_model to get inference program desc,
# feed_target_names is used to define variable name needed to be passed into network
# fetch_targets define variable name to be fetched from network
[inference_program, feed_target_names,
fetch_targets] = fluid.io.load_inference_model(
save_dirname, exe, None, None)
# Make feed a dictionary {feed_target_name: feed_target_data}
# The result will contain a data list corresponding to fetch_targets
results = exe.run(inference_program,
feed={feed_target_names[0]: tensor_img},
fetch_list=fetch_targets)
lab = numpy.argsort(results)
# Print prediction result of infer_3.png
img=Image.open('image/infer_3.png')
plt.imshow(img)
print("Inference result of image/infer_3.png is: %d" % lab[0][0][-1])
``` ```
## Conclusion
This tutorial describes a few common deep learning models using **Softmax regression**, **Multilayer Perceptron Network**, and **Convolutional Neural Network**. Understanding these models is crucial for future learning; the subsequent tutorials derive more sophisticated networks by building on top of them.
When our model evolves from a simple softmax regression to a slightly complex Convolutional Neural Network, the recognition accuracy on the MNIST dataset achieves a large improvement. This is due to the Convolutional layers' local connections and parameter sharing. While learning new models in the future, we encourage the readers to understand the key ideas that lead a new model to improve the results of an old one. ### Result
If successful, the inference result input is as follows:
`Inference result of image/infer_3.png is: 3` , which indicates that out network successfully recognize the picture!
Moreover, this tutorial introduces the basic flow of PaddlePaddle model design, which starts with a *data provider*, a model layer construction, and finally training and prediction. Motivated readers can leverage the flow used in this MNIST handwritten digit classification example and experiment with different data and network architectures to train models for classification tasks of their choice. ## Summary
Softmax regression, multilayer perceptron and convolutional neural network are the most basic deep learning model, from which complex neural networks are all derivative, so these models are helpful for later learning. At the same time, we found that from simple softmax regression transform to slightly complex convolutional neural network, the accuracy of recognition on MNIST dataset largely increased, resulting from that convolution layer is featured with local connection and sharing weight. When study of new models later, hope you make a deep understand of the key upgrade of new model compared with original model. In addition, this tutorial also talks about the basic steps to build PaddlePadle model, from the code of dataprovider, build of network to training and prediction. Familiar with the work flow, you can use your own data, define your own network model and finish your training and prediction tasks.
<a name="References"></a>
## References ## References
1. LeCun, Yann, Léon Bottou, Yoshua Bengio, and Patrick Haffner. ["Gradient-based learning applied to document recognition."](http://ieeexplore.ieee.org/abstract/document/726791/) Proceedings of the IEEE 86, no. 11 (1998): 2278-2324. 1. LeCun, Yann, Léon Bottou, Yoshua Bengio, and Patrick Haffner. ["Gradient-based learning applied to document recognition."](http://ieeexplore.ieee.org/abstract/document/726791/) Proceedings of the IEEE 86, no. 11 (1998): 2278-2324.
2. Wejéus, Samuel. ["A Neural Network Approach to Arbitrary SymbolRecognition on Modern Smartphones."](http://www.diva-portal.org/smash/record.jsf?pid=diva2:753279&dswid=-434) (2014). 2. Wejéus, Samuel. ["A Neural Network Approach to Arbitrary SymbolRecognition on Modern Smartphones."](http://www.diva-portal.org/smash/record.jsf?pid=diva2%3A753279&dswid=-434) (2014).
3. Decoste, Dennis, and Bernhard Schölkopf. ["Training invariant support vector machines."](http://link.springer.com/article/10.1023/A:1012454411458) Machine learning 46, no. 1-3 (2002): 161-190. 3. Decoste, Dennis, and Bernhard Schölkopf. ["Training invariant support vector machines."](http://link.springer.com/article/10.1023/A:1012454411458) Machine learning 46, no. 1-3 (2002): 161-190.
4. Simard, Patrice Y., David Steinkraus, and John C. Platt. ["Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis."](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.160.8494&rep=rep1&type=pdf) In ICDAR, vol. 3, pp. 958-962. 2003. 4. Simard, Patrice Y., David Steinkraus, and John C. Platt. ["Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis."](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.160.8494&rep=rep1&type=pdf) In ICDAR, vol. 3, pp. 958-962. 2003.
5. Salakhutdinov, Ruslan, and Geoffrey E. Hinton. ["Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure."](http://www.jmlr.org/proceedings/papers/v2/salakhutdinov07a/salakhutdinov07a.pdf) In AISTATS, vol. 11. 2007. 5. Salakhutdinov, Ruslan, and Geoffrey E. Hinton. ["Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure."](http://www.jmlr.org/proceedings/papers/v2/salakhutdinov07a/salakhutdinov07a.pdf) In AISTATS, vol. 11. 2007.
...@@ -436,4 +582,4 @@ Moreover, this tutorial introduces the basic flow of PaddlePaddle model design, ...@@ -436,4 +582,4 @@ Moreover, this tutorial introduces the basic flow of PaddlePaddle model design,
10. Bishop, Christopher M. ["Pattern recognition."](http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf) Machine Learning 128 (2006): 1-58. 10. Bishop, Christopher M. ["Pattern recognition."](http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf) Machine Learning 128 (2006): 1-58.
<br/> <br/>
This tutorial is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://paddlepaddleimage.cdn.bcebos.com/bookimage/camo.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">This tutorial</span> is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.
### This file is only used for continuous evaluation test!
from __future__ import print_function
from __future__ import division
from __future__ import absolute_import
import os
import sys
sys.path.append(os.environ['ceroot'])
from kpi import CostKpi
from kpi import AccKpi
train_cost_kpi = CostKpi('train_cost', 0.02, 0, actived=True, desc='train cost')
test_cost_kpi = CostKpi('test_cost', 0.02, 0, actived=True, desc='test cost')
test_acc_kpi = AccKpi('test_acc', 0.02, 0, actived=True, desc='test acc')
tracking_kpis = [train_cost_kpi, test_cost_kpi, test_acc_kpi]
def parse_log(log):
for line in log.split('\n'):
fs = line.strip().split('\t')
print(fs)
if len(fs) == 3 and fs[0] == 'kpis':
kpi_name = fs[1]
kpi_value = float(fs[2])
yield kpi_name, kpi_value
def log_to_ce(log):
kpi_tracker = {}
for kpi in tracking_kpis:
kpi_tracker[kpi.name] = kpi
for (kpi_name, kpi_value) in parse_log(log):
print(kpi_name, kpi_value)
kpi_tracker[kpi_name].add_record(kpi_value)
kpi_tracker[kpi_name].persist()
if __name__ == '__main__':
log = sys.stdin.read()
log_to_ce(log)
...@@ -44,6 +44,14 @@ ...@@ -44,6 +44,14 @@
本教程源代码目录在[book/recognize_digits](https://github.com/PaddlePaddle/book/tree/develop/02.recognize_digits),初次使用请您参考[Book文档使用说明](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)。 本教程源代码目录在[book/recognize_digits](https://github.com/PaddlePaddle/book/tree/develop/02.recognize_digits),初次使用请您参考[Book文档使用说明](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)。
### 说明: ###
1. 硬件环境要求:
本文可支持在CPU、GPU下运行
2. Docker镜像支持的CUDA/cuDNN版本:
如果使用了Docker运行Book,请注意:这里所提供的默认镜像的GPU环境为 CUDA 8/cuDNN 5,对于NVIDIA Tesla V100等要求CUDA 9的 GPU,使用该镜像可能会运行失败。
3. 文档和脚本中代码的一致性问题:
请注意:为使本文更加易读易用,我们拆分、调整了train.py的代码并放入本文。本文中代码与train.py的运行结果一致,可直接运行[train.py](https://github.com/PaddlePaddle/book/blob/develop/02.recognize_digits/train.py)进行验证。
## 背景介绍 ## 背景介绍
当我们学习编程的时候,编写的第一个程序一般是实现打印"Hello World"。而机器学习(或深度学习)的入门教程,一般都是 [MNIST](http://yann.lecun.com/exdb/mnist/) 数据库上的手写识别问题。原因是手写识别属于典型的图像分类问题,比较简单,同时MNIST数据集也很完备。MNIST数据集作为一个简单的计算机视觉数据集,包含一系列如图1所示的手写数字图片和对应的标签。图片是28x28的像素矩阵,标签则对应着0~9的10个数字。每张图片都经过了大小归一化和居中处理。 当我们学习编程的时候,编写的第一个程序一般是实现打印"Hello World"。而机器学习(或深度学习)的入门教程,一般都是 [MNIST](http://yann.lecun.com/exdb/mnist/) 数据库上的手写识别问题。原因是手写识别属于典型的图像分类问题,比较简单,同时MNIST数据集也很完备。MNIST数据集作为一个简单的计算机视觉数据集,包含一系列如图1所示的手写数字图片和对应的标签。图片是28x28的像素矩阵,标签则对应着0~9的10个数字。每张图片都经过了大小归一化和居中处理。
...@@ -71,7 +79,7 @@ MNIST吸引了大量的科学家基于此数据集训练模型,1998年,LeCun ...@@ -71,7 +79,7 @@ MNIST吸引了大量的科学家基于此数据集训练模型,1998年,LeCun
- $Y$是输出:分类器的输出是10类数字(0-9),即$Y=\left ( y_0, y_1, \dots, y_9 \right )$,每一维$y_i$代表图片分类为第$i$类数字的概率。 - $Y$是输出:分类器的输出是10类数字(0-9),即$Y=\left ( y_0, y_1, \dots, y_9 \right )$,每一维$y_i$代表图片分类为第$i$类数字的概率。
- $Label$是图片的真实标签:$Label=\left ( l_0, l_1, \dots, l_9 \right )$也是10维,但只有一维为1,其他都为0。例如某张图片上的数字为2,则它的标签为$(0,0,1,0, \dot, 0)$ - $Label$是图片的真实标签:$Label=\left ( l_0, l_1, \dots, l_9 \right )$也是10维,但只有一维为1,其他都为0。例如某张图片上的数字为2,则它的标签为$(0,0,1,0, \dots, 0)$
### Softmax回归(Softmax Regression) ### Softmax回归(Softmax Regression)
...@@ -79,14 +87,19 @@ MNIST吸引了大量的科学家基于此数据集训练模型,1998年,LeCun ...@@ -79,14 +87,19 @@ MNIST吸引了大量的科学家基于此数据集训练模型,1998年,LeCun
输入层的数据$X$传到输出层,在激活操作之前,会乘以相应的权重 $W$ ,并加上偏置变量 $b$ ,具体如下: 输入层的数据$X$传到输出层,在激活操作之前,会乘以相应的权重 $W$ ,并加上偏置变量 $b$ ,具体如下:
$$ y_i = \text{softmax}(\sum_j W_{i,j}x_j + b_i) $$ <p align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/02.recognize_digits/image/01.gif?raw=true"><br/>
</p>
其中 $ \text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} $ 其中
<p align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/02.recognize_digits/image/02.gif?raw=true"><br/>
</p>
图2为softmax回归的网络图,图中权重用蓝线表示、偏置用红线表示、+1代表偏置参数的系数为1。 图2为softmax回归的网络图,图中权重用蓝线表示、偏置用红线表示、+1代表偏置参数的系数为1。
<p align="center"> <p align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/02.recognize_digits/image/softmax_regression.png?raw=true" width=400><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/02.recognize_digits/image/softmax_regression.png?raw=true" width=200><br/>
图2. softmax回归网络结构图<br/> 图2. softmax回归网络结构图<br/>
</p> </p>
...@@ -94,7 +107,9 @@ $$ y_i = \text{softmax}(\sum_j W_{i,j}x_j + b_i) $$ ...@@ -94,7 +107,9 @@ $$ y_i = \text{softmax}(\sum_j W_{i,j}x_j + b_i) $$
在分类问题中,我们一般采用交叉熵代价损失函数(cross entropy loss),公式如下: 在分类问题中,我们一般采用交叉熵代价损失函数(cross entropy loss),公式如下:
$$ L_{cross-entropy}(label, y) = -\sum_i label_ilog(y_i) $$ <p align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/02.recognize_digits/image/03.gif?raw=true"><br/>
</p>
...@@ -140,7 +155,7 @@ Softmax回归模型采用了最简单的两层神经网络,即只有输入层 ...@@ -140,7 +155,7 @@ Softmax回归模型采用了最简单的两层神经网络,即只有输入层
- 局部连接:每个神经元仅与输入神经元的一块区域连接,这块局部区域称作感受野(receptive field)。在图像卷积操作中,即神经元在空间维度(spatial dimension,即上图示例H和W所在的平面)是局部连接,但在深度上是全部连接。对于二维图像本身而言,也是局部像素关联较强。这种局部连接保证了学习后的过滤器能够对于局部的输入特征有最强的响应。局部连接的思想,也是受启发于生物学里面的视觉系统结构,视觉皮层的神经元就是局部接受信息的。 - 局部连接:每个神经元仅与输入神经元的一块区域连接,这块局部区域称作感受野(receptive field)。在图像卷积操作中,即神经元在空间维度(spatial dimension,即上图示例H和W所在的平面)是局部连接,但在深度上是全部连接。对于二维图像本身而言,也是局部像素关联较强。这种局部连接保证了学习后的过滤器能够对于局部的输入特征有最强的响应。局部连接的思想,也是受启发于生物学里面的视觉系统结构,视觉皮层的神经元就是局部接受信息的。
- 权重共享:计算同一个深度切片的神经元时采用的滤波器是共享的。例如图4中计算$o[:,:,0]$的每个每个神经元的滤波器均相同,都为$W_0$,这样可以很大程度上减少参数。共享权重在一定程度上讲是有意义的,例如图片的底层边缘特征与特征在图中的具体位置无关。但是在一些场景中是无意的,比如输入的图片是人脸,眼睛和头发位于不同的位置,希望在不同的位置学到不同的特征 (参考[斯坦福大学公开课]( http://cs231n.github.io/convolutional-networks/))。请注意权重只是对于同一深度切片的神经元是共享的,在卷积层,通常采用多组卷积核提取不同特征,即对应不同深度切片的特征,不同深度切片的神经元权重是不共享。另外,偏重对同一深度切片的所有神经元都是共享的。 - 权重共享:计算同一个深度切片的神经元时采用的滤波器是共享的。例如图5中计算$o[:,:,0]$的每个每个神经元的滤波器均相同,都为$W_0$,这样可以很大程度上减少参数。共享权重在一定程度上讲是有意义的,例如图片的底层边缘特征与特征在图中的具体位置无关。但是在一些场景中是无意的,比如输入的图片是人脸,眼睛和头发位于不同的位置,希望在不同的位置学到不同的特征 (参考[斯坦福大学公开课]( http://cs231n.github.io/convolutional-networks/))。请注意权重只是对于同一深度切片的神经元是共享的,在卷积层,通常采用多组卷积核提取不同特征,即对应不同深度切片的特征,不同深度切片的神经元权重是不共享。另外,偏重对同一深度切片的所有神经元都是共享的。
通过介绍卷积计算过程及其特性,可以看出卷积是线性操作,并具有平移不变性(shift-invariant),平移不变性即在图像每个位置执行相同的操作。卷积层的局部连接和权重共享使得需要学习的参数大大减小,这样也有利于训练较大卷积神经网络。 通过介绍卷积计算过程及其特性,可以看出卷积是线性操作,并具有平移不变性(shift-invariant),平移不变性即在图像每个位置执行相同的操作。卷积层的局部连接和权重共享使得需要学习的参数大大减小,这样也有利于训练较大卷积神经网络。
...@@ -159,9 +174,17 @@ Softmax回归模型采用了最简单的两层神经网络,即只有输入层 ...@@ -159,9 +174,17 @@ Softmax回归模型采用了最简单的两层神经网络,即只有输入层
<a name="常见激活函数介绍"></a> <a name="常见激活函数介绍"></a>
### 常见激活函数介绍 ### 常见激活函数介绍
- sigmoid激活函数: $ f(x) = sigmoid(x) = \frac{1}{1+e^{-x}} $ - sigmoid激活函数:
- tanh激活函数: $ f(x) = tanh(x) = \frac{e^x-e^{-x}}{e^x+e^{-x}} $ <p align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/02.recognize_digits/image/04.gif?raw=true"><br/>
</p>
- tanh激活函数:
<p align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/02.recognize_digits/image/05.gif?raw=true"><br/>
</p>
实际上,tanh函数只是规模变化的sigmoid函数,将sigmoid函数值放大2倍之后再向下平移1个单位:tanh(x) = 2sigmoid(2x) - 1 。 实际上,tanh函数只是规模变化的sigmoid函数,将sigmoid函数值放大2倍之后再向下平移1个单位:tanh(x) = 2sigmoid(2x) - 1 。
...@@ -202,13 +225,13 @@ PaddlePaddle在API中提供了自动加载[MNIST](http://yann.lecun.com/exdb/mni ...@@ -202,13 +225,13 @@ PaddlePaddle在API中提供了自动加载[MNIST](http://yann.lecun.com/exdb/mni
加载 PaddlePaddle 的 Fluid API 包。 加载 PaddlePaddle 的 Fluid API 包。
```python ```python
from __future__ import print_function # 将python3中的print特性导入当前版本
import os import os
from PIL import Image # 导入图像处理模块 from PIL import Image # 导入图像处理模块
import matplotlib.pyplot as plt import matplotlib.pyplot as plt
import numpy import numpy
import paddle # 导入paddle模块 import paddle # 导入paddle模块
import paddle.fluid as fluid import paddle.fluid as fluid
from __future__ import print_function # 将python3中的print特性导入当前版本
``` ```
### Program Functions 配置 ### Program Functions 配置
...@@ -282,7 +305,7 @@ def convolutional_neural_network(): ...@@ -282,7 +305,7 @@ def convolutional_neural_network():
act="relu") act="relu")
conv_pool_1 = fluid.layers.batch_norm(conv_pool_1) conv_pool_1 = fluid.layers.batch_norm(conv_pool_1)
# 第二个卷积-池化层 # 第二个卷积-池化层
# 使用20个5*5的滤波器,池化大小为2,池化步长为2,激活函数为Relu # 使用50个5*5的滤波器,池化大小为2,池化步长为2,激活函数为Relu
conv_pool_2 = fluid.nets.simple_img_conv_pool( conv_pool_2 = fluid.nets.simple_img_conv_pool(
input=conv_pool_1, input=conv_pool_1,
filter_size=5, filter_size=5,
...@@ -415,15 +438,13 @@ place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace() ...@@ -415,15 +438,13 @@ place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
# 调用train_program 获取预测值,损失值, # 调用train_program 获取预测值,损失值,
prediction, [avg_loss, acc] = train_program() prediction, [avg_loss, acc] = train_program()
# 输入的原始图像数据,大小为28*28*1 # 输入的原始图像数据,名称为img,大小为28*28*1
img = fluid.layers.data(name='img', shape=[1, 28, 28], dtype='float32')
# 标签层,名称为label,对应输入图片的类别标签 # 标签层,名称为label,对应输入图片的类别标签
label = fluid.layers.data(name='label', shape=[1], dtype='int64')
# 告知网络传入的数据分为两部分,第一部分是img值,第二部分是label值 # 告知网络传入的数据分为两部分,第一部分是img值,第二部分是label值
feeder = fluid.DataFeeder(feed_list=[img, label], place=place) feeder = fluid.DataFeeder(feed_list=['img', 'label'], place=place)
# 选择Adam优化器 # 选择Adam优化器
optimizer = fluid.optimizer.Adam(learning_rate=0.001) optimizer = optimizer_program()
optimizer.minimize(avg_loss) optimizer.minimize(avg_loss)
``` ```
...@@ -534,7 +555,7 @@ Pass 900, Batch 0, Cost 0.239809 ...@@ -534,7 +555,7 @@ Pass 900, Batch 0, Cost 0.239809
Test with Epoch 0, avg_cost: 0.053097883707459624, acc: 0.9822850318471338 Test with Epoch 0, avg_cost: 0.053097883707459624, acc: 0.9822850318471338
``` ```
训练之后,检查模型的预测准确度。用 MNIST 训练的时候,一般 softmax回归模型的分类准确率为约为 92.34%,多层感知器为97.66%,卷积神经网络可以达到 99.20%。 训练之后,检查模型的预测准确度。用 MNIST 训练的时候,一般 softmax回归模型的分类准确率约为 92.34%,多层感知器为97.66%,卷积神经网络可以达到 99.20%。
## 应用模型 ## 应用模型
...@@ -547,9 +568,13 @@ Test with Epoch 0, avg_cost: 0.053097883707459624, acc: 0.9822850318471338 ...@@ -547,9 +568,13 @@ Test with Epoch 0, avg_cost: 0.053097883707459624, acc: 0.9822850318471338
```python ```python
def load_image(file): def load_image(file):
# 读取图片文件,并将它转成灰度图
im = Image.open(file).convert('L') im = Image.open(file).convert('L')
# 将输入图片调整为 28*28 的高质量图
im = im.resize((28, 28), Image.ANTIALIAS) im = im.resize((28, 28), Image.ANTIALIAS)
# 将图片转换为numpy
im = numpy.array(im).reshape(1, 1, 28, 28).astype(numpy.float32) im = numpy.array(im).reshape(1, 1, 28, 28).astype(numpy.float32)
# 对数据作归一化处理
im = im / 255.0 * 2.0 - 1.0 im = im / 255.0 * 2.0 - 1.0
return im return im
...@@ -607,8 +632,7 @@ with fluid.scope_guard(inference_scope): ...@@ -607,8 +632,7 @@ with fluid.scope_guard(inference_scope):
10. Bishop, Christopher M. ["Pattern recognition."](http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf) Machine Learning 128 (2006): 1-58. 10. Bishop, Christopher M. ["Pattern recognition."](http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf) Machine Learning 128 (2006): 1-58.
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">知识共享 署名-相同方式共享 4.0 国际 许可协议</a>进行许可。 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://paddlepaddleimage.cdn.bcebos.com/bookimage/camo.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">知识共享 署名-相同方式共享 4.0 国际 许可协议</a>进行许可。
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
......
...@@ -42,213 +42,239 @@ ...@@ -42,213 +42,239 @@
<div id="markdown" style='display:none'> <div id="markdown" style='display:none'>
# Recognize Digits # Recognize Digits
The source code for this tutorial is here: [book/recognize_digits](https://github.com/PaddlePaddle/book/tree/develop/02.recognize_digits). For instructions on getting started with this book,see [Running This Book](https://github.com/PaddlePaddle/book/blob/develop/README.md#running-the-book). The source code of this tutorial is in [book/recognize_digits](https://github.com/PaddlePaddle/book/tree/develop/02.recognize_digits). For new users, please refer to [Running This Book](https://github.com/PaddlePaddle/book/blob/develop/README.md#running-the-book) .
## Introduction ## Background
When one learns to program, the first task is usually to write a program that prints "Hello World!". When we learning programming, the first program we write is generally to implement printing “Hello World”. But the tutorial of machine learning or deep learning for the beginner is usually handwriting recognition on the [MNIST](http://yann.lecun.com/exdb/mnist/) database. Because handwriting recognition is a typical classification problem, relatively simple and the MNIST dataset is complete. MNIST dataset as a simple computer vision dataset contains a series of pictures and corresponding labels of handwriting digits. The picture is a 28x28 pixel matrix, and the label corresponds to 10 numbers from 0 to 9. Each picture has been normalized in size and centered in the position.
In Machine Learning or Deep Learning, an equivalent task is to train a model to recognize hand-written digits using the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset.
Handwriting recognition is a classic image classification problem. The problem is relatively easy and MNIST is a complete dataset.
As a simple Computer Vision dataset, MNIST contains images of handwritten digits and their corresponding labels (Fig. 1).
The input image is a $28\times28$ matrix, and the label is one of the digits from $0$ to $9$. All images are normalized, meaning that they are both rescaled and centered.
<p align="center"> <p align="center">
<img src="image/mnist_example_image.png" width="400"><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/02.recognize_digits/image/mnist_example_image.png?raw=true" width="400"><br/>
Fig. 1. Examples of MNIST images Figure 1. Example of a MNIST picture
</p> </p>
The MNIST dataset is from the [NIST](https://www.nist.gov/srd/nist-special-database-19) Special Database 3 (SD-3) and the Special Database 1 (SD-1). MNIST dataset is created from [NIST](https://www.nist.gov/srd/nist-special-database-19) Special Database 3(SD-3) and Special Database 1(SD-1). Because SD-3 is labeled by stuff of US Census Bureau and SD-1 is labeled by US high school students, so SD-3 is clearer and easier to be recognized than SD-1. Yann LeCun et al. pick half of SD-1 and half of SD-3 as train dataset (60000 data) and test dataset (10000 data).250 annotators labeled the training set, thus guaranteed that there wasn't a complete overlap of annotators of training set and test set.
The SD-3 is labeled by the staff of the U.S. Census Bureau, while SD-1 is labeled by high school students. Therefore the SD-3 is cleaner and easier to recognize than the SD-1 dataset.
Yann LeCun et al. used half of the samples from each of SD-1 and SD-3 to create the MNIST training set of 60,000 samples and test set of 10,000 samples.
250 annotators labeled the training set, thus guaranteed that there wasn't a complete overlap of annotators of training set and test set.
The MNIST dataset has been used for evaluating many image recognition algorithms such as a single layer linear classifier, MNIST attracts scholars to train model based on the dataset. In 1998, LeCun conducted experiments respectively using Single layer classifier, Multilayer Perceptron and Multilayer convolutional neural network LeNet, constantly decreasing the error on test dataset ( from 12% to 0.7%)\[[1](#References)\]。 In the process of research, LeCun, the pioneer in the field of deep learning, came up with Convolutional Neural Network, largely improving the performance of handwriting recognition. After that, researchers take a large number of experiments based on K-Nearest Neighbors algorithm\[[2](#References)\], SVM\[[3](#References)\], Neural Network\[[4-7](#References)\] and Boosting method\[[8](#References)\] and so on, with multiple pre-processing methods(like distortion removal, noise removal, and blurring) to upgrade accuracy of recognition.
Multilayer Perceptron (MLP) and Multilayer CNN LeNet\[[1](#references)\], K-Nearest Neighbors (k-NN) \[[2](#references)\], Support Vector Machine (SVM) \[[3](#references)\],
Neural Networks \[[4-7](#references)\], Boosting \[[8](#references)\] and preprocessing methods like distortion removal, noise removal, and blurring.
Among these algorithms, the *Convolutional Neural Network* (CNN) has achieved a series of impressive results in Image Classification tasks, including VGGNet, GoogLeNet,
and ResNet (See [Image Classification](https://github.com/PaddlePaddle/book/tree/develop/03.image_classification) tutorial).
In this tutorial, we start with a simple **softmax** regression model and go on with MLP and CNN. Readers will see how these methods improve the recognition accuracy step-by-step. Convolutional Neural Network plays an important role in the field of deep learning now. From simple LeNet proposed by Yann LeCun in early days to model VGGNet, GoogleNet, ResNet and so on in the ImageNet competition (please refer to [Image Classification](https://github.com/PaddlePaddle/book/tree/develop/03.image_classification) tutorial ), we have gain a serious of great achievements with convolutional neural network in the field of image classification.
## Model Overview
Before introducing classification algorithms and training procedure, we define the following symbols: In this tutorial, starting from simple Softmax regression model, we help you learn about handwriting recognition and introduce you how to upgrade model and how to use MLP and CNN to optimize recognition result.
- $X$ is the input: Input is a $28\times 28$ MNIST image. It is flattened to a $784$ dimensional vector. $X=\left (x_0, x_1, \dots, x_{783} \right )$.
- $Y$ is the output: Output of the classifier is 1 of the 10 classes (digits from 0 to 9). $Y=\left (y_0, y_1, \dots, y_9 \right )$. Each dimension $y_i$ represents the probability that the input image belongs to class $i$.
- $L$ is the ground truth label: $L=\left ( l_0, l_1, \dots, l_9 \right )$. It is also 10 dimensional, but only one entry is $1$ and all others are $0$s. ## Exploration of Models
To train a classifier based on MNIST dataset, before the introduction of three basic image classification networks used in this tutorial, we first give some definitions:
- $X$ is the input: the MNIST image is a two-dimensional image of $28\times28$. For the calculation, we transform it into a $784$ dimensional vector, ie $X=\left ( x_0, x_1, \dots, x_{783} \right )$.
- $Y$ is the output: the output of the classifier is number (0-9), ie $Y=\left ( y_0, y_1, \dots, y_9 \right )$, and each dimension $y_i$ represents the probability of image classification as $i$th number.
- $Label$ is the actual label of the picture: $Label=\left ( l_0, l_1, \dots, l_9 \right ) $ is also 10 dimensions, but only one dimension represents 1, and the rest is 0. For example, if the number on an image is 2, its label is $(0,0,1,0, \dots, 0)$
### Softmax Regression ### Softmax Regression
In a simple softmax regression model, the input is first fed to fully connected layers. Then, a softmax function is applied to output probabilities of multiple output classes\[[9](#references)\]. The simplest Softmax regression model is to get features with input layer passing through a fully connected layer and then compute and ouput probabilities of multiple classifications directly via Softmax function \[[9](#references)\].
The input $X$ is multiplied by weights $W$ and then added to the bias $b$ to generate activations. The data of the input layer $X$ is passed to the output layer. The input $X$ is multiplied by weights $W$ and then added to the bias $b$ to generate activations:
$$ y_i = \text{softmax}(\sum_j W_{i,j}x_j + b_i) $$ $$ y_i = \text{softmax}(\sum_j W_{i,j}x_j + b_i) $$
where $ \text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} $ where $ \text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} $
For an $N$-class classification problem with $N$ output nodes, Softmax normalizes the resulting $N$ dimensional vector so that each of its entries falls in the range $[0,1]\in {R}$, representing the probability that the sample belongs to a certain class. Here $y_i$ denotes the predicted probability that an image is of digit $i$. Figure 2 is a network of softmax regression, in which weights are represented by blue lines, bias are represented by red lines, and +1 indicates that the bias is $1$.
In such a classification problem, we usually use the cross entropy loss function: <p align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/02.recognize_digits/image/softmax_regression.png?raw=true" width=400><br/>
Figure 2. Softmax regression network structure <br/>
</p>
$$ \text{_L_<sub>cross-entropy</sub>}(label, y) = -\sum_i label_ilog(y_i) $$ For an $N$-class classification problem with $N$ output nodes, Softmax normalizes the resulting $N$ dimensional vector so that each of its entries falls in the range $[0,1]\in {R}$, representing the probability that the sample belongs to a certain category. Here $y_i$ denotes the predicted probability that an image is of number $i$.
In the classification problem, we usually use cross-entropy loss, the formula is as follows:
$$ L_{cross-entropy}(label, y) = -\sum_i label_ilog(y_i) $$
Fig. 2 illustrates a softmax regression network, with the weights in blue, and the bias in red. `+1` indicates that the bias is $1$.
<p align="center">
<img src="image/softmax_regression_en.png" width=400><br/>
Fig. 2. Softmax regression network architecture<br/>
</p>
### Multilayer Perceptron ### Multilayer Perceptron
The softmax regression model described above uses the simplest two-layer neural network. That is, it only contains an input layer and an output layer, with limited regression capability. To achieve better recognition results, consider adding several hidden layers\[[10](#references)\] between the input layer and the output layer. The Softmax regression model uses the simplest two-layer neural network, which contains only the input layer and the output layer, so its performance is limited. In order to achieve better recognition, we consider adding several hidden layers \[[10](#references)\] between the input and output layer.
1. After the first hidden layer, we get $ H_1 = \phi(W_1X + b_1) $, where $\phi$ denotes the activation function. Some [common ones](###list-of-common-activation-functions) are sigmoid, tanh and ReLU. 1.In the first hidden layer, you can get $ H_1 = \phi(W_1X + b_1) $, where $\phi$ represents the activation function. And common functions are [sigmoid, tanh or ReLU](#common activation functions).
2. After the second hidden layer, we get $ H_2 = \phi(W_2H_1 + b_2) $. 2.In the second hidden layer, you can get $ H_2 = \phi(W_2H_1 + b_2) $.
3. Finally, the output layer outputs $Y=\text{softmax}(W_3H_2 + b_3)$, the vector denoting our classification result. 3.Finally, in the output layer, you can get $Y=\text{softmax}(W_3H_2 + b_3)$, that is the final classification result vector.
Fig. 3. shows a Multilayer Perceptron network, with the weights in blue, and the bias in red. +1 indicates that the bias is $1$.
<p align="center"> Figure 3 is a network structure of a multi-layer perceptron, in which weights are represented by blue lines, bias are represented by red lines, and +1 indicates that the bias is $1$.
<img src="image/mlp_en.png" width=500><br/>
Fig. 3. Multilayer Perceptron network architecture<br/>
<p align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/02.recognize_digits/image/mlp.png?raw=true" width=500><br/>
Figure 3. Multilayer perceptron network structure <br/>
</p> </p>
### Convolutional Neural Network ### Convolutional Neural Network
#### Convolutional Layer In the multi-layer perceptron model, an image is expanded into a one-dimensional vector and input into the network, ignoring its position and structure information. And the convolutional neural network can better utilize the structure information of the image. [LeNet-5](http://yann.lecun.com/exdb/lenet/) is a relatively simple convolutional neural network. Figure 4 shows the structure: the input two-dimensional image, first through the two convolutional layers to the pooling layer, then through the fully connected layer, and finally using the softmax as the output layer. Below we mainly introduce the convolutional layer and the pooling layer.
<p align="center"> <p align="center">
<img src="image/conv_layer.png" width='750'><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/02.recognize_digits/image/cnn.png?raw=true" width="600"><br/>
Fig. 4. Convolutional layer<br/> Figure 4. LeNet-5 convolutional neural network structure<br/>
</p> </p>
The **convolutional layer** is the core of a Convolutional Neural Network. The parameters in this layer are composed of a set of filters, also called kernels. We could visualize the convolution step in the following fashion: Each kernel slides horizontally and vertically till it covers the whole image. At every window, we compute the dot product of the kernel and the input. Then, we add the bias and apply an activation function. The result is a two-dimensional activation map. For example, some kernel may recognize corners, and some may recognize circles. These convolution kernels may respond strongly to the corresponding features. #### Convolutional Layer
Fig. 4 illustrates the dynamic programming of a convolutional layer, where depths are flattened for simplicity. The input is $W_1=5$, $H_1=5$, $D_1=3$. In fact, this is a common representation for colored images. $W_1$ and $H_1$ correspond to the width and height in a colored image. $D_1$ corresponds to the three color channels for RGB. The parameters of the convolutional layer are $K=2$, $F=3$, $S=2$, $P=1$. $K$ denotes the number of kernels; specifically, $Filter$ $W_0$ and $Filter$ $W_1$ are the kernels. $F$ is kernel size while $W0$ and $W1$ are both $F\timesF = 3\times3$ matrices in all depths. $S$ is the stride, which is the width of the sliding window; here, kernels move leftwards or downwards by two units each time. $P$ is the width of the padding, which denotes an extension of the input; here, the gray area shows zero padding with size 1.
#### Pooling Layer Convolutional Layer is the core of convolutional neural network. The convolution we mentioned in image recognition is a two-dimensional convolution, that is, a discrete two-dimensional filter (also called a convolutional kernel) and a two-dimensional image for convoluting. In short, the two-dimensional filter slides to all positions on two-dimensional images and dot product is taken for this pixel and its domain pixel at each position. Convolution operations are widely used in the field of image processing. Different convolutional kernels can extract different features, such as edges, lines, and angles. In deep convolutional neural networks, low-level to complex image features can be extracted by convolution operation.
<p align="center"> <p align="center">
<img src="image/max_pooling_en.png" width="400px"><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/02.recognize_digits/image/conv_layer.png?raw=true" width='750'><br/>
Fig. 5 Pooling layer using max-pooling<br/> Figure 5. Convolutional Layer Picture <br/>
</p> </p>
A **pooling layer** performs downsampling. The main functionality of this layer is to reduce computation by reducing the network parameters. It also prevents over-fitting to some extent. Usually, a pooling layer is added after a convolutional layer. Pooling layer can use various techniques, such as max pooling and average pooling. As shown in Fig.5, max pooling uses rectangles to segment the input layer into several parts and computes the maximum value in each part as the output. Figure 5 shows an example of the process of computing convolution with input image in size of $H=5, W=5, D=3$, ie $5 \times 5$ size of 3 channel (RGB, also known as depth) color image.
This example contains two (denoted by $K$) groups of convolutional kernels, i.e. $Filter W_0$ and $Filter W_1$ in the figure. In convolution calculation, different convolutional kernels are usually used for different input channels. In the example, each set of convolutional kernels contains ($D=3$) $3\times 3$ (indicated by $F \times F$) convolutional kernel. In addition, the stride of convolutional kernel in horizontal and vertical direction of image is 2 (indicated by $S$); Pad 1 (represented by $P$) 0 in the four directions of input image, that is, the input layer raw data in the figure is the blue part, and the gray part is expanded with 0 in the size of 1. The convolution operation yields a feature map of the size of $3 \times 3 \times 2$ (represented by $H_{o} \times W_{o} \times K$), which is a 2-channel feature map in size of $3 \times 3$, where $H_o$ is calculated as: $H_o = (H - F + 2 \times P)/S + 1$, so is $W_o$. And each pixel in the output feature map is the summation of the inner product of each set of filters and each feature of the input image, plus the bias $b_o$, the bias is usually shared by each output feature map. The last $-2$ in the output feature map $o[:,:,0]$ is calculated as shown in the lower right corner of Figure 5.
The convolutional kernel is a learnable parameter in the convolution operation. As shown in the example above, the parameter of each layer of convolution is $D \times F \times F \times K$. In the multi-layer perceptron model, neurons are usually fully connected therefore with a large number of parameters. There are fewer parameters in the convolutional layer, which is also determined by main features of the convolutional layer, namely local connections and shared weights.
- Local connection: Each neuron is connected to only one region of the input neuron, which is called Receptive Field. In the image convolution operation, that is, the neurons are locally connected in the spatial dimension (the plane in which the above examples H and W are located), but are fully connected in depth. For the two-dimensional image itself, the local pixels are strongly related. This local connection ensures that the learned filter makes the strongest response to local input features. The idea of local connection is also inspired by the structure of visual system in biology. The neurons in the visual cortex receive information locally.
- Weight sharing: The filters used to calculate neurons in the same deep slice are shared. For example, in Figure 5, the filter for each neuron calculated by $o[:,:,0]$ is the same, both are $W_0$, which can greatly reduce the parameters. The sharing weight is meaningful to a certain extent, for example, the bottom edge feature of the image is independent of the specific location of the feature in the graph. However, it is unintentional in some cases. For example, the input picture is a face, eyes and hair are in different positions. And to learn different features in different positions, please (refer to [Stanford University Open Class](http://cs231n.Github.io/convolutional-networks/)). Note that the weights are only shared for the neurons of the same depth slice. In the convolutional layer, multiple sets of convolutional kernels are usually used to extract different features, that is, the weights of neurons with different depth slices are not shared by the features with different depth slices. In addition, bias are shared by all neurons with the same depth.
By introducing the calculation process of convolution and its features, convolution could be seen as a linear operation with shift-invariant, which is the same operation performed at each position of the image. The local connection and weight sharing of the convolutional layer greatly reduce the parameters that need to be learned, which helps with training larger convolutional neural networks.
#### LeNet-5 Network For more information about convolution, please refer to [Reference Reading](http://ufldl.stanford.edu/wiki/index.php/Feature_extraction_using_convolution#Convolutions)。
### Pooling Layer
<p align="center"> <p align="center">
<img src="image/cnn_en.png"><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/02.recognize_digits/image/max_pooling.png?raw=true" width="400px"><br/>
Fig. 6. LeNet-5 Convolutional Neural Network architecture<br/> Figure 6. Picture in pooling layer</br>
</p> </p>
[**LeNet-5**](http://yann.lecun.com/exdb/lenet/) is one of the simplest Convolutional Neural Networks. Fig. 6. shows its architecture: A 2-dimensional input image is fed into two sets of convolutional layers and pooling layers. This output is then fed to a fully connected layer and a softmax classifier. Compared to multilayer, fully connected perceptrons, the LeNet-5 can recognize images better. This is due to the following three properties of the convolution: Pooling is a form of nonlinear downsampling. The main functionality of this layer is to reduce computation by reducing the network parameters and to control the overfitting to some extent. Normally a pooling layer is added after the convolutional layer. Pooling includes maximum pooling, average pooling and so on. The largest pooling is to divide the input layer into different areas by non-overlapping rectangular boxes, and the maximum value of each rectangular box is taken as the output layer, as shown in Figure. 6.
- The 3D nature of the neurons: a convolutional layer is organized by width, height, and depth. Neurons in each layer are connected to only a small region in the previous layer. This region is called the receptive field. For details about convolutional neural network, please refer to the tutorial of [Standford Online Course]( http://cs231n.github.io/convolutional-networks/ ), [Ufldl](http://ufldl.stanford.edu/wiki/index.php/Pooling) and [Image Classification]( https://github.com/PaddlePaddle/book/tree/develop/03.image_classification ).
- Local connectivity: A CNN utilizes the local space correlation by connecting local neurons. This design guarantees that the learned filter has a strong response to local input features. Stacking many such layers generates a non-linear filter that is more global. This enables the network to first obtain good representation for small parts of input and then combine them to represent a larger region.
- Weight sharing: In a CNN, computation is iterated on shared parameters (weights and bias) to form a feature map. This means that all the neurons in the same depth of the output response to the same feature. This allows the network to detect a feature regardless of its position in the input.
For more details on Convolutional Neural Networks, please refer to the tutorial on [Image Classification](https://github.com/PaddlePaddle/book/blob/develop/image_classification/README.md) and the [relevant lecture](http://cs231n.github.io/convolutional-networks/) from a Stanford course. <a name="common activation functions"></a>
### Common activation functions
### List of Common Activation Functions
- Sigmoid activation function: $ f(x) = sigmoid(x) = \frac{1}{1+e^{-x}} $ - Sigmoid activation function: $ f(x) = sigmoid(x) = \frac{1}{1+e^{-x}} $
- Tanh activation function: $ f(x) = tanh(x) = \frac{e^x-e^{-x}}{e^x+e^{-x}} $ - Tanh activation function: $ f(x) = tanh(x) = \frac{e^x-e^{-x}}{e^x+e^{-x}} $
In fact, tanh function is just a rescaled version of the sigmoid function. It is obtained by magnifying the value of the sigmoid function and moving it downwards by 1. In fact, the tanh function is only a sigmoid function with change of scale. The value of the sigmoid function is doubled and then shifted down by 1 unit: tanh(x) = 2sigmoid(2x) - 1 .
- ReLU activation function: $ f(x) = max(0, x) $ - ReLU activation function: $ f(x) = max(0, x) $
For more information, please refer to [Activation functions on Wikipedia](https://en.wikipedia.org/wiki/Activation_function). For details, please refer to [activation function in Wikipedia](https://en.wikipedia.org/wiki/Activation_function).
## Data Preparation
PaddlePaddle provides a Python module, `paddle.dataset.mnist`, which downloads and caches the [MNIST dataset](http://yann.lecun.com/exdb/mnist/). The cache is under `/home/username/.cache/paddle/dataset/mnist`: ## Dataset Preparation
PaddlePaddle provides a module `paddle.dataset.mnist` that automatically loads [MNIST] (http://yann.lecun.com/exdb/mnist/) data in the API. The loaded data is located under `/home/username/.cache/paddle/dataset/mnist`:
| File name | Description | Size |
|----------------------|--------------|-----------|
|train-images-idx3-ubyte| Training images | 60,000 |
|train-labels-idx1-ubyte| Training labels | 60,000 |
|t10k-images-idx3-ubyte | Evaluation images | 10,000 |
|t10k-labels-idx1-ubyte | Evaluation labels | 10,000 |
| filename | note |
|----------------------|-------------------------|
|train-images-idx3-ubyte| train data picture, 60,000 data |
|train-labels-idx1-ubyte| train data label, 60,000 data |
|t10k-images-idx3-ubyte | test data picture, 10,000 data |
|t10k-labels-idx1-ubyte | test data label, 10,000 data |
## Fluid API Overview ## Fluid API Overview
The demo will be using the latest paddle fluid API. Fluid API is the latest Paddle API. It simplifies the model configurations without sacrifice the performance. The demo will use the latest [Fluid API](http://paddlepaddle.org/documentation/docs/en/1.2/api_cn/index_cn.html). Fluid API is the latest PaddlePaddle API. It simplifies model configuration without sacrificing performance.
We recommend using Fluid API as it is much easier to pick up. We recommend using the Fluid API, which is easy to learn and use to help you complete your machine learning tasks quickly.
Here are the quick overview on the major fluid API complements.
Here is an overview of several important concepts in the Fluid API:
1. `inference_program`: A function that specify how to get the prediction from the data input. 1. `inference_program`: specifies how to get the inference function from the data input.
This is where you specify the network flow. This is where the network flow is defined.
1. `train_program`: A function that specify how to get avg_cost from `inference_program` and labels.
This is where you specify the loss calculations.
1. `optimizer_func`:"A function that specifies the configuration of the the optimizer. The optimizer is responsible for minimizing the loss and driving the training. Paddle supports many different optimizers."
1. `Trainer`: Fluid trainer manages the training process specified by the `train_program` and `optimizer`. Users can monitor the training
progress through the `event_handler` callback function.
1. `Inferencer`: Fluid inferencer loads the `inference_program` and the parameters trained by the Trainer.
It then can infer the data and return prediction
We will go though all of them and dig more on the configurations in this demo. 2. `train_program`: specifies how to get the `loss` function from `inference_program` and `tag value`.
This is where the loss calculation is specified.
## Model Configuration 3. `optimizer_func`: Specifies the function of the optimizer configuration. The optimizer is responsible for reducing losses and driving training. Paddle supports a number of different optimizers.
A PaddlePaddle program starts from importing the API package: In the code examples below, we'll take a closer look at them.
## Configuration Instructions
Load the Fluid API package for PaddlePaddle.
```python ```python
import paddle from __future__ import print_function #load print of python3 into current version
import os
from PIL import Image # load module of image processing
import matplotlib.pyplot as plt
import numpy
import paddle # load paddle module
import paddle.fluid as fluid import paddle.fluid as fluid
from __future__ import print_function
try:
from paddle.fluid.contrib.trainer import *
from paddle.fluid.contrib.inferencer import *
except ImportError:
print(
"In the fluid 1.0, the trainer and inferencer are moving to paddle.fluid.contrib",
file=sys.stderr)
from paddle.fluid.trainer import *
from paddle.fluid.inferencer import *
``` ```
### Program Functions Configuration ### Program Functions Configuration
First, We need to setup the `inference_program` function. We want to use this program to demonstrate three different classifiers, each defined as a Python function. We need to configure `inference_program` function. We want to use this program to show three different classifiers, each of which is defined as a Python function.
We need to feed image data to the classifier. PaddlePaddle provides a special layer `layer.data` for reading data. We need to input the image data into the classifier. Paddle provides a special layer `layer.data` for reading data.
Let us create a data layer for reading images and connect it to the classification network. Let's create a data layer to read the image and connect it to the network of classification.
- Softmax regression: the network has a fully-connection layer with softmax activation: -Softmax regression: The results of the classification can be obtained only through a simple layer of simple fully connected layer with softmax as the activation function.
```python ```python
def softmax_regression(): def softmax_regression():
"""
Define softmax classifier:
A fully connected layer with activation function softmax
Return:
predict_image -- result of classification
"""
# input original image data in size of 28*28*1
img = fluid.layers.data(name='img', shape=[1, 28, 28], dtype='float32') img = fluid.layers.data(name='img', shape=[1, 28, 28], dtype='float32')
# With softmax as the fully connected layer of the activation function, the size of the output layer must be 10
predict = fluid.layers.fc( predict = fluid.layers.fc(
input=img, size=10, act='softmax') input=img, size=10, act='softmax')
return predict return predict
``` ```
- Multi-Layer Perceptron: this network has two hidden fully-connected layers, both are using ReLU as activation function. The output layer is using softmax activation: -Multilayer Perceptron: The following code implements a multilayer perceptron with two hidden layers (that is, fully connected layers). The activation functions of the two hidden layers are all ReLU, and the activation function of the output layer is Softmax.
```python ```python
def multilayer_perceptron(): def multilayer_perceptron():
"""
Define multilayer perceptron classifier:
Multilayer perceptron with two hidden layers (fully connected layers)
The activation function of the first two hidden layers uses ReLU, and the activation function of the output layer uses Softmax.
Return:
predict_image -- result of classification
"""
# input raw image data in size of 28*28*1
img = fluid.layers.data(name='img', shape=[1, 28, 28], dtype='float32') img = fluid.layers.data(name='img', shape=[1, 28, 28], dtype='float32')
# first fully-connected layer, using ReLu as its activation function # the first fully connected layer, whose activation function is ReLU
hidden = fluid.layers.fc(input=img, size=200, act='relu') hidden = fluid.layers.fc(input=img, size=200, act='relu')
# second fully-connected layer, using ReLu as its activation function # the second fully connected layer, whose activation function is ReLU
hidden = fluid.layers.fc(input=hidden, size=200, act='relu') hidden = fluid.layers.fc(input=hidden, size=200, act='relu')
# With softmax as the fully connected output layer of the activation function, the size of the output layer must be 10
prediction = fluid.layers.fc(input=hidden, size=10, act='softmax') prediction = fluid.layers.fc(input=hidden, size=10, act='softmax')
return prediction return prediction
``` ```
- Convolution network LeNet-5: the input image is fed through two convolution-pooling layers, a fully-connected layer, and the softmax output layer: -Convolutional neural network LeNet-5: The input two-dimensional image first passes through two convolutional layers to the pooling layer, then passes through the fully connected layer, and finally fully connection layer with softmax as activation function is used as output layer.
```python ```python
def convolutional_neural_network(): def convolutional_neural_network():
"""
Define convolutional neural network classifier:
The input 2D image passes through two convolution-pooling layers, using the fully connected layer with softmax as the output layer
Return:
predict -- result of classification
"""
# input raw image data in size of 28*28*1
img = fluid.layers.data(name='img', shape=[1, 28, 28], dtype='float32') img = fluid.layers.data(name='img', shape=[1, 28, 28], dtype='float32')
# first conv pool # the first convolution-pooling layer
# Use 20 5*5 filters, the pooling size is 2, the pooling step is 2, and the activation function is Relu.
conv_pool_1 = fluid.nets.simple_img_conv_pool( conv_pool_1 = fluid.nets.simple_img_conv_pool(
input=img, input=img,
filter_size=5, filter_size=5,
...@@ -257,7 +283,8 @@ def convolutional_neural_network(): ...@@ -257,7 +283,8 @@ def convolutional_neural_network():
pool_stride=2, pool_stride=2,
act="relu") act="relu")
conv_pool_1 = fluid.layers.batch_norm(conv_pool_1) conv_pool_1 = fluid.layers.batch_norm(conv_pool_1)
# second conv pool # the second convolution-pooling layer
# Use 20 5*5 filters, the pooling size is 2, the pooling step is 2, and the activation function is Relu.
conv_pool_2 = fluid.nets.simple_img_conv_pool( conv_pool_2 = fluid.nets.simple_img_conv_pool(
input=conv_pool_1, input=conv_pool_1,
filter_size=5, filter_size=5,
...@@ -265,121 +292,239 @@ def convolutional_neural_network(): ...@@ -265,121 +292,239 @@ def convolutional_neural_network():
pool_size=2, pool_size=2,
pool_stride=2, pool_stride=2,
act="relu") act="relu")
# output layer with softmax activation function. size = 10 since there are only 10 possible digits. # With softmax as the fully connected output layer of the activation function, the size of the output layer must be 10
prediction = fluid.layers.fc(input=conv_pool_2, size=10, act='softmax') prediction = fluid.layers.fc(input=conv_pool_2, size=10, act='softmax')
return prediction return prediction
``` ```
#### Train Program Configuration #### Train Program Configuration
Then we need to setup the the `train_program`. It takes the prediction from the classifier first. Then we need to set train program `train_program` It firstly infers from classifier.
During the training, it will calculate the `avg_loss` from the prediction. During the training, it will compute `avg_cost`.
**NOTE:** A train program should return an array and the first return argument has to be `avg_cost`. ** Note:** train program should return an array. The first parameter returned must be `avg_cost`. The trainer uses it to compute gradient.
The trainer always implicitly use it to calculate the gradient.
Please feel free to modify the code to test different results between `softmax regression`, `mlp`, and `convolutional neural network` classifier. Please write your code and then test results of different classifiers of `softmax_regression`, `MLP` and `convolutional neural network`.
```python ```python
def train_program(): def train_program():
"""
Configure train_program
Return:
predict -- result of classification
avg_cost -- mean loss
acc -- accuracy of classification
"""
# label layer, called label, correspondent with label category of input picture
label = fluid.layers.data(name='label', shape=[1], dtype='int64') label = fluid.layers.data(name='label', shape=[1], dtype='int64')
# predict = softmax_regression() # uncomment for Softmax # predict = softmax_regression() # cancel note and run Softmax regression
# predict = multilayer_perceptron() # uncomment for MLP # predict = multilayer_perceptron() # cancel note and run multiple perceptron
predict = convolutional_neural_network() # uncomment for LeNet5 predict = convolutional_neural_network() # cancel note and run LeNet5 convolutional neural network
# Calculate the cost from the prediction and label. # use class cross-entropy function to compute loss function between predict and label
cost = fluid.layers.cross_entropy(input=predict, label=label) cost = fluid.layers.cross_entropy(input=predict, label=label)
# compute mean loss
avg_cost = fluid.layers.mean(cost) avg_cost = fluid.layers.mean(cost)
# compute accuracy of classification
acc = fluid.layers.accuracy(input=predict, label=label) acc = fluid.layers.accuracy(input=predict, label=label)
return predict, [avg_cost, acc]
# The first item needs to be avg_cost.
return [avg_cost, acc]
``` ```
#### Optimizer Function Configuration #### Optimizer Function Configuration
In the following `Adam` optimizer, `learning_rate` specifies the learning rate in the optimization procedure. `Adam optimizer`,`learning_rate` below are learning rate. Their size is associated with speed of network train convergence.
```python ```python
def optimizer_program(): def optimizer_program():
return fluid.optimizer.Adam(learning_rate=0.001) return fluid.optimizer.Adam(learning_rate=0.001)
``` ```
### Data Feeders Configuration ### Data Feeders for dataset Configuration
Then we specify the training data `paddle.dataset.mnist.train()` and testing data `paddle.dataset.mnist.test()`. These two methods are *reader creators*. Once called, a reader creator returns a *reader*. A reader is a Python method, which, once called, returns a Python generator, which yields instances of data. Next We start the training process. `Paddle.dataset.mnist.train()` and `paddle.dataset.mnist.test()` are respectively as train dataset and test dataset. These two functions respectively return a reader-- reader in PaddlePaddle is a Python function, which returns a Python yield generator when calling the reader.
`shuffle` is a reader decorator. It takes a reader A as input and returns a new reader B. Under the hood, B calls A to read data in the following fashion: it copies in `buffer_size` instances at a time into a buffer, shuffles the data, and yields the shuffled instances one at a time. A large buffer size would yield very shuffled data. `Shuffle` below is a reader decorator, which receives a reader A and returns another reader B. Reader B read `buffer_size` train data into a buffer and then the data is disordered randomly and is output one by one.
`batch` is a special decorator, which takes a reader and outputs a *batch reader*, which doesn't yield an instance, but a minibatch at a time. `Batch` is a special decorator. Its input is a reader and output is a batched reader. In PaddlePaddle, a reader yield a piece of data every time while batched reader yield a minibatch every time.
```python ```python
# there are 64 data in a minibatch
BATCH_SIZE = 64
# read 500 data in train dataset, randomly disorder them and then transfer it into batched reader which yield 64 data each time.
train_reader = paddle.batch( train_reader = paddle.batch(
paddle.reader.shuffle( paddle.reader.shuffle(
paddle.dataset.mnist.train(), buf_size=500), paddle.dataset.mnist.train(), buf_size=500),
batch_size=64) batch_size=BATCH_SIZE)
# read data in test dataset and yield 64 data every time
test_reader = paddle.batch( test_reader = paddle.batch(
paddle.dataset.mnist.test(), batch_size=64) paddle.dataset.mnist.test(), batch_size=BATCH_SIZE)
``` ```
### Trainer Configuration ### create training process
Now we need to create a training process. We will use `train_program`, `place` and `optimizer` defined before, conclude test loss in the period of training iteration and training verification and save parameters of model for prediction.
Now, we need to setup the trainer. The trainer need to take in `train_program`, `place`, and `optimizer`.
#### Event Handler Configuration
We can call a handler function to supervise training process during training.
We display two `event_handler` programs here. Please freely update Jupyter Notebook and find the changes.
`Event_handler` is used to output training result during the train
```python
def event_handler(pass_id, batch_id, cost):
# print the intermediate results of training, like
# training iterations, number of batch, and loss function
print("Pass %d, Batch %d, Cost %f" % (pass_id,batch_id, cost))
```
```python ```python
use_cuda = False # set to True if training with GPU from paddle.utils.plot import Ploter
train_prompt = "Train cost"
test_prompt = "Test cost"
cost_ploter = Ploter(train_prompt, test_prompt)
# visualize training process
def event_handler_plot(ploter_title, step, cost):
cost_ploter.append(ploter_title, step, cost)
cost_ploter.plot()
```
`event_handler_plot` can be visualized as follows:
![png](./image/train_and_test.png)
### Start training
Aftering adding `event_handler` and `data reader` we configured, we can start to train the model.
Set parameters for operation to configure data description.
`Feed_order` is used to map data directory to `train_program`
Create a `train_test` reflecting the loss during our training.
Define network structure:
```python
# the model is run on single CPU
use_cuda = False # If you want to use GPU, please set it True
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace() place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
trainer = Trainer( # call train_program to get prediction value and loss value,
train_func=train_program, place=place, optimizer_func=optimizer_program) prediction, [avg_loss, acc] = train_program()
```
# input original image data in size of 28*28*1
# label layer, called label, correspondent with label category of input picture.
#### Event Handler # It is informed that data in network consists of two parts. One is img value, the other is label value.
feeder = fluid.DataFeeder(feed_list=['img', 'label'], place=place)
Fluid API provides a hook to the callback function during training. Users are able to monitor training progress through mechanism. # choose Adam optimizer
We will demonstrate two event handlers here. Please feel free to modify on the Jupyter notebook to see the differences. optimizer = optimizer_program()
optimizer.minimize(avg_loss)
```
`event_handler` is used to plot some text data when training. Configure hyper parameter during the training:
```python ```python
# Save the parameter into a directory. The Inferencer can load the parameters from it to do infer
params_dirname = "recognize_digits_network.inference.model" PASS_NUM = 5 #train 5 iterations
lists = [] epochs = [epoch_id for epoch_id in range(PASS_NUM)]
def event_handler(event):
if isinstance(event, EndStepEvent): # save parameters of model into save_dirname file
if event.step % 100 == 0: save_dirname = "recognize_digits.inference.model"
# event.metrics maps with train program return arguments. ```
# event.metrics[0] will yeild avg_cost and event.metrics[1] will yeild acc in this example.
print("Pass %d, Batch %d, Cost %f" % (
event.step, event.epoch, event.metrics[0])) ```python
def train_test(train_test_program,
if isinstance(event, EndEpochEvent): train_test_feed, train_test_reader):
avg_cost, acc = trainer.test(
reader=test_reader, feed_order=['img', 'label']) # save classification accuracy into acc_set
acc_set = []
print("Test with Epoch %d, avg_cost: %s, acc: %s" % (event.epoch, avg_cost, acc)) # save mean loss in avg_loss_set
avg_loss_set = []
# save parameters # transfer each data which is the output of testing reader_yield into network to train
trainer.save_params(params_dirname) for test_data in train_test_reader():
lists.append((event.epoch, avg_cost, acc)) acc_np, avg_loss_np = exe.run(
program=train_test_program,
feed=train_test_feed.feed(test_data),
fetch_list=[acc, avg_loss])
acc_set.append(float(acc_np))
avg_loss_set.append(float(avg_loss_np))
# get accuracy and loss value on the test data
acc_val_mean = numpy.array(acc_set).mean()
avg_loss_val_mean = numpy.array(avg_loss_set).mean()
# return mean loss value and mean accuracy
return avg_loss_val_mean, acc_val_mean
```
Create executor
```python
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
```
Set up main_program and test_program:
```python
main_program = fluid.default_main_program()
test_program = fluid.default_main_program().clone(for_test=True)
``` ```
Start training:
#### Start training
Now that we setup the event_handler and the reader, we can start training the model. `feed_order` is used to map the data dict to the train_program
```python ```python
# Train the model now lists = []
trainer.train( step = 0
num_epochs=5, for epoch_id in epochs:
event_handler=event_handler, for step_id, data in enumerate(train_reader()):
reader=train_reader, metrics = exe.run(main_program,
feed_order=['img', 'label']) feed=feeder.feed(data),
fetch_list=[avg_loss, acc])
if step % 100 == 0: # print a log for every 100 times of training
print("Pass %d, Batch %d, Cost %f" % (step, epoch_id, metrics[0]))
event_handler_plot(train_prompt, step, metrics[0])
step += 1
# test classification result of each epoch
avg_loss_val, acc_val = train_test(train_test_program=test_program,
train_test_reader=test_reader,
train_test_feed=feeder)
print("Test with Epoch %d, avg_cost: %s, acc: %s" %(epoch_id, avg_loss_val, acc_val))
event_handler_plot(test_prompt, step, metrics[0])
lists.append((epoch_id, avg_loss_val, acc_val))
# save parameters of trained model for prediction
if save_dirname is not None:
fluid.io.save_inference_model(save_dirname,
["img"], [prediction], exe,
model_filename=None,
params_filename=None)
# Choose the best pass
best = sorted(lists, key=lambda list: float(list[1]))[0]
print('Best pass is %s, testing Avgcost is %s' % (best[0], best[1]))
print('The classification accuracy is %.2f%%' % (float(best[2]) * 100))
``` ```
During training, `trainer.train` invokes `event_handler` for certain events. This gives us a chance to print the training progress.
The training process is completely automatic. The log printed in event_handler is like as follows.
Pass represents iterations of train. Batch represents times to train all data. cost represents loss value of current pass.
Compute the mean loss and accuracy of classification after an epoch.
``` ```
Pass 0, Batch 0, Cost 0.125650 Pass 0, Batch 0, Cost 0.125650
...@@ -395,79 +540,80 @@ Pass 900, Batch 0, Cost 0.239809 ...@@ -395,79 +540,80 @@ Pass 900, Batch 0, Cost 0.239809
Test with Epoch 0, avg_cost: 0.053097883707459624, acc: 0.9822850318471338 Test with Epoch 0, avg_cost: 0.053097883707459624, acc: 0.9822850318471338
``` ```
After the training, we can check the model's prediction accuracy. Check prediction accuracy of the model after training. In the train with MNIST, generally classification accuracy of softmax regression model is about 92.34%, while that of multilayer perceptron is 97.66% and that of convolutional neural network is 99.20%.
```python
# find the best pass
best = sorted(lists, key=lambda list: float(list[1]))[0]
print 'Best pass is %s, testing Avgcost is %s' % (best[0], best[1])
print 'The classification accuracy is %.2f%%' % (float(best[2]) * 100)
```
Usually, with MNIST data, the softmax regression model achieves an accuracy around 92.34%, the MLP 97.66%, and the convolution network around 99.20%. Convolution layers have been widely considered a great invention for image processing.
## Application
After training, users can use the trained model to classify images. The following code shows how to inference MNIST images through `fluid.contrib.inferencer.Inferencer`. ## Deploy the Model
### Create Inferencer You can use trained model to classify handwriting pictures of digits. The program below shows how to use well-trained model to predict.
The `Inferencer` takes an `infer_func` and `param_path` to setup the network and the trained parameters. ### Generate input data to be inferred
We can simply plug-in the classifier defined earlier here.
```python
inferencer = Inferencer(
# infer_func=softmax_regression, # uncomment for softmax regression
# infer_func=multilayer_perceptron, # uncomment for MLP
infer_func=convolutional_neural_network, # uncomment for LeNet5
param_path=params_dirname,
place=place)
```
#### Generate input data for inferring `infer_3.png` is an example picture of number 3. Transform it into a numpy to match feed data format
`infer_3.png` is an example image of the digit `3`. Turn it into an numpy array to match the data feeder format.
```python ```python
# Prepare the test image
import os
import numpy as np
from PIL import Image
def load_image(file): def load_image(file):
# open the image file and covert to grayscale
im = Image.open(file).convert('L') im = Image.open(file).convert('L')
# adjust the input image to a 28*28 high quality image
im = im.resize((28, 28), Image.ANTIALIAS) im = im.resize((28, 28), Image.ANTIALIAS)
im = np.array(im).reshape(1, 1, 28, 28).astype(np.float32) # convert img to numpy
im = numpy.array(im).reshape(1, 1, 28, 28).astype(numpy.float32)
# normalize
im = im / 255.0 * 2.0 - 1.0 im = im / 255.0 * 2.0 - 1.0
return im return im
cur_dir = os.getcwd() cur_dir = os.getcwd()
img = load_image(cur_dir + '/image/infer_3.png') tensor_img = load_image(cur_dir + '/image/infer_3.png')
``` ```
### Inference ### Inference
Now we are ready to do inference. By configuring network and training parameters via `load_inference_model`, We can simply insert classifier defined before.
```python ```python
results = inferencer.infer({'img': img}) inference_scope = fluid.core.Scope()
lab = np.argsort(results) # probs and lab are the results of one batch data with fluid.scope_guard(inference_scope):
print("Inference result of image/infer_3.png is: %d" % lab[0][0][-1]) # use fluid.io.load_inference_model to get inference program desc,
# feed_target_names is used to define variable name needed to be passed into network
# fetch_targets define variable name to be fetched from network
[inference_program, feed_target_names,
fetch_targets] = fluid.io.load_inference_model(
save_dirname, exe, None, None)
# Make feed a dictionary {feed_target_name: feed_target_data}
# The result will contain a data list corresponding to fetch_targets
results = exe.run(inference_program,
feed={feed_target_names[0]: tensor_img},
fetch_list=fetch_targets)
lab = numpy.argsort(results)
# Print prediction result of infer_3.png
img=Image.open('image/infer_3.png')
plt.imshow(img)
print("Inference result of image/infer_3.png is: %d" % lab[0][0][-1])
``` ```
## Conclusion
This tutorial describes a few common deep learning models using **Softmax regression**, **Multilayer Perceptron Network**, and **Convolutional Neural Network**. Understanding these models is crucial for future learning; the subsequent tutorials derive more sophisticated networks by building on top of them.
When our model evolves from a simple softmax regression to a slightly complex Convolutional Neural Network, the recognition accuracy on the MNIST dataset achieves a large improvement. This is due to the Convolutional layers' local connections and parameter sharing. While learning new models in the future, we encourage the readers to understand the key ideas that lead a new model to improve the results of an old one. ### Result
If successful, the inference result input is as follows:
`Inference result of image/infer_3.png is: 3` , which indicates that out network successfully recognize the picture!
Moreover, this tutorial introduces the basic flow of PaddlePaddle model design, which starts with a *data provider*, a model layer construction, and finally training and prediction. Motivated readers can leverage the flow used in this MNIST handwritten digit classification example and experiment with different data and network architectures to train models for classification tasks of their choice. ## Summary
Softmax regression, multilayer perceptron and convolutional neural network are the most basic deep learning model, from which complex neural networks are all derivative, so these models are helpful for later learning. At the same time, we found that from simple softmax regression transform to slightly complex convolutional neural network, the accuracy of recognition on MNIST dataset largely increased, resulting from that convolution layer is featured with local connection and sharing weight. When study of new models later, hope you make a deep understand of the key upgrade of new model compared with original model. In addition, this tutorial also talks about the basic steps to build PaddlePadle model, from the code of dataprovider, build of network to training and prediction. Familiar with the work flow, you can use your own data, define your own network model and finish your training and prediction tasks.
<a name="References"></a>
## References ## References
1. LeCun, Yann, Léon Bottou, Yoshua Bengio, and Patrick Haffner. ["Gradient-based learning applied to document recognition."](http://ieeexplore.ieee.org/abstract/document/726791/) Proceedings of the IEEE 86, no. 11 (1998): 2278-2324. 1. LeCun, Yann, Léon Bottou, Yoshua Bengio, and Patrick Haffner. ["Gradient-based learning applied to document recognition."](http://ieeexplore.ieee.org/abstract/document/726791/) Proceedings of the IEEE 86, no. 11 (1998): 2278-2324.
2. Wejéus, Samuel. ["A Neural Network Approach to Arbitrary SymbolRecognition on Modern Smartphones."](http://www.diva-portal.org/smash/record.jsf?pid=diva2:753279&dswid=-434) (2014). 2. Wejéus, Samuel. ["A Neural Network Approach to Arbitrary SymbolRecognition on Modern Smartphones."](http://www.diva-portal.org/smash/record.jsf?pid=diva2%3A753279&dswid=-434) (2014).
3. Decoste, Dennis, and Bernhard Schölkopf. ["Training invariant support vector machines."](http://link.springer.com/article/10.1023/A:1012454411458) Machine learning 46, no. 1-3 (2002): 161-190. 3. Decoste, Dennis, and Bernhard Schölkopf. ["Training invariant support vector machines."](http://link.springer.com/article/10.1023/A:1012454411458) Machine learning 46, no. 1-3 (2002): 161-190.
4. Simard, Patrice Y., David Steinkraus, and John C. Platt. ["Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis."](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.160.8494&rep=rep1&type=pdf) In ICDAR, vol. 3, pp. 958-962. 2003. 4. Simard, Patrice Y., David Steinkraus, and John C. Platt. ["Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis."](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.160.8494&rep=rep1&type=pdf) In ICDAR, vol. 3, pp. 958-962. 2003.
5. Salakhutdinov, Ruslan, and Geoffrey E. Hinton. ["Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure."](http://www.jmlr.org/proceedings/papers/v2/salakhutdinov07a/salakhutdinov07a.pdf) In AISTATS, vol. 11. 2007. 5. Salakhutdinov, Ruslan, and Geoffrey E. Hinton. ["Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure."](http://www.jmlr.org/proceedings/papers/v2/salakhutdinov07a/salakhutdinov07a.pdf) In AISTATS, vol. 11. 2007.
...@@ -478,7 +624,7 @@ Moreover, this tutorial introduces the basic flow of PaddlePaddle model design, ...@@ -478,7 +624,7 @@ Moreover, this tutorial introduces the basic flow of PaddlePaddle model design,
10. Bishop, Christopher M. ["Pattern recognition."](http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf) Machine Learning 128 (2006): 1-58. 10. Bishop, Christopher M. ["Pattern recognition."](http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf) Machine Learning 128 (2006): 1-58.
<br/> <br/>
This tutorial is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://paddlepaddleimage.cdn.bcebos.com/bookimage/camo.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">This tutorial</span> is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
......
...@@ -15,13 +15,28 @@ ...@@ -15,13 +15,28 @@
from __future__ import print_function from __future__ import print_function
import os import os
import argparse
from PIL import Image from PIL import Image
import numpy import numpy
import paddle import paddle
import paddle.fluid as fluid import paddle.fluid as fluid
BATCH_SIZE = 64
PASS_NUM = 5 def parse_args():
parser = argparse.ArgumentParser("mnist")
parser.add_argument(
'--enable_ce',
action='store_true',
help="If set, run the task with continuous evaluation logs.")
parser.add_argument(
'--use_gpu',
type=bool,
default=False,
help="Whether to use GPU or not.")
parser.add_argument(
'--num_epochs', type=int, default=5, help="number of epochs.")
args = parser.parse_args()
return args
def loss_net(hidden, label): def loss_net(hidden, label):
...@@ -69,6 +84,23 @@ def train(nn_type, ...@@ -69,6 +84,23 @@ def train(nn_type,
if use_cuda and not fluid.core.is_compiled_with_cuda(): if use_cuda and not fluid.core.is_compiled_with_cuda():
return return
startup_program = fluid.default_startup_program()
main_program = fluid.default_main_program()
if args.enable_ce:
train_reader = paddle.batch(
paddle.dataset.mnist.train(), batch_size=BATCH_SIZE)
test_reader = paddle.batch(
paddle.dataset.mnist.test(), batch_size=BATCH_SIZE)
startup_program.random_seed = 90
main_program.random_seed = 90
else:
train_reader = paddle.batch(
paddle.reader.shuffle(paddle.dataset.mnist.train(), buf_size=500),
batch_size=BATCH_SIZE)
test_reader = paddle.batch(
paddle.dataset.mnist.test(), batch_size=BATCH_SIZE)
img = fluid.layers.data(name='img', shape=[1, 28, 28], dtype='float32') img = fluid.layers.data(name='img', shape=[1, 28, 28], dtype='float32')
label = fluid.layers.data(name='label', shape=[1], dtype='int64') label = fluid.layers.data(name='label', shape=[1], dtype='int64')
...@@ -81,8 +113,7 @@ def train(nn_type, ...@@ -81,8 +113,7 @@ def train(nn_type,
prediction, avg_loss, acc = net_conf(img, label) prediction, avg_loss, acc = net_conf(img, label)
test_program = fluid.default_main_program().clone(for_test=True) test_program = main_program.clone(for_test=True)
optimizer = fluid.optimizer.Adam(learning_rate=0.001) optimizer = fluid.optimizer.Adam(learning_rate=0.001)
optimizer.minimize(avg_loss) optimizer.minimize(avg_loss)
...@@ -105,15 +136,8 @@ def train(nn_type, ...@@ -105,15 +136,8 @@ def train(nn_type,
exe = fluid.Executor(place) exe = fluid.Executor(place)
train_reader = paddle.batch(
paddle.reader.shuffle(paddle.dataset.mnist.train(), buf_size=500),
batch_size=BATCH_SIZE)
test_reader = paddle.batch(
paddle.dataset.mnist.test(), batch_size=BATCH_SIZE)
feeder = fluid.DataFeeder(feed_list=[img, label], place=place) feeder = fluid.DataFeeder(feed_list=[img, label], place=place)
exe.run(startup_program)
exe.run(fluid.default_startup_program())
main_program = fluid.default_main_program()
epochs = [epoch_id for epoch_id in range(PASS_NUM)] epochs = [epoch_id for epoch_id in range(PASS_NUM)]
lists = [] lists = []
...@@ -125,7 +149,7 @@ def train(nn_type, ...@@ -125,7 +149,7 @@ def train(nn_type,
feed=feeder.feed(data), feed=feeder.feed(data),
fetch_list=[avg_loss, acc]) fetch_list=[avg_loss, acc])
if step % 100 == 0: if step % 100 == 0:
print("Pass %d, Batch %d, Cost %f" % (step, epoch_id, print("Pass %d, Epoch %d, Cost %f" % (step, epoch_id,
metrics[0])) metrics[0]))
step += 1 step += 1
# test for epoch # test for epoch
...@@ -144,6 +168,11 @@ def train(nn_type, ...@@ -144,6 +168,11 @@ def train(nn_type,
model_filename=model_filename, model_filename=model_filename,
params_filename=params_filename) params_filename=params_filename)
if args.enable_ce:
print("kpis\ttrain_cost\t%f" % metrics[0])
print("kpis\ttest_cost\t%s" % avg_loss_val)
print("kpis\ttest_acc\t%s" % acc_val)
# find the best pass # find the best pass
best = sorted(lists, key=lambda list: float(list[1]))[0] best = sorted(lists, key=lambda list: float(list[1]))[0]
print('Best pass is %s, testing Avgcost is %s' % (best[0], best[1])) print('Best pass is %s, testing Avgcost is %s' % (best[0], best[1]))
...@@ -210,7 +239,10 @@ def main(use_cuda, nn_type): ...@@ -210,7 +239,10 @@ def main(use_cuda, nn_type):
if __name__ == '__main__': if __name__ == '__main__':
use_cuda = False args = parse_args()
BATCH_SIZE = 64
PASS_NUM = args.num_epochs
use_cuda = args.use_gpu
# predict = 'softmax_regression' # uncomment for Softmax # predict = 'softmax_regression' # uncomment for Softmax
# predict = 'multilayer_perceptron' # uncomment for MLP # predict = 'multilayer_perceptron' # uncomment for MLP
predict = 'convolutional_neural_network' # uncomment for LeNet5 predict = 'convolutional_neural_network' # uncomment for LeNet5
......
#!/bin/bash
#This file is only used for continuous evaluation.
export FLAGS_cudnn_deterministic=true
export CUDA_VISIBLE_DEVICES=0
python train.py --num_epochs 1 --use_gpu 1 --enable_ce | python _ce.py
...@@ -3,6 +3,18 @@ ...@@ -3,6 +3,18 @@
本教程源代码目录在[book/image_classification](https://github.com/PaddlePaddle/book/tree/develop/03.image_classification),初次使用请您参考[Book文档使用说明](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书) 本教程源代码目录在[book/image_classification](https://github.com/PaddlePaddle/book/tree/develop/03.image_classification),初次使用请您参考[Book文档使用说明](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)
### 说明: ###
1.硬件环境要求:
本文可支持在CPU、GPU下运行
2.Docker镜像支持的CUDA/cuDNN版本:
如果使用了Docker运行Book,请注意:这里所提供的默认镜像的GPU环境为 CUDA 8/cuDNN 5,对于NVIDIA Tesla V100等要求CUDA 9的 GPU,使用该镜像可能会运行失败。
3.文档和脚本中代码的一致性问题:
请注意:为使本文更加易读易用,我们拆分、调整了train.py的代码并放入本文。本文中代码与train.py的运行结果一致,可直接运行[train.py](https://github.com/PaddlePaddle/book/blob/develop/03.image_classification/train.py)进行验证。
## 背景介绍 ## 背景介绍
图像相比文字能够提供更加生动、容易理解及更具艺术感的信息,是人们转递与交换信息的重要来源。在本教程中,我们专注于图像识别领域的一个重要问题,即图像分类。 图像相比文字能够提供更加生动、容易理解及更具艺术感的信息,是人们转递与交换信息的重要来源。在本教程中,我们专注于图像识别领域的一个重要问题,即图像分类。
...@@ -52,7 +64,7 @@ ...@@ -52,7 +64,7 @@
2). **特征编码**: 底层特征中包含了大量冗余与噪声,为了提高特征表达的鲁棒性,需要使用一种特征变换算法对底层特征进行编码,称作特征编码。常用的特征编码方法包括向量量化编码 \[[4](#参考文献)\]、稀疏编码 \[[5](#参考文献)\]、局部线性约束编码 \[[6](#参考文献)\]、Fisher向量编码 \[[7](#参考文献)\] 等。 2). **特征编码**: 底层特征中包含了大量冗余与噪声,为了提高特征表达的鲁棒性,需要使用一种特征变换算法对底层特征进行编码,称作特征编码。常用的特征编码方法包括向量量化编码 \[[4](#参考文献)\]、稀疏编码 \[[5](#参考文献)\]、局部线性约束编码 \[[6](#参考文献)\]、Fisher向量编码 \[[7](#参考文献)\] 等。
3). **空间特征约束**: 特征编码之后一般会经过空间特征约束,也称作**特征汇聚**。特征汇聚是指在一个空间范围内,对每一维特征取最大值或者平均值,可以获得一定特征不变形的特征表达。金字塔特征匹配是一种常用的特征聚会方法,这种方法提出将图像均匀分块,在分块内做特征汇聚。 3). **空间特征约束**: 特征编码之后一般会经过空间特征约束,也称作**特征汇聚**。特征汇聚是指在一个空间范围内,对每一维特征取最大值或者平均值,可以获得一定特征不变形的特征表达。金字塔特征匹配是一种常用的特征汇聚方法,这种方法提出将图像均匀分块,在分块内做特征汇聚。
4). **通过分类器分类**: 经过前面步骤之后一张图像可以用一个固定维度的向量进行描述,接下来就是经过分类器对图像进行分类。通常使用的分类器包括SVM(Support Vector Machine, 支持向量机)、随机森林等。而使用核方法的SVM是最为广泛的分类器,在传统图像分类任务上性能很好。 4). **通过分类器分类**: 经过前面步骤之后一张图像可以用一个固定维度的向量进行描述,接下来就是经过分类器对图像进行分类。通常使用的分类器包括SVM(Support Vector Machine, 支持向量机)、随机森林等。而使用核方法的SVM是最为广泛的分类器,在传统图像分类任务上性能很好。
...@@ -82,7 +94,7 @@ Alex Krizhevsky在2012年ILSVRC提出的CNN模型 \[[9](#参考文献)\] 取得 ...@@ -82,7 +94,7 @@ Alex Krizhevsky在2012年ILSVRC提出的CNN模型 \[[9](#参考文献)\] 取得
另外,在训练过程中由于每层参数不断更新,会导致下一次输入分布发生变化,这样导致训练过程需要精心设计超参数。如2015年Sergey Ioffe和Christian Szegedy提出了Batch Normalization (BN)算法 \[[14](#参考文献)\] 中,每个batch对网络中的每一层特征都做归一化,使得每层分布相对稳定。BN算法不仅起到一定的正则作用,而且弱化了一些超参数的设计。经过实验证明,BN算法加速了模型收敛过程,在后来较深的模型中被广泛使用。 另外,在训练过程中由于每层参数不断更新,会导致下一次输入分布发生变化,这样导致训练过程需要精心设计超参数。如2015年Sergey Ioffe和Christian Szegedy提出了Batch Normalization (BN)算法 \[[14](#参考文献)\] 中,每个batch对网络中的每一层特征都做归一化,使得每层分布相对稳定。BN算法不仅起到一定的正则作用,而且弱化了一些超参数的设计。经过实验证明,BN算法加速了模型收敛过程,在后来较深的模型中被广泛使用。
接下来我们主要介绍VGG,GoogleNet和ResNet网络结构。 接下来我们主要介绍VGG,GoogLeNet和ResNet网络结构。
### VGG ### VGG
...@@ -93,9 +105,9 @@ Alex Krizhevsky在2012年ILSVRC提出的CNN模型 \[[9](#参考文献)\] 取得 ...@@ -93,9 +105,9 @@ Alex Krizhevsky在2012年ILSVRC提出的CNN模型 \[[9](#参考文献)\] 取得
图6. 基于ImageNet的VGG16模型 图6. 基于ImageNet的VGG16模型
</p> </p>
### GoogleNet ### GoogLeNet
GoogleNet \[[12](#参考文献)\] 在2014年ILSVRC的获得了冠军,在介绍该模型之前我们先来了解NIN(Network in Network)模型 \[[13](#参考文献)\] 和Inception模块,因为GoogleNet模型由多组Inception模块组成,模型设计借鉴了NIN的一些思想。 GoogLeNet \[[12](#参考文献)\] 在2014年ILSVRC的获得了冠军,在介绍该模型之前我们先来了解NIN(Network in Network)模型 \[[13](#参考文献)\] 和Inception模块,因为GoogLeNet模型由多组Inception模块组成,模型设计借鉴了NIN的一些思想。
NIN模型主要有两个特点: NIN模型主要有两个特点:
...@@ -110,17 +122,17 @@ Inception模块如下图7所示,图(a)是最简单的设计,输出是3个卷 ...@@ -110,17 +122,17 @@ Inception模块如下图7所示,图(a)是最简单的设计,输出是3个卷
图7. Inception模块 图7. Inception模块
</p> </p>
GoogleNet由多组Inception模块堆积而成。另外,在网络最后也没有采用传统的多层全连接层,而是像NIN网络一样采用了均值池化层;但与NIN不同的是,GoogleNet在池化层后加了一个全连接层来映射类别数。除了这两个特点之外,由于网络中间层特征也很有判别性,GoogleNet在中间层添加了两个辅助分类器,在后向传播中增强梯度并且增强正则化,而整个网络的损失函数是这个三个分类器的损失加权求和。 GoogLeNet由多组Inception模块堆积而成。另外,在网络最后也没有采用传统的多层全连接层,而是像NIN网络一样采用了均值池化层;但与NIN不同的是,GoogLeNet在池化层后加了一个全连接层来映射类别数。除了这两个特点之外,由于网络中间层特征也很有判别性,GoogLeNet在中间层添加了两个辅助分类器,在后向传播中增强梯度并且增强正则化,而整个网络的损失函数是这个三个分类器的损失加权求和。
GoogleNet整体网络结构如图8所示,总共22层网络:开始由3层普通的卷积组成;接下来由三组子网络组成,第一组子网络包含2个Inception模块,第二组包含5个Inception模块,第三组包含2个Inception模块;然后接均值池化层、全连接层。 GoogLeNet整体网络结构如图8所示,总共22层网络:开始由3层普通的卷积组成;接下来由三组子网络组成,第一组子网络包含2个Inception模块,第二组包含5个Inception模块,第三组包含2个Inception模块;然后接均值池化层、全连接层。
<p align="center"> <p align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/03.image_classification/image/googlenet.jpeg?raw=true" ><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/03.image_classification/image/googlenet.jpeg?raw=true" ><br/>
图8. GoogleNet[12] 图8. GoogLeNet[12]
</p> </p>
上面介绍的是GoogleNet第一版模型(称作GoogleNet-v1)。GoogleNet-v2 \[[14](#参考文献)\] 引入BN层;GoogleNet-v3 \[[16](#参考文献)\] 对一些卷积层做了分解,进一步提高网络非线性能力和加深网络;GoogleNet-v4 \[[17](#参考文献)\] 引入下面要讲的ResNet设计思路。从v1到v4每一版的改进都会带来准确度的提升,介于篇幅,这里不再详细介绍v2到v4的结构。 上面介绍的是GoogLeNet第一版模型(称作GoogLeNet-v1)。GoogLeNet-v2 \[[14](#参考文献)\] 引入BN层;GoogLeNet-v3 \[[16](#参考文献)\] 对一些卷积层做了分解,进一步提高网络非线性能力和加深网络;GoogLeNet-v4 \[[17](#参考文献)\] 引入下面要讲的ResNet设计思路。从v1到v4每一版的改进都会带来准确度的提升,介于篇幅,这里不再详细介绍v2到v4的结构。
### ResNet ### ResNet
...@@ -164,11 +176,12 @@ Paddle API提供了自动加载cifar数据集模块 `paddle.dataset.cifar`。 ...@@ -164,11 +176,12 @@ Paddle API提供了自动加载cifar数据集模块 `paddle.dataset.cifar`。
让我们从导入 Paddle Fluid API 和辅助模块开始。 让我们从导入 Paddle Fluid API 和辅助模块开始。
```python ```python
from __future__ import print_function
import paddle import paddle
import paddle.fluid as fluid import paddle.fluid as fluid
import numpy import numpy
import sys import sys
from __future__ import print_function
``` ```
...@@ -209,7 +222,7 @@ def vgg_bn_drop(input): ...@@ -209,7 +222,7 @@ def vgg_bn_drop(input):
``` ```
1. 首先定义了一组卷积网络,即conv_block。卷积核大小为3x3,池化窗口大小为2x2,窗口滑动大小为2,groups决定每组VGG模块是几次连续的卷积操作,dropouts指定Dropout操作的概率。所使用的`img_conv_group`是在`paddle.networks`中预定义的模块,由若干组 Conv->BN->ReLu->Dropout 和 一组 Pooling 组成。 1. 首先定义了一组卷积网络,即conv_block。卷积核大小为3x3,池化窗口大小为2x2,窗口滑动大小为2,groups决定每组VGG模块是几次连续的卷积操作,dropouts指定Dropout操作的概率。所使用的`img_conv_group`是在`paddle.nets`中预定义的模块,由若干组 Conv->BN->ReLu->Dropout 和 一组 Pooling 组成。
2. 五组卷积操作,即 5个conv_block。 第一、二组采用两次连续的卷积操作。第三、四、五组采用三次连续的卷积操作。每组最后一个卷积后面Dropout概率为0,即不使用Dropout操作。 2. 五组卷积操作,即 5个conv_block。 第一、二组采用两次连续的卷积操作。第三、四、五组采用三次连续的卷积操作。每组最后一个卷积后面Dropout概率为0,即不使用Dropout操作。
...@@ -276,7 +289,7 @@ def layer_warp(block_func, input, ch_in, ch_out, count, stride): ...@@ -276,7 +289,7 @@ def layer_warp(block_func, input, ch_in, ch_out, count, stride):
3. 最后对网络做均值池化并返回该层。 3. 最后对网络做均值池化并返回该层。
注意:除第一层卷积层和最后一层全连接层之外,要求三组 `layer_warp` 总的含参层数能够被6整除,即 `resnet_cifar10` 的 depth 要满足 $(depth - 2) % 6 = 0$ 注意:除第一层卷积层和最后一层全连接层之外,要求三组 `layer_warp` 总的含参层数能够被6整除,即 `resnet_cifar10` 的 depth 要满足(depth-2)%6=0
```python ```python
def resnet_cifar10(ipt, depth=32): def resnet_cifar10(ipt, depth=32):
...@@ -294,7 +307,7 @@ def resnet_cifar10(ipt, depth=32): ...@@ -294,7 +307,7 @@ def resnet_cifar10(ipt, depth=32):
return predict return predict
``` ```
## Infererence Program 配置 ## Inference Program 配置
网络输入定义为 `data_layer` (数据层),在图像分类中即为图像像素信息。CIFRAR10是RGB 3通道32x32大小的彩色图,因此输入数据大小为3072(3x32x32)。 网络输入定义为 `data_layer` (数据层),在图像分类中即为图像像素信息。CIFRAR10是RGB 3通道32x32大小的彩色图,因此输入数据大小为3072(3x32x32)。
...@@ -358,7 +371,7 @@ test_reader = paddle.batch( ...@@ -358,7 +371,7 @@ test_reader = paddle.batch(
``` ```
### Trainer 程序的实现 ### Trainer 程序的实现
我们需要为训练过程制定一个main_program, 同样的,还需要为测试程序配置一个test_program。定义训练的 `place` ,并使用先前定义的优化器 `optimizer_func` 我们需要为训练过程制定一个main_program, 同样的,还需要为测试程序配置一个test_program。定义训练的 `place` ,并使用先前定义的优化器 `optimizer_program`
```python ```python
...@@ -525,11 +538,7 @@ with fluid.scope_guard(inference_scope): ...@@ -525,11 +538,7 @@ with fluid.scope_guard(inference_scope):
[inference_program, feed_target_names, [inference_program, feed_target_names,
fetch_targets] = fluid.io.load_inference_model(params_dirname, exe) fetch_targets] = fluid.io.load_inference_model(params_dirname, exe)
# The input's dimension of conv should be 4-D or 5-D.
# Use inference_transpiler to speedup
inference_transpiler_program = inference_program.clone()
t = fluid.transpiler.InferenceTranspiler()
t.transpile(inference_transpiler_program, place)
# Construct feed as a dictionary of {feed_target_name: feed_target_data} # Construct feed as a dictionary of {feed_target_name: feed_target_data}
# and results will contain a list of data corresponding to fetch_targets. # and results will contain a list of data corresponding to fetch_targets.
...@@ -537,14 +546,6 @@ with fluid.scope_guard(inference_scope): ...@@ -537,14 +546,6 @@ with fluid.scope_guard(inference_scope):
feed={feed_target_names[0]: img}, feed={feed_target_names[0]: img},
fetch_list=fetch_targets) fetch_list=fetch_targets)
transpiler_results = exe.run(inference_transpiler_program,
feed={feed_target_names[0]: img},
fetch_list=fetch_targets)
assert len(results[0]) == len(transpiler_results[0])
for i in range(len(results[0])):
numpy.testing.assert_almost_equal(
results[0][i], transpiler_results[0][i], decimal=5)
# infer label # infer label
label_list = [ label_list = [
...@@ -557,7 +558,7 @@ with fluid.scope_guard(inference_scope): ...@@ -557,7 +558,7 @@ with fluid.scope_guard(inference_scope):
## 总结 ## 总结
传统图像分类方法由多个阶段构成,框架较为复杂,而端到端的CNN模型结构可一步到位,而且大幅度提升了分类准确率。本文我们首先介绍VGG、GoogleNet、ResNet三个经典的模型;然后基于CIFAR10数据集,介绍如何使用PaddlePaddle配置和训练CNN模型,尤其是VGG和ResNet模型;最后介绍如何使用PaddlePaddle的API接口对图片进行预测和特征提取。对于其他数据集比如ImageNet,配置和训练流程是同样的,大家可以自行进行实验。 传统图像分类方法由多个阶段构成,框架较为复杂,而端到端的CNN模型结构可一步到位,而且大幅度提升了分类准确率。本文我们首先介绍VGG、GoogLeNet、ResNet三个经典的模型;然后基于CIFAR10数据集,介绍如何使用PaddlePaddle配置和训练CNN模型,尤其是VGG和ResNet模型;最后介绍如何使用PaddlePaddle的API接口对图片进行预测和特征提取。对于其他数据集比如ImageNet,配置和训练流程是同样的,大家可以自行进行实验。
<a name="参考文献"></a> <a name="参考文献"></a>
## 参考文献 ## 参考文献
...@@ -596,7 +597,7 @@ with fluid.scope_guard(inference_scope): ...@@ -596,7 +597,7 @@ with fluid.scope_guard(inference_scope):
[17] Szegedy, C., Ioffe, S., Vanhoucke, V. [Inception-v4, inception-resnet and the impact of residual connections on learning](https://arxiv.org/abs/1602.07261). arXiv:1602.07261 (2016). [17] Szegedy, C., Ioffe, S., Vanhoucke, V. [Inception-v4, inception-resnet and the impact of residual connections on learning](https://arxiv.org/abs/1602.07261). arXiv:1602.07261 (2016).
[18] Everingham, M., Eslami, S. M. A., Van Gool, L., Williams, C. K. I., Winn, J. and Zisserman, A. [The Pascal Visual Object Classes Challenge: A Retrospective]((http://link.springer.com/article/10.1007/s11263-014-0733-5)). International Journal of Computer Vision, 111(1), 98-136, 2015. [18] Everingham, M., Eslami, S. M. A., Van Gool, L., Williams, C. K. I., Winn, J. and Zisserman, A. [The Pascal Visual Object Classes Challenge: A Retrospective](http://link.springer.com/article/10.1007/s11263-014-0733-5). International Journal of Computer Vision, 111(1), 98-136, 2015.
[19] He, K., Zhang, X., Ren, S., and Sun, J. [Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification](https://arxiv.org/abs/1502.01852). ArXiv e-prints, February 2015. [19] He, K., Zhang, X., Ren, S., and Sun, J. [Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification](https://arxiv.org/abs/1502.01852). ArXiv e-prints, February 2015.
...@@ -607,4 +608,4 @@ with fluid.scope_guard(inference_scope): ...@@ -607,4 +608,4 @@ with fluid.scope_guard(inference_scope):
[22] http://cs231n.github.io/classification/ [22] http://cs231n.github.io/classification/
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">知识共享 署名-相同方式共享 4.0 国际 许可协议</a>进行许可。 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://paddlepaddleimage.cdn.bcebos.com/bookimage/camo.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">知识共享 署名-相同方式共享 4.0 国际 许可协议</a>进行许可。
Image Classification Image Classification
======================= =======================
The source code for this chapter is at [book/image_classification](https://github.com/PaddlePaddle/book/tree/develop/03.image_classification). For instructions on getting started with this book,see [Running This Book](https://github.com/PaddlePaddle/book/blob/develop/README.md#running-the-book). The source code for this chapter is in [book/image_classification](https://github.com/PaddlePaddle/book/tree/develop/03.image_classification). For users new to book, check [Running This Book](https://github.com/PaddlePaddle/book/blob/develop/README.md#running-the-book) .
## Background ## Background
Compared to words, images provide much more vivid and easier to understand information with an artistic sense. They are an important source for people to express and exchange ideas. In this chapter, we focus on one of the essential problems in image recognition -- image classification. Compared with words, images provide information in a much more vivid, artistic, easy-to-understand manner. They are an important source for people to express and exchange ideas. In this chapter, we focus on one of the essential problems in image recognition -- image classification.
Image classification is the task of distinguishing images in different categories based on their semantic meaning. It is a core problem in computer vision and is also the foundation of other higher level computer vision tasks such as object detection, image segmentation, object tracking, action recognition, etc. Image classification has applications in many areas such as face recognition, intelligent video analysis in security systems, traffic scene recognition in transportation systems, content-based image retrieval and automatic photo indexing in web services, image classification in medicine, etc. Image classification is the task of distinguishing images in different categories based on their semantic meaning. It is a core problem in computer vision and is also the foundation of other higher level computer vision tasks such as object detection, image segmentation, object tracking, action recognition. Image classification has applications in many areas such as face recognition, intelligent video analysis in security systems, traffic scene recognition in transportation systems, content-based image retrieval and automatic photo indexing in Internet services, image classification in medicine industry.
To classify an image we firstly encode the entire image using handcrafted or learned features and then determine the category using a classifier. Thus, feature extraction plays an important role in image classification. Prior to deep learning the BoW(Bag of Words) model was the most widely used method for classifying an image as well as an object. The BoW technique was introduced in Natural Language Processing where a training sentence is represented as a bag of words. In the context of image classification, the BoW model requires constructing a dictionary. The simplest BoW framework can be designed with three steps: **feature extraction**, **feature encoding** and **classifier design**. To classify an image we firstly encode the entire image using manual or learned features and then determine the category using a classifier. Thus, feature extraction plays an important role in image classification. Prior to deep learning the BoW(Bag of Words) model was the most widely used method for classifying an image. The BoW technique was introduced in Natural Language Processing where a training sentence is represented as a bag of words. In the context of image classification, the BoW model requires constructing a dictionary. The simplest BoW framework can be designed in three steps: **feature extraction**, **feature encoding** and **classifier design**.
Using Deep learning, image classification can be framed as a supervised or unsupervised learning problem that uses hierarchical features automatically without any need for manually crafted features from the image. In recent years, Convolutional Neural Networks (CNNs) have made significant progress in image classification. CNNs use raw image pixels as input, extract low-level and high-level abstract features through convolution operations, and directly output the classification results from the model. This style of end-to-end learning has lead to not only increased performance but also wider adoption various applications. With Deep learning, image classification can be framed as a supervised or unsupervised learning problem that uses hierarchical features automatically without any need for manually crafted features from the image. In recent years, Convolution Neural Networks (CNNs) have made significant progress in image classification. CNNs use raw image pixels as input, extract low-level and high-level abstract features through convolution operations, and directly output the classification results from the model. This style of end-to-end learning has led to not only higher performance but also wider adoption in various applications.
In this chapter, we introduce deep-learning-based image classification methods and explain how to train a CNN model using PaddlePaddle. In this chapter, we introduce deep-learning-based image classification methods and explain how to train a CNN model using PaddlePaddle.
## Demonstration ## Result Demo
An image can be classified by a general as well as fine-grained image classifier. Image Classification can be divided into general image classification and fine-grained image classification.
Figure 1 shows the results of a general image classifier -- the trained model can correctly recognize the main objects in the images. Figure 1 shows the results of general image classification -- the trained model can correctly recognize the main objects in the images.
<p align="center"> <p align="center">
<img src="image/dog_cat.png " width="350" ><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/03.image_classification/image/dog_cat.png?raw=true" width="350" ><br/>
Figure 1. General image classification Figure 1. General image classification
</p> </p>
Figure 2 shows the results of a fine-grained image classifier. This task of flower recognition requires correctly recognizing of the flower's categories. Figure 2 shows the results of a fine-grained image classifier. This task of flower recognition ought to correctly recognize of the flower's breed.
<p align="center"> <p align="center">
<img src="image/flowers.png" width="400" ><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/03.image_classification/image/flowers.png?raw=true" width="400" ><br/>
Figure 2. Fine-grained image classification Figure 2. Fine-grained image classification
</p> </p>
A good model should recognize objects of different categories correctly. The results of such a model should not vary due to viewpoint variation, illumination conditions, object distortion or occlusion. A qualified model should recognize objects of different categories correctly. The results of such a model should remain accurate in different perspectives, illumination conditions, object distortion or occlusion (we refer to these conditions as Image Disturbance).
Figure 3 shows some images with various disturbances. A good model should classify these images correctly like humans. Figure 3 shows some images with various disturbances. A good model should classify these images correctly like humans.
<p align="center"> <p align="center">
<img src="image/variations_en.png" width="550" ><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/03.image_classification/image/variations.png?raw=true" width="550" ><br/>
Figure 3. Disturbed images [22] Figure 3. Disturbed images <a src="#References">[22]</a>
</p> </p>
## Model Overview ## Exploration of Models
A large amount of researches in image classification are built upon benchmark datasets such as [PASCAL VOC](http://host.robots.ox.ac.uk/pascal/VOC/), [ImageNet](http://image-net.org/) etc. Many image classification algorithms are usually evaluated and compared based on these datasets. PASCAL VOC is a computer vision competition started in 2005, and ImageNet is a dataset holding Large Scale Visual Recognition Challenge (ILSVRC) started in 2010. In this chapter, we introduce some image classification models from the submissions to these competitions.
A large amount of research in image classification is built upon public datasets such as [PASCAL VOC](http://host.robots.ox.ac.uk/pascal/VOC/), [ImageNet](http://image-net.org/) etc. Many image classification algorithms are usually evaluated and compared on top of these datasets. PASCAL VOC is a computer vision competition started in 2005, and ImageNet is a dataset for Large Scale Visual Recognition Challenge (ILSVRC) started in 2010. In this chapter, we introduce some image classification models from the submissions to these competitions.
Before 2012, traditional image classification was accomplished with the three steps described in the background section. A complete model construction usually involves the following stages: low-level feature extraction, feature encoding, spatial constraint or feature clustering, classifier design, model ensemble. Before 2012, traditional image classification was accomplished with the three steps described in the background section. A complete model construction usually involves the following stages: low-level feature extraction, feature encoding, spatial constraint or feature clustering, classifier design, model ensemble.
1). **Low-level feature extraction**: This step extracts large amounts of local features according to fixed strides and scales. Popular local features include Scale-Invariant Feature Transform (SIFT)[1], Histogram of Oriented Gradient(HOG)[2], Local Binary Pattern(LBP)[3], etc. A common practice is to employ multiple feature descriptors in order to avoid missing a lot of information. 1). **Low-level feature extraction**: This step extracts large amounts of local features according to fixed strides and scales. Popular local features include Scale-Invariant Feature Transform (SIFT) \[[1](#References)\], Histogram of Oriented Gradient(HOG) \[[2](#References)\], Local Binary Pattern(LBP) \[[3](#References)\], etc. A common practice is to employ multiple feature descriptors in order to avoid missing a lot of information.
2). **Feature encoding**: Low-level features contain a large amount of redundancy and noise. In order to improve the robustness of features, it is necessary to employ a feature transformation to encode low-level features. This is called feature encoding. Common feature encoding methods include vector quantization [4], sparse coding [5], locality-constrained linear coding [6], Fisher vector encoding [7], etc. 2). **Feature encoding**: Low-level features contain a large amount of redundancy and noise. In order to improve the robustness of features, it is necessary to employ a feature transformation to encode low-level features. This is called feature encoding. Common feature encoding methods include vector quantization \[[4](#References)\], sparse coding \[[5](#References)\], locality-constrained linear coding \[[6](#References)\], Fisher vector encoding \[[7](#References)\], etc.
3). **Spatial constraint**: Spatial constraint or feature clustering is usually adopted after feature encoding for extracting the maximum or average of each dimension in the spatial domain. Pyramid feature matching--a popular feature clustering method--divides an image uniformly into patches and performs feature clustering in each patch. 3). **Spatial constraint**: Spatial constraint or feature clustering is usually adopted after feature encoding for extracting the maximum or average of each dimension in the spatial domain. Pyramid feature matching--a popular feature clustering method--divides an image uniformly into patches and performs feature clustering in each patch.
4). **Classification**: In the above steps an image can be described by a vector of fixed dimension. Then a classifier can be used to classify the image into categories. Common classifiers include Support Vector Machine(SVM), random forest etc. Kernel SVM is the most popular classifier and has achieved very good performance in traditional image classification tasks. 4). **Classification**: In the above steps an image can be described by a vector of fixed dimension. Then a classifier can be used to classify the image into categories. Common classifiers include Support Vector Machine(SVM), random forest etc. Kernel SVM is the most popular classifier and has achieved very good performance in traditional image classification tasks.
This method has been used widely as image classification algorithm in PASCAL VOC [18]. NEC Labs(http://www.nec-labs.com/) won the championship by employing SIFT and LBP features, two non-linear encoders and SVM in ILSVRC 2010 [8]. This classic method has been used widely as image classification algorithm in PASCAL VOC \[[18](#References)\]. [NEC Labs](http://www.nec-labs.com/) won the championship by employing SIFT and LBP features, two non-linear encoders and SVM in ILSVRC 2010 \[[8](#References)\].
The CNN model--AlexNet proposed by Alex Krizhevsky et al.[9], made a breakthrough in ILSVRC 2012. It dramatically outperformed traditional methods and won the ILSVRC championship in 2012. This was also the first time that a deep learning method was used for large-scale image classification. Since AlexNet, a series of CNN models have been proposed that have advanced the state of the art steadily on Imagenet as shown in Figure 4. With deeper and more sophisticated architectures, Top-5 error rate is getting lower and lower (to around 3.5%). The error rate of human raters on the same Imagenet dataset is 5.1%, which means that the image classification capability of a deep learning model has surpassed human raters. The CNN model--AlexNet proposed by Alex Krizhevsky et al. \[[9](#References)\], made a breakthrough in ILSVRC 2012. It dramatically outperformed classical methods and won the ILSVRC championship in 2012. This was also the first time that a deep learning method was adopted for large-scale image classification. Since AlexNet, a series of CNN models have been proposed that have advanced the state of the art steadily on Imagenet as shown in Figure 4. With deeper and more sophisticated architectures, Top-5 error rate is getting lower and lower (to around 3.5%). The error rate of human raters on the same Imagenet dataset is 5.1%, which means that the image classification capability of a deep learning model has surpassed human raters.
<p align="center"> <p align="center">
<img src="image/ilsvrc.png" width="500" ><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/03.image_classification/image/ilsvrc.png?raw=true" width="500" ><br/>
Figure 4. Top-5 error rates on ILSVRC image classification Figure 4. Top-5 error rates on ILSVRC image classification
</p> </p>
### CNN ### CNN
Traditional CNNs consist of convolutional and fully-connected layers and use the softmax multi-category classifier with the cross-entropy loss function. Figure 5 shows a typical CNN. We first introduce the common components of a CNN. Traditional CNNs consist of convolution and fully-connected layers and use the softmax multi-category classifier with the cross-entropy loss function. Figure 5 shows a typical CNN. We first take look at the common components of a CNN.
<p align="center"> <p align="center">
<img src="image/lenet_en.png"><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/03.image_classification/image/lenet.png?raw=true"><br/>
Figure 5. A CNN example [20] Figure 5. A CNN example <a src="#References">[20]</a>
</p> </p>
- convolutional layer: this layer uses the convolution operation to extract (low-level and high-level) features and to discover local correlation and spatial invariance. - convolutional layer: this layer uses the convolution operation to extract (low-level and high-level) features and to discover local correlation and spatial invariance.
- pooling layer: this layer down samples feature maps by extracting local max (max-pooling) or average (avg-pooling) value of each patch in the feature map. Down-sampling is a common operation in image processing and is used to filter out high-frequency information. - pooling layer: this layer down-samples feature maps by extracting local max (max-pooling) or average (avg-pooling) value of each patch in the feature map. Down-sampling is a common operation in image processing and is used to filter out trivial high-frequency information.
- fully-connected layer: this layer fully connects neurons between two adjacent layers. - fully-connected layer: this layer fully connects neurons between two adjacent layers.
- non-linear activation: Convolutional and fully-connected layers are usually followed by some non-linear activation layers. Non-linearities enhance the expression capability of the network. Some examples of non-linear activation functions are Sigmoid, Tanh and ReLU. ReLU is the most commonly used activation function in CNN. - non-linear activation: Convolutional and fully-connected layers are usually followed by some non-linear activation layers. Non-linearities enhance the expression capability of the network. Some examples of non-linear activation functions are Sigmoid, Tanh and ReLU. ReLU is the most commonly used activation function in CNN.
- Dropout [10]: At each training stage, individual nodes are dropped out of the network with a certain probability. This improves the network's ability to generalize and avoids overfitting. - Dropout \[[10](#References)\]: At each training stage, individual nodes are dropped out of the network with a certain random probability. This improves the network's ability to generalize and avoids overfitting.
Parameter updates at each layer during training causes input layer distributions to change and in turn requires hyper-parameters to be carefully tuned. In 2015, Sergey Ioffe and Christian Szegedy proposed a Batch Normalization (BN) algorithm [14], which normalizes the features of each batch in a layer, and enables relatively stable distribution in each layer. Not only does BN algorithm act as a regularizer, but also reduces the need for careful hyper-parameter design. Experiments demonstrate that BN algorithm accelerates the training convergence and has been widely used in later deeper models. Parameter updates at each layer during training causes input layer distributions to change and in turn requires hyper-parameters to be carefully tuned. In 2015, Sergey Ioffe and Christian Szegedy proposed a Batch Normalization (BN) algorithm \[[14](#References)\], which normalizes the features of each batch in a layer, and enables relatively stable distribution in each layer. Not only does BN algorithm act as a regularizer, but also eliminates the need for meticulous hyper-parameter design. Experiments demonstrate that BN algorithm accelerates the training convergence and has been widely used in further deeper models.
In the following sections, we will introduce the following network architectures - VGG, GoogleNet and ResNets. In the following sections, we will take a tour through the following network architectures - VGG, GoogLeNet and ResNets.
### VGG ### VGG
The Oxford Visual Geometry Group (VGG) proposed the VGG network in ILSVRC 2014 [11]. This model is deeper and wider than previous neural architectures. It consists of five main groups of convolution operations. Adjacent convolution groups are connected via max-pooling layers. Each group contains a series of 3x3 convolutional layers (i.e. kernels). The number of convolution kernels stays the same within the group and increases from 64 in the first group to 512 in the last one. The total number of learnable layers could be 11, 13, 16, or 19 depending on the number of convolutional layers in each group. Figure 6 illustrates a 16-layer VGG. The neural architecture of VGG is relatively simple and has been adopted by many papers such as the first one that surpassed human-level performance on ImageNet [19]. The Oxford Visual Geometry Group (VGG) proposed the VGG network in ILSVRC 2014 \[[11](#References)\]. This model is deeper and wider than previous neural architectures. Its major part is the five main groups of convolution operations. Adjacent convolution groups are connected via max-pooling layers to perform dimensionality reduction. Each group contains a series of 3x3 convolutional layers (i.e. kernels). The number of convolution kernels stays the same within the single group and increases from 64 in the first group to 512 in the last one. Double FC layers and a classifier layer will follow afterwards. The total number of learnable layers could be 11, 13, 16, or 19 depending on the number of convolutional layers in each group. Figure 6 illustrates a 16-layer VGG. The architecture of VGG is relatively simple and has been adopted by many papers such as the first one that surpassed human-level performance on ImageNet \[[19](#References)\].
<p align="center"> <p align="center">
<img src="image/vgg16.png" width="750" ><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/03.image_classification/image/vgg16.png?raw=true" width="750" ><br/>
Figure 6. VGG16 model for ImageNet Figure 6. VGG16 model for ImageNet
</p> </p>
### GoogleNet ### GoogLeNet
GoogleNet [12] won the ILSVRC championship in 2014. GoogleNet borrowed some ideas from the Network in Network(NIN) model [13] and is built on the Inception blocks. Let us first familiarize ourselves with these first. GoogLeNet \[[12](#References)\] won the ILSVRC championship in 2014. GoogLeNet borrowed some ideas from the Network in Network(NIN) model \[[13](#References)\] and is built on the Inception blocks. Let us first familiarize ourselves with these concepts first.
The two main characteristics of the NIN model are: The two main characteristics of the NIN model are:
1) A single-layer convolutional network is replaced with a Multi-Layer Perceptron Convolution (MLPconv). MLPconv is a tiny multi-layer convolutional network. It enhances non-linearity by adding several 1x1 convolutional layers after linear ones. 1) A single-layer convolutional network is replaced with a Multi-Layer Perceptron Convolution (MLPconv). MLPconv is a tiny multi-layer convolutional network. It enhances non-linearity by adding several 1x1 convolutional layers after linear ones.
2) In traditional CNNs, the last fewer layers are usually fully-connected with a large number of parameters. In contrast, NIN replaces all fully-connected layers with convolutional layers with feature maps of the same size as the category dimension and a global average pooling. This replacement of fully-connected layers significantly reduces the number of parameters. 2) In traditional CNNs, the last fewer layers are usually fully-connected with a large number of parameters. In contrast, the last convolution layer of NIN contains feature maps of the same size as the category dimension, and NIN replaces fully-connected layers with global average pooling to fetch a vector of the same size as category dimension and classify them. This replacement of fully-connected layers significantly reduces the number of parameters.
Figure 7 depicts two Inception blocks. Figure 7(a) is the simplest design. The output is a concatenation of features from three convolutional layers and one pooling layer. The disadvantage of this design is that the pooling layer does not change the number of filters and leads to an increase in the number of outputs. After several of such blocks, the number of outputs and parameters become larger and larger and lead to higher computation complexity. To overcome this drawback, the Inception block in Figure 7(b) employs three 1x1 convolutional layers. These reduce dimensions or the number of channels but improve the non-linearity of the network. Figure 7 depicts two Inception blocks. Figure 7(a) is the simplest design. The output is a concatenation of features from three convolutional layers and one pooling layer. The disadvantage of this design is that the pooling layer does not change the number of channels and leads to an increased channel number of features after concatenation. After several such blocks, the number of channels and parameters become larger and larger and lead to higher computation complexity. To overcome this drawback, the Inception block in Figure 7(b) employs three 1x1 convolutional layers to perform dimensionality reduction, which, to put it simply, is to reduce the number of channels and simultaneously improve the non-linearity of the network.
<p align="center"> <p align="center">
<img src="image/inception_en.png" width="800" ><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/03.image_classification/image/inception.png?raw=ture" width="800" ><br/>
Figure 7. Inception block Figure 7. Inception block
</p> </p>
GoogleNet consists of multiple stacked Inception blocks followed by an avg-pooling layer as in NIN instead of traditional fully connected layers. The difference between GoogleNet and NIN is that GoogleNet adds a fully connected layer after avg-pooling layer to output a vector of category size. Besides these two characteristics, the features from middle layers of a GoogleNet are also very discriminative. Therefore, GoogeleNet inserts two auxiliary classifiers in the model for enhancing gradient and regularization when doing backpropagation. The loss function of the whole network is the weighted sum of these three classifiers. GoogLeNet comprises multiple stacked Inception blocks followed by an avg-pooling layer as in NIN instead of traditional fully connected layers. The difference between GoogLeNet and NIN is that GoogLeNet adds a fully connected layer after avg-pooling layer to output a vector of category size. Besides these two characteristics, the features from middle layers of a GoogLeNet are also very discriminative. Therefore, GoogeleNet inserts two auxiliary classifiers in the model for enhancing gradient and regularization when doing back-propagation. The loss function of the whole network is the weighted sum of these three classifiers.
Figure 8 illustrates the neural architecture of a GoogleNet which consists of 22 layers: it starts with three regular convolutional layers followed by three groups of sub-networks -- the first group contains two Inception blocks, the second group has five, and the third group has two. It ends with an average pooling and a fully-connected layer. Figure 8 illustrates the neural architecture of a GoogLeNet which consists of 22 layers: it starts with three regular convolutional layers followed by three groups of sub-networks -- the first group contains two Inception blocks, the second group has five, and the third group has two again. Finally, It ends with an average pooling and a fully-connected layer.
<p align="center"> <p align="center">
<img src="image/googlenet.jpeg" ><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/03.image_classification/image/googlenet.jpeg?raw=true" ><br/>
Figure 8. GoogleNet[12] Figure 8. GoogLeNet <a src="#References">[12]</a>
</p> </p>
The above model is the first version of GoogleNet or GoogelNet-v1. GoogleNet-v2 [14] introduced BN layer; GoogleNet-v3 [16] further split some convolutional layers, which increases non-linearity and network depth; GoogelNet-v4 [17] leads to the design idea of ResNet which will be introduced in the next section. The evolution from v1 to v4 improved the accuracy rate consistently. We will not go into details of the neural architectures of v2 to v4. The model above is the first version of GoogLeNet or the so-called GoogelNet-v1. GoogLeNet-v2 \[[14](#References)\] introduced BN layer; GoogLeNet-v3 \[[16](#References)\] further split some convolutional layers, which increases non-linearity and network depth; GoogelNet-v4 \[[17](#References)\] is inspired by the design idea of ResNet which will be introduced in the next section. The evolution from v1 to v4 improved the accuracy rate consistently. The length of this article being limited, we will not scrutinize the neural architectures of v2 to v4.
### ResNet ### ResNet
Residual Network(ResNet)[15] won the 2015 championship on three ImageNet competitions -- image classification, object localization, and object detection. The main challenge in training deeper networks is that accuracy degrades with network depth. The authors of ResNet proposed a residual learning approach to ease the difficulty of training deeper networks. Based on the design ideas of BN, small convolutional kernels, full convolutional network, ResNets reformulate the layers as residual blocks, with each block containing two branches, one directly connecting input to the output, the other performing two to three convolutions and calculating the residual function with reference to the layer's inputs. The outputs of these two branches are then added up. Residual Network(ResNet) \[[15](#References)\] won the 2015 championship on three ImageNet competitions -- image classification, object localization, and object detection. The main challenge in training deeper networks is that accuracy degrades with network depth. The authors of ResNet proposed a residual learning approach to ease the training of deeper networks. Based on the design ideas of BN, small convolutional kernels, full convolutional network, ResNets reformulate the layers as residual blocks, with each block containing two branches, one directly connecting input to the output, the other performing two to three convolutions and calculating the residual function with reference to the layer's inputs. The output features of these two branches are then added up.
Figure 9 illustrates the ResNet architecture. To the left is the basic building block, it consists of two 3x3 convolutional layers of the same channels. To the right is a Bottleneck block. The bottleneck is a 1x1 convolutional layer used to reduce dimension from 256 to 64. The other 1x1 convolutional layer is used to increase dimension from 64 to 256. Thus, the number of input and output channels of the middle 3x3 convolutional layer is 64, which is relatively small. Figure 9 illustrates the ResNet architecture. To the left is the basic building block, it consists of two 3x3 convolutional layers with the same size of output channels. To the right is a Bottleneck block. The bottleneck is a 1x1 convolutional layer used to reduce dimension (from 256 to 64 here). The following 1x1 convolutional layer is used to increase dimension from 64 to 256. Thus, the number of input and output channels of the middle 3x3 convolutional layer is relatively small (64->64 in this example).
<p align="center"> <p align="center">
<img src="image/resnet_block.jpg" width="400"><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/03.image_classification/image/resnet_block.jpg?raw=true" width="400"><br/>
Figure 9. Residual block Figure 9. Residual block
</p> </p>
Figure 10 illustrates ResNets with 50, 101, 152 layers, respectively. All three networks use bottleneck blocks of different numbers of repetitions. ResNet converges very fast and can be trained with hundreds or thousands of layers. Figure 10 illustrates ResNets with 50, 101, 152 layers, respectively. All three networks use bottleneck blocks and their difference lies in the repetition time of residual blocks. ResNet converges very fast and can be trained with hundreds or thousands of layers.
<p align="center"> <p align="center">
<img src="image/resnet.png"><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/03.image_classification/image/resnet.png?raw=true"><br/>
Figure 10. ResNet model for ImageNet Figure 10. ResNet model for ImageNet
</p> </p>
## Dataset ## Get Data Ready
Commonly used public datasets for image classification are [CIFAR](https://www.cs.toronto.edu/~kriz/cifar.html), [ImageNet](http://image-net.org/), [COCO](http://mscoco.org/), etc. Those used for fine-grained image classification are [CUB-200-2011](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html), [Stanford Dog](http://vision.stanford.edu/aditya86/ImageNetDogs/), [Oxford-flowers](http://www.robots.ox.ac.uk/~vgg/data/flowers/), etc. Among these, the ImageNet dataset is the largest. Most research results are reported on ImageNet as mentioned in the Model Overview section. Since 2010, the ImageNet dataset has gone through some changes. The commonly used ImageNet-2012 dataset contains 1000 categories. There are 1,281,167 training images, ranging from 732 to 1200 images per category, and 50,000 validation images with 50 images per category in average. Common public benchmark datasets for image classification are [CIFAR](https://www.cs.toronto.edu/~kriz/cifar.html), [ImageNet](http://image-net.org/), [COCO](http://mscoco.org/), etc. Those used for fine-grained image classification are [CUB-200-2011](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html), [Stanford Dog](http://vision.stanford.edu/aditya86/ImageNetDogs/), [Oxford-flowers](http://www.robots.ox.ac.uk/~vgg/data/flowers/), etc. Among these, the ImageNet dataset is the largest. Most research results are reported on ImageNet as mentioned in the "Exploration of Models" section. Since 2010, the ImageNet dataset has gone through some changes. The commonly used ImageNet-2012 dataset contains 1000 categories. There are 1,281,167 training images, ranging from 732 to 1200 images per category, and 50,000 validation images with 50 images per category in average.
Since ImageNet is too large to be downloaded and trained efficiently, we use [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) in this tutorial. The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. Figure 11 shows all the classes in CIFAR-10 as well as 10 images randomly sampled from each category. Since ImageNet is too large to be downloaded and trained efficiently, we use [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) in this tutorial. The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. Figure 11 shows all the classes in CIFAR-10 as well as 10 images randomly sampled from each category.
<p align="center"> <p align="center">
<img src="image/cifar.png" width="350"><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/03.image_classification/image/cifar.png?raw=true" width="350"><br/>
Figure 11. CIFAR10 dataset[21] Figure 11. CIFAR10 dataset <a src="#References">[21]</a>
</p> </p>
`paddle.datasets` package encapsulates multiple public datasets, including `cifar`, `imdb`, `mnist`, `moivelens` and `wmt14`, etc. There's no need to manually download and preprocess CIFAR-10. The Paddle API invents 'Paddle.dataset.cifar' to automatically load the Cifar DataSet module.
After running the command `python train.py`, training will start immediately. The following sections will describe in details. After running the command `python train.py`, training will start immediately. The following sections will explain `train.py` inside and out.
## Model Configuration ## Model Configuration
#### Initialize Paddle
Let's start with importing the Paddle Fluid API package and the helper modules. Let's start with importing the Paddle Fluid API package and the helper modules.
```python ```python
from __future__ import print_function
import paddle import paddle
import paddle.fluid as fluid import paddle.fluid as fluid
import numpy import numpy
import sys import sys
from __future__ import print_function
try:
from paddle.fluid.contrib.trainer import *
from paddle.fluid.contrib.inferencer import *
except ImportError:
print(
"In the fluid 1.0, the trainer and inferencer are moving to paddle.fluid.contrib",
file=sys.stderr)
from paddle.fluid.trainer import *
from paddle.fluid.inferencer import *
``` ```
Now we are going to walk you through the implementations of the VGG and ResNet. Now we are going to walk you through the implementations of the VGG and ResNet.
### VGG ### VGG
Let's start with the VGG model. Since the image size and amount of CIFAR10 are relatively small comparing to ImageNet, we use a small version of VGG network for CIFAR10. Convolution groups incorporate BN and dropout operations. Let's start with the VGG model. Since the image size and amount of CIFAR10 are smaller than ImageNet, we tailor our model to fit CIFAR10 dataset. Convolution groups incorporate BN and dropout operations.
The input to VGG main module is from the data layer. `vgg_bn_drop` defines a 16-layer VGG network, with each convolutional layer followed by BN and dropout layers. Here is the definition in detail: The input to VGG core module is the data layer. `vgg_bn_drop` defines a 16-layer VGG network, with each convolutional layer followed by BN and dropout layers. Here is the definition in detail:
```python ```python
def vgg_bn_drop(input): def vgg_bn_drop(input):
...@@ -220,24 +216,26 @@ def vgg_bn_drop(input): ...@@ -220,24 +216,26 @@ def vgg_bn_drop(input):
return predict return predict
``` ```
1. Firstly, it defines a convolution block or conv_block. The default convolution kernel is 3x3, and the default pooling size is 2x2 with stride 2. Dropout specifies the probability in dropout operation. Function `img_conv_group` is defined in `paddle.networks` consisting of a series of `Conv->BN->ReLu->Dropout` and a `Pooling`.
2. Five groups of convolutions. The first two groups perform two convolutions, while the last three groups perform three convolutions. The dropout rate of the last convolution in each group is set to 0, which means there is no dropout for this layer. 1. Firstly, it defines a convolution block or conv_block. The default convolution kernel is 3x3, and the default pooling size is 2x2 with stride 2. Groups decide the number of consecutive convolution operations in each VGG block. Dropout specifies the probability to perform dropout operation. Function `img_conv_group` is predefined in `paddle.nets` consisting of a series of `Conv->BN->ReLu->Dropout` and a group of `Pooling` .
2. Five groups of convolutions. The first two groups perform two consecutive convolutions, while the last three groups perform three convolutions in sequence. The dropout rate of the last convolution in each group is set to 0, which means there is no dropout for this layer.
3. The last two layers are fully-connected layers of dimension 512. 3. The last two layers are fully-connected layers of 512 dimensions.
4. The above VGG network extracts high-level features and maps them to a vector of the same size as the categories. Softmax function or classifier is then used for calculating the probability of the image belonging to each category. 4. The VGG network begins with extracting high-level features and then maps them to a vector of the same size as the category dimension. Finally, Softmax function is used for calculating the probability of classifying the image to each category.
### ResNet ### ResNet
Here are some basic functions used in `resnet_cifar10`: The 1st, 3rd, and 4th step is identical to the counterparts in VGG, which are skipped hereby.
We will explain the 2nd step at lengths, namely the core module of ResNet on CIFAR10.
- `conv_bn_layer` : convolutional layer followed by BN. To start with, here are some basic functions used in `resnet_cifar10` ,and the network connection procedure is illustrated afterwards:
- `shortcut` : the shortcut branch in a residual block. There are two kinds of shortcuts: 1x1 convolution used when the number of channels between input and output is different; direct connection used otherwise.
- `conv_bn_layer` : convolutional layer with BN.
- `shortcut` : the shortcut connection in a residual block. There are two kinds of shortcuts: 1x1 convolutions are used to increase dimensionality when in the residual block the number of channels in input feature and that in output feature are different; direct connection used otherwise.
- `basicblock` : a basic residual module as shown in the left of Figure 9, it consists of two sequential 3x3 convolutions and one "shortcut" branch. - `basicblock` : a basic residual module as shown in the left of Figure 9, it consists of two sequential 3x3 convolutions and one "shortcut" branch.
- `bottleneck` : a bottleneck module as shown in the right of Figure 9, it consists of two 1x1 convolutions with one 3x3 convolution in between branch and a "shortcut" branch. - `layer_warp` : a group of residual modules consisting of several stacked blocks. In each group, the sliding window size of the first residual block could be different from the rest, in order to reduce the size of feature maps along horizontal and vertical directions.
- `layer_warp` : a group of residual modules consisting of several stacking blocks. In each group, the sliding window size of the first residual block could be different from the rest of blocks, in order to reduce the size of feature maps along horizontal and vertical directions.
```python ```python
def conv_bn_layer(input, def conv_bn_layer(input,
...@@ -279,13 +277,14 @@ def layer_warp(block_func, input, ch_in, ch_out, count, stride): ...@@ -279,13 +277,14 @@ def layer_warp(block_func, input, ch_in, ch_out, count, stride):
return tmp return tmp
``` ```
The following are the components of `resnet_cifar10`: The following are the components of `resnet_cifar10`:
1. The lowest level is `conv_bn_layer`. 1. The lowest level is `conv_bn_layer` , e.t. the convolution layer with BN.
2. The middle level consists of three `layer_warp`, each of which uses the left residual block in Figure 9. 2. The next level is composed of three residual blocks, namely three `layer_warp`, each of which uses the left residual block in Figure 10.
3. The last level is average pooling layer. 3. The last level is average pooling layer.
Note: besides the first convolutional layer and the last fully-connected layer, the total number of layers in three `layer_warp` should be dividable by 6, that is the depth of `resnet_cifar10` should satisfy $(depth - 2) % 6 == 0$. Note: Except the first convolutional layer and the last fully-connected layer, the total number of layers with parameters in three `layer_warp` should be dividable by 6. In other words, the depth of `resnet_cifar10` should satisfy (depth-2)%6=0.
```python ```python
def resnet_cifar10(ipt, depth=32): def resnet_cifar10(ipt, depth=32):
...@@ -303,9 +302,10 @@ def resnet_cifar10(ipt, depth=32): ...@@ -303,9 +302,10 @@ def resnet_cifar10(ipt, depth=32):
return predict return predict
``` ```
## Infererence Program Configuration
The input to the network is defined as `fluid.layers.data`, or image pixels in the context of image classification. The images in CIFAR10 are 32x32 color images of three channels. Therefore, the size of the input data is 3072 (3x32x32). ## Inference Program Configuration
The input to the network is defined as `fluid.layers.data` , corresponding to image pixels in the context of image classification. The images in CIFAR10 are 32x32 coloured images with three channels. Therefore, the size of the input data is 3072 (3x32x32).
```python ```python
def inference_program(): def inference_program():
...@@ -318,14 +318,14 @@ def inference_program(): ...@@ -318,14 +318,14 @@ def inference_program():
return predict return predict
``` ```
## Train Program Configuration ## Training Program Configuration
Then we need to setup the the `train_program`. It takes the prediction from the inference_program first. Then we need to set up the the `train_program`. It takes the prediction from the inference_program first.
During the training, it will calculate the `avg_loss` from the prediction. During the training, it will calculate the `avg_loss` from the prediction.
In the context of supervised learning, labels of training images are defined in `fluid.layers.data` as well. During training, the cross-entropy loss function is used and the loss is the output of the network. During testing, the outputs are the probabilities calculated in the classifier. In the context of supervised learning, labels of training images are defined in `fluid.layers.data` as well. During training, the multi-class cross-entropy is used as the loss function and becomes the output of the network. During testing, the outputs are the probabilities calculated in the classifier.
**NOTE:** A train program should return an array and the first returned argument has to be `avg_cost`. **NOTE:** A training program should return an array and the first returned argument has to be `avg_cost` .
The trainer always implicitly use it to calculate the gradient. The trainer always uses it to calculate the gradients.
```python ```python
def train_program(): def train_program():
...@@ -340,7 +340,7 @@ def train_program(): ...@@ -340,7 +340,7 @@ def train_program():
## Optimizer Function Configuration ## Optimizer Function Configuration
In the following `Adam` optimizer, `learning_rate` specifies the learning rate in the optimization procedure. In the following `Adam` optimizer, `learning_rate` specifies the learning rate in the optimization procedure. It influences the convergence speed.
```python ```python
def optimizer_program(): def optimizer_program():
...@@ -349,23 +349,10 @@ def optimizer_program(): ...@@ -349,23 +349,10 @@ def optimizer_program():
## Model Training ## Model Training
### Create Trainer
Before creating a training module, it is necessary to set the algorithm.
Here we specify `Adam` optimization algorithm via `fluid.optimizer`.
```python
use_cuda = False
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
trainer = Trainer(
train_func=train_program,
optimizer_func=optimizer_program,
place=place)
```
### Data Feeders Configuration ### Data Feeders Configuration
`cifar.train10()` will yield records during each pass, after shuffling, a batch input is generated for training. `cifar.train10()` generates one sample at a time as the input for training after completing shuffle and batch.
```python ```python
# Each batch will yield 128 images # Each batch will yield 128 images
...@@ -381,54 +368,109 @@ test_reader = paddle.batch( ...@@ -381,54 +368,109 @@ test_reader = paddle.batch(
paddle.dataset.cifar.test10(), batch_size=BATCH_SIZE) paddle.dataset.cifar.test10(), batch_size=BATCH_SIZE)
``` ```
### Event Handler
Callback function `event_handler` will be called during training when a pre-defined event happens. ### Implementation of the trainer program
We need to develop a main_program for the training process. Similarly, we need to configure a test_program for the test program. It's also necessary to define the `place` of the training and use the optimizer `optimizer_program` previously defined .
`event_handler` is used to plot some text data when training.
```python ```python
params_dirname = "image_classification_resnet.inference.model" use_cuda = False
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
# event handler to track training and testing process feed_order = ['pixel', 'label']
def event_handler(event):
if isinstance(event, EndStepEvent): main_program = fluid.default_main_program()
if event.step % 100 == 0: star_program = fluid.default_startup_program()
print("\nPass %d, Batch %d, Cost %f, Acc %f" %
(event.step, event.epoch, event.metrics[0], avg_cost, acc = train_program()
event.metrics[1]))
else: # Test program
sys.stdout.write('.') test_program = main_program.clone(for_test=True)
sys.stdout.flush()
optimizer = optimizer_program()
optimizer.minimize(avg_cost)
if isinstance(event, EndEpochEvent): exe = fluid.Executor(place)
# Test against with the test dataset to get accuracy.
avg_cost, accuracy = trainer.test(
reader=test_reader, feed_order=['pixel', 'label'])
print('\nTest with Pass {0}, Loss {1:2.2}, Acc {2:2.2}'.format(event.epoch, avg_cost, accuracy)) EPOCH_NUM = 2
# For training test cost
def train_test(program, reader):
count = 0
feed_var_list = [
program.global_block().var(var_name) for var_name in feed_order
]
feeder_test = fluid.DataFeeder(
feed_list=feed_var_list, place=place)
test_exe = fluid.Executor(place)
accumulated = len([avg_cost, acc]) * [0]
for tid, test_data in enumerate(reader()):
avg_cost_np = test_exe.run(program=program,
feed=feeder_test.feed(test_data),
fetch_list=[avg_cost, acc])
accumulated = [x[0] + x[1][0] for x in zip(accumulated, avg_cost_np)]
count += 1
return [x / count for x in accumulated]
```
### The main loop of training and the outputs along the process
In the next main training cycle, we will observe the training process or run test in good use of the outputs.
You can also use `plot` to plot the process by calling back data:
```python
params_dirname = "image_classification_resnet.inference.model"
from paddle.utils.plot import Ploter
train_prompt = "Train cost"
test_prompt = "Test cost"
plot_cost = Ploter(test_prompt,train_prompt)
# main train loop.
def train_loop():
feed_var_list_loop = [
main_program.global_block().var(var_name) for var_name in feed_order
]
feeder = fluid.DataFeeder(
feed_list=feed_var_list_loop, place=place)
exe.run(star_program)
step = 0
for pass_id in range(EPOCH_NUM):
for step_id, data_train in enumerate(train_reader()):
avg_loss_value = exe.run(main_program,
feed=feeder.feed(data_train),
fetch_list=[avg_cost, acc])
if step % 1 == 0:
plot_cost.append(train_prompt, step, avg_loss_value[0])
plot_cost.plot()
step += 1
avg_cost_test, accuracy_test = train_test(test_program,
reader=test_reader)
plot_cost.append(test_prompt, step, avg_cost_test)
# save parameters # save parameters
if params_dirname is not None: if params_dirname is not None:
trainer.save_params(params_dirname) fluid.io.save_inference_model(params_dirname, ["pixel"],
[predict], exe)
``` ```
### Training ### Training
Finally, we can invoke `trainer.train` to start training. Training via `trainer_loop` function, here we only have 2 Epoch iterations. Generally we need to execute above a hundred Epoch in practice.
**Note:** On CPU, each epoch will take about 15~20 minutes. This part may take a while. Please feel free to modify the code to run the test on GPU to increase the training speed.
**Note:** On CPU, each Epoch will take approximately 15 to 20 minutes. It may cost some time in this part. Please freely update the code and run test on GPU to accelerate training
```python ```python
trainer.train( train_loop()
reader=train_reader,
num_epochs=2,
event_handler=event_handler,
feed_order=['pixel', 'label'])
``` ```
Here is an example log after training for one pass. The accuracy rates are 0.59 on the training set and 0.6 on the validation set. An example of an epoch of training log is shown below. After 1 pass, the average Accuracy on the training set is 0.59 and the average Accuracy on the testing set is 0.6.
```text ```text
Pass 0, Batch 0, Cost 3.869598, Acc 0.164062 Pass 0, Batch 0, Cost 3.869598, Acc 0.164062
...@@ -442,32 +484,31 @@ Pass 300, Batch 0, Cost 1.223424, Acc 0.593750 ...@@ -442,32 +484,31 @@ Pass 300, Batch 0, Cost 1.223424, Acc 0.593750
Test with Pass 0, Loss 1.1, Acc 0.6 Test with Pass 0, Loss 1.1, Acc 0.6
``` ```
Figure 12 shows the curve of training error rate, which indicates it converges at Pass 200 with error rate 8.54%. Figure 13 is a curve graph of the classification error rate of the training. After pass of 200 times, it almost converges, and finally the classification error rate on the test set is 8.54%.
<p align="center"> <p align="center">
<img src="image/plot_en.png" width="400" ><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/03.image_classification/image/plot.png?raw=true" width="400" ><br/>
Figure 12. The error rate of VGG model on CIFAR10 Figure 13. Classification error rate of VGG model on the CIFAR10 data set
</p> </p>
## Model Application
## Application You can use a trained model to classify your images. The following program shows how to load a trained network and optimized parameters for inference.
After training is completed, users can use the trained model to classify images. The following code shows how to infer through `fluid.contrib.inferencer.Inferencer` interface. You can uncomment some lines from below to change the model name.
### Generate input data for inferring ### Generate Input Data to infer
`dog.png` is an example image of a dog. Turn it into a numpy array to match the data feeder format. `dog.png` is a picture of a puppy. We convert it to a `numpy` array to meet the `feeder` format.
```python ```python
# Prepare testing data. # Prepare testing data.
from PIL import Image from PIL import Image
import numpy as np
import os import os
def load_image(file): def load_image(file):
im = Image.open(file) im = Image.open(file)
im = im.resize((32, 32), Image.ANTIALIAS) im = im.resize((32, 32), Image.ANTIALIAS)
im = np.array(im).astype(np.float32) im = numpy.array(im).astype(numpy.float32)
# The storage order of the loaded image is W(width), # The storage order of the loaded image is W(width),
# H(height), C(channel). PaddlePaddle requires # H(height), C(channel). PaddlePaddle requires
# the CHW order, so transpose them. # the CHW order, so transpose them.
...@@ -484,26 +525,48 @@ img = load_image(cur_dir + '/image/dog.png') ...@@ -484,26 +525,48 @@ img = load_image(cur_dir + '/image/dog.png')
### Inferencer Configuration and Inference ### Inferencer Configuration and Inference
The `Inferencer` takes an `infer_func` and `param_path` to setup the network and the trained parameters. Similar to the training process, a inferencer needs to build the corresponding process. We load the trained network and parameters from `params_dirname` .
We can simply plug-in the inference_program defined earlier here. We can just insert the inference program defined previously.
Now we are ready to do inference. Now let's make our inference.
```python ```python
inferencer = Inferencer( place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
infer_func=inference_program, param_path=params_dirname, place=place) exe = fluid.Executor(place)
inference_scope = fluid.core.Scope()
with fluid.scope_guard(inference_scope):
[inference_program, feed_target_names,
fetch_targets] = fluid.io.load_inference_model(params_dirname, exe)
label_list = ["airplane", "automobile", "bird", "cat", "deer", "dog", "frog", "horse", "ship", "truck"] # Construct feed as a dictionary of {feed_target_name: feed_target_data}
# inference # and results will contain a list of data corresponding to fetch_targets.
results = inferencer.infer({'pixel': img}) results = exe.run(inference_program,
print("infer results: %s" % label_list[np.argmax(results[0])]) feed={feed_target_names[0]: img},
fetch_list=fetch_targets)
# infer label
label_list = [
"airplane", "automobile", "bird", "cat", "deer", "dog", "frog", "horse",
"ship", "truck"
]
print("infer results: %s" % label_list[numpy.argmax(results[0])])
``` ```
## Conclusion
Traditional image classification methods involve multiple stages of processing, which has to utilize complex frameworks. Contrarily, CNN models can be trained end-to-end with a significant increase in classification accuracy. In this chapter, we introduced three models -- VGG, GoogleNet, ResNet and provided PaddlePaddle config files for training VGG and ResNet on CIFAR10. We also explained how to perform prediction and feature extraction using the PaddlePaddle API. For other datasets such as ImageNet, the procedure for config and training are the same and you are welcome to give it a try. ## Summary
The traditional image classification method consists of multiple stages. The framework is a little complex. In contrast, the end-to-end CNN model can be implemented in one step, and the accuracy of classification is greatly improved. In this article, we first introduced three classic models, VGG, GoogLeNet and ResNet. Then we have introduced how to use PaddlePaddle to configure and train CNN models based on CIFAR10 dataset, especially VGG and ResNet models. Finally, we have guided you how to use PaddlePaddle's API interfaces to predict images and extract features. For other datasets such as ImageNet, the configuration and training process is the same, so you can embark on your adventure on your own.
<a name="References"></a>
## References ## References
[1] D. G. Lowe, [Distinctive image features from scale-invariant keypoints](http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf). IJCV, 60(2):91-110, 2004. [1] D. G. Lowe, [Distinctive image features from scale-invariant keypoints](http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf). IJCV, 60(2):91-110, 2004.
...@@ -540,7 +603,7 @@ Traditional image classification methods involve multiple stages of processing, ...@@ -540,7 +603,7 @@ Traditional image classification methods involve multiple stages of processing,
[17] Szegedy, C., Ioffe, S., Vanhoucke, V. [Inception-v4, inception-resnet and the impact of residual connections on learning](https://arxiv.org/abs/1602.07261). arXiv:1602.07261 (2016). [17] Szegedy, C., Ioffe, S., Vanhoucke, V. [Inception-v4, inception-resnet and the impact of residual connections on learning](https://arxiv.org/abs/1602.07261). arXiv:1602.07261 (2016).
[18] Everingham, M., Eslami, S. M. A., Van Gool, L., Williams, C. K. I., Winn, J. and Zisserman, A. [The Pascal Visual Object Classes Challenge: A Retrospective]((http://link.springer.com/article/10.1007/s11263-014-0733-5)). International Journal of Computer Vision, 111(1), 98-136, 2015. [18] Everingham, M., Eslami, S. M. A., Van Gool, L., Williams, C. K. I., Winn, J. and Zisserman, A. [The Pascal Visual Object Classes Challenge: A Retrospective](http://link.springer.com/article/10.1007/s11263-014-0733-5). International Journal of Computer Vision, 111(1), 98-136, 2015.
[19] He, K., Zhang, X., Ren, S., and Sun, J. [Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification](https://arxiv.org/abs/1502.01852). ArXiv e-prints, February 2015. [19] He, K., Zhang, X., Ren, S., and Sun, J. [Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification](https://arxiv.org/abs/1502.01852). ArXiv e-prints, February 2015.
...@@ -550,5 +613,7 @@ Traditional image classification methods involve multiple stages of processing, ...@@ -550,5 +613,7 @@ Traditional image classification methods involve multiple stages of processing,
[22] http://cs231n.github.io/classification/ [22] http://cs231n.github.io/classification/
<br/> <br/>
This tutorial is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://paddlepaddleimage.cdn.bcebos.com/bookimage/camo.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">This tutorial</span> is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.
### This file is only used for continuous evaluation test!
from __future__ import print_function
from __future__ import division
from __future__ import absolute_import
import os
import sys
sys.path.append(os.environ['ceroot'])
from kpi import CostKpi
from kpi import AccKpi
train_cost_kpi = CostKpi('train_cost', 0.02, 0, actived=True, desc='train cost')
train_acc_kpi = AccKpi('train_acc', 0.02, 0, actived=True, desc='train acc')
test_cost_kpi = CostKpi('test_cost', 0.02, 0, actived=True, desc='test cost')
test_acc_kpi = AccKpi('test_acc', 0.02, 0, actived=True, desc='test acc')
tracking_kpis = [train_cost_kpi, train_acc_kpi, test_cost_kpi, test_acc_kpi]
def parse_log(log):
for line in log.split('\n'):
fs = line.strip().split('\t')
print(fs)
if len(fs) == 3 and fs[0] == 'kpis':
kpi_name = fs[1]
kpi_value = float(fs[2])
yield kpi_name, kpi_value
def log_to_ce(log):
kpi_tracker = {}
for kpi in tracking_kpis:
kpi_tracker[kpi.name] = kpi
for (kpi_name, kpi_value) in parse_log(log):
print(kpi_name, kpi_value)
kpi_tracker[kpi_name].add_record(kpi_value)
kpi_tracker[kpi_name].persist()
if __name__ == '__main__':
log = sys.stdin.read()
log_to_ce(log)
...@@ -45,6 +45,18 @@ ...@@ -45,6 +45,18 @@
本教程源代码目录在[book/image_classification](https://github.com/PaddlePaddle/book/tree/develop/03.image_classification),初次使用请您参考[Book文档使用说明](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)。 本教程源代码目录在[book/image_classification](https://github.com/PaddlePaddle/book/tree/develop/03.image_classification),初次使用请您参考[Book文档使用说明](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)。
### 说明: ###
1.硬件环境要求:
本文可支持在CPU、GPU下运行
2.Docker镜像支持的CUDA/cuDNN版本:
如果使用了Docker运行Book,请注意:这里所提供的默认镜像的GPU环境为 CUDA 8/cuDNN 5,对于NVIDIA Tesla V100等要求CUDA 9的 GPU,使用该镜像可能会运行失败。
3.文档和脚本中代码的一致性问题:
请注意:为使本文更加易读易用,我们拆分、调整了train.py的代码并放入本文。本文中代码与train.py的运行结果一致,可直接运行[train.py](https://github.com/PaddlePaddle/book/blob/develop/03.image_classification/train.py)进行验证。
## 背景介绍 ## 背景介绍
图像相比文字能够提供更加生动、容易理解及更具艺术感的信息,是人们转递与交换信息的重要来源。在本教程中,我们专注于图像识别领域的一个重要问题,即图像分类。 图像相比文字能够提供更加生动、容易理解及更具艺术感的信息,是人们转递与交换信息的重要来源。在本教程中,我们专注于图像识别领域的一个重要问题,即图像分类。
...@@ -94,7 +106,7 @@ ...@@ -94,7 +106,7 @@
2). **特征编码**: 底层特征中包含了大量冗余与噪声,为了提高特征表达的鲁棒性,需要使用一种特征变换算法对底层特征进行编码,称作特征编码。常用的特征编码方法包括向量量化编码 \[[4](#参考文献)\]、稀疏编码 \[[5](#参考文献)\]、局部线性约束编码 \[[6](#参考文献)\]、Fisher向量编码 \[[7](#参考文献)\] 等。 2). **特征编码**: 底层特征中包含了大量冗余与噪声,为了提高特征表达的鲁棒性,需要使用一种特征变换算法对底层特征进行编码,称作特征编码。常用的特征编码方法包括向量量化编码 \[[4](#参考文献)\]、稀疏编码 \[[5](#参考文献)\]、局部线性约束编码 \[[6](#参考文献)\]、Fisher向量编码 \[[7](#参考文献)\] 等。
3). **空间特征约束**: 特征编码之后一般会经过空间特征约束,也称作**特征汇聚**。特征汇聚是指在一个空间范围内,对每一维特征取最大值或者平均值,可以获得一定特征不变形的特征表达。金字塔特征匹配是一种常用的特征聚会方法,这种方法提出将图像均匀分块,在分块内做特征汇聚。 3). **空间特征约束**: 特征编码之后一般会经过空间特征约束,也称作**特征汇聚**。特征汇聚是指在一个空间范围内,对每一维特征取最大值或者平均值,可以获得一定特征不变形的特征表达。金字塔特征匹配是一种常用的特征汇聚方法,这种方法提出将图像均匀分块,在分块内做特征汇聚。
4). **通过分类器分类**: 经过前面步骤之后一张图像可以用一个固定维度的向量进行描述,接下来就是经过分类器对图像进行分类。通常使用的分类器包括SVM(Support Vector Machine, 支持向量机)、随机森林等。而使用核方法的SVM是最为广泛的分类器,在传统图像分类任务上性能很好。 4). **通过分类器分类**: 经过前面步骤之后一张图像可以用一个固定维度的向量进行描述,接下来就是经过分类器对图像进行分类。通常使用的分类器包括SVM(Support Vector Machine, 支持向量机)、随机森林等。而使用核方法的SVM是最为广泛的分类器,在传统图像分类任务上性能很好。
...@@ -124,7 +136,7 @@ Alex Krizhevsky在2012年ILSVRC提出的CNN模型 \[[9](#参考文献)\] 取得 ...@@ -124,7 +136,7 @@ Alex Krizhevsky在2012年ILSVRC提出的CNN模型 \[[9](#参考文献)\] 取得
另外,在训练过程中由于每层参数不断更新,会导致下一次输入分布发生变化,这样导致训练过程需要精心设计超参数。如2015年Sergey Ioffe和Christian Szegedy提出了Batch Normalization (BN)算法 \[[14](#参考文献)\] 中,每个batch对网络中的每一层特征都做归一化,使得每层分布相对稳定。BN算法不仅起到一定的正则作用,而且弱化了一些超参数的设计。经过实验证明,BN算法加速了模型收敛过程,在后来较深的模型中被广泛使用。 另外,在训练过程中由于每层参数不断更新,会导致下一次输入分布发生变化,这样导致训练过程需要精心设计超参数。如2015年Sergey Ioffe和Christian Szegedy提出了Batch Normalization (BN)算法 \[[14](#参考文献)\] 中,每个batch对网络中的每一层特征都做归一化,使得每层分布相对稳定。BN算法不仅起到一定的正则作用,而且弱化了一些超参数的设计。经过实验证明,BN算法加速了模型收敛过程,在后来较深的模型中被广泛使用。
接下来我们主要介绍VGG,GoogleNet和ResNet网络结构。 接下来我们主要介绍VGG,GoogLeNet和ResNet网络结构。
### VGG ### VGG
...@@ -135,9 +147,9 @@ Alex Krizhevsky在2012年ILSVRC提出的CNN模型 \[[9](#参考文献)\] 取得 ...@@ -135,9 +147,9 @@ Alex Krizhevsky在2012年ILSVRC提出的CNN模型 \[[9](#参考文献)\] 取得
图6. 基于ImageNet的VGG16模型 图6. 基于ImageNet的VGG16模型
</p> </p>
### GoogleNet ### GoogLeNet
GoogleNet \[[12](#参考文献)\] 在2014年ILSVRC的获得了冠军,在介绍该模型之前我们先来了解NIN(Network in Network)模型 \[[13](#参考文献)\] 和Inception模块,因为GoogleNet模型由多组Inception模块组成,模型设计借鉴了NIN的一些思想。 GoogLeNet \[[12](#参考文献)\] 在2014年ILSVRC的获得了冠军,在介绍该模型之前我们先来了解NIN(Network in Network)模型 \[[13](#参考文献)\] 和Inception模块,因为GoogLeNet模型由多组Inception模块组成,模型设计借鉴了NIN的一些思想。
NIN模型主要有两个特点: NIN模型主要有两个特点:
...@@ -152,17 +164,17 @@ Inception模块如下图7所示,图(a)是最简单的设计,输出是3个卷 ...@@ -152,17 +164,17 @@ Inception模块如下图7所示,图(a)是最简单的设计,输出是3个卷
图7. Inception模块 图7. Inception模块
</p> </p>
GoogleNet由多组Inception模块堆积而成。另外,在网络最后也没有采用传统的多层全连接层,而是像NIN网络一样采用了均值池化层;但与NIN不同的是,GoogleNet在池化层后加了一个全连接层来映射类别数。除了这两个特点之外,由于网络中间层特征也很有判别性,GoogleNet在中间层添加了两个辅助分类器,在后向传播中增强梯度并且增强正则化,而整个网络的损失函数是这个三个分类器的损失加权求和。 GoogLeNet由多组Inception模块堆积而成。另外,在网络最后也没有采用传统的多层全连接层,而是像NIN网络一样采用了均值池化层;但与NIN不同的是,GoogLeNet在池化层后加了一个全连接层来映射类别数。除了这两个特点之外,由于网络中间层特征也很有判别性,GoogLeNet在中间层添加了两个辅助分类器,在后向传播中增强梯度并且增强正则化,而整个网络的损失函数是这个三个分类器的损失加权求和。
GoogleNet整体网络结构如图8所示,总共22层网络:开始由3层普通的卷积组成;接下来由三组子网络组成,第一组子网络包含2个Inception模块,第二组包含5个Inception模块,第三组包含2个Inception模块;然后接均值池化层、全连接层。 GoogLeNet整体网络结构如图8所示,总共22层网络:开始由3层普通的卷积组成;接下来由三组子网络组成,第一组子网络包含2个Inception模块,第二组包含5个Inception模块,第三组包含2个Inception模块;然后接均值池化层、全连接层。
<p align="center"> <p align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/03.image_classification/image/googlenet.jpeg?raw=true" ><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/03.image_classification/image/googlenet.jpeg?raw=true" ><br/>
图8. GoogleNet[12] 图8. GoogLeNet[12]
</p> </p>
上面介绍的是GoogleNet第一版模型(称作GoogleNet-v1)。GoogleNet-v2 \[[14](#参考文献)\] 引入BN层;GoogleNet-v3 \[[16](#参考文献)\] 对一些卷积层做了分解,进一步提高网络非线性能力和加深网络;GoogleNet-v4 \[[17](#参考文献)\] 引入下面要讲的ResNet设计思路。从v1到v4每一版的改进都会带来准确度的提升,介于篇幅,这里不再详细介绍v2到v4的结构。 上面介绍的是GoogLeNet第一版模型(称作GoogLeNet-v1)。GoogLeNet-v2 \[[14](#参考文献)\] 引入BN层;GoogLeNet-v3 \[[16](#参考文献)\] 对一些卷积层做了分解,进一步提高网络非线性能力和加深网络;GoogLeNet-v4 \[[17](#参考文献)\] 引入下面要讲的ResNet设计思路。从v1到v4每一版的改进都会带来准确度的提升,介于篇幅,这里不再详细介绍v2到v4的结构。
### ResNet ### ResNet
...@@ -206,11 +218,12 @@ Paddle API提供了自动加载cifar数据集模块 `paddle.dataset.cifar`。 ...@@ -206,11 +218,12 @@ Paddle API提供了自动加载cifar数据集模块 `paddle.dataset.cifar`。
让我们从导入 Paddle Fluid API 和辅助模块开始。 让我们从导入 Paddle Fluid API 和辅助模块开始。
```python ```python
from __future__ import print_function
import paddle import paddle
import paddle.fluid as fluid import paddle.fluid as fluid
import numpy import numpy
import sys import sys
from __future__ import print_function
``` ```
...@@ -251,7 +264,7 @@ def vgg_bn_drop(input): ...@@ -251,7 +264,7 @@ def vgg_bn_drop(input):
``` ```
1. 首先定义了一组卷积网络,即conv_block。卷积核大小为3x3,池化窗口大小为2x2,窗口滑动大小为2,groups决定每组VGG模块是几次连续的卷积操作,dropouts指定Dropout操作的概率。所使用的`img_conv_group`是在`paddle.networks`中预定义的模块,由若干组 Conv->BN->ReLu->Dropout 和 一组 Pooling 组成。 1. 首先定义了一组卷积网络,即conv_block。卷积核大小为3x3,池化窗口大小为2x2,窗口滑动大小为2,groups决定每组VGG模块是几次连续的卷积操作,dropouts指定Dropout操作的概率。所使用的`img_conv_group`是在`paddle.nets`中预定义的模块,由若干组 Conv->BN->ReLu->Dropout 和 一组 Pooling 组成。
2. 五组卷积操作,即 5个conv_block。 第一、二组采用两次连续的卷积操作。第三、四、五组采用三次连续的卷积操作。每组最后一个卷积后面Dropout概率为0,即不使用Dropout操作。 2. 五组卷积操作,即 5个conv_block。 第一、二组采用两次连续的卷积操作。第三、四、五组采用三次连续的卷积操作。每组最后一个卷积后面Dropout概率为0,即不使用Dropout操作。
...@@ -318,7 +331,7 @@ def layer_warp(block_func, input, ch_in, ch_out, count, stride): ...@@ -318,7 +331,7 @@ def layer_warp(block_func, input, ch_in, ch_out, count, stride):
3. 最后对网络做均值池化并返回该层。 3. 最后对网络做均值池化并返回该层。
注意:除第一层卷积层和最后一层全连接层之外,要求三组 `layer_warp` 总的含参层数能够被6整除,即 `resnet_cifar10` 的 depth 要满足 $(depth - 2) % 6 = 0$ 注意:除第一层卷积层和最后一层全连接层之外,要求三组 `layer_warp` 总的含参层数能够被6整除,即 `resnet_cifar10` 的 depth 要满足(depth-2)%6=0
```python ```python
def resnet_cifar10(ipt, depth=32): def resnet_cifar10(ipt, depth=32):
...@@ -336,7 +349,7 @@ def resnet_cifar10(ipt, depth=32): ...@@ -336,7 +349,7 @@ def resnet_cifar10(ipt, depth=32):
return predict return predict
``` ```
## Infererence Program 配置 ## Inference Program 配置
网络输入定义为 `data_layer` (数据层),在图像分类中即为图像像素信息。CIFRAR10是RGB 3通道32x32大小的彩色图,因此输入数据大小为3072(3x32x32)。 网络输入定义为 `data_layer` (数据层),在图像分类中即为图像像素信息。CIFRAR10是RGB 3通道32x32大小的彩色图,因此输入数据大小为3072(3x32x32)。
...@@ -400,7 +413,7 @@ test_reader = paddle.batch( ...@@ -400,7 +413,7 @@ test_reader = paddle.batch(
``` ```
### Trainer 程序的实现 ### Trainer 程序的实现
我们需要为训练过程制定一个main_program, 同样的,还需要为测试程序配置一个test_program。定义训练的 `place` ,并使用先前定义的优化器 `optimizer_func`。 我们需要为训练过程制定一个main_program, 同样的,还需要为测试程序配置一个test_program。定义训练的 `place` ,并使用先前定义的优化器 `optimizer_program`。
```python ```python
...@@ -567,11 +580,7 @@ with fluid.scope_guard(inference_scope): ...@@ -567,11 +580,7 @@ with fluid.scope_guard(inference_scope):
[inference_program, feed_target_names, [inference_program, feed_target_names,
fetch_targets] = fluid.io.load_inference_model(params_dirname, exe) fetch_targets] = fluid.io.load_inference_model(params_dirname, exe)
# The input's dimension of conv should be 4-D or 5-D.
# Use inference_transpiler to speedup
inference_transpiler_program = inference_program.clone()
t = fluid.transpiler.InferenceTranspiler()
t.transpile(inference_transpiler_program, place)
# Construct feed as a dictionary of {feed_target_name: feed_target_data} # Construct feed as a dictionary of {feed_target_name: feed_target_data}
# and results will contain a list of data corresponding to fetch_targets. # and results will contain a list of data corresponding to fetch_targets.
...@@ -579,14 +588,6 @@ with fluid.scope_guard(inference_scope): ...@@ -579,14 +588,6 @@ with fluid.scope_guard(inference_scope):
feed={feed_target_names[0]: img}, feed={feed_target_names[0]: img},
fetch_list=fetch_targets) fetch_list=fetch_targets)
transpiler_results = exe.run(inference_transpiler_program,
feed={feed_target_names[0]: img},
fetch_list=fetch_targets)
assert len(results[0]) == len(transpiler_results[0])
for i in range(len(results[0])):
numpy.testing.assert_almost_equal(
results[0][i], transpiler_results[0][i], decimal=5)
# infer label # infer label
label_list = [ label_list = [
...@@ -599,7 +600,7 @@ with fluid.scope_guard(inference_scope): ...@@ -599,7 +600,7 @@ with fluid.scope_guard(inference_scope):
## 总结 ## 总结
传统图像分类方法由多个阶段构成,框架较为复杂,而端到端的CNN模型结构可一步到位,而且大幅度提升了分类准确率。本文我们首先介绍VGG、GoogleNet、ResNet三个经典的模型;然后基于CIFAR10数据集,介绍如何使用PaddlePaddle配置和训练CNN模型,尤其是VGG和ResNet模型;最后介绍如何使用PaddlePaddle的API接口对图片进行预测和特征提取。对于其他数据集比如ImageNet,配置和训练流程是同样的,大家可以自行进行实验。 传统图像分类方法由多个阶段构成,框架较为复杂,而端到端的CNN模型结构可一步到位,而且大幅度提升了分类准确率。本文我们首先介绍VGG、GoogLeNet、ResNet三个经典的模型;然后基于CIFAR10数据集,介绍如何使用PaddlePaddle配置和训练CNN模型,尤其是VGG和ResNet模型;最后介绍如何使用PaddlePaddle的API接口对图片进行预测和特征提取。对于其他数据集比如ImageNet,配置和训练流程是同样的,大家可以自行进行实验。
<a name="参考文献"></a> <a name="参考文献"></a>
## 参考文献 ## 参考文献
...@@ -638,7 +639,7 @@ with fluid.scope_guard(inference_scope): ...@@ -638,7 +639,7 @@ with fluid.scope_guard(inference_scope):
[17] Szegedy, C., Ioffe, S., Vanhoucke, V. [Inception-v4, inception-resnet and the impact of residual connections on learning](https://arxiv.org/abs/1602.07261). arXiv:1602.07261 (2016). [17] Szegedy, C., Ioffe, S., Vanhoucke, V. [Inception-v4, inception-resnet and the impact of residual connections on learning](https://arxiv.org/abs/1602.07261). arXiv:1602.07261 (2016).
[18] Everingham, M., Eslami, S. M. A., Van Gool, L., Williams, C. K. I., Winn, J. and Zisserman, A. [The Pascal Visual Object Classes Challenge: A Retrospective]((http://link.springer.com/article/10.1007/s11263-014-0733-5)). International Journal of Computer Vision, 111(1), 98-136, 2015. [18] Everingham, M., Eslami, S. M. A., Van Gool, L., Williams, C. K. I., Winn, J. and Zisserman, A. [The Pascal Visual Object Classes Challenge: A Retrospective](http://link.springer.com/article/10.1007/s11263-014-0733-5). International Journal of Computer Vision, 111(1), 98-136, 2015.
[19] He, K., Zhang, X., Ren, S., and Sun, J. [Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification](https://arxiv.org/abs/1502.01852). ArXiv e-prints, February 2015. [19] He, K., Zhang, X., Ren, S., and Sun, J. [Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification](https://arxiv.org/abs/1502.01852). ArXiv e-prints, February 2015.
...@@ -649,7 +650,7 @@ with fluid.scope_guard(inference_scope): ...@@ -649,7 +650,7 @@ with fluid.scope_guard(inference_scope):
[22] http://cs231n.github.io/classification/ [22] http://cs231n.github.io/classification/
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">知识共享 署名-相同方式共享 4.0 国际 许可协议</a>进行许可。 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://paddlepaddleimage.cdn.bcebos.com/bookimage/camo.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">知识共享 署名-相同方式共享 4.0 国际 许可协议</a>进行许可。
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
......
...@@ -40,198 +40,194 @@ ...@@ -40,198 +40,194 @@
<!-- This block will be replaced by each markdown file content. Please do not change lines below.--> <!-- This block will be replaced by each markdown file content. Please do not change lines below.-->
<div id="markdown" style='display:none'> <div id="markdown" style='display:none'>
Image Classification Image Classification
======================= =======================
The source code for this chapter is at [book/image_classification](https://github.com/PaddlePaddle/book/tree/develop/03.image_classification). For instructions on getting started with this book,see [Running This Book](https://github.com/PaddlePaddle/book/blob/develop/README.md#running-the-book). The source code for this chapter is in [book/image_classification](https://github.com/PaddlePaddle/book/tree/develop/03.image_classification). For users new to book, check [Running This Book](https://github.com/PaddlePaddle/book/blob/develop/README.md#running-the-book) .
## Background ## Background
Compared to words, images provide much more vivid and easier to understand information with an artistic sense. They are an important source for people to express and exchange ideas. In this chapter, we focus on one of the essential problems in image recognition -- image classification. Compared with words, images provide information in a much more vivid, artistic, easy-to-understand manner. They are an important source for people to express and exchange ideas. In this chapter, we focus on one of the essential problems in image recognition -- image classification.
Image classification is the task of distinguishing images in different categories based on their semantic meaning. It is a core problem in computer vision and is also the foundation of other higher level computer vision tasks such as object detection, image segmentation, object tracking, action recognition, etc. Image classification has applications in many areas such as face recognition, intelligent video analysis in security systems, traffic scene recognition in transportation systems, content-based image retrieval and automatic photo indexing in web services, image classification in medicine, etc. Image classification is the task of distinguishing images in different categories based on their semantic meaning. It is a core problem in computer vision and is also the foundation of other higher level computer vision tasks such as object detection, image segmentation, object tracking, action recognition. Image classification has applications in many areas such as face recognition, intelligent video analysis in security systems, traffic scene recognition in transportation systems, content-based image retrieval and automatic photo indexing in Internet services, image classification in medicine industry.
To classify an image we firstly encode the entire image using handcrafted or learned features and then determine the category using a classifier. Thus, feature extraction plays an important role in image classification. Prior to deep learning the BoW(Bag of Words) model was the most widely used method for classifying an image as well as an object. The BoW technique was introduced in Natural Language Processing where a training sentence is represented as a bag of words. In the context of image classification, the BoW model requires constructing a dictionary. The simplest BoW framework can be designed with three steps: **feature extraction**, **feature encoding** and **classifier design**. To classify an image we firstly encode the entire image using manual or learned features and then determine the category using a classifier. Thus, feature extraction plays an important role in image classification. Prior to deep learning the BoW(Bag of Words) model was the most widely used method for classifying an image. The BoW technique was introduced in Natural Language Processing where a training sentence is represented as a bag of words. In the context of image classification, the BoW model requires constructing a dictionary. The simplest BoW framework can be designed in three steps: **feature extraction**, **feature encoding** and **classifier design**.
Using Deep learning, image classification can be framed as a supervised or unsupervised learning problem that uses hierarchical features automatically without any need for manually crafted features from the image. In recent years, Convolutional Neural Networks (CNNs) have made significant progress in image classification. CNNs use raw image pixels as input, extract low-level and high-level abstract features through convolution operations, and directly output the classification results from the model. This style of end-to-end learning has lead to not only increased performance but also wider adoption various applications. With Deep learning, image classification can be framed as a supervised or unsupervised learning problem that uses hierarchical features automatically without any need for manually crafted features from the image. In recent years, Convolution Neural Networks (CNNs) have made significant progress in image classification. CNNs use raw image pixels as input, extract low-level and high-level abstract features through convolution operations, and directly output the classification results from the model. This style of end-to-end learning has led to not only higher performance but also wider adoption in various applications.
In this chapter, we introduce deep-learning-based image classification methods and explain how to train a CNN model using PaddlePaddle. In this chapter, we introduce deep-learning-based image classification methods and explain how to train a CNN model using PaddlePaddle.
## Demonstration ## Result Demo
An image can be classified by a general as well as fine-grained image classifier. Image Classification can be divided into general image classification and fine-grained image classification.
Figure 1 shows the results of a general image classifier -- the trained model can correctly recognize the main objects in the images. Figure 1 shows the results of general image classification -- the trained model can correctly recognize the main objects in the images.
<p align="center"> <p align="center">
<img src="image/dog_cat.png " width="350" ><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/03.image_classification/image/dog_cat.png?raw=true" width="350" ><br/>
Figure 1. General image classification Figure 1. General image classification
</p> </p>
Figure 2 shows the results of a fine-grained image classifier. This task of flower recognition requires correctly recognizing of the flower's categories. Figure 2 shows the results of a fine-grained image classifier. This task of flower recognition ought to correctly recognize of the flower's breed.
<p align="center"> <p align="center">
<img src="image/flowers.png" width="400" ><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/03.image_classification/image/flowers.png?raw=true" width="400" ><br/>
Figure 2. Fine-grained image classification Figure 2. Fine-grained image classification
</p> </p>
A good model should recognize objects of different categories correctly. The results of such a model should not vary due to viewpoint variation, illumination conditions, object distortion or occlusion. A qualified model should recognize objects of different categories correctly. The results of such a model should remain accurate in different perspectives, illumination conditions, object distortion or occlusion (we refer to these conditions as Image Disturbance).
Figure 3 shows some images with various disturbances. A good model should classify these images correctly like humans. Figure 3 shows some images with various disturbances. A good model should classify these images correctly like humans.
<p align="center"> <p align="center">
<img src="image/variations_en.png" width="550" ><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/03.image_classification/image/variations.png?raw=true" width="550" ><br/>
Figure 3. Disturbed images [22] Figure 3. Disturbed images <a src="#References">[22]</a>
</p> </p>
## Model Overview ## Exploration of Models
A large amount of researches in image classification are built upon benchmark datasets such as [PASCAL VOC](http://host.robots.ox.ac.uk/pascal/VOC/), [ImageNet](http://image-net.org/) etc. Many image classification algorithms are usually evaluated and compared based on these datasets. PASCAL VOC is a computer vision competition started in 2005, and ImageNet is a dataset holding Large Scale Visual Recognition Challenge (ILSVRC) started in 2010. In this chapter, we introduce some image classification models from the submissions to these competitions.
A large amount of research in image classification is built upon public datasets such as [PASCAL VOC](http://host.robots.ox.ac.uk/pascal/VOC/), [ImageNet](http://image-net.org/) etc. Many image classification algorithms are usually evaluated and compared on top of these datasets. PASCAL VOC is a computer vision competition started in 2005, and ImageNet is a dataset for Large Scale Visual Recognition Challenge (ILSVRC) started in 2010. In this chapter, we introduce some image classification models from the submissions to these competitions.
Before 2012, traditional image classification was accomplished with the three steps described in the background section. A complete model construction usually involves the following stages: low-level feature extraction, feature encoding, spatial constraint or feature clustering, classifier design, model ensemble. Before 2012, traditional image classification was accomplished with the three steps described in the background section. A complete model construction usually involves the following stages: low-level feature extraction, feature encoding, spatial constraint or feature clustering, classifier design, model ensemble.
1). **Low-level feature extraction**: This step extracts large amounts of local features according to fixed strides and scales. Popular local features include Scale-Invariant Feature Transform (SIFT)[1], Histogram of Oriented Gradient(HOG)[2], Local Binary Pattern(LBP)[3], etc. A common practice is to employ multiple feature descriptors in order to avoid missing a lot of information. 1). **Low-level feature extraction**: This step extracts large amounts of local features according to fixed strides and scales. Popular local features include Scale-Invariant Feature Transform (SIFT) \[[1](#References)\], Histogram of Oriented Gradient(HOG) \[[2](#References)\], Local Binary Pattern(LBP) \[[3](#References)\], etc. A common practice is to employ multiple feature descriptors in order to avoid missing a lot of information.
2). **Feature encoding**: Low-level features contain a large amount of redundancy and noise. In order to improve the robustness of features, it is necessary to employ a feature transformation to encode low-level features. This is called feature encoding. Common feature encoding methods include vector quantization [4], sparse coding [5], locality-constrained linear coding [6], Fisher vector encoding [7], etc. 2). **Feature encoding**: Low-level features contain a large amount of redundancy and noise. In order to improve the robustness of features, it is necessary to employ a feature transformation to encode low-level features. This is called feature encoding. Common feature encoding methods include vector quantization \[[4](#References)\], sparse coding \[[5](#References)\], locality-constrained linear coding \[[6](#References)\], Fisher vector encoding \[[7](#References)\], etc.
3). **Spatial constraint**: Spatial constraint or feature clustering is usually adopted after feature encoding for extracting the maximum or average of each dimension in the spatial domain. Pyramid feature matching--a popular feature clustering method--divides an image uniformly into patches and performs feature clustering in each patch. 3). **Spatial constraint**: Spatial constraint or feature clustering is usually adopted after feature encoding for extracting the maximum or average of each dimension in the spatial domain. Pyramid feature matching--a popular feature clustering method--divides an image uniformly into patches and performs feature clustering in each patch.
4). **Classification**: In the above steps an image can be described by a vector of fixed dimension. Then a classifier can be used to classify the image into categories. Common classifiers include Support Vector Machine(SVM), random forest etc. Kernel SVM is the most popular classifier and has achieved very good performance in traditional image classification tasks. 4). **Classification**: In the above steps an image can be described by a vector of fixed dimension. Then a classifier can be used to classify the image into categories. Common classifiers include Support Vector Machine(SVM), random forest etc. Kernel SVM is the most popular classifier and has achieved very good performance in traditional image classification tasks.
This method has been used widely as image classification algorithm in PASCAL VOC [18]. NEC Labs(http://www.nec-labs.com/) won the championship by employing SIFT and LBP features, two non-linear encoders and SVM in ILSVRC 2010 [8]. This classic method has been used widely as image classification algorithm in PASCAL VOC \[[18](#References)\]. [NEC Labs](http://www.nec-labs.com/) won the championship by employing SIFT and LBP features, two non-linear encoders and SVM in ILSVRC 2010 \[[8](#References)\].
The CNN model--AlexNet proposed by Alex Krizhevsky et al.[9], made a breakthrough in ILSVRC 2012. It dramatically outperformed traditional methods and won the ILSVRC championship in 2012. This was also the first time that a deep learning method was used for large-scale image classification. Since AlexNet, a series of CNN models have been proposed that have advanced the state of the art steadily on Imagenet as shown in Figure 4. With deeper and more sophisticated architectures, Top-5 error rate is getting lower and lower (to around 3.5%). The error rate of human raters on the same Imagenet dataset is 5.1%, which means that the image classification capability of a deep learning model has surpassed human raters. The CNN model--AlexNet proposed by Alex Krizhevsky et al. \[[9](#References)\], made a breakthrough in ILSVRC 2012. It dramatically outperformed classical methods and won the ILSVRC championship in 2012. This was also the first time that a deep learning method was adopted for large-scale image classification. Since AlexNet, a series of CNN models have been proposed that have advanced the state of the art steadily on Imagenet as shown in Figure 4. With deeper and more sophisticated architectures, Top-5 error rate is getting lower and lower (to around 3.5%). The error rate of human raters on the same Imagenet dataset is 5.1%, which means that the image classification capability of a deep learning model has surpassed human raters.
<p align="center"> <p align="center">
<img src="image/ilsvrc.png" width="500" ><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/03.image_classification/image/ilsvrc.png?raw=true" width="500" ><br/>
Figure 4. Top-5 error rates on ILSVRC image classification Figure 4. Top-5 error rates on ILSVRC image classification
</p> </p>
### CNN ### CNN
Traditional CNNs consist of convolutional and fully-connected layers and use the softmax multi-category classifier with the cross-entropy loss function. Figure 5 shows a typical CNN. We first introduce the common components of a CNN. Traditional CNNs consist of convolution and fully-connected layers and use the softmax multi-category classifier with the cross-entropy loss function. Figure 5 shows a typical CNN. We first take look at the common components of a CNN.
<p align="center"> <p align="center">
<img src="image/lenet_en.png"><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/03.image_classification/image/lenet.png?raw=true"><br/>
Figure 5. A CNN example [20] Figure 5. A CNN example <a src="#References">[20]</a>
</p> </p>
- convolutional layer: this layer uses the convolution operation to extract (low-level and high-level) features and to discover local correlation and spatial invariance. - convolutional layer: this layer uses the convolution operation to extract (low-level and high-level) features and to discover local correlation and spatial invariance.
- pooling layer: this layer down samples feature maps by extracting local max (max-pooling) or average (avg-pooling) value of each patch in the feature map. Down-sampling is a common operation in image processing and is used to filter out high-frequency information. - pooling layer: this layer down-samples feature maps by extracting local max (max-pooling) or average (avg-pooling) value of each patch in the feature map. Down-sampling is a common operation in image processing and is used to filter out trivial high-frequency information.
- fully-connected layer: this layer fully connects neurons between two adjacent layers. - fully-connected layer: this layer fully connects neurons between two adjacent layers.
- non-linear activation: Convolutional and fully-connected layers are usually followed by some non-linear activation layers. Non-linearities enhance the expression capability of the network. Some examples of non-linear activation functions are Sigmoid, Tanh and ReLU. ReLU is the most commonly used activation function in CNN. - non-linear activation: Convolutional and fully-connected layers are usually followed by some non-linear activation layers. Non-linearities enhance the expression capability of the network. Some examples of non-linear activation functions are Sigmoid, Tanh and ReLU. ReLU is the most commonly used activation function in CNN.
- Dropout [10]: At each training stage, individual nodes are dropped out of the network with a certain probability. This improves the network's ability to generalize and avoids overfitting. - Dropout \[[10](#References)\]: At each training stage, individual nodes are dropped out of the network with a certain random probability. This improves the network's ability to generalize and avoids overfitting.
Parameter updates at each layer during training causes input layer distributions to change and in turn requires hyper-parameters to be carefully tuned. In 2015, Sergey Ioffe and Christian Szegedy proposed a Batch Normalization (BN) algorithm [14], which normalizes the features of each batch in a layer, and enables relatively stable distribution in each layer. Not only does BN algorithm act as a regularizer, but also reduces the need for careful hyper-parameter design. Experiments demonstrate that BN algorithm accelerates the training convergence and has been widely used in later deeper models. Parameter updates at each layer during training causes input layer distributions to change and in turn requires hyper-parameters to be carefully tuned. In 2015, Sergey Ioffe and Christian Szegedy proposed a Batch Normalization (BN) algorithm \[[14](#References)\], which normalizes the features of each batch in a layer, and enables relatively stable distribution in each layer. Not only does BN algorithm act as a regularizer, but also eliminates the need for meticulous hyper-parameter design. Experiments demonstrate that BN algorithm accelerates the training convergence and has been widely used in further deeper models.
In the following sections, we will introduce the following network architectures - VGG, GoogleNet and ResNets. In the following sections, we will take a tour through the following network architectures - VGG, GoogLeNet and ResNets.
### VGG ### VGG
The Oxford Visual Geometry Group (VGG) proposed the VGG network in ILSVRC 2014 [11]. This model is deeper and wider than previous neural architectures. It consists of five main groups of convolution operations. Adjacent convolution groups are connected via max-pooling layers. Each group contains a series of 3x3 convolutional layers (i.e. kernels). The number of convolution kernels stays the same within the group and increases from 64 in the first group to 512 in the last one. The total number of learnable layers could be 11, 13, 16, or 19 depending on the number of convolutional layers in each group. Figure 6 illustrates a 16-layer VGG. The neural architecture of VGG is relatively simple and has been adopted by many papers such as the first one that surpassed human-level performance on ImageNet [19]. The Oxford Visual Geometry Group (VGG) proposed the VGG network in ILSVRC 2014 \[[11](#References)\]. This model is deeper and wider than previous neural architectures. Its major part is the five main groups of convolution operations. Adjacent convolution groups are connected via max-pooling layers to perform dimensionality reduction. Each group contains a series of 3x3 convolutional layers (i.e. kernels). The number of convolution kernels stays the same within the single group and increases from 64 in the first group to 512 in the last one. Double FC layers and a classifier layer will follow afterwards. The total number of learnable layers could be 11, 13, 16, or 19 depending on the number of convolutional layers in each group. Figure 6 illustrates a 16-layer VGG. The architecture of VGG is relatively simple and has been adopted by many papers such as the first one that surpassed human-level performance on ImageNet \[[19](#References)\].
<p align="center"> <p align="center">
<img src="image/vgg16.png" width="750" ><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/03.image_classification/image/vgg16.png?raw=true" width="750" ><br/>
Figure 6. VGG16 model for ImageNet Figure 6. VGG16 model for ImageNet
</p> </p>
### GoogleNet ### GoogLeNet
GoogleNet [12] won the ILSVRC championship in 2014. GoogleNet borrowed some ideas from the Network in Network(NIN) model [13] and is built on the Inception blocks. Let us first familiarize ourselves with these first. GoogLeNet \[[12](#References)\] won the ILSVRC championship in 2014. GoogLeNet borrowed some ideas from the Network in Network(NIN) model \[[13](#References)\] and is built on the Inception blocks. Let us first familiarize ourselves with these concepts first.
The two main characteristics of the NIN model are: The two main characteristics of the NIN model are:
1) A single-layer convolutional network is replaced with a Multi-Layer Perceptron Convolution (MLPconv). MLPconv is a tiny multi-layer convolutional network. It enhances non-linearity by adding several 1x1 convolutional layers after linear ones. 1) A single-layer convolutional network is replaced with a Multi-Layer Perceptron Convolution (MLPconv). MLPconv is a tiny multi-layer convolutional network. It enhances non-linearity by adding several 1x1 convolutional layers after linear ones.
2) In traditional CNNs, the last fewer layers are usually fully-connected with a large number of parameters. In contrast, NIN replaces all fully-connected layers with convolutional layers with feature maps of the same size as the category dimension and a global average pooling. This replacement of fully-connected layers significantly reduces the number of parameters. 2) In traditional CNNs, the last fewer layers are usually fully-connected with a large number of parameters. In contrast, the last convolution layer of NIN contains feature maps of the same size as the category dimension, and NIN replaces fully-connected layers with global average pooling to fetch a vector of the same size as category dimension and classify them. This replacement of fully-connected layers significantly reduces the number of parameters.
Figure 7 depicts two Inception blocks. Figure 7(a) is the simplest design. The output is a concatenation of features from three convolutional layers and one pooling layer. The disadvantage of this design is that the pooling layer does not change the number of filters and leads to an increase in the number of outputs. After several of such blocks, the number of outputs and parameters become larger and larger and lead to higher computation complexity. To overcome this drawback, the Inception block in Figure 7(b) employs three 1x1 convolutional layers. These reduce dimensions or the number of channels but improve the non-linearity of the network. Figure 7 depicts two Inception blocks. Figure 7(a) is the simplest design. The output is a concatenation of features from three convolutional layers and one pooling layer. The disadvantage of this design is that the pooling layer does not change the number of channels and leads to an increased channel number of features after concatenation. After several such blocks, the number of channels and parameters become larger and larger and lead to higher computation complexity. To overcome this drawback, the Inception block in Figure 7(b) employs three 1x1 convolutional layers to perform dimensionality reduction, which, to put it simply, is to reduce the number of channels and simultaneously improve the non-linearity of the network.
<p align="center"> <p align="center">
<img src="image/inception_en.png" width="800" ><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/03.image_classification/image/inception.png?raw=ture" width="800" ><br/>
Figure 7. Inception block Figure 7. Inception block
</p> </p>
GoogleNet consists of multiple stacked Inception blocks followed by an avg-pooling layer as in NIN instead of traditional fully connected layers. The difference between GoogleNet and NIN is that GoogleNet adds a fully connected layer after avg-pooling layer to output a vector of category size. Besides these two characteristics, the features from middle layers of a GoogleNet are also very discriminative. Therefore, GoogeleNet inserts two auxiliary classifiers in the model for enhancing gradient and regularization when doing backpropagation. The loss function of the whole network is the weighted sum of these three classifiers. GoogLeNet comprises multiple stacked Inception blocks followed by an avg-pooling layer as in NIN instead of traditional fully connected layers. The difference between GoogLeNet and NIN is that GoogLeNet adds a fully connected layer after avg-pooling layer to output a vector of category size. Besides these two characteristics, the features from middle layers of a GoogLeNet are also very discriminative. Therefore, GoogeleNet inserts two auxiliary classifiers in the model for enhancing gradient and regularization when doing back-propagation. The loss function of the whole network is the weighted sum of these three classifiers.
Figure 8 illustrates the neural architecture of a GoogleNet which consists of 22 layers: it starts with three regular convolutional layers followed by three groups of sub-networks -- the first group contains two Inception blocks, the second group has five, and the third group has two. It ends with an average pooling and a fully-connected layer. Figure 8 illustrates the neural architecture of a GoogLeNet which consists of 22 layers: it starts with three regular convolutional layers followed by three groups of sub-networks -- the first group contains two Inception blocks, the second group has five, and the third group has two again. Finally, It ends with an average pooling and a fully-connected layer.
<p align="center"> <p align="center">
<img src="image/googlenet.jpeg" ><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/03.image_classification/image/googlenet.jpeg?raw=true" ><br/>
Figure 8. GoogleNet[12] Figure 8. GoogLeNet <a src="#References">[12]</a>
</p> </p>
The above model is the first version of GoogleNet or GoogelNet-v1. GoogleNet-v2 [14] introduced BN layer; GoogleNet-v3 [16] further split some convolutional layers, which increases non-linearity and network depth; GoogelNet-v4 [17] leads to the design idea of ResNet which will be introduced in the next section. The evolution from v1 to v4 improved the accuracy rate consistently. We will not go into details of the neural architectures of v2 to v4. The model above is the first version of GoogLeNet or the so-called GoogelNet-v1. GoogLeNet-v2 \[[14](#References)\] introduced BN layer; GoogLeNet-v3 \[[16](#References)\] further split some convolutional layers, which increases non-linearity and network depth; GoogelNet-v4 \[[17](#References)\] is inspired by the design idea of ResNet which will be introduced in the next section. The evolution from v1 to v4 improved the accuracy rate consistently. The length of this article being limited, we will not scrutinize the neural architectures of v2 to v4.
### ResNet ### ResNet
Residual Network(ResNet)[15] won the 2015 championship on three ImageNet competitions -- image classification, object localization, and object detection. The main challenge in training deeper networks is that accuracy degrades with network depth. The authors of ResNet proposed a residual learning approach to ease the difficulty of training deeper networks. Based on the design ideas of BN, small convolutional kernels, full convolutional network, ResNets reformulate the layers as residual blocks, with each block containing two branches, one directly connecting input to the output, the other performing two to three convolutions and calculating the residual function with reference to the layer's inputs. The outputs of these two branches are then added up. Residual Network(ResNet) \[[15](#References)\] won the 2015 championship on three ImageNet competitions -- image classification, object localization, and object detection. The main challenge in training deeper networks is that accuracy degrades with network depth. The authors of ResNet proposed a residual learning approach to ease the training of deeper networks. Based on the design ideas of BN, small convolutional kernels, full convolutional network, ResNets reformulate the layers as residual blocks, with each block containing two branches, one directly connecting input to the output, the other performing two to three convolutions and calculating the residual function with reference to the layer's inputs. The output features of these two branches are then added up.
Figure 9 illustrates the ResNet architecture. To the left is the basic building block, it consists of two 3x3 convolutional layers of the same channels. To the right is a Bottleneck block. The bottleneck is a 1x1 convolutional layer used to reduce dimension from 256 to 64. The other 1x1 convolutional layer is used to increase dimension from 64 to 256. Thus, the number of input and output channels of the middle 3x3 convolutional layer is 64, which is relatively small. Figure 9 illustrates the ResNet architecture. To the left is the basic building block, it consists of two 3x3 convolutional layers with the same size of output channels. To the right is a Bottleneck block. The bottleneck is a 1x1 convolutional layer used to reduce dimension (from 256 to 64 here). The following 1x1 convolutional layer is used to increase dimension from 64 to 256. Thus, the number of input and output channels of the middle 3x3 convolutional layer is relatively small (64->64 in this example).
<p align="center"> <p align="center">
<img src="image/resnet_block.jpg" width="400"><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/03.image_classification/image/resnet_block.jpg?raw=true" width="400"><br/>
Figure 9. Residual block Figure 9. Residual block
</p> </p>
Figure 10 illustrates ResNets with 50, 101, 152 layers, respectively. All three networks use bottleneck blocks of different numbers of repetitions. ResNet converges very fast and can be trained with hundreds or thousands of layers. Figure 10 illustrates ResNets with 50, 101, 152 layers, respectively. All three networks use bottleneck blocks and their difference lies in the repetition time of residual blocks. ResNet converges very fast and can be trained with hundreds or thousands of layers.
<p align="center"> <p align="center">
<img src="image/resnet.png"><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/03.image_classification/image/resnet.png?raw=true"><br/>
Figure 10. ResNet model for ImageNet Figure 10. ResNet model for ImageNet
</p> </p>
## Dataset ## Get Data Ready
Commonly used public datasets for image classification are [CIFAR](https://www.cs.toronto.edu/~kriz/cifar.html), [ImageNet](http://image-net.org/), [COCO](http://mscoco.org/), etc. Those used for fine-grained image classification are [CUB-200-2011](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html), [Stanford Dog](http://vision.stanford.edu/aditya86/ImageNetDogs/), [Oxford-flowers](http://www.robots.ox.ac.uk/~vgg/data/flowers/), etc. Among these, the ImageNet dataset is the largest. Most research results are reported on ImageNet as mentioned in the Model Overview section. Since 2010, the ImageNet dataset has gone through some changes. The commonly used ImageNet-2012 dataset contains 1000 categories. There are 1,281,167 training images, ranging from 732 to 1200 images per category, and 50,000 validation images with 50 images per category in average. Common public benchmark datasets for image classification are [CIFAR](https://www.cs.toronto.edu/~kriz/cifar.html), [ImageNet](http://image-net.org/), [COCO](http://mscoco.org/), etc. Those used for fine-grained image classification are [CUB-200-2011](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html), [Stanford Dog](http://vision.stanford.edu/aditya86/ImageNetDogs/), [Oxford-flowers](http://www.robots.ox.ac.uk/~vgg/data/flowers/), etc. Among these, the ImageNet dataset is the largest. Most research results are reported on ImageNet as mentioned in the "Exploration of Models" section. Since 2010, the ImageNet dataset has gone through some changes. The commonly used ImageNet-2012 dataset contains 1000 categories. There are 1,281,167 training images, ranging from 732 to 1200 images per category, and 50,000 validation images with 50 images per category in average.
Since ImageNet is too large to be downloaded and trained efficiently, we use [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) in this tutorial. The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. Figure 11 shows all the classes in CIFAR-10 as well as 10 images randomly sampled from each category. Since ImageNet is too large to be downloaded and trained efficiently, we use [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) in this tutorial. The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. Figure 11 shows all the classes in CIFAR-10 as well as 10 images randomly sampled from each category.
<p align="center"> <p align="center">
<img src="image/cifar.png" width="350"><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/03.image_classification/image/cifar.png?raw=true" width="350"><br/>
Figure 11. CIFAR10 dataset[21] Figure 11. CIFAR10 dataset <a src="#References">[21]</a>
</p> </p>
`paddle.datasets` package encapsulates multiple public datasets, including `cifar`, `imdb`, `mnist`, `moivelens` and `wmt14`, etc. There's no need to manually download and preprocess CIFAR-10. The Paddle API invents 'Paddle.dataset.cifar' to automatically load the Cifar DataSet module.
After running the command `python train.py`, training will start immediately. The following sections will describe in details. After running the command `python train.py`, training will start immediately. The following sections will explain `train.py` inside and out.
## Model Configuration ## Model Configuration
#### Initialize Paddle
Let's start with importing the Paddle Fluid API package and the helper modules. Let's start with importing the Paddle Fluid API package and the helper modules.
```python ```python
from __future__ import print_function
import paddle import paddle
import paddle.fluid as fluid import paddle.fluid as fluid
import numpy import numpy
import sys import sys
from __future__ import print_function
try:
from paddle.fluid.contrib.trainer import *
from paddle.fluid.contrib.inferencer import *
except ImportError:
print(
"In the fluid 1.0, the trainer and inferencer are moving to paddle.fluid.contrib",
file=sys.stderr)
from paddle.fluid.trainer import *
from paddle.fluid.inferencer import *
``` ```
Now we are going to walk you through the implementations of the VGG and ResNet. Now we are going to walk you through the implementations of the VGG and ResNet.
### VGG ### VGG
Let's start with the VGG model. Since the image size and amount of CIFAR10 are relatively small comparing to ImageNet, we use a small version of VGG network for CIFAR10. Convolution groups incorporate BN and dropout operations. Let's start with the VGG model. Since the image size and amount of CIFAR10 are smaller than ImageNet, we tailor our model to fit CIFAR10 dataset. Convolution groups incorporate BN and dropout operations.
The input to VGG main module is from the data layer. `vgg_bn_drop` defines a 16-layer VGG network, with each convolutional layer followed by BN and dropout layers. Here is the definition in detail: The input to VGG core module is the data layer. `vgg_bn_drop` defines a 16-layer VGG network, with each convolutional layer followed by BN and dropout layers. Here is the definition in detail:
```python ```python
def vgg_bn_drop(input): def vgg_bn_drop(input):
...@@ -262,24 +258,26 @@ def vgg_bn_drop(input): ...@@ -262,24 +258,26 @@ def vgg_bn_drop(input):
return predict return predict
``` ```
1. Firstly, it defines a convolution block or conv_block. The default convolution kernel is 3x3, and the default pooling size is 2x2 with stride 2. Dropout specifies the probability in dropout operation. Function `img_conv_group` is defined in `paddle.networks` consisting of a series of `Conv->BN->ReLu->Dropout` and a `Pooling`.
2. Five groups of convolutions. The first two groups perform two convolutions, while the last three groups perform three convolutions. The dropout rate of the last convolution in each group is set to 0, which means there is no dropout for this layer. 1. Firstly, it defines a convolution block or conv_block. The default convolution kernel is 3x3, and the default pooling size is 2x2 with stride 2. Groups decide the number of consecutive convolution operations in each VGG block. Dropout specifies the probability to perform dropout operation. Function `img_conv_group` is predefined in `paddle.nets` consisting of a series of `Conv->BN->ReLu->Dropout` and a group of `Pooling` .
2. Five groups of convolutions. The first two groups perform two consecutive convolutions, while the last three groups perform three convolutions in sequence. The dropout rate of the last convolution in each group is set to 0, which means there is no dropout for this layer.
3. The last two layers are fully-connected layers of dimension 512. 3. The last two layers are fully-connected layers of 512 dimensions.
4. The above VGG network extracts high-level features and maps them to a vector of the same size as the categories. Softmax function or classifier is then used for calculating the probability of the image belonging to each category. 4. The VGG network begins with extracting high-level features and then maps them to a vector of the same size as the category dimension. Finally, Softmax function is used for calculating the probability of classifying the image to each category.
### ResNet ### ResNet
Here are some basic functions used in `resnet_cifar10`: The 1st, 3rd, and 4th step is identical to the counterparts in VGG, which are skipped hereby.
We will explain the 2nd step at lengths, namely the core module of ResNet on CIFAR10.
- `conv_bn_layer` : convolutional layer followed by BN. To start with, here are some basic functions used in `resnet_cifar10` ,and the network connection procedure is illustrated afterwards:
- `shortcut` : the shortcut branch in a residual block. There are two kinds of shortcuts: 1x1 convolution used when the number of channels between input and output is different; direct connection used otherwise.
- `conv_bn_layer` : convolutional layer with BN.
- `shortcut` : the shortcut connection in a residual block. There are two kinds of shortcuts: 1x1 convolutions are used to increase dimensionality when in the residual block the number of channels in input feature and that in output feature are different; direct connection used otherwise.
- `basicblock` : a basic residual module as shown in the left of Figure 9, it consists of two sequential 3x3 convolutions and one "shortcut" branch. - `basicblock` : a basic residual module as shown in the left of Figure 9, it consists of two sequential 3x3 convolutions and one "shortcut" branch.
- `bottleneck` : a bottleneck module as shown in the right of Figure 9, it consists of two 1x1 convolutions with one 3x3 convolution in between branch and a "shortcut" branch. - `layer_warp` : a group of residual modules consisting of several stacked blocks. In each group, the sliding window size of the first residual block could be different from the rest, in order to reduce the size of feature maps along horizontal and vertical directions.
- `layer_warp` : a group of residual modules consisting of several stacking blocks. In each group, the sliding window size of the first residual block could be different from the rest of blocks, in order to reduce the size of feature maps along horizontal and vertical directions.
```python ```python
def conv_bn_layer(input, def conv_bn_layer(input,
...@@ -321,13 +319,14 @@ def layer_warp(block_func, input, ch_in, ch_out, count, stride): ...@@ -321,13 +319,14 @@ def layer_warp(block_func, input, ch_in, ch_out, count, stride):
return tmp return tmp
``` ```
The following are the components of `resnet_cifar10`: The following are the components of `resnet_cifar10`:
1. The lowest level is `conv_bn_layer`. 1. The lowest level is `conv_bn_layer` , e.t. the convolution layer with BN.
2. The middle level consists of three `layer_warp`, each of which uses the left residual block in Figure 9. 2. The next level is composed of three residual blocks, namely three `layer_warp`, each of which uses the left residual block in Figure 10.
3. The last level is average pooling layer. 3. The last level is average pooling layer.
Note: besides the first convolutional layer and the last fully-connected layer, the total number of layers in three `layer_warp` should be dividable by 6, that is the depth of `resnet_cifar10` should satisfy $(depth - 2) % 6 == 0$. Note: Except the first convolutional layer and the last fully-connected layer, the total number of layers with parameters in three `layer_warp` should be dividable by 6. In other words, the depth of `resnet_cifar10` should satisfy (depth-2)%6=0.
```python ```python
def resnet_cifar10(ipt, depth=32): def resnet_cifar10(ipt, depth=32):
...@@ -345,9 +344,10 @@ def resnet_cifar10(ipt, depth=32): ...@@ -345,9 +344,10 @@ def resnet_cifar10(ipt, depth=32):
return predict return predict
``` ```
## Infererence Program Configuration
The input to the network is defined as `fluid.layers.data`, or image pixels in the context of image classification. The images in CIFAR10 are 32x32 color images of three channels. Therefore, the size of the input data is 3072 (3x32x32). ## Inference Program Configuration
The input to the network is defined as `fluid.layers.data` , corresponding to image pixels in the context of image classification. The images in CIFAR10 are 32x32 coloured images with three channels. Therefore, the size of the input data is 3072 (3x32x32).
```python ```python
def inference_program(): def inference_program():
...@@ -360,14 +360,14 @@ def inference_program(): ...@@ -360,14 +360,14 @@ def inference_program():
return predict return predict
``` ```
## Train Program Configuration ## Training Program Configuration
Then we need to setup the the `train_program`. It takes the prediction from the inference_program first. Then we need to set up the the `train_program`. It takes the prediction from the inference_program first.
During the training, it will calculate the `avg_loss` from the prediction. During the training, it will calculate the `avg_loss` from the prediction.
In the context of supervised learning, labels of training images are defined in `fluid.layers.data` as well. During training, the cross-entropy loss function is used and the loss is the output of the network. During testing, the outputs are the probabilities calculated in the classifier. In the context of supervised learning, labels of training images are defined in `fluid.layers.data` as well. During training, the multi-class cross-entropy is used as the loss function and becomes the output of the network. During testing, the outputs are the probabilities calculated in the classifier.
**NOTE:** A train program should return an array and the first returned argument has to be `avg_cost`. **NOTE:** A training program should return an array and the first returned argument has to be `avg_cost` .
The trainer always implicitly use it to calculate the gradient. The trainer always uses it to calculate the gradients.
```python ```python
def train_program(): def train_program():
...@@ -382,7 +382,7 @@ def train_program(): ...@@ -382,7 +382,7 @@ def train_program():
## Optimizer Function Configuration ## Optimizer Function Configuration
In the following `Adam` optimizer, `learning_rate` specifies the learning rate in the optimization procedure. In the following `Adam` optimizer, `learning_rate` specifies the learning rate in the optimization procedure. It influences the convergence speed.
```python ```python
def optimizer_program(): def optimizer_program():
...@@ -391,23 +391,10 @@ def optimizer_program(): ...@@ -391,23 +391,10 @@ def optimizer_program():
## Model Training ## Model Training
### Create Trainer
Before creating a training module, it is necessary to set the algorithm.
Here we specify `Adam` optimization algorithm via `fluid.optimizer`.
```python
use_cuda = False
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
trainer = Trainer(
train_func=train_program,
optimizer_func=optimizer_program,
place=place)
```
### Data Feeders Configuration ### Data Feeders Configuration
`cifar.train10()` will yield records during each pass, after shuffling, a batch input is generated for training. `cifar.train10()` generates one sample at a time as the input for training after completing shuffle and batch.
```python ```python
# Each batch will yield 128 images # Each batch will yield 128 images
...@@ -423,54 +410,109 @@ test_reader = paddle.batch( ...@@ -423,54 +410,109 @@ test_reader = paddle.batch(
paddle.dataset.cifar.test10(), batch_size=BATCH_SIZE) paddle.dataset.cifar.test10(), batch_size=BATCH_SIZE)
``` ```
### Event Handler
Callback function `event_handler` will be called during training when a pre-defined event happens. ### Implementation of the trainer program
We need to develop a main_program for the training process. Similarly, we need to configure a test_program for the test program. It's also necessary to define the `place` of the training and use the optimizer `optimizer_program` previously defined .
`event_handler` is used to plot some text data when training.
```python ```python
params_dirname = "image_classification_resnet.inference.model" use_cuda = False
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
# event handler to track training and testing process feed_order = ['pixel', 'label']
def event_handler(event):
if isinstance(event, EndStepEvent): main_program = fluid.default_main_program()
if event.step % 100 == 0: star_program = fluid.default_startup_program()
print("\nPass %d, Batch %d, Cost %f, Acc %f" %
(event.step, event.epoch, event.metrics[0], avg_cost, acc = train_program()
event.metrics[1]))
else: # Test program
sys.stdout.write('.') test_program = main_program.clone(for_test=True)
sys.stdout.flush()
optimizer = optimizer_program()
optimizer.minimize(avg_cost)
if isinstance(event, EndEpochEvent): exe = fluid.Executor(place)
# Test against with the test dataset to get accuracy.
avg_cost, accuracy = trainer.test(
reader=test_reader, feed_order=['pixel', 'label'])
print('\nTest with Pass {0}, Loss {1:2.2}, Acc {2:2.2}'.format(event.epoch, avg_cost, accuracy)) EPOCH_NUM = 2
# For training test cost
def train_test(program, reader):
count = 0
feed_var_list = [
program.global_block().var(var_name) for var_name in feed_order
]
feeder_test = fluid.DataFeeder(
feed_list=feed_var_list, place=place)
test_exe = fluid.Executor(place)
accumulated = len([avg_cost, acc]) * [0]
for tid, test_data in enumerate(reader()):
avg_cost_np = test_exe.run(program=program,
feed=feeder_test.feed(test_data),
fetch_list=[avg_cost, acc])
accumulated = [x[0] + x[1][0] for x in zip(accumulated, avg_cost_np)]
count += 1
return [x / count for x in accumulated]
```
### The main loop of training and the outputs along the process
In the next main training cycle, we will observe the training process or run test in good use of the outputs.
You can also use `plot` to plot the process by calling back data:
```python
params_dirname = "image_classification_resnet.inference.model"
from paddle.utils.plot import Ploter
train_prompt = "Train cost"
test_prompt = "Test cost"
plot_cost = Ploter(test_prompt,train_prompt)
# main train loop.
def train_loop():
feed_var_list_loop = [
main_program.global_block().var(var_name) for var_name in feed_order
]
feeder = fluid.DataFeeder(
feed_list=feed_var_list_loop, place=place)
exe.run(star_program)
step = 0
for pass_id in range(EPOCH_NUM):
for step_id, data_train in enumerate(train_reader()):
avg_loss_value = exe.run(main_program,
feed=feeder.feed(data_train),
fetch_list=[avg_cost, acc])
if step % 1 == 0:
plot_cost.append(train_prompt, step, avg_loss_value[0])
plot_cost.plot()
step += 1
avg_cost_test, accuracy_test = train_test(test_program,
reader=test_reader)
plot_cost.append(test_prompt, step, avg_cost_test)
# save parameters # save parameters
if params_dirname is not None: if params_dirname is not None:
trainer.save_params(params_dirname) fluid.io.save_inference_model(params_dirname, ["pixel"],
[predict], exe)
``` ```
### Training ### Training
Finally, we can invoke `trainer.train` to start training. Training via `trainer_loop` function, here we only have 2 Epoch iterations. Generally we need to execute above a hundred Epoch in practice.
**Note:** On CPU, each epoch will take about 15~20 minutes. This part may take a while. Please feel free to modify the code to run the test on GPU to increase the training speed.
**Note:** On CPU, each Epoch will take approximately 15 to 20 minutes. It may cost some time in this part. Please freely update the code and run test on GPU to accelerate training
```python ```python
trainer.train( train_loop()
reader=train_reader,
num_epochs=2,
event_handler=event_handler,
feed_order=['pixel', 'label'])
``` ```
Here is an example log after training for one pass. The accuracy rates are 0.59 on the training set and 0.6 on the validation set. An example of an epoch of training log is shown below. After 1 pass, the average Accuracy on the training set is 0.59 and the average Accuracy on the testing set is 0.6.
```text ```text
Pass 0, Batch 0, Cost 3.869598, Acc 0.164062 Pass 0, Batch 0, Cost 3.869598, Acc 0.164062
...@@ -484,32 +526,31 @@ Pass 300, Batch 0, Cost 1.223424, Acc 0.593750 ...@@ -484,32 +526,31 @@ Pass 300, Batch 0, Cost 1.223424, Acc 0.593750
Test with Pass 0, Loss 1.1, Acc 0.6 Test with Pass 0, Loss 1.1, Acc 0.6
``` ```
Figure 12 shows the curve of training error rate, which indicates it converges at Pass 200 with error rate 8.54%. Figure 13 is a curve graph of the classification error rate of the training. After pass of 200 times, it almost converges, and finally the classification error rate on the test set is 8.54%.
<p align="center"> <p align="center">
<img src="image/plot_en.png" width="400" ><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/03.image_classification/image/plot.png?raw=true" width="400" ><br/>
Figure 12. The error rate of VGG model on CIFAR10 Figure 13. Classification error rate of VGG model on the CIFAR10 data set
</p> </p>
## Model Application
## Application You can use a trained model to classify your images. The following program shows how to load a trained network and optimized parameters for inference.
After training is completed, users can use the trained model to classify images. The following code shows how to infer through `fluid.contrib.inferencer.Inferencer` interface. You can uncomment some lines from below to change the model name.
### Generate input data for inferring ### Generate Input Data to infer
`dog.png` is an example image of a dog. Turn it into a numpy array to match the data feeder format. `dog.png` is a picture of a puppy. We convert it to a `numpy` array to meet the `feeder` format.
```python ```python
# Prepare testing data. # Prepare testing data.
from PIL import Image from PIL import Image
import numpy as np
import os import os
def load_image(file): def load_image(file):
im = Image.open(file) im = Image.open(file)
im = im.resize((32, 32), Image.ANTIALIAS) im = im.resize((32, 32), Image.ANTIALIAS)
im = np.array(im).astype(np.float32) im = numpy.array(im).astype(numpy.float32)
# The storage order of the loaded image is W(width), # The storage order of the loaded image is W(width),
# H(height), C(channel). PaddlePaddle requires # H(height), C(channel). PaddlePaddle requires
# the CHW order, so transpose them. # the CHW order, so transpose them.
...@@ -526,26 +567,48 @@ img = load_image(cur_dir + '/image/dog.png') ...@@ -526,26 +567,48 @@ img = load_image(cur_dir + '/image/dog.png')
### Inferencer Configuration and Inference ### Inferencer Configuration and Inference
The `Inferencer` takes an `infer_func` and `param_path` to setup the network and the trained parameters. Similar to the training process, a inferencer needs to build the corresponding process. We load the trained network and parameters from `params_dirname` .
We can simply plug-in the inference_program defined earlier here. We can just insert the inference program defined previously.
Now we are ready to do inference. Now let's make our inference.
```python ```python
inferencer = Inferencer( place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
infer_func=inference_program, param_path=params_dirname, place=place) exe = fluid.Executor(place)
inference_scope = fluid.core.Scope()
with fluid.scope_guard(inference_scope):
[inference_program, feed_target_names,
fetch_targets] = fluid.io.load_inference_model(params_dirname, exe)
label_list = ["airplane", "automobile", "bird", "cat", "deer", "dog", "frog", "horse", "ship", "truck"] # Construct feed as a dictionary of {feed_target_name: feed_target_data}
# inference # and results will contain a list of data corresponding to fetch_targets.
results = inferencer.infer({'pixel': img}) results = exe.run(inference_program,
print("infer results: %s" % label_list[np.argmax(results[0])]) feed={feed_target_names[0]: img},
fetch_list=fetch_targets)
# infer label
label_list = [
"airplane", "automobile", "bird", "cat", "deer", "dog", "frog", "horse",
"ship", "truck"
]
print("infer results: %s" % label_list[numpy.argmax(results[0])])
``` ```
## Conclusion
Traditional image classification methods involve multiple stages of processing, which has to utilize complex frameworks. Contrarily, CNN models can be trained end-to-end with a significant increase in classification accuracy. In this chapter, we introduced three models -- VGG, GoogleNet, ResNet and provided PaddlePaddle config files for training VGG and ResNet on CIFAR10. We also explained how to perform prediction and feature extraction using the PaddlePaddle API. For other datasets such as ImageNet, the procedure for config and training are the same and you are welcome to give it a try. ## Summary
The traditional image classification method consists of multiple stages. The framework is a little complex. In contrast, the end-to-end CNN model can be implemented in one step, and the accuracy of classification is greatly improved. In this article, we first introduced three classic models, VGG, GoogLeNet and ResNet. Then we have introduced how to use PaddlePaddle to configure and train CNN models based on CIFAR10 dataset, especially VGG and ResNet models. Finally, we have guided you how to use PaddlePaddle's API interfaces to predict images and extract features. For other datasets such as ImageNet, the configuration and training process is the same, so you can embark on your adventure on your own.
<a name="References"></a>
## References ## References
[1] D. G. Lowe, [Distinctive image features from scale-invariant keypoints](http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf). IJCV, 60(2):91-110, 2004. [1] D. G. Lowe, [Distinctive image features from scale-invariant keypoints](http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf). IJCV, 60(2):91-110, 2004.
...@@ -582,7 +645,7 @@ Traditional image classification methods involve multiple stages of processing, ...@@ -582,7 +645,7 @@ Traditional image classification methods involve multiple stages of processing,
[17] Szegedy, C., Ioffe, S., Vanhoucke, V. [Inception-v4, inception-resnet and the impact of residual connections on learning](https://arxiv.org/abs/1602.07261). arXiv:1602.07261 (2016). [17] Szegedy, C., Ioffe, S., Vanhoucke, V. [Inception-v4, inception-resnet and the impact of residual connections on learning](https://arxiv.org/abs/1602.07261). arXiv:1602.07261 (2016).
[18] Everingham, M., Eslami, S. M. A., Van Gool, L., Williams, C. K. I., Winn, J. and Zisserman, A. [The Pascal Visual Object Classes Challenge: A Retrospective]((http://link.springer.com/article/10.1007/s11263-014-0733-5)). International Journal of Computer Vision, 111(1), 98-136, 2015. [18] Everingham, M., Eslami, S. M. A., Van Gool, L., Williams, C. K. I., Winn, J. and Zisserman, A. [The Pascal Visual Object Classes Challenge: A Retrospective](http://link.springer.com/article/10.1007/s11263-014-0733-5). International Journal of Computer Vision, 111(1), 98-136, 2015.
[19] He, K., Zhang, X., Ren, S., and Sun, J. [Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification](https://arxiv.org/abs/1502.01852). ArXiv e-prints, February 2015. [19] He, K., Zhang, X., Ren, S., and Sun, J. [Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification](https://arxiv.org/abs/1502.01852). ArXiv e-prints, February 2015.
...@@ -592,8 +655,10 @@ Traditional image classification methods involve multiple stages of processing, ...@@ -592,8 +655,10 @@ Traditional image classification methods involve multiple stages of processing,
[22] http://cs231n.github.io/classification/ [22] http://cs231n.github.io/classification/
<br/> <br/>
This tutorial is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://paddlepaddleimage.cdn.bcebos.com/bookimage/camo.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">This tutorial</span> is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
......
...@@ -15,6 +15,7 @@ ...@@ -15,6 +15,7 @@
from __future__ import print_function from __future__ import print_function
import os import os
import argparse
import paddle import paddle
import paddle.fluid as fluid import paddle.fluid as fluid
import numpy import numpy
...@@ -23,6 +24,20 @@ from vgg import vgg_bn_drop ...@@ -23,6 +24,20 @@ from vgg import vgg_bn_drop
from resnet import resnet_cifar10 from resnet import resnet_cifar10
def parse_args():
parser = argparse.ArgumentParser("image_classification")
parser.add_argument(
'--enable_ce',
action='store_true',
help='If set, run the task with continuous evaluation logs.')
parser.add_argument(
'--use_gpu', type=bool, default=0, help='whether to use gpu')
parser.add_argument(
'--num_epochs', type=int, default=1, help='number of epoch')
args = parser.parse_args()
return args
def inference_network(): def inference_network():
# The image is 32 * 32 with RGB representation. # The image is 32 * 32 with RGB representation.
data_shape = [3, 32, 32] data_shape = [3, 32, 32]
...@@ -48,31 +63,40 @@ def optimizer_program(): ...@@ -48,31 +63,40 @@ def optimizer_program():
def train(use_cuda, params_dirname): def train(use_cuda, params_dirname):
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace() place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
BATCH_SIZE = 128 BATCH_SIZE = 128
if args.enable_ce:
train_reader = paddle.batch(
paddle.dataset.cifar.train10(), batch_size=BATCH_SIZE)
test_reader = paddle.batch(
paddle.dataset.cifar.test10(), batch_size=BATCH_SIZE)
else:
test_reader = paddle.batch(
paddle.dataset.cifar.test10(), batch_size=BATCH_SIZE)
train_reader = paddle.batch( train_reader = paddle.batch(
paddle.reader.shuffle( paddle.reader.shuffle(
paddle.dataset.cifar.train10(), buf_size=128 * 100), paddle.dataset.cifar.train10(), buf_size=128 * 100),
batch_size=BATCH_SIZE) batch_size=BATCH_SIZE)
test_reader = paddle.batch(
paddle.dataset.cifar.test10(), batch_size=BATCH_SIZE)
feed_order = ['pixel', 'label'] feed_order = ['pixel', 'label']
main_program = fluid.default_main_program() main_program = fluid.default_main_program()
star_program = fluid.default_startup_program() start_program = fluid.default_startup_program()
if args.enable_ce:
main_program.random_seed = 90
start_program.random_seed = 90
predict = inference_network() predict = inference_network()
avg_cost, acc = train_network(predict) avg_cost, acc = train_network(predict)
# Test program # Test program
test_program = main_program.clone(for_test=True) test_program = main_program.clone(for_test=True)
optimizer = optimizer_program() optimizer = optimizer_program()
optimizer.minimize(avg_cost) optimizer.minimize(avg_cost)
exe = fluid.Executor(place) exe = fluid.Executor(place)
EPOCH_NUM = 1 EPOCH_NUM = args.num_epochs
# For training test cost # For training test cost
def train_test(program, reader): def train_test(program, reader):
...@@ -100,7 +124,7 @@ def train(use_cuda, params_dirname): ...@@ -100,7 +124,7 @@ def train(use_cuda, params_dirname):
main_program.global_block().var(var_name) for var_name in feed_order main_program.global_block().var(var_name) for var_name in feed_order
] ]
feeder = fluid.DataFeeder(feed_list=feed_var_list_loop, place=place) feeder = fluid.DataFeeder(feed_list=feed_var_list_loop, place=place)
exe.run(star_program) exe.run(start_program)
step = 0 step = 0
for pass_id in range(EPOCH_NUM): for pass_id in range(EPOCH_NUM):
...@@ -126,6 +150,12 @@ def train(use_cuda, params_dirname): ...@@ -126,6 +150,12 @@ def train(use_cuda, params_dirname):
fluid.io.save_inference_model(params_dirname, ["pixel"], fluid.io.save_inference_model(params_dirname, ["pixel"],
[predict], exe) [predict], exe)
if args.enable_ce and pass_id == EPOCH_NUM - 1:
print("kpis\ttrain_cost\t%f" % avg_loss_value[0])
print("kpis\ttrain_acc\t%f" % avg_loss_value[1])
print("kpis\ttest_cost\t%f" % avg_cost_test)
print("kpis\ttest_acc\t%f" % accuracy_test)
train_loop() train_loop()
...@@ -161,12 +191,6 @@ def infer(use_cuda, params_dirname=None): ...@@ -161,12 +191,6 @@ def infer(use_cuda, params_dirname=None):
[inference_program, feed_target_names, [inference_program, feed_target_names,
fetch_targets] = fluid.io.load_inference_model(params_dirname, exe) fetch_targets] = fluid.io.load_inference_model(params_dirname, exe)
# The input's dimension of conv should be 4-D or 5-D.
# Use inference_transpiler to speedup
inference_transpiler_program = inference_program.clone()
t = fluid.transpiler.InferenceTranspiler()
t.transpile(inference_transpiler_program, place)
# Construct feed as a dictionary of {feed_target_name: feed_target_data} # Construct feed as a dictionary of {feed_target_name: feed_target_data}
# and results will contain a list of data corresponding to fetch_targets. # and results will contain a list of data corresponding to fetch_targets.
results = exe.run( results = exe.run(
...@@ -174,16 +198,6 @@ def infer(use_cuda, params_dirname=None): ...@@ -174,16 +198,6 @@ def infer(use_cuda, params_dirname=None):
feed={feed_target_names[0]: img}, feed={feed_target_names[0]: img},
fetch_list=fetch_targets) fetch_list=fetch_targets)
transpiler_results = exe.run(
inference_transpiler_program,
feed={feed_target_names[0]: img},
fetch_list=fetch_targets)
assert len(results[0]) == len(transpiler_results[0])
for i in range(len(results[0])):
numpy.testing.assert_almost_equal(
results[0][i], transpiler_results[0][i], decimal=5)
# infer label # infer label
label_list = [ label_list = [
"airplane", "automobile", "bird", "cat", "deer", "dog", "frog", "airplane", "automobile", "bird", "cat", "deer", "dog", "frog",
...@@ -206,4 +220,6 @@ def main(use_cuda): ...@@ -206,4 +220,6 @@ def main(use_cuda):
if __name__ == '__main__': if __name__ == '__main__':
# For demo purpose, the training runs on CPU # For demo purpose, the training runs on CPU
# Please change accordingly. # Please change accordingly.
main(use_cuda=False) args = parse_args()
use_cuda = args.use_gpu
main(use_cuda)
#!/bin/bash
#This file is only used for continuous evaluation.
python train.py --enable_ce | python _ce.py
...@@ -3,6 +3,18 @@ ...@@ -3,6 +3,18 @@
本教程源代码目录在[book/word2vec](https://github.com/PaddlePaddle/book/tree/develop/04.word2vec),初次使用请您参考[Book文档使用说明](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书) 本教程源代码目录在[book/word2vec](https://github.com/PaddlePaddle/book/tree/develop/04.word2vec),初次使用请您参考[Book文档使用说明](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)
### 说明
1. 本教程可支持在 CPU/GPU 环境下运行
2. Docker镜像支持的CUDA/cuDNN版本
如果使用了Docker运行Book,请注意:这里所提供的默认镜像的GPU环境为 CUDA 8/cuDNN 5,对于NVIDIA Tesla V100等要求CUDA 9的 GPU,使用该镜像可能会运行失败;
3. 文档和脚本中代码的一致性问题
请注意:为使本文更加易读易用,我们拆分、调整了[train.py](https://github.com/PaddlePaddle/book/blob/develop/04.word2vec/train.py)的代码并放入本文。本文中代码与train.py的运行结果一致,可直接运行train.py进行验证。
## 背景介绍 ## 背景介绍
本章我们介绍词的向量表征,也称为word embedding。词向量是自然语言处理中常见的一个操作,是搜索引擎、广告系统、推荐系统等互联网服务背后常见的基础技术。 本章我们介绍词的向量表征,也称为word embedding。词向量是自然语言处理中常见的一个操作,是搜索引擎、广告系统、推荐系统等互联网服务背后常见的基础技术。
...@@ -16,7 +28,9 @@ One-hot vector虽然自然,但是用处有限。比如,在互联网广告系 ...@@ -16,7 +28,9 @@ One-hot vector虽然自然,但是用处有限。比如,在互联网广告系
词向量模型可以是概率模型、共生矩阵(co-occurrence matrix)模型或神经元网络模型。在用神经网络求词向量之前,传统做法是统计一个词语的共生矩阵$X$。$X$是一个$|V| \times |V|$ 大小的矩阵,$X_{ij}$表示在所有语料中,词汇表$V$(vocabulary)中第i个词和第j个词同时出现的词数,$|V|$为词汇表的大小。对$X$做矩阵分解(如奇异值分解,Singular Value Decomposition \[[5](#参考文献)\]),得到的$U$即视为所有词的词向量: 词向量模型可以是概率模型、共生矩阵(co-occurrence matrix)模型或神经元网络模型。在用神经网络求词向量之前,传统做法是统计一个词语的共生矩阵$X$。$X$是一个$|V| \times |V|$ 大小的矩阵,$X_{ij}$表示在所有语料中,词汇表$V$(vocabulary)中第i个词和第j个词同时出现的词数,$|V|$为词汇表的大小。对$X$做矩阵分解(如奇异值分解,Singular Value Decomposition \[[5](#参考文献)\]),得到的$U$即视为所有词的词向量:
$$X = USV^T$$ <p align="center">
<img src = "https://github.com/PaddlePaddle/book/blob/develop/04.word2vec/image/Eqn1.gif?raw=true"><br/>
</p>
但这样的传统做法有很多问题: 但这样的传统做法有很多问题:
...@@ -65,11 +79,17 @@ similarity: -0.0997506977351 ...@@ -65,11 +79,17 @@ similarity: -0.0997506977351
对语言模型的目标概率$P(w_1, ..., w_T)$,如果假设文本中每个词都是相互独立的,则整句话的联合概率可以表示为其中所有词语条件概率的乘积,即: 对语言模型的目标概率$P(w_1, ..., w_T)$,如果假设文本中每个词都是相互独立的,则整句话的联合概率可以表示为其中所有词语条件概率的乘积,即:
$$P(w_1, ..., w_T) = \prod_{t=1}^TP(w_t)$$
<p align="center">
<img src = "https://github.com/PaddlePaddle/book/blob/develop/04.word2vec/image/Eqn2.gif?raw=true"><br/>
</p>
然而我们知道语句中的每个词出现的概率都与其前面的词紧密相关, 所以实际上通常用条件概率表示语言模型: 然而我们知道语句中的每个词出现的概率都与其前面的词紧密相关, 所以实际上通常用条件概率表示语言模型:
$$P(w_1, ..., w_T) = \prod_{t=1}^TP(w_t | w_1, ... , w_{t-1})$$ <p align="center">
<img src = "https://github.com/PaddlePaddle/book/blob/develop/04.word2vec/image/Eqn3.gif?raw=true"><br/>
</p>
...@@ -81,11 +101,16 @@ Yoshua Bengio等科学家就于2003年在著名论文 Neural Probabilistic Langu ...@@ -81,11 +101,16 @@ Yoshua Bengio等科学家就于2003年在著名论文 Neural Probabilistic Langu
我们在上文中已经讲到用条件概率建模语言模型,即一句话中第$t$个词的概率和该句话的前$t-1$个词相关。可实际上越远的词语其实对该词的影响越小,那么如果考虑一个n-gram, 每个词都只受其前面`n-1`个词的影响,则有: 我们在上文中已经讲到用条件概率建模语言模型,即一句话中第$t$个词的概率和该句话的前$t-1$个词相关。可实际上越远的词语其实对该词的影响越小,那么如果考虑一个n-gram, 每个词都只受其前面`n-1`个词的影响,则有:
$$P(w_1, ..., w_T) = \prod_{t=n}^TP(w_t|w_{t-1}, w_{t-2}, ..., w_{t-n+1})$$ <p align="center">
<img src = "https://github.com/PaddlePaddle/book/blob/develop/04.word2vec/image/Eqn4.gif?raw=true"><br/>
</p>
给定一些真实语料,这些语料中都是有意义的句子,N-gram模型的优化目标则是最大化目标函数: 给定一些真实语料,这些语料中都是有意义的句子,N-gram模型的优化目标则是最大化目标函数:
$$\frac{1}{T}\sum_t f(w_t, w_{t-1}, ..., w_{t-n+1};\theta) + R(\theta)$$ <p align="center">
<img src = "https://github.com/PaddlePaddle/book/blob/develop/04.word2vec/image/Eqn5.gif?raw=true"><br/>
</p>
其中$f(w_t, w_{t-1}, ..., w_{t-n+1})$表示根据历史n-1个词得到当前词$w_t$的条件概率,$R(\theta)$表示参数正则项。 其中$f(w_t, w_{t-1}, ..., w_{t-n+1})$表示根据历史n-1个词得到当前词$w_t$的条件概率,$R(\theta)$表示参数正则项。
...@@ -101,20 +126,25 @@ $$\frac{1}{T}\sum_t f(w_t, w_{t-1}, ..., w_{t-n+1};\theta) + R(\theta)$$ ...@@ -101,20 +126,25 @@ $$\frac{1}{T}\sum_t f(w_t, w_{t-1}, ..., w_{t-n+1};\theta) + R(\theta)$$
- 然后所有词语的词向量拼接成一个大向量,并经过一个非线性映射得到历史词语的隐层表示: - 然后所有词语的词向量拼接成一个大向量,并经过一个非线性映射得到历史词语的隐层表示:
$$g=Utanh(\theta^Tx + b_1) + Wx + b_2$$ <p align="center">
<img src = "https://github.com/PaddlePaddle/book/blob/develop/04.word2vec/image/Eqn6.gif?raw=true"><br/>
</p>
其中,$x$为所有词语的词向量拼接成的大向量,表示文本历史特征;$\theta$、$U$、$b_1$、$b_2$和$W$分别为词向量层到隐层连接的参数。$g$表示未经归一化的所有输出单词概率,$g_i$表示未经归一化的字典中第$i$个单词的输出概率。 其中,$x$为所有词语的词向量拼接成的大向量,表示文本历史特征;$\theta$、$U$、$b_1$、$b_2$和$W$分别为词向量层到隐层连接的参数。$g$表示未经归一化的所有输出单词概率,$g_i$表示未经归一化的字典中第$i$个单词的输出概率。
- 根据softmax的定义,通过归一化$g_i$, 生成目标词$w_t$的概率为: - 根据softmax的定义,通过归一化$g_i$, 生成目标词$w_t$的概率为:
$$P(w_t | w_1, ..., w_{t-n+1}) = \frac{e^{g_{w_t}}}{\sum_i^{|V|} e^{g_i}}$$ <p align="center">
<img src = "https://github.com/PaddlePaddle/book/blob/develop/04.word2vec/image/Eqn7.gif?raw=true"><br/>
</p>
- 整个网络的损失值(cost)为多类分类交叉熵,用公式表示为 - 整个网络的损失值(cost)为多类分类交叉熵,用公式表示为
$$J(\theta) = -\sum_{i=1}^N\sum_{k=1}^{|V|}y_k^{i}log(softmax(g_k^i))$$ <p align="center">
<img src = "https://github.com/PaddlePaddle/book/blob/develop/04.word2vec/image/Eqn8.gif?raw=true"><br/>
其中$y_k^i$表示第$i$个样本第$k$类的真实标签(0或1),$softmax(g_k^i)$表示第i个样本第k类softmax输出的概率。 </p>
其中$y_k^i$表示第$i$个样本第$k$类的真实标签(0或1),$\text{softmax}(g_k^i)$表示第i个样本第k类softmax输出的概率。
### Continuous Bag-of-Words model(CBOW) ### Continuous Bag-of-Words model(CBOW)
...@@ -128,7 +158,10 @@ CBOW模型通过一个词的上下文(各N个词)预测当前词。当N=2时 ...@@ -128,7 +158,10 @@ CBOW模型通过一个词的上下文(各N个词)预测当前词。当N=2时
具体来说,不考虑上下文的词语输入顺序,CBOW是用上下文词语的词向量的均值来预测当前词。即: 具体来说,不考虑上下文的词语输入顺序,CBOW是用上下文词语的词向量的均值来预测当前词。即:
$$context = \frac{x_{t-1} + x_{t-2} + x_{t+1} + x_{t+2}}{4}$$
<p align="center">
<img src = "https://github.com/PaddlePaddle/book/blob/develop/04.word2vec/image/Eqn9.gif?raw=true"><br/>
</p>
其中$x_t$为第$t$个词的词向量,分类分数(score)向量 $z=U*context$,最终的分类$y$采用softmax,损失函数采用多类分类交叉熵。 其中$x_t$为第$t$个词的词向量,分类分数(score)向量 $z=U*context$,最终的分类$y$采用softmax,损失函数采用多类分类交叉熵。
...@@ -203,14 +236,14 @@ dream that one day <e> ...@@ -203,14 +236,14 @@ dream that one day <e>
```python ```python
import paddle as paddle from __future__ import print_function
import paddle
import paddle.fluid as fluid import paddle.fluid as fluid
import six import six
import numpy import numpy
import math import math
from __future__ import print_function
``` ```
然后,定义参数: 然后,定义参数:
...@@ -411,11 +444,11 @@ def infer(use_cuda, params_dirname=None): ...@@ -411,11 +444,11 @@ def infer(use_cuda, params_dirname=None):
# 用来查询embedding表获取对应的词向量,因此其形状大小是[1]。 # 用来查询embedding表获取对应的词向量,因此其形状大小是[1]。
# recursive_sequence_lengths设置的是基于长度的LoD,因此都应该设为[[1]] # recursive_sequence_lengths设置的是基于长度的LoD,因此都应该设为[[1]]
# 注意recursive_sequence_lengths是列表的列表 # 注意recursive_sequence_lengths是列表的列表
data1 = [[211]] # 'among' data1 = numpy.asarray([[211]], dtype=numpy.int64) # 'among'
data2 = [[6]] # 'a' data2 = numpy.asarray([[6]], dtype=numpy.int64) # 'a'
data3 = [[96]] # 'group' data3 = numpy.asarray([[96]], dtype=numpy.int64) # 'group'
data4 = [[4]] # 'of' data4 = numpy.asarray([[4]], dtype=numpy.int64) # 'of'
lod = [[1]] lod = numpy.asarray([[1]], dtype=numpy.int64)
first_word = fluid.create_lod_tensor(data1, lod, place) first_word = fluid.create_lod_tensor(data1, lod, place)
second_word = fluid.create_lod_tensor(data2, lod, place) second_word = fluid.create_lod_tensor(data2, lod, place)
...@@ -491,4 +524,4 @@ main(use_cuda=use_cuda, is_sparse=True) ...@@ -491,4 +524,4 @@ main(use_cuda=use_cuda, is_sparse=True)
5. https://en.wikipedia.org/wiki/Singular_value_decomposition 5. https://en.wikipedia.org/wiki/Singular_value_decomposition
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">知识共享 署名-相同方式共享 4.0 国际 许可协议</a>进行许可。 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://paddlepaddleimage.cdn.bcebos.com/bookimage/camo.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">知识共享 署名-相同方式共享 4.0 国际 许可协议</a>进行许可。
# Word2Vec # Word Vector
This is intended as a reference tutorial. The source code of this tutorial is located at [book/word2vec](https://github.com/PaddlePaddle/book/tree/develop/04.word2vec). The source code of this tutorial is in [book/word2vec](https://github.com/PaddlePaddle/book/tree/develop/04.word2vec). For new users, please refer to [Running This Book](https://github.com/PaddlePaddle/book/blob/develop/README.md#running-the-book) .
For instructions on getting started with this book,see [Running This Book](https://github.com/PaddlePaddle/book/blob/develop/README.md#running-the-book). ## Background
## Background Introduction In this chapter, we'll introduce the vector representation of words, also known as word embedding. Word vector is a common operation in natural language processing. It is a common technology underlying Internet services such as search engines, advertising systems, and recommendation systems.
This section introduces the concept of **word embeddings**, which are vector representations of words. Word embeddings is a popular technique used in natural language processing to support applications such as search engines, advertising systems, and recommendation systems. In these Internet services, we often compare the correlation between two words or two paragraphs of text. In order to make such comparisons, we often have to express words in a way that is suitable for computer processing. The most natural way is probably the vector space model.In this way, each word is represented as a one-hot vector whose length is the dictionary size, and each dimension corresponds to each word in a dictionary, except that the value in the corresponding dimension of the word is 1, other elements are 0.
### One-Hot Vectors The One-hot vector is natural but has limitation. For example, in the internet advertising system, if the query entered by the user is "Mother's Day", the keyword of an advertisement is "Carnation". Although according to common sense, we know that there is a connection between these two words - Mother's Day should usually give the mother a bunch of carnations; but the distance between the two words corresponds to the one-hot vectors, whether it is Euclidean distance or cosine similarity, the two words are considered to be irrelevant due to their vector orthogonality. The root cause of this conclusion contradicting us is that the amount of information in each word itself is too small. Therefore, just giving two words is not enough for us to accurately determine whether they are relevant. To accurately calculate correlations, we need more information—knowledge from a large amount of data through machine learning methods.
Building these applications requires us to quantify the similarity between two words or paragraphs. This calls for a new representation of all the words to make them more suitable for computation. An obvious way to achieve this is through the vector space model, where every word is represented as an **one-hot vector**. In the field of machine learning, all kinds of "knowledge" are represented by various models, and the word embedding model is one of them. A one-hot vector can be mapped to a lower-dimensional embedding vector by the word embedding model, such as $embedding (Mother's day) = [0.3, 4.2, -1.5, ...], embedding (carnation) = [0.2, 5.6, -2.3, ...]$. In this representation of the embedding vector to which it is mapped, it is desirable that the word vectors corresponding to the similar words on the two semantics (or usages) are "more like", such that the cosine similarity of the corresponding word vectors of "Mother's Day" and "Carnation" is no longer zero.
For each word, its vector representation has the corresponding entry in the vector as 1, and all other entries as 0. The lengths of one-hot vectors match the size of the dictionary. Each entry of a vector corresponds to the presence (or absence) of a word in the dictionary. The word embedding model can be a probability model, a co-occurrence matrix model, or a neural network model. Before implementing neural networks to calculate the embedding vector, the traditional method is to count the co-occurrence matrix $X$ of a word. $X$ is a matrix of $|V| \times |V|$ size, $X_{ij}$ means that in all corpora, The number of words appearing simultaneously with the i-th word and the j-th word in the vocabulary $V$(vocabulary), $|V|$ is the size of the vocabulary. Do matrix decomposition for $X$ (such as singular value decomposition, Singular Value Decomposition \[[5](#references)\]), and the result $U$ is treated as the embedding vector for all words:
One-hot vectors are intuitive, yet they have limited usefulness. Take the example of an Internet advertising system: Suppose a customer enters the query "Mother's Day", while an ad bids for the keyword "carnations". Because the one-hot vectors of these two words are perpendicular, the metric distance (either Euclidean or cosine similarity) between them would indicate little relevance. However, *we* know that these two queries are connected semantically, since people often gift their mothers bundles of carnation flowers on Mother's Day. This discrepancy is due to the low information capacity in each vector. That is, comparing the vector representations of two words does not assess their relevance sufficiently. To calculate their similarity accurately, we need more information, which could be learned from large amounts of data through machine learning methods. $$X = USV^T$$
Like many machine learning models, word embeddings can represent knowledge in various ways. Another model may project an one-hot vector to an embedding vector of lower dimension e.g. $embedding(mother's day) = [0.3, 4.2, -1.5, ...], embedding(carnations) = [0.2, 5.6, -2.3, ...]$. Mapping one-hot vectors onto an embedded vector space has the potential to bring the embedding vectors of similar words (either semantically or usage-wise) closer to each other, so that the cosine similarity between the corresponding vectors for words like "Mother's Day" and "carnations" are no longer zero. But such traditional method has many problems:
A word embedding model could be a probabilistic model, a co-occurrence matrix model, or a neural network. Before people started using neural networks to generate word embedding, the traditional method was to calculate a co-occurrence matrix $X$ of words. Here, $X$ is a $|V| \times |V|$ matrix, where $X_{ij}$ represents the co-occurrence times of the $i$th and $j$th words in the vocabulary `V` within all corpus, and $|V|$ is the size of the vocabulary. By performing matrix decomposition on $X$ e.g. Singular Value Decomposition \[[5](#references)\] 1) Since many words do not appear, the matrix is extremely sparse, so additional processing of the word frequency is needed to achieve a good matrix decomposition effect;
$$X = USV^T$$ 2) The matrix is very large and the dimensions are too high (usually up to $10^6 \times 10^6$);
the resulting $U$ can be seen as the word embedding of all the words. 3) You need to manually remove the stop words (such as although, a, ...), otherwise these frequently occurring words will also affect the effect of matrix decomposition.
However, this method suffers from many drawbacks: The neural-network-based model does not need to calculate and store a large table that is statistically generated on the whole corpus, but obtains the word vector by learning the semantic information, so the problem above can be well solved. In this chapter, we will show the details of training word vectors based on neural networks and how to train a word embedding model with PaddlePaddle.
1) Since many pairs of words don't co-occur, the co-occurrence matrix is sparse. To achieve good performance of matrix factorization, further treatment on word frequency is needed;
2) The matrix is large, frequently on the order of $10^6*10^6$;
3) We need to manually filter out stop words (like "although", "a", ...), otherwise these frequent words will affect the performance of matrix factorization.
The neural network based model does not require storing huge hash tables of statistics on all of the corpus. It obtains the word embedding by learning from semantic information, hence could avoid the aforementioned problems in the traditional method. In this chapter, we will introduce the details of neural network word embedding model and how to train such model in PaddlePaddle.
## Results Demonstration ## Result Demo
In this section, we use the $t-$SNE\[[4](#references)\] data visualization algorithm to draw the word embedding vectors after projecting them onto a two-dimensional space (see figure below). From the figure we can see that the semantically relevant words -- *a*, *the*, and *these* or *big* and *huge* -- are close to each other in the projected space, while irrelevant words -- *say* and *business* or *decision* and *japan* -- are far from each other. In this chapter, after the embedding vector is trained, we can use the data visualization algorithm t-SNE\[[4](#references)\] to draw the projection of the word features in two dimensions (as shown below). As can be seen from the figure, semantically related words (such as a, the, these; big, huge) are very close in projection, and semantic unrelated words (such as say, business; decision, japan) are far away from the projection.
<p align="center"> <p align="center">
<img src = "image/2d_similarity.png" width=400><br/> <img src = "https://github.com/PaddlePaddle/book/blob/develop/04.word2vec/image/2d_similarity.png?raw=true" width=400><br/>
Figure 1. Two dimension projection of word embeddings Figure 1. Two-dimensional projection of a word vector
</p> </p>
### Cosine Similarity On the other hand, we know that the cosine of two vectors is in the interval of $[-1,1]$: two identical vector cosines are 1, and the cosine value between two mutually perpendicular vectors is 0, The vector cosine of the opposite direction is -1, which the correlation is proportional to the magnitude of the cosine. So we can also calculate the cosine similarity of two word vectors:
On the other hand, we know that the cosine similarity between two vectors falls between $[-1,1]$. Specifically, the cosine similarity is 1 when the vectors are identical, 0 when the vectors are perpendicular, -1 when the are of opposite directions. That is, the cosine similarity between two vectors scales with their relevance. So we can calculate the cosine similarity of two word embedding vectors to represent their relevance:
``` ```
please input two words: big huge please input two words: big huge
similarity: 0.899180685161 Similarity: 0.899180685161
please input two words: from company please input two words: from company
similarity: -0.0997506977351 Similarity: -0.0997506977351
```
The above results could be obtained by running `calculate_dis.py`, which loads the words in the dictionary and their corresponding trained word embeddings. For detailed instruction, see section [Model Application](https://github.com/PaddlePaddle/book/tree/develop/04.word2vec#model-application).
```
## Model Overview The results above can be obtained by running `calculate_dis.py`, loading the words in the dictionary and the corresponding training feature results. We will describe the usage for details in [model application](#model application).
In this section, we will introduce three word embedding models: N-gram model, CBOW, and Skip-gram, which all output the frequency of each word given its immediate context.
For N-gram model, we will first introduce the concept of language model, and implement it using PaddlePaddle in section [Training](https://github.com/PaddlePaddle/book/tree/develop/04.word2vec#model-application). ## Overview of Models
The latter two models, which became popular recently, are neural word embedding model developed by Tomas Mikolov at Google \[[3](#references)\]. Despite their apparent simplicity, these models train very well. Here we introduce three models of training word vectors: N-gram model, CBOW model and Skip-gram model. Their central idea is to get the probability of a word appearing through the context. For the N-gram model, we will first introduce the concept of the language model. In the section [training model](#training model), we'll tutor you to implement it with PaddlePaddle. The latter two models are the most famous neuron word vector models in recent years, developed by Tomas Mikolov in Google \[[3](#references)\], although they are very simple, but the training effect is very good.
### Language Model ### Language Model
Before diving into word embedding models, we will first introduce the concept of **language model**. Language models build the joint probability function $P(w_1, ..., w_T)$ of a sentence, where $w_i$ is the i-th word in the sentence. The goal is to give higher probabilities to meaningful sentences, and lower probabilities to meaningless constructions. Before introducing the word embedding model, let us introduce a concept: the language model.
The language model is intended to model the joint probability function $P(w_1, ..., w_T)$ of a sentence, where $w_i$ represents the ith word in the sentence. The goal of the language model isn that the model gives a high probability to meaningful sentences and a small probability to meaningless sentences.Such models can be applied to many fields, such as machine translation, speech recognition, information retrieval, part-of-speech tagging, handwriting recognition, etc., All of which hope to obtain the probability of a continuous sequence. Take information retrieval as an example, when you search for "how long is a football bame" (bame is a medical term), the search engine will prompt you if you want to search for "how long is a football game", because the probability of calculating "how long is a football bame" is very low, and the word is similar to bame, which may cause errors, the game will maximize the probability of generating the sentence.
In general, models that generate the probability of a sequence can be applied to many fields, like machine translation, speech recognition, information retrieval, part-of-speech tagging, and handwriting recognition. Take information retrieval, for example. If you were to search for "how long is a football bame" (where bame is a medical noun), the search engine would have asked if you had meant "how long is a football game" instead. This is because the probability of "how long is a football bame" is very low according to the language model; in addition, among all of the words easily confused with "bame", "game" would build the most probable sentence. For the target probability of the language model $P(w_1, ..., w_T)$, if it is assumed that each word in the text is independent, the joint probability of the whole sentence can be expressed as the product of the conditional probabilities of all the words. which is:
#### Target Probability
For language model's target probability $P(w_1, ..., w_T)$, if the words in the sentence were to be independent, the joint probability of the whole sentence would be the product of each word's probability:
$$P(w_1, ..., w_T) = \prod_{t=1}^TP(w_t)$$ $$P(w_1, ..., w_T) = \prod_{t=1}^TP(w_t)$$
However, the frequency of words in a sentence typically relates to the words before them, so canonical language models are constructed using conditional probability in its target probability: However, we know that the probability of each word in the statement is closely related to the word in front of it, so in fact, the language model is usually represented by conditional probability:
$$P(w_1, ..., w_T) = \prod_{t=1}^TP(w_t | w_1, ... , w_{t-1})$$ $$P(w_1, ..., w_T) = \prod_{t=1}^TP(w_t | w_1, ... , w_{t-1})$$
### N-gram neural model ### N-gram neural model
In computational linguistics, n-gram is an important method to represent text. An n-gram represents a contiguous sequence of n consecutive items given a text. Based on the desired application scenario, each item could be a letter, a syllable or a word. The N-gram model is also an important method in statistical language modeling. When training language models with n-grams, the first (n-1) words of an n-gram are used to predict the *n*th word. In computational linguistics, n-gram is an important text representation method that represents a continuous n items in a text. Each item can be a letter, word or syllable based on the specific application scenario. The n-gram model is also an important method in the statistical language model. When n-gram is used to train the language model, the nth word is generally predicted by the content of the n-1 words of each n-gram.
Yoshua Bengio and other scientists describe how to train a word embedding model using neural network in the famous paper of Neural Probabilistic Language Models \[[1](#references)\] published in 2003. The Neural Network Language Model (NNLM) described in the paper learns the language model and word embedding simultaneously through a linear transformation and a non-linear hidden connection. That is, after training on large amounts of corpus, the model learns the word embedding; then, it computes the probability of the whole sentence, using the embedding. This type of language model can overcome the **curse of dimensionality** i.e. model inaccuracy caused by the difference in dimensionality between training and testing data. Note that the term *neural network language model* is ill-defined, so we will not use the name NNLM but only refer to it as *N-gram neural model* in this section. Scientists such as Yoshua Bengio introduced how to learn a word vector model of a neural network representation in the famous paper Neural Probabilistic Language Models \[[1](#references)\ in 2003. The Neural Network Language Model (NNLM) in this paper connects the linear model and a nonlinear hidden layer. It learns the language model and the word vector simultaneously, that is, by learning a large number of corpora to obtain the vector expression of the words, and the probability of the entire sentence is obtained by using these vectors. Since all words are represented by a low-dimensional vector, learning the language model in this way can overcome the curse of dimensionality.
Note: Because the "Neural Network Language Model" is more general, we do not use the real name of NNLM here, considering its specific practice, this model here is called N-gram neural model.
We have previously described language model using conditional probability, where the probability of the *t*-th word in a sentence depends on all $t-1$ words before it. Furthermore, since words further prior have less impact on a word, and every word within an n-gram is only effected by its previous n-1 words, we have: We have already mentioned above using the conditional probability language model, that is, the probability of the $t$ word in a sentence is related to the first $t-1$ words of the sentence. The farther the word actually has the smaller effect on the word, then if you consider an n-gram, each word is only affected by the preceding `n-1` words, then:
$$P(w_1, ..., w_T) = \prod_{t=n}^TP(w_t|w_{t-1}, w_{t-2}, ..., w_{t-n+1})$$ $$P(w_1, ..., w_T) = \prod_{t=n}^TP(w_t|w_{t-1}, w_{t-2}, ..., w_{t-n+1 })$$
Given some real corpus in which all sentences are meaningful, the n-gram model should maximize the following objective function: Given some real corpora, these corpora are meaningful sentences, and the optimization goal of the N-gram model is to maximize the objective function:
$$\frac{1}{T}\sum_t f(w_t, w_{t-1}, ..., w_{t-n+1};\theta) + R(\theta)$$ $$\frac{1}{T}\sum_t f(w_t, w_{t-1}, ..., w_{t-n+1};\theta) + R(\theta)$$
where $f(w_t, w_{t-1}, ..., w_{t-n+1})$ represents the conditional logarithmic probability of the current word $w_t$ given its previous $n-1$ words, and $R(\theta)$ represents parameter regularization term. Where $f(w_t, w_{t-1}, ..., w_{t-n+1})$ represents the conditional probability of getting the current word $w_t$ based on historical n-1 words, $R(\theta )$ represents a parameter regularization item.
<p align="center"> <p align="center">
<img src="image/nnlm_en.png" width=500><br/>        <img src="https://github.com/PaddlePaddle/book/blob/develop/04.word2vec/image/nnlm.png?raw=true" width=500><br/>
Figure 2. N-gram neural network model        Figure 2. N-gram neural network model
</p> </p>
Figure 2 shows the N-gram neural network model. From the bottom up, the model is divided into the following parts:
- For each sample, the model enters $w_{t-n+1},...w_{t-1}$, and outputs the probability distribution of the t-th word in the dictionary on the `|V|` words.
Figure 2 shows the N-gram neural network model. From the bottom up, the model has the following components: Each input word $w_{t-n+1},...w_{t-1}$ first maps to the word vector $C(w_{t-n+1}),...W_{t-1})$ by the mapping matrix.
- For each sample, the model gets input $w_{t-n+1},...w_{t-1}$, and outputs the probability that the t-th word is one of `|V|` in the dictionary.
Every input word $w_{t-n+1},...w_{t-1}$ first gets transformed into word embedding $C(w_{t-n+1}),...C(w_{t-1})$ through a transformation matrix.
- All the word embeddings concatenate into a single vector, which is mapped (nonlinearly) into the $t$-th word hidden representation: - Then the word vectors of all words are spliced into a large vector, and a hidden layer representation of the historical words is obtained through a non-linear mapping:
$$g=Utanh(\theta^Tx + b_1) + Wx + b_2$$ $$g=Utanh(\theta^Tx + b_1) + Wx + b_2$$
where $x$ is the large vector concatenated from all the word embeddings representing the context; $\theta$, $U$, $b_1$, $b_2$ and $W$ are parameters connecting word embedding layers to the hidden layers. $g$ represents the unnormalized probability of the output word, $g_i$ represents the unnormalized probability of the output word being the i-th word in the dictionary. Among them, $x$ is a large vector of all words, representing text history features; $\theta$, $U$, $b_1$, $b_2$, and $W$ are respectively parameters for the word vector layer to the hidden layer connection. $g$ represents the probability of all output words that are not normalized, and $g_i$ represents the output probability of the $i$ word in the unnormalized dictionary.
- Based on the definition of softmax, using normalized $g_i$, the probability that the output word is $w_t$ is represented as: - According to the definition of softmax, by normalizing $g_i$, the probability of generating the target word $w_t$ is:
$$P(w_t | w_1, ..., w_{t-n+1}) = \frac{e^{g_{w_t}}}{\sum_i^{|V|} e^{g_i}}$$ $$P(w_t | w_1, ..., w_{t-n+1}) = \frac{e^{g_{w_t}}}{\sum_i^{|V|} e^{g_i}}$$
- The cost of the entire network is a multi-class cross-entropy and can be described by the following loss function - The loss value of the entire network is the multi-class classification cross entropy, which is expressed as
$$J(\theta) = -\sum_{i=1}^N\sum_{k=1}^{|V|}y_k^{i}log(softmax(g_k^i))$$
where $y_k^i$ represents the real label (0 or 1) of the $i$ sample of the $k$ class, and $softmax(g_k^i)$ represents the probability of the kth softmax output of the i-th sample.
$$J(\theta) = -\sum_{i=1}^N\sum_{c=1}^{|V|}y_k^{i}log(softmax(g_k^i))$$
where $y_k^i$ represents the true label for the $k$-th class in the $i$-th sample ($0$ or $1$), $softmax(g_k^i)$ represents the softmax probability for the $k$-th class in the $i$-th sample.
### Continuous Bag-of-Words model(CBOW) ### Continuous Bag-of-Words model(CBOW)
CBOW model predicts the current word based on the N words both before and after it. When $N=2$, the model is as the figure below: The CBOW model predicts the current word through the context of a word (each N words). When N=2, the model is shown below:
<p align="center"> <p align="center">
<img src="image/cbow_en.png" width=250><br/>     <img src="https://github.com/PaddlePaddle/book/blob/develop/04.word2vec/image/cbow.png?raw=true" width=250><br/>
Figure 3. CBOW model     Figure 3. CBOW model
</p> </p>
Specifically, by ignoring the order of words in the sequence, CBOW uses the average value of the word embedding of the context to predict the current word: Specifically, regardless of the contextual word input order, CBOW uses the mean of the word vectors of the context words to predict the current word. which is:
$$\text{context} = \frac{x_{t-1} + x_{t-2} + x_{t+1} + x_{t+2}}{4}$$ $$context = \frac{x_{t-1} + x_{t-2} + x_{t+1} + x_{t+2}}{4}$$
where $x_t$ is the word embedding of the t-th word, classification score vector is $z=U*\text{context}$, the final classification $y$ uses softmax and the loss function uses multi-class cross-entropy. Where $x_t$ is the word vector of the $t$th word, the score vector (score) $z=U\*context$, the final classification $y$ uses softmax, and the loss function uses multi-class classification cross entropy.
### Skip-gram model ### Skip-gram model
The advantages of CBOW is that it smooths over the word embeddings of the context and reduces noise, so it is very effective on small dataset. Skip-gram uses a word to predict its context and get multiple context for the given word, so it can be used in larger datasets. The benefit of CBOW is that the distribution of contextual words is smoothed over the word vector, removing noise. Therefore it is very effective on small data sets. In the Skip-gram method, a word is used to predict its context, and many samples of the current word context are obtained, so it can be used for a larger data set.
<p align="center"> <p align="center">
<img src="image/skipgram_en.png" width=250><br/>     <img src="https://github.com/PaddlePaddle/book/blob/develop/04.word2vec/image/skipgram.png?raw=true" width=250><br/>
Figure 4. Skip-gram model     Figure 4. Skip-gram model
</p> </p>
As illustrated in the figure above, skip-gram model maps the word embedding of the given word onto $2n$ word embeddings (including $n$ words before and $n$ words after the given word), and then combine the classification loss of all those $2n$ words by softmax. As shown in the figure above, the specific method of the Skip-gram model is to map the word vector of a word to the word vector of $2n$ words ($2n$ represents the $n$ words before and after the input word), and then obtained the sum of the classification loss values of the $2n$ words by softmax.
## Dataset ## Data Preparation
We will use Penn Treebank (PTB) (Tomas Mikolov's pre-processed version) dataset. PTB is a small dataset, used in Recurrent Neural Network Language Modeling Toolkit\[[2](#references)\]. Its statistics are as follows: ### Data Introduction
This tutorial uses the Penn Treebank (PTB) (pre-processed version of Tomas Mikolov) dataset. The PTB data set is small and the training speed is fast. It is applied to Mikolov's open language model training tool \[[2](#references)\]. Its statistics are as follows:
<p align="center"> <p align="center">
<table> <table>
<tr>     <tr>
<td>training set</td>         <td>Training data</td>
<td>validation set</td>         <td>Verify data</td>
<td>test set</td>         <td>Test data</td>
</tr>     </tr>
<tr>     <tr>
<td>ptb.train.txt</td>         <td>ptb.train.txt</td>
<td>ptb.valid.txt</td>         <td>ptb.valid.txt</td>
<td>ptb.test.txt</td>         <td>ptb.test.txt</td>
</tr>     </tr>
<tr>     <tr>
<td>42068 lines</td>         <td>42068 sentences</td>
<td>3370 lines</td>         <td>3370 sentences</td>
<td>3761 lines</td>         <td>3761 sentence</td>
</tr>     </tr>
</table> </table>
</p> </p>
### Python Dataset Module
We encapsulated the PTB Data Set in our Python module `paddle.dataset.imikolov`. This module can
1. download the dataset to `~/.cache/paddle/dataset/imikolov`, if not yet, and ### Data Preprocessing
2. [preprocesses](#preprocessing) the dataset.
### Preprocessing This chapter trains the 5-gram model, which means that the first 4 words of each piece of data are used to predict the 5th word during PaddlePaddle training. PaddlePaddle provides the python package `paddle.dataset.imikolov` corresponding to the PTB dataset, which automatically downloads and preprocesses the data for your convenience.
We will be training a 5-gram model. Given five words in a window, we will predict the fifth word given the first four words. Preprocessing adds the start symbol `<s>` and the end symbol `<e>` to each sentence in the data set. Then, depending on the window size (5 in this tutorial), slide the window to the right each time from start to end and generate a piece of data.
Beginning and end of a sentence have a special meaning, so we will add begin token `<s>` in the front of the sentence. And end token `<e>` in the end of the sentence. By moving the five word window in the sentence, data instances are generated. For example, "I have a dream that one day" provides 5 pieces of data:
For example, the sentence "I have a dream that one day" generates five data instances:
```text ```text
<s> I have a dream <s> I have a dream
I have a dream that I have a dream that
have a dream that one Have a dream that one
a dream that one day a dream that one day
dream that one day <e> Dream that one day <e>
``` ```
At last, each data instance will be converted into an integer sequence according it's words' index inside the dictionary. Finally, based on the position of its word in the dictionary, each input is converted to an index sequence of integers as the input to PaddlePaddle.
## Training <a name="training model"></a>
## Program the Model
The neural network that we will be using is illustrated in the graph below: The model structure of this configuration is shown below:
<p align="center"> <p align="center">
<img src="image/ngram.en.png" width=400><br/>     <img src="https://github.com/PaddlePaddle/book/blob/develop/04.word2vec/image/ngram.png?raw=true" width=400><br/>
Figure 5. N-gram neural network model in model configuration     Figure 5. N-gram neural network model in model configuration
</p> </p>
`word2vec/train.py` demonstrates training word2vec using PaddlePaddle: First, load packages:
### Datafeeder Configuration ```python
Our program starts with importing necessary packages:
- Import packages. from __future__ import print_function
```python import paddle as paddle
import paddle
import paddle.fluid as fluid import paddle.fluid as fluid
import six
import numpy import numpy
from functools import partial
import math import math
import os
import six
import sys
from __future__ import print_function
try:
from paddle.fluid.contrib.trainer import *
from paddle.fluid.contrib.inferencer import *
except ImportError:
print(
"In the fluid 1.0, the trainer and inferencer are moving to paddle.fluid.contrib",
file=sys.stderr)
from paddle.fluid.trainer import *
from paddle.fluid.inferencer import *
``` ```
- Configure parameters and build word dictionary. Then, define the parameters:
```python ```python
EMBED_SIZE = 32 # word vector dimension EMBED_SIZE = 32 # embedding dimensions
HIDDEN_SIZE = 256 # hidden layer dimension HIDDEN_SIZE = 256 # hidden layer size
N = 5 # train 5-gram N = 5 # ngram size, here fixed 5
BATCH_SIZE = 32 # batch size BATCH_SIZE = 100 # batch size
PASS_NUM = 100 # Training rounds
# can use CPU or GPU use_cuda = False # Set to True if trained with GPU
use_cuda = os.getenv('WITH_GPU', '0') != '0'
word_dict = paddle.dataset.imikolov.build_dict() word_dict = paddle.dataset.imikolov.build_dict()
dict_size = len(word_dict) dict_size = len(word_dict)
``` ```
Unlike from the previous PaddlePaddle v2, in the new API (Fluid), we do not need to calculate word embedding ourselves. PaddlePaddle provides a built-in method `fluid.layers.embedding` and we can use it directly to build our N-gram neural network model. A larger `BATCH_SIZE` will make the training converge faster, but it will also consume more memory. Since the word vector calculation is large, if the environment allows, please turn on the GPU for training, and get results faster.
Unlike the previous PaddlePaddle v2 version, in the new Fluid version, we don't have to manually calculate the word vector. PaddlePaddle provides a built-in method `fluid.layers.embedding`, which we can use directly to construct an N-gram neural network.
- We define our N-gram neural network structure as below. This structure will be used both in `train` and in `infer`. We can specify `is_sparse = True` to accelerate sparse matrix update for word embedding. - Let's define our N-gram neural network structure. This structure is used in both training and predicting. Because the word vector is sparse, we pass the parameter `is_sparse == True` to speed up the update of the sparse matrix.
```python ```python
def inference_program(is_sparse): def inference_program(words, is_sparse):
first_word = fluid.layers.data(name='firstw', shape=[1], dtype='int64')
second_word = fluid.layers.data(name='secondw', shape=[1], dtype='int64')
third_word = fluid.layers.data(name='thirdw', shape=[1], dtype='int64')
fourth_word = fluid.layers.data(name='fourthw', shape=[1], dtype='int64')
embed_first = fluid.layers.embedding( embed_first = fluid.layers.embedding(
input=first_word, input=words[0],
size=[dict_size, EMBED_SIZE], size=[dict_size, EMBED_SIZE],
dtype='float32', dtype='float32',
is_sparse=is_sparse, is_sparse=is_sparse,
param_attr='shared_w') param_attr='shared_w')
embed_second = fluid.layers.embedding( embed_second = fluid.layers.embedding(
input=second_word, input=words[1],
size=[dict_size, EMBED_SIZE], size=[dict_size, EMBED_SIZE],
dtype='float32', dtype='float32',
is_sparse=is_sparse, is_sparse=is_sparse,
param_attr='shared_w') param_attr='shared_w')
embed_third = fluid.layers.embedding( embed_third = fluid.layers.embedding(
input=third_word, input=words[2],
size=[dict_size, EMBED_SIZE], size=[dict_size, EMBED_SIZE],
dtype='float32', dtype='float32',
is_sparse=is_sparse, is_sparse=is_sparse,
param_attr='shared_w') param_attr='shared_w')
embed_fourth = fluid.layers.embedding( embed_fourth = fluid.layers.embedding(
input=fourth_word, input=words[3],
size=[dict_size, EMBED_SIZE], size=[dict_size, EMBED_SIZE],
dtype='float32', dtype='float32',
is_sparse=is_sparse, is_sparse=is_sparse,
...@@ -295,79 +268,116 @@ def inference_program(is_sparse): ...@@ -295,79 +268,116 @@ def inference_program(is_sparse):
return predict_word return predict_word
``` ```
- As we already defined the N-gram neural network structure in the above, we can use it in our `train` method. - Based on the neural network structure above, we can define our training method as follows:
```python ```python
def train_program(is_sparse): def train_program(predict_word):
# The declaration of 'next_word' must be after the invoking of inference_program, # The definition of'next_word' must be after the declaration of inference_program.
# or the data input order of train program would be [next_word, firstw, secondw, # Otherwise the sequence of the train program input data becomes [next_word, firstw, secondw,
# thirdw, fourthw], which is not correct. #thirdw, fourthw], This is not true.
predict_word = inference_program(is_sparse)
next_word = fluid.layers.data(name='nextw', shape=[1], dtype='int64') next_word = fluid.layers.data(name='nextw', shape=[1], dtype='int64')
cost = fluid.layers.cross_entropy(input=predict_word, label=next_word) cost = fluid.layers.cross_entropy(input=predict_word, label=next_word)
avg_cost = fluid.layers.mean(cost) avg_cost = fluid.layers.mean(cost)
return avg_cost return avg_cost
```
- Now we will begin the training process. It is relatively simple compared to the previous version. `paddle.dataset.imikolov.train()` and `paddle.dataset.imikolov.test()` are our training and test set. Both of the functions will return a **reader**: In PaddlePaddle, reader is a python function which returns a Python iterator which output a single data instance at a time.
`paddle.batch` takes reader as input, outputs a **batched reader**: In PaddlePaddle, a reader outputs a single data instance at a time but batched reader outputs a minibatch of data instances.
`event_handler` can be passed into `trainer.train` so that we can do some tasks after each step or epoch. These tasks include recording current metrics or terminate current training process.
```python
def optimizer_func(): def optimizer_func():
return fluid.optimizer.AdagradOptimizer( return fluid.optimizer.AdagradOptimizer(
learning_rate=3e-3, learning_rate=3e-3,
regularization=fluid.regularizer.L2DecayRegularizer(8e-4)) regularization=fluid.regularizer.L2DecayRegularizer(8e-4))
```
- Now we can start training. This version is much simpler than before. We have ready-made training and test sets: `paddle.dataset.imikolov.train()` and `paddle.dataset.imikolov.test()`. Both will return a reader. In PaddlePaddle, the reader is a Python function that reads the next piece of data when called each time . It is a Python generator.
`paddle.batch` will read in a reader and output a batched reader. We can also output the training of each step and batch during the training process.
```python
def train(if_use_cuda, params_dirname, is_sparse=True):
place = fluid.CUDAPlace(0) if if_use_cuda else fluid.CPUPlace()
def train(use_cuda, train_program, params_dirname):
train_reader = paddle.batch( train_reader = paddle.batch(
paddle.dataset.imikolov.train(word_dict, N), BATCH_SIZE) paddle.dataset.imikolov.train(word_dict, N), BATCH_SIZE)
test_reader = paddle.batch( test_reader = paddle.batch(
paddle.dataset.imikolov.test(word_dict, N), BATCH_SIZE) paddle.dataset.imikolov.test(word_dict, N), BATCH_SIZE)
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace() first_word = fluid.layers.data(name='firstw', shape=[1], dtype='int64')
second_word = fluid.layers.data(name='secondw', shape=[1], dtype='int64')
third_word = fluid.layers.data(name='thirdw', shape=[1], dtype='int64')
forth_word = fluid.layers.data(name='fourthw', shape=[1], dtype='int64')
next_word = fluid.layers.data(name='nextw', shape=[1], dtype='int64')
def event_handler(event): word_list = [first_word, second_word, third_word, forth_word, next_word]
if isinstance(event, EndStepEvent): feed_order = ['firstw', 'secondw', 'thirdw', 'fourthw', 'nextw']
outs = trainer.test(
reader=test_reader, main_program = fluid.default_main_program()
feed_order=['firstw', 'secondw', 'thirdw', 'fourthw', 'nextw']) star_program = fluid.default_startup_program()
avg_cost = outs[0]
predict_word = inference_program(word_list, is_sparse)
# We output cost every 10 steps. avg_cost = train_program(predict_word)
if event.step % 10 == 0: test_program = main_program.clone(for_test=True)
print("Step %d: Average Cost %f" % (event.step, avg_cost))
sgd_optimizer = optimizer_func()
# If average cost is lower than 5.8, we consider the model good enough to stop. sgd_optimizer.minimize(avg_cost)
# Note 5.8 is a relatively high value. In order to get a better model, one should
# aim for avg_cost lower than 3.5. But the training could take longer time. exe = fluid.Executor(place)
if avg_cost < 5.8:
trainer.save_params(params_dirname) def train_test(program, reader):
trainer.stop() count = 0
feed_var_list = [
if math.isnan(avg_cost): program.global_block().var(var_name) for var_name in feed_order
]
feeder_test = fluid.DataFeeder(feed_list=feed_var_list, place=place)
test_exe = fluid.Executor(place)
accumulated = len([avg_cost]) * [0]
for test_data in reader():
avg_cost_np = test_exe.run(
program=program,
feed=feeder_test.feed(test_data),
fetch_list=[avg_cost])
accumulated = [
x[0] + x[1][0] for x in zip(accumulated, avg_cost_np)
]
count += 1
return [x / count for x in accumulated]
def train_loop():
step = 0
feed_var_list_loop = [
main_program.global_block().var(var_name) for var_name in feed_order
]
feeder = fluid.DataFeeder(feed_list=feed_var_list_loop, place=place)
exe.run(star_program)
for pass_id in range(PASS_NUM):
for data in train_reader():
avg_cost_np = exe.run(
main_program, feed=feeder.feed(data), fetch_list=[avg_cost])
if step % 10 == 0:
outs = train_test(test_program, test_reader)
print("Step %d: Average Cost %f" % (step, outs[0]))
# The entire training process takes several hours if the average loss is less than 5.8,
# We think that the model has achieved good results and can stop training.
# Note 5.8 is a relatively high value, in order to get a better model, you can
# set the threshold here to be 3.5, but the training time will be longer.
if outs[0] < 5.8:
if params_dirname is not None:
fluid.io.save_inference_model(params_dirname, [
'firstw', 'secondw', 'thirdw', 'fourthw'
], [predict_word], exe)
return
step += 1
if math.isnan(float(avg_cost_np[0])):
sys.exit("got NaN loss, training failed.") sys.exit("got NaN loss, training failed.")
trainer = Trainer( raise AssertionError("Cost is too large {0:2.2}".format(avg_cost_np[0]))
train_func=train_program,
# Note here we need to chse more sophisticated optimizer train_loop()
# such as AdaGrad with a decay rate. The normal SGD converges
# very slowly.
# optimizer=fluid.optimizer.SGD(learning_rate=0.001),
optimizer_func=optimizer_func,
place=place)
trainer.train(
reader=train_reader,
num_epochs=1,
event_handler=event_handler,
feed_order=['firstw', 'secondw', 'thirdw', 'fourthw', 'nextw'])
``` ```
`trainer.train` will start training, the output of `event_handler` will be similar to following: - `train_loop` will start training. The log of the training process during the period is as follows:
```text ```text
Step 0: Average Cost 7.337213 Step 0: Average Cost 7.337213
Step 10: Average Cost 6.136128 Step 10: Average Cost 6.136128
...@@ -375,51 +385,61 @@ Step 20: Average Cost 5.766995 ...@@ -375,51 +385,61 @@ Step 20: Average Cost 5.766995
... ...
``` ```
<a name="model application"></a>
## Model Application ## Model Application
After the model is trained, we can use it to make some predictions.
After the model is trained, we can load the saved model parameters and do some inference. ### Predict the next word
We can use our trained model to predict the next word after learning the previous N-gram.
### Predicting the next word
We can use our trained model to predict the next word given its previous N-gram. For example
```python ```python
def infer(use_cuda, inference_program, params_dirname=None): def infer(use_cuda, params_dirname=None):
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace() place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
inferencer = Inferencer(
infer_func=inference_program, param_path=params_dirname, place=place) exe = fluid.Executor(place)
# Setup inputs by creating 4 LoDTensors representing 4 words. Here each word inference_scope = fluid.core.Scope()
# is simply an index to look up for the corresponding word vector and hence with fluid.scope_guard(inference_scope):
# the shape of word (base_shape) should be [1]. The length-based level of #Get the inference program using fluid.io.load_inference_model,
# detail (lod) info of each LoDtensor should be [[1]] meaning there is only #feed variable name by feed_target_names and fetch fetch_targets from scope
# one lod_level and there is only one sequence of one word on this level. [inferencer, feed_target_names,
# Note that lod info should be a list of lists. fetch_targets] = fluid.io.load_inference_model(params_dirname, exe)
data1 = [[211]] # 'among' # Set the input and use 4 LoDTensor to represent 4 words. Each word here is an id,
data2 = [[6]] # 'a' # Used to query the embedding table to get the corresponding word vector, so its shape size is [1].
data3 = [[96]] # 'group' # recursive_sequence_lengths sets the length based on LoD, so it should all be set to [[1]]
data4 = [[4]] # 'of' # Note that recursive_sequence_lengths is a list of lists
lod = [[1]] data1 = numpy.asarray([[211]], dtype=numpy.int64) # 'among'
data2 = numpy.asarray([[6]], dtype=numpy.int64) # 'a'
data3 = numpy.asarray([[96]], dtype=numpy.int64) # 'group'
data4 = numpy.asarray([[4]], dtype=numpy.int64) # 'of'
lod = numpy.asarray([[1]], dtype=numpy.int64)
first_word = fluid.create_lod_tensor(data1, lod, place) first_word = fluid.create_lod_tensor(data1, lod, place)
second_word = fluid.create_lod_tensor(data2, lod, place) second_word = fluid.create_lod_tensor(data2, lod, place)
third_word = fluid.create_lod_tensor(data3, lod, place) third_word = fluid.create_lod_tensor(data3, lod, place)
fourth_word = fluid.create_lod_tensor(data4, lod, place) fourth_word = fluid.create_lod_tensor(data4, lod, place)
result = inferencer.infer( assert feed_target_names[0] == 'firstw'
{ assert feed_target_names[1] == 'secondw'
'firstw': first_word, assert feed_target_names[2] == 'thirdw'
'secondw': second_word, assert feed_target_names[3] == 'fourthw'
'thirdw': third_word,
'fourthw': fourth_word # Construct the feed dictionary {feed_target_name: feed_target_data}
# Prediction results are included in results
results = exe.run(
inferencer,
feed={
feed_target_names[0]: first_word,
feed_target_names[1]: second_word,
feed_target_names[2]: third_word,
feed_target_names[3]: fourth_word
}, },
fetch_list=fetch_targets,
return_numpy=False) return_numpy=False)
print(numpy.array(result[0])) print(numpy.array(results[0]))
most_possible_word_index = numpy.argmax(result[0]) most_possible_word_index = numpy.argmax(results[0])
print(most_possible_word_index) print(most_possible_word_index)
print([ print([
key for key, value in six.iteritems(word_dict) key for key, value in six.iteritems(word_dict)
...@@ -427,15 +447,16 @@ def infer(use_cuda, inference_program, params_dirname=None): ...@@ -427,15 +447,16 @@ def infer(use_cuda, inference_program, params_dirname=None):
][0]) ][0])
``` ```
When we spent 3 mins in training, the output is like below, which means the next word for `among a group of` is `a`. If we train the model with a longer time, it will give a meaningful prediction as `workers`. Since the word vector matrix itself is relatively sparse, the training process takes a long time to reach a certain precision. In order to see the effect simply, the tutorial only sets up with a few rounds of training and ends with the following result. Our model predicts that the next word for `among a group of` is `the`. This is in line with the law of grammar. If we train for longer time, such as several hours, then the next predicted word we will get is `workers`. The format of the predicted output is as follows:
```text ```text
[[0.00106646 0.0007907 0.00072041 ... 0.00049024 0.00041355 0.00084464]] [[0.03768077 0.03463154 0.00018074 ... 0.00022283 0.00029888 0.02967956]]
6 0
a the
``` ```
The first line represents the probability distribution of the predicted word in the dictionary, the second line represents the id corresponding to the word with the highest probability, and the third line represents the word with the highest probability.
The main entrance of the program is fairly simple: The entrance to the entire program is simple:
```python ```python
def main(use_cuda, is_sparse): def main(use_cuda, is_sparse):
...@@ -445,32 +466,26 @@ def main(use_cuda, is_sparse): ...@@ -445,32 +466,26 @@ def main(use_cuda, is_sparse):
params_dirname = "word2vec.inference.model" params_dirname = "word2vec.inference.model"
train( train(
use_cuda=use_cuda, if_use_cuda=use_cuda,
train_program=partial(train_program, is_sparse), params_dirname=params_dirname,
params_dirname=params_dirname) is_sparse=is_sparse)
infer( infer(use_cuda=use_cuda, params_dirname=params_dirname)
use_cuda=use_cuda,
inference_program=partial(inference_program, is_sparse),
params_dirname=params_dirname)
main(use_cuda=use_cuda, is_sparse=True) main(use_cuda=use_cuda, is_sparse=True)
``` ```
## Conclusion ## Conclusion
In this chapter, we introduced word vectors, the relationship between language models and word vectors and how to obtain word vectors by training neural network models. In information retrieval, we can judge the correlation between query and document keywords based on the cosine value between vectors. In syntactic analysis and semantic analysis, trained word vectors can be used to initialize the model for better results. In the document classification, after the word vector, you can cluster to group synonyms in a document, or you can use N-gram to predict the next word. We hope that everyone can easily use the word vector to conduct research in related fields after reading this chapter.
This chapter introduces word embeddings, the relationship between language model and word embedding, and how to train neural networks to learn word embedding. <a name="references"></a>
In grammar analysis and semantic analysis, a previously trained word embedding can initialize models for better performance. We hope that readers can use word embedding models in their work after reading this chapter.
## References ## References
1. Bengio Y, Ducharme R, Vincent P, et al. [A neural probabilistic language model](http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)[J]. journal of machine learning research, 2003, 3(Feb): 1137-1155. 1. Bengio Y, Ducharme R, Vincent P, et al. [A neural probabilistic language model](http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)[J]. journal of machine learning Research, 2003, 3(Feb): 1137-1155.
2. Mikolov T, Kombrink S, Deoras A, et al. [Rnnlm-recurrent neural network language modeling toolkit](http://www.fit.vutbr.cz/~imikolov/rnnlm/rnnlm-demo.pdf)[C]//Proc. of the 2011 ASRU Workshop. 2011: 196-201. 2. Mikolov T, Kombrink S, Deoras A, et al. [Rnnlm-recurrent neural network language modeling toolkit](http://www.fit.vutbr.cz/~imikolov/rnnlm/rnnlm-demo.pdf)[C ]//Proc. of the 2011 ASRU Workshop. 2011: 196-201.
3. Mikolov T, Chen K, Corrado G, et al. [Efficient estimation of word representations in vector space](https://arxiv.org/pdf/1301.3781.pdf)[J]. arXiv preprint arXiv:1301.3781, 2013. 3. Mikolov T, Chen K, Corrado G, et al. [Efficient estimation of word representations in vector space](https://arxiv.org/pdf/1301.3781.pdf)[J]. arXiv preprint arXiv:1301.3781, 2013 .
4. Maaten L, Hinton G. [Visualizing data using t-SNE](https://lvdmaaten.github.io/publications/papers/JMLR_2008.pdf)[J]. Journal of Machine Learning Research, 2008, 9(Nov): 2579-2605. 4. Maaten L, Hinton G. [Visualizing data using t-SNE](https://lvdmaaten.github.io/publications/papers/JMLR_2008.pdf)[J]. Journal of Machine Learning Research, 2008, 9(Nov ): 2579-2605.
5. https://en.wikipedia.org/wiki/Singular_value_decomposition 5. https://en.wikipedia.org/wiki/Singular_value_decomposition
<br/> <br/>
This tutorial is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://paddlepaddleimage.cdn.bcebos.com/bookimage/camo.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">This tutorial</span> is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.
### This file is only used for continuous evaluation test!
from __future__ import print_function
from __future__ import division
from __future__ import absolute_import
import os
import sys
sys.path.append(os.environ['ceroot'])
from kpi import CostKpi
train_cost_kpi = CostKpi('train_cost', 0.02, 0, actived=True, desc='train cost')
tracking_kpis = [train_cost_kpi]
def parse_log(log):
for line in log.split('\n'):
fs = line.strip().split('\t')
print(fs)
if len(fs) == 3 and fs[0] == 'kpis':
kpi_name = fs[1]
kpi_value = float(fs[2])
yield kpi_name, kpi_value
def log_to_ce(log):
kpi_tracker = {}
for kpi in tracking_kpis:
kpi_tracker[kpi.name] = kpi
for (kpi_name, kpi_value) in parse_log(log):
print(kpi_name, kpi_value)
kpi_tracker[kpi_name].add_record(kpi_value)
kpi_tracker[kpi_name].persist()
if __name__ == '__main__':
log = sys.stdin.read()
log_to_ce(log)
...@@ -45,6 +45,18 @@ ...@@ -45,6 +45,18 @@
本教程源代码目录在[book/word2vec](https://github.com/PaddlePaddle/book/tree/develop/04.word2vec),初次使用请您参考[Book文档使用说明](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)。 本教程源代码目录在[book/word2vec](https://github.com/PaddlePaddle/book/tree/develop/04.word2vec),初次使用请您参考[Book文档使用说明](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)。
### 说明
1. 本教程可支持在 CPU/GPU 环境下运行
2. Docker镜像支持的CUDA/cuDNN版本
如果使用了Docker运行Book,请注意:这里所提供的默认镜像的GPU环境为 CUDA 8/cuDNN 5,对于NVIDIA Tesla V100等要求CUDA 9的 GPU,使用该镜像可能会运行失败;
3. 文档和脚本中代码的一致性问题
请注意:为使本文更加易读易用,我们拆分、调整了[train.py](https://github.com/PaddlePaddle/book/blob/develop/04.word2vec/train.py)的代码并放入本文。本文中代码与train.py的运行结果一致,可直接运行train.py进行验证。
## 背景介绍 ## 背景介绍
本章我们介绍词的向量表征,也称为word embedding。词向量是自然语言处理中常见的一个操作,是搜索引擎、广告系统、推荐系统等互联网服务背后常见的基础技术。 本章我们介绍词的向量表征,也称为word embedding。词向量是自然语言处理中常见的一个操作,是搜索引擎、广告系统、推荐系统等互联网服务背后常见的基础技术。
...@@ -58,7 +70,9 @@ One-hot vector虽然自然,但是用处有限。比如,在互联网广告系 ...@@ -58,7 +70,9 @@ One-hot vector虽然自然,但是用处有限。比如,在互联网广告系
词向量模型可以是概率模型、共生矩阵(co-occurrence matrix)模型或神经元网络模型。在用神经网络求词向量之前,传统做法是统计一个词语的共生矩阵$X$。$X$是一个$|V| \times |V|$ 大小的矩阵,$X_{ij}$表示在所有语料中,词汇表$V$(vocabulary)中第i个词和第j个词同时出现的词数,$|V|$为词汇表的大小。对$X$做矩阵分解(如奇异值分解,Singular Value Decomposition \[[5](#参考文献)\]),得到的$U$即视为所有词的词向量: 词向量模型可以是概率模型、共生矩阵(co-occurrence matrix)模型或神经元网络模型。在用神经网络求词向量之前,传统做法是统计一个词语的共生矩阵$X$。$X$是一个$|V| \times |V|$ 大小的矩阵,$X_{ij}$表示在所有语料中,词汇表$V$(vocabulary)中第i个词和第j个词同时出现的词数,$|V|$为词汇表的大小。对$X$做矩阵分解(如奇异值分解,Singular Value Decomposition \[[5](#参考文献)\]),得到的$U$即视为所有词的词向量:
$$X = USV^T$$ <p align="center">
<img src = "https://github.com/PaddlePaddle/book/blob/develop/04.word2vec/image/Eqn1.gif?raw=true"><br/>
</p>
但这样的传统做法有很多问题: 但这样的传统做法有很多问题:
...@@ -107,11 +121,17 @@ similarity: -0.0997506977351 ...@@ -107,11 +121,17 @@ similarity: -0.0997506977351
对语言模型的目标概率$P(w_1, ..., w_T)$,如果假设文本中每个词都是相互独立的,则整句话的联合概率可以表示为其中所有词语条件概率的乘积,即: 对语言模型的目标概率$P(w_1, ..., w_T)$,如果假设文本中每个词都是相互独立的,则整句话的联合概率可以表示为其中所有词语条件概率的乘积,即:
$$P(w_1, ..., w_T) = \prod_{t=1}^TP(w_t)$$
<p align="center">
<img src = "https://github.com/PaddlePaddle/book/blob/develop/04.word2vec/image/Eqn2.gif?raw=true"><br/>
</p>
然而我们知道语句中的每个词出现的概率都与其前面的词紧密相关, 所以实际上通常用条件概率表示语言模型: 然而我们知道语句中的每个词出现的概率都与其前面的词紧密相关, 所以实际上通常用条件概率表示语言模型:
$$P(w_1, ..., w_T) = \prod_{t=1}^TP(w_t | w_1, ... , w_{t-1})$$ <p align="center">
<img src = "https://github.com/PaddlePaddle/book/blob/develop/04.word2vec/image/Eqn3.gif?raw=true"><br/>
</p>
...@@ -123,11 +143,16 @@ Yoshua Bengio等科学家就于2003年在著名论文 Neural Probabilistic Langu ...@@ -123,11 +143,16 @@ Yoshua Bengio等科学家就于2003年在著名论文 Neural Probabilistic Langu
我们在上文中已经讲到用条件概率建模语言模型,即一句话中第$t$个词的概率和该句话的前$t-1$个词相关。可实际上越远的词语其实对该词的影响越小,那么如果考虑一个n-gram, 每个词都只受其前面`n-1`个词的影响,则有: 我们在上文中已经讲到用条件概率建模语言模型,即一句话中第$t$个词的概率和该句话的前$t-1$个词相关。可实际上越远的词语其实对该词的影响越小,那么如果考虑一个n-gram, 每个词都只受其前面`n-1`个词的影响,则有:
$$P(w_1, ..., w_T) = \prod_{t=n}^TP(w_t|w_{t-1}, w_{t-2}, ..., w_{t-n+1})$$ <p align="center">
<img src = "https://github.com/PaddlePaddle/book/blob/develop/04.word2vec/image/Eqn4.gif?raw=true"><br/>
</p>
给定一些真实语料,这些语料中都是有意义的句子,N-gram模型的优化目标则是最大化目标函数: 给定一些真实语料,这些语料中都是有意义的句子,N-gram模型的优化目标则是最大化目标函数:
$$\frac{1}{T}\sum_t f(w_t, w_{t-1}, ..., w_{t-n+1};\theta) + R(\theta)$$ <p align="center">
<img src = "https://github.com/PaddlePaddle/book/blob/develop/04.word2vec/image/Eqn5.gif?raw=true"><br/>
</p>
其中$f(w_t, w_{t-1}, ..., w_{t-n+1})$表示根据历史n-1个词得到当前词$w_t$的条件概率,$R(\theta)$表示参数正则项。 其中$f(w_t, w_{t-1}, ..., w_{t-n+1})$表示根据历史n-1个词得到当前词$w_t$的条件概率,$R(\theta)$表示参数正则项。
...@@ -143,20 +168,25 @@ $$\frac{1}{T}\sum_t f(w_t, w_{t-1}, ..., w_{t-n+1};\theta) + R(\theta)$$ ...@@ -143,20 +168,25 @@ $$\frac{1}{T}\sum_t f(w_t, w_{t-1}, ..., w_{t-n+1};\theta) + R(\theta)$$
- 然后所有词语的词向量拼接成一个大向量,并经过一个非线性映射得到历史词语的隐层表示: - 然后所有词语的词向量拼接成一个大向量,并经过一个非线性映射得到历史词语的隐层表示:
$$g=Utanh(\theta^Tx + b_1) + Wx + b_2$$ <p align="center">
<img src = "https://github.com/PaddlePaddle/book/blob/develop/04.word2vec/image/Eqn6.gif?raw=true"><br/>
</p>
其中,$x$为所有词语的词向量拼接成的大向量,表示文本历史特征;$\theta$、$U$、$b_1$、$b_2$和$W$分别为词向量层到隐层连接的参数。$g$表示未经归一化的所有输出单词概率,$g_i$表示未经归一化的字典中第$i$个单词的输出概率。 其中,$x$为所有词语的词向量拼接成的大向量,表示文本历史特征;$\theta$、$U$、$b_1$、$b_2$和$W$分别为词向量层到隐层连接的参数。$g$表示未经归一化的所有输出单词概率,$g_i$表示未经归一化的字典中第$i$个单词的输出概率。
- 根据softmax的定义,通过归一化$g_i$, 生成目标词$w_t$的概率为: - 根据softmax的定义,通过归一化$g_i$, 生成目标词$w_t$的概率为:
$$P(w_t | w_1, ..., w_{t-n+1}) = \frac{e^{g_{w_t}}}{\sum_i^{|V|} e^{g_i}}$$ <p align="center">
<img src = "https://github.com/PaddlePaddle/book/blob/develop/04.word2vec/image/Eqn7.gif?raw=true"><br/>
</p>
- 整个网络的损失值(cost)为多类分类交叉熵,用公式表示为 - 整个网络的损失值(cost)为多类分类交叉熵,用公式表示为
$$J(\theta) = -\sum_{i=1}^N\sum_{k=1}^{|V|}y_k^{i}log(softmax(g_k^i))$$ <p align="center">
<img src = "https://github.com/PaddlePaddle/book/blob/develop/04.word2vec/image/Eqn8.gif?raw=true"><br/>
其中$y_k^i$表示第$i$个样本第$k$类的真实标签(0或1),$softmax(g_k^i)$表示第i个样本第k类softmax输出的概率。 </p>
其中$y_k^i$表示第$i$个样本第$k$类的真实标签(0或1),$\text{softmax}(g_k^i)$表示第i个样本第k类softmax输出的概率。
### Continuous Bag-of-Words model(CBOW) ### Continuous Bag-of-Words model(CBOW)
...@@ -170,7 +200,10 @@ CBOW模型通过一个词的上下文(各N个词)预测当前词。当N=2时 ...@@ -170,7 +200,10 @@ CBOW模型通过一个词的上下文(各N个词)预测当前词。当N=2时
具体来说,不考虑上下文的词语输入顺序,CBOW是用上下文词语的词向量的均值来预测当前词。即: 具体来说,不考虑上下文的词语输入顺序,CBOW是用上下文词语的词向量的均值来预测当前词。即:
$$context = \frac{x_{t-1} + x_{t-2} + x_{t+1} + x_{t+2}}{4}$$
<p align="center">
<img src = "https://github.com/PaddlePaddle/book/blob/develop/04.word2vec/image/Eqn9.gif?raw=true"><br/>
</p>
其中$x_t$为第$t$个词的词向量,分类分数(score)向量 $z=U*context$,最终的分类$y$采用softmax,损失函数采用多类分类交叉熵。 其中$x_t$为第$t$个词的词向量,分类分数(score)向量 $z=U*context$,最终的分类$y$采用softmax,损失函数采用多类分类交叉熵。
...@@ -245,14 +278,14 @@ dream that one day <e> ...@@ -245,14 +278,14 @@ dream that one day <e>
```python ```python
import paddle as paddle from __future__ import print_function
import paddle
import paddle.fluid as fluid import paddle.fluid as fluid
import six import six
import numpy import numpy
import math import math
from __future__ import print_function
``` ```
然后,定义参数: 然后,定义参数:
...@@ -453,11 +486,11 @@ def infer(use_cuda, params_dirname=None): ...@@ -453,11 +486,11 @@ def infer(use_cuda, params_dirname=None):
# 用来查询embedding表获取对应的词向量,因此其形状大小是[1]。 # 用来查询embedding表获取对应的词向量,因此其形状大小是[1]。
# recursive_sequence_lengths设置的是基于长度的LoD,因此都应该设为[[1]] # recursive_sequence_lengths设置的是基于长度的LoD,因此都应该设为[[1]]
# 注意recursive_sequence_lengths是列表的列表 # 注意recursive_sequence_lengths是列表的列表
data1 = [[211]] # 'among' data1 = numpy.asarray([[211]], dtype=numpy.int64) # 'among'
data2 = [[6]] # 'a' data2 = numpy.asarray([[6]], dtype=numpy.int64) # 'a'
data3 = [[96]] # 'group' data3 = numpy.asarray([[96]], dtype=numpy.int64) # 'group'
data4 = [[4]] # 'of' data4 = numpy.asarray([[4]], dtype=numpy.int64) # 'of'
lod = [[1]] lod = numpy.asarray([[1]], dtype=numpy.int64)
first_word = fluid.create_lod_tensor(data1, lod, place) first_word = fluid.create_lod_tensor(data1, lod, place)
second_word = fluid.create_lod_tensor(data2, lod, place) second_word = fluid.create_lod_tensor(data2, lod, place)
...@@ -533,7 +566,7 @@ main(use_cuda=use_cuda, is_sparse=True) ...@@ -533,7 +566,7 @@ main(use_cuda=use_cuda, is_sparse=True)
5. https://en.wikipedia.org/wiki/Singular_value_decomposition 5. https://en.wikipedia.org/wiki/Singular_value_decomposition
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">知识共享 署名-相同方式共享 4.0 国际 许可协议</a>进行许可。 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://paddlepaddleimage.cdn.bcebos.com/bookimage/camo.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">知识共享 署名-相同方式共享 4.0 国际 许可协议</a>进行许可。
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
......
...@@ -40,289 +40,262 @@ ...@@ -40,289 +40,262 @@
<!-- This block will be replaced by each markdown file content. Please do not change lines below.--> <!-- This block will be replaced by each markdown file content. Please do not change lines below.-->
<div id="markdown" style='display:none'> <div id="markdown" style='display:none'>
# Word2Vec # Word Vector
This is intended as a reference tutorial. The source code of this tutorial is located at [book/word2vec](https://github.com/PaddlePaddle/book/tree/develop/04.word2vec). The source code of this tutorial is in [book/word2vec](https://github.com/PaddlePaddle/book/tree/develop/04.word2vec). For new users, please refer to [Running This Book](https://github.com/PaddlePaddle/book/blob/develop/README.md#running-the-book) .
For instructions on getting started with this book,see [Running This Book](https://github.com/PaddlePaddle/book/blob/develop/README.md#running-the-book). ## Background
## Background Introduction In this chapter, we'll introduce the vector representation of words, also known as word embedding. Word vector is a common operation in natural language processing. It is a common technology underlying Internet services such as search engines, advertising systems, and recommendation systems.
This section introduces the concept of **word embeddings**, which are vector representations of words. Word embeddings is a popular technique used in natural language processing to support applications such as search engines, advertising systems, and recommendation systems. In these Internet services, we often compare the correlation between two words or two paragraphs of text. In order to make such comparisons, we often have to express words in a way that is suitable for computer processing. The most natural way is probably the vector space model.In this way, each word is represented as a one-hot vector whose length is the dictionary size, and each dimension corresponds to each word in a dictionary, except that the value in the corresponding dimension of the word is 1, other elements are 0.
### One-Hot Vectors The One-hot vector is natural but has limitation. For example, in the internet advertising system, if the query entered by the user is "Mother's Day", the keyword of an advertisement is "Carnation". Although according to common sense, we know that there is a connection between these two words - Mother's Day should usually give the mother a bunch of carnations; but the distance between the two words corresponds to the one-hot vectors, whether it is Euclidean distance or cosine similarity, the two words are considered to be irrelevant due to their vector orthogonality. The root cause of this conclusion contradicting us is that the amount of information in each word itself is too small. Therefore, just giving two words is not enough for us to accurately determine whether they are relevant. To accurately calculate correlations, we need more information—knowledge from a large amount of data through machine learning methods.
Building these applications requires us to quantify the similarity between two words or paragraphs. This calls for a new representation of all the words to make them more suitable for computation. An obvious way to achieve this is through the vector space model, where every word is represented as an **one-hot vector**. In the field of machine learning, all kinds of "knowledge" are represented by various models, and the word embedding model is one of them. A one-hot vector can be mapped to a lower-dimensional embedding vector by the word embedding model, such as $embedding (Mother's day) = [0.3, 4.2, -1.5, ...], embedding (carnation) = [0.2, 5.6, -2.3, ...]$. In this representation of the embedding vector to which it is mapped, it is desirable that the word vectors corresponding to the similar words on the two semantics (or usages) are "more like", such that the cosine similarity of the corresponding word vectors of "Mother's Day" and "Carnation" is no longer zero.
For each word, its vector representation has the corresponding entry in the vector as 1, and all other entries as 0. The lengths of one-hot vectors match the size of the dictionary. Each entry of a vector corresponds to the presence (or absence) of a word in the dictionary. The word embedding model can be a probability model, a co-occurrence matrix model, or a neural network model. Before implementing neural networks to calculate the embedding vector, the traditional method is to count the co-occurrence matrix $X$ of a word. $X$ is a matrix of $|V| \times |V|$ size, $X_{ij}$ means that in all corpora, The number of words appearing simultaneously with the i-th word and the j-th word in the vocabulary $V$(vocabulary), $|V|$ is the size of the vocabulary. Do matrix decomposition for $X$ (such as singular value decomposition, Singular Value Decomposition \[[5](#references)\]), and the result $U$ is treated as the embedding vector for all words:
One-hot vectors are intuitive, yet they have limited usefulness. Take the example of an Internet advertising system: Suppose a customer enters the query "Mother's Day", while an ad bids for the keyword "carnations". Because the one-hot vectors of these two words are perpendicular, the metric distance (either Euclidean or cosine similarity) between them would indicate little relevance. However, *we* know that these two queries are connected semantically, since people often gift their mothers bundles of carnation flowers on Mother's Day. This discrepancy is due to the low information capacity in each vector. That is, comparing the vector representations of two words does not assess their relevance sufficiently. To calculate their similarity accurately, we need more information, which could be learned from large amounts of data through machine learning methods. $$X = USV^T$$
Like many machine learning models, word embeddings can represent knowledge in various ways. Another model may project an one-hot vector to an embedding vector of lower dimension e.g. $embedding(mother's day) = [0.3, 4.2, -1.5, ...], embedding(carnations) = [0.2, 5.6, -2.3, ...]$. Mapping one-hot vectors onto an embedded vector space has the potential to bring the embedding vectors of similar words (either semantically or usage-wise) closer to each other, so that the cosine similarity between the corresponding vectors for words like "Mother's Day" and "carnations" are no longer zero. But such traditional method has many problems:
A word embedding model could be a probabilistic model, a co-occurrence matrix model, or a neural network. Before people started using neural networks to generate word embedding, the traditional method was to calculate a co-occurrence matrix $X$ of words. Here, $X$ is a $|V| \times |V|$ matrix, where $X_{ij}$ represents the co-occurrence times of the $i$th and $j$th words in the vocabulary `V` within all corpus, and $|V|$ is the size of the vocabulary. By performing matrix decomposition on $X$ e.g. Singular Value Decomposition \[[5](#references)\] 1) Since many words do not appear, the matrix is extremely sparse, so additional processing of the word frequency is needed to achieve a good matrix decomposition effect;
$$X = USV^T$$ 2) The matrix is very large and the dimensions are too high (usually up to $10^6 \times 10^6$);
the resulting $U$ can be seen as the word embedding of all the words. 3) You need to manually remove the stop words (such as although, a, ...), otherwise these frequently occurring words will also affect the effect of matrix decomposition.
However, this method suffers from many drawbacks: The neural-network-based model does not need to calculate and store a large table that is statistically generated on the whole corpus, but obtains the word vector by learning the semantic information, so the problem above can be well solved. In this chapter, we will show the details of training word vectors based on neural networks and how to train a word embedding model with PaddlePaddle.
1) Since many pairs of words don't co-occur, the co-occurrence matrix is sparse. To achieve good performance of matrix factorization, further treatment on word frequency is needed;
2) The matrix is large, frequently on the order of $10^6*10^6$;
3) We need to manually filter out stop words (like "although", "a", ...), otherwise these frequent words will affect the performance of matrix factorization.
The neural network based model does not require storing huge hash tables of statistics on all of the corpus. It obtains the word embedding by learning from semantic information, hence could avoid the aforementioned problems in the traditional method. In this chapter, we will introduce the details of neural network word embedding model and how to train such model in PaddlePaddle.
## Results Demonstration ## Result Demo
In this section, we use the $t-$SNE\[[4](#references)\] data visualization algorithm to draw the word embedding vectors after projecting them onto a two-dimensional space (see figure below). From the figure we can see that the semantically relevant words -- *a*, *the*, and *these* or *big* and *huge* -- are close to each other in the projected space, while irrelevant words -- *say* and *business* or *decision* and *japan* -- are far from each other. In this chapter, after the embedding vector is trained, we can use the data visualization algorithm t-SNE\[[4](#references)\] to draw the projection of the word features in two dimensions (as shown below). As can be seen from the figure, semantically related words (such as a, the, these; big, huge) are very close in projection, and semantic unrelated words (such as say, business; decision, japan) are far away from the projection.
<p align="center"> <p align="center">
<img src = "image/2d_similarity.png" width=400><br/> <img src = "https://github.com/PaddlePaddle/book/blob/develop/04.word2vec/image/2d_similarity.png?raw=true" width=400><br/>
Figure 1. Two dimension projection of word embeddings Figure 1. Two-dimensional projection of a word vector
</p> </p>
### Cosine Similarity On the other hand, we know that the cosine of two vectors is in the interval of $[-1,1]$: two identical vector cosines are 1, and the cosine value between two mutually perpendicular vectors is 0, The vector cosine of the opposite direction is -1, which the correlation is proportional to the magnitude of the cosine. So we can also calculate the cosine similarity of two word vectors:
On the other hand, we know that the cosine similarity between two vectors falls between $[-1,1]$. Specifically, the cosine similarity is 1 when the vectors are identical, 0 when the vectors are perpendicular, -1 when the are of opposite directions. That is, the cosine similarity between two vectors scales with their relevance. So we can calculate the cosine similarity of two word embedding vectors to represent their relevance:
``` ```
please input two words: big huge please input two words: big huge
similarity: 0.899180685161 Similarity: 0.899180685161
please input two words: from company please input two words: from company
similarity: -0.0997506977351 Similarity: -0.0997506977351
```
The above results could be obtained by running `calculate_dis.py`, which loads the words in the dictionary and their corresponding trained word embeddings. For detailed instruction, see section [Model Application](https://github.com/PaddlePaddle/book/tree/develop/04.word2vec#model-application).
```
## Model Overview The results above can be obtained by running `calculate_dis.py`, loading the words in the dictionary and the corresponding training feature results. We will describe the usage for details in [model application](#model application).
In this section, we will introduce three word embedding models: N-gram model, CBOW, and Skip-gram, which all output the frequency of each word given its immediate context.
For N-gram model, we will first introduce the concept of language model, and implement it using PaddlePaddle in section [Training](https://github.com/PaddlePaddle/book/tree/develop/04.word2vec#model-application). ## Overview of Models
The latter two models, which became popular recently, are neural word embedding model developed by Tomas Mikolov at Google \[[3](#references)\]. Despite their apparent simplicity, these models train very well. Here we introduce three models of training word vectors: N-gram model, CBOW model and Skip-gram model. Their central idea is to get the probability of a word appearing through the context. For the N-gram model, we will first introduce the concept of the language model. In the section [training model](#training model), we'll tutor you to implement it with PaddlePaddle. The latter two models are the most famous neuron word vector models in recent years, developed by Tomas Mikolov in Google \[[3](#references)\], although they are very simple, but the training effect is very good.
### Language Model ### Language Model
Before diving into word embedding models, we will first introduce the concept of **language model**. Language models build the joint probability function $P(w_1, ..., w_T)$ of a sentence, where $w_i$ is the i-th word in the sentence. The goal is to give higher probabilities to meaningful sentences, and lower probabilities to meaningless constructions. Before introducing the word embedding model, let us introduce a concept: the language model.
The language model is intended to model the joint probability function $P(w_1, ..., w_T)$ of a sentence, where $w_i$ represents the ith word in the sentence. The goal of the language model isn that the model gives a high probability to meaningful sentences and a small probability to meaningless sentences.Such models can be applied to many fields, such as machine translation, speech recognition, information retrieval, part-of-speech tagging, handwriting recognition, etc., All of which hope to obtain the probability of a continuous sequence. Take information retrieval as an example, when you search for "how long is a football bame" (bame is a medical term), the search engine will prompt you if you want to search for "how long is a football game", because the probability of calculating "how long is a football bame" is very low, and the word is similar to bame, which may cause errors, the game will maximize the probability of generating the sentence.
In general, models that generate the probability of a sequence can be applied to many fields, like machine translation, speech recognition, information retrieval, part-of-speech tagging, and handwriting recognition. Take information retrieval, for example. If you were to search for "how long is a football bame" (where bame is a medical noun), the search engine would have asked if you had meant "how long is a football game" instead. This is because the probability of "how long is a football bame" is very low according to the language model; in addition, among all of the words easily confused with "bame", "game" would build the most probable sentence. For the target probability of the language model $P(w_1, ..., w_T)$, if it is assumed that each word in the text is independent, the joint probability of the whole sentence can be expressed as the product of the conditional probabilities of all the words. which is:
#### Target Probability
For language model's target probability $P(w_1, ..., w_T)$, if the words in the sentence were to be independent, the joint probability of the whole sentence would be the product of each word's probability:
$$P(w_1, ..., w_T) = \prod_{t=1}^TP(w_t)$$ $$P(w_1, ..., w_T) = \prod_{t=1}^TP(w_t)$$
However, the frequency of words in a sentence typically relates to the words before them, so canonical language models are constructed using conditional probability in its target probability: However, we know that the probability of each word in the statement is closely related to the word in front of it, so in fact, the language model is usually represented by conditional probability:
$$P(w_1, ..., w_T) = \prod_{t=1}^TP(w_t | w_1, ... , w_{t-1})$$ $$P(w_1, ..., w_T) = \prod_{t=1}^TP(w_t | w_1, ... , w_{t-1})$$
### N-gram neural model ### N-gram neural model
In computational linguistics, n-gram is an important method to represent text. An n-gram represents a contiguous sequence of n consecutive items given a text. Based on the desired application scenario, each item could be a letter, a syllable or a word. The N-gram model is also an important method in statistical language modeling. When training language models with n-grams, the first (n-1) words of an n-gram are used to predict the *n*th word. In computational linguistics, n-gram is an important text representation method that represents a continuous n items in a text. Each item can be a letter, word or syllable based on the specific application scenario. The n-gram model is also an important method in the statistical language model. When n-gram is used to train the language model, the nth word is generally predicted by the content of the n-1 words of each n-gram.
Yoshua Bengio and other scientists describe how to train a word embedding model using neural network in the famous paper of Neural Probabilistic Language Models \[[1](#references)\] published in 2003. The Neural Network Language Model (NNLM) described in the paper learns the language model and word embedding simultaneously through a linear transformation and a non-linear hidden connection. That is, after training on large amounts of corpus, the model learns the word embedding; then, it computes the probability of the whole sentence, using the embedding. This type of language model can overcome the **curse of dimensionality** i.e. model inaccuracy caused by the difference in dimensionality between training and testing data. Note that the term *neural network language model* is ill-defined, so we will not use the name NNLM but only refer to it as *N-gram neural model* in this section. Scientists such as Yoshua Bengio introduced how to learn a word vector model of a neural network representation in the famous paper Neural Probabilistic Language Models \[[1](#references)\ in 2003. The Neural Network Language Model (NNLM) in this paper connects the linear model and a nonlinear hidden layer. It learns the language model and the word vector simultaneously, that is, by learning a large number of corpora to obtain the vector expression of the words, and the probability of the entire sentence is obtained by using these vectors. Since all words are represented by a low-dimensional vector, learning the language model in this way can overcome the curse of dimensionality.
Note: Because the "Neural Network Language Model" is more general, we do not use the real name of NNLM here, considering its specific practice, this model here is called N-gram neural model.
We have previously described language model using conditional probability, where the probability of the *t*-th word in a sentence depends on all $t-1$ words before it. Furthermore, since words further prior have less impact on a word, and every word within an n-gram is only effected by its previous n-1 words, we have: We have already mentioned above using the conditional probability language model, that is, the probability of the $t$ word in a sentence is related to the first $t-1$ words of the sentence. The farther the word actually has the smaller effect on the word, then if you consider an n-gram, each word is only affected by the preceding `n-1` words, then:
$$P(w_1, ..., w_T) = \prod_{t=n}^TP(w_t|w_{t-1}, w_{t-2}, ..., w_{t-n+1})$$ $$P(w_1, ..., w_T) = \prod_{t=n}^TP(w_t|w_{t-1}, w_{t-2}, ..., w_{t-n+1 })$$
Given some real corpus in which all sentences are meaningful, the n-gram model should maximize the following objective function: Given some real corpora, these corpora are meaningful sentences, and the optimization goal of the N-gram model is to maximize the objective function:
$$\frac{1}{T}\sum_t f(w_t, w_{t-1}, ..., w_{t-n+1};\theta) + R(\theta)$$ $$\frac{1}{T}\sum_t f(w_t, w_{t-1}, ..., w_{t-n+1};\theta) + R(\theta)$$
where $f(w_t, w_{t-1}, ..., w_{t-n+1})$ represents the conditional logarithmic probability of the current word $w_t$ given its previous $n-1$ words, and $R(\theta)$ represents parameter regularization term. Where $f(w_t, w_{t-1}, ..., w_{t-n+1})$ represents the conditional probability of getting the current word $w_t$ based on historical n-1 words, $R(\theta )$ represents a parameter regularization item.
<p align="center"> <p align="center">
<img src="image/nnlm_en.png" width=500><br/>        <img src="https://github.com/PaddlePaddle/book/blob/develop/04.word2vec/image/nnlm.png?raw=true" width=500><br/>
Figure 2. N-gram neural network model        Figure 2. N-gram neural network model
</p> </p>
Figure 2 shows the N-gram neural network model. From the bottom up, the model is divided into the following parts:
- For each sample, the model enters $w_{t-n+1},...w_{t-1}$, and outputs the probability distribution of the t-th word in the dictionary on the `|V|` words.
Figure 2 shows the N-gram neural network model. From the bottom up, the model has the following components: Each input word $w_{t-n+1},...w_{t-1}$ first maps to the word vector $C(w_{t-n+1}),...W_{t-1})$ by the mapping matrix.
- For each sample, the model gets input $w_{t-n+1},...w_{t-1}$, and outputs the probability that the t-th word is one of `|V|` in the dictionary.
Every input word $w_{t-n+1},...w_{t-1}$ first gets transformed into word embedding $C(w_{t-n+1}),...C(w_{t-1})$ through a transformation matrix.
- All the word embeddings concatenate into a single vector, which is mapped (nonlinearly) into the $t$-th word hidden representation: - Then the word vectors of all words are spliced into a large vector, and a hidden layer representation of the historical words is obtained through a non-linear mapping:
$$g=Utanh(\theta^Tx + b_1) + Wx + b_2$$ $$g=Utanh(\theta^Tx + b_1) + Wx + b_2$$
where $x$ is the large vector concatenated from all the word embeddings representing the context; $\theta$, $U$, $b_1$, $b_2$ and $W$ are parameters connecting word embedding layers to the hidden layers. $g$ represents the unnormalized probability of the output word, $g_i$ represents the unnormalized probability of the output word being the i-th word in the dictionary. Among them, $x$ is a large vector of all words, representing text history features; $\theta$, $U$, $b_1$, $b_2$, and $W$ are respectively parameters for the word vector layer to the hidden layer connection. $g$ represents the probability of all output words that are not normalized, and $g_i$ represents the output probability of the $i$ word in the unnormalized dictionary.
- Based on the definition of softmax, using normalized $g_i$, the probability that the output word is $w_t$ is represented as: - According to the definition of softmax, by normalizing $g_i$, the probability of generating the target word $w_t$ is:
$$P(w_t | w_1, ..., w_{t-n+1}) = \frac{e^{g_{w_t}}}{\sum_i^{|V|} e^{g_i}}$$ $$P(w_t | w_1, ..., w_{t-n+1}) = \frac{e^{g_{w_t}}}{\sum_i^{|V|} e^{g_i}}$$
- The cost of the entire network is a multi-class cross-entropy and can be described by the following loss function - The loss value of the entire network is the multi-class classification cross entropy, which is expressed as
$$J(\theta) = -\sum_{i=1}^N\sum_{k=1}^{|V|}y_k^{i}log(softmax(g_k^i))$$
where $y_k^i$ represents the real label (0 or 1) of the $i$ sample of the $k$ class, and $softmax(g_k^i)$ represents the probability of the kth softmax output of the i-th sample.
$$J(\theta) = -\sum_{i=1}^N\sum_{c=1}^{|V|}y_k^{i}log(softmax(g_k^i))$$
where $y_k^i$ represents the true label for the $k$-th class in the $i$-th sample ($0$ or $1$), $softmax(g_k^i)$ represents the softmax probability for the $k$-th class in the $i$-th sample.
### Continuous Bag-of-Words model(CBOW) ### Continuous Bag-of-Words model(CBOW)
CBOW model predicts the current word based on the N words both before and after it. When $N=2$, the model is as the figure below: The CBOW model predicts the current word through the context of a word (each N words). When N=2, the model is shown below:
<p align="center"> <p align="center">
<img src="image/cbow_en.png" width=250><br/>     <img src="https://github.com/PaddlePaddle/book/blob/develop/04.word2vec/image/cbow.png?raw=true" width=250><br/>
Figure 3. CBOW model     Figure 3. CBOW model
</p> </p>
Specifically, by ignoring the order of words in the sequence, CBOW uses the average value of the word embedding of the context to predict the current word: Specifically, regardless of the contextual word input order, CBOW uses the mean of the word vectors of the context words to predict the current word. which is:
$$\text{context} = \frac{x_{t-1} + x_{t-2} + x_{t+1} + x_{t+2}}{4}$$ $$context = \frac{x_{t-1} + x_{t-2} + x_{t+1} + x_{t+2}}{4}$$
where $x_t$ is the word embedding of the t-th word, classification score vector is $z=U*\text{context}$, the final classification $y$ uses softmax and the loss function uses multi-class cross-entropy. Where $x_t$ is the word vector of the $t$th word, the score vector (score) $z=U\*context$, the final classification $y$ uses softmax, and the loss function uses multi-class classification cross entropy.
### Skip-gram model ### Skip-gram model
The advantages of CBOW is that it smooths over the word embeddings of the context and reduces noise, so it is very effective on small dataset. Skip-gram uses a word to predict its context and get multiple context for the given word, so it can be used in larger datasets. The benefit of CBOW is that the distribution of contextual words is smoothed over the word vector, removing noise. Therefore it is very effective on small data sets. In the Skip-gram method, a word is used to predict its context, and many samples of the current word context are obtained, so it can be used for a larger data set.
<p align="center"> <p align="center">
<img src="image/skipgram_en.png" width=250><br/>     <img src="https://github.com/PaddlePaddle/book/blob/develop/04.word2vec/image/skipgram.png?raw=true" width=250><br/>
Figure 4. Skip-gram model     Figure 4. Skip-gram model
</p> </p>
As illustrated in the figure above, skip-gram model maps the word embedding of the given word onto $2n$ word embeddings (including $n$ words before and $n$ words after the given word), and then combine the classification loss of all those $2n$ words by softmax. As shown in the figure above, the specific method of the Skip-gram model is to map the word vector of a word to the word vector of $2n$ words ($2n$ represents the $n$ words before and after the input word), and then obtained the sum of the classification loss values of the $2n$ words by softmax.
## Dataset ## Data Preparation
We will use Penn Treebank (PTB) (Tomas Mikolov's pre-processed version) dataset. PTB is a small dataset, used in Recurrent Neural Network Language Modeling Toolkit\[[2](#references)\]. Its statistics are as follows: ### Data Introduction
This tutorial uses the Penn Treebank (PTB) (pre-processed version of Tomas Mikolov) dataset. The PTB data set is small and the training speed is fast. It is applied to Mikolov's open language model training tool \[[2](#references)\]. Its statistics are as follows:
<p align="center"> <p align="center">
<table> <table>
<tr>     <tr>
<td>training set</td>         <td>Training data</td>
<td>validation set</td>         <td>Verify data</td>
<td>test set</td>         <td>Test data</td>
</tr>     </tr>
<tr>     <tr>
<td>ptb.train.txt</td>         <td>ptb.train.txt</td>
<td>ptb.valid.txt</td>         <td>ptb.valid.txt</td>
<td>ptb.test.txt</td>         <td>ptb.test.txt</td>
</tr>     </tr>
<tr>     <tr>
<td>42068 lines</td>         <td>42068 sentences</td>
<td>3370 lines</td>         <td>3370 sentences</td>
<td>3761 lines</td>         <td>3761 sentence</td>
</tr>     </tr>
</table> </table>
</p> </p>
### Python Dataset Module
We encapsulated the PTB Data Set in our Python module `paddle.dataset.imikolov`. This module can
1. download the dataset to `~/.cache/paddle/dataset/imikolov`, if not yet, and ### Data Preprocessing
2. [preprocesses](#preprocessing) the dataset.
### Preprocessing This chapter trains the 5-gram model, which means that the first 4 words of each piece of data are used to predict the 5th word during PaddlePaddle training. PaddlePaddle provides the python package `paddle.dataset.imikolov` corresponding to the PTB dataset, which automatically downloads and preprocesses the data for your convenience.
We will be training a 5-gram model. Given five words in a window, we will predict the fifth word given the first four words. Preprocessing adds the start symbol `<s>` and the end symbol `<e>` to each sentence in the data set. Then, depending on the window size (5 in this tutorial), slide the window to the right each time from start to end and generate a piece of data.
Beginning and end of a sentence have a special meaning, so we will add begin token `<s>` in the front of the sentence. And end token `<e>` in the end of the sentence. By moving the five word window in the sentence, data instances are generated. For example, "I have a dream that one day" provides 5 pieces of data:
For example, the sentence "I have a dream that one day" generates five data instances:
```text ```text
<s> I have a dream <s> I have a dream
I have a dream that I have a dream that
have a dream that one Have a dream that one
a dream that one day a dream that one day
dream that one day <e> Dream that one day <e>
``` ```
At last, each data instance will be converted into an integer sequence according it's words' index inside the dictionary. Finally, based on the position of its word in the dictionary, each input is converted to an index sequence of integers as the input to PaddlePaddle.
## Training <a name="training model"></a>
## Program the Model
The neural network that we will be using is illustrated in the graph below: The model structure of this configuration is shown below:
<p align="center"> <p align="center">
<img src="image/ngram.en.png" width=400><br/>     <img src="https://github.com/PaddlePaddle/book/blob/develop/04.word2vec/image/ngram.png?raw=true" width=400><br/>
Figure 5. N-gram neural network model in model configuration     Figure 5. N-gram neural network model in model configuration
</p> </p>
`word2vec/train.py` demonstrates training word2vec using PaddlePaddle: First, load packages:
### Datafeeder Configuration ```python
Our program starts with importing necessary packages:
- Import packages. from __future__ import print_function
```python import paddle as paddle
import paddle
import paddle.fluid as fluid import paddle.fluid as fluid
import six
import numpy import numpy
from functools import partial
import math import math
import os
import six
import sys
from __future__ import print_function
try:
from paddle.fluid.contrib.trainer import *
from paddle.fluid.contrib.inferencer import *
except ImportError:
print(
"In the fluid 1.0, the trainer and inferencer are moving to paddle.fluid.contrib",
file=sys.stderr)
from paddle.fluid.trainer import *
from paddle.fluid.inferencer import *
``` ```
- Configure parameters and build word dictionary. Then, define the parameters:
```python ```python
EMBED_SIZE = 32 # word vector dimension EMBED_SIZE = 32 # embedding dimensions
HIDDEN_SIZE = 256 # hidden layer dimension HIDDEN_SIZE = 256 # hidden layer size
N = 5 # train 5-gram N = 5 # ngram size, here fixed 5
BATCH_SIZE = 32 # batch size BATCH_SIZE = 100 # batch size
PASS_NUM = 100 # Training rounds
# can use CPU or GPU use_cuda = False # Set to True if trained with GPU
use_cuda = os.getenv('WITH_GPU', '0') != '0'
word_dict = paddle.dataset.imikolov.build_dict() word_dict = paddle.dataset.imikolov.build_dict()
dict_size = len(word_dict) dict_size = len(word_dict)
``` ```
Unlike from the previous PaddlePaddle v2, in the new API (Fluid), we do not need to calculate word embedding ourselves. PaddlePaddle provides a built-in method `fluid.layers.embedding` and we can use it directly to build our N-gram neural network model. A larger `BATCH_SIZE` will make the training converge faster, but it will also consume more memory. Since the word vector calculation is large, if the environment allows, please turn on the GPU for training, and get results faster.
Unlike the previous PaddlePaddle v2 version, in the new Fluid version, we don't have to manually calculate the word vector. PaddlePaddle provides a built-in method `fluid.layers.embedding`, which we can use directly to construct an N-gram neural network.
- We define our N-gram neural network structure as below. This structure will be used both in `train` and in `infer`. We can specify `is_sparse = True` to accelerate sparse matrix update for word embedding. - Let's define our N-gram neural network structure. This structure is used in both training and predicting. Because the word vector is sparse, we pass the parameter `is_sparse == True` to speed up the update of the sparse matrix.
```python ```python
def inference_program(is_sparse): def inference_program(words, is_sparse):
first_word = fluid.layers.data(name='firstw', shape=[1], dtype='int64')
second_word = fluid.layers.data(name='secondw', shape=[1], dtype='int64')
third_word = fluid.layers.data(name='thirdw', shape=[1], dtype='int64')
fourth_word = fluid.layers.data(name='fourthw', shape=[1], dtype='int64')
embed_first = fluid.layers.embedding( embed_first = fluid.layers.embedding(
input=first_word, input=words[0],
size=[dict_size, EMBED_SIZE], size=[dict_size, EMBED_SIZE],
dtype='float32', dtype='float32',
is_sparse=is_sparse, is_sparse=is_sparse,
param_attr='shared_w') param_attr='shared_w')
embed_second = fluid.layers.embedding( embed_second = fluid.layers.embedding(
input=second_word, input=words[1],
size=[dict_size, EMBED_SIZE], size=[dict_size, EMBED_SIZE],
dtype='float32', dtype='float32',
is_sparse=is_sparse, is_sparse=is_sparse,
param_attr='shared_w') param_attr='shared_w')
embed_third = fluid.layers.embedding( embed_third = fluid.layers.embedding(
input=third_word, input=words[2],
size=[dict_size, EMBED_SIZE], size=[dict_size, EMBED_SIZE],
dtype='float32', dtype='float32',
is_sparse=is_sparse, is_sparse=is_sparse,
param_attr='shared_w') param_attr='shared_w')
embed_fourth = fluid.layers.embedding( embed_fourth = fluid.layers.embedding(
input=fourth_word, input=words[3],
size=[dict_size, EMBED_SIZE], size=[dict_size, EMBED_SIZE],
dtype='float32', dtype='float32',
is_sparse=is_sparse, is_sparse=is_sparse,
...@@ -337,79 +310,116 @@ def inference_program(is_sparse): ...@@ -337,79 +310,116 @@ def inference_program(is_sparse):
return predict_word return predict_word
``` ```
- As we already defined the N-gram neural network structure in the above, we can use it in our `train` method. - Based on the neural network structure above, we can define our training method as follows:
```python ```python
def train_program(is_sparse): def train_program(predict_word):
# The declaration of 'next_word' must be after the invoking of inference_program, # The definition of'next_word' must be after the declaration of inference_program.
# or the data input order of train program would be [next_word, firstw, secondw, # Otherwise the sequence of the train program input data becomes [next_word, firstw, secondw,
# thirdw, fourthw], which is not correct. #thirdw, fourthw], This is not true.
predict_word = inference_program(is_sparse)
next_word = fluid.layers.data(name='nextw', shape=[1], dtype='int64') next_word = fluid.layers.data(name='nextw', shape=[1], dtype='int64')
cost = fluid.layers.cross_entropy(input=predict_word, label=next_word) cost = fluid.layers.cross_entropy(input=predict_word, label=next_word)
avg_cost = fluid.layers.mean(cost) avg_cost = fluid.layers.mean(cost)
return avg_cost return avg_cost
```
- Now we will begin the training process. It is relatively simple compared to the previous version. `paddle.dataset.imikolov.train()` and `paddle.dataset.imikolov.test()` are our training and test set. Both of the functions will return a **reader**: In PaddlePaddle, reader is a python function which returns a Python iterator which output a single data instance at a time.
`paddle.batch` takes reader as input, outputs a **batched reader**: In PaddlePaddle, a reader outputs a single data instance at a time but batched reader outputs a minibatch of data instances.
`event_handler` can be passed into `trainer.train` so that we can do some tasks after each step or epoch. These tasks include recording current metrics or terminate current training process.
```python
def optimizer_func(): def optimizer_func():
return fluid.optimizer.AdagradOptimizer( return fluid.optimizer.AdagradOptimizer(
learning_rate=3e-3, learning_rate=3e-3,
regularization=fluid.regularizer.L2DecayRegularizer(8e-4)) regularization=fluid.regularizer.L2DecayRegularizer(8e-4))
```
- Now we can start training. This version is much simpler than before. We have ready-made training and test sets: `paddle.dataset.imikolov.train()` and `paddle.dataset.imikolov.test()`. Both will return a reader. In PaddlePaddle, the reader is a Python function that reads the next piece of data when called each time . It is a Python generator.
`paddle.batch` will read in a reader and output a batched reader. We can also output the training of each step and batch during the training process.
```python
def train(if_use_cuda, params_dirname, is_sparse=True):
place = fluid.CUDAPlace(0) if if_use_cuda else fluid.CPUPlace()
def train(use_cuda, train_program, params_dirname):
train_reader = paddle.batch( train_reader = paddle.batch(
paddle.dataset.imikolov.train(word_dict, N), BATCH_SIZE) paddle.dataset.imikolov.train(word_dict, N), BATCH_SIZE)
test_reader = paddle.batch( test_reader = paddle.batch(
paddle.dataset.imikolov.test(word_dict, N), BATCH_SIZE) paddle.dataset.imikolov.test(word_dict, N), BATCH_SIZE)
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace() first_word = fluid.layers.data(name='firstw', shape=[1], dtype='int64')
second_word = fluid.layers.data(name='secondw', shape=[1], dtype='int64')
third_word = fluid.layers.data(name='thirdw', shape=[1], dtype='int64')
forth_word = fluid.layers.data(name='fourthw', shape=[1], dtype='int64')
next_word = fluid.layers.data(name='nextw', shape=[1], dtype='int64')
def event_handler(event): word_list = [first_word, second_word, third_word, forth_word, next_word]
if isinstance(event, EndStepEvent): feed_order = ['firstw', 'secondw', 'thirdw', 'fourthw', 'nextw']
outs = trainer.test(
reader=test_reader, main_program = fluid.default_main_program()
feed_order=['firstw', 'secondw', 'thirdw', 'fourthw', 'nextw']) star_program = fluid.default_startup_program()
avg_cost = outs[0]
predict_word = inference_program(word_list, is_sparse)
# We output cost every 10 steps. avg_cost = train_program(predict_word)
if event.step % 10 == 0: test_program = main_program.clone(for_test=True)
print("Step %d: Average Cost %f" % (event.step, avg_cost))
sgd_optimizer = optimizer_func()
# If average cost is lower than 5.8, we consider the model good enough to stop. sgd_optimizer.minimize(avg_cost)
# Note 5.8 is a relatively high value. In order to get a better model, one should
# aim for avg_cost lower than 3.5. But the training could take longer time. exe = fluid.Executor(place)
if avg_cost < 5.8:
trainer.save_params(params_dirname) def train_test(program, reader):
trainer.stop() count = 0
feed_var_list = [
if math.isnan(avg_cost): program.global_block().var(var_name) for var_name in feed_order
]
feeder_test = fluid.DataFeeder(feed_list=feed_var_list, place=place)
test_exe = fluid.Executor(place)
accumulated = len([avg_cost]) * [0]
for test_data in reader():
avg_cost_np = test_exe.run(
program=program,
feed=feeder_test.feed(test_data),
fetch_list=[avg_cost])
accumulated = [
x[0] + x[1][0] for x in zip(accumulated, avg_cost_np)
]
count += 1
return [x / count for x in accumulated]
def train_loop():
step = 0
feed_var_list_loop = [
main_program.global_block().var(var_name) for var_name in feed_order
]
feeder = fluid.DataFeeder(feed_list=feed_var_list_loop, place=place)
exe.run(star_program)
for pass_id in range(PASS_NUM):
for data in train_reader():
avg_cost_np = exe.run(
main_program, feed=feeder.feed(data), fetch_list=[avg_cost])
if step % 10 == 0:
outs = train_test(test_program, test_reader)
print("Step %d: Average Cost %f" % (step, outs[0]))
# The entire training process takes several hours if the average loss is less than 5.8,
# We think that the model has achieved good results and can stop training.
# Note 5.8 is a relatively high value, in order to get a better model, you can
# set the threshold here to be 3.5, but the training time will be longer.
if outs[0] < 5.8:
if params_dirname is not None:
fluid.io.save_inference_model(params_dirname, [
'firstw', 'secondw', 'thirdw', 'fourthw'
], [predict_word], exe)
return
step += 1
if math.isnan(float(avg_cost_np[0])):
sys.exit("got NaN loss, training failed.") sys.exit("got NaN loss, training failed.")
trainer = Trainer( raise AssertionError("Cost is too large {0:2.2}".format(avg_cost_np[0]))
train_func=train_program,
# Note here we need to chse more sophisticated optimizer train_loop()
# such as AdaGrad with a decay rate. The normal SGD converges
# very slowly.
# optimizer=fluid.optimizer.SGD(learning_rate=0.001),
optimizer_func=optimizer_func,
place=place)
trainer.train(
reader=train_reader,
num_epochs=1,
event_handler=event_handler,
feed_order=['firstw', 'secondw', 'thirdw', 'fourthw', 'nextw'])
``` ```
`trainer.train` will start training, the output of `event_handler` will be similar to following: - `train_loop` will start training. The log of the training process during the period is as follows:
```text ```text
Step 0: Average Cost 7.337213 Step 0: Average Cost 7.337213
Step 10: Average Cost 6.136128 Step 10: Average Cost 6.136128
...@@ -417,51 +427,61 @@ Step 20: Average Cost 5.766995 ...@@ -417,51 +427,61 @@ Step 20: Average Cost 5.766995
... ...
``` ```
<a name="model application"></a>
## Model Application ## Model Application
After the model is trained, we can use it to make some predictions.
After the model is trained, we can load the saved model parameters and do some inference. ### Predict the next word
We can use our trained model to predict the next word after learning the previous N-gram.
### Predicting the next word
We can use our trained model to predict the next word given its previous N-gram. For example
```python ```python
def infer(use_cuda, inference_program, params_dirname=None): def infer(use_cuda, params_dirname=None):
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace() place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
inferencer = Inferencer(
infer_func=inference_program, param_path=params_dirname, place=place) exe = fluid.Executor(place)
# Setup inputs by creating 4 LoDTensors representing 4 words. Here each word inference_scope = fluid.core.Scope()
# is simply an index to look up for the corresponding word vector and hence with fluid.scope_guard(inference_scope):
# the shape of word (base_shape) should be [1]. The length-based level of #Get the inference program using fluid.io.load_inference_model,
# detail (lod) info of each LoDtensor should be [[1]] meaning there is only #feed variable name by feed_target_names and fetch fetch_targets from scope
# one lod_level and there is only one sequence of one word on this level. [inferencer, feed_target_names,
# Note that lod info should be a list of lists. fetch_targets] = fluid.io.load_inference_model(params_dirname, exe)
data1 = [[211]] # 'among' # Set the input and use 4 LoDTensor to represent 4 words. Each word here is an id,
data2 = [[6]] # 'a' # Used to query the embedding table to get the corresponding word vector, so its shape size is [1].
data3 = [[96]] # 'group' # recursive_sequence_lengths sets the length based on LoD, so it should all be set to [[1]]
data4 = [[4]] # 'of' # Note that recursive_sequence_lengths is a list of lists
lod = [[1]] data1 = numpy.asarray([[211]], dtype=numpy.int64) # 'among'
data2 = numpy.asarray([[6]], dtype=numpy.int64) # 'a'
data3 = numpy.asarray([[96]], dtype=numpy.int64) # 'group'
data4 = numpy.asarray([[4]], dtype=numpy.int64) # 'of'
lod = numpy.asarray([[1]], dtype=numpy.int64)
first_word = fluid.create_lod_tensor(data1, lod, place) first_word = fluid.create_lod_tensor(data1, lod, place)
second_word = fluid.create_lod_tensor(data2, lod, place) second_word = fluid.create_lod_tensor(data2, lod, place)
third_word = fluid.create_lod_tensor(data3, lod, place) third_word = fluid.create_lod_tensor(data3, lod, place)
fourth_word = fluid.create_lod_tensor(data4, lod, place) fourth_word = fluid.create_lod_tensor(data4, lod, place)
result = inferencer.infer( assert feed_target_names[0] == 'firstw'
{ assert feed_target_names[1] == 'secondw'
'firstw': first_word, assert feed_target_names[2] == 'thirdw'
'secondw': second_word, assert feed_target_names[3] == 'fourthw'
'thirdw': third_word,
'fourthw': fourth_word # Construct the feed dictionary {feed_target_name: feed_target_data}
# Prediction results are included in results
results = exe.run(
inferencer,
feed={
feed_target_names[0]: first_word,
feed_target_names[1]: second_word,
feed_target_names[2]: third_word,
feed_target_names[3]: fourth_word
}, },
fetch_list=fetch_targets,
return_numpy=False) return_numpy=False)
print(numpy.array(result[0])) print(numpy.array(results[0]))
most_possible_word_index = numpy.argmax(result[0]) most_possible_word_index = numpy.argmax(results[0])
print(most_possible_word_index) print(most_possible_word_index)
print([ print([
key for key, value in six.iteritems(word_dict) key for key, value in six.iteritems(word_dict)
...@@ -469,15 +489,16 @@ def infer(use_cuda, inference_program, params_dirname=None): ...@@ -469,15 +489,16 @@ def infer(use_cuda, inference_program, params_dirname=None):
][0]) ][0])
``` ```
When we spent 3 mins in training, the output is like below, which means the next word for `among a group of` is `a`. If we train the model with a longer time, it will give a meaningful prediction as `workers`. Since the word vector matrix itself is relatively sparse, the training process takes a long time to reach a certain precision. In order to see the effect simply, the tutorial only sets up with a few rounds of training and ends with the following result. Our model predicts that the next word for `among a group of` is `the`. This is in line with the law of grammar. If we train for longer time, such as several hours, then the next predicted word we will get is `workers`. The format of the predicted output is as follows:
```text ```text
[[0.00106646 0.0007907 0.00072041 ... 0.00049024 0.00041355 0.00084464]] [[0.03768077 0.03463154 0.00018074 ... 0.00022283 0.00029888 0.02967956]]
6 0
a the
``` ```
The first line represents the probability distribution of the predicted word in the dictionary, the second line represents the id corresponding to the word with the highest probability, and the third line represents the word with the highest probability.
The main entrance of the program is fairly simple: The entrance to the entire program is simple:
```python ```python
def main(use_cuda, is_sparse): def main(use_cuda, is_sparse):
...@@ -487,35 +508,29 @@ def main(use_cuda, is_sparse): ...@@ -487,35 +508,29 @@ def main(use_cuda, is_sparse):
params_dirname = "word2vec.inference.model" params_dirname = "word2vec.inference.model"
train( train(
use_cuda=use_cuda, if_use_cuda=use_cuda,
train_program=partial(train_program, is_sparse), params_dirname=params_dirname,
params_dirname=params_dirname) is_sparse=is_sparse)
infer( infer(use_cuda=use_cuda, params_dirname=params_dirname)
use_cuda=use_cuda,
inference_program=partial(inference_program, is_sparse),
params_dirname=params_dirname)
main(use_cuda=use_cuda, is_sparse=True) main(use_cuda=use_cuda, is_sparse=True)
``` ```
## Conclusion ## Conclusion
In this chapter, we introduced word vectors, the relationship between language models and word vectors and how to obtain word vectors by training neural network models. In information retrieval, we can judge the correlation between query and document keywords based on the cosine value between vectors. In syntactic analysis and semantic analysis, trained word vectors can be used to initialize the model for better results. In the document classification, after the word vector, you can cluster to group synonyms in a document, or you can use N-gram to predict the next word. We hope that everyone can easily use the word vector to conduct research in related fields after reading this chapter.
This chapter introduces word embeddings, the relationship between language model and word embedding, and how to train neural networks to learn word embedding. <a name="references"></a>
In grammar analysis and semantic analysis, a previously trained word embedding can initialize models for better performance. We hope that readers can use word embedding models in their work after reading this chapter.
## References ## References
1. Bengio Y, Ducharme R, Vincent P, et al. [A neural probabilistic language model](http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)[J]. journal of machine learning research, 2003, 3(Feb): 1137-1155. 1. Bengio Y, Ducharme R, Vincent P, et al. [A neural probabilistic language model](http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)[J]. journal of machine learning Research, 2003, 3(Feb): 1137-1155.
2. Mikolov T, Kombrink S, Deoras A, et al. [Rnnlm-recurrent neural network language modeling toolkit](http://www.fit.vutbr.cz/~imikolov/rnnlm/rnnlm-demo.pdf)[C]//Proc. of the 2011 ASRU Workshop. 2011: 196-201. 2. Mikolov T, Kombrink S, Deoras A, et al. [Rnnlm-recurrent neural network language modeling toolkit](http://www.fit.vutbr.cz/~imikolov/rnnlm/rnnlm-demo.pdf)[C ]//Proc. of the 2011 ASRU Workshop. 2011: 196-201.
3. Mikolov T, Chen K, Corrado G, et al. [Efficient estimation of word representations in vector space](https://arxiv.org/pdf/1301.3781.pdf)[J]. arXiv preprint arXiv:1301.3781, 2013. 3. Mikolov T, Chen K, Corrado G, et al. [Efficient estimation of word representations in vector space](https://arxiv.org/pdf/1301.3781.pdf)[J]. arXiv preprint arXiv:1301.3781, 2013 .
4. Maaten L, Hinton G. [Visualizing data using t-SNE](https://lvdmaaten.github.io/publications/papers/JMLR_2008.pdf)[J]. Journal of Machine Learning Research, 2008, 9(Nov): 2579-2605. 4. Maaten L, Hinton G. [Visualizing data using t-SNE](https://lvdmaaten.github.io/publications/papers/JMLR_2008.pdf)[J]. Journal of Machine Learning Research, 2008, 9(Nov ): 2579-2605.
5. https://en.wikipedia.org/wiki/Singular_value_decomposition 5. https://en.wikipedia.org/wiki/Singular_value_decomposition
<br/> <br/>
This tutorial is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://paddlepaddleimage.cdn.bcebos.com/bookimage/camo.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">This tutorial</span> is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
......
...@@ -18,19 +18,31 @@ import six ...@@ -18,19 +18,31 @@ import six
import numpy import numpy
import sys import sys
import math import math
import argparse
EMBED_SIZE = 32 EMBED_SIZE = 32
HIDDEN_SIZE = 256 HIDDEN_SIZE = 256
N = 5 N = 5
BATCH_SIZE = 100 BATCH_SIZE = 100
PASS_NUM = 100
use_cuda = False # set to True if training with GPU
word_dict = paddle.dataset.imikolov.build_dict() word_dict = paddle.dataset.imikolov.build_dict()
dict_size = len(word_dict) dict_size = len(word_dict)
def parse_args():
parser = argparse.ArgumentParser("word2vec")
parser.add_argument(
'--enable_ce',
action='store_true',
help='If set, run the task with continuous evaluation logs.')
parser.add_argument(
'--use_gpu', type=int, default=0, help='whether to use gpu')
parser.add_argument(
'--num_epochs', type=int, default=100, help='number of epoch')
args = parser.parse_args()
return args
def inference_program(words, is_sparse): def inference_program(words, is_sparse):
embed_first = fluid.layers.embedding( embed_first = fluid.layers.embedding(
...@@ -102,6 +114,10 @@ def train(if_use_cuda, params_dirname, is_sparse=True): ...@@ -102,6 +114,10 @@ def train(if_use_cuda, params_dirname, is_sparse=True):
main_program = fluid.default_main_program() main_program = fluid.default_main_program()
star_program = fluid.default_startup_program() star_program = fluid.default_startup_program()
if args.enable_ce:
main_program.random_seed = 90
star_program.random_seed = 90
predict_word = inference_program(word_list, is_sparse) predict_word = inference_program(word_list, is_sparse)
avg_cost = train_program(predict_word) avg_cost = train_program(predict_word)
test_program = main_program.clone(for_test=True) test_program = main_program.clone(for_test=True)
...@@ -153,6 +169,9 @@ def train(if_use_cuda, params_dirname, is_sparse=True): ...@@ -153,6 +169,9 @@ def train(if_use_cuda, params_dirname, is_sparse=True):
# Note 5.8 is a relatively high value. In order to get a better model, one should # Note 5.8 is a relatively high value. In order to get a better model, one should
# aim for avg_cost lower than 3.5. But the training could take longer time. # aim for avg_cost lower than 3.5. But the training could take longer time.
if outs[0] < 5.8: if outs[0] < 5.8:
if args.enable_ce:
print("kpis\ttrain_cost\t%f" % outs[0])
if params_dirname is not None: if params_dirname is not None:
fluid.io.save_inference_model(params_dirname, [ fluid.io.save_inference_model(params_dirname, [
'firstw', 'secondw', 'thirdw', 'fourthw' 'firstw', 'secondw', 'thirdw', 'fourthw'
...@@ -161,7 +180,6 @@ def train(if_use_cuda, params_dirname, is_sparse=True): ...@@ -161,7 +180,6 @@ def train(if_use_cuda, params_dirname, is_sparse=True):
step += 1 step += 1
if math.isnan(float(avg_cost_np[0])): if math.isnan(float(avg_cost_np[0])):
sys.exit("got NaN loss, training failed.") sys.exit("got NaN loss, training failed.")
raise AssertionError("Cost is too large {0:2.2}".format(avg_cost_np[0])) raise AssertionError("Cost is too large {0:2.2}".format(avg_cost_np[0]))
train_loop() train_loop()
...@@ -188,11 +206,11 @@ def infer(use_cuda, params_dirname=None): ...@@ -188,11 +206,11 @@ def infer(use_cuda, params_dirname=None):
# meaning there is only one level of detail and there is only one sequence of # meaning there is only one level of detail and there is only one sequence of
# one word on this level. # one word on this level.
# Note that recursive_sequence_lengths should be a list of lists. # Note that recursive_sequence_lengths should be a list of lists.
data1 = [[numpy.int64(211)]] # 'among' data1 = numpy.asarray([[211]], dtype=numpy.int64) # 'among'
data2 = [[numpy.int64(6)]] # 'a' data2 = numpy.asarray([[6]], dtype=numpy.int64) # 'a'
data3 = [[numpy.int64(96)]] # 'group' data3 = numpy.asarray([[96]], dtype=numpy.int64) # 'group'
data4 = [[numpy.int64(4)]] # 'of' data4 = numpy.asarray([[4]], dtype=numpy.int64) # 'of'
lod = [[numpy.int64(1)]] lod = numpy.asarray([[1]], dtype=numpy.int64)
first_word = fluid.create_lod_tensor(data1, lod, place) first_word = fluid.create_lod_tensor(data1, lod, place)
second_word = fluid.create_lod_tensor(data2, lod, place) second_word = fluid.create_lod_tensor(data2, lod, place)
...@@ -245,4 +263,7 @@ def main(use_cuda, is_sparse): ...@@ -245,4 +263,7 @@ def main(use_cuda, is_sparse):
if __name__ == '__main__': if __name__ == '__main__':
args = parse_args()
PASS_NUM = args.num_epochs
use_cuda = args.use_gpu # set to True if training with GPU
main(use_cuda=use_cuda, is_sparse=True) main(use_cuda=use_cuda, is_sparse=True)
#!/bin/bash
#This file is only used for continuous evaluation.
python train.py --enable_ce | python _ce.py
...@@ -2,6 +2,14 @@ ...@@ -2,6 +2,14 @@
本教程源代码目录在[book/recommender_system](https://github.com/PaddlePaddle/book/tree/develop/05.recommender_system),初次使用请您参考[Book文档使用说明](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书) 本教程源代码目录在[book/recommender_system](https://github.com/PaddlePaddle/book/tree/develop/05.recommender_system),初次使用请您参考[Book文档使用说明](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)
### 说明: ###
1. 硬件环境要求:
本文可支持在CPU、GPU下运行
2. Docker镜像支持的CUDA/cuDNN版本:
如果使用了Docker运行Book,请注意:这里所提供的默认镜像的GPU环境为 CUDA 8/cuDNN 5,对于NVIDIA Tesla V100等要求CUDA 9的 GPU,使用该镜像可能会运行失败。
3. 文档和脚本中代码的一致性问题:
请注意:为使本文更加易读易用,我们拆分、调整了train.py的代码并放入本文。本文中代码与train.py的运行结果一致,可直接运行[train.py](https://github.com/PaddlePaddle/book/blob/develop/05.recommender_system/train.py)进行验证。
## 背景介绍 ## 背景介绍
在网络技术不断发展和电子商务规模不断扩大的背景下,商品数量和种类快速增长,用户需要花费大量时间才能找到自己想买的商品,这就是信息超载问题。为了解决这个难题,个性化推荐系统(Recommender System)应运而生。 在网络技术不断发展和电子商务规模不断扩大的背景下,商品数量和种类快速增长,用户需要花费大量时间才能找到自己想买的商品,这就是信息超载问题。为了解决这个难题,个性化推荐系统(Recommender System)应运而生。
...@@ -54,7 +62,9 @@ YouTube是世界上最大的视频上传、分享和发现网站,YouTube个性 ...@@ -54,7 +62,9 @@ YouTube是世界上最大的视频上传、分享和发现网站,YouTube个性
对于一个用户$U$,预测此刻用户要观看的视频$\omega$为视频$i$的概率公式为: 对于一个用户$U$,预测此刻用户要观看的视频$\omega$为视频$i$的概率公式为:
$$P(\omega=i|u)=\frac{e^{v_{i}u}}{\sum_{j \in V}e^{v_{j}u}}$$ <p align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/05.recommender_system/image/formula1.png?raw=true" width="20%" ><br/>
</p>
其中$u$为用户$U$的特征表示,$V$为视频库集合,$v_i$为视频库中第$i$个视频的特征表示。$u$和$v_i$为长度相等的向量,两者点积可以通过全连接层实现。 其中$u$为用户$U$的特征表示,$V$为视频库集合,$v_i$为视频库中第$i$个视频的特征表示。$u$和$v_i$为长度相等的向量,两者点积可以通过全连接层实现。
...@@ -83,11 +93,15 @@ $$P(\omega=i|u)=\frac{e^{v_{i}u}}{\sum_{j \in V}e^{v_{j}u}}$$ ...@@ -83,11 +93,15 @@ $$P(\omega=i|u)=\frac{e^{v_{i}u}}{\sum_{j \in V}e^{v_{j}u}}$$
其次,进行卷积操作:把卷积核(kernel)$w\in\mathbb{R}^{hk}$应用于包含$h$个词的窗口$x_{i:i+h-1}$,得到特征$c_i=f(w\cdot x_{i:i+h-1}+b)$,其中$b\in\mathbb{R}$为偏置项(bias),$f$为非线性激活函数,如$sigmoid$。将卷积核应用于句子中所有的词窗口${x_{1:h},x_{2:h+1},\ldots,x_{n-h+1:n}}$,产生一个特征图(feature map): 其次,进行卷积操作:把卷积核(kernel)$w\in\mathbb{R}^{hk}$应用于包含$h$个词的窗口$x_{i:i+h-1}$,得到特征$c_i=f(w\cdot x_{i:i+h-1}+b)$,其中$b\in\mathbb{R}$为偏置项(bias),$f$为非线性激活函数,如$sigmoid$。将卷积核应用于句子中所有的词窗口${x_{1:h},x_{2:h+1},\ldots,x_{n-h+1:n}}$,产生一个特征图(feature map):
$$c=[c_1,c_2,\ldots,c_{n-h+1}], c \in \mathbb{R}^{n-h+1}$$ <p align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/05.recommender_system/image/formula2.png?raw=true" width="40%" ><br/>
</p>
接下来,对特征图采用时间维度上的最大池化(max pooling over time)操作得到此卷积核对应的整句话的特征$\hat c$,它是特征图中所有元素的最大值: 接下来,对特征图采用时间维度上的最大池化(max pooling over time)操作得到此卷积核对应的整句话的特征$\hat c$,它是特征图中所有元素的最大值:
$$\hat c=max(c)$$ <p align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/05.recommender_system/image/formula3.png?raw=true" width="15%" ><br/>
</p>
#### 融合推荐模型概览 #### 融合推荐模型概览
...@@ -120,9 +134,10 @@ Paddle在API中提供了自动加载数据的模块。数据模块为 `paddle.da ...@@ -120,9 +134,10 @@ Paddle在API中提供了自动加载数据的模块。数据模块为 `paddle.da
```python ```python
from __future__ import print_function
import paddle import paddle
movie_info = paddle.dataset.movielens.movie_info() movie_info = paddle.dataset.movielens.movie_info()
print movie_info.values()[0] print(list(movie_info.values())[0])
``` ```
...@@ -138,7 +153,7 @@ print movie_info.values()[0] ...@@ -138,7 +153,7 @@ print movie_info.values()[0]
```python ```python
movie_info = paddle.dataset.movielens.movie_info() movie_info = paddle.dataset.movielens.movie_info()
print movie_info.values()[0] print(list(movie_info.values())[0])
``` ```
<MovieInfo id(1), title(Toy Story ), categories(['Animation', "Children's", 'Comedy'])> <MovieInfo id(1), title(Toy Story ), categories(['Animation', "Children's", 'Comedy'])>
...@@ -149,7 +164,7 @@ print movie_info.values()[0] ...@@ -149,7 +164,7 @@ print movie_info.values()[0]
```python ```python
user_info = paddle.dataset.movielens.user_info() user_info = paddle.dataset.movielens.user_info()
print user_info.values()[0] print(list(user_info.values())[0])
``` ```
<UserInfo id(1), gender(F), age(1), job(10)> <UserInfo id(1), gender(F), age(1), job(10)>
...@@ -202,7 +217,7 @@ train_set_creator = paddle.dataset.movielens.train() ...@@ -202,7 +217,7 @@ train_set_creator = paddle.dataset.movielens.train()
train_sample = next(train_set_creator()) train_sample = next(train_set_creator())
uid = train_sample[0] uid = train_sample[0]
mov_id = train_sample[len(user_info[uid].value())] mov_id = train_sample[len(user_info[uid].value())]
print "User %s rates Movie %s with Score %s"%(user_info[uid], movie_info[mov_id], train_sample[-1]) print ("User %s rates Movie %s with Score %s"%(user_info[uid], movie_info[mov_id], train_sample[-1]))
``` ```
User <UserInfo id(1), gender(F), age(1), job(10)> rates Movie <MovieInfo id(1193), title(One Flew Over the Cuckoo's Nest ), categories(['Drama'])> with Score [5.0] User <UserInfo id(1), gender(F), age(1), job(10)> rates Movie <MovieInfo id(1193), title(One Flew Over the Cuckoo's Nest ), categories(['Drama'])> with Score [5.0]
...@@ -218,7 +233,6 @@ print "User %s rates Movie %s with Score %s"%(user_info[uid], movie_info[mov_id] ...@@ -218,7 +233,6 @@ print "User %s rates Movie %s with Score %s"%(user_info[uid], movie_info[mov_id]
```python ```python
from __future__ import print_function
import math import math
import sys import sys
import numpy as np import numpy as np
...@@ -519,13 +533,13 @@ train_loop() ...@@ -519,13 +533,13 @@ train_loop()
```python ```python
infer_movie_id = 783 infer_movie_id = 783
infer_movie_name = paddle.dataset.movielens.movie_info()[infer_movie_id].title infer_movie_name = paddle.dataset.movielens.movie_info()[infer_movie_id].title
user_id = fluid.create_lod_tensor([[1]], [[1]], place) user_id = fluid.create_lod_tensor([[np.int64(1)]], [[1]], place)
gender_id = fluid.create_lod_tensor([[1]], [[1]], place) gender_id = fluid.create_lod_tensor([[np.int64(1)]], [[1]], place)
age_id = fluid.create_lod_tensor([[0]], [[1]], place) age_id = fluid.create_lod_tensor([[np.int64(0)]], [[1]], place)
job_id = fluid.create_lod_tensor([[10]], [[1]], place) job_id = fluid.create_lod_tensor([[np.int64(10)]], [[1]], place)
movie_id = fluid.create_lod_tensor([[783]], [[1]], place) # Hunchback of Notre Dame movie_id = fluid.create_lod_tensor([[np.int64(783)]], [[1]], place) # Hunchback of Notre Dame
category_id = fluid.create_lod_tensor([[10, 8, 9]], [[3]], place) # Animation, Children's, Musical category_id = fluid.create_lod_tensor([np.array([10, 8, 9], dtype='int64')], [[3]], place) # Animation, Children's, Musical
movie_title = fluid.create_lod_tensor([[1069, 4140, 2923, 710, 988]], [[5]], movie_title = fluid.create_lod_tensor([np.array([1069, 4140, 2923, 710, 988], dtype='int64')], [[5]],
place) # 'hunchback','of','notre','dame','the' place) # 'hunchback','of','notre','dame','the'
``` ```
...@@ -578,10 +592,10 @@ with fluid.scope_guard(inference_scope): ...@@ -578,10 +592,10 @@ with fluid.scope_guard(inference_scope):
2. Sarwar, Badrul, et al. "[Item-based collaborative filtering recommendation algorithms.](http://files.grouplens.org/papers/www10_sarwar.pdf)*Proceedings of the 10th international conference on World Wide Web*. ACM, 2001. 2. Sarwar, Badrul, et al. "[Item-based collaborative filtering recommendation algorithms.](http://files.grouplens.org/papers/www10_sarwar.pdf)*Proceedings of the 10th international conference on World Wide Web*. ACM, 2001.
3. Kautz, Henry, Bart Selman, and Mehul Shah. "[Referral Web: combining social networks and collaborative filtering.](http://www.cs.cornell.edu/selman/papers/pdf/97.cacm.refweb.pdf)" Communications of the ACM 40.3 (1997): 63-65. APA 3. Kautz, Henry, Bart Selman, and Mehul Shah. "[Referral Web: combining social networks and collaborative filtering.](http://www.cs.cornell.edu/selman/papers/pdf/97.cacm.refweb.pdf)" Communications of the ACM 40.3 (1997): 63-65. APA
4. [Peter Brusilovsky](https://en.wikipedia.org/wiki/Peter_Brusilovsky) (2007). *The Adaptive Web*. p. 325. 4. [Peter Brusilovsky](https://en.wikipedia.org/wiki/Peter_Brusilovsky) (2007). *The Adaptive Web*. p. 325.
5. Robin Burke , [Hybrid Web Recommender Systems](http://www.dcs.warwick.ac.uk/~acristea/courses/CS411/2010/Book%20-%20The%20Adaptive%20Web/HybridWebRecommenderSystems.pdf), pp. 377-408, The Adaptive Web, Peter Brusilovsky, Alfred Kobsa, Wolfgang Nejdl (Ed.), Lecture Notes in Computer Science, Springer-Verlag, Berlin, Germany, Lecture Notes in Computer Science, Vol. 4321, May 2007, 978-3-540-72078-2. 5. Robin Burke , [Hybrid Web Recommender Systems](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.435.7538&rep=rep1&type=pdf), pp. 377-408, The Adaptive Web, Peter Brusilovsky, Alfred Kobsa, Wolfgang Nejdl (Ed.), Lecture Notes in Computer Science, Springer-Verlag, Berlin, Germany, Lecture Notes in Computer Science, Vol. 4321, May 2007, 978-3-540-72078-2.
6. Yuan, Jianbo, et al. ["Solving Cold-Start Problem in Large-scale Recommendation Engines: A Deep Learning Approach."](https://arxiv.org/pdf/1611.05480v1.pdf) *arXiv preprint arXiv:1611.05480* (2016). 6. Yuan, Jianbo, et al. ["Solving Cold-Start Problem in Large-scale Recommendation Engines: A Deep Learning Approach."](https://arxiv.org/pdf/1611.05480v1.pdf) *arXiv preprint arXiv:1611.05480* (2016).
7. Covington P, Adams J, Sargin E. [Deep neural networks for youtube recommendations](https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf)[C]//Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 2016: 191-198. 7. Covington P, Adams J, Sargin E. [Deep neural networks for youtube recommendations](https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf)[C]//Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 2016: 191-198.
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">知识共享 署名-相同方式共享 4.0 国际 许可协议</a>进行许可。 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://paddlepaddleimage.cdn.bcebos.com/bookimage/camo.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">知识共享 署名-相同方式共享 4.0 国际 许可协议</a>进行许可。
# Personalized Recommendation # Recommender System
The source code from this tutorial is at [here](https://github.com/PaddlePaddle/book/tree/develop/05.recommender_system). For instructions on getting started with this book,see [Running This Book](https://github.com/PaddlePaddle/book/blob/develop/README.md#running-the-book). The source code of this tutorial is in [book/recommender_system](https://github.com/PaddlePaddle/book/tree/develop/05.recommender_system). For new users, please refer to [Running This Book](https://github.com/PaddlePaddle/book/blob/develop/README.md#running-the-book) .
## Background Introduction
## Background With the continuous development of network technology and the ever-expanding scale of e-commerce, the number and variety of goods grow rapidly and users need to spend a lot of time to find the goods they want to buy. This is information overload. In order to solve this problem, recommendation system came into being.
The recommender system is a component of e-commerce, online videos, and online reading services. There are several different approaches for recommender systems to learn from user behavior and product properties and to understand users' interests. The recommendation system is a subset of the Information Filtering System, which can be used in a range of areas such as movies, music, e-commerce, and Feed stream recommendations. The recommendation system discovers the user's personalized needs and interests by analyzing and mining user behaviors, and recommends information or products that may be of interest to the user. Unlike search engines, recommendation system do not require users to accurately describe their needs, but model their historical behavior to proactively provide information that meets user interests and needs.
- User behavior-based approach. A well-known method of this approach is collaborative filtering, which assumes that if two users made similar purchases, they share common interests and would likely go on making the same decision. Some variants of collaborative filtering are user-based[[3](#references)], item-based [[4](#references)], social network based[[5](#references)], and model-based. The GroupLens system \[[1](#references)\] introduced by the University of Minnesota in 1994 is generally considered to be a relatively independent research direction for the recommendation system. The system first proposed the idea of completing recommendation task based on collaborative filtering. After that, the collaborative filtering recommendation based on the model led the development of recommendation system for more than ten years.
- Content-based approach[[1](#references)]. This approach represents product properties and user interests as feature vectors of the same space so that it could measure how much a user is interested in a product by the distance between two feature vectors. The traditional personalized recommendation system methods mainly include:
- Hybrid approach[[2](#references)]: This one combines above two to help with each other about the data sparsity problem[[6](#references)]. - Collaborative Filtering Recommendation: This method is one of the most widely used technologies which requires the collection and analysis of users' historical behaviors, activities and preferences. It can usually be divided into two sub-categories: User-Based Recommendation \[[1](#references)\] and Item-Based Recommendation \[[2](#references)\]. A key advantage of this method is that it does not rely on the machine to analyze the content characteristics of the item, so it does not need to understand the item itself to accurately recommend complex items such as movies. However, the disadvantage is that there is a cold start problem for new users without any behavior. At the same time, there is also a sparsity problem caused by insufficient interaction data between users and commodities. It is worth mentioning that social network \[[3](#references)\] or geographic location and other context information can be integrated into collaborative filtering.
- Content-Based Filtering Recommendation \[[4](#references)\] : This method uses the content description of the product to abstract meaningful features by calculating the similarity between the user's interest and the product description to make recommendations to users. The advantage is that it is simple and straightforward. It does not need to evaluate products based on the comments of users. Instead, it compares the product similarity by product attributes to recommend similar products to the users of interest. The disadvantage is that there is also a cold start problem for new users without any behavior.
- Hybrid Recommendation \[[5](#references)\]: Use different inputs and techniques to jointly recommend items to complement each single recommendation technique.
This tutorial explains a deep learning based hybrid approach and its implement in PaddlePaddle. We are going to train a model using a dataset that includes user information, movie information, and ratings. Once we train the model, we will be able to get a predicted rating given a pair of user and movie IDs. In recent years, deep learning has achieved great success in many fields. Both academia and industry are trying to apply deep learning to the field of recommendation systems. Deep learning has excellent ability to automatically extract features, can learn multi-level abstract feature representations, and learn heterogeneous or cross-domain content information, which can deal with the cold start problem \[[6](#references)\] of recommendation system to some extent. This tutorial focuses on the deep learning model of recommendation system and how to implement the model with PaddlePaddle.
## Result Demo
## Model Overview We use a dataset containing user information, movie information, and movie ratings as a recommendation system. When we train the model, we only need to input the corresponding user ID and movie ID, we can get a matching score (range [0, 5], the higher the score is regarded as the greater interest), and then according to the recommendation of all movies sort the scores and recommend them to movies that may be of interest to the user.
To know more about deep learning based recommendation, let us start from going over the Youtube recommender system[[7](#references)] before introducing our hybrid model. ```
Input movie_id: 1962
Input user_id: 1
Prediction Score is 4.25
```
## Model Overview
In this chapter, we first introduce YouTube's video personalization recommendation system \[[7](#references)\], and then introduce the fusion recommendation model we implemented.
### YouTube's Deep Learning Recommendation Model ### YouTube's Deep Neural Network Personalized Recommendation System
YouTube is a video-sharing Web site with one of the largest user base in the world. Its recommender system serves more than a billion users. This system is composed of two major parts: candidate generation and ranking. The former selects few hundreds of candidates from millions of videos, and the latter ranks and outputs the top 10. YouTube is the world's largest video uploading, sharing and discovery site, and the YouTube Personalized Recommendation System recommends personalized content from a growing library to more than 1 billion users. The entire system consists of two neural networks: a candidate generation network and a ranking network. The candidate generation network generates hundreds of candidates from a million-level video library, and the ranking network sorts the candidates and outputs the highest ranked tens of results. The system structure is shown in Figure 1:
<p align="center"> <p align="center">
<img src="image/YouTube_Overview.en.png" width="70%" ><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/05.recommender_system/image/YouTube_Overview.png?raw=true" width="70%" ><br/>
Figure 1. YouTube recommender system overview. Figure 1. YouTube personalized recommendation system structure
</p> </p>
#### Candidate Generation Network #### Candidate Generation Network
YouTube models candidate generation as a multi-class classification problem with a huge number of classes equal to the number of videos. The architecture of the model is as follows: The candidate generation network models the recommendation problem as a multi-class classification problem with a large number of categories. For a Youtube user, using its watching history (video ID), search tokens, demographic information (such as geographic location, user login device), binary features (such as gender, whether to log in), and continuous features (such as user age), etc., multi-classify all videos in the video library to obtain the classification result of each category (ie, the recommendation probability of each video), eventually outputting hundreds of videos with high probability.
First, the historical information such as watching history and search token records are mapped to vectors and averaged to obtain a fixed length representation. At the same time, demographic characteristics are input to optimize the recommendation effect of new users, and the binary features and continuous features are normalized to the range [0, 1]. Next, put all the feature representations into a vector and input them to the non-linear multilayer perceptron (MLP, see [Identification Figures](https://github.com/PaddlePaddle/book/blob/develop/02.recognize_digits/README.md) tutorial). Finally, during training, the output of the MLP is classified by softmax. When predicting, the similarity of the user's comprehensive features (MLP output) to all videos' features is calculated, and the highest score of $k$ is obtained as the result of the candidate generation network. Figure 2 shows the candidate generation network structure.
<p align="center"> <p align="center">
<img src="image/Deep_candidate_generation_model_architecture.en.png" width="70%" ><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/05.recommender_system/image/Deep_candidate_generation_model_architecture.png?raw=true" width="70%" ><br/>
Figure 2. Deep candidate generation model. Figure 2. Candidate generation network structure
</p> </p>
The first stage of this model maps watching history and search queries into fixed-length representative features. Then, an MLP (multi-layer Perceptron, as described in the [Recognize Digits](https://github.com/PaddlePaddle/book/blob/develop/recognize_digits/README.md) tutorial) takes the concatenation of all representative vectors. The output of the MLP represents the user' *intrinsic interests*. At training time, it is used together with a softmax output layer for minimizing the classification error. At serving time, it is used to compute the relevance of the user with all movies. For a user $U$, the formula for predicting whether the video $\omega$ that the user wants to watch at the moment is video $i$ is:
For a user $U$, the predicted watching probability of video $i$ is
$$P(\omega=i|u)=\frac{e^{v_{i}u}}{\sum_{j \in V}e^{v_{j}u}}$$ $$P(\omega=i|u)=\frac{e^{v_{i}u}}{\sum_{j \in V}e^{v_{j}u}}$$
where $u$ is the representative vector of user $U$, $V$ is the corpus of all videos, $v_i$ is the representative vector of the $i$-th video. $u$ and $v_i$ are vectors of the same length, so we can compute their dot product using a fully connected layer. Where $u$ is the feature representation of the user $U$, $V$ is the video library collection, and $v_i$ is the feature representation of the $i$ video in the video library. $u$ and $v_i$ are vectors of equal length, and the dot product can be implemented by a fully connected layer.
This model could have a performance issue as the softmax output covers millions of classification labels. To optimize performance, at the training time, the authors down-sample negative samples, so the actual number of classes is reduced to thousands. At serving time, the authors ignore the normalization of the softmax outputs, because the results are just for ranking. Considering that the number of categories in the softmax classification is very large, in order to ensure a certain computational efficiency: 1) in the training phase, use negative sample category sampling to reduce the number of actually calculated categories to thousands; 2) in the recommendation (prediction) phase, ignore the normalized calculation of softmax (does not affect the result), and simplifies the category scoring problem into the nearest neighbor search problem in the dot product space, then takes the nearest $k$ video of $u$ as a candidate for generation.
#### Ranking Network #### Ranking Network
The structure of the ranking network is similar to the candidate generation network, but its goal is to perform finer ranking of the candidates. Similar to the feature extraction method in traditional advertisement ranking, a large number of related features (such as video ID, last watching time, etc.) for video sorting are also constructed here. These features are treated similarly to the candidate generation network, except that at the top of the ranking network is a weighted logistic regression that scores all candidate videos and sorts them from high to low. Then, return to the user.
The architecture of the ranking network is similar to that of the candidate generation network. Similar to ranking models widely used in online advertising, it uses rich features like video ID, last watching time, etc. The output layer of the ranking network is a weighted logistic regression, which rates all candidate videos. ### Fusion recommendation model
This section uses Convolutional Neural Networks to learn the representation of movie titles. The convolutional neural network for text and the fusion recommendation model are introduced in turn.
### Hybrid Model
In the section, let us introduce our movie recommendation system. Especially, we feed moives titles into a text convolution network to get a fixed-length representative feature vector. Accordingly we will introduce the convolutional neural network for texts and the hybrid recommendation model respectively.
#### Convolutional Neural Networks for Texts (CNN)
**Convolutional Neural Networks** are frequently applied to data with grid-like topology such as two-dimensional images and one-dimensional texts. A CNN can extract multiple local features, combine them, and produce high-level abstractions, which correspond to semantic understanding. Empirically, CNN is shown to be efficient for image and text modeling. #### Convolutional Neural Network (CNN) for text
CNN mainly contains convolution and pooling operation, with versatile combinations in various applications. Here, we briefly describe a CNN as shown in Figure 3. Convolutional neural networks are often used to deal with data of a grid-like topology. For example, an image can be viewed as a pixel of a two-dimensional grid, and a natural language can be viewed as a one-dimensional sequence of words. Convolutional neural networks can extract a variety of local features and combine them to obtain more advanced feature representations. Experiments show that convolutional neural networks can efficiently model image and text problems.
The convolutional neural network is mainly composed of convolution and pooling operations, and its application and combination methods are flexible and varied. In this section we will explain the network as shown in Figure 3:
<p align="center"> <p align="center">
<img src="image/text_cnn_en.png" width = "80%" align="center"/><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/05.recommender_system/image/text_cnn.png?raw=true" width = "80%" align="center"/><br />
Figure 3. CNN for text modeling. Figure 3. Convolutional neural network text classification model
</p> </p>
Let $n$ be the length of the sentence to process, and the $i$-th word has embedding as $x_i\in\mathbb{R}^k$,where $k$ is the embedding dimensionality. Suppose the length of the sentence to be processed is $n$, where the word vector of the $i$ word is $x_i\in\mathbb{R}^k$, and $k$ is the dimension size.
First, we concatenate the words by piecing together every $h$ words, each as a window of length $h$. This window is denoted as $x_{i:i+h-1}$, consisting of $x_{i},x_{i+1},\ldots,x_{i+h-1}$, where $x_i$ is the first word in the window and $i$ takes value ranging from $1$ to $n-h+1$: $x_{i:i+h-1}\in\mathbb{R}^{hk}$. First, splicing the word vector: splicing each $h$ word to form a word window of size $h$, denoted as $x_{i:i+h-1}$, which represents the word sequence splicing of $x_{i}, x_{i+1}, \ldots, x_{i+h-1}$, where $i$ represents the position of the first word in the word window throughout the sentence, ranging from $1$ to $n-h+1$, $x_{i:i+h-1}\in\mathbb{R}^{hk}$.
Next, we apply the convolution operation: we apply the kernel $w\in\mathbb{R}^{hk}$ in each window, extracting features $c_i=f(w\cdot x_{i:i+h-1}+b)$, where $b\in\mathbb{R}$ is the bias and $f$ is a non-linear activation function such as $sigmoid$. Convolving by the kernel at every window ${x_{1:h},x_{2:h+1},\ldots,x_{n-h+1:n}}$ produces a feature map in the following form: Second, perform a convolution operation: apply the convolution kernel $w\in\mathbb{R}^{hk}$ to the window $x_{i:i+h-1}$ containing $h$ words. , get the feature $c_i=f(w\cdot x_{i:i+h-1}+b)$, where $b\in\mathbb{R}$ is the bias and $f$ is the non Linear activation function, such as $sigmoid$. Apply the convolution kernel to all word windows ${x_{1:h}, x_{2:h+1},\ldots,x_{n-h+1:n}}$ in the sentence, producing a feature map:
$$c=[c_1,c_2,\ldots,c_{n-h+1}], c \in \mathbb{R}^{n-h+1}$$ $$c=[c_1,c_2,\ldots,c_{n-h+1}], c \in \mathbb{R}^{n-h+1}$$
Next, we apply *max pooling* over time to represent the whole sentence $\hat c$, which is the maximum element across the feature map: Next, using the max pooling over time for feature maps to obtain the feature $\hat c$, of the whole sentence corresponding to this convolution kernel, which is the maximum value of all elements in the feature map:
$$\hat c=max(c)$$ $$\hat c=max(c)$$
#### Model Structure Of The Hybrid Model #### Fusion recommendation model overview
In our network, the input includes features of users and movies. The user feature includes four properties: user ID, gender, occupation, and age. Movie features include their IDs, genres, and titles. In the film personalized recommendation system that incorporates the recommendation model:
We use fully-connected layers to map user features into representative feature vectors and concatenate them. The process of movie features is similar, except that for movie titles -- we feed titles into a text convolution network as described in the above section to get a fixed-length representative feature vector. 1. First, take user features and movie features as input to the neural network, where:
Given the feature vectors of users and movies, we compute the relevance using cosine similarity. We minimize the squared error at training time. - The user features incorporate four attribute information: user ID, gender, occupation, and age.
- The movie feature incorporate three attribute information: movie ID, movie type ID, and movie name.
2. For the user feature, map the user ID to a vector representation with a dimension size of 256, enter the fully connected layer, and do similar processing for the other three attributes. Then the feature representations of the four attributes are fully connected and added separately.
3. For movie features, the movie ID is processed in a manner similar to the user ID. The movie type ID is directly input into the fully connected layer in the form of a vector, and the movie name is represented by a fixed-length vector using a text convolutional neural network. The feature representations of the three attributes are then fully connected and added separately.
4. After obtaining the vector representation of the user and the movie, calculate the cosine similarity of them as the score of the personalized recommendation system. Finally, the square of the difference between the similarity score and the user's true score is used as the loss function of the regression model.
<p align="center"> <p align="center">
<img src="image/rec_regression_network_en.png" width="90%" ><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/05.recommender_system/image/rec_regression_network.png?raw=true" width="90%" ><br/>
Figure 4. A hybrid recommendation model. Figure 4. Fusion recommendation model
</p> </p>
## Dataset ## Data Preparation
### Data Introduction and Download
We use the [MovieLens ml-1m](http://files.grouplens.org/datasets/movielens/ml-1m.zip) to train our model. This dataset includes 10,000 ratings of 4,000 movies from 6,000 users to 4,000 movies. Each rate is in the range of 1~5. Thanks to GroupLens Research for collecting, processing and publishing the dataset. We take [MovieLens Million Dataset (ml-1m)](http://files.grouplens.org/datasets/movielens/ml-1m.zip) as an example. The ml-1m dataset contains 1,000,000 reviews of 4,000 movies by 6,000 users (scores ranging from 1 to 5, all integer), collected by the GroupLens Research lab.
`paddle.datasets` package encapsulates multiple public datasets, including `cifar`, `imdb`, `mnist`, `movielens` and `wmt14`, etc. There's no need for us to manually download and preprocess `MovieLens` dataset. Paddle provides modules for automatically loading data in the API. The data module is `paddle.dataset.movielens`
The raw `MoiveLens` contains movie ratings, relevant features from both movies and users.
For instance, one movie's feature could be:
```python ```python
from __future__ import print_function
import paddle import paddle
movie_info = paddle.dataset.movielens.movie_info() movie_info = paddle.dataset.movielens.movie_info()
print movie_info.values()[0] print(list(movie_info.values())[0])
```
```python
# Run this block to show dataset's documentation
# help(paddle.dataset.movielens)
``` ```
```text The original data includes feature data of the movie, user's feature data, and the user's rating of the movie.
<MovieInfo id(1), title(Toy Story), categories(['Animation', "Children's", 'Comedy'])>
For example, one of the movie features is:
```python
movie_info = paddle.dataset.movielens.movie_info()
print(list(movie_info.values())[0])
``` ```
One user's feature could be: <MovieInfo id(1), title(Toy Story ), categories(['Animation', "Children's", 'Comedy'])>
This means that the movie id is 1, and the title is 《Toy Story》, which is divided into three categories. These three categories are animation, children, and comedy.
```python ```python
user_info = paddle.dataset.movielens.user_info() user_info = paddle.dataset.movielens.user_info()
print user_info.values()[0] print(list(user_info.values())[0])
``` ```
```text <UserInfo id(1), gender(F), age(1), job(10)>
<UserInfo id(1), gender(F), age(1), job(10)>
```
In this dateset, the distribution of age is shown as follows:
```text This means that the user ID is 1, female, and younger than 18 years old. The occupation ID is 10.
1: "Under 18"
18: "18-24"
25: "25-34"
35: "35-44"
45: "45-49"
50: "50-55"
56: "56+"
```
User's occupation is selected from the following options:
```text
0: "other" or not specified
1: "academic/educator"
2: "artist"
3: "clerical/admin"
4: "college/grad student"
5: "customer service"
6: "doctor/health care"
7: "executive/managerial"
8: "farmer"
9: "homemaker"
10: "K-12 student"
11: "lawyer"
12: "programmer"
13: "retired"
14: "sales/marketing"
15: "scientist"
16: "self-employed"
17: "technician/engineer"
18: "tradesman/craftsman"
19: "unemployed"
20: "writer"
```
Each record consists of three main components: user features, movie features and movie ratings. Among them, the age uses the following distribution
Likewise, as a simple example, consider the following:
* 1: "Under 18"
* 18: "18-24"
* 25: "25-34"
* 35: "35-44"
* 45: "45-49"
* 50: "50-55"
* 56: "56+"
The occupation is selected from the following options:
* 0: "other" or not specified
* 1: "academic/educator"
* 2: "artist"
* 3: "clerical/admin"
* 4: "college/grad student"
* 5: "customer service"
* 6: "doctor/health care"
* 7: "executive/managerial"
* 8: "farmer"
* 9: "homemaker"
* 10: "K-12 student"
* 11: "lawyer"
* 12: "programmer"
* 13: "retired"
* 14: "sales/marketing"
* 15: "scientist"
* 16: "self-employed"
* 17: "technician/engineer"
* 18: "tradesman/craftsman"
* 19: "unemployed"
* 20: "writer"
For each training or test data, it is <user features> + <movie feature> + rating.
For example, we get the first training data:
```python ```python
train_set_creator = paddle.dataset.movielens.train() train_set_creator = paddle.dataset.movielens.train()
train_sample = next(train_set_creator()) train_sample = next(train_set_creator())
uid = train_sample[0] uid = train_sample[0]
mov_id = train_sample[len(user_info[uid].value())] mov_id = train_sample[len(user_info[uid].value())]
print "User %s rates Movie %s with Score %s"%(user_info[uid], movie_info[mov_id], train_sample[-1]) print("User %s rates Movie %s with Score %s"%(user_info[uid], movie_info[mov_id], train_sample[-1]))
``` ```
```text ```python
User <UserInfo id(1), gender(F), age(1), job(10)> rates Movie <MovieInfo id(1193), title(One Flew Over the Cuckoo's Nest), categories(['Drama'])> with Score [5.0] User <UserInfo id(1), gender(F), age(1), job(10)> rates Movie <MovieInfo id(1193), title(One Flew Over the Cuckoo's Nest ), categories(['Drama'])> with Score [5.0]
``` ```
The output shows that user 1 gave movie `1193` a rating of 5. That is, the user 1 evaluates the movie 1193 as 5 points.
After issuing a command `python train.py`, training will start immediately. The details will be unpacked by the following sessions to see how it works. ## Configuration Instruction
Below we begin to configure the model based on the form of the input data. First import the required library functions and define global variables.
- IS_SPARSE: whether to use sparse update in embedding
- PASS_NUM: number of epoch
## Model Configuration
Our program starts with importing necessary packages and initializing some global variables:
```python ```python
from __future__ import print_function
import math import math
import sys import sys
import numpy as np import numpy as np
...@@ -193,26 +228,17 @@ import paddle ...@@ -193,26 +228,17 @@ import paddle
import paddle.fluid as fluid import paddle.fluid as fluid
import paddle.fluid.layers as layers import paddle.fluid.layers as layers
import paddle.fluid.nets as nets import paddle.fluid.nets as nets
try:
from paddle.fluid.contrib.trainer import *
from paddle.fluid.contrib.inferencer import *
except ImportError:
print(
"In the fluid 1.0, the trainer and inferencer are moving to paddle.fluid.contrib",
file=sys.stderr)
from paddle.fluid.trainer import *
from paddle.fluid.inferencer import *
IS_SPARSE = True IS_SPARSE = True
USE_GPU = False
BATCH_SIZE = 256 BATCH_SIZE = 256
PASS_NUM = 20
``` ```
Then define the model configuration for our user feature synthesis model
Then we define the model configuration for user combined features:
```python ```python
def get_usr_combined_features(): def get_usr_combined_features():
"""network definition for user part"""
USR_DICT_SIZE = paddle.dataset.movielens.max_user_id() + 1 USR_DICT_SIZE = paddle.dataset.movielens.max_user_id() + 1
...@@ -269,14 +295,16 @@ def get_usr_combined_features(): ...@@ -269,14 +295,16 @@ def get_usr_combined_features():
return usr_combined_features return usr_combined_features
``` ```
As shown in the above code, the input is four dimension integers for each user, that is `user_id`,`gender_id`, `age_id` and `job_id`. In order to deal with these features conveniently, we use the language model in NLP to transform these discrete values into embedding vaules `usr_emb`, `usr_gender_emb`, `usr_age_emb` and `usr_job_emb`. As shown in the code above, for each user, we enter a 4-dimensional feature. This includes user_id, gender_id, age_id, job_id. These dimensional features are simple integer values. In order to facilitate the subsequent neural network processing of these features, we use the language model in NLP to transform these discrete integer values ​​into embedding. And form them into usr_emb, usr_gender_emb, usr_age_emb, usr_job_emb, respectively.
Then, we enter all the user features into a fully connected layer(fc). Combine all features into one 200-dimension feature.
Then we can use user features as input, directly connecting to a fully-connected layer, which is used to reduce dimension to 200. Furthermore, we make a similar transformation for each movie feature, the network configuration is:
Furthermore, we do a similar transformation for each movie feature. The model configuration is:
```python ```python
def get_mov_combined_features(): def get_mov_combined_features():
"""network definition for item(movie) part"""
MOV_DICT_SIZE = paddle.dataset.movielens.max_movie_id() + 1 MOV_DICT_SIZE = paddle.dataset.movielens.max_movie_id() + 1
...@@ -325,13 +353,15 @@ def get_mov_combined_features(): ...@@ -325,13 +353,15 @@ def get_mov_combined_features():
return mov_combined_features return mov_combined_features
``` ```
Movie title, which is a sequence of words represented by an integer word index sequence, will be fed into a `sequence_conv_pool` layer, which will apply convolution and pooling on time dimension. Because pooling is done on time dimension, the output will be a fixed-length vector regardless the length of the input sequence.
The title of a movie is a sequence of integers, and the integer represents the subscript of the word in the index sequence. This sequence is sent to the `sequence_conv_pool` layer, which uses convolution and pooling on the time dimension. Because of this, the output will be fixed length, although the length of the input sequence will vary.
Finally, we can define a `inference_program` that uses cosine similarity to calculate the similarity between user characteristics and movie features. Finally, we define an `inference_program` to calculate the similarity between user features and movie features using cosine similarity.
```python ```python
def inference_program(): def inference_program():
"""the combined network"""
usr_combined_features = get_usr_combined_features() usr_combined_features = get_usr_combined_features()
mov_combined_features = get_mov_combined_features() mov_combined_features = get_mov_combined_features()
...@@ -341,11 +371,11 @@ def inference_program(): ...@@ -341,11 +371,11 @@ def inference_program():
return scale_infer return scale_infer
``` ```
Then we define a `training_program` that uses the result from `inference_program` to compute the cost with label data. Furthermore, we define a `train_program` to use the result computed by `inference_program`, and calculate the error with the help of the tag data. We also define an `optimizer_func` to define the optimizer.
Also define `optimizer_func` to specify the optimizer.
```python ```python
def train_program(): def train_program():
"""define the cost function"""
scale_infer = inference_program() scale_infer = inference_program()
...@@ -360,21 +390,19 @@ def optimizer_func(): ...@@ -360,21 +390,19 @@ def optimizer_func():
return fluid.optimizer.SGD(learning_rate=0.2) return fluid.optimizer.SGD(learning_rate=0.2)
``` ```
## Model Training
### Specify training environment ## Training Model
Specify your training environment, you should specify if the training is on CPU or GPU. ### Defining the training environment
Define your training environment and specify whether the training takes place on CPU or GPU.
```python ```python
use_cuda = False use_cuda = False
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace() place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
``` ```
### Datafeeder Configuration ### Defining the data provider
The next step is to define a data provider for training and testing. The provider reads in a data of size `BATCH_SIZE`. `paddle.dataset.movielens.train` will provide a data of size `BATCH_SIZE` after each scribbling, and the size of the out-of-order is the cache size `buf_size`.
Next we define data feeders for test and train. The feeder reads a `buf_size` of data each time and feed them to the training/testing process.
`paddle.dataset.movielens.train` will yield records during each pass, after shuffling, a batch input of `BATCH_SIZE` is generated for training.
```python ```python
train_reader = paddle.batch( train_reader = paddle.batch(
...@@ -386,87 +414,111 @@ test_reader = paddle.batch( ...@@ -386,87 +414,111 @@ test_reader = paddle.batch(
paddle.dataset.movielens.test(), batch_size=BATCH_SIZE) paddle.dataset.movielens.test(), batch_size=BATCH_SIZE)
``` ```
### Create Trainer ### Constructing a training process (trainer)
We have constructed a training process here, including training optimization functions.
Create a trainer that takes `train_program` as input and specify optimizer function.
```python ### Provide data
trainer = Trainer(
train_func=train_program, place=place, optimizer_func=optimizer_func)
```
### Feeding Data `feed_order` is used to define the mapping between each generated data and `paddle.layer.data`. For example, the data in the first column generated by `movielens.train` corresponds to the feature `user_id`.
`feed_order` is devoted to specifying the correspondence between each yield record and `paddle.layer.data`. For instance, the first column of data generated by `movielens.train` corresponds to `user_id` feature.
```python ```python
feed_order = [ feed_order = [
'user_id', 'gender_id', 'age_id', 'job_id', 'movie_id', 'category_id', 'user_id', 'gender_id', 'age_id', 'job_id', 'movie_id', 'category_id',
'movie_title', 'score' 'movie_title', 'score'
] ]
``` ```
### Event Handler ### Building training programs and testing programs
The training program and the test program are separately constructed, and the training optimizer is imported.
Callback function `event_handler` will be called during training when a pre-defined event happens. ```python
For example, we can check the cost by `trainer.test` when `EndStepEvent` occurs main_program = fluid.default_main_program()
star_program = fluid.default_startup_program()
[avg_cost, scale_infer] = train_program()
test_program = main_program.clone(for_test=True)
sgd_optimizer = optimizer_func()
sgd_optimizer.minimize(avg_cost)
exe = fluid.Executor(place)
def train_test(program, reader):
count = 0
feed_var_list = [
program.global_block().var(var_name) for var_name in feed_order
]
feeder_test = fluid.DataFeeder(
feed_list=feed_var_list, place=place)
test_exe = fluid.Executor(place)
accumulated = 0
for test_data in reader():
avg_cost_np = test_exe.run(program=program,
feed=feeder_test.feed(test_data),
fetch_list=[avg_cost])
accumulated += avg_cost_np[0]
count += 1
return accumulated / count
```
### Build a training main loop and start training
We perform the training cycle according to the training cycle number (`PASS_NUM`) defined above and some other parameters, and perform a test every time. When the test result is good enough, we exit the training and save the trained parameters.
```python ```python
# Specify the directory path to save the parameters # Specify the directory path to save the parameters
params_dirname = "recommender_system.inference.model" params_dirname = "recommender_system.inference.model"
def event_handler(event): from paddle.utils.plot import Ploter
if isinstance(event, EndStepEvent): train_prompt = "Train cost"
test_reader = paddle.batch( test_prompt = "Test cost"
paddle.dataset.movielens.test(), batch_size=BATCH_SIZE)
avg_cost_set = trainer.test(
reader=test_reader, feed_order=feed_order)
# get avg cost
avg_cost = np.array(avg_cost_set).mean()
print("avg_cost: %s" % avg_cost) plot_cost = Ploter(train_prompt, test_prompt)
if float(avg_cost) < 4: # Change this number to adjust accuracy def train_loop():
trainer.save_params(params_dirname) feed_list = [
trainer.stop() main_program.global_block().var(var_name) for var_name in feed_order
else: ]
print('BatchID {0}, Test Loss {1:0.2}'.format(event.epoch + 1, feeder = fluid.DataFeeder(feed_list, place)
float(avg_cost))) exe.run(star_program)
if math.isnan(float(avg_cost)):
for pass_id in range(PASS_NUM):
for batch_id, data in enumerate(train_reader()):
# train a mini-batch
outs = exe.run(program=main_program,
feed=feeder.feed(data),
fetch_list=[avg_cost])
out = np.array(outs[0])
# get test avg_cost
test_avg_cost = train_test(test_program, test_reader)
plot_cost.append(train_prompt, batch_id, outs[0])
plot_cost.append(test_prompt, batch_id, test_avg_cost)
plot_cost.plot()
if batch_id == 20:
if params_dirname is not None:
fluid.io.save_inference_model(params_dirname, [
"user_id", "gender_id", "age_id", "job_id",
"movie_id", "category_id", "movie_title"
], [scale_infer], exe)
return
print('EpochID {0}, BatchID {1}, Test Loss {2:0.2}'.format(
pass_id + 1, batch_id + 1, float(test_avg_cost)))
if math.isnan(float(out[0])):
sys.exit("got NaN loss, training failed.") sys.exit("got NaN loss, training failed.")
``` ```
Start training
### Training
Finally, we invoke `trainer.train` to start training with `num_epochs` and other parameters.
```python ```python
trainer.train( train_loop()
num_epochs=1,
event_handler=event_handler,
reader=train_reader,
feed_order=feed_order)
``` ```
## Inference ## Model Application
### Create Inferencer ### Generate test data
Use the API of create_lod_tensor(data, lod, place) to generate the tensor of the detail level. `data` is a sequence, and each element is a sequence of index numbers. `lod` is the detail level's information, corresponding to `data`. For example, data = [[10, 2, 3], [2, 3]] means that it contains two sequences of lengths 3 and 2. Correspondingly lod = [[3, 2]], which indicates that it contains a layer of detail information, meaning that `data` has two sequences, lengths of 3 and 2.
Initialize Inferencer with `inference_program` and `params_dirname` which is where we save params from training. In this prediction example, we try to predict the score given by user with ID1 for the movie 'Hunchback of Notre Dame'.
```python
inferencer = Inferencer(
inference_program, param_path=params_dirname, place=place)
```
### Generate input data for testing
Use create_lod_tensor(data, lod, place) API to generate LoD Tensor, where `data` is a list of sequences of index numbers, `lod` is the level of detail (lod) info associated with `data`.
For example, data = [[10, 2, 3], [2, 3]] means that it contains two sequences of indices, of length 3 and 2, respectively.
Correspondingly, lod = [[3, 2]] contains one level of detail info, indicating that `data` consists of two sequences of length 3 and 2.
In this infer example, we try to predict rating of movie 'Hunchback of Notre Dame' from the info of user id 1.
```python ```python
infer_movie_id = 783 infer_movie_id = 783
infer_movie_name = paddle.dataset.movielens.movie_info()[infer_movie_id].title infer_movie_name = paddle.dataset.movielens.movie_info()[infer_movie_id].title
...@@ -480,13 +532,27 @@ movie_title = fluid.create_lod_tensor([[1069, 4140, 2923, 710, 988]], [[5]], ...@@ -480,13 +532,27 @@ movie_title = fluid.create_lod_tensor([[1069, 4140, 2923, 710, 988]], [[5]],
place) # 'hunchback','of','notre','dame','the' place) # 'hunchback','of','notre','dame','the'
``` ```
### Infer ### Building the prediction process and testing
Similar to the training process, we need to build a prediction process, where `params_dirname` is the address used to store the various parameters in the training process.
```python
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
exe = fluid.Executor(place)
inference_scope = fluid.core.Scope()
```
### Testing
Now we can make predictions. The `feed_order` we provide should be consistent with the training process.
Now we can infer with inputs that we provide in `feed_order` during training.
```python ```python
results = inferencer.infer( with fluid.scope_guard(inference_scope):
{ [inferencer, feed_target_names,
fetch_targets] = fluid.io.load_inference_model(params_dirname, exe)
results = exe.run(inferencer,
feed={
'user_id': user_id, 'user_id': user_id,
'gender_id': gender_id, 'gender_id': gender_id,
'age_id': age_id, 'age_id': age_id,
...@@ -495,27 +561,30 @@ results = inferencer.infer( ...@@ -495,27 +561,30 @@ results = inferencer.infer(
'category_id': category_id, 'category_id': category_id,
'movie_title': movie_title 'movie_title': movie_title
}, },
fetch_list=fetch_targets,
return_numpy=False) return_numpy=False)
predict_rating = np.array(results[0])
predict_rating = np.array(results[0]) print("Predict Rating of user id 1 on movie \"" + infer_movie_name +
print("Predict Rating of user id 1 on movie \"" + infer_movie_name + "\" is " + str(predict_rating[0][0])) "\" is " + str(predict_rating[0][0]))
print("Actual Rating of user id 1 on movie \"" + infer_movie_name + "\" is 4.") print("Actual Rating of user id 1 on movie \"" + infer_movie_name +
"\" is 4.")
``` ```
## Conclusion ## Summary
This tutorial goes over traditional approaches in recommender system and a deep learning based approach. We also show that how to train and use the model with PaddlePaddle. Deep learning has been well used in computer vision and NLP, we look forward to its new successes in recommender systems. This chapter introduced the traditional personalized recommendation system method and YouTube's deep neural network personalized recommendation system. It further took movie recommendation as an example, and used PaddlePaddle to train a personalized recommendation neural network model. The personalized recommendation system covers almost all aspects of e-commerce systems, social networks, advertising recommendations, search engines, etc. Deep learning technologies have played an important role in image processing, natural language processing, etc., and will also prevail in personalized recommendation systems.
<a name="references"></a>
## References ## References
1. [Peter Brusilovsky](https://en.wikipedia.org/wiki/Peter_Brusilovsky) (2007). *The Adaptive Web*. p. 325. 1. P. Resnick, N. Iacovou, etc. “[GroupLens: An Open Architecture for Collaborative Filtering of Netnews](http://ccs.mit.edu/papers/CCSWP165.html)”, Proceedings of ACM Conference on Computer Supported Cooperative Work, CSCW 1994. pp.175-186.
2. Robin Burke , [Hybrid Web Recommender Systems](http://www.dcs.warwick.ac.uk/~acristea/courses/CS411/2010/Book%20-%20The%20Adaptive%20Web/HybridWebRecommenderSystems.pdf), pp. 377-408, The Adaptive Web, Peter Brusilovsky, Alfred Kobsa, Wolfgang Nejdl (Ed.), Lecture Notes in Computer Science, Springer-Verlag, Berlin, Germany, Lecture Notes in Computer Science, Vol. 4321, May 2007, 978-3-540-72078-2. 2. Sarwar, Badrul, et al. "[Item-based collaborative filtering recommendation algorithms.](http://files.grouplens.org/papers/www10_sarwar.pdf)*Proceedings of the 10th international conference on World Wide Web*. ACM, 2001.
3. P. Resnick, N. Iacovou, etc. “[GroupLens: An Open Architecture for Collaborative Filtering of Netnews](http://ccs.mit.edu/papers/CCSWP165.html)”, Proceedings of ACM Conference on Computer Supported Cooperative Work, CSCW 1994. pp.175-186. 3. Kautz, Henry, Bart Selman, and Mehul Shah. "[Referral Web: combining social networks and collaborative filtering.](http://www.cs.cornell.edu/selman/papers/pdf/97.cacm.refweb.pdf)" Communications of the ACM 40.3 (1997): 63-65. APA
4. Sarwar, Badrul, et al. "[Item-based collaborative filtering recommendation algorithms.](http://files.grouplens.org/papers/www10_sarwar.pdf)" *Proceedings of the 10th International Conference on World Wide Web*. ACM, 2001. 4. [Peter Brusilovsky](https://en.wikipedia.org/wiki/Peter_Brusilovsky) (2007). *The Adaptive Web*. p. 325.
5. Kautz, Henry, Bart Selman, and Mehul Shah. "[Referral Web: Combining Social networks and collaborative filtering.](http://www.cs.cornell.edu/selman/papers/pdf/97.cacm.refweb.pdf)" Communications of the ACM 40.3 (1997): 63-65. APA 5. Robin Burke , [Hybrid Web recommendation systems](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.435.7538&rep=rep1&type=pdf), pp. 377-408, The Adaptive Web, Peter Brusilovsky, Alfred Kobsa, Wolfgang Nejdl (Ed.), Lecture Notes in Computer Science, Springer-Verlag, Berlin, Germany, Lecture Notes in Computer Science, Vol. 4321, May 2007, 978-3-540-72078-2.
6. Yuan, Jianbo, et al. ["Solving Cold-Start Problem in Large-scale Recommendation Engines: A Deep Learning Approach."](https://arxiv.org/pdf/1611.05480v1.pdf) *arXiv preprint arXiv:1611.05480* (2016). 6. Yuan, Jianbo, et al. ["Solving Cold-Start Problem in Large-scale Recommendation Engines: A Deep Learning Approach."](https://arxiv.org/pdf/1611.05480v1.pdf) *arXiv preprint arXiv:1611.05480* (2016).
7. Covington P, Adams J, Sargin E. [Deep neural networks for youtube recommendations](https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf)[C]//Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 2016: 191-198. 7. Covington P, Adams J, Sargin E. [Deep neural networks for youtube recommendations](https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf)[C]//Proceedings of the 10th ACM Conference on recommendation systems. ACM, 2016: 191-198.
<br/> <br/>
This tutorial is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://paddlepaddleimage.cdn.bcebos.com/bookimage/camo.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">This tutorial</span> is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.
### This file is only used for continuous evaluation test!
from __future__ import print_function
from __future__ import division
from __future__ import absolute_import
import os
import sys
sys.path.append(os.environ['ceroot'])
from kpi import CostKpi
test_cost_kpi = CostKpi('test_cost', 0.02, 0, actived=True, desc='test cost')
tracking_kpis = [test_cost_kpi]
def parse_log(log):
for line in log.split('\n'):
fs = line.strip().split('\t')
print(fs)
if len(fs) == 3 and fs[0] == 'kpis':
kpi_name = fs[1]
kpi_value = float(fs[2])
yield kpi_name, kpi_value
def log_to_ce(log):
kpi_tracker = {}
for kpi in tracking_kpis:
kpi_tracker[kpi.name] = kpi
for (kpi_name, kpi_value) in parse_log(log):
print(kpi_name, kpi_value)
kpi_tracker[kpi_name].add_record(kpi_value)
kpi_tracker[kpi_name].persist()
if __name__ == '__main__':
log = sys.stdin.read()
log_to_ce(log)
...@@ -44,6 +44,14 @@ ...@@ -44,6 +44,14 @@
本教程源代码目录在[book/recommender_system](https://github.com/PaddlePaddle/book/tree/develop/05.recommender_system),初次使用请您参考[Book文档使用说明](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)。 本教程源代码目录在[book/recommender_system](https://github.com/PaddlePaddle/book/tree/develop/05.recommender_system),初次使用请您参考[Book文档使用说明](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)。
### 说明: ###
1. 硬件环境要求:
本文可支持在CPU、GPU下运行
2. Docker镜像支持的CUDA/cuDNN版本:
如果使用了Docker运行Book,请注意:这里所提供的默认镜像的GPU环境为 CUDA 8/cuDNN 5,对于NVIDIA Tesla V100等要求CUDA 9的 GPU,使用该镜像可能会运行失败。
3. 文档和脚本中代码的一致性问题:
请注意:为使本文更加易读易用,我们拆分、调整了train.py的代码并放入本文。本文中代码与train.py的运行结果一致,可直接运行[train.py](https://github.com/PaddlePaddle/book/blob/develop/05.recommender_system/train.py)进行验证。
## 背景介绍 ## 背景介绍
在网络技术不断发展和电子商务规模不断扩大的背景下,商品数量和种类快速增长,用户需要花费大量时间才能找到自己想买的商品,这就是信息超载问题。为了解决这个难题,个性化推荐系统(Recommender System)应运而生。 在网络技术不断发展和电子商务规模不断扩大的背景下,商品数量和种类快速增长,用户需要花费大量时间才能找到自己想买的商品,这就是信息超载问题。为了解决这个难题,个性化推荐系统(Recommender System)应运而生。
...@@ -96,7 +104,9 @@ YouTube是世界上最大的视频上传、分享和发现网站,YouTube个性 ...@@ -96,7 +104,9 @@ YouTube是世界上最大的视频上传、分享和发现网站,YouTube个性
对于一个用户$U$,预测此刻用户要观看的视频$\omega$为视频$i$的概率公式为: 对于一个用户$U$,预测此刻用户要观看的视频$\omega$为视频$i$的概率公式为:
$$P(\omega=i|u)=\frac{e^{v_{i}u}}{\sum_{j \in V}e^{v_{j}u}}$$ <p align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/05.recommender_system/image/formula1.png?raw=true" width="20%" ><br/>
</p>
其中$u$为用户$U$的特征表示,$V$为视频库集合,$v_i$为视频库中第$i$个视频的特征表示。$u$和$v_i$为长度相等的向量,两者点积可以通过全连接层实现。 其中$u$为用户$U$的特征表示,$V$为视频库集合,$v_i$为视频库中第$i$个视频的特征表示。$u$和$v_i$为长度相等的向量,两者点积可以通过全连接层实现。
...@@ -125,11 +135,15 @@ $$P(\omega=i|u)=\frac{e^{v_{i}u}}{\sum_{j \in V}e^{v_{j}u}}$$ ...@@ -125,11 +135,15 @@ $$P(\omega=i|u)=\frac{e^{v_{i}u}}{\sum_{j \in V}e^{v_{j}u}}$$
其次,进行卷积操作:把卷积核(kernel)$w\in\mathbb{R}^{hk}$应用于包含$h$个词的窗口$x_{i:i+h-1}$,得到特征$c_i=f(w\cdot x_{i:i+h-1}+b)$,其中$b\in\mathbb{R}$为偏置项(bias),$f$为非线性激活函数,如$sigmoid$。将卷积核应用于句子中所有的词窗口${x_{1:h},x_{2:h+1},\ldots,x_{n-h+1:n}}$,产生一个特征图(feature map): 其次,进行卷积操作:把卷积核(kernel)$w\in\mathbb{R}^{hk}$应用于包含$h$个词的窗口$x_{i:i+h-1}$,得到特征$c_i=f(w\cdot x_{i:i+h-1}+b)$,其中$b\in\mathbb{R}$为偏置项(bias),$f$为非线性激活函数,如$sigmoid$。将卷积核应用于句子中所有的词窗口${x_{1:h},x_{2:h+1},\ldots,x_{n-h+1:n}}$,产生一个特征图(feature map):
$$c=[c_1,c_2,\ldots,c_{n-h+1}], c \in \mathbb{R}^{n-h+1}$$ <p align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/05.recommender_system/image/formula2.png?raw=true" width="40%" ><br/>
</p>
接下来,对特征图采用时间维度上的最大池化(max pooling over time)操作得到此卷积核对应的整句话的特征$\hat c$,它是特征图中所有元素的最大值: 接下来,对特征图采用时间维度上的最大池化(max pooling over time)操作得到此卷积核对应的整句话的特征$\hat c$,它是特征图中所有元素的最大值:
$$\hat c=max(c)$$ <p align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/05.recommender_system/image/formula3.png?raw=true" width="15%" ><br/>
</p>
#### 融合推荐模型概览 #### 融合推荐模型概览
...@@ -162,9 +176,10 @@ Paddle在API中提供了自动加载数据的模块。数据模块为 `paddle.da ...@@ -162,9 +176,10 @@ Paddle在API中提供了自动加载数据的模块。数据模块为 `paddle.da
```python ```python
from __future__ import print_function
import paddle import paddle
movie_info = paddle.dataset.movielens.movie_info() movie_info = paddle.dataset.movielens.movie_info()
print movie_info.values()[0] print(list(movie_info.values())[0])
``` ```
...@@ -180,7 +195,7 @@ print movie_info.values()[0] ...@@ -180,7 +195,7 @@ print movie_info.values()[0]
```python ```python
movie_info = paddle.dataset.movielens.movie_info() movie_info = paddle.dataset.movielens.movie_info()
print movie_info.values()[0] print(list(movie_info.values())[0])
``` ```
<MovieInfo id(1), title(Toy Story ), categories(['Animation', "Children's", 'Comedy'])> <MovieInfo id(1), title(Toy Story ), categories(['Animation', "Children's", 'Comedy'])>
...@@ -191,7 +206,7 @@ print movie_info.values()[0] ...@@ -191,7 +206,7 @@ print movie_info.values()[0]
```python ```python
user_info = paddle.dataset.movielens.user_info() user_info = paddle.dataset.movielens.user_info()
print user_info.values()[0] print(list(user_info.values())[0])
``` ```
<UserInfo id(1), gender(F), age(1), job(10)> <UserInfo id(1), gender(F), age(1), job(10)>
...@@ -244,7 +259,7 @@ train_set_creator = paddle.dataset.movielens.train() ...@@ -244,7 +259,7 @@ train_set_creator = paddle.dataset.movielens.train()
train_sample = next(train_set_creator()) train_sample = next(train_set_creator())
uid = train_sample[0] uid = train_sample[0]
mov_id = train_sample[len(user_info[uid].value())] mov_id = train_sample[len(user_info[uid].value())]
print "User %s rates Movie %s with Score %s"%(user_info[uid], movie_info[mov_id], train_sample[-1]) print ("User %s rates Movie %s with Score %s"%(user_info[uid], movie_info[mov_id], train_sample[-1]))
``` ```
User <UserInfo id(1), gender(F), age(1), job(10)> rates Movie <MovieInfo id(1193), title(One Flew Over the Cuckoo's Nest ), categories(['Drama'])> with Score [5.0] User <UserInfo id(1), gender(F), age(1), job(10)> rates Movie <MovieInfo id(1193), title(One Flew Over the Cuckoo's Nest ), categories(['Drama'])> with Score [5.0]
...@@ -260,7 +275,6 @@ print "User %s rates Movie %s with Score %s"%(user_info[uid], movie_info[mov_id] ...@@ -260,7 +275,6 @@ print "User %s rates Movie %s with Score %s"%(user_info[uid], movie_info[mov_id]
```python ```python
from __future__ import print_function
import math import math
import sys import sys
import numpy as np import numpy as np
...@@ -561,13 +575,13 @@ train_loop() ...@@ -561,13 +575,13 @@ train_loop()
```python ```python
infer_movie_id = 783 infer_movie_id = 783
infer_movie_name = paddle.dataset.movielens.movie_info()[infer_movie_id].title infer_movie_name = paddle.dataset.movielens.movie_info()[infer_movie_id].title
user_id = fluid.create_lod_tensor([[1]], [[1]], place) user_id = fluid.create_lod_tensor([[np.int64(1)]], [[1]], place)
gender_id = fluid.create_lod_tensor([[1]], [[1]], place) gender_id = fluid.create_lod_tensor([[np.int64(1)]], [[1]], place)
age_id = fluid.create_lod_tensor([[0]], [[1]], place) age_id = fluid.create_lod_tensor([[np.int64(0)]], [[1]], place)
job_id = fluid.create_lod_tensor([[10]], [[1]], place) job_id = fluid.create_lod_tensor([[np.int64(10)]], [[1]], place)
movie_id = fluid.create_lod_tensor([[783]], [[1]], place) # Hunchback of Notre Dame movie_id = fluid.create_lod_tensor([[np.int64(783)]], [[1]], place) # Hunchback of Notre Dame
category_id = fluid.create_lod_tensor([[10, 8, 9]], [[3]], place) # Animation, Children's, Musical category_id = fluid.create_lod_tensor([np.array([10, 8, 9], dtype='int64')], [[3]], place) # Animation, Children's, Musical
movie_title = fluid.create_lod_tensor([[1069, 4140, 2923, 710, 988]], [[5]], movie_title = fluid.create_lod_tensor([np.array([1069, 4140, 2923, 710, 988], dtype='int64')], [[5]],
place) # 'hunchback','of','notre','dame','the' place) # 'hunchback','of','notre','dame','the'
``` ```
...@@ -620,13 +634,13 @@ with fluid.scope_guard(inference_scope): ...@@ -620,13 +634,13 @@ with fluid.scope_guard(inference_scope):
2. Sarwar, Badrul, et al. "[Item-based collaborative filtering recommendation algorithms.](http://files.grouplens.org/papers/www10_sarwar.pdf)" *Proceedings of the 10th international conference on World Wide Web*. ACM, 2001. 2. Sarwar, Badrul, et al. "[Item-based collaborative filtering recommendation algorithms.](http://files.grouplens.org/papers/www10_sarwar.pdf)" *Proceedings of the 10th international conference on World Wide Web*. ACM, 2001.
3. Kautz, Henry, Bart Selman, and Mehul Shah. "[Referral Web: combining social networks and collaborative filtering.](http://www.cs.cornell.edu/selman/papers/pdf/97.cacm.refweb.pdf)" Communications of the ACM 40.3 (1997): 63-65. APA 3. Kautz, Henry, Bart Selman, and Mehul Shah. "[Referral Web: combining social networks and collaborative filtering.](http://www.cs.cornell.edu/selman/papers/pdf/97.cacm.refweb.pdf)" Communications of the ACM 40.3 (1997): 63-65. APA
4. [Peter Brusilovsky](https://en.wikipedia.org/wiki/Peter_Brusilovsky) (2007). *The Adaptive Web*. p. 325. 4. [Peter Brusilovsky](https://en.wikipedia.org/wiki/Peter_Brusilovsky) (2007). *The Adaptive Web*. p. 325.
5. Robin Burke , [Hybrid Web Recommender Systems](http://www.dcs.warwick.ac.uk/~acristea/courses/CS411/2010/Book%20-%20The%20Adaptive%20Web/HybridWebRecommenderSystems.pdf), pp. 377-408, The Adaptive Web, Peter Brusilovsky, Alfred Kobsa, Wolfgang Nejdl (Ed.), Lecture Notes in Computer Science, Springer-Verlag, Berlin, Germany, Lecture Notes in Computer Science, Vol. 4321, May 2007, 978-3-540-72078-2. 5. Robin Burke , [Hybrid Web Recommender Systems](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.435.7538&rep=rep1&type=pdf), pp. 377-408, The Adaptive Web, Peter Brusilovsky, Alfred Kobsa, Wolfgang Nejdl (Ed.), Lecture Notes in Computer Science, Springer-Verlag, Berlin, Germany, Lecture Notes in Computer Science, Vol. 4321, May 2007, 978-3-540-72078-2.
6. Yuan, Jianbo, et al. ["Solving Cold-Start Problem in Large-scale Recommendation Engines: A Deep Learning Approach."](https://arxiv.org/pdf/1611.05480v1.pdf) *arXiv preprint arXiv:1611.05480* (2016). 6. Yuan, Jianbo, et al. ["Solving Cold-Start Problem in Large-scale Recommendation Engines: A Deep Learning Approach."](https://arxiv.org/pdf/1611.05480v1.pdf) *arXiv preprint arXiv:1611.05480* (2016).
7. Covington P, Adams J, Sargin E. [Deep neural networks for youtube recommendations](https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf)[C]//Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 2016: 191-198. 7. Covington P, Adams J, Sargin E. [Deep neural networks for youtube recommendations](https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf)[C]//Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 2016: 191-198.
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">知识共享 署名-相同方式共享 4.0 国际 许可协议</a>进行许可。 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://paddlepaddleimage.cdn.bcebos.com/bookimage/camo.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">知识共享 署名-相同方式共享 4.0 国际 许可协议</a>进行许可。
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
......
...@@ -40,194 +40,229 @@ ...@@ -40,194 +40,229 @@
<!-- This block will be replaced by each markdown file content. Please do not change lines below.--> <!-- This block will be replaced by each markdown file content. Please do not change lines below.-->
<div id="markdown" style='display:none'> <div id="markdown" style='display:none'>
# Personalized Recommendation # Recommender System
The source code from this tutorial is at [here](https://github.com/PaddlePaddle/book/tree/develop/05.recommender_system). For instructions on getting started with this book,see [Running This Book](https://github.com/PaddlePaddle/book/blob/develop/README.md#running-the-book). The source code of this tutorial is in [book/recommender_system](https://github.com/PaddlePaddle/book/tree/develop/05.recommender_system). For new users, please refer to [Running This Book](https://github.com/PaddlePaddle/book/blob/develop/README.md#running-the-book) .
## Background Introduction
## Background With the continuous development of network technology and the ever-expanding scale of e-commerce, the number and variety of goods grow rapidly and users need to spend a lot of time to find the goods they want to buy. This is information overload. In order to solve this problem, recommendation system came into being.
The recommender system is a component of e-commerce, online videos, and online reading services. There are several different approaches for recommender systems to learn from user behavior and product properties and to understand users' interests. The recommendation system is a subset of the Information Filtering System, which can be used in a range of areas such as movies, music, e-commerce, and Feed stream recommendations. The recommendation system discovers the user's personalized needs and interests by analyzing and mining user behaviors, and recommends information or products that may be of interest to the user. Unlike search engines, recommendation system do not require users to accurately describe their needs, but model their historical behavior to proactively provide information that meets user interests and needs.
- User behavior-based approach. A well-known method of this approach is collaborative filtering, which assumes that if two users made similar purchases, they share common interests and would likely go on making the same decision. Some variants of collaborative filtering are user-based[[3](#references)], item-based [[4](#references)], social network based[[5](#references)], and model-based. The GroupLens system \[[1](#references)\] introduced by the University of Minnesota in 1994 is generally considered to be a relatively independent research direction for the recommendation system. The system first proposed the idea of completing recommendation task based on collaborative filtering. After that, the collaborative filtering recommendation based on the model led the development of recommendation system for more than ten years.
- Content-based approach[[1](#references)]. This approach represents product properties and user interests as feature vectors of the same space so that it could measure how much a user is interested in a product by the distance between two feature vectors. The traditional personalized recommendation system methods mainly include:
- Hybrid approach[[2](#references)]: This one combines above two to help with each other about the data sparsity problem[[6](#references)]. - Collaborative Filtering Recommendation: This method is one of the most widely used technologies which requires the collection and analysis of users' historical behaviors, activities and preferences. It can usually be divided into two sub-categories: User-Based Recommendation \[[1](#references)\] and Item-Based Recommendation \[[2](#references)\]. A key advantage of this method is that it does not rely on the machine to analyze the content characteristics of the item, so it does not need to understand the item itself to accurately recommend complex items such as movies. However, the disadvantage is that there is a cold start problem for new users without any behavior. At the same time, there is also a sparsity problem caused by insufficient interaction data between users and commodities. It is worth mentioning that social network \[[3](#references)\] or geographic location and other context information can be integrated into collaborative filtering.
- Content-Based Filtering Recommendation \[[4](#references)\] : This method uses the content description of the product to abstract meaningful features by calculating the similarity between the user's interest and the product description to make recommendations to users. The advantage is that it is simple and straightforward. It does not need to evaluate products based on the comments of users. Instead, it compares the product similarity by product attributes to recommend similar products to the users of interest. The disadvantage is that there is also a cold start problem for new users without any behavior.
- Hybrid Recommendation \[[5](#references)\]: Use different inputs and techniques to jointly recommend items to complement each single recommendation technique.
This tutorial explains a deep learning based hybrid approach and its implement in PaddlePaddle. We are going to train a model using a dataset that includes user information, movie information, and ratings. Once we train the model, we will be able to get a predicted rating given a pair of user and movie IDs. In recent years, deep learning has achieved great success in many fields. Both academia and industry are trying to apply deep learning to the field of recommendation systems. Deep learning has excellent ability to automatically extract features, can learn multi-level abstract feature representations, and learn heterogeneous or cross-domain content information, which can deal with the cold start problem \[[6](#references)\] of recommendation system to some extent. This tutorial focuses on the deep learning model of recommendation system and how to implement the model with PaddlePaddle.
## Result Demo
## Model Overview We use a dataset containing user information, movie information, and movie ratings as a recommendation system. When we train the model, we only need to input the corresponding user ID and movie ID, we can get a matching score (range [0, 5], the higher the score is regarded as the greater interest), and then according to the recommendation of all movies sort the scores and recommend them to movies that may be of interest to the user.
To know more about deep learning based recommendation, let us start from going over the Youtube recommender system[[7](#references)] before introducing our hybrid model. ```
Input movie_id: 1962
Input user_id: 1
Prediction Score is 4.25
```
## Model Overview
In this chapter, we first introduce YouTube's video personalization recommendation system \[[7](#references)\], and then introduce the fusion recommendation model we implemented.
### YouTube's Deep Learning Recommendation Model ### YouTube's Deep Neural Network Personalized Recommendation System
YouTube is a video-sharing Web site with one of the largest user base in the world. Its recommender system serves more than a billion users. This system is composed of two major parts: candidate generation and ranking. The former selects few hundreds of candidates from millions of videos, and the latter ranks and outputs the top 10. YouTube is the world's largest video uploading, sharing and discovery site, and the YouTube Personalized Recommendation System recommends personalized content from a growing library to more than 1 billion users. The entire system consists of two neural networks: a candidate generation network and a ranking network. The candidate generation network generates hundreds of candidates from a million-level video library, and the ranking network sorts the candidates and outputs the highest ranked tens of results. The system structure is shown in Figure 1:
<p align="center"> <p align="center">
<img src="image/YouTube_Overview.en.png" width="70%" ><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/05.recommender_system/image/YouTube_Overview.png?raw=true" width="70%" ><br/>
Figure 1. YouTube recommender system overview. Figure 1. YouTube personalized recommendation system structure
</p> </p>
#### Candidate Generation Network #### Candidate Generation Network
YouTube models candidate generation as a multi-class classification problem with a huge number of classes equal to the number of videos. The architecture of the model is as follows: The candidate generation network models the recommendation problem as a multi-class classification problem with a large number of categories. For a Youtube user, using its watching history (video ID), search tokens, demographic information (such as geographic location, user login device), binary features (such as gender, whether to log in), and continuous features (such as user age), etc., multi-classify all videos in the video library to obtain the classification result of each category (ie, the recommendation probability of each video), eventually outputting hundreds of videos with high probability.
First, the historical information such as watching history and search token records are mapped to vectors and averaged to obtain a fixed length representation. At the same time, demographic characteristics are input to optimize the recommendation effect of new users, and the binary features and continuous features are normalized to the range [0, 1]. Next, put all the feature representations into a vector and input them to the non-linear multilayer perceptron (MLP, see [Identification Figures](https://github.com/PaddlePaddle/book/blob/develop/02.recognize_digits/README.md) tutorial). Finally, during training, the output of the MLP is classified by softmax. When predicting, the similarity of the user's comprehensive features (MLP output) to all videos' features is calculated, and the highest score of $k$ is obtained as the result of the candidate generation network. Figure 2 shows the candidate generation network structure.
<p align="center"> <p align="center">
<img src="image/Deep_candidate_generation_model_architecture.en.png" width="70%" ><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/05.recommender_system/image/Deep_candidate_generation_model_architecture.png?raw=true" width="70%" ><br/>
Figure 2. Deep candidate generation model. Figure 2. Candidate generation network structure
</p> </p>
The first stage of this model maps watching history and search queries into fixed-length representative features. Then, an MLP (multi-layer Perceptron, as described in the [Recognize Digits](https://github.com/PaddlePaddle/book/blob/develop/recognize_digits/README.md) tutorial) takes the concatenation of all representative vectors. The output of the MLP represents the user' *intrinsic interests*. At training time, it is used together with a softmax output layer for minimizing the classification error. At serving time, it is used to compute the relevance of the user with all movies. For a user $U$, the formula for predicting whether the video $\omega$ that the user wants to watch at the moment is video $i$ is:
For a user $U$, the predicted watching probability of video $i$ is
$$P(\omega=i|u)=\frac{e^{v_{i}u}}{\sum_{j \in V}e^{v_{j}u}}$$ $$P(\omega=i|u)=\frac{e^{v_{i}u}}{\sum_{j \in V}e^{v_{j}u}}$$
where $u$ is the representative vector of user $U$, $V$ is the corpus of all videos, $v_i$ is the representative vector of the $i$-th video. $u$ and $v_i$ are vectors of the same length, so we can compute their dot product using a fully connected layer. Where $u$ is the feature representation of the user $U$, $V$ is the video library collection, and $v_i$ is the feature representation of the $i$ video in the video library. $u$ and $v_i$ are vectors of equal length, and the dot product can be implemented by a fully connected layer.
This model could have a performance issue as the softmax output covers millions of classification labels. To optimize performance, at the training time, the authors down-sample negative samples, so the actual number of classes is reduced to thousands. At serving time, the authors ignore the normalization of the softmax outputs, because the results are just for ranking. Considering that the number of categories in the softmax classification is very large, in order to ensure a certain computational efficiency: 1) in the training phase, use negative sample category sampling to reduce the number of actually calculated categories to thousands; 2) in the recommendation (prediction) phase, ignore the normalized calculation of softmax (does not affect the result), and simplifies the category scoring problem into the nearest neighbor search problem in the dot product space, then takes the nearest $k$ video of $u$ as a candidate for generation.
#### Ranking Network #### Ranking Network
The structure of the ranking network is similar to the candidate generation network, but its goal is to perform finer ranking of the candidates. Similar to the feature extraction method in traditional advertisement ranking, a large number of related features (such as video ID, last watching time, etc.) for video sorting are also constructed here. These features are treated similarly to the candidate generation network, except that at the top of the ranking network is a weighted logistic regression that scores all candidate videos and sorts them from high to low. Then, return to the user.
The architecture of the ranking network is similar to that of the candidate generation network. Similar to ranking models widely used in online advertising, it uses rich features like video ID, last watching time, etc. The output layer of the ranking network is a weighted logistic regression, which rates all candidate videos. ### Fusion recommendation model
This section uses Convolutional Neural Networks to learn the representation of movie titles. The convolutional neural network for text and the fusion recommendation model are introduced in turn.
### Hybrid Model
In the section, let us introduce our movie recommendation system. Especially, we feed moives titles into a text convolution network to get a fixed-length representative feature vector. Accordingly we will introduce the convolutional neural network for texts and the hybrid recommendation model respectively.
#### Convolutional Neural Networks for Texts (CNN)
**Convolutional Neural Networks** are frequently applied to data with grid-like topology such as two-dimensional images and one-dimensional texts. A CNN can extract multiple local features, combine them, and produce high-level abstractions, which correspond to semantic understanding. Empirically, CNN is shown to be efficient for image and text modeling. #### Convolutional Neural Network (CNN) for text
CNN mainly contains convolution and pooling operation, with versatile combinations in various applications. Here, we briefly describe a CNN as shown in Figure 3. Convolutional neural networks are often used to deal with data of a grid-like topology. For example, an image can be viewed as a pixel of a two-dimensional grid, and a natural language can be viewed as a one-dimensional sequence of words. Convolutional neural networks can extract a variety of local features and combine them to obtain more advanced feature representations. Experiments show that convolutional neural networks can efficiently model image and text problems.
The convolutional neural network is mainly composed of convolution and pooling operations, and its application and combination methods are flexible and varied. In this section we will explain the network as shown in Figure 3:
<p align="center"> <p align="center">
<img src="image/text_cnn_en.png" width = "80%" align="center"/><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/05.recommender_system/image/text_cnn.png?raw=true" width = "80%" align="center"/><br />
Figure 3. CNN for text modeling. Figure 3. Convolutional neural network text classification model
</p> </p>
Let $n$ be the length of the sentence to process, and the $i$-th word has embedding as $x_i\in\mathbb{R}^k$,where $k$ is the embedding dimensionality. Suppose the length of the sentence to be processed is $n$, where the word vector of the $i$ word is $x_i\in\mathbb{R}^k$, and $k$ is the dimension size.
First, we concatenate the words by piecing together every $h$ words, each as a window of length $h$. This window is denoted as $x_{i:i+h-1}$, consisting of $x_{i},x_{i+1},\ldots,x_{i+h-1}$, where $x_i$ is the first word in the window and $i$ takes value ranging from $1$ to $n-h+1$: $x_{i:i+h-1}\in\mathbb{R}^{hk}$. First, splicing the word vector: splicing each $h$ word to form a word window of size $h$, denoted as $x_{i:i+h-1}$, which represents the word sequence splicing of $x_{i}, x_{i+1}, \ldots, x_{i+h-1}$, where $i$ represents the position of the first word in the word window throughout the sentence, ranging from $1$ to $n-h+1$, $x_{i:i+h-1}\in\mathbb{R}^{hk}$.
Next, we apply the convolution operation: we apply the kernel $w\in\mathbb{R}^{hk}$ in each window, extracting features $c_i=f(w\cdot x_{i:i+h-1}+b)$, where $b\in\mathbb{R}$ is the bias and $f$ is a non-linear activation function such as $sigmoid$. Convolving by the kernel at every window ${x_{1:h},x_{2:h+1},\ldots,x_{n-h+1:n}}$ produces a feature map in the following form: Second, perform a convolution operation: apply the convolution kernel $w\in\mathbb{R}^{hk}$ to the window $x_{i:i+h-1}$ containing $h$ words. , get the feature $c_i=f(w\cdot x_{i:i+h-1}+b)$, where $b\in\mathbb{R}$ is the bias and $f$ is the non Linear activation function, such as $sigmoid$. Apply the convolution kernel to all word windows ${x_{1:h}, x_{2:h+1},\ldots,x_{n-h+1:n}}$ in the sentence, producing a feature map:
$$c=[c_1,c_2,\ldots,c_{n-h+1}], c \in \mathbb{R}^{n-h+1}$$ $$c=[c_1,c_2,\ldots,c_{n-h+1}], c \in \mathbb{R}^{n-h+1}$$
Next, we apply *max pooling* over time to represent the whole sentence $\hat c$, which is the maximum element across the feature map: Next, using the max pooling over time for feature maps to obtain the feature $\hat c$, of the whole sentence corresponding to this convolution kernel, which is the maximum value of all elements in the feature map:
$$\hat c=max(c)$$ $$\hat c=max(c)$$
#### Model Structure Of The Hybrid Model #### Fusion recommendation model overview
In our network, the input includes features of users and movies. The user feature includes four properties: user ID, gender, occupation, and age. Movie features include their IDs, genres, and titles. In the film personalized recommendation system that incorporates the recommendation model:
We use fully-connected layers to map user features into representative feature vectors and concatenate them. The process of movie features is similar, except that for movie titles -- we feed titles into a text convolution network as described in the above section to get a fixed-length representative feature vector. 1. First, take user features and movie features as input to the neural network, where:
Given the feature vectors of users and movies, we compute the relevance using cosine similarity. We minimize the squared error at training time. - The user features incorporate four attribute information: user ID, gender, occupation, and age.
- The movie feature incorporate three attribute information: movie ID, movie type ID, and movie name.
2. For the user feature, map the user ID to a vector representation with a dimension size of 256, enter the fully connected layer, and do similar processing for the other three attributes. Then the feature representations of the four attributes are fully connected and added separately.
3. For movie features, the movie ID is processed in a manner similar to the user ID. The movie type ID is directly input into the fully connected layer in the form of a vector, and the movie name is represented by a fixed-length vector using a text convolutional neural network. The feature representations of the three attributes are then fully connected and added separately.
4. After obtaining the vector representation of the user and the movie, calculate the cosine similarity of them as the score of the personalized recommendation system. Finally, the square of the difference between the similarity score and the user's true score is used as the loss function of the regression model.
<p align="center"> <p align="center">
<img src="image/rec_regression_network_en.png" width="90%" ><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/05.recommender_system/image/rec_regression_network.png?raw=true" width="90%" ><br/>
Figure 4. A hybrid recommendation model. Figure 4. Fusion recommendation model
</p> </p>
## Dataset ## Data Preparation
### Data Introduction and Download
We use the [MovieLens ml-1m](http://files.grouplens.org/datasets/movielens/ml-1m.zip) to train our model. This dataset includes 10,000 ratings of 4,000 movies from 6,000 users to 4,000 movies. Each rate is in the range of 1~5. Thanks to GroupLens Research for collecting, processing and publishing the dataset. We take [MovieLens Million Dataset (ml-1m)](http://files.grouplens.org/datasets/movielens/ml-1m.zip) as an example. The ml-1m dataset contains 1,000,000 reviews of 4,000 movies by 6,000 users (scores ranging from 1 to 5, all integer), collected by the GroupLens Research lab.
`paddle.datasets` package encapsulates multiple public datasets, including `cifar`, `imdb`, `mnist`, `movielens` and `wmt14`, etc. There's no need for us to manually download and preprocess `MovieLens` dataset. Paddle provides modules for automatically loading data in the API. The data module is `paddle.dataset.movielens`
The raw `MoiveLens` contains movie ratings, relevant features from both movies and users.
For instance, one movie's feature could be:
```python ```python
from __future__ import print_function
import paddle import paddle
movie_info = paddle.dataset.movielens.movie_info() movie_info = paddle.dataset.movielens.movie_info()
print movie_info.values()[0] print(list(movie_info.values())[0])
```
```python
# Run this block to show dataset's documentation
# help(paddle.dataset.movielens)
``` ```
```text The original data includes feature data of the movie, user's feature data, and the user's rating of the movie.
<MovieInfo id(1), title(Toy Story), categories(['Animation', "Children's", 'Comedy'])>
For example, one of the movie features is:
```python
movie_info = paddle.dataset.movielens.movie_info()
print(list(movie_info.values())[0])
``` ```
One user's feature could be: <MovieInfo id(1), title(Toy Story ), categories(['Animation', "Children's", 'Comedy'])>
This means that the movie id is 1, and the title is 《Toy Story》, which is divided into three categories. These three categories are animation, children, and comedy.
```python ```python
user_info = paddle.dataset.movielens.user_info() user_info = paddle.dataset.movielens.user_info()
print user_info.values()[0] print(list(user_info.values())[0])
``` ```
```text <UserInfo id(1), gender(F), age(1), job(10)>
<UserInfo id(1), gender(F), age(1), job(10)>
```
In this dateset, the distribution of age is shown as follows:
```text This means that the user ID is 1, female, and younger than 18 years old. The occupation ID is 10.
1: "Under 18"
18: "18-24"
25: "25-34"
35: "35-44"
45: "45-49"
50: "50-55"
56: "56+"
```
User's occupation is selected from the following options:
```text
0: "other" or not specified
1: "academic/educator"
2: "artist"
3: "clerical/admin"
4: "college/grad student"
5: "customer service"
6: "doctor/health care"
7: "executive/managerial"
8: "farmer"
9: "homemaker"
10: "K-12 student"
11: "lawyer"
12: "programmer"
13: "retired"
14: "sales/marketing"
15: "scientist"
16: "self-employed"
17: "technician/engineer"
18: "tradesman/craftsman"
19: "unemployed"
20: "writer"
```
Each record consists of three main components: user features, movie features and movie ratings. Among them, the age uses the following distribution
Likewise, as a simple example, consider the following:
* 1: "Under 18"
* 18: "18-24"
* 25: "25-34"
* 35: "35-44"
* 45: "45-49"
* 50: "50-55"
* 56: "56+"
The occupation is selected from the following options:
* 0: "other" or not specified
* 1: "academic/educator"
* 2: "artist"
* 3: "clerical/admin"
* 4: "college/grad student"
* 5: "customer service"
* 6: "doctor/health care"
* 7: "executive/managerial"
* 8: "farmer"
* 9: "homemaker"
* 10: "K-12 student"
* 11: "lawyer"
* 12: "programmer"
* 13: "retired"
* 14: "sales/marketing"
* 15: "scientist"
* 16: "self-employed"
* 17: "technician/engineer"
* 18: "tradesman/craftsman"
* 19: "unemployed"
* 20: "writer"
For each training or test data, it is <user features> + <movie feature> + rating.
For example, we get the first training data:
```python ```python
train_set_creator = paddle.dataset.movielens.train() train_set_creator = paddle.dataset.movielens.train()
train_sample = next(train_set_creator()) train_sample = next(train_set_creator())
uid = train_sample[0] uid = train_sample[0]
mov_id = train_sample[len(user_info[uid].value())] mov_id = train_sample[len(user_info[uid].value())]
print "User %s rates Movie %s with Score %s"%(user_info[uid], movie_info[mov_id], train_sample[-1]) print("User %s rates Movie %s with Score %s"%(user_info[uid], movie_info[mov_id], train_sample[-1]))
``` ```
```text ```python
User <UserInfo id(1), gender(F), age(1), job(10)> rates Movie <MovieInfo id(1193), title(One Flew Over the Cuckoo's Nest), categories(['Drama'])> with Score [5.0] User <UserInfo id(1), gender(F), age(1), job(10)> rates Movie <MovieInfo id(1193), title(One Flew Over the Cuckoo's Nest ), categories(['Drama'])> with Score [5.0]
``` ```
The output shows that user 1 gave movie `1193` a rating of 5. That is, the user 1 evaluates the movie 1193 as 5 points.
After issuing a command `python train.py`, training will start immediately. The details will be unpacked by the following sessions to see how it works. ## Configuration Instruction
Below we begin to configure the model based on the form of the input data. First import the required library functions and define global variables.
- IS_SPARSE: whether to use sparse update in embedding
- PASS_NUM: number of epoch
## Model Configuration
Our program starts with importing necessary packages and initializing some global variables:
```python ```python
from __future__ import print_function
import math import math
import sys import sys
import numpy as np import numpy as np
...@@ -235,26 +270,17 @@ import paddle ...@@ -235,26 +270,17 @@ import paddle
import paddle.fluid as fluid import paddle.fluid as fluid
import paddle.fluid.layers as layers import paddle.fluid.layers as layers
import paddle.fluid.nets as nets import paddle.fluid.nets as nets
try:
from paddle.fluid.contrib.trainer import *
from paddle.fluid.contrib.inferencer import *
except ImportError:
print(
"In the fluid 1.0, the trainer and inferencer are moving to paddle.fluid.contrib",
file=sys.stderr)
from paddle.fluid.trainer import *
from paddle.fluid.inferencer import *
IS_SPARSE = True IS_SPARSE = True
USE_GPU = False
BATCH_SIZE = 256 BATCH_SIZE = 256
PASS_NUM = 20
``` ```
Then define the model configuration for our user feature synthesis model
Then we define the model configuration for user combined features:
```python ```python
def get_usr_combined_features(): def get_usr_combined_features():
"""network definition for user part"""
USR_DICT_SIZE = paddle.dataset.movielens.max_user_id() + 1 USR_DICT_SIZE = paddle.dataset.movielens.max_user_id() + 1
...@@ -311,14 +337,16 @@ def get_usr_combined_features(): ...@@ -311,14 +337,16 @@ def get_usr_combined_features():
return usr_combined_features return usr_combined_features
``` ```
As shown in the above code, the input is four dimension integers for each user, that is `user_id`,`gender_id`, `age_id` and `job_id`. In order to deal with these features conveniently, we use the language model in NLP to transform these discrete values into embedding vaules `usr_emb`, `usr_gender_emb`, `usr_age_emb` and `usr_job_emb`. As shown in the code above, for each user, we enter a 4-dimensional feature. This includes user_id, gender_id, age_id, job_id. These dimensional features are simple integer values. In order to facilitate the subsequent neural network processing of these features, we use the language model in NLP to transform these discrete integer values ​​into embedding. And form them into usr_emb, usr_gender_emb, usr_age_emb, usr_job_emb, respectively.
Then, we enter all the user features into a fully connected layer(fc). Combine all features into one 200-dimension feature.
Then we can use user features as input, directly connecting to a fully-connected layer, which is used to reduce dimension to 200. Furthermore, we make a similar transformation for each movie feature, the network configuration is:
Furthermore, we do a similar transformation for each movie feature. The model configuration is:
```python ```python
def get_mov_combined_features(): def get_mov_combined_features():
"""network definition for item(movie) part"""
MOV_DICT_SIZE = paddle.dataset.movielens.max_movie_id() + 1 MOV_DICT_SIZE = paddle.dataset.movielens.max_movie_id() + 1
...@@ -367,13 +395,15 @@ def get_mov_combined_features(): ...@@ -367,13 +395,15 @@ def get_mov_combined_features():
return mov_combined_features return mov_combined_features
``` ```
Movie title, which is a sequence of words represented by an integer word index sequence, will be fed into a `sequence_conv_pool` layer, which will apply convolution and pooling on time dimension. Because pooling is done on time dimension, the output will be a fixed-length vector regardless the length of the input sequence.
The title of a movie is a sequence of integers, and the integer represents the subscript of the word in the index sequence. This sequence is sent to the `sequence_conv_pool` layer, which uses convolution and pooling on the time dimension. Because of this, the output will be fixed length, although the length of the input sequence will vary.
Finally, we can define a `inference_program` that uses cosine similarity to calculate the similarity between user characteristics and movie features. Finally, we define an `inference_program` to calculate the similarity between user features and movie features using cosine similarity.
```python ```python
def inference_program(): def inference_program():
"""the combined network"""
usr_combined_features = get_usr_combined_features() usr_combined_features = get_usr_combined_features()
mov_combined_features = get_mov_combined_features() mov_combined_features = get_mov_combined_features()
...@@ -383,11 +413,11 @@ def inference_program(): ...@@ -383,11 +413,11 @@ def inference_program():
return scale_infer return scale_infer
``` ```
Then we define a `training_program` that uses the result from `inference_program` to compute the cost with label data. Furthermore, we define a `train_program` to use the result computed by `inference_program`, and calculate the error with the help of the tag data. We also define an `optimizer_func` to define the optimizer.
Also define `optimizer_func` to specify the optimizer.
```python ```python
def train_program(): def train_program():
"""define the cost function"""
scale_infer = inference_program() scale_infer = inference_program()
...@@ -402,21 +432,19 @@ def optimizer_func(): ...@@ -402,21 +432,19 @@ def optimizer_func():
return fluid.optimizer.SGD(learning_rate=0.2) return fluid.optimizer.SGD(learning_rate=0.2)
``` ```
## Model Training
### Specify training environment ## Training Model
Specify your training environment, you should specify if the training is on CPU or GPU. ### Defining the training environment
Define your training environment and specify whether the training takes place on CPU or GPU.
```python ```python
use_cuda = False use_cuda = False
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace() place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
``` ```
### Datafeeder Configuration ### Defining the data provider
The next step is to define a data provider for training and testing. The provider reads in a data of size `BATCH_SIZE`. `paddle.dataset.movielens.train` will provide a data of size `BATCH_SIZE` after each scribbling, and the size of the out-of-order is the cache size `buf_size`.
Next we define data feeders for test and train. The feeder reads a `buf_size` of data each time and feed them to the training/testing process.
`paddle.dataset.movielens.train` will yield records during each pass, after shuffling, a batch input of `BATCH_SIZE` is generated for training.
```python ```python
train_reader = paddle.batch( train_reader = paddle.batch(
...@@ -428,87 +456,111 @@ test_reader = paddle.batch( ...@@ -428,87 +456,111 @@ test_reader = paddle.batch(
paddle.dataset.movielens.test(), batch_size=BATCH_SIZE) paddle.dataset.movielens.test(), batch_size=BATCH_SIZE)
``` ```
### Create Trainer ### Constructing a training process (trainer)
We have constructed a training process here, including training optimization functions.
Create a trainer that takes `train_program` as input and specify optimizer function.
```python ### Provide data
trainer = Trainer(
train_func=train_program, place=place, optimizer_func=optimizer_func)
```
### Feeding Data `feed_order` is used to define the mapping between each generated data and `paddle.layer.data`. For example, the data in the first column generated by `movielens.train` corresponds to the feature `user_id`.
`feed_order` is devoted to specifying the correspondence between each yield record and `paddle.layer.data`. For instance, the first column of data generated by `movielens.train` corresponds to `user_id` feature.
```python ```python
feed_order = [ feed_order = [
'user_id', 'gender_id', 'age_id', 'job_id', 'movie_id', 'category_id', 'user_id', 'gender_id', 'age_id', 'job_id', 'movie_id', 'category_id',
'movie_title', 'score' 'movie_title', 'score'
] ]
``` ```
### Event Handler ### Building training programs and testing programs
The training program and the test program are separately constructed, and the training optimizer is imported.
Callback function `event_handler` will be called during training when a pre-defined event happens. ```python
For example, we can check the cost by `trainer.test` when `EndStepEvent` occurs main_program = fluid.default_main_program()
star_program = fluid.default_startup_program()
[avg_cost, scale_infer] = train_program()
test_program = main_program.clone(for_test=True)
sgd_optimizer = optimizer_func()
sgd_optimizer.minimize(avg_cost)
exe = fluid.Executor(place)
def train_test(program, reader):
count = 0
feed_var_list = [
program.global_block().var(var_name) for var_name in feed_order
]
feeder_test = fluid.DataFeeder(
feed_list=feed_var_list, place=place)
test_exe = fluid.Executor(place)
accumulated = 0
for test_data in reader():
avg_cost_np = test_exe.run(program=program,
feed=feeder_test.feed(test_data),
fetch_list=[avg_cost])
accumulated += avg_cost_np[0]
count += 1
return accumulated / count
```
### Build a training main loop and start training
We perform the training cycle according to the training cycle number (`PASS_NUM`) defined above and some other parameters, and perform a test every time. When the test result is good enough, we exit the training and save the trained parameters.
```python ```python
# Specify the directory path to save the parameters # Specify the directory path to save the parameters
params_dirname = "recommender_system.inference.model" params_dirname = "recommender_system.inference.model"
def event_handler(event): from paddle.utils.plot import Ploter
if isinstance(event, EndStepEvent): train_prompt = "Train cost"
test_reader = paddle.batch( test_prompt = "Test cost"
paddle.dataset.movielens.test(), batch_size=BATCH_SIZE)
avg_cost_set = trainer.test(
reader=test_reader, feed_order=feed_order)
# get avg cost
avg_cost = np.array(avg_cost_set).mean()
print("avg_cost: %s" % avg_cost) plot_cost = Ploter(train_prompt, test_prompt)
if float(avg_cost) < 4: # Change this number to adjust accuracy def train_loop():
trainer.save_params(params_dirname) feed_list = [
trainer.stop() main_program.global_block().var(var_name) for var_name in feed_order
else: ]
print('BatchID {0}, Test Loss {1:0.2}'.format(event.epoch + 1, feeder = fluid.DataFeeder(feed_list, place)
float(avg_cost))) exe.run(star_program)
if math.isnan(float(avg_cost)):
for pass_id in range(PASS_NUM):
for batch_id, data in enumerate(train_reader()):
# train a mini-batch
outs = exe.run(program=main_program,
feed=feeder.feed(data),
fetch_list=[avg_cost])
out = np.array(outs[0])
# get test avg_cost
test_avg_cost = train_test(test_program, test_reader)
plot_cost.append(train_prompt, batch_id, outs[0])
plot_cost.append(test_prompt, batch_id, test_avg_cost)
plot_cost.plot()
if batch_id == 20:
if params_dirname is not None:
fluid.io.save_inference_model(params_dirname, [
"user_id", "gender_id", "age_id", "job_id",
"movie_id", "category_id", "movie_title"
], [scale_infer], exe)
return
print('EpochID {0}, BatchID {1}, Test Loss {2:0.2}'.format(
pass_id + 1, batch_id + 1, float(test_avg_cost)))
if math.isnan(float(out[0])):
sys.exit("got NaN loss, training failed.") sys.exit("got NaN loss, training failed.")
``` ```
Start training
### Training
Finally, we invoke `trainer.train` to start training with `num_epochs` and other parameters.
```python ```python
trainer.train( train_loop()
num_epochs=1,
event_handler=event_handler,
reader=train_reader,
feed_order=feed_order)
``` ```
## Inference ## Model Application
### Create Inferencer ### Generate test data
Use the API of create_lod_tensor(data, lod, place) to generate the tensor of the detail level. `data` is a sequence, and each element is a sequence of index numbers. `lod` is the detail level's information, corresponding to `data`. For example, data = [[10, 2, 3], [2, 3]] means that it contains two sequences of lengths 3 and 2. Correspondingly lod = [[3, 2]], which indicates that it contains a layer of detail information, meaning that `data` has two sequences, lengths of 3 and 2.
Initialize Inferencer with `inference_program` and `params_dirname` which is where we save params from training. In this prediction example, we try to predict the score given by user with ID1 for the movie 'Hunchback of Notre Dame'.
```python
inferencer = Inferencer(
inference_program, param_path=params_dirname, place=place)
```
### Generate input data for testing
Use create_lod_tensor(data, lod, place) API to generate LoD Tensor, where `data` is a list of sequences of index numbers, `lod` is the level of detail (lod) info associated with `data`.
For example, data = [[10, 2, 3], [2, 3]] means that it contains two sequences of indices, of length 3 and 2, respectively.
Correspondingly, lod = [[3, 2]] contains one level of detail info, indicating that `data` consists of two sequences of length 3 and 2.
In this infer example, we try to predict rating of movie 'Hunchback of Notre Dame' from the info of user id 1.
```python ```python
infer_movie_id = 783 infer_movie_id = 783
infer_movie_name = paddle.dataset.movielens.movie_info()[infer_movie_id].title infer_movie_name = paddle.dataset.movielens.movie_info()[infer_movie_id].title
...@@ -522,13 +574,27 @@ movie_title = fluid.create_lod_tensor([[1069, 4140, 2923, 710, 988]], [[5]], ...@@ -522,13 +574,27 @@ movie_title = fluid.create_lod_tensor([[1069, 4140, 2923, 710, 988]], [[5]],
place) # 'hunchback','of','notre','dame','the' place) # 'hunchback','of','notre','dame','the'
``` ```
### Infer ### Building the prediction process and testing
Similar to the training process, we need to build a prediction process, where `params_dirname` is the address used to store the various parameters in the training process.
```python
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
exe = fluid.Executor(place)
inference_scope = fluid.core.Scope()
```
### Testing
Now we can make predictions. The `feed_order` we provide should be consistent with the training process.
Now we can infer with inputs that we provide in `feed_order` during training.
```python ```python
results = inferencer.infer( with fluid.scope_guard(inference_scope):
{ [inferencer, feed_target_names,
fetch_targets] = fluid.io.load_inference_model(params_dirname, exe)
results = exe.run(inferencer,
feed={
'user_id': user_id, 'user_id': user_id,
'gender_id': gender_id, 'gender_id': gender_id,
'age_id': age_id, 'age_id': age_id,
...@@ -537,30 +603,33 @@ results = inferencer.infer( ...@@ -537,30 +603,33 @@ results = inferencer.infer(
'category_id': category_id, 'category_id': category_id,
'movie_title': movie_title 'movie_title': movie_title
}, },
fetch_list=fetch_targets,
return_numpy=False) return_numpy=False)
predict_rating = np.array(results[0])
predict_rating = np.array(results[0]) print("Predict Rating of user id 1 on movie \"" + infer_movie_name +
print("Predict Rating of user id 1 on movie \"" + infer_movie_name + "\" is " + str(predict_rating[0][0])) "\" is " + str(predict_rating[0][0]))
print("Actual Rating of user id 1 on movie \"" + infer_movie_name + "\" is 4.") print("Actual Rating of user id 1 on movie \"" + infer_movie_name +
"\" is 4.")
``` ```
## Conclusion ## Summary
This tutorial goes over traditional approaches in recommender system and a deep learning based approach. We also show that how to train and use the model with PaddlePaddle. Deep learning has been well used in computer vision and NLP, we look forward to its new successes in recommender systems. This chapter introduced the traditional personalized recommendation system method and YouTube's deep neural network personalized recommendation system. It further took movie recommendation as an example, and used PaddlePaddle to train a personalized recommendation neural network model. The personalized recommendation system covers almost all aspects of e-commerce systems, social networks, advertising recommendations, search engines, etc. Deep learning technologies have played an important role in image processing, natural language processing, etc., and will also prevail in personalized recommendation systems.
<a name="references"></a>
## References ## References
1. [Peter Brusilovsky](https://en.wikipedia.org/wiki/Peter_Brusilovsky) (2007). *The Adaptive Web*. p. 325. 1. P. Resnick, N. Iacovou, etc. “[GroupLens: An Open Architecture for Collaborative Filtering of Netnews](http://ccs.mit.edu/papers/CCSWP165.html)”, Proceedings of ACM Conference on Computer Supported Cooperative Work, CSCW 1994. pp.175-186.
2. Robin Burke , [Hybrid Web Recommender Systems](http://www.dcs.warwick.ac.uk/~acristea/courses/CS411/2010/Book%20-%20The%20Adaptive%20Web/HybridWebRecommenderSystems.pdf), pp. 377-408, The Adaptive Web, Peter Brusilovsky, Alfred Kobsa, Wolfgang Nejdl (Ed.), Lecture Notes in Computer Science, Springer-Verlag, Berlin, Germany, Lecture Notes in Computer Science, Vol. 4321, May 2007, 978-3-540-72078-2. 2. Sarwar, Badrul, et al. "[Item-based collaborative filtering recommendation algorithms.](http://files.grouplens.org/papers/www10_sarwar.pdf)" *Proceedings of the 10th international conference on World Wide Web*. ACM, 2001.
3. P. Resnick, N. Iacovou, etc. “[GroupLens: An Open Architecture for Collaborative Filtering of Netnews](http://ccs.mit.edu/papers/CCSWP165.html)”, Proceedings of ACM Conference on Computer Supported Cooperative Work, CSCW 1994. pp.175-186. 3. Kautz, Henry, Bart Selman, and Mehul Shah. "[Referral Web: combining social networks and collaborative filtering.](http://www.cs.cornell.edu/selman/papers/pdf/97.cacm.refweb.pdf)" Communications of the ACM 40.3 (1997): 63-65. APA
4. Sarwar, Badrul, et al. "[Item-based collaborative filtering recommendation algorithms.](http://files.grouplens.org/papers/www10_sarwar.pdf)" *Proceedings of the 10th International Conference on World Wide Web*. ACM, 2001. 4. [Peter Brusilovsky](https://en.wikipedia.org/wiki/Peter_Brusilovsky) (2007). *The Adaptive Web*. p. 325.
5. Kautz, Henry, Bart Selman, and Mehul Shah. "[Referral Web: Combining Social networks and collaborative filtering.](http://www.cs.cornell.edu/selman/papers/pdf/97.cacm.refweb.pdf)" Communications of the ACM 40.3 (1997): 63-65. APA 5. Robin Burke , [Hybrid Web recommendation systems](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.435.7538&rep=rep1&type=pdf), pp. 377-408, The Adaptive Web, Peter Brusilovsky, Alfred Kobsa, Wolfgang Nejdl (Ed.), Lecture Notes in Computer Science, Springer-Verlag, Berlin, Germany, Lecture Notes in Computer Science, Vol. 4321, May 2007, 978-3-540-72078-2.
6. Yuan, Jianbo, et al. ["Solving Cold-Start Problem in Large-scale Recommendation Engines: A Deep Learning Approach."](https://arxiv.org/pdf/1611.05480v1.pdf) *arXiv preprint arXiv:1611.05480* (2016). 6. Yuan, Jianbo, et al. ["Solving Cold-Start Problem in Large-scale Recommendation Engines: A Deep Learning Approach."](https://arxiv.org/pdf/1611.05480v1.pdf) *arXiv preprint arXiv:1611.05480* (2016).
7. Covington P, Adams J, Sargin E. [Deep neural networks for youtube recommendations](https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf)[C]//Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 2016: 191-198. 7. Covington P, Adams J, Sargin E. [Deep neural networks for youtube recommendations](https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf)[C]//Proceedings of the 10th ACM Conference on recommendation systems. ACM, 2016: 191-198.
<br/> <br/>
This tutorial is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://paddlepaddleimage.cdn.bcebos.com/bookimage/camo.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">This tutorial</span> is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
......
...@@ -15,6 +15,7 @@ ...@@ -15,6 +15,7 @@
from __future__ import print_function from __future__ import print_function
import math import math
import sys import sys
import argparse
import numpy as np import numpy as np
import paddle import paddle
import paddle.fluid as fluid import paddle.fluid as fluid
...@@ -22,9 +23,21 @@ import paddle.fluid.layers as layers ...@@ -22,9 +23,21 @@ import paddle.fluid.layers as layers
import paddle.fluid.nets as nets import paddle.fluid.nets as nets
IS_SPARSE = True IS_SPARSE = True
USE_GPU = False
BATCH_SIZE = 256 BATCH_SIZE = 256
PASS_NUM = 100
def parse_args():
parser = argparse.ArgumentParser("recommender_system")
parser.add_argument(
'--enable_ce',
action='store_true',
help="If set, run the task with continuous evaluation logs.")
parser.add_argument(
'--use_gpu', type=int, default=0, help="Whether to use GPU or not.")
parser.add_argument(
'--num_epochs', type=int, default=1, help="number of epochs.")
args = parser.parse_args()
return args
def get_usr_combined_features(): def get_usr_combined_features():
...@@ -154,8 +167,15 @@ def optimizer_func(): ...@@ -154,8 +167,15 @@ def optimizer_func():
def train(use_cuda, params_dirname): def train(use_cuda, params_dirname):
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace() place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
if args.enable_ce:
train_reader = paddle.batch(
paddle.dataset.movielens.train(), batch_size=BATCH_SIZE)
test_reader = paddle.batch(
paddle.dataset.movielens.test(), batch_size=BATCH_SIZE)
else:
train_reader = paddle.batch( train_reader = paddle.batch(
paddle.reader.shuffle(paddle.dataset.movielens.train(), buf_size=8192), paddle.reader.shuffle(
paddle.dataset.movielens.train(), buf_size=8192),
batch_size=BATCH_SIZE) batch_size=BATCH_SIZE)
test_reader = paddle.batch( test_reader = paddle.batch(
paddle.dataset.movielens.test(), batch_size=BATCH_SIZE) paddle.dataset.movielens.test(), batch_size=BATCH_SIZE)
...@@ -167,6 +187,10 @@ def train(use_cuda, params_dirname): ...@@ -167,6 +187,10 @@ def train(use_cuda, params_dirname):
main_program = fluid.default_main_program() main_program = fluid.default_main_program()
star_program = fluid.default_startup_program() star_program = fluid.default_startup_program()
if args.enable_ce:
main_program.random_seed = 90
star_program.random_seed = 90
scale_infer, avg_cost = inference_program() scale_infer, avg_cost = inference_program()
test_program = main_program.clone(for_test=True) test_program = main_program.clone(for_test=True)
...@@ -212,6 +236,10 @@ def train(use_cuda, params_dirname): ...@@ -212,6 +236,10 @@ def train(use_cuda, params_dirname):
# if test_avg_cost < 4.0: # Change this number to adjust accuracy # if test_avg_cost < 4.0: # Change this number to adjust accuracy
if batch_id == 20: if batch_id == 20:
if args.enable_ce:
print("kpis\ttest_cost\t%f" % float(test_avg_cost))
if params_dirname is not None: if params_dirname is not None:
fluid.io.save_inference_model(params_dirname, [ fluid.io.save_inference_model(params_dirname, [
"user_id", "gender_id", "age_id", "job_id", "user_id", "gender_id", "age_id", "job_id",
...@@ -319,4 +347,7 @@ def main(use_cuda): ...@@ -319,4 +347,7 @@ def main(use_cuda):
if __name__ == '__main__': if __name__ == '__main__':
main(USE_GPU) args = parse_args()
PASS_NUM = args.num_epochs
use_cuda = args.use_gpu
main(use_cuda)
#!/bin/bash
#This file is only used for continuous evaluation.
export FLAGS_cudnn_deterministic=true
export CUDA_VISIBLE_DEVICES=0
python train_conv.py --use_gpu 1 --num_epochs=1 --enable_ce | python _ce.py
python train_dyn_rnn.py --use_gpu 1 --num_epochs=1 --enable_ce | python _ce.py
python train_stacked_lstm.py --use_gpu 1 --num_epochs=1 --enable_ce | python _ce.py
...@@ -21,6 +21,15 @@ ...@@ -21,6 +21,15 @@
本章我们所要介绍的深度学习模型克服了BOW表示的上述缺陷,它在考虑词顺序的基础上把文本映射到低维度的语义空间,并且以端对端(end to end)的方式进行文本表示及分类,其性能相对于传统方法有显著的提升\[[1](#参考文献)\] 本章我们所要介绍的深度学习模型克服了BOW表示的上述缺陷,它在考虑词顺序的基础上把文本映射到低维度的语义空间,并且以端对端(end to end)的方式进行文本表示及分类,其性能相对于传统方法有显著的提升\[[1](#参考文献)\]
## 说明:
1. 硬件环境要求:
本文可支持在CPU、GPU下运行
2. Docker镜像支持的CUDA/cuDNN版本:
如果使用了Docker运行Book,请注意:这里所提供的默认镜像的GPU环境为 CUDA 8/cuDNN 5,对于NVIDIA Tesla V100等要求CUDA 9的 GPU,使用该镜像可能会运行失败。
3. 文档和脚本中代码的一致性问题:
请注意:为使本文更加易读易用,我们拆分、调整了train.py的代码并放入本文。本文中代码与train.py的运行结果一致,可直接运行[train.py](https://github.com/PaddlePaddle/book/blob/develop/06.understand_sentiment/train_stacked_lstm.py)进行验证。
## 模型概览 ## 模型概览
本章所使用的文本表示模型为卷积神经网络(Convolutional Neural Networks)和循环神经网络(Recurrent Neural Networks)及其扩展。下面依次介绍这几个模型。 本章所使用的文本表示模型为卷积神经网络(Convolutional Neural Networks)和循环神经网络(Recurrent Neural Networks)及其扩展。下面依次介绍这几个模型。
...@@ -48,7 +57,9 @@ ...@@ -48,7 +57,9 @@
循环神经网络按时间展开后如图2所示:在第$t$时刻,网络读入第$t$个输入$x_t$(向量表示)及前一时刻隐层的状态值$h_{t-1}$(向量表示,$h_0$一般初始化为$0$向量),计算得出本时刻隐层的状态值$h_t$,重复这一步骤直至读完所有输入。如果将循环神经网络所表示的函数记为$f$,则其公式可表示为: 循环神经网络按时间展开后如图2所示:在第$t$时刻,网络读入第$t$个输入$x_t$(向量表示)及前一时刻隐层的状态值$h_{t-1}$(向量表示,$h_0$一般初始化为$0$向量),计算得出本时刻隐层的状态值$h_t$,重复这一步骤直至读完所有输入。如果将循环神经网络所表示的函数记为$f$,则其公式可表示为:
$$h_t=f(x_t,h_{t-1})=\sigma(W_{xh}x_t+W_{hh}h_{t-1}+b_h)$$ <p align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/06.understand_sentiment/image/formula_rnn.png?raw=true" width = "65%" align="center"/><br/>
</p>
其中$W_{xh}$是输入到隐层的矩阵参数,$W_{hh}$是隐层到隐层的矩阵参数,$b_h$为隐层的偏置向量(bias)参数,$\sigma$为$sigmoid$函数。 其中$W_{xh}$是输入到隐层的矩阵参数,$W_{hh}$是隐层到隐层的矩阵参数,$b_h$为隐层的偏置向量(bias)参数,$\sigma$为$sigmoid$函数。
...@@ -60,14 +71,17 @@ $$h_t=f(x_t,h_{t-1})=\sigma(W_{xh}x_t+W_{hh}h_{t-1}+b_h)$$ ...@@ -60,14 +71,17 @@ $$h_t=f(x_t,h_{t-1})=\sigma(W_{xh}x_t+W_{hh}h_{t-1}+b_h)$$
相比于简单的循环神经网络,LSTM增加了记忆单元$c$、输入门$i$、遗忘门$f$及输出门$o$。这些门及记忆单元组合起来大大提升了循环神经网络处理长序列数据的能力。若将基于LSTM的循环神经网络表示的函数记为$F$,则其公式为: 相比于简单的循环神经网络,LSTM增加了记忆单元$c$、输入门$i$、遗忘门$f$及输出门$o$。这些门及记忆单元组合起来大大提升了循环神经网络处理长序列数据的能力。若将基于LSTM的循环神经网络表示的函数记为$F$,则其公式为:
$$ h_t=F(x_t,h_{t-1})$$ <p align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/06.understand_sentiment/image/formula_lstm.png?raw=true" width = "30%" align="center"/><br/>
</p>
$F$由下列公式组合而成\[[7](#参考文献)\] $F$由下列公式组合而成\[[7](#参考文献)\]
$$ i_t = \sigma{(W_{xi}x_t+W_{hi}h_{t-1}+W_{ci}c_{t-1}+b_i)} $$
$$ f_t = \sigma(W_{xf}x_t+W_{hf}h_{t-1}+W_{cf}c_{t-1}+b_f) $$ <p align="center">
$$ c_t = f_t\odot c_{t-1}+i_t\odot tanh(W_{xc}x_t+W_{hc}h_{t-1}+b_c) $$ <img src="https://github.com/PaddlePaddle/book/blob/develop/06.understand_sentiment/image/formula_lstm_more.png?raw=true" width = "65%" align="center"/><br/>
$$ o_t = \sigma(W_{xo}x_t+W_{ho}h_{t-1}+W_{co}c_{t}+b_o) $$ </p>
$$ h_t = o_t\odot tanh(c_t) $$
其中,$i_t, f_t, c_t, o_t$分别表示输入门,遗忘门,记忆单元及输出门的向量值,带角标的$W$及$b$为模型参数,$tanh$为双曲正切函数,$\odot$表示逐元素(elementwise)的乘法操作。输入门控制着新输入进入记忆单元$c$的强度,遗忘门控制着记忆单元维持上一时刻值的强度,输出门控制着输出记忆单元的强度。三种门的计算方式类似,但有着完全不同的参数,它们各自以不同的方式控制着记忆单元$c$,如图3所示: 其中,$i_t, f_t, c_t, o_t$分别表示输入门,遗忘门,记忆单元及输出门的向量值,带角标的$W$及$b$为模型参数,$tanh$为双曲正切函数,$\odot$表示逐元素(elementwise)的乘法操作。输入门控制着新输入进入记忆单元$c$的强度,遗忘门控制着记忆单元维持上一时刻值的强度,输出门控制着输出记忆单元的强度。三种门的计算方式类似,但有着完全不同的参数,它们各自以不同的方式控制着记忆单元$c$,如图3所示:
<p align="center"> <p align="center">
...@@ -77,9 +91,11 @@ $$ h_t = o_t\odot tanh(c_t) $$ ...@@ -77,9 +91,11 @@ $$ h_t = o_t\odot tanh(c_t) $$
LSTM通过给简单的循环神经网络增加记忆及控制门的方式,增强了其处理远距离依赖问题的能力。类似原理的改进还有Gated Recurrent Unit (GRU)\[[8](#参考文献)\],其设计更为简洁一些。**这些改进虽然各有不同,但是它们的宏观描述却与简单的循环神经网络一样(如图2所示),即隐状态依据当前输入及前一时刻的隐状态来改变,不断地循环这一过程直至输入处理完毕:** LSTM通过给简单的循环神经网络增加记忆及控制门的方式,增强了其处理远距离依赖问题的能力。类似原理的改进还有Gated Recurrent Unit (GRU)\[[8](#参考文献)\],其设计更为简洁一些。**这些改进虽然各有不同,但是它们的宏观描述却与简单的循环神经网络一样(如图2所示),即隐状态依据当前输入及前一时刻的隐状态来改变,不断地循环这一过程直至输入处理完毕:**
$$ h_t=Recrurent(x_t,h_{t-1})$$ <p align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/06.understand_sentiment/image/formula_recrurent.png?raw=true" width = "50%" align="center"/><br/>
</p>
其中,$Recrurent$可以表示简单的循环神经网络、GRU或LSTM。 其中,$Recurrent$可以表示简单的循环神经网络、GRU或LSTM。
### 栈式双向LSTM(Stacked Bidirectional LSTM) ### 栈式双向LSTM(Stacked Bidirectional LSTM)
...@@ -154,7 +170,7 @@ def convolution_net(data, input_dim, class_dim, emb_dim, hid_dim): ...@@ -154,7 +170,7 @@ def convolution_net(data, input_dim, class_dim, emb_dim, hid_dim):
return prediction return prediction
``` ```
网络的输入`input_dim`表示的是词典的大小,`class_dim`表示类别数。这里,我们使用[`sequence_conv_pool`](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/trainer_config_helpers/networks.py) API实现了卷积和池化操作。 网络的输入`input_dim`表示的是词典的大小,`class_dim`表示类别数。这里,我们使用[`sequence_conv_pool`](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/nets.py) API实现了卷积和池化操作。
<a name="栈值双向LSTM"></a> <a name="栈值双向LSTM"></a>
...@@ -433,4 +449,4 @@ with fluid.scope_guard(inference_scope): ...@@ -433,4 +449,4 @@ with fluid.scope_guard(inference_scope):
9. Zhou J, Xu W. [End-to-end learning of semantic role labeling using recurrent neural networks](http://www.aclweb.org/anthology/P/P15/P15-1109.pdf)[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2015. 9. Zhou J, Xu W. [End-to-end learning of semantic role labeling using recurrent neural networks](http://www.aclweb.org/anthology/P/P15/P15-1109.pdf)[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2015.
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">知识共享 署名-相同方式共享 4.0 国际 许可协议</a>进行许可。 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://paddlepaddleimage.cdn.bcebos.com/bookimage/camo.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">知识共享 署名-相同方式共享 4.0 国际 许可协议</a>进行许可。
# Sentiment Analysis # Sentiment Analysis
The source codes of this section is located at [book/understand_sentiment](https://github.com/PaddlePaddle/book/tree/develop/06.understand_sentiment). For instructions on getting started with this book,see [Running This Book](https://github.com/PaddlePaddle/book/blob/develop/README.md#running-the-book). The source code of this tutorial is in [book/understand_sentiment](https://github.com/PaddlePaddle/book/tree/develop/06.understand_sentiment). For new users, please refer to [Running This Book](https://github.com/PaddlePaddle/book/blob/develop/README.md#running-the-book) .
## Background ## Background Introduction
In natural language processing, sentiment analysis refers to determining the emotion expressed in a piece of text. The text can be a sentence, a paragraph, or a document. Emotion categorization can be binary -- positive/negative or happy/sad -- or in three classes -- positive/neutral/negative. Sentiment analysis is applicable in a wide range of services, such as e-commerce sites like Amazon and Taobao, hospitality services like Airbnb and hotels.com, and movie rating sites like Rotten Tomatoes and IMDB. It can be used to gauge from the reviews how the customers feel about the product. Table 1 illustrates an example of sentiment analysis in movie reviews: In natural language processing, sentiment analysis generally refers to judging the emotion expressed by a piece of text. Among them, a piece of text can be a sentence, a paragraph or a document. Emotional state can be two categories, such as (positive, negative), (happy, sad); or three categories, such as (positive, negative, neutral) and so on.The application scenarios of understanding sentiment are very broad, such as dividing the comments posted by users on shopping websites (Amazon, Tmall, Taobao, etc.), travel websites, and movie review websites into positive comments and negative comments; or in order to analyze the user's overall experience with a product, grab user reviews of the product, and perform sentiment analysis. Table 1 shows an example of understanding sentiment of movie reviews:
| Movie Review | Category | | Movie Comments | Category |
| -------- | ----- | | -------- | ----- |
| Best movie of Xiaogang Feng in recent years!| Positive | |In Feng Xiaogang’s movies of the past few years, it is the best one | Positive |
| Pretty bad. Feels like a tv-series from a local TV-channel | Negative | |Very bad feat, like a local TV series | Negative |
| Politically correct version of Taken ... and boring as Heck| Negative| |The round-lens lens is full of brilliance, and the tonal background is beautiful, but the plot is procrastinating, the accent is not good, and even though taking an effort but it is hard to focus on the show | Negative |
|delightful, mesmerizing, and completely unexpected. The plot is nicely designed.|Positive| |The plot could be scored 4 stars. In addition, the angle of the round lens plusing the scenery of Wuyuan is very much like the feeling of Chinese landscape painting. It satisfied me. | Positive |
<p align="center">Table 1 Sentiment Analysis in Movie Reviews</p> <p align="center">Form 1 Sentiment analysis of movie comments</p>
In natural language processing, sentiment analysis can be categorized as a **Text Classification problem**, i.e., to categorize a piece of text to a specific class. It involves two related tasks: text representation and classification. Before the emergence of deep learning techniques, the mainstream methods for text representation include BOW (*bag of words*) and topic modeling, while the latter contains SVM (*support vector machine*) and LR (*logistic regression*). In natural language processing, sentiment is a typical problem of **text categorization**, which divides the text that needs to be sentiment analysis into its category. Text categorization involves two issues: text representation and classification methods. Before the emergence of the deep learning, the mainstream text representation methods are BOW (bag of words), topic models, etc.; the classification methods are SVM (support vector machine), LR (logistic regression) and so on.
The BOW model does not capture all the information in a piece of text, as it ignores syntax and grammar and just treats the text as a set of words. For example, “this movie is extremely bad“ and “boring, dull, and empty work” describe very similar semantic meaning, yet their BOW representations have very little similarity. Furthermore, “the movie is bad“ and “the movie is not bad“ have high similarity with BOW features, but they express completely opposite semantics. For a piece of text, BOW means that its word order, grammar and syntax are ignored, and this text is only treated as a collection of words, so the BOW method does not adequately represent the semantic information of the text. For example, the sentence "This movie is awful" and "a boring, empty, non-connotative work" have a high semantic similarity in sentiment analysis, but their BOW representation has a similarity of zero. Another example is that the BOW is very similar to the sentence "an empty, work without connotations" and "a work that is not empty and has connotations", but in fact they mean differently.
This chapter introduces a deep learning model that handles these issues in BOW. Our model embeds texts into a low-dimensional space and takes word order into consideration. It is an end-to-end framework and it has large performance improvement over traditional methods \[[1](#references)\]. The deep learning we are going to introduce in this chapter overcomes the above shortcomings of BOW representation. It maps text to low-dimensional semantic space based on word order, and performs text representation and classification in end-to-end mode. Its performance is significantly improved compared to the traditional method \[[1](#References)\].
## Model Overview ## Model Overview
The text representation models used in this chapter are Convolutional Neural Networks and Recurrent Neural Networks and their extensions. These models are described below.
The model we used in this chapter uses **Convolutional Neural Networks** (**CNNs**) and **Recurrent Neural Networks** (**RNNs**) with some specific extensions. ### Introduction of Text Convolutional Neural Networks (CNN)
### Revisit to the Convolutional Neural Networks for Texts (CNN) We introduced the calculation process of the CNN model applied to text data in the [Recommended System](https://github.com/PaddlePaddle/book/tree/develop/05.recommender_system) section. Here is a simple review.
The convolutional neural network for texts is introduced in chapter [recommender_system](https://github.com/PaddlePaddle/book/tree/develop/05.recommender_system), here is a brief overview. For a CNN, first convolute input word vector sequence to generate a feature map, and then obtain the features of the whole sentence corresponding to the kernel by using a max pooling over time on the feature map. Finally, the splicing of all the features obtained is the fixed-length vector representation of the text. For the text classification problem, connecting it via softmax to construct a complete model. In actual applications, we use multiple convolution kernels to process sentences, and convolution kernels with the same window size are stacked to form a matrix, which can complete the operation more efficiently. In addition, we can also use the convolution kernel with different window sizes to process the sentence. Figure 3 in the [Recommend System](https://github.com/PaddlePaddle/book/tree/develop/05.recommender_system) section shows four convolution kernels, namely Figure 1 below, with different colors representing convolution kernel operations of different sizes.
CNN mainly contains convolution and pooling operation, with versatile combinations in various applications. We firstly apply the convolution operation: we apply the kernel in each window, extracting features. Convolving by the kernel at every window produces a feature map. Next, we apply *max pooling* over time to represent the whole sentence, which is the maximum element across the feature map. In real applications, we will apply multiple CNN kernels on the sentences. It can be implemented efficiently by concatenating the kernels together as a matrix. Also, we can use CNN kernels with different kernel size. Finally, concatenating the resulting features produces a fixed-length representation, which can be combined with a softmax to form the model for the sentiment analysis problem. <p align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/05.recommender_system/image/text_cnn.png?raw=true" width = "80%" align="center"/><br/>
Figure 1. CNN text classification model
</p>
For short texts, the aforementioned CNN model can achieve very high accuracy \[[1](#references)\]. If we want to extract more abstract representations, we may apply a deeper CNN model \[[2](#references),[3](#references)\]. For the general short text classification problem, the simple text convolution network described above can achieve a high accuracy rate \[[1](#References)\]. If you want a more abstract and advanced text feature representation, you can construct a deep text convolutional neural network\[[2](#References), [3](#References)\].
### Recurrent Neural Network (RNN) ### Recurrent Neural Network (RNN)
RNN is an effective model for sequential data. In terms of computability, the RNN is Turing-complete \[[4](#references)\]. Since NLP is a classical problem of sequential data, the RNN, especially its variant LSTM\[[5](#references)\]), achieves state-of-the-art performance on various NLP tasks, such as language modeling, syntax parsing, POS-tagging, image captioning, dialog, machine translation, and so forth. RNN is a powerful tool for accurately modeling sequence data. In fact, the theoretical computational power of the RNN is perfected by Turing' \[[4](#References)\]. Natural language is a typical sequence data (word sequence). In recent years, RNN and its derivation (such as long short term memory\[[5](#References)\]) have been applied in many natural language fields, such as in language models, syntactic parsing, semantic role labeling (or general sequence labeling), semantic representation, graphic generation, dialogue, machine translation, etc., all perform well and even become the best at present.
<p align="center"> <p align="center">
<img src="image/rnn.png" width = "60%" align="center"/><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/06.understand_sentiment/image/rnn.png?raw=true" width = "60%" align="center"/><br />
Figure 1. An illustration of an unfolded RNN in time. Figure 2. Schematic diagram of the RNN expanded by time
</p> </p>
As shown in Figure 1, we unfold an RNN: at the $t$-th time step, the network takes two inputs: the $t$-th input vector $\vec{x_t}$ and the latent state from the last time-step $\vec{h_{t-1}}$. From those, it computes the latent state of the current step $\vec{h_t}$. This process is repeated until all inputs are consumed. Denoting the RNN as function $f$, it can be formulated as follows: The RNN expands as time is shown in Figure 2: at the time of $t$, the network reads the $t$th input $x_t$ (vector representation) and the state value of the hidden layer at the previous moment $h_{t- 1}$ (vector representation, $h_0$ is normally initialized to $0$ vector), and calculate the state value $h_t$ of the hidden layer at this moment. Repeat this step until all the inputs have been read. If the function is recorded as $f$, its formula can be expressed as:
$$\vec{h_t}=f(\vec{x_t},\vec{h_{t-1}})=\sigma(W_{xh}\vec{x_t}+W_{hh}\vec{h_{t-1}}+\vec{b_h})$$ $$h_t=f(x_t,h_{t-1})=\sigma(W_{xh}x_t+W_{hh}h_{t-1}+b_h)$$
where $W_{xh}$ is the weight matrix to feed into the latent layer; $W_{hh}$ is the latent-to-latent matrix; $b_h$ is the latent bias and $\sigma$ refers to the $sigmoid$ function. Where $W_{xh}$ is the matrix parameter of the input to the hidden layer, $W_{hh}$ is the matrix parameter of the hidden layer to the hidden layer, and $b_h$ is the bias vector parameter of the hidden layer, $\sigma $ is the $sigmoid$ function.
In NLP, words are often represented as one-hot vectors and then mapped to an embedding. The embedded feature goes through an RNN as input $x_t$ at every time step. Moreover, we can add other layers on top of RNN, such as a deep or stacked RNN. Finally, the last latent state may be used as a feature for sentence classification. When dealing with natural language, the word (one-hot representation) is usually mapped to its word vector representation, and then used as the input $x_t$ for each moment of the recurrent neural network. In addition, other layers may be connected to the hidden layer of the RNN depending on actual needs. For example, you can connect the hidden layer output of a RNN to the input of the next RNN to build a deep or stacked RNN, or extract the hidden layer state at the last moment as a sentence representation and then implement a classification model, etc.
### Long-Short Term Memory (LSTM) ### Long and Short Term Memory Network (LSTM)
Training an RNN on long sequential data sometimes leads to the gradient vanishing or exploding\[[6](#references)\]. To solve this problem Hochreiter S, Schmidhuber J. (1997) proposed **Long Short Term Memory** (LSTM)\[[5](#references)\]). For longer sequence data, the gradient disappearance or explosion phenomenon is likely to occur during training RNN\[[6](#References)\]. To solve this problem, Hochreiter S, Schmidhuber J. (1997) proposed LSTM (long short term memory\[[5](#References)\]).
Compared to the structure of a simple RNN, an LSTM includes memory cell $c$, input gate $i$, forget gate $f$ and output gate $o$. These gates and memory cells dramatically improve the ability for the network to handle long sequences. We can formulate the **LSTM-RNN**, denoted as a function $F$, as follows: Compared to a simple RNN, LSTM adds memory unit $c$, input gate $i$, forget gate $f$, and output gate $o$. The combination of these gates and memory units greatly enhances the ability of the recurrent neural network to process long sequence data. If the function \is denoted as $F$, the formula is:
$$ h_t=F(x_t,h_{t-1})$$ $$ h_t=F(x_t,h_{t-1})$$
$F$ contains following formulations\[[7](#references)\] $F$ It is a combination of the following formulas\[[7](#References)\]:
$$ i_t = \sigma{(W_{xi}x_t+W_{hi}h_{t-1}+W_{ci}c_{t-1}+b_i)} $$ $$ i_t = \sigma{(W_{xi}x_t+W_{hi}h_{t-1}+W_{ci}c_{t-1}+b_i)} $$
$$ f_t = \sigma(W_{xf}x_t+W_{hf}h_{t-1}+W_{cf}c_{t-1}+b_f) $$ $$ f_t = \sigma(W_{xf}x_t+W_{hf}h_{t-1}+W_{cf}c_{t-1}+b_f) $$
$$ c_t = f_t\odot c_{t-1}+i_t\odot tanh(W_{xc}x_t+W_{hc}h_{t-1}+b_c) $$ $$ c_t = f_t\odot c_{t-1}+i_t\odot tanh(W_{xc}x_t+W_{hc}h_{t-1}+b_c) $$
$$ o_t = \sigma(W_{xo}x_t+W_{ho}h_{t-1}+W_{co}c_{t}+b_o) $$ $$ o_t = \sigma(W_{xo}x_t+W_{ho}h_{t-1}+W_{co}c_{t}+b_o) $$
$$ h_t = o_t\odot tanh(c_t) $$ $$ h_t = o_t\odot tanh(c_t) $$
Where $i_t, f_t, c_t, o_t$ respectively represent the vector representation of the input gate, the forget gate, the memory unit and the output gate, the $W$ and $b$ with the angular label are the model parameters, and the $tanh$ is the hyperbolic tangent function. , $\odot$ represents an elementwise multiplication operation. The input gate controls the intensity of the new input into the memory unit $c$, the forget gate controls the intensity of the memory unit to maintain the previous time value, and the output gate controls the intensity of the output memory unit. The three gates are calculated in a similar way, but with completely different parameters.They controll the memory unit $c$ in different ways, as shown in Figure 3:
In the equation,$i_t, f_t, c_t, o_t$ stand for input gate, forget gate, memory cell and output gate, respectively. $W$ and $b$ are model parameters, $\tanh$ is a hyperbolic tangent, and $\odot$ denotes an element-wise product operation. The input gate controls the magnitude of the new input into the memory cell $c$; the forget gate controls the memory propagated from the last time step; the output gate controls the magnitutde of the output. The three gates are computed similarly with different parameters, and they influence memory cell $c$ separately, as shown in Figure 2:
<p align="center"> <p align="center">
<img src="image/lstm_en.png" width = "65%" align="center"/><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/06.understand_sentiment/image/lstm.png?raw=true" width = "65%" align="center"/><br />
Figure 2. LSTM at time step $t$ [7]. Figure 3. LSTM for time $t$ [7]
</p> </p>
LSTM enhances the ability of considering long-term reliance, with the help of memory cell and gate. Similar structures are also proposed in Gated Recurrent Unit (GRU)\[[8](Reference)\] with a simpler design. **The structures are still similar to RNN, though with some modifications (As shown in Figure 2), i.e., latent status depends on input as well as the latent status of the last time step, and the process goes on recurrently until all inputs are consumed:** LSTM enhances its ability to handle long-range dependencies by adding memory and control gates to RNN. A similar principle improvement is Gated Recurrent Unit (GRU)\[[8](#References)\], which is more concise in design. **These improvements are different, but their macro descriptions are the same as simple recurrent neural networks (as shown in Figure 2). That is, the hidden state changes according to the current input and the hidden state of the previous moment, and this process is continuous until the input is processed:**
$$ h_t=Recurrent(x_t,h_{t-1})$$
Among them, $Recurrent$ can represent a RNN, GRU or LSTM.
$$ h_t=Recrurent(x_t,h_{t-1})$$
where $Recrurent$ is a simple RNN, GRU or LSTM.
<a name="Stacked Bidirectional LSTM"></a>
### Stacked Bidirectional LSTM ### Stacked Bidirectional LSTM
For vanilla LSTM, $h_t$ contains input information from previous time-step $1..t-1$ context. We can also apply an RNN with reverse-direction to take successive context $t+1…n$ into consideration. Combining constructing deep RNN (deeper RNN can contain more abstract and higher level semantic), we can design structures with deep stacked bidirectional LSTM to model sequential data\[[9](#references)\]. For a normal directional RNN, $h_t$ contains the input information before the $t$ time, which is the above context information. Similarly, in order to get the following context information, we can use a RNN in the opposite direction (which will be processed in reverse order). Combined with the method of constructing deep-loop neural networks (deep neural networks often get more abstract and advanced feature representations), we can build a more powerful LSTM-based stack bidirectional recurrent neural network\[[9](#References )\] to model time series data.
As shown in Figure 3 (3-layer RNN), odd/even layers are forward/reverse LSTM. Higher layers of LSTM take lower-layers LSTM as input, and the top-layer LSTM produces a fixed length vector by max-pooling (this representation considers contexts from previous and successive words for higher-level abstractions). Finally, we concatenate the output to a softmax layer for classification. As shown in Figure 4 (taking three layers as an example), the odd-numbered LSTM is forward and the even-numbered LSTM is inverted. The higher-level LSTM uses the lower LSTM and all previous layers of information as input. The maximum pooling of the highest-level LSTM sequence in the time dimension can be used to obtain a fixed-length vector representation of the text (this representation fully fuses the contextual information and deeply abstracts of the text), and finally we connect the text representation to the softmax to build the classification model.
<p align="center"> <p align="center">
<img src="image/stacked_lstm_en.png" width=450><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/06.understand_sentiment/image/stacked_lstm.jpg?raw=true" width=450><br/>
Figure 3. Stacked Bidirectional LSTM for NLP modeling. Figure 4. Stacked bidirectional LSTM for text categorization
</p> </p>
## Dataset
We use [IMDB](http://ai.stanford.edu/%7Eamaas/data/sentiment/) dataset for sentiment analysis in this tutorial, which consists of 50,000 movie reviews split evenly into a 25k train set and a 25k test set. In the labeled train/test sets, a negative review has a score <= 4 out of 10, and a positive review has a score >= 7 out of 10. ## Dataset Introduction
`paddle.datasets` package encapsulates multiple public datasets, including `cifar`, `imdb`, `mnist`, `moivelens`, and `wmt14`, etc. There's no need for us to manually download and preprocess IMDB.
After issuing a command `python train.py`, training will start immediately. The details will be unpacked by the following sessions to see how it works.
We use the [IMDB sentiment analysis data set](http://ai.stanford.edu/%7Eamaas/data/sentiment/) as an example. The training and testing IMDB dataset contain 25,000 labeled movie reviews respectively. Among them, the score of the negative comment is less than or equal to 4, and the score of the positive comment is greater than or equal to 7, full score is 10.
```text
aclImdb
|- test
|-- neg
|-- pos
|- train
|-- neg
|-- pos
```
Paddle implements the automatic download and read the imdb dataset in `dataset/imdb.py`, and provides API for reading dictionary, training data, testing data, and so on.
## Model Configuration ## Model Configuration
Our program starts with importing necessary packages and initializing some global variables: In this example, we implement two text categorization algorithms based on the text convolutional neural network described in the [Recommender System](https://github.com/PaddlePaddle/book/tree/develop/05.recommender_system) section and [Stacked Bidirectional LSTM](#Stacked Bidirectional LSTM). We first import the packages we need to use and define global variables:
```python ```python
from __future__ import print_function from __future__ import print_function
import paddle import paddle
import paddle.fluid as fluid import paddle.fluid as fluid
from functools import partial
import numpy as np import numpy as np
try: import sys
from paddle.fluid.contrib.trainer import * import math
from paddle.fluid.contrib.inferencer import *
except ImportError:
print(
"In the fluid 1.0, the trainer and inferencer are moving to paddle.fluid.contrib",
file=sys.stderr)
from paddle.fluid.trainer import *
from paddle.fluid.inferencer import *
CLASS_DIM = 2
EMB_DIM = 128
HID_DIM = 512
STACKED_NUM = 3
BATCH_SIZE = 128
USE_GPU = False
```
As alluded to in section [Model Overview](#model-overview), here we provide the implementations of both Text CNN and Stacked-bidirectional LSTM models. CLASS_DIM = 2 #Number of categories for sentiment analysis
EMB_DIM = 128 #Dimensions of the word vector
HID_DIM = 512 #Dimensions of hide layer
STACKED_NUM = 3 #LSTM Layers of the bidirectional stack
BATCH_SIZE = 128 #batch size
### Text Convolution Neural Network (Text CNN) ```
We create a neural network `convolution_net` as the following snippet code.
Note: `fluid.nets.sequence_conv_pool` includes both convolution and pooling layer operations. ### Text Convolutional Neural Network
We build the neural network `convolution_net`, the sample code is as follows.
Note that `fluid.nets.sequence_conv_pool` contains both convolution and pooling layers.
```python ```python
#Textconvolution neural network
def convolution_net(data, input_dim, class_dim, emb_dim, hid_dim): def convolution_net(data, input_dim, class_dim, emb_dim, hid_dim):
emb = fluid.layers.embedding( emb = fluid.layers.embedding(
input=data, size=[input_dim, emb_dim], is_sparse=True) input=data, size=[input_dim, emb_dim], is_sparse=True)
...@@ -153,48 +157,53 @@ def convolution_net(data, input_dim, class_dim, emb_dim, hid_dim): ...@@ -153,48 +157,53 @@ def convolution_net(data, input_dim, class_dim, emb_dim, hid_dim):
prediction = fluid.layers.fc( prediction = fluid.layers.fc(
input=[conv_3, conv_4], size=class_dim, act="softmax") input=[conv_3, conv_4], size=class_dim, act="softmax")
return prediction return prediction
``` ```
Parameter `input_dim` denotes the dictionary size, and `class_dim` is the number of categories.
The above Text CNN network extracts high-level features and maps them to a vector of the same size as the categories. `paddle.activation.Softmax` function or classifier is then used for calculating the probability of the sentence belonging to each category. The network input `input_dim` indicates the size of the dictionary, and `class_dim` indicates the number of categories. Here, we implement the convolution and pooling operations using the [`sequence_conv_pool`](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/nets.py) API.
<a name="Stack value bidirectional LSTM"></a>
### Stacked bidirectional LSTM ### Stacked bidirectional LSTM
We create a neural network `stacked_lstm_net` as below. The code of the stack bidirectional LSTM `stacked_lstm_net` is as follows:
```python ```python
#Stack Bidirectional LSTM
def stacked_lstm_net(data, input_dim, class_dim, emb_dim, hid_dim, stacked_num): def stacked_lstm_net(data, input_dim, class_dim, emb_dim, hid_dim, stacked_num):
# Calculate word vectorvector
emb = fluid.layers.embedding( emb = fluid.layers.embedding(
input=data, size=[input_dim, emb_dim], is_sparse=True) input=data, size=[input_dim, emb_dim], is_sparse=True)
#First stack
#Fully connected layer
fc1 = fluid.layers.fc(input=emb, size=hid_dim) fc1 = fluid.layers.fc(input=emb, size=hid_dim)
#lstm layer
lstm1, cell1 = fluid.layers.dynamic_lstm(input=fc1, size=hid_dim) lstm1, cell1 = fluid.layers.dynamic_lstm(input=fc1, size=hid_dim)
inputs = [fc1, lstm1] inputs = [fc1, lstm1]
#All remaining stack structures
for i in range(2, stacked_num + 1): for i in range(2, stacked_num + 1):
fc = fluid.layers.fc(input=inputs, size=hid_dim) fc = fluid.layers.fc(input=inputs, size=hid_dim)
lstm, cell = fluid.layers.dynamic_lstm( lstm, cell = fluid.layers.dynamic_lstm(
input=fc, size=hid_dim, is_reverse=(i % 2) == 0) input=fc, size=hid_dim, is_reverse=(i % 2) == 0)
inputs = [fc, lstm] inputs = [fc, lstm]
fc_last = fluid.layers.sequence_pool(input=inputs[0], pool_type='max') #pooling layer
pc_last = fluid.layers.sequence_pool(input=inputs[0], pool_type='max')
lstm_last = fluid.layers.sequence_pool(input=inputs[1], pool_type='max') lstm_last = fluid.layers.sequence_pool(input=inputs[1], pool_type='max')
prediction = fluid.layers.fc(input=[fc_last, lstm_last], #Fully connected layer, softmax prediction
size=class_dim, prediction = fluid.layers.fc(
act='softmax') input=[fc_last, lstm_last], size=class_dim, act='softmax')
return prediction return prediction
``` ```
The above stacked bidirectional LSTM network extracts high-level features and maps them to a vector of the same size as the categories. `paddle.activation.Softmax` function or classifier is then used for calculating the probability of the sentence belonging to each category. The above stacked bidirectional LSTM abstracts the advanced features and maps them to vectors of the same size as the number of classification. The 'softmax' activation function of the last fully connected layer is used to calculate the probability of a certain category.
To reiterate, we can either invoke `convolution_net` or `stacked_lstm_net`. In below steps, we will go with `convolution_net`. Again, here we can call any network structure of `convolution_net` or `stacked_lstm_net` for training and learning. Let's take `convolution_net` as an example.
Next we define an `inference_program` that simply uses `convolution_net` to predict output with the input from `fluid.layer.data`. Next we define the prediction program (`inference_program`). We use `convolution_net` to predict the input of `fluid.layer.data`.
```python ```python
def inference_program(word_dict): def inference_program(word_dict):
...@@ -207,47 +216,44 @@ def inference_program(word_dict): ...@@ -207,47 +216,44 @@ def inference_program(word_dict):
return net return net
``` ```
Then we define a `training_program` that uses the result from `inference_program` to compute the cost with label data. We define `training_program` here, which uses the result returned from `inference_program` to calculate the error. We also define the optimization function `optimizer_func`.
Also define `optimizer_func` to specify the optimizer.
Because it is supervised learning, the training set tags are also defined in `fluid.layers.data`. During training, cross-entropy is used as a loss function in `fluid.layer.cross_entropy`.
In the context of supervised learning, labels of the training set are defined in `paddle.layer.data` too. During training, cross-entropy is used as loss function in `paddle.layer.classification_cost` and as the output of the network; During testing, the outputs are the probabilities calculated in the classifier. During the testing, the classifier calculates the probability of each output. The first returned value is specified as cost.
First result that returns from the list must be cost.
```python ```python
def train_program(word_dict): def train_program(prediction):
prediction = inference_program(word_dict)
label = fluid.layers.data(name="label", shape=[1], dtype="int64") label = fluid.layers.data(name="label", shape=[1], dtype="int64")
cost = fluid.layers.cross_entropy(input=prediction, label=label) cost = fluid.layers.cross_entropy(input=prediction, label=label)
avg_cost = fluid.layers.mean(cost) avg_cost = fluid.layers.mean(cost)
accuracy = fluid.layers.accuracy(input=prediction, label=label) accuracy = fluid.layers.accuracy(input=prediction, label=label)
return [avg_cost, accuracy] return [avg_cost, accuracy] #return average cost and accuracy acc
#Optimization function
def optimizer_func(): def optimizer_func():
return fluid.optimizer.Adagrad(learning_rate=0.002) return fluid.optimizer.Adagrad(learning_rate=0.002)
``` ```
## Model Training ## Training Model
### Specify training environment ### Defining the training environment
Define whether your training is on the CPU or GPU:
Specify your training environment, you should specify if the training is on CPU or GPU.
```python ```python
use_cuda = False use_cuda = False #train on cpu
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace() place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
``` ```
### Datafeeder Configuration ### Defining the data creator
Next we define data feeders for test and train. The feeder reads a `buf_size` of data each time and feed them to the training/testing process. The next step is to define a data creator for training and testing. The creator reads in a data of size BATCH_SIZE. Paddle.dataset.imdb.word_dict will provide a size of BATCH_SIZE after each time shuffling, which is the cache size: buf_size.
`paddle.dataset.imdb.train` will yield records during each pass, after shuffling, a batch input of `BATCH_SIZE` is generated for training.
Notice for loading and reading IMDB data, it could take up to 1 minute. Please be patient. Note: It may take a few minutes to read the IMDB data, please be patient.
```python ```python
print("Loading IMDB word dict....") print("Loading IMDB word dict....")
word_dict = paddle.dataset.imdb.word_dict() word_dict = paddle.dataset.imdb.word_dict()
...@@ -256,77 +262,125 @@ train_reader = paddle.batch( ...@@ -256,77 +262,125 @@ train_reader = paddle.batch(
paddle.reader.shuffle( paddle.reader.shuffle(
paddle.dataset.imdb.train(word_dict), buf_size=25000), paddle.dataset.imdb.train(word_dict), buf_size=25000),
batch_size=BATCH_SIZE) batch_size=BATCH_SIZE)
print("Reading testing data....")
test_reader = paddle.batch(
paddle.dataset.imdb.test(word_dict), batch_size=BATCH_SIZE)
``` ```
Word_dict is a dictionary sequence, which is the correspondence between words and labels. You can see it specifically by running the next code:
### Create Trainer
Create a trainer that takes `train_program` as input and specify optimizer function.
```python ```python
trainer = Trainer( word_dict
train_func=partial(train_program, word_dict),
place=place,
optimizer_func=optimizer_func)
``` ```
Each line is a correspondence such as ('limited': 1726), which indicates that the label corresponding to the word limited is 1726.
### Feeding Data ### Construction Trainer
The trainer requires a training program and a training optimization function.
`feed_order` is devoted to specifying the correspondence between each yield record and `paddle.layer.data`. For instance, the first column of data generated by `imdb.train` corresponds to `words`. ```python
exe = fluid.Executor(place)
prediction = inference_program(word_dict)
[avg_cost, accuracy] = train_program(prediction)#training program
sgd_optimizer = optimizer_func()# training optimization function
sgd_optimizer.minimize(avg_cost)
```
This function is used to calculate the result of the model on the test dataset.
```python ```python
feed_order = ['words', 'label'] def train_test(program, reader):
count = 0
feed_var_list = [
program.global_block().var(var_name) for var_name in feed_order
]
feeder_test = fluid.DataFeeder(feed_list=feed_var_list, place=place)
test_exe = fluid.Executor(place)
accumulated = len([avg_cost, accuracy]) * [0]
for test_data in reader():
avg_cost_np = test_exe.run(
program=program,
feed=feeder_test.feed(test_data),
fetch_list=[avg_cost, accuracy])
accumulated = [
x[0] + x[1][0] for x in zip(accumulated, avg_cost_np)
]
count += 1
return [x / count for x in accumulated]
``` ```
### Event Handler ### Providing data and building a main training loop
Callback function `event_handler` will be called during training when a pre-defined event happens. `feed_order` is used to define the mapping relationship between each generated data and `fluid.layers.data`. For example, the data in the first column generated by `imdb.train` corresponds to the `words` feature.
For example, we can check the cost by `trainer.test` when `EndStepEvent` occurs
```python ```python
# Specify the directory path to save the parameters # Specify the directory path to save the parameters
params_dirname = "understand_sentiment_conv.inference.model" params_dirname = "understand_sentiment_conv.inference.model"
def event_handler(event): feed_order = ['words', 'label']
if isinstance(event, EndStepEvent): pass_num = 1 #Number rounds of the training loop
print("Step {0}, Epoch {1} Metrics {2}".format(
event.step, event.epoch, list(map(np.array, event.metrics)))) # Main loop part of the program
def train_loop(main_program):
# Start the trainer built above
exe.run(fluid.default_startup_program())
feed_var_list_loop = [
main_program.global_block().var(var_name) for var_name in feed_order
]
feeder = fluid.DataFeeder(
feed_list=feed_var_list_loop, place=place)
test_program = fluid.default_main_program().clone(for_test=True)
# Training loop
for epoch_id in range(pass_num):
for step_id, data in enumerate(train_reader()):
# Running trainer
metrics = exe.run(main_program,
feed=feeder.feed(data),
fetch_list=[avg_cost, accuracy])
# Testing Results
avg_cost_test, acc_test = train_test(test_program, test_reader)
print('Step {0}, Test Loss {1:0.2}, Acc {2:0.2}'.format(
step_id, avg_cost_test, acc_test))
if event.step == 10: print("Step {0}, Epoch {1} Metrics {2}".format(
trainer.save_params(params_dirname) step_id, epoch_id, list(map(np.array,
trainer.stop() metrics))))
if step_id == 30:
if params_dirname is not None:
fluid.io.save_inference_model(params_dirname, ["words"],
prediction, exe)# Save model
return
``` ```
### Training ### Training process
We print the output of each step in the main loop of the training, and we can observe the training situation.
Finally, we invoke `trainer.train` to start training with `num_epochs` and other parameters. ### Start training
Finally, we start the training main loop to start training. The training time is longer. If you want to get the result faster, you can shorten the training time by adjusting the loss value range or the number of training steps at the cost of reducing the accuracy.
```python ```python
trainer.train( train_loop(fluid.default_main_program())
num_epochs=1,
event_handler=event_handler,
reader=train_reader,
feed_order=feed_order)
``` ```
## Inference ## Application Model
### Create Inferencer ### Building a predictor
Initialize Inferencer with `inference_program` and `params_dirname` which is where we save params from training. As the training process, we need to create a prediction process and use the trained models and parameters to make predictions. `params_dirname` is used to store the various parameters in the training process.
```python ```python
inferencer = Inferencer( place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
infer_func=partial(inference_program, word_dict), exe = fluid.Executor(place)
param_path=params_dirname, inference_scope = fluid.core.Scope()
place=place)
``` ```
### Create Lod Tensor with test data ### Generating test input data
To do inference, we pick 3 potential reviews out of our mind as testing data. Feel free to modify any of them. In order to make predictions, we randomly select 3 comments. We correspond each word in the comment to the id in `word_dict`. If the word is not in the dictionary, set it to `unknown`.
We map each word in the reviews to id from `word_dict`, replaced by 'unknown' if the word is not in `word_dict`. Then we use `create_lod_tensor` to create the tensor of the detail level. For a detailed explanation of this function, please refer to [API documentation](http://paddlepaddle.org/documentation/docs/en/1.2/user_guides/howto/basic_concept/lod_tensor.html).
Then we create lod data with the id list and use `create_lod_tensor` to create lod tensor.
```python ```python
reviews_str = [ reviews_str = [
...@@ -344,27 +398,39 @@ base_shape = [[len(c) for c in lod]] ...@@ -344,27 +398,39 @@ base_shape = [[len(c) for c in lod]]
tensor_words = fluid.create_lod_tensor(lod, base_shape, place) tensor_words = fluid.create_lod_tensor(lod, base_shape, place)
``` ```
### Infer ## Applying models and making predictions
Now we can infer and predict probability of positive or negative from each review above. Now we can make positive or negative predictions for each comment.
```python ```python
results = inferencer.infer({'words': tensor_words}) with fluid.scope_guard(inference_scope):
for i, r in enumerate(results[0]): [inferencer, feed_target_names,
print("Predict probability of ", r[0], " to be positive and ", r[1], " to be negative for review \'", reviews_str[i], "\'") fetch_targets] = fluid.io.load_inference_model(params_dirname, exe)
assert feed_target_names[0] == "words"
results = exe.run(inferencer,
feed={feed_target_names[0]: tensor_words},
fetch_list=fetch_targets,
return_numpy=False)
np_data = np.array(results[0])
for i, r in enumerate(np_data):
print("Predict probability of ", r[0], " to be positive and ", r[1],
" to be negative for review \'", reviews_str[i], "\'")
``` ```
## Conclusion ## Conclusion
In this chapter, we use sentiment analysis as an example to introduce applying deep learning models on end-to-end short text classification, as well as how to use PaddlePaddle to implement the model. Meanwhile, we briefly introduce two models for text processing: CNN and RNN. In following chapters, we will see how these models can be applied in other tasks. In this chapter, we take sentiment analysis as an example to introduce end-to-end short text classification using deep learning, and complete all relevant experiments using PaddlePaddle. At the same time, we briefly introduce two text processing models: convolutional neural networks and recurrent neural networks. In the following chapters, we will see the application of these two basic deep learning models on other tasks.
<a name="References"></a>
## References ## References
1. Kim Y. [Convolutional neural networks for sentence classification](http://arxiv.org/pdf/1408.5882)[J]. arXiv preprint arXiv:1408.5882, 2014. 1. Kim Y. [Convolutional neural networks for sentence classification](http://arxiv.org/pdf/1408.5882)[J]. arXiv preprint arXiv:1408.5882, 2014.
2. Kalchbrenner N, Grefenstette E, Blunsom P. [A convolutional neural network for modeling sentences](http://arxiv.org/pdf/1404.2188.pdf?utm_medium=App.net&utm_source=PourOver)[J]. arXiv preprint arXiv:1404.2188, 2014. 2. Kalchbrenner N, Grefenstette E, Blunsom P. [A convolutional neural network for modelling sentences](http://arxiv.org/pdf/1404.2188.pdf?utm_medium=App.net&utm_source=PourOver)[J]. arXiv preprint arXiv:1404.2188, 2014.
3. Yann N. Dauphin, et al. [Language Modeling with Gated Convolutional Networks](https://arxiv.org/pdf/1612.08083v1.pdf)[J] arXiv preprint arXiv:1612.08083, 2016. 3. Yann N. Dauphin, et al. [Language Modeling with Gated Convolutional Networks](https://arxiv.org/pdf/1612.08083v1.pdf)[J] arXiv preprint arXiv:1612.08083, 2016.
4. Siegelmann H T, Sontag E D. [On the computational power of neural nets](http://research.cs.queensu.ca/home/akl/cisc879/papers/SELECTED_PAPERS_FROM_VARIOUS_SOURCES/05070215382317071.pdf)[C]//Proceedings of the fifth annual workshop on Computational learning theory. ACM, 1992: 440-449. 4. Siegelmann H T, Sontag E D. [On the computational power of neural nets](http://research.cs.queensu.ca/home/akl/cisc879/papers/SELECTED_PAPERS_FROM_VARIOUS_SOURCES/05070215382317071.pdf)[C]//Proceedings of the fifth annual workshop on Computational learning theory. ACM, 1992: 440-449.
5. Hochreiter S, Schmidhuber J. [Long short-term memory](http://web.eecs.utk.edu/~itamar/courses/ECE-692/Bobby_paper1.pdf)[J]. Neural computation, 1997, 9(8): 1735-1780. 5. Hochreiter S, Schmidhuber J. [Long short-term memory](http://web.eecs.utk.edu/~itamar/courses/ECE-692/Bobby_paper1.pdf)[J]. Neural computation, 1997, 9(8): 1735-1780.
...@@ -374,4 +440,4 @@ In this chapter, we use sentiment analysis as an example to introduce applying d ...@@ -374,4 +440,4 @@ In this chapter, we use sentiment analysis as an example to introduce applying d
9. Zhou J, Xu W. [End-to-end learning of semantic role labeling using recurrent neural networks](http://www.aclweb.org/anthology/P/P15/P15-1109.pdf)[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2015. 9. Zhou J, Xu W. [End-to-end learning of semantic role labeling using recurrent neural networks](http://www.aclweb.org/anthology/P/P15/P15-1109.pdf)[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2015.
<br/> <br/>
This tutorial is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://paddlepaddleimage.cdn.bcebos.com/bookimage/camo.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">This tutorial</span> is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.
### This file is only used for continuous evaluation test!
from __future__ import print_function
from __future__ import division
from __future__ import absolute_import
import os
import sys
sys.path.append(os.environ['ceroot'])
from kpi import CostKpi
from kpi import AccKpi
conv_train_cost_kpi = CostKpi(
'conv_train_cost', 0.02, 0, actived=True, desc='train cost')
conv_train_acc_kpi = AccKpi(
'conv_train_acc', 0.02, 0, actived=True, desc='train acc')
conv_test_cost_kpi = CostKpi(
'conv_test_cost', 0.02, 0, actived=True, desc='test cost')
conv_test_acc_kpi = AccKpi(
'conv_test_acc', 0.02, 0, actived=True, desc='test acc')
rnn_train_cost_kpi = CostKpi(
'rnn_train_cost', 0.02, 0, actived=True, desc='train cost')
rnn_train_acc_kpi = AccKpi(
'rnn_train_acc', 0.02, 0, actived=True, desc='train acc')
rnn_test_cost_kpi = CostKpi(
'rnn_test_cost', 0.02, 0, actived=True, desc='test cost')
rnn_test_acc_kpi = AccKpi(
'rnn_test_acc', 0.02, 0, actived=True, desc='test acc')
lstm_train_cost_kpi = CostKpi(
'lstm_train_cost', 0.02, 0, actived=True, desc='train cost')
lstm_train_acc_kpi = AccKpi(
'lstm_train_acc', 0.02, 0, actived=True, desc='train acc')
lstm_test_cost_kpi = CostKpi(
'lstm_test_cost', 0.02, 0, actived=True, desc='test cost')
lstm_test_acc_kpi = AccKpi(
'lstm_test_acc', 0.02, 0, actived=True, desc='test acc')
tracking_kpis = [
conv_train_cost_kpi, conv_train_acc_kpi, conv_test_cost_kpi,
conv_test_acc_kpi, rnn_train_cost_kpi, rnn_train_acc_kpi, rnn_test_cost_kpi,
rnn_test_acc_kpi, lstm_train_cost_kpi, lstm_train_acc_kpi,
lstm_test_cost_kpi, lstm_test_acc_kpi
]
def parse_log(log):
for line in log.split('\n'):
fs = line.strip().split('\t')
print(fs)
if len(fs) == 3 and fs[0] == 'kpis':
kpi_name = fs[1]
kpi_value = float(fs[2])
yield kpi_name, kpi_value
def log_to_ce(log):
kpi_tracker = {}
for kpi in tracking_kpis:
kpi_tracker[kpi.name] = kpi
for (kpi_name, kpi_value) in parse_log(log):
print(kpi_name, kpi_value)
kpi_tracker[kpi_name].add_record(kpi_value)
kpi_tracker[kpi_name].persist()
if __name__ == '__main__':
log = sys.stdin.read()
log_to_ce(log)
...@@ -63,6 +63,15 @@ ...@@ -63,6 +63,15 @@
本章我们所要介绍的深度学习模型克服了BOW表示的上述缺陷,它在考虑词顺序的基础上把文本映射到低维度的语义空间,并且以端对端(end to end)的方式进行文本表示及分类,其性能相对于传统方法有显著的提升\[[1](#参考文献)\]。 本章我们所要介绍的深度学习模型克服了BOW表示的上述缺陷,它在考虑词顺序的基础上把文本映射到低维度的语义空间,并且以端对端(end to end)的方式进行文本表示及分类,其性能相对于传统方法有显著的提升\[[1](#参考文献)\]。
## 说明:
1. 硬件环境要求:
本文可支持在CPU、GPU下运行
2. Docker镜像支持的CUDA/cuDNN版本:
如果使用了Docker运行Book,请注意:这里所提供的默认镜像的GPU环境为 CUDA 8/cuDNN 5,对于NVIDIA Tesla V100等要求CUDA 9的 GPU,使用该镜像可能会运行失败。
3. 文档和脚本中代码的一致性问题:
请注意:为使本文更加易读易用,我们拆分、调整了train.py的代码并放入本文。本文中代码与train.py的运行结果一致,可直接运行[train.py](https://github.com/PaddlePaddle/book/blob/develop/06.understand_sentiment/train_stacked_lstm.py)进行验证。
## 模型概览 ## 模型概览
本章所使用的文本表示模型为卷积神经网络(Convolutional Neural Networks)和循环神经网络(Recurrent Neural Networks)及其扩展。下面依次介绍这几个模型。 本章所使用的文本表示模型为卷积神经网络(Convolutional Neural Networks)和循环神经网络(Recurrent Neural Networks)及其扩展。下面依次介绍这几个模型。
...@@ -90,7 +99,9 @@ ...@@ -90,7 +99,9 @@
循环神经网络按时间展开后如图2所示:在第$t$时刻,网络读入第$t$个输入$x_t$(向量表示)及前一时刻隐层的状态值$h_{t-1}$(向量表示,$h_0$一般初始化为$0$向量),计算得出本时刻隐层的状态值$h_t$,重复这一步骤直至读完所有输入。如果将循环神经网络所表示的函数记为$f$,则其公式可表示为: 循环神经网络按时间展开后如图2所示:在第$t$时刻,网络读入第$t$个输入$x_t$(向量表示)及前一时刻隐层的状态值$h_{t-1}$(向量表示,$h_0$一般初始化为$0$向量),计算得出本时刻隐层的状态值$h_t$,重复这一步骤直至读完所有输入。如果将循环神经网络所表示的函数记为$f$,则其公式可表示为:
$$h_t=f(x_t,h_{t-1})=\sigma(W_{xh}x_t+W_{hh}h_{t-1}+b_h)$$ <p align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/06.understand_sentiment/image/formula_rnn.png?raw=true" width = "65%" align="center"/><br/>
</p>
其中$W_{xh}$是输入到隐层的矩阵参数,$W_{hh}$是隐层到隐层的矩阵参数,$b_h$为隐层的偏置向量(bias)参数,$\sigma$为$sigmoid$函数。 其中$W_{xh}$是输入到隐层的矩阵参数,$W_{hh}$是隐层到隐层的矩阵参数,$b_h$为隐层的偏置向量(bias)参数,$\sigma$为$sigmoid$函数。
...@@ -102,14 +113,17 @@ $$h_t=f(x_t,h_{t-1})=\sigma(W_{xh}x_t+W_{hh}h_{t-1}+b_h)$$ ...@@ -102,14 +113,17 @@ $$h_t=f(x_t,h_{t-1})=\sigma(W_{xh}x_t+W_{hh}h_{t-1}+b_h)$$
相比于简单的循环神经网络,LSTM增加了记忆单元$c$、输入门$i$、遗忘门$f$及输出门$o$。这些门及记忆单元组合起来大大提升了循环神经网络处理长序列数据的能力。若将基于LSTM的循环神经网络表示的函数记为$F$,则其公式为: 相比于简单的循环神经网络,LSTM增加了记忆单元$c$、输入门$i$、遗忘门$f$及输出门$o$。这些门及记忆单元组合起来大大提升了循环神经网络处理长序列数据的能力。若将基于LSTM的循环神经网络表示的函数记为$F$,则其公式为:
$$ h_t=F(x_t,h_{t-1})$$ <p align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/06.understand_sentiment/image/formula_lstm.png?raw=true" width = "30%" align="center"/><br/>
</p>
$F$由下列公式组合而成\[[7](#参考文献)\]: $F$由下列公式组合而成\[[7](#参考文献)\]:
$$ i_t = \sigma{(W_{xi}x_t+W_{hi}h_{t-1}+W_{ci}c_{t-1}+b_i)} $$
$$ f_t = \sigma(W_{xf}x_t+W_{hf}h_{t-1}+W_{cf}c_{t-1}+b_f) $$ <p align="center">
$$ c_t = f_t\odot c_{t-1}+i_t\odot tanh(W_{xc}x_t+W_{hc}h_{t-1}+b_c) $$ <img src="https://github.com/PaddlePaddle/book/blob/develop/06.understand_sentiment/image/formula_lstm_more.png?raw=true" width = "65%" align="center"/><br/>
$$ o_t = \sigma(W_{xo}x_t+W_{ho}h_{t-1}+W_{co}c_{t}+b_o) $$ </p>
$$ h_t = o_t\odot tanh(c_t) $$
其中,$i_t, f_t, c_t, o_t$分别表示输入门,遗忘门,记忆单元及输出门的向量值,带角标的$W$及$b$为模型参数,$tanh$为双曲正切函数,$\odot$表示逐元素(elementwise)的乘法操作。输入门控制着新输入进入记忆单元$c$的强度,遗忘门控制着记忆单元维持上一时刻值的强度,输出门控制着输出记忆单元的强度。三种门的计算方式类似,但有着完全不同的参数,它们各自以不同的方式控制着记忆单元$c$,如图3所示: 其中,$i_t, f_t, c_t, o_t$分别表示输入门,遗忘门,记忆单元及输出门的向量值,带角标的$W$及$b$为模型参数,$tanh$为双曲正切函数,$\odot$表示逐元素(elementwise)的乘法操作。输入门控制着新输入进入记忆单元$c$的强度,遗忘门控制着记忆单元维持上一时刻值的强度,输出门控制着输出记忆单元的强度。三种门的计算方式类似,但有着完全不同的参数,它们各自以不同的方式控制着记忆单元$c$,如图3所示:
<p align="center"> <p align="center">
...@@ -119,9 +133,11 @@ $$ h_t = o_t\odot tanh(c_t) $$ ...@@ -119,9 +133,11 @@ $$ h_t = o_t\odot tanh(c_t) $$
LSTM通过给简单的循环神经网络增加记忆及控制门的方式,增强了其处理远距离依赖问题的能力。类似原理的改进还有Gated Recurrent Unit (GRU)\[[8](#参考文献)\],其设计更为简洁一些。**这些改进虽然各有不同,但是它们的宏观描述却与简单的循环神经网络一样(如图2所示),即隐状态依据当前输入及前一时刻的隐状态来改变,不断地循环这一过程直至输入处理完毕:** LSTM通过给简单的循环神经网络增加记忆及控制门的方式,增强了其处理远距离依赖问题的能力。类似原理的改进还有Gated Recurrent Unit (GRU)\[[8](#参考文献)\],其设计更为简洁一些。**这些改进虽然各有不同,但是它们的宏观描述却与简单的循环神经网络一样(如图2所示),即隐状态依据当前输入及前一时刻的隐状态来改变,不断地循环这一过程直至输入处理完毕:**
$$ h_t=Recrurent(x_t,h_{t-1})$$ <p align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/06.understand_sentiment/image/formula_recrurent.png?raw=true" width = "50%" align="center"/><br/>
</p>
其中,$Recrurent$可以表示简单的循环神经网络、GRU或LSTM。 其中,$Recurrent$可以表示简单的循环神经网络、GRU或LSTM。
### 栈式双向LSTM(Stacked Bidirectional LSTM) ### 栈式双向LSTM(Stacked Bidirectional LSTM)
...@@ -196,7 +212,7 @@ def convolution_net(data, input_dim, class_dim, emb_dim, hid_dim): ...@@ -196,7 +212,7 @@ def convolution_net(data, input_dim, class_dim, emb_dim, hid_dim):
return prediction return prediction
``` ```
网络的输入`input_dim`表示的是词典的大小,`class_dim`表示类别数。这里,我们使用[`sequence_conv_pool`](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/trainer_config_helpers/networks.py) API实现了卷积和池化操作。 网络的输入`input_dim`表示的是词典的大小,`class_dim`表示类别数。这里,我们使用[`sequence_conv_pool`](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/nets.py) API实现了卷积和池化操作。
<a name="栈值双向LSTM"></a> <a name="栈值双向LSTM"></a>
...@@ -475,7 +491,7 @@ with fluid.scope_guard(inference_scope): ...@@ -475,7 +491,7 @@ with fluid.scope_guard(inference_scope):
9. Zhou J, Xu W. [End-to-end learning of semantic role labeling using recurrent neural networks](http://www.aclweb.org/anthology/P/P15/P15-1109.pdf)[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2015. 9. Zhou J, Xu W. [End-to-end learning of semantic role labeling using recurrent neural networks](http://www.aclweb.org/anthology/P/P15/P15-1109.pdf)[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2015.
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">知识共享 署名-相同方式共享 4.0 国际 许可协议</a>进行许可。 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://paddlepaddleimage.cdn.bcebos.com/bookimage/camo.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">知识共享 署名-相同方式共享 4.0 国际 许可协议</a>进行许可。
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
......
...@@ -40,143 +40,147 @@ ...@@ -40,143 +40,147 @@
<!-- This block will be replaced by each markdown file content. Please do not change lines below.--> <!-- This block will be replaced by each markdown file content. Please do not change lines below.-->
<div id="markdown" style='display:none'> <div id="markdown" style='display:none'>
# Sentiment Analysis # Sentiment Analysis
The source codes of this section is located at [book/understand_sentiment](https://github.com/PaddlePaddle/book/tree/develop/06.understand_sentiment). For instructions on getting started with this book,see [Running This Book](https://github.com/PaddlePaddle/book/blob/develop/README.md#running-the-book). The source code of this tutorial is in [book/understand_sentiment](https://github.com/PaddlePaddle/book/tree/develop/06.understand_sentiment). For new users, please refer to [Running This Book](https://github.com/PaddlePaddle/book/blob/develop/README.md#running-the-book) .
## Background ## Background Introduction
In natural language processing, sentiment analysis refers to determining the emotion expressed in a piece of text. The text can be a sentence, a paragraph, or a document. Emotion categorization can be binary -- positive/negative or happy/sad -- or in three classes -- positive/neutral/negative. Sentiment analysis is applicable in a wide range of services, such as e-commerce sites like Amazon and Taobao, hospitality services like Airbnb and hotels.com, and movie rating sites like Rotten Tomatoes and IMDB. It can be used to gauge from the reviews how the customers feel about the product. Table 1 illustrates an example of sentiment analysis in movie reviews: In natural language processing, sentiment analysis generally refers to judging the emotion expressed by a piece of text. Among them, a piece of text can be a sentence, a paragraph or a document. Emotional state can be two categories, such as (positive, negative), (happy, sad); or three categories, such as (positive, negative, neutral) and so on.The application scenarios of understanding sentiment are very broad, such as dividing the comments posted by users on shopping websites (Amazon, Tmall, Taobao, etc.), travel websites, and movie review websites into positive comments and negative comments; or in order to analyze the user's overall experience with a product, grab user reviews of the product, and perform sentiment analysis. Table 1 shows an example of understanding sentiment of movie reviews:
| Movie Review | Category | | Movie Comments | Category |
| -------- | ----- | | -------- | ----- |
| Best movie of Xiaogang Feng in recent years!| Positive | |In Feng Xiaogang’s movies of the past few years, it is the best one | Positive |
| Pretty bad. Feels like a tv-series from a local TV-channel | Negative | |Very bad feat, like a local TV series | Negative |
| Politically correct version of Taken ... and boring as Heck| Negative| |The round-lens lens is full of brilliance, and the tonal background is beautiful, but the plot is procrastinating, the accent is not good, and even though taking an effort but it is hard to focus on the show | Negative |
|delightful, mesmerizing, and completely unexpected. The plot is nicely designed.|Positive| |The plot could be scored 4 stars. In addition, the angle of the round lens plusing the scenery of Wuyuan is very much like the feeling of Chinese landscape painting. It satisfied me. | Positive |
<p align="center">Table 1 Sentiment Analysis in Movie Reviews</p> <p align="center">Form 1 Sentiment analysis of movie comments</p>
In natural language processing, sentiment analysis can be categorized as a **Text Classification problem**, i.e., to categorize a piece of text to a specific class. It involves two related tasks: text representation and classification. Before the emergence of deep learning techniques, the mainstream methods for text representation include BOW (*bag of words*) and topic modeling, while the latter contains SVM (*support vector machine*) and LR (*logistic regression*). In natural language processing, sentiment is a typical problem of **text categorization**, which divides the text that needs to be sentiment analysis into its category. Text categorization involves two issues: text representation and classification methods. Before the emergence of the deep learning, the mainstream text representation methods are BOW (bag of words), topic models, etc.; the classification methods are SVM (support vector machine), LR (logistic regression) and so on.
The BOW model does not capture all the information in a piece of text, as it ignores syntax and grammar and just treats the text as a set of words. For example, “this movie is extremely bad“ and “boring, dull, and empty work” describe very similar semantic meaning, yet their BOW representations have very little similarity. Furthermore, “the movie is bad“ and “the movie is not bad“ have high similarity with BOW features, but they express completely opposite semantics. For a piece of text, BOW means that its word order, grammar and syntax are ignored, and this text is only treated as a collection of words, so the BOW method does not adequately represent the semantic information of the text. For example, the sentence "This movie is awful" and "a boring, empty, non-connotative work" have a high semantic similarity in sentiment analysis, but their BOW representation has a similarity of zero. Another example is that the BOW is very similar to the sentence "an empty, work without connotations" and "a work that is not empty and has connotations", but in fact they mean differently.
This chapter introduces a deep learning model that handles these issues in BOW. Our model embeds texts into a low-dimensional space and takes word order into consideration. It is an end-to-end framework and it has large performance improvement over traditional methods \[[1](#references)\]. The deep learning we are going to introduce in this chapter overcomes the above shortcomings of BOW representation. It maps text to low-dimensional semantic space based on word order, and performs text representation and classification in end-to-end mode. Its performance is significantly improved compared to the traditional method \[[1](#References)\].
## Model Overview ## Model Overview
The text representation models used in this chapter are Convolutional Neural Networks and Recurrent Neural Networks and their extensions. These models are described below.
The model we used in this chapter uses **Convolutional Neural Networks** (**CNNs**) and **Recurrent Neural Networks** (**RNNs**) with some specific extensions. ### Introduction of Text Convolutional Neural Networks (CNN)
### Revisit to the Convolutional Neural Networks for Texts (CNN) We introduced the calculation process of the CNN model applied to text data in the [Recommended System](https://github.com/PaddlePaddle/book/tree/develop/05.recommender_system) section. Here is a simple review.
The convolutional neural network for texts is introduced in chapter [recommender_system](https://github.com/PaddlePaddle/book/tree/develop/05.recommender_system), here is a brief overview. For a CNN, first convolute input word vector sequence to generate a feature map, and then obtain the features of the whole sentence corresponding to the kernel by using a max pooling over time on the feature map. Finally, the splicing of all the features obtained is the fixed-length vector representation of the text. For the text classification problem, connecting it via softmax to construct a complete model. In actual applications, we use multiple convolution kernels to process sentences, and convolution kernels with the same window size are stacked to form a matrix, which can complete the operation more efficiently. In addition, we can also use the convolution kernel with different window sizes to process the sentence. Figure 3 in the [Recommend System](https://github.com/PaddlePaddle/book/tree/develop/05.recommender_system) section shows four convolution kernels, namely Figure 1 below, with different colors representing convolution kernel operations of different sizes.
CNN mainly contains convolution and pooling operation, with versatile combinations in various applications. We firstly apply the convolution operation: we apply the kernel in each window, extracting features. Convolving by the kernel at every window produces a feature map. Next, we apply *max pooling* over time to represent the whole sentence, which is the maximum element across the feature map. In real applications, we will apply multiple CNN kernels on the sentences. It can be implemented efficiently by concatenating the kernels together as a matrix. Also, we can use CNN kernels with different kernel size. Finally, concatenating the resulting features produces a fixed-length representation, which can be combined with a softmax to form the model for the sentiment analysis problem. <p align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/05.recommender_system/image/text_cnn.png?raw=true" width = "80%" align="center"/><br/>
Figure 1. CNN text classification model
</p>
For short texts, the aforementioned CNN model can achieve very high accuracy \[[1](#references)\]. If we want to extract more abstract representations, we may apply a deeper CNN model \[[2](#references),[3](#references)\]. For the general short text classification problem, the simple text convolution network described above can achieve a high accuracy rate \[[1](#References)\]. If you want a more abstract and advanced text feature representation, you can construct a deep text convolutional neural network\[[2](#References), [3](#References)\].
### Recurrent Neural Network (RNN) ### Recurrent Neural Network (RNN)
RNN is an effective model for sequential data. In terms of computability, the RNN is Turing-complete \[[4](#references)\]. Since NLP is a classical problem of sequential data, the RNN, especially its variant LSTM\[[5](#references)\]), achieves state-of-the-art performance on various NLP tasks, such as language modeling, syntax parsing, POS-tagging, image captioning, dialog, machine translation, and so forth. RNN is a powerful tool for accurately modeling sequence data. In fact, the theoretical computational power of the RNN is perfected by Turing' \[[4](#References)\]. Natural language is a typical sequence data (word sequence). In recent years, RNN and its derivation (such as long short term memory\[[5](#References)\]) have been applied in many natural language fields, such as in language models, syntactic parsing, semantic role labeling (or general sequence labeling), semantic representation, graphic generation, dialogue, machine translation, etc., all perform well and even become the best at present.
<p align="center"> <p align="center">
<img src="image/rnn.png" width = "60%" align="center"/><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/06.understand_sentiment/image/rnn.png?raw=true" width = "60%" align="center"/><br />
Figure 1. An illustration of an unfolded RNN in time. Figure 2. Schematic diagram of the RNN expanded by time
</p> </p>
As shown in Figure 1, we unfold an RNN: at the $t$-th time step, the network takes two inputs: the $t$-th input vector $\vec{x_t}$ and the latent state from the last time-step $\vec{h_{t-1}}$. From those, it computes the latent state of the current step $\vec{h_t}$. This process is repeated until all inputs are consumed. Denoting the RNN as function $f$, it can be formulated as follows: The RNN expands as time is shown in Figure 2: at the time of $t$, the network reads the $t$th input $x_t$ (vector representation) and the state value of the hidden layer at the previous moment $h_{t- 1}$ (vector representation, $h_0$ is normally initialized to $0$ vector), and calculate the state value $h_t$ of the hidden layer at this moment. Repeat this step until all the inputs have been read. If the function is recorded as $f$, its formula can be expressed as:
$$\vec{h_t}=f(\vec{x_t},\vec{h_{t-1}})=\sigma(W_{xh}\vec{x_t}+W_{hh}\vec{h_{t-1}}+\vec{b_h})$$ $$h_t=f(x_t,h_{t-1})=\sigma(W_{xh}x_t+W_{hh}h_{t-1}+b_h)$$
where $W_{xh}$ is the weight matrix to feed into the latent layer; $W_{hh}$ is the latent-to-latent matrix; $b_h$ is the latent bias and $\sigma$ refers to the $sigmoid$ function. Where $W_{xh}$ is the matrix parameter of the input to the hidden layer, $W_{hh}$ is the matrix parameter of the hidden layer to the hidden layer, and $b_h$ is the bias vector parameter of the hidden layer, $\sigma $ is the $sigmoid$ function.
In NLP, words are often represented as one-hot vectors and then mapped to an embedding. The embedded feature goes through an RNN as input $x_t$ at every time step. Moreover, we can add other layers on top of RNN, such as a deep or stacked RNN. Finally, the last latent state may be used as a feature for sentence classification. When dealing with natural language, the word (one-hot representation) is usually mapped to its word vector representation, and then used as the input $x_t$ for each moment of the recurrent neural network. In addition, other layers may be connected to the hidden layer of the RNN depending on actual needs. For example, you can connect the hidden layer output of a RNN to the input of the next RNN to build a deep or stacked RNN, or extract the hidden layer state at the last moment as a sentence representation and then implement a classification model, etc.
### Long-Short Term Memory (LSTM) ### Long and Short Term Memory Network (LSTM)
Training an RNN on long sequential data sometimes leads to the gradient vanishing or exploding\[[6](#references)\]. To solve this problem Hochreiter S, Schmidhuber J. (1997) proposed **Long Short Term Memory** (LSTM)\[[5](#references)\]). For longer sequence data, the gradient disappearance or explosion phenomenon is likely to occur during training RNN\[[6](#References)\]. To solve this problem, Hochreiter S, Schmidhuber J. (1997) proposed LSTM (long short term memory\[[5](#References)\]).
Compared to the structure of a simple RNN, an LSTM includes memory cell $c$, input gate $i$, forget gate $f$ and output gate $o$. These gates and memory cells dramatically improve the ability for the network to handle long sequences. We can formulate the **LSTM-RNN**, denoted as a function $F$, as follows: Compared to a simple RNN, LSTM adds memory unit $c$, input gate $i$, forget gate $f$, and output gate $o$. The combination of these gates and memory units greatly enhances the ability of the recurrent neural network to process long sequence data. If the function \is denoted as $F$, the formula is:
$$ h_t=F(x_t,h_{t-1})$$ $$ h_t=F(x_t,h_{t-1})$$
$F$ contains following formulations\[[7](#references)\]: $F$ It is a combination of the following formulas\[[7](#References)\]:
$$ i_t = \sigma{(W_{xi}x_t+W_{hi}h_{t-1}+W_{ci}c_{t-1}+b_i)} $$ $$ i_t = \sigma{(W_{xi}x_t+W_{hi}h_{t-1}+W_{ci}c_{t-1}+b_i)} $$
$$ f_t = \sigma(W_{xf}x_t+W_{hf}h_{t-1}+W_{cf}c_{t-1}+b_f) $$ $$ f_t = \sigma(W_{xf}x_t+W_{hf}h_{t-1}+W_{cf}c_{t-1}+b_f) $$
$$ c_t = f_t\odot c_{t-1}+i_t\odot tanh(W_{xc}x_t+W_{hc}h_{t-1}+b_c) $$ $$ c_t = f_t\odot c_{t-1}+i_t\odot tanh(W_{xc}x_t+W_{hc}h_{t-1}+b_c) $$
$$ o_t = \sigma(W_{xo}x_t+W_{ho}h_{t-1}+W_{co}c_{t}+b_o) $$ $$ o_t = \sigma(W_{xo}x_t+W_{ho}h_{t-1}+W_{co}c_{t}+b_o) $$
$$ h_t = o_t\odot tanh(c_t) $$ $$ h_t = o_t\odot tanh(c_t) $$
Where $i_t, f_t, c_t, o_t$ respectively represent the vector representation of the input gate, the forget gate, the memory unit and the output gate, the $W$ and $b$ with the angular label are the model parameters, and the $tanh$ is the hyperbolic tangent function. , $\odot$ represents an elementwise multiplication operation. The input gate controls the intensity of the new input into the memory unit $c$, the forget gate controls the intensity of the memory unit to maintain the previous time value, and the output gate controls the intensity of the output memory unit. The three gates are calculated in a similar way, but with completely different parameters.They controll the memory unit $c$ in different ways, as shown in Figure 3:
In the equation,$i_t, f_t, c_t, o_t$ stand for input gate, forget gate, memory cell and output gate, respectively. $W$ and $b$ are model parameters, $\tanh$ is a hyperbolic tangent, and $\odot$ denotes an element-wise product operation. The input gate controls the magnitude of the new input into the memory cell $c$; the forget gate controls the memory propagated from the last time step; the output gate controls the magnitutde of the output. The three gates are computed similarly with different parameters, and they influence memory cell $c$ separately, as shown in Figure 2:
<p align="center"> <p align="center">
<img src="image/lstm_en.png" width = "65%" align="center"/><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/06.understand_sentiment/image/lstm.png?raw=true" width = "65%" align="center"/><br />
Figure 2. LSTM at time step $t$ [7]. Figure 3. LSTM for time $t$ [7]
</p> </p>
LSTM enhances the ability of considering long-term reliance, with the help of memory cell and gate. Similar structures are also proposed in Gated Recurrent Unit (GRU)\[[8](Reference)\] with a simpler design. **The structures are still similar to RNN, though with some modifications (As shown in Figure 2), i.e., latent status depends on input as well as the latent status of the last time step, and the process goes on recurrently until all inputs are consumed:** LSTM enhances its ability to handle long-range dependencies by adding memory and control gates to RNN. A similar principle improvement is Gated Recurrent Unit (GRU)\[[8](#References)\], which is more concise in design. **These improvements are different, but their macro descriptions are the same as simple recurrent neural networks (as shown in Figure 2). That is, the hidden state changes according to the current input and the hidden state of the previous moment, and this process is continuous until the input is processed:**
$$ h_t=Recurrent(x_t,h_{t-1})$$
Among them, $Recurrent$ can represent a RNN, GRU or LSTM.
$$ h_t=Recrurent(x_t,h_{t-1})$$
where $Recrurent$ is a simple RNN, GRU or LSTM.
<a name="Stacked Bidirectional LSTM"></a>
### Stacked Bidirectional LSTM ### Stacked Bidirectional LSTM
For vanilla LSTM, $h_t$ contains input information from previous time-step $1..t-1$ context. We can also apply an RNN with reverse-direction to take successive context $t+1…n$ into consideration. Combining constructing deep RNN (deeper RNN can contain more abstract and higher level semantic), we can design structures with deep stacked bidirectional LSTM to model sequential data\[[9](#references)\]. For a normal directional RNN, $h_t$ contains the input information before the $t$ time, which is the above context information. Similarly, in order to get the following context information, we can use a RNN in the opposite direction (which will be processed in reverse order). Combined with the method of constructing deep-loop neural networks (deep neural networks often get more abstract and advanced feature representations), we can build a more powerful LSTM-based stack bidirectional recurrent neural network\[[9](#References )\] to model time series data.
As shown in Figure 3 (3-layer RNN), odd/even layers are forward/reverse LSTM. Higher layers of LSTM take lower-layers LSTM as input, and the top-layer LSTM produces a fixed length vector by max-pooling (this representation considers contexts from previous and successive words for higher-level abstractions). Finally, we concatenate the output to a softmax layer for classification. As shown in Figure 4 (taking three layers as an example), the odd-numbered LSTM is forward and the even-numbered LSTM is inverted. The higher-level LSTM uses the lower LSTM and all previous layers of information as input. The maximum pooling of the highest-level LSTM sequence in the time dimension can be used to obtain a fixed-length vector representation of the text (this representation fully fuses the contextual information and deeply abstracts of the text), and finally we connect the text representation to the softmax to build the classification model.
<p align="center"> <p align="center">
<img src="image/stacked_lstm_en.png" width=450><br/> <img src="https://github.com/PaddlePaddle/book/blob/develop/06.understand_sentiment/image/stacked_lstm.jpg?raw=true" width=450><br/>
Figure 3. Stacked Bidirectional LSTM for NLP modeling. Figure 4. Stacked bidirectional LSTM for text categorization
</p> </p>
## Dataset
We use [IMDB](http://ai.stanford.edu/%7Eamaas/data/sentiment/) dataset for sentiment analysis in this tutorial, which consists of 50,000 movie reviews split evenly into a 25k train set and a 25k test set. In the labeled train/test sets, a negative review has a score <= 4 out of 10, and a positive review has a score >= 7 out of 10. ## Dataset Introduction
`paddle.datasets` package encapsulates multiple public datasets, including `cifar`, `imdb`, `mnist`, `moivelens`, and `wmt14`, etc. There's no need for us to manually download and preprocess IMDB.
After issuing a command `python train.py`, training will start immediately. The details will be unpacked by the following sessions to see how it works.
We use the [IMDB sentiment analysis data set](http://ai.stanford.edu/%7Eamaas/data/sentiment/) as an example. The training and testing IMDB dataset contain 25,000 labeled movie reviews respectively. Among them, the score of the negative comment is less than or equal to 4, and the score of the positive comment is greater than or equal to 7, full score is 10.
```text
aclImdb
|- test
|-- neg
|-- pos
|- train
|-- neg
|-- pos
```
Paddle implements the automatic download and read the imdb dataset in `dataset/imdb.py`, and provides API for reading dictionary, training data, testing data, and so on.
## Model Configuration ## Model Configuration
Our program starts with importing necessary packages and initializing some global variables: In this example, we implement two text categorization algorithms based on the text convolutional neural network described in the [Recommender System](https://github.com/PaddlePaddle/book/tree/develop/05.recommender_system) section and [Stacked Bidirectional LSTM](#Stacked Bidirectional LSTM). We first import the packages we need to use and define global variables:
```python ```python
from __future__ import print_function from __future__ import print_function
import paddle import paddle
import paddle.fluid as fluid import paddle.fluid as fluid
from functools import partial
import numpy as np import numpy as np
try: import sys
from paddle.fluid.contrib.trainer import * import math
from paddle.fluid.contrib.inferencer import *
except ImportError:
print(
"In the fluid 1.0, the trainer and inferencer are moving to paddle.fluid.contrib",
file=sys.stderr)
from paddle.fluid.trainer import *
from paddle.fluid.inferencer import *
CLASS_DIM = 2
EMB_DIM = 128
HID_DIM = 512
STACKED_NUM = 3
BATCH_SIZE = 128
USE_GPU = False
```
As alluded to in section [Model Overview](#model-overview), here we provide the implementations of both Text CNN and Stacked-bidirectional LSTM models. CLASS_DIM = 2 #Number of categories for sentiment analysis
EMB_DIM = 128 #Dimensions of the word vector
HID_DIM = 512 #Dimensions of hide layer
STACKED_NUM = 3 #LSTM Layers of the bidirectional stack
BATCH_SIZE = 128 #batch size
### Text Convolution Neural Network (Text CNN) ```
We create a neural network `convolution_net` as the following snippet code.
Note: `fluid.nets.sequence_conv_pool` includes both convolution and pooling layer operations. ### Text Convolutional Neural Network
We build the neural network `convolution_net`, the sample code is as follows.
Note that `fluid.nets.sequence_conv_pool` contains both convolution and pooling layers.
```python ```python
#Textconvolution neural network
def convolution_net(data, input_dim, class_dim, emb_dim, hid_dim): def convolution_net(data, input_dim, class_dim, emb_dim, hid_dim):
emb = fluid.layers.embedding( emb = fluid.layers.embedding(
input=data, size=[input_dim, emb_dim], is_sparse=True) input=data, size=[input_dim, emb_dim], is_sparse=True)
...@@ -195,48 +199,53 @@ def convolution_net(data, input_dim, class_dim, emb_dim, hid_dim): ...@@ -195,48 +199,53 @@ def convolution_net(data, input_dim, class_dim, emb_dim, hid_dim):
prediction = fluid.layers.fc( prediction = fluid.layers.fc(
input=[conv_3, conv_4], size=class_dim, act="softmax") input=[conv_3, conv_4], size=class_dim, act="softmax")
return prediction return prediction
``` ```
Parameter `input_dim` denotes the dictionary size, and `class_dim` is the number of categories.
The above Text CNN network extracts high-level features and maps them to a vector of the same size as the categories. `paddle.activation.Softmax` function or classifier is then used for calculating the probability of the sentence belonging to each category. The network input `input_dim` indicates the size of the dictionary, and `class_dim` indicates the number of categories. Here, we implement the convolution and pooling operations using the [`sequence_conv_pool`](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/nets.py) API.
<a name="Stack value bidirectional LSTM"></a>
### Stacked bidirectional LSTM ### Stacked bidirectional LSTM
We create a neural network `stacked_lstm_net` as below. The code of the stack bidirectional LSTM `stacked_lstm_net` is as follows:
```python ```python
#Stack Bidirectional LSTM
def stacked_lstm_net(data, input_dim, class_dim, emb_dim, hid_dim, stacked_num): def stacked_lstm_net(data, input_dim, class_dim, emb_dim, hid_dim, stacked_num):
# Calculate word vectorvector
emb = fluid.layers.embedding( emb = fluid.layers.embedding(
input=data, size=[input_dim, emb_dim], is_sparse=True) input=data, size=[input_dim, emb_dim], is_sparse=True)
#First stack
#Fully connected layer
fc1 = fluid.layers.fc(input=emb, size=hid_dim) fc1 = fluid.layers.fc(input=emb, size=hid_dim)
#lstm layer
lstm1, cell1 = fluid.layers.dynamic_lstm(input=fc1, size=hid_dim) lstm1, cell1 = fluid.layers.dynamic_lstm(input=fc1, size=hid_dim)
inputs = [fc1, lstm1] inputs = [fc1, lstm1]
#All remaining stack structures
for i in range(2, stacked_num + 1): for i in range(2, stacked_num + 1):
fc = fluid.layers.fc(input=inputs, size=hid_dim) fc = fluid.layers.fc(input=inputs, size=hid_dim)
lstm, cell = fluid.layers.dynamic_lstm( lstm, cell = fluid.layers.dynamic_lstm(
input=fc, size=hid_dim, is_reverse=(i % 2) == 0) input=fc, size=hid_dim, is_reverse=(i % 2) == 0)
inputs = [fc, lstm] inputs = [fc, lstm]
fc_last = fluid.layers.sequence_pool(input=inputs[0], pool_type='max') #pooling layer
pc_last = fluid.layers.sequence_pool(input=inputs[0], pool_type='max')
lstm_last = fluid.layers.sequence_pool(input=inputs[1], pool_type='max') lstm_last = fluid.layers.sequence_pool(input=inputs[1], pool_type='max')
prediction = fluid.layers.fc(input=[fc_last, lstm_last], #Fully connected layer, softmax prediction
size=class_dim, prediction = fluid.layers.fc(
act='softmax') input=[fc_last, lstm_last], size=class_dim, act='softmax')
return prediction return prediction
``` ```
The above stacked bidirectional LSTM network extracts high-level features and maps them to a vector of the same size as the categories. `paddle.activation.Softmax` function or classifier is then used for calculating the probability of the sentence belonging to each category. The above stacked bidirectional LSTM abstracts the advanced features and maps them to vectors of the same size as the number of classification. The 'softmax' activation function of the last fully connected layer is used to calculate the probability of a certain category.
To reiterate, we can either invoke `convolution_net` or `stacked_lstm_net`. In below steps, we will go with `convolution_net`. Again, here we can call any network structure of `convolution_net` or `stacked_lstm_net` for training and learning. Let's take `convolution_net` as an example.
Next we define an `inference_program` that simply uses `convolution_net` to predict output with the input from `fluid.layer.data`. Next we define the prediction program (`inference_program`). We use `convolution_net` to predict the input of `fluid.layer.data`.
```python ```python
def inference_program(word_dict): def inference_program(word_dict):
...@@ -249,47 +258,44 @@ def inference_program(word_dict): ...@@ -249,47 +258,44 @@ def inference_program(word_dict):
return net return net
``` ```
Then we define a `training_program` that uses the result from `inference_program` to compute the cost with label data. We define `training_program` here, which uses the result returned from `inference_program` to calculate the error. We also define the optimization function `optimizer_func`.
Also define `optimizer_func` to specify the optimizer.
Because it is supervised learning, the training set tags are also defined in `fluid.layers.data`. During training, cross-entropy is used as a loss function in `fluid.layer.cross_entropy`.
In the context of supervised learning, labels of the training set are defined in `paddle.layer.data` too. During training, cross-entropy is used as loss function in `paddle.layer.classification_cost` and as the output of the network; During testing, the outputs are the probabilities calculated in the classifier. During the testing, the classifier calculates the probability of each output. The first returned value is specified as cost.
First result that returns from the list must be cost.
```python ```python
def train_program(word_dict): def train_program(prediction):
prediction = inference_program(word_dict)
label = fluid.layers.data(name="label", shape=[1], dtype="int64") label = fluid.layers.data(name="label", shape=[1], dtype="int64")
cost = fluid.layers.cross_entropy(input=prediction, label=label) cost = fluid.layers.cross_entropy(input=prediction, label=label)
avg_cost = fluid.layers.mean(cost) avg_cost = fluid.layers.mean(cost)
accuracy = fluid.layers.accuracy(input=prediction, label=label) accuracy = fluid.layers.accuracy(input=prediction, label=label)
return [avg_cost, accuracy] return [avg_cost, accuracy] #return average cost and accuracy acc
#Optimization function
def optimizer_func(): def optimizer_func():
return fluid.optimizer.Adagrad(learning_rate=0.002) return fluid.optimizer.Adagrad(learning_rate=0.002)
``` ```
## Model Training ## Training Model
### Specify training environment ### Defining the training environment
Define whether your training is on the CPU or GPU:
Specify your training environment, you should specify if the training is on CPU or GPU.
```python ```python
use_cuda = False use_cuda = False #train on cpu
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace() place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
``` ```
### Datafeeder Configuration ### Defining the data creator
Next we define data feeders for test and train. The feeder reads a `buf_size` of data each time and feed them to the training/testing process. The next step is to define a data creator for training and testing. The creator reads in a data of size BATCH_SIZE. Paddle.dataset.imdb.word_dict will provide a size of BATCH_SIZE after each time shuffling, which is the cache size: buf_size.
`paddle.dataset.imdb.train` will yield records during each pass, after shuffling, a batch input of `BATCH_SIZE` is generated for training.
Notice for loading and reading IMDB data, it could take up to 1 minute. Please be patient. Note: It may take a few minutes to read the IMDB data, please be patient.
```python ```python
print("Loading IMDB word dict....") print("Loading IMDB word dict....")
word_dict = paddle.dataset.imdb.word_dict() word_dict = paddle.dataset.imdb.word_dict()
...@@ -298,77 +304,125 @@ train_reader = paddle.batch( ...@@ -298,77 +304,125 @@ train_reader = paddle.batch(
paddle.reader.shuffle( paddle.reader.shuffle(
paddle.dataset.imdb.train(word_dict), buf_size=25000), paddle.dataset.imdb.train(word_dict), buf_size=25000),
batch_size=BATCH_SIZE) batch_size=BATCH_SIZE)
print("Reading testing data....")
test_reader = paddle.batch(
paddle.dataset.imdb.test(word_dict), batch_size=BATCH_SIZE)
``` ```
Word_dict is a dictionary sequence, which is the correspondence between words and labels. You can see it specifically by running the next code:
### Create Trainer
Create a trainer that takes `train_program` as input and specify optimizer function.
```python ```python
trainer = Trainer( word_dict
train_func=partial(train_program, word_dict),
place=place,
optimizer_func=optimizer_func)
``` ```
Each line is a correspondence such as ('limited': 1726), which indicates that the label corresponding to the word limited is 1726.
### Feeding Data ### Construction Trainer
The trainer requires a training program and a training optimization function.
`feed_order` is devoted to specifying the correspondence between each yield record and `paddle.layer.data`. For instance, the first column of data generated by `imdb.train` corresponds to `words`. ```python
exe = fluid.Executor(place)
prediction = inference_program(word_dict)
[avg_cost, accuracy] = train_program(prediction)#training program
sgd_optimizer = optimizer_func()# training optimization function
sgd_optimizer.minimize(avg_cost)
```
This function is used to calculate the result of the model on the test dataset.
```python ```python
feed_order = ['words', 'label'] def train_test(program, reader):
count = 0
feed_var_list = [
program.global_block().var(var_name) for var_name in feed_order
]
feeder_test = fluid.DataFeeder(feed_list=feed_var_list, place=place)
test_exe = fluid.Executor(place)
accumulated = len([avg_cost, accuracy]) * [0]
for test_data in reader():
avg_cost_np = test_exe.run(
program=program,
feed=feeder_test.feed(test_data),
fetch_list=[avg_cost, accuracy])
accumulated = [
x[0] + x[1][0] for x in zip(accumulated, avg_cost_np)
]
count += 1
return [x / count for x in accumulated]
``` ```
### Event Handler ### Providing data and building a main training loop
Callback function `event_handler` will be called during training when a pre-defined event happens. `feed_order` is used to define the mapping relationship between each generated data and `fluid.layers.data`. For example, the data in the first column generated by `imdb.train` corresponds to the `words` feature.
For example, we can check the cost by `trainer.test` when `EndStepEvent` occurs
```python ```python
# Specify the directory path to save the parameters # Specify the directory path to save the parameters
params_dirname = "understand_sentiment_conv.inference.model" params_dirname = "understand_sentiment_conv.inference.model"
def event_handler(event): feed_order = ['words', 'label']
if isinstance(event, EndStepEvent): pass_num = 1 #Number rounds of the training loop
print("Step {0}, Epoch {1} Metrics {2}".format(
event.step, event.epoch, list(map(np.array, event.metrics)))) # Main loop part of the program
def train_loop(main_program):
# Start the trainer built above
exe.run(fluid.default_startup_program())
feed_var_list_loop = [
main_program.global_block().var(var_name) for var_name in feed_order
]
feeder = fluid.DataFeeder(
feed_list=feed_var_list_loop, place=place)
test_program = fluid.default_main_program().clone(for_test=True)
# Training loop
for epoch_id in range(pass_num):
for step_id, data in enumerate(train_reader()):
# Running trainer
metrics = exe.run(main_program,
feed=feeder.feed(data),
fetch_list=[avg_cost, accuracy])
# Testing Results
avg_cost_test, acc_test = train_test(test_program, test_reader)
print('Step {0}, Test Loss {1:0.2}, Acc {2:0.2}'.format(
step_id, avg_cost_test, acc_test))
if event.step == 10: print("Step {0}, Epoch {1} Metrics {2}".format(
trainer.save_params(params_dirname) step_id, epoch_id, list(map(np.array,
trainer.stop() metrics))))
if step_id == 30:
if params_dirname is not None:
fluid.io.save_inference_model(params_dirname, ["words"],
prediction, exe)# Save model
return
``` ```
### Training ### Training process
We print the output of each step in the main loop of the training, and we can observe the training situation.
Finally, we invoke `trainer.train` to start training with `num_epochs` and other parameters. ### Start training
Finally, we start the training main loop to start training. The training time is longer. If you want to get the result faster, you can shorten the training time by adjusting the loss value range or the number of training steps at the cost of reducing the accuracy.
```python ```python
trainer.train( train_loop(fluid.default_main_program())
num_epochs=1,
event_handler=event_handler,
reader=train_reader,
feed_order=feed_order)
``` ```
## Inference ## Application Model
### Create Inferencer ### Building a predictor
Initialize Inferencer with `inference_program` and `params_dirname` which is where we save params from training. As the training process, we need to create a prediction process and use the trained models and parameters to make predictions. `params_dirname` is used to store the various parameters in the training process.
```python ```python
inferencer = Inferencer( place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
infer_func=partial(inference_program, word_dict), exe = fluid.Executor(place)
param_path=params_dirname, inference_scope = fluid.core.Scope()
place=place)
``` ```
### Create Lod Tensor with test data ### Generating test input data
To do inference, we pick 3 potential reviews out of our mind as testing data. Feel free to modify any of them. In order to make predictions, we randomly select 3 comments. We correspond each word in the comment to the id in `word_dict`. If the word is not in the dictionary, set it to `unknown`.
We map each word in the reviews to id from `word_dict`, replaced by 'unknown' if the word is not in `word_dict`. Then we use `create_lod_tensor` to create the tensor of the detail level. For a detailed explanation of this function, please refer to [API documentation](http://paddlepaddle.org/documentation/docs/en/1.2/user_guides/howto/basic_concept/lod_tensor.html).
Then we create lod data with the id list and use `create_lod_tensor` to create lod tensor.
```python ```python
reviews_str = [ reviews_str = [
...@@ -386,27 +440,39 @@ base_shape = [[len(c) for c in lod]] ...@@ -386,27 +440,39 @@ base_shape = [[len(c) for c in lod]]
tensor_words = fluid.create_lod_tensor(lod, base_shape, place) tensor_words = fluid.create_lod_tensor(lod, base_shape, place)
``` ```
### Infer ## Applying models and making predictions
Now we can infer and predict probability of positive or negative from each review above. Now we can make positive or negative predictions for each comment.
```python ```python
results = inferencer.infer({'words': tensor_words}) with fluid.scope_guard(inference_scope):
for i, r in enumerate(results[0]): [inferencer, feed_target_names,
print("Predict probability of ", r[0], " to be positive and ", r[1], " to be negative for review \'", reviews_str[i], "\'") fetch_targets] = fluid.io.load_inference_model(params_dirname, exe)
assert feed_target_names[0] == "words"
results = exe.run(inferencer,
feed={feed_target_names[0]: tensor_words},
fetch_list=fetch_targets,
return_numpy=False)
np_data = np.array(results[0])
for i, r in enumerate(np_data):
print("Predict probability of ", r[0], " to be positive and ", r[1],
" to be negative for review \'", reviews_str[i], "\'")
``` ```
## Conclusion ## Conclusion
In this chapter, we use sentiment analysis as an example to introduce applying deep learning models on end-to-end short text classification, as well as how to use PaddlePaddle to implement the model. Meanwhile, we briefly introduce two models for text processing: CNN and RNN. In following chapters, we will see how these models can be applied in other tasks. In this chapter, we take sentiment analysis as an example to introduce end-to-end short text classification using deep learning, and complete all relevant experiments using PaddlePaddle. At the same time, we briefly introduce two text processing models: convolutional neural networks and recurrent neural networks. In the following chapters, we will see the application of these two basic deep learning models on other tasks.
<a name="References"></a>
## References ## References
1. Kim Y. [Convolutional neural networks for sentence classification](http://arxiv.org/pdf/1408.5882)[J]. arXiv preprint arXiv:1408.5882, 2014. 1. Kim Y. [Convolutional neural networks for sentence classification](http://arxiv.org/pdf/1408.5882)[J]. arXiv preprint arXiv:1408.5882, 2014.
2. Kalchbrenner N, Grefenstette E, Blunsom P. [A convolutional neural network for modeling sentences](http://arxiv.org/pdf/1404.2188.pdf?utm_medium=App.net&utm_source=PourOver)[J]. arXiv preprint arXiv:1404.2188, 2014. 2. Kalchbrenner N, Grefenstette E, Blunsom P. [A convolutional neural network for modelling sentences](http://arxiv.org/pdf/1404.2188.pdf?utm_medium=App.net&utm_source=PourOver)[J]. arXiv preprint arXiv:1404.2188, 2014.
3. Yann N. Dauphin, et al. [Language Modeling with Gated Convolutional Networks](https://arxiv.org/pdf/1612.08083v1.pdf)[J] arXiv preprint arXiv:1612.08083, 2016. 3. Yann N. Dauphin, et al. [Language Modeling with Gated Convolutional Networks](https://arxiv.org/pdf/1612.08083v1.pdf)[J] arXiv preprint arXiv:1612.08083, 2016.
4. Siegelmann H T, Sontag E D. [On the computational power of neural nets](http://research.cs.queensu.ca/home/akl/cisc879/papers/SELECTED_PAPERS_FROM_VARIOUS_SOURCES/05070215382317071.pdf)[C]//Proceedings of the fifth annual workshop on Computational learning theory. ACM, 1992: 440-449. 4. Siegelmann H T, Sontag E D. [On the computational power of neural nets](http://research.cs.queensu.ca/home/akl/cisc879/papers/SELECTED_PAPERS_FROM_VARIOUS_SOURCES/05070215382317071.pdf)[C]//Proceedings of the fifth annual workshop on Computational learning theory. ACM, 1992: 440-449.
5. Hochreiter S, Schmidhuber J. [Long short-term memory](http://web.eecs.utk.edu/~itamar/courses/ECE-692/Bobby_paper1.pdf)[J]. Neural computation, 1997, 9(8): 1735-1780. 5. Hochreiter S, Schmidhuber J. [Long short-term memory](http://web.eecs.utk.edu/~itamar/courses/ECE-692/Bobby_paper1.pdf)[J]. Neural computation, 1997, 9(8): 1735-1780.
...@@ -416,7 +482,7 @@ In this chapter, we use sentiment analysis as an example to introduce applying d ...@@ -416,7 +482,7 @@ In this chapter, we use sentiment analysis as an example to introduce applying d
9. Zhou J, Xu W. [End-to-end learning of semantic role labeling using recurrent neural networks](http://www.aclweb.org/anthology/P/P15/P15-1109.pdf)[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2015. 9. Zhou J, Xu W. [End-to-end learning of semantic role labeling using recurrent neural networks](http://www.aclweb.org/anthology/P/P15/P15-1109.pdf)[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2015.
<br/> <br/>
This tutorial is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://paddlepaddleimage.cdn.bcebos.com/bookimage/camo.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">This tutorial</span> is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
......
...@@ -19,6 +19,7 @@ import paddle.fluid as fluid ...@@ -19,6 +19,7 @@ import paddle.fluid as fluid
import numpy as np import numpy as np
import sys import sys
import math import math
import argparse
CLASS_DIM = 2 CLASS_DIM = 2
EMB_DIM = 128 EMB_DIM = 128
...@@ -26,6 +27,20 @@ HID_DIM = 512 ...@@ -26,6 +27,20 @@ HID_DIM = 512
BATCH_SIZE = 128 BATCH_SIZE = 128
def parse_args():
parser = argparse.ArgumentParser("conv")
parser.add_argument(
'--enable_ce',
action='store_true',
help="If set, run the task with continuous evaluation logs.")
parser.add_argument(
'--use_gpu', type=int, default=0, help="Whether to use GPU or not.")
parser.add_argument(
'--num_epochs', type=int, default=1, help="number of epochs.")
args = parser.parse_args()
return args
def convolution_net(data, input_dim, class_dim, emb_dim, hid_dim): def convolution_net(data, input_dim, class_dim, emb_dim, hid_dim):
emb = fluid.layers.embedding( emb = fluid.layers.embedding(
input=data, size=[input_dim, emb_dim], is_sparse=True) input=data, size=[input_dim, emb_dim], is_sparse=True)
...@@ -74,6 +89,10 @@ def train(use_cuda, params_dirname): ...@@ -74,6 +89,10 @@ def train(use_cuda, params_dirname):
word_dict = paddle.dataset.imdb.word_dict() word_dict = paddle.dataset.imdb.word_dict()
print("Reading training data....") print("Reading training data....")
if args.enable_ce:
train_reader = paddle.batch(
paddle.dataset.imdb.train(word_dict), batch_size=BATCH_SIZE)
else:
train_reader = paddle.batch( train_reader = paddle.batch(
paddle.reader.shuffle( paddle.reader.shuffle(
paddle.dataset.imdb.train(word_dict), buf_size=25000), paddle.dataset.imdb.train(word_dict), buf_size=25000),
...@@ -84,10 +103,15 @@ def train(use_cuda, params_dirname): ...@@ -84,10 +103,15 @@ def train(use_cuda, params_dirname):
paddle.dataset.imdb.test(word_dict), batch_size=BATCH_SIZE) paddle.dataset.imdb.test(word_dict), batch_size=BATCH_SIZE)
feed_order = ['words', 'label'] feed_order = ['words', 'label']
pass_num = 1 pass_num = args.num_epochs
main_program = fluid.default_main_program() main_program = fluid.default_main_program()
star_program = fluid.default_startup_program() star_program = fluid.default_startup_program()
if args.enable_ce:
main_program.random_seed = 90
star_program.random_seed = 90
prediction = inference_program(word_dict) prediction = inference_program(word_dict)
train_func_outputs = train_program(prediction) train_func_outputs = train_program(prediction)
avg_cost = train_func_outputs[0] avg_cost = train_func_outputs[0]
...@@ -147,6 +171,11 @@ def train(use_cuda, params_dirname): ...@@ -147,6 +171,11 @@ def train(use_cuda, params_dirname):
if params_dirname is not None: if params_dirname is not None:
fluid.io.save_inference_model(params_dirname, ["words"], fluid.io.save_inference_model(params_dirname, ["words"],
prediction, exe) prediction, exe)
if args.enable_ce and epoch_id == pass_num - 1:
print("kpis\tconv_train_cost\t%f" % metrics[0])
print("kpis\tconv_train_acc\t%f" % metrics[1])
print("kpis\tconv_test_cost\t%f" % avg_cost_test)
print("kpis\tconv_test_acc\t%f" % acc_test)
train_loop() train_loop()
...@@ -211,5 +240,6 @@ def main(use_cuda): ...@@ -211,5 +240,6 @@ def main(use_cuda):
if __name__ == '__main__': if __name__ == '__main__':
use_cuda = False # set to True if training with GPU args = parse_args()
use_cuda = args.use_gpu # set to True if training with GPU
main(use_cuda) main(use_cuda)
...@@ -19,6 +19,7 @@ import paddle.fluid as fluid ...@@ -19,6 +19,7 @@ import paddle.fluid as fluid
import numpy as np import numpy as np
import sys import sys
import math import math
import argparse
CLASS_DIM = 2 CLASS_DIM = 2
EMB_DIM = 128 EMB_DIM = 128
...@@ -26,6 +27,20 @@ BATCH_SIZE = 128 ...@@ -26,6 +27,20 @@ BATCH_SIZE = 128
LSTM_SIZE = 128 LSTM_SIZE = 128
def parse_args():
parser = argparse.ArgumentParser("dyn_rnn")
parser.add_argument(
'--enable_ce',
action='store_true',
help="If set, run the task with continuous evaluation logs.")
parser.add_argument(
'--use_gpu', type=int, default=0, help="Whether to use GPU or not.")
parser.add_argument(
'--num_epochs', type=int, default=1, help="number of epochs.")
args = parser.parse_args()
return args
def dynamic_rnn_lstm(data, input_dim, class_dim, emb_dim, lstm_size): def dynamic_rnn_lstm(data, input_dim, class_dim, emb_dim, lstm_size):
emb = fluid.layers.embedding( emb = fluid.layers.embedding(
input=data, size=[input_dim, emb_dim], is_sparse=True) input=data, size=[input_dim, emb_dim], is_sparse=True)
...@@ -89,6 +104,10 @@ def train(use_cuda, params_dirname): ...@@ -89,6 +104,10 @@ def train(use_cuda, params_dirname):
word_dict = paddle.dataset.imdb.word_dict() word_dict = paddle.dataset.imdb.word_dict()
print("Reading training data....") print("Reading training data....")
if args.enable_ce:
train_reader = paddle.batch(
paddle.dataset.imdb.train(word_dict), batch_size=BATCH_SIZE)
else:
train_reader = paddle.batch( train_reader = paddle.batch(
paddle.reader.shuffle( paddle.reader.shuffle(
paddle.dataset.imdb.train(word_dict), buf_size=25000), paddle.dataset.imdb.train(word_dict), buf_size=25000),
...@@ -99,10 +118,15 @@ def train(use_cuda, params_dirname): ...@@ -99,10 +118,15 @@ def train(use_cuda, params_dirname):
paddle.dataset.imdb.test(word_dict), batch_size=BATCH_SIZE) paddle.dataset.imdb.test(word_dict), batch_size=BATCH_SIZE)
feed_order = ['words', 'label'] feed_order = ['words', 'label']
pass_num = 1 pass_num = args.num_epochs
main_program = fluid.default_main_program() main_program = fluid.default_main_program()
star_program = fluid.default_startup_program() star_program = fluid.default_startup_program()
if args.enable_ce:
main_program.random_seed = 90
star_program.random_seed = 90
prediction = inference_program(word_dict) prediction = inference_program(word_dict)
train_func_outputs = train_program(prediction) train_func_outputs = train_program(prediction)
avg_cost = train_func_outputs[0] avg_cost = train_func_outputs[0]
...@@ -148,9 +172,10 @@ def train(use_cuda, params_dirname): ...@@ -148,9 +172,10 @@ def train(use_cuda, params_dirname):
fetch_list=[var.name for var in train_func_outputs]) fetch_list=[var.name for var in train_func_outputs])
if (step_id + 1) % 10 == 0: if (step_id + 1) % 10 == 0:
#avg_cost_test, acc_test = train_test(test_program, test_reader) avg_cost_test, acc_test = train_test(test_program,
#print('Step {0}, Test Loss {1:0.2}, Acc {2:0.2}'.format( test_reader)
# step_id, avg_cost_test, acc_test)) print('Step {0}, Test Loss {1:0.2}, Acc {2:0.2}'.format(
step_id, avg_cost_test, acc_test))
print("Step {0}, Epoch {1} Metrics {2}".format( print("Step {0}, Epoch {1} Metrics {2}".format(
step_id, epoch_id, list(map(np.array, metrics)))) step_id, epoch_id, list(map(np.array, metrics))))
...@@ -159,6 +184,11 @@ def train(use_cuda, params_dirname): ...@@ -159,6 +184,11 @@ def train(use_cuda, params_dirname):
if params_dirname is not None: if params_dirname is not None:
fluid.io.save_inference_model(params_dirname, ["words"], fluid.io.save_inference_model(params_dirname, ["words"],
prediction, exe) prediction, exe)
if args.enable_ce and epoch_id == pass_num - 1:
print("kpis\trnn_train_cost\t%f" % metrics[0])
print("kpis\trnn_train_acc\t%f" % metrics[1])
print("kpis\trnn_test_cost\t%f" % avg_cost_test)
print("kpis\trnn_test_acc\t%f" % acc_test)
train_loop() train_loop()
...@@ -223,5 +253,6 @@ def main(use_cuda): ...@@ -223,5 +253,6 @@ def main(use_cuda):
if __name__ == '__main__': if __name__ == '__main__':
use_cuda = False # set to True if training with GPU args = parse_args()
use_cuda = args.use_gpu # set to True if training with GPU
main(use_cuda) main(use_cuda)
...@@ -20,13 +20,27 @@ import paddle.fluid as fluid ...@@ -20,13 +20,27 @@ import paddle.fluid as fluid
import numpy as np import numpy as np
import sys import sys
import math import math
import argparse
CLASS_DIM = 2 CLASS_DIM = 2
EMB_DIM = 128 EMB_DIM = 128
HID_DIM = 512 HID_DIM = 512
STACKED_NUM = 3 STACKED_NUM = 3
BATCH_SIZE = 128 BATCH_SIZE = 128
USE_GPU = False
def parse_args():
parser = argparse.ArgumentParser("stacked_lstm")
parser.add_argument(
'--enable_ce',
action='store_true',
help="If set, run the task with continuous evaluation logs.")
parser.add_argument(
'--use_gpu', type=int, default=0, help="Whether to use GPU or not.")
parser.add_argument(
'--num_epochs', type=int, default=1, help="number of epochs.")
args = parser.parse_args()
return args
def stacked_lstm_net(data, input_dim, class_dim, emb_dim, hid_dim, stacked_num): def stacked_lstm_net(data, input_dim, class_dim, emb_dim, hid_dim, stacked_num):
...@@ -84,6 +98,11 @@ def train(use_cuda, params_dirname): ...@@ -84,6 +98,11 @@ def train(use_cuda, params_dirname):
word_dict = paddle.dataset.imdb.word_dict() word_dict = paddle.dataset.imdb.word_dict()
print("Reading training data....") print("Reading training data....")
if args.enable_ce:
train_reader = paddle.batch(
paddle.dataset.imdb.train(word_dict), batch_size=BATCH_SIZE)
else:
train_reader = paddle.batch( train_reader = paddle.batch(
paddle.reader.shuffle( paddle.reader.shuffle(
paddle.dataset.imdb.train(word_dict), buf_size=25000), paddle.dataset.imdb.train(word_dict), buf_size=25000),
...@@ -94,10 +113,15 @@ def train(use_cuda, params_dirname): ...@@ -94,10 +113,15 @@ def train(use_cuda, params_dirname):
paddle.dataset.imdb.test(word_dict), batch_size=BATCH_SIZE) paddle.dataset.imdb.test(word_dict), batch_size=BATCH_SIZE)
feed_order = ['words', 'label'] feed_order = ['words', 'label']
pass_num = 1 pass_num = args.num_epochs
main_program = fluid.default_main_program() main_program = fluid.default_main_program()
star_program = fluid.default_startup_program() star_program = fluid.default_startup_program()
if args.enable_ce:
main_program.random_seed = 90
star_program.random_seed = 90
prediction = inference_program(word_dict) prediction = inference_program(word_dict)
train_func_outputs = train_program(prediction) train_func_outputs = train_program(prediction)
avg_cost = train_func_outputs[0] avg_cost = train_func_outputs[0]
...@@ -134,7 +158,7 @@ def train(use_cuda, params_dirname): ...@@ -134,7 +158,7 @@ def train(use_cuda, params_dirname):
main_program.global_block().var(var_name) for var_name in feed_order main_program.global_block().var(var_name) for var_name in feed_order
] ]
feeder = fluid.DataFeeder(feed_list=feed_var_list_loop, place=place) feeder = fluid.DataFeeder(feed_list=feed_var_list_loop, place=place)
exe.run(fluid.default_startup_program()) exe.run(star_program)
for epoch_id in range(pass_num): for epoch_id in range(pass_num):
for step_id, data in enumerate(train_reader()): for step_id, data in enumerate(train_reader()):
...@@ -157,6 +181,11 @@ def train(use_cuda, params_dirname): ...@@ -157,6 +181,11 @@ def train(use_cuda, params_dirname):
if params_dirname is not None: if params_dirname is not None:
fluid.io.save_inference_model(params_dirname, ["words"], fluid.io.save_inference_model(params_dirname, ["words"],
prediction, exe) prediction, exe)
if args.enable_ce and epoch_id == pass_num - 1:
print("kpis\tlstm_train_cost\t%f" % metrics[0])
print("kpis\tlstm_train_acc\t%f" % metrics[1])
print("kpis\tlstm_test_cost\t%f" % avg_cost_test)
print("kpis\tlstm_test_acc\t%f" % acc_test)
train_loop() train_loop()
...@@ -221,5 +250,6 @@ def main(use_cuda): ...@@ -221,5 +250,6 @@ def main(use_cuda):
if __name__ == '__main__': if __name__ == '__main__':
use_cuda = False # set to True if training with GPU args = parse_args()
use_cuda = args.use_gpu # set to True if training with GPU
main(use_cuda) main(use_cuda)
#!/bin/bash
#This file is only used for continuous evaluation.
python train.py --enable_ce | python _ce.py
...@@ -2,13 +2,28 @@ ...@@ -2,13 +2,28 @@
本教程源代码目录在[book/label_semantic_roles](https://github.com/PaddlePaddle/book/tree/develop/07.label_semantic_roles),初次使用请您参考[Book文档使用说明](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书) 本教程源代码目录在[book/label_semantic_roles](https://github.com/PaddlePaddle/book/tree/develop/07.label_semantic_roles),初次使用请您参考[Book文档使用说明](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)
### 说明
1. 本教程可支持在 CPU/GPU 环境下运行
2. Docker镜像支持的CUDA/cuDNN版本
如果使用了Docker运行Book,请注意:这里所提供的默认镜像的GPU环境为 CUDA 8/cuDNN 5,对于NVIDIA Tesla V100等要求CUDA 9的 GPU,使用该镜像可能会运行失败;
3. 文档和脚本中代码的一致性问题
请注意:为使本文更加易读易用,我们拆分、调整了[train.py](https://github.com/PaddlePaddle/book/tree/develop/07.label_semantic_roles/train.py)的代码并放入本文。本文中代码与train.py的运行结果一致,可直接运行train.py进行验证。
## 背景介绍 ## 背景介绍
自然语言分析技术大致分为三个层面:词法分析、句法分析和语义分析。语义角色标注是实现浅层语义分析的一种方式。在一个句子中,谓词是对主语的陈述或说明,指出“做什么”、“是什么”或“怎么样,代表了一个事件的核心,跟谓词搭配的名词称为论元。语义角色是指论元在动词所指事件中担任的角色。主要有:施事者(Agent)、受事者(Patient)、客体(Theme)、经验者(Experiencer)、受益者(Beneficiary)、工具(Instrument)、处所(Location)、目标(Goal)和来源(Source)等。 自然语言分析技术大致分为三个层面:词法分析、句法分析和语义分析。语义角色标注是实现浅层语义分析的一种方式。在一个句子中,谓词是对主语的陈述或说明,指出“做什么”、“是什么”或“怎么样,代表了一个事件的核心,跟谓词搭配的名词称为论元。语义角色是指论元在动词所指事件中担任的角色。主要有:施事者(Agent)、受事者(Patient)、客体(Theme)、经验者(Experiencer)、受益者(Beneficiary)、工具(Instrument)、处所(Location)、目标(Goal)和来源(Source)等。
请看下面的例子,“遇到” 是谓词(Predicate,通常简写为“Pred”),“小明”是施事者(Agent),“小红”是受事者(Patient),“昨天” 是事件发生的时间(Time),“公园”是事情发生的地点(Location)。 请看下面的例子,“遇到” 是谓词(Predicate,通常简写为“Pred”),“小明”是施事者(Agent),“小红”是受事者(Patient),“昨天” 是事件发生的时间(Time),“公园”是事情发生的地点(Location)。
$$\mbox{[小明]}_{\mbox{Agent}}\mbox{[昨天]}_{\mbox{Time}}\mbox{[晚上]}_\mbox{Time}\mbox{在[公园]}_{\mbox{Location}}\mbox{[遇到]}_{\mbox{Predicate}}\mbox{了[小红]}_{\mbox{Patient}}\mbox{。}$$ <p align="center">
<img src = "https://github.com/PaddlePaddle/book/blob/develop/07.label_semantic_roles/image/Eqn1.png?raw=true"><br/>
</p>
语义角色标注(Semantic Role Labeling,SRL)以句子的谓词为中心,不对句子所包含的语义信息进行深入分析,只分析句子中各成分与谓词之间的关系,即句子的谓词(Predicate)- 论元(Argument)结构,并用语义角色来描述这些结构关系,是许多自然语言理解任务(如信息抽取,篇章分析,深度问答等)的一个重要中间步骤。在研究中一般都假定谓词是给定的,所要做的就是找出给定谓词的各个论元和它们的语义角色。 语义角色标注(Semantic Role Labeling,SRL)以句子的谓词为中心,不对句子所包含的语义信息进行深入分析,只分析句子中各成分与谓词之间的关系,即句子的谓词(Predicate)- 论元(Argument)结构,并用语义角色来描述这些结构关系,是许多自然语言理解任务(如信息抽取,篇章分析,深度问答等)的一个重要中间步骤。在研究中一般都假定谓词是给定的,所要做的就是找出给定谓词的各个论元和它们的语义角色。
...@@ -85,15 +100,22 @@ CRF是一种概率化结构模型,可以看作是一个概率无向图模型 ...@@ -85,15 +100,22 @@ CRF是一种概率化结构模型,可以看作是一个概率无向图模型
根据线性链条件随机场上的因子分解定理\[[5](#参考文献)\],在给定观测序列$X$时,一个特定标记序列$Y$的概率可以定义为: 根据线性链条件随机场上的因子分解定理\[[5](#参考文献)\],在给定观测序列$X$时,一个特定标记序列$Y$的概率可以定义为:
$$p(Y | X) = \frac{1}{Z(X)} \text{exp}\left(\sum_{i=1}^{n}\left(\sum_{j}\lambda_{j}t_{j} (y_{i - 1}, y_{i}, X, i) + \sum_{k} \mu_k s_k (y_i, X, i)\right)\right)$$ <p align="center">
<img src = "https://github.com/PaddlePaddle/book/blob/develop/07.label_semantic_roles/image/Eqn2.gif?raw=true"><br/>
</p>
其中$Z(X)$是归一化因子,$t_j$ 是定义在边上的特征函数,依赖于当前和前一个位置,称为转移特征,表示对于输入序列$X$及其标注序列在 $i$及$i - 1$位置上标记的转移概率。$s_k$是定义在结点上的特征函数,称为状态特征,依赖于当前位置,表示对于观察序列$X$及其$i$位置的标记概率。$\lambda_j$ 和 $\mu_k$ 分别是转移特征函数和状态特征函数对应的权值。实际上,$t$和$s$可以用相同的数学形式表示,再对转移特征和状态特在各个位置$i$求和有:$f_{k}(Y, X) = \sum_{i=1}^{n}f_k({y_{i - 1}, y_i, X, i})$,把$f$统称为特征函数,于是$P(Y|X)$可表示为: 其中$Z(X)$是归一化因子,$t_j$ 是定义在边上的特征函数,依赖于当前和前一个位置,称为转移特征,表示对于输入序列$X$及其标注序列在 $i$及$i - 1$位置上标记的转移概率。$s_k$是定义在结点上的特征函数,称为状态特征,依赖于当前位置,表示对于观察序列$X$及其$i$位置的标记概率。$\lambda_j$ 和 $\mu_k$ 分别是转移特征函数和状态特征函数对应的权值。实际上,$t$和$s$可以用相同的数学形式表示,再对转移特征和状态特在各个位置$i$求和有:$f_{k}(Y, X) = \sum_{i=1}^{n}f_k({y_{i - 1}, y_i, X, i})$,把$f$统称为特征函数,于是$P(Y|X)$可表示为:
$$p(Y|X, W) = \frac{1}{Z(X)}\text{exp}\sum_{k}\omega_{k}f_{k}(Y, X)$$ <p align="center">
<img src = "https://github.com/PaddlePaddle/book/blob/develop/07.label_semantic_roles/image/Eqn3.gif?raw=true"><br/>
</p>
$\omega$是特征函数对应的权值,是CRF模型要学习的参数。训练时,对于给定的输入序列和对应的标记序列集合$D = \left[(X_1, Y_1), (X_2 , Y_2) , ... , (X_N, Y_N)\right]$ ,通过正则化的极大似然估计,求解如下优化目标: $\omega$是特征函数对应的权值,是CRF模型要学习的参数。训练时,对于给定的输入序列和对应的标记序列集合$D = \left[(X_1, Y_1), (X_2 , Y_2) , ... , (X_N, Y_N)\right]$ ,通过正则化的极大似然估计,求解如下优化目标:
$$\DeclareMathOperator*{\argmax}{arg\,max} L(\lambda, D) = - \text{log}\left(\prod_{m=1}^{N}p(Y_m|X_m, W)\right) + C \frac{1}{2}\lVert W\rVert^{2}$$ <p align="center">
<img src = "https://github.com/PaddlePaddle/book/blob/develop/07.label_semantic_roles/image/Eqn4.png?raw=true"><br/>
</p>
这个优化目标可以通过反向传播算法和整个神经网络一起求解。解码时,对于给定的输入序列$X$,通过解码算法(通常有:维特比算法、Beam Search)求令出条件概率$\bar{P}(Y|X)$最大的输出序列 $\bar{Y}$。 这个优化目标可以通过反向传播算法和整个神经网络一起求解。解码时,对于给定的输入序列$X$,通过解码算法(通常有:维特比算法、Beam Search)求令出条件概率$\bar{P}(Y|X)$最大的输出序列 $\bar{Y}$。
...@@ -455,6 +477,7 @@ save_dirname = "label_semantic_roles.inference.model" #调用训练好的模型 ...@@ -455,6 +477,7 @@ save_dirname = "label_semantic_roles.inference.model" #调用训练好的模型
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace() place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
exe = fluid.Executor(place) exe = fluid.Executor(place)
``` ```
设置输入,用LoDTensor来表示输入的词序列,这里每个词的形状 base_shape都是[1],是因为每个词都是用一个id来表示的。假如基于长度的LoD是[[3, 4, 2]],这是一个单层的LoD,那么构造出的LoDTensor就包含3个序列,其长度分别为3、4和2。 设置输入,用LoDTensor来表示输入的词序列,这里每个词的形状 base_shape都是[1],是因为每个词都是用一个id来表示的。假如基于长度的LoD是[[3, 4, 2]],这是一个单层的LoD,那么构造出的LoDTensor就包含3个序列,其长度分别为3、4和2。
注意LoD是个列表的列表 注意LoD是个列表的列表
...@@ -538,7 +561,7 @@ print("Inference Shape: ", np_data.shape) ...@@ -538,7 +561,7 @@ print("Inference Shape: ", np_data.shape)
2. Pascanu R, Gulcehre C, Cho K, et al. [How to construct deep recurrent neural networks](https://arxiv.org/abs/1312.6026)[J]. arXiv preprint arXiv:1312.6026, 2013. 2. Pascanu R, Gulcehre C, Cho K, et al. [How to construct deep recurrent neural networks](https://arxiv.org/abs/1312.6026)[J]. arXiv preprint arXiv:1312.6026, 2013.
3. Cho K, Van Merriënboer B, Gulcehre C, et al. [Learning phrase representations using RNN encoder-decoder for statistical machine translation](https://arxiv.org/abs/1406.1078)[J]. arXiv preprint arXiv:1406.1078, 2014. 3. Cho K, Van Merriënboer B, Gulcehre C, et al. [Learning phrase representations using RNN encoder-decoder for statistical machine translation](https://arxiv.org/abs/1406.1078)[J]. arXiv preprint arXiv:1406.1078, 2014.
4. Bahdanau D, Cho K, Bengio Y. [Neural machine translation by jointly learning to align and translate](https://arxiv.org/abs/1409.0473)[J]. arXiv preprint arXiv:1409.0473, 2014. 4. Bahdanau D, Cho K, Bengio Y. [Neural machine translation by jointly learning to align and translate](https://arxiv.org/abs/1409.0473)[J]. arXiv preprint arXiv:1409.0473, 2014.
5. Lafferty J, McCallum A, Pereira F. [Conditional random fields: Probabilistic models for segmenting and labeling sequence data](http://www.jmlr.org/papers/volume15/doppa14a/source/biblio.bib.old)[C]//Proceedings of the eighteenth international conference on machine learning, ICML. 2001, 1: 282-289. 5. Lafferty J, McCallum A, Pereira F. [Conditional random fields: Probabilistic models for segmenting and labeling sequence data](https://repository.upenn.edu/cgi/viewcontent.cgi?article=1162&context=cis_papers)[C]//Proceedings of the eighteenth international conference on machine learning, ICML. 2001, 1: 282-289.
6. 李航. 统计学习方法[J]. 清华大学出版社, 北京, 2012. 6. 李航. 统计学习方法[J]. 清华大学出版社, 北京, 2012.
7. Marcus M P, Marcinkiewicz M A, Santorini B. [Building a large annotated corpus of English: The Penn Treebank](http://repository.upenn.edu/cgi/viewcontent.cgi?article=1246&context=cis_reports)[J]. Computational linguistics, 1993, 19(2): 313-330. 7. Marcus M P, Marcinkiewicz M A, Santorini B. [Building a large annotated corpus of English: The Penn Treebank](http://repository.upenn.edu/cgi/viewcontent.cgi?article=1246&context=cis_reports)[J]. Computational linguistics, 1993, 19(2): 313-330.
8. Palmer M, Gildea D, Kingsbury P. [The proposition bank: An annotated corpus of semantic roles](http://www.mitpressjournals.org/doi/pdfplus/10.1162/0891201053630264)[J]. Computational linguistics, 2005, 31(1): 71-106. 8. Palmer M, Gildea D, Kingsbury P. [The proposition bank: An annotated corpus of semantic roles](http://www.mitpressjournals.org/doi/pdfplus/10.1162/0891201053630264)[J]. Computational linguistics, 2005, 31(1): 71-106.
...@@ -546,4 +569,4 @@ print("Inference Shape: ", np_data.shape) ...@@ -546,4 +569,4 @@ print("Inference Shape: ", np_data.shape)
10. Zhou J, Xu W. [End-to-end learning of semantic role labeling using recurrent neural networks](http://www.aclweb.org/anthology/P/P15/P15-1109.pdf)[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2015. 10. Zhou J, Xu W. [End-to-end learning of semantic role labeling using recurrent neural networks](http://www.aclweb.org/anthology/P/P15/P15-1109.pdf)[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2015.
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">知识共享 署名-相同方式共享 4.0 国际 许可协议</a>进行许可。 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://paddlepaddleimage.cdn.bcebos.com/bookimage/camo.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">知识共享 署名-相同方式共享 4.0 国际 许可协议</a>进行许可。
# Semantic Role Labeling
The source code of this chapter locates at [book/label_semantic_roles](https://github.com/PaddlePaddle/book/tree/develop/07.label_semantic_roles). # Label Semantic Roles
For instructions on getting started with this book,see [Running This Book](https://github.com/PaddlePaddle/book/blob/develop/README.md#running-the-book). The source code of this tutorial is in [book/label_semantic_roles](https://github.com/PaddlePaddle/book/tree/develop/07.label_semantic_roles). For the new users to Paddle book, please refer to [Book Documentation Instructions](https://github.com/PaddlePaddle/book#running-the-book) .
## Background ## Background
Natural language analysis techniques consist of lexical, syntactic, and semantic analysis. **Semantic Role Labeling (SRL)** is an instance of **Shallow Semantic Analysis**. Natural language analysis techniques are roughly divided into three levels: lexical analysis, syntactic analysis, and semantic analysis. Labeling semantic roles is a way to implement shallow semantic analysis. In a sentence, the predicate is a statement or explanation of the subject, pointing out "what to do", "what is it" or "how is it", which represents the majority of an event. The noun with a predicate is called argument. The semantic role is the role of argument in the events. It mainly includes: Agent, Patient, Theme, Experiencer, Beneficiary, Instrument , Location, Goal, Source and so on.
In a sentence, a **predicate** states a property or a characterization of a *subject*, such as what it does and what it is like. The predicate represents the core of an event, whereas the words accompanying the predicate are **arguments**. A **semantic role** refers to the abstract role an argument of a predicate take on in the event, including *agent*, *patient*, *theme*, *experiencer*, *beneficiary*, *instrument*, *location*, *goal*, and *source*. Please look at the following example. "Encounter" is a predicate (Predicate, usually abbreviated as "Pred"), "Xiaoming" is an agent, "Xiaohong" is a patient, "Yesterday" is the time when the event occurred, the "park" is the location where the event occurred.
In the following example of a Chinese sentence, "to encounter" is the predicate (*pred*); "Ming" is the *agent*; "Hong" is the *patient*; "yesterday" and "evening" are the *time*; finally, "the park" is the *location*. $$\mbox{[Xiaoming]}_{\mbox{Agent}}\mbox{[yesterday]}_{\mbox{Time}}\mbox{[evening]}_\mbox{Time}\mbox{in[Park]}_{\mbox{Location}}\mbox{[encounter]}_{\mbox{Predicate}}\mbox{[Xiaohong]}_{\mbox{Patient}}\mbox{. }$$
$$\mbox{[小明 Ming]}_{\mbox{Agent}}\mbox{[昨天 yesterday]}_{\mbox{Time}}\mbox{[晚上 evening]}_\mbox{Time}\mbox{在[公园 a park]}_{\mbox{Location}}\mbox{[遇到 to encounter]}_{\mbox{Predicate}}\mbox{了[小红 Hong]}_{\mbox{Patient}}\mbox{。}$$ Semantic role labeling (SRL) is centered on the predicate of the sentence. It does not analyze the semantic information contained in the sentence. It only analyzes the relationship between the components and the predicate in the sentence, that is, the predicate of the sentence--the Argument structure. And using semantic roles to describe these structural relationships is an important intermediate step in many natural language understanding tasks (such as information extraction, text analysis, deep question and answer, etc.). It is generally assumed in the research that the predicate is given, and all that has to be done is to find the individual arguments of the given predicate and their semantic roles.
Instead of analyzing the semantic information, **Semantic Role Labeling** (**SRL**) identifies the relationship between the predicate and the other constituents surrounding it. The predicate-argument structures are labeled as specific semantic roles. A wide range of natural language understanding tasks, including *information extraction*, *discourse analysis*, and *deepQA*. Research usually assumes a predicate of a sentence to be specified; the only task is to identify its arguments and their semantic roles. Traditional SRL systems are mostly based on syntactic analysis and usually consist of five processes:
Conventional SRL systems mostly build on top of syntactic analysis, usually consisting of five steps:
1. Construct a syntax tree, as shown in Fig. 1
2. Identity the candidate arguments of the given predicate on the tree.
3. Prune the most unlikely candidate arguments.
4. Identify the real arguments, often by a binary classifier.
5. Multi-classify on results from step 4 to label the semantic roles. Steps 2 and 3 usually introduce hand-designed features based on syntactic analysis (step 1).
1. Construct a parse tree. For example, Figure 1 is a syntactic tree for the dependency syntax analysis of the above example.
2. Identify candidate arguments for a given predicate from the syntax tree.
3. Prune the candidate arguments; there may be many candidate arguments in a sentence, and pruning candidate arguments is pruned out of a large number of candidates that are the most unlikely candidates arguments.
4. Argument recognition: This process is to judge which is the real argument from the candidates after the previous pruning, usually as a two-classification problem.
5. For the result of step 4, get the semantic role label of the argument by multi-classification. It can be seen that syntactic analysis is the basis, and some artificial features are often constructed in subsequent steps, and these features are often also derived from syntactic analysis.
<div align="center"> <div align="center">
<img src="image/dependency_parsing_en.png" width = "80%" align=center /><br> <img src="https://github.com/PaddlePaddle/book/blob/develop/07.label_semantic_roles/image/dependency_parsing.png?raw=true" width = "80%" align=center /><br>
Fig 1. Syntax tree Figure 1. Example of dependency syntax analysis tree
</div> </div>
However, complete syntactic analysis needs to determine all the syntactic information contained in a sentence and the relationship between the components of the sentence. It is a very difficult task. The accuracy of syntactic analysis in current technology is not good, and the little errors in syntactic analysis will caused the SRL error. In order to reduce the complexity of the problem and obtain certain syntactic structure information, the idea of ​​"shallow syntactic analysis" came into being. Shallow syntactic analysis is also called partial parsing or chunking. Different from full syntactic analysis which obtains a complete syntactic tree, shallow syntactic analysis only needs to identify some relatively simple independent components of the sentence, such as verb phrases, these identified structures are called chunks. In order to avoid the difficulties caused by the failure to obtain a syntactic tree with high accuracy, some studies \[[1](#References)\] also proposed a chunk-based SRL method. The block-based SRL method solves the SRL as a sequence labeling problem. Sequence labeling tasks generally use the BIO representation to define the set of labels for sequence annotations. Firstly, Let's introduce this representation. In the BIO notation, B stands for the beginning of the block, I stands for the middle of the block, and O stands for the end of the block. Different blocks are assigned different labels by B, I, and O. For example, for a block group extended by role A, the first block it contains is assigned to tag B-A, the other blocks it contains are assigned to tag I-A, and the block not belonging to any argument is assigned tag O.
However, a complete syntactic analysis requires identifying the relationship among all constituents. Thus, the accuracy of SRL is sensitive to the preciseness of the syntactic analysis, making SRL challenging. To reduce its complexity and obtain some information on the syntactic structures, we often use *shallow syntactic analysis* a.k.a. partial parsing or chunking. Unlike complete syntactic analysis, which requires the construction of the complete parsing tree, *Shallow Syntactic Analysis* only requires identifying some independent constituents with relatively simple structures, such as verb phrases (chunk). To avoid difficulties in constructing a syntax tree with high accuracy, some work\[[1](#reference)\] proposed semantic chunking-based SRL methods, which reduces SRL into a sequence tagging problem. Sequence tagging tasks classify syntactic chunks using **BIO representation**. For syntactic chunks forming role A, its first chunk receives the B-A tag (Begin) and the remaining ones receive the tag I-A (Inside); in the end, the chunks left out will receive the tag O. Let's continue to take the above sentence as an example. Figure 1 shows the BIO representation method.
The BIO representation of above example is shown in Fig.1.
<div align="center"> <div align="center">
<img src="image/bio_example_en.png" width = "90%" align=center /><br> <img src="https://github.com/PaddlePaddle/book/blob/develop/07.label_semantic_roles/image/bio_example.png?raw=true" width = "90%" align=center /><br>
Fig 2. BIO representation Figure 2. Example of BIO labeling method
</div> </div>
This example illustrates the simplicity of sequence tagging, since As can be seen from the above example, it is a relatively simple process to directly get the semantic roles labeling result of the argument according to the sequence labeling result. This simplicity is reflected in: (1) relying on shallow syntactic analysis, reducing the requirements and difficulty of syntactic analysis; (2) there is no candidate argument to pruning in this step; (3) the identification and labeling of arguments are realized at the same time. This integrated approach to arguments identification and labeling simplifies the process, reduces the risk of error accumulation, and often achieves better results.
1. It only relies on shallow syntactic analysis, reduces the precision requirement of syntactic analysis;
2. Pruning the candidate arguments is no longer necessary;
3. Arguments are identified and tagged at the same time. Simplifying the workflow reduces the risk of accumulating errors; oftentimes, methods that unify multiple steps boost performance.
In this tutorial, our SRL system is built as an end-to-end system via a neural network. The system takes only text sequences as input, without using any syntactic parsing results or complex hand-designed features. The public dataset [CoNLL-2004 and CoNLL-2005 Shared Tasks](http://www.cs.upc.edu/~srlconll/) is used for the following task: given a sentence with predicates marked, identify the corresponding arguments and their semantic roles through sequence tagging. Similar to the block-based SRL method, in this tutorial we also regard the SRL as a sequence labeling problem. The difference is that we only rely on input text sequences, without relying on any additional syntax analysis results or complex artificial features. And constructing an end-to-end learning SRL system by using deep neural networks. Let's take the public data set of the SRL task in the [CoNLL-2004 and CoNLL-2005 Shared Tasks](http://www.cs.upc.edu/~srlconll/) task as an example to practice the following tasks. Giving a sentence and a predicate in this sentence, through the way of sequence labeling, find the arguments corresponding to the predicate from the sentence, and mark their semantic roles.
## Model ## Model Overview
**Recurrent Neural Networks** (*RNN*) are important tools for sequence modeling and have been successfully used in some natural language processing tasks. Unlike feed-forward neural networks, RNNs can model the dependencies between elements of sequences. As a variant of RNNs', LSTMs aim modeling long-term dependency in long sequences. We have introduced this in [understand_sentiment](https://github.com/PaddlePaddle/book/tree/develop/05.understand_sentiment). In this chapter, we continue to use LSTMs to solve SRL problems. Recurrent Neural Network is an important model for modeling sequences. It is widely used in natural language processing tasks. Unlike the feed-forward neural network, the RNN is able to handle the contextual correlation between inputs. LSTM is an important variant of RNN that is commonly used to learn the long-range dependencies contained in long sequences. We have already introduced in [Sentiment Analysis](https://github.com/PaddlePaddle/book/tree/develop/06.understand_sentiment), in this article we still use LSTM to solve the SRL problem.
### Stacked Recurrent Neural Network ### Stacked Recurrent Neural Network
*Deep Neural Networks* can extract hierarchical representations. The higher layers can form relatively abstract/complex representations, based on primitive features discovered through the lower layers. Unfolding LSTMs through time results in a deep feed-forward neural network. This is because any computational path between the input at time $k < t$ to the output at time $t$ crosses several nonlinear layers. On the other hand, due to parameter sharing over time, LSTMs are also *shallow*; that is, the computation carried out at each time-step is just a linear transformation. Deep LSTM networks are typically constructed by stacking multiple LSTM layers on top of each other and taking the output from lower LSTM layer at time $t$ as the input of upper LSTM layer at time $t$. Deep, hierarchical neural networks can be efficient at representing some functions and modeling varying-length dependencies\[[2](#reference)\]. The deep network helps to form hierarchical features, and the upper layers of the network form more complex advanced features based on the primary features that have been learned in the lower layers. Although the LSTM is expanded along the time axis and is equivalent to a very "deep" feedforward network. However, since the LSTM time step parameters are shared, the mapping of the $t-1$ time state to the time of $t$ always passes only one non-linear mapping. It means that the modeling of state transitions by single-layer LSTM is “shallow”. Stacking multiple LSTM units, making the output of the previous LSTM$t$ time as the input of the next LSTM unit $t$ time, helps us build a deep network. We call it the first version of the stack ecurrent neural networks. Deep networks improve the ability of models to fit complex patterns and better model patterns across different time steps\[[2](#References)\].
However, in a deep LSTM network, any gradient propagated back in depth needs to traverse a large number of nonlinear steps. As a result, while LSTMs of 4 layers can be trained properly, those with 4-8 have much worse performance. Conventional LSTMs prevent back-propagated errors from vanishing or exploding by introducing shortcut connections to skip the intermediate nonlinear layers. Therefore, deep LSTMs can consider shortcut connections in depth as well.
A single LSTM cell has three operations: However, training a deep LSTM network is not an easy task. Stacking multiple LSTM cells in portrait orientation may encounter problems with the propagation of gradients in the longitudinal depth. Generally, stacking 4 layers of LSTM units can be trained normally. When the number of layers reaches 4~8 layers, performance degradation will occur. At this time, some new structures must be considered to ensure the gradient is transmitted vertically and smoothly. This is a problem that must be solved in training a deep LSTM networks. We can learn from LSTM to solve one of the tips of the "gradient disappearance and gradient explosion" problem: there is no nonlinear mapping on the information propagation route of Memory Cell, and neither gradient decay nor explosion when the gradient propagates back. Therefore, the deep LSTM model can also add a path that ensures smooth gradient propagation in the vertical direction.
1. input-to-hidden: map input $x$ to the input of the forget gates, input gates, memory cells and output gates by linear transformation (i.e., matrix mapping); The operation performed by an LSTM unit can be divided into three parts: (1) Input-to-hidden: Each time step input information $x$ will first pass through a matrix map and then as a forgetting gate, input gate, memory unit, output gate's input. Note that this mapping does not introduce nonlinear activation; (2) Hidden-to-hidden: this step is the main body of LSTM calculation, including forgotten gate, input gate, memory unit update, output gate calculation; (3) hidden-to-output: usually simple to activate the hidden layer vector. On the basis of the first version of the stack network, we add a new path: in addition to the previous LSTM output, the mapping of the input of the previous LSTM to the hidden layer is used as a new input. and a new input is added. At the same time, add a linear map to learn a new transform.
2. hidden-to-hidden: calculate forget gates, input gates, output gates and update memory cell, this is the main part of LSTMs;
3. hidden-to-output: this part typically involves an activation operation on hidden states.
Based on the stacked LSTMs, we add shortcut connections: take the input-to-hidden from the previous layer as a new input and learn another linear transformation. Figure 3 is a schematic structural diagram of a finally obtained stack recurrent neural network.
Fig.3 illustrates the final stacked recurrent neural networks.
<p align="center"> <p align="center">
<img src="./image/stacked_lstm_en.png" width = "40%" align=center><br> <img src="https://github.com/PaddlePaddle/book/blob/develop/07.label_semantic_roles/image/stacked_lstm.png?raw=true" width = "40%" align=center><br>
Fig 3. Stacked Recurrent Neural Networks Figure 3. Schematic diagram of stack-based recurrent neural network based on LSTM
</p> </p>
### Bidirectional Recurrent Neural Network ### Bidirectional Recurrent Neural Network
While LSTMs can summarize the history, they can not see the future. Because most NLP (natural language processing) tasks provide the entirety of sentences, sequential learning can benefit from having the future encoded as well as the history. In LSTM, the hidden layer vector at the time of $t$ encodes all input information until the time of $t$. The LSTM at $t$ can see the history, but cannot see the future. In most natural language processing tasks, we almost always get the whole sentence. In this case, if you can get future information like the historical information, it will be of great help to the sequence learning task.
To address this, we can design a bidirectional recurrent neural network by making a minor modification. A higher LSTM layer can process the sequence in reversed direction with regards to its immediate lower LSTM layer, i.e., deep LSTM layers take turns to train on input sequences from left-to-right and right-to-left. Therefore, LSTM layers at time-step $t$ can see both histories and the future, starting from the second layer. Fig. 4 illustrates the bidirectional recurrent neural networks.
In order to overcome this shortcoming, we can design a bidirectional recurrent network unit, which is simple and straightforward: make a small modification to the stack recurrent neural network of the previous section, stack multiple LSTM units, and let each layer of LSTM units learn the output sequence of the previous layer in the order of forward, reverse, forward …… So, starting from layer 2, our LSTM unit will always see historical and future information at $t$. Figure 4 is a schematic diagram showing the structure of a bidirectional recurrent neural network based on LSTM.
<p align="center"> <p align="center">
<img src="./image/bidirectional_stacked_lstm_en.png" width = "60%" align=center><br> <img src="https://github.com/PaddlePaddle/book/blob/develop/07.label_semantic_roles/image/bidirectional_stacked_lstm.png?raw=true" width = "60%" align=center><br>
Fig 4. Bidirectional LSTMs Figure 4. Schematic diagram of a bidirectional recurrent neural network based on LSTM
</p> </p>
Note that, this bidirectional RNNs is different from the one proposed by Bengio et al. in machine translation tasks \[[3](#reference), [4](#reference)\]. We will introduce another bidirectional RNNs in the following chapter [machine translation](https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/README.md) It should be noted that this bidirectional RNN structure is not the same as the bidirectional RNN structure used by Bengio etc in machine translation tasks\[[3](#References), [4](#References)\] Another bidirectional recurrent neural network will be introduced in the following [Machine Translation](https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation) task.
### Conditional Random Field (CRF)
Typically, a neural network's lower layers learn representations while its very top layer accomplishes the final task. These principles can guide our problem-solving approaches. In SRL tasks, a **Conditional Random Field** (*CRF*) is built on top of the network in order to perform the final prediction to tag sequences. It takes representations provided by the last LSTM layer as input. ### Conditional Random Field
The idea of ​​using a neural network model to solve a problem usually is: the front-layer network learns the feature representation of the input, and the last layer of the network completes the final task based on the feature. In the SRL task, the feature representation of the deep LSTM network learns input. Conditional Random Filed (CRF) completes the sequence labeling on th basis of features at the end of the entire network.
The CRF is an undirected probabilistic graph with nodes denoting random variables and edges denoting dependencies between these variables. In essence, CRFs learn the conditional probability $P(Y|X)$, where $X = (x_1, x_2, ... , x_n)$ are sequences of input and $Y = (y_1, y_2, ... , y_n)$ are label sequences; to decode, simply search through $Y$ for a sequence that maximizes the conditional probability $P(Y|X)$, i.e., $Y^* = \mbox{arg max}_{Y} P(Y | X)$。 CRF is a probabilistic structural model, which can be regarded as a probabilistic undirected graph model. Nodes represent random variables and edges represent probability dependencies between random variables. In simple terms, CRF learns the conditional probability $P(X|Y)$, where $X = (x_1, x_2, ... , x_n)$ is the input sequence, $Y = (y_1, y_2, ..., y_n $ is a sequence of tokens; the decoding process is given the $X$ sequence to solve the $Y$ sequence with the largest $P(Y|X)$, that is $Y^* = \mbox{arg max}_{Y} P( Y | X)$.
Sequence tagging tasks do not assume a lot of conditional independence, because they only concern about the input and the output being linear sequences. Thus, the graph model of sequence tagging tasks is usually a simple chain or line, which results in a **Linear-Chain Conditional Random Field**, shown in Fig.5. The sequence labeling task only needs to consider that both the input and the output are a linear sequence. And since we only use the input sequence as a condition and do not make any conditional independent assumptions, there is no graph structure between the elements of the input sequence. In summary, the CRF defined on the chain diagram shown in Figure 5 is used in the sequence labeling task, which is called Linear Chain Conditional Random Field.
<p align="center"> <p align="center">
<img src="./image/linear_chain_crf.png" width = "35%" align=center><br> <img src="https://github.com/PaddlePaddle/book/blob/develop/07.label_semantic_roles/image/linear_chain_crf.png?raw=true" width = "35%" align=center><br>
Fig 5. Linear Chain Conditional Random Field used in SRL tasks Figure 5. Linear chain conditional random field used in sequence labeling tasks
</p> </p>
By the fundamental theorem of random fields \[[5](#reference)\], the joint distribution over the label sequence $Y$ given $X$ has the form: According to the factorization theorem on the linear chain condition random field \[[5](#References)\], the probability of a particular tag sequence $Y$ can be defined as given in the observation sequence $X$:
$$p(Y | X) = \frac{1}{Z(X)} \text{exp}\left(\sum_{i=1}^{n}\left(\sum_{j}\lambda_{j}t_{j} (y_{i - 1}, y_{i}, X, i) + \sum_{k} \mu_k s_k (y_i, X, i)\right)\right)$$ $$p(Y | X) = \frac{1}{Z(X)} \text{exp}\left(\sum_{i=1}^{n}\left(\sum_{j}\lambda_{ j}t_{j} (y_{i - 1}, y_{i}, X, i) + \sum_{k} \mu_k s_k (y_i, X, i)\right)\right)$$
Where $Z(X)$ is the normalization factor, and $t_j$ is the feature function defined on the edge, depending on the current and previous position, which called the transition feature. It represents the transition probability of the input sequence $X$ and its labeling sequence marked at the $i$ and $i - 1$ positions. $s_k$ is a feature function defined on the node, called a state feature, which depends on the current position. It represents the probability of marking for the observation sequence $X$ and its $i$ position. $\lambda_j$ and $\mu_k$ are the weights corresponding to the transfer feature function and the state feature function respectively. In fact, $t$ and $s$ can be represented in the same mathematical form, and the transfer feature and state are summed at each position $i$: $f_{k}(Y, X) = \sum_{i =1}^{n}f_k({y_{i - 1}, y_i, X, i})$. Calling $f$ collectively as a feature function, so $P(Y|X)$ can be expressed as:
where, $Z(X)$ is normalization constant, ${t_j}$ represents the feature functions defined on edges called the *transition feature*, which denotes the transition probabilities from $y_{i-1}$ to $y_i$ given input sequence $X$. ${s_k}$ represents the feature function defined on nodes, called the state feature, denoting the probability of $y_i$ given input sequence $X$. In addition, $\lambda_j$ and $\mu_k$ are weights corresponding to $t_j$ and $s_k$. Alternatively, $t$ and $s$ can be written in the same form that depends on $y_{i - 1}$, $y_i$, $X$, and $i$. Taking its summation over all nodes $i$, we have: $f_{k}(Y, X) = \sum_{i=1}^{n}f_k({y_{i - 1}, y_i, X, i})$, which defines the *feature function* $f$. Thus, $P(Y|X)$ can be written as:
$$p(Y|X, W) = \frac{1}{Z(X)}\text{exp}\sum_{k}\omega_{k}f_{k}(Y, X)$$ $$p(Y|X, W) = \frac{1}{Z(X)}\text{exp}\sum_{k}\omega_{k}f_{k}(Y, X)$$
where $\omega$ are the weights to the feature function that the CRF learns. While training, given input sequences and label sequences $D = \left[(X_1, Y_1), (X_2 , Y_2) , ... , (X_N, Y_N)\right]$, by maximum likelihood estimation (**MLE**), we construct the following objective function: $\omega$ is the weight corresponding to the feature function and is the parameter to be learned by the CRF model. During training, for a given input sequence and the corresponding set of markup sequences $D = \left[(X_1, Y_1), (X_2 , Y_2) , ... , (X_N, Y_N)\right]$ , by regularizing the maximum likelihood estimation to solve the following optimization objectives:
$$\DeclareMathOperator*{\argmax}{arg\,max} L(\lambda, D) = - \text{log}\left(\prod_{m=1}^{N}p(Y_m|X_m, W)\right) + C \frac{1}{2}\lVert W\rVert^{2}$$
$$\DeclareMathOperator*{\argmax}{arg\,max} L(\lambda, D) = - \text{log}\left(\prod_{m=1}^{N}p(Y_m|X_m, W )\right) + C \frac{1}{2}\lVert W\rVert^{2}$$
This objective function can be solved via back-propagation in an end-to-end manner. While decoding, given input sequences $X$, search for sequence $\bar{Y}$ to maximize the conditional probability $\bar{P}(Y|X)$ via decoding methods (such as *Viterbi*, or [Beam Search Algorithm](https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/README.md#beam-search-algorithm)). This optimization objectives can be solved by the back propagation algorithm together with the entire neural network. When decoding, for a given input sequence $X$, the output sequence $\bar{Y}$ of maximizing the conditional probability $\bar{P}(Y|X)$ by the decoding algorithm (such as: Viterbi algorithm, Beam Search).
### Deep Bidirectional LSTM (DB-LSTM) SRL model ### Deep bidirectional LSTM (DB-LSTM) SRL model
Given predicates and a sentence, SRL tasks aim to identify arguments of the given predicate and their semantic roles. If a sequence has $n$ predicates, we will process this sequence $n$ times. Here is the breakdown of a straight-forward model: In the SRL task, the input is “predicate” and “a sentence”. The goal is to find the argument of the predicate from this sentence and mark the semantic role of the argument. If a sentence contains $n$ predicates, the sentence will be processed for $n$ times. One of the most straightforward models is the following:
1. Construct inputs; 1. Construct the input;
- input 1: predicate, input 2: sentence - Input 1 is the predicate and 2 is the sentence
- expand input 1 into a sequence of the same length with input 2's sentence, using one-hot representation; - Extend input 1 to a sequence as long as input 2, expressed by one-hot mode;
2. Convert the one-hot sequences from step 1 to vector sequences via a word embedding's lookup table; 2. The predicate sequence and sentence sequence of the one-hot format are converted into a sequence of word vectors represented by real vectors through a vocabulary;
3. Learn the representation of input sequences by taking vector sequences from step 2 as inputs; 3. The two word vector sequences in step 2 are used as input of the bidirectional LSTM to learn the feature representation of the input sequence;
4. Take the representation from step 3 as input, label sequence as a supervisory signal, and realize sequence tagging tasks. 4. The CRF takes the features learned in the model in step 3 as input, and uses the tag sequence as the supervised signal to implement sequence labeling;
Here, we propose some improvements by introducing two simple but effective features: You can try this method. Here, we propose some improvements that introduce two simple features that are very effective in improving system performance:
- predicate context (**ctx-p**): A single predicate word may not describe all the predicate information, especially when the same words appear multiple times in a sentence. With the expanded context, the ambiguity can be largely eliminated. Thus, we extract $n$ words before and after predicate to construct a window chunk. - Predicate's context: In the above method, only the word vector of the predicate is used to express all the information related to the predicate. This method is always very weak, especially if the predicate appears multiple times in the sentence, it may cause certain ambiguity. From experience, a small segment of several words before and after the predicate can provide more information to help resolve ambiguity. So, we add this kind of experience to the model, and extract a "predicate context" fragment for each predicate, that is, a window fragment composed of $n$ words before and after the predicate;
- Predicate context area's tag: Introduces a 0-1 binary variable for each word in the sentence, which indicats whether they are in the "predicate context" fragment;
- region mark ($m_r$): The binary marker on a word, $m_r$, takes the value of $1$ when the word is in the predicate context region, and $0$ if not. The modified model is as follows (Figure 6 is a schematic diagram of the model structure with a depth of 4):
After these modifications, the model is as follows, as illustrated in Figure 6:
1. Construct inputs
- Input 1: word sequence. Input 2: predicate. Input 3: predicate context, extract $n$ words before and after predicate. Input 4: region mark sequence, where an entry is 1 if the word is located in the predicate context region, 0 otherwise.
- expand input 2~3 into sequences with the same length with input 1
2. Convert input 1~4 to vector sequences via word embedding lookup tables; While input 1 and 3 shares the same lookup table, input 2 and 4 have separate lookup tables.
3. Take the four vector sequences from step 2 as inputs to bidirectional LSTMs; Train the LSTMs to update representations.
4. Take the representation from step 3 as input to CRF, label sequence as a supervisory signal, and complete sequence tagging tasks.
1. Construct input
- Input 1 is a sentence sequence, input 2 is a predicate sequence, input 3 is a predicate context, and $n$ words before and after the predicate are extracted from the sentence to form a predicate context, which represented by one-hot. Input 4 is a predicate context area which marks whether each word in the sentence is in the context of the predicate;
- Extend the input 2~3 to a sequence as long as the input 1;
2. Input 1~4 are converted into a sequence of word vectors represented by real vectors in vocabulary; where inputs 1 and 3 share the same vocabulary, and inputs 2 and 4 each have their own vocabulary;
3. The four word vector sequences in step 2 are used as input to the bidirectional LSTM model; the LSTM model learns the feature representation of the input sequence to obtain a new feature representation sequence;
4. The CRF takes the features learned in step 3 of the LSTM as input, and uses the marked sequence as the supervised signal to complete the sequence labeling;
<div align="center"> <div align="center">
<img src="image/db_lstm_network_en.png" width = "60%" align=center /><br> <img src="https://github.com/PaddlePaddle/book/blob/develop/07.label_semantic_roles/image/db_lstm_network.png?raw=true" width = "60%" align=center /><br>
Fig 6. DB-LSTM for SRL tasks Figure 6. Deep bidirectional LSTM model on the SRL task
</div> </div>
## Data Preparation
In the tutorial, we use [CoNLL 2005](http://www.cs.upc.edu/~srlconll/) SRL task open dataset as an example. Note that the training set and development set of the CoNLL 2005 SRL task are not free to download after the competition. Currently, only the test set can be obtained, including 23 sections of the Wall Street Journal and three sections of the Brown corpus. In this tutorial, we use the WSJ corpus as the training dataset to explain the model. However, since the training set is small, for a usable neural network SRL system, please consider paying for the full corpus. ## Data Introduction
In this tutorial, We use the data set opened by the [CoNLL 2005](http://www.cs.upc.edu/~srlconll/) SRL task as an example. It is important to note that the training set and development set of the CoNLL 2005 SRL task are not free for public after the competition. Currently, only the test set is available, including 23 in the Wall Street Journal and 3 in the Brown corpus. In this tutorial, we use the WSJ data in the test set to solve the model for the training set. However, since the number of samples in the test set is far from enough, if you want to train an available neural network SRL system, consider paying for the full amount of data.
The original data includes a variety of information such as POS tagging, naming entity recognition, syntax tree, etc. In this tutorial, we only use the data under `test.wsj/words/` (text sequence) and `test.wsj/props/` (label results). The data directory used in this tutorial is as follows: The original data also includes a variety of information such as part-of-speech tagging, named entity recognition, and syntax parse tree. In this tutorial, we use the data in the test.wsj folder for training and testing, and only use the data under the words folder (text sequence) and the props folder (labeled results). The data directories used in this tutorial are as follows:
```text ```text
conll05st-release/ conll05st-release/
└── test.wsj └── test.wsj
├── props # label results     ├── props # Label result
└── words # text sequence     └── words # Input text sequence
``` ```
The annotation information is derived from the results of Penn TreeBank\[[7](#references)\] and PropBank \[[8](#references)\]. The labeling of the PropBank is different from the labeling methods mentioned before, but shares with it the same underlying principle. For descriptions of the labeling, please refer to the paper \[[9](#references)\]. The labeling information is derived from the labeling results of Penn TreeBank\[[7](#References)\] and PropBank\[[8](#References)\]. The label of the PropBank labeling result is different from the labeling result label we used in the first example of the article, but the principle is the same. For the description of the meaning of the labeling result label, please refer to the paper \[[9](#References)\].
The raw data needs to be preprocessed into formats that PaddlePaddle can handle. The preprocessing consists of the following steps: The raw data needs to be preprocessed in order to be processed by PaddlePaddle. The preprocessing includes the following steps:
1. Merge the text sequence and the tag sequence into the same record; 1. Combine text sequences and tag sequences into one record;
2. If a sentence contains $n$ predicates, the sentence will be processed $n$ times into $n$ separate training samples, each sample with a different predicate; 2. If a sentence contains $n$ predicates, the sentence will be processed for $n$ times, becoming a $n$ independent training sample, each sample with a different predicate;
3. Extract the predicate context and construct the predicate context region marker; 3. Extract the predicate context and construct the predicate context area tag;
4. Construct the markings in BIO format; 4. Construct a tag represented by the BIO method;
5. Obtain the integer index corresponding to the word according to the dictionary. 5. Get the integer index corresponding to the word according to the dictionary.
After preprocessing, a training sample contains nine features, namely: word sequence, predicate, predicate context (5 columns), region mark sequence, label sequence. The following table is an example of a training sample. After the pre-processing is completed, a training sample data contains 9 fields, namely: sentence sequence, predicate, predicate context (accounting for 5 columns), predicate context area tag, and labeling sequence. The following table is an example of a training sample.
| word sequence | predicate | predicate context(5 columns) | region mark sequence | label sequence| | Sentence Sequence | Predicate | Predicate Context (Window = 5) | Predicate Context Area Tag | Label Sequence |
|---|---|---|---|---| |---|---|---|---|---|
| A | set | n't been set . × | 0 | B-A1 | | A | set | n't been set . × | 0 | B-A1 |
| record | set | n't been set . × | 0 | I-A1 | | record | set | n't been set . × | 0 | I-A1 |
...@@ -188,18 +165,19 @@ After preprocessing, a training sample contains nine features, namely: word sequ ...@@ -188,18 +165,19 @@ After preprocessing, a training sample contains nine features, namely: word sequ
| set | set | n't been set . × | 1 | B-V | | set | set | n't been set . × | 1 | B-V |
| . | set | n't been set . × | 1 | O | | . | set | n't been set . × | 1 | O |
In addition to the data, we provide following resources:
| filename | explanation | In addition to the data, we also provide the following resources:
| File Name | Description |
|---|---| |---|---|
| word_dict | dictionary of input sentences, total 44068 words | | word_dict | Input a dictionary of sentences for a total of 44068 words |
| label_dict | dictionary of labels, total 106 labels | | label_dict | Tag dictionary, total 106 tags |
| predicate_dict | predicate dictionary, total 3162 predicates | | predicate_dict | Dictionary of predicates, totaling 3162 words |
| emb | a pre-trained word vector lookup table, 32-dimensional | | emb | A trained vocabulary, 32-dimensional |
We trained a language model on the English Wikipedia to get a word vector lookup table used to initialize the SRL model. While training the SRL model, the word vector lookup table is no longer updated. To learn more about the language model and the word vector lookup table, please refer to the tutorial [word vector](https://github.com/PaddlePaddle/book/blob/develop/04.word2vec/README.md). There are 995,000,000 tokens in the training corpus, and the dictionary size is 4900,000 words. In the CoNLL 2005 training corpus, 5% of the words are not in the 4900,000 words, and we see them all as unknown words, represented by `<unk>`. We trained a language model on English Wikipedia to get a word vector to initialize the SRL model. During the training of the SRL model, the word vector is no longer updated. For the language model and word vector, refer to [Word Vector](https://github.com/PaddlePaddle/book/blob/develop/04.word2vec) for this tutorial. The corpus of our training language model has a total of 995,000,000 tokens, and the dictionary size is controlled to 4,900,000 words. CoNLL 2005 training corpus 5% of this word is not in 4900,000 words, we have seen them all unknown words, with `<unk>` representation.
Here we fetch the dictionary, and print its size: Get the dictionary and print the dictionary size:
```python ```python
from __future__ import print_function from __future__ import print_function
...@@ -229,95 +207,132 @@ print('pred_dict_len: ', pred_dict_len) ...@@ -229,95 +207,132 @@ print('pred_dict_len: ', pred_dict_len)
- Define input data dimensions and model hyperparameters. - Define input data dimensions and model hyperparameters.
```python ```python
mark_dict_len = 2 mark_dict_len = 2 # The dimension of the context area flag, which is a 0-1 2 value feature, so the dimension is 2
word_dim = 32 Word_dim = 32 # Word vector dimension
mark_dim = 5 Mark_dim = 5 # The predicate context area is mapped to a real vector by the vocabulary, which is the adjacent dimension
hidden_dim = 512 Hidden_dim = 512 # LSTM Hidden Layer Vector Dimensions : 512 / 4
depth = 8 Depth = 8 # depth of stack LSTM
mix_hidden_lr = 1e-3 Mix_hidden_lr = 1e-3 # Basic learning rate of fundamental_chain_crf layer
IS_SPARSE = True IS_SPARSE = True # Whether to update embedding in sparse way
PASS_NUM = 10 PASS_NUM = 10 # Training epoches
BATCH_SIZE = 10 BATCH_SIZE = 10 # Batch size
embedding_name = 'emb' Embeddding_name = 'emb'
``` ```
Note that `hidden_dim = 512` means a LSTM hidden vector of 128 dimension (512/4). Please refer to PaddlePaddle's official documentation for detail: [lstmemory](http://www.paddlepaddle.org/doc/ui/api/trainer_config_helpers/layers.html#lstmemory) It should be specially noted that the parameter `hidden_dim = 512` actually specifies the dimension of the LSTM hidden layer's vector is 128. For this, please refer to the description of `dynamic_lstm` in the official PaddlePaddle API documentation.
- Define a parameter loader method to load the pre-trained word lookup tables from word embeddings trained on the English language Wikipedia. - As is mentioned above, we use the trained word vector based on English Wikipedia to initialize the embedding layer parameters of the total six features of the sequence input and predicate context, which are not updated during training.
```python ```python
#Here load the binary parameters saved by PaddlePaddle
def load_parameter(file_name, h, w): def load_parameter(file_name, h, w):
with open(file_name, 'rb') as f: with open(file_name, 'rb') as f:
f.read(16) # skip header. f.read(16) # skip header.
return np.fromfile(f, dtype=np.float32).reshape(h, w) return np.fromfile(f, dtype=np.float32).reshape(h, w)
``` ```
- Transform the word sequence itself, the predicate, the predicate context, and the region mark sequence into embedded vector sequences.
- 8 LSTM units are trained through alternating left-to-right / right-to-left order denoted by the variable `reverse`.
## Training Model
- We train according to the network topology and model parameters. We also need to specify the optimization method when constructing. Here we use the most basic SGD method (momentum is set to 0), and set the learning rate, regularition, and so on.
Define hyperparameters for the training process
```python
use_cuda = False #Execute training on cpu
save_dirname = "label_semantic_roles.inference.model" #The model parameters obtained by training are saved in the file.
is_local = True
```
### Data input layer definition
Defines the format of the model input features, including the sentence sequence, the predicate, the five features of the predicate context, and the predicate context area flags.
```python
# Sentence sequences
word = fluid.layers.data(
name='word_data', shape=[1], dtype='int64', lod_level=1)
# predicate
predicate = fluid.layers.data(
name='verb_data', shape=[1], dtype='int64', lod_level=1)
# predicate context's 5 features
ctx_n2 = fluid.layers.data(
name='ctx_n2_data', shape=[1], dtype='int64', lod_level=1)
ctx_n1 = fluid.layers.data(
name='ctx_n1_data', shape=[1], dtype='int64', lod_level=1)
ctx_0 = fluid.layers.data(
name='ctx_0_data', shape=[1], dtype='int64', lod_level=1)
ctx_p1 = fluid.layers.data(
name='ctx_p1_data', shape=[1], dtype='int64', lod_level=1)
ctx_p2 = fluid.layers.data(
name='ctx_p2_data', shape=[1], dtype='int64', lod_level=1)
# Predicate conotext area flag
mark = fluid.layers.data(
name='mark_data', shape=[1], dtype='int64', lod_level=1)
```
### Defining the network structure
First pre-train and define the model input layer
```python ```python
def db_lstm(word, predicate, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2, mark, #pre-training predicate and predicate context area flags
**ignored): predicate_embedding = fluid.layers.embedding(
# 8 features
predicate_embedding = fluid.layers.embedding(
input=predicate, input=predicate,
size=[pred_dict_len, word_dim], size=[pred_dict_len, word_dim],
dtype='float32', dtype='float32',
is_sparse=IS_SPARSE, is_sparse=IS_SPARSE,
param_attr='vemb') param_attr='vemb')
mark_embedding = fluid.layers.embedding( mark_embedding = fluid.layers.embedding(
input=mark, input=mark,
size=[mark_dict_len, mark_dim], size=[mark_dict_len, mark_dim],
dtype='float32', dtype='float32',
is_sparse=IS_SPARSE) is_sparse=IS_SPARSE)
word_input = [word, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2] #Sentence sequences and predicate context 5 features then pre-trained
# Since word vector lookup table is pre-trained, we won't update it this time. word_input = [word, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2]
# trainable being False prevents updating the lookup table during training. #Because word vector is pre-trained, no longer training embedding table,
emb_layers = [ # The trainable's parameter attribute set to False prevents the embedding table from being updated during training
emb_layers = [
fluid.layers.embedding( fluid.layers.embedding(
size=[word_dict_len, word_dim], size=[word_dict_len, word_dim],
input=x, input=x,
param_attr=fluid.ParamAttr( param_attr=fluid.ParamAttr(
name=embedding_name, trainable=False)) for x in word_input name=embedding_name, trainable=False)) for x in word_input
] ]
emb_layers.append(predicate_embedding) # Pre-training results for adding predicate and predicate context area tags
emb_layers.append(mark_embedding) emb_layers.append(predicate_embedding)
emb_layers.append(mark_embedding)
```
Define eight LSTM units to learn all input sequences in "forward/reverse" order.
# 8 LSTM units are trained through alternating left-to-right / right-to-left order ```python
# denoted by the variable `reverse`. # A total of 8 LSTM units are trained, each unit is oriented from left to right or right to left.
hidden_0_layers = [ # Determined by the parameter `is_reverse`
# First stack structure
hidden_0_layers = [
fluid.layers.fc(input=emb, size=hidden_dim, act='tanh') fluid.layers.fc(input=emb, size=hidden_dim, act='tanh')
for emb in emb_layers for emb in emb_layers
] ]
hidden_0 = fluid.layers.sums(input=hidden_0_layers) hidden_0 = fluid.layers.sums(input=hidden_0_layers)
lstm_0 = fluid.layers.dynamic_lstm( lstm_0 = fluid.layers.dynamic_lstm(
input=hidden_0, input=hidden_0,
size=hidden_dim, size=hidden_dim,
candidate_activation='relu', candidate_activation='relu',
gate_activation='sigmoid', gate_activation='sigmoid',
cell_activation='sigmoid') cell_activation='sigmoid')
# stack L-LSTM and R-LSTM with direct edges # Stack L-LSTM and R-LSTM with directly connected sides
input_tmp = [hidden_0, lstm_0] input_tmp = [hidden_0, lstm_0]
# In PaddlePaddle, state features and transition features of a CRF are implemented
# by a fully connected layer and a CRF layer seperately. The fully connected layer
# with linear activation learns the state features, here we use fluid.layers.sums
# (fluid.layers.fc can be uesed as well), and the CRF layer in PaddlePaddle:
# fluid.layers.linear_chain_crf only
# learns the transition features, which is a cost layer and is the last layer of the network.
# fluid.layers.linear_chain_crf outputs the log probability of true tag sequence
# as the cost by given the input sequence and it requires the true tag sequence
# as target in the learning process.
for i in range(1, depth): # remaining stack structure
for i in range(1, depth):
mix_hidden = fluid.layers.sums(input=[ mix_hidden = fluid.layers.sums(input=[
fluid.layers.fc(input=input_tmp[0], size=hidden_dim, act='tanh'), fluid.layers.fc(input=input_tmp[0], size=hidden_dim, act='tanh'),
fluid.layers.fc(input=input_tmp[1], size=hidden_dim, act='tanh') fluid.layers.fc(input=input_tmp[1], size=hidden_dim, act='tanh')
...@@ -333,102 +348,82 @@ def db_lstm(word, predicate, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2, mark, ...@@ -333,102 +348,82 @@ def db_lstm(word, predicate, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2, mark,
input_tmp = [mix_hidden, lstm] input_tmp = [mix_hidden, lstm]
feature_out = fluid.layers.sums(input=[ # Fetch the output of the last stack LSTM and the input of this LSTM unit to the hidden layer mapping,
# Learn the state feature of CRF after a fully connected layer maps to the dimensions of the tags dictionary
feature_out = fluid.layers.sums(input=[
fluid.layers.fc(input=input_tmp[0], size=label_dict_len, act='tanh'), fluid.layers.fc(input=input_tmp[0], size=label_dict_len, act='tanh'),
fluid.layers.fc(input=input_tmp[1], size=label_dict_len, act='tanh') fluid.layers.fc(input=input_tmp[1], size=label_dict_len, act='tanh')
]) ])
return feature_out
```
## Train model
- In the `train` method, we will create trainer given model topology, parameters, and optimization method. We will use the most basic **SGD** method, which is a momentum optimizer with 0 momentum. Meanwhile, we will set learning rate and decay.
- As mentioned in data preparation section, we will use CoNLL 2005 test corpus as the training data set. `conll05.test()` outputs one training instance at a time. It is shuffled and batched into mini batches, and used as input.
- `feeding` is used to specify the correspondence between data instance and data layer. For example, according to the `feeding`, the 0th column of data instance produced by`conll05.test()` is matched to the data layer named `word_data`.
- `event_handler` can be used as callback for training events, it will be used as an argument for the `train` method. Following `event_handler` prints cost during training.
- `trainer.train` will train the model.
```python # tag/label sequence
def train(use_cuda, save_dirname=None, is_local=True): target = fluid.layers.data(
# define network topology
word = fluid.layers.data(
name='word_data', shape=[1], dtype='int64', lod_level=1)
predicate = fluid.layers.data(
name='verb_data', shape=[1], dtype='int64', lod_level=1)
ctx_n2 = fluid.layers.data(
name='ctx_n2_data', shape=[1], dtype='int64', lod_level=1)
ctx_n1 = fluid.layers.data(
name='ctx_n1_data', shape=[1], dtype='int64', lod_level=1)
ctx_0 = fluid.layers.data(
name='ctx_0_data', shape=[1], dtype='int64', lod_level=1)
ctx_p1 = fluid.layers.data(
name='ctx_p1_data', shape=[1], dtype='int64', lod_level=1)
ctx_p2 = fluid.layers.data(
name='ctx_p2_data', shape=[1], dtype='int64', lod_level=1)
mark = fluid.layers.data(
name='mark_data', shape=[1], dtype='int64', lod_level=1)
# define network topology
feature_out = db_lstm(**locals())
target = fluid.layers.data(
name='target', shape=[1], dtype='int64', lod_level=1) name='target', shape=[1], dtype='int64', lod_level=1)
crf_cost = fluid.layers.linear_chain_crf(
# Learning CRF transfer features
crf_cost = fluid.layers.linear_chain_crf(
input=feature_out, input=feature_out,
label=target, label=target,
param_attr=fluid.ParamAttr( param_attr=fluid.ParamAttr(
name='crfw', learning_rate=mix_hidden_lr)) name='crfw', learning_rate=mix_hidden_lr))
avg_cost = fluid.layers.mean(crf_cost)
sgd_optimizer = fluid.optimizer.SGD( avg_cost = fluid.layers.mean(crf_cost)
# Use the most basic SGD optimization method (momentum is set to 0)
sgd_optimizer = fluid.optimizer.SGD(
learning_rate=fluid.layers.exponential_decay( learning_rate=fluid.layers.exponential_decay(
learning_rate=0.01, learning_rate=0.01,
decay_steps=100000, decay_steps=100000,
decay_rate=0.5, decay_rate=0.5,
staircase=True)) staircase=True))
sgd_optimizer.minimize(avg_cost) sgd_optimizer.minimize(avg_cost)
# The CRF decoding layer is used for evaluation and inference. ```
# It shares weights with CRF layer. The sharing of parameters among multiple layers
# is specified by using the same parameter name in these layers. If true tag sequence The data introduction section mentions the payment of the CoNLL 2005 training set. Here we use the test set training for everyone to learn. Conll05.test() produces one sample every time, containing 9 features, then shuffle and after batching as the input for training.
# is provided in training process, `fluid.layers.crf_decoding` calculates labelling error
# for each input token and sums the error over the entire sequence. ```python
# Otherwise, `fluid.layers.crf_decoding` generates the labelling tags. crf_decode = fluid.layers.crf_decoding(
crf_decode = fluid.layers.crf_decoding(
input=feature_out, param_attr=fluid.ParamAttr(name='crfw')) input=feature_out, param_attr=fluid.ParamAttr(name='crfw'))
train_data = paddle.batch( train_data = paddle.batch(
paddle.reader.shuffle( paddle.reader.shuffle(
paddle.dataset.conll05.test(), buf_size=8192), paddle.dataset.conll05.test(), buf_size=8192),
batch_size=BATCH_SIZE) batch_size=BATCH_SIZE)
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace() place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
```
The corresponding relationship between each data and data_layer is specified by the feeder. The following feeder indicates that the data_layer corresponding to the 0th column of the data generated by conll05.test() is `word`.
feeder = fluid.DataFeeder( ```python
feeder = fluid.DataFeeder(
feed_list=[ feed_list=[
word, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2, predicate, mark, target word, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2, predicate, mark, target
], ],
place=place) place=place)
exe = fluid.Executor(place) exe = fluid.Executor(place)
```
Start training
```python
main_program = fluid.default_main_program()
def train_loop(main_program): exe.run(fluid.default_startup_program())
exe.run(fluid.default_startup_program()) embedding_param = fluid.global_scope().find_var(
embedding_param = fluid.global_scope().find_var(
embedding_name).get_tensor() embedding_name).get_tensor()
embedding_param.set( embedding_param.set(
load_parameter(conll05.get_embedding(), word_dict_len, word_dim), load_parameter(conll05.get_embedding(), word_dict_len, word_dim),
place) place)
start_time = time.time() start_time = time.time()
batch_id = 0 batch_id = 0
for pass_id in six.moves.xrange(PASS_NUM): for pass_id in six.moves.xrange(PASS_NUM):
for data in train_data(): for data in train_data():
cost = exe.run(main_program, cost = exe.run(main_program,
feed=feeder.feed(data), feed=feeder.feed(data),
...@@ -448,77 +443,75 @@ def train(use_cuda, save_dirname=None, is_local=True): ...@@ -448,77 +443,75 @@ def train(use_cuda, save_dirname=None, is_local=True):
'ctx_n1_data', 'ctx_0_data', 'ctx_p1_data', 'ctx_n1_data', 'ctx_0_data', 'ctx_p1_data',
'ctx_p2_data', 'mark_data' 'ctx_p2_data', 'mark_data'
], [feature_out], exe) ], [feature_out], exe)
return break
batch_id = batch_id + 1 batch_id = batch_id + 1
train_loop(fluid.default_main_program())
``` ```
## Application ## Model Application
After completing the training, the optimal model needs to be selected according to a performance indicator we care about. You can simply select the model with the least markup error on the test set. We give an example of using a trained model for prediction as follows.
- When training is completed, we need to select an optimal model based one performance index to do inference. In this task, one can simply select the model with the least number of marks on the test set. We demonstrate doing an inference using the trained model. First set the parameters of the prediction process
```python ```python
def infer(use_cuda, save_dirname=None): use_cuda = False #predict on cpu
if save_dirname is None: save_dirname = "label_semantic_roles.inference.model" #call trained model for prediction
return
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace() exe = fluid.Executor(place)
exe = fluid.Executor(place) ```
Set the input, use LoDTensor to represent the input word sequence, where the shape of each word's base_shape is [1], because each word is represented by an id. If the length-based LoD is [[3, 4, 2]], which is a single-layer LoD, then the constructed LoDTensor contains three sequences which their length are 3, 4, and 2.
inference_scope = fluid.core.Scope()
with fluid.scope_guard(inference_scope): Note that LoD is a list of lists.
# Use fluid.io.load_inference_model to obtain the inference program desc,
# the feed_target_names (the names of variables that will be fed
# data using feed operators), and the fetch_targets (variables that
# we want to obtain data from using fetch operators).
[inference_program, feed_target_names,
fetch_targets] = fluid.io.load_inference_model(save_dirname, exe)
# Setup inputs by creating LoDTensors to represent sequences of words. ```python
# Here each word is the basic element of these LoDTensors and the shape of lod = [[3, 4, 2]]
# each word (base_shape) should be [1] since it is simply an index to base_shape = [1]
# look up for the corresponding word vector.
# Suppose the length_based level of detail (lod) info is set to [[3, 4, 2]], # Construct fake data as input, the range of random integer numbers is [low, high]
# which has only one lod level. Then the created LoDTensors will have only word = fluid.create_random_int_lodtensor(
# one higher level structure (sequence of words, or sentence) than the basic
# element (word). Hence the LoDTensor will hold data for three sentences of
# length 3, 4 and 2, respectively.
# Note that lod info should be a list of lists.
lod = [[3, 4, 2]]
base_shape = [1]
# The range of random integers is [low, high]
word = fluid.create_random_int_lodtensor(
lod, base_shape, place, low=0, high=word_dict_len - 1) lod, base_shape, place, low=0, high=word_dict_len - 1)
pred = fluid.create_random_int_lodtensor( pred = fluid.create_random_int_lodtensor(
lod, base_shape, place, low=0, high=pred_dict_len - 1) lod, base_shape, place, low=0, high=pred_dict_len - 1)
ctx_n2 = fluid.create_random_int_lodtensor( ctx_n2 = fluid.create_random_int_lodtensor(
lod, base_shape, place, low=0, high=word_dict_len - 1) lod, base_shape, place, low=0, high=word_dict_len - 1)
ctx_n1 = fluid.create_random_int_lodtensor( ctx_n1 = fluid.create_random_int_lodtensor(
lod, base_shape, place, low=0, high=word_dict_len - 1) lod, base_shape, place, low=0, high=word_dict_len - 1)
ctx_0 = fluid.create_random_int_lodtensor( ctx_0 = fluid.create_random_int_lodtensor(
lod, base_shape, place, low=0, high=word_dict_len - 1) lod, base_shape, place, low=0, high=word_dict_len - 1)
ctx_p1 = fluid.create_random_int_lodtensor( ctx_p1 = fluid.create_random_int_lodtensor(
lod, base_shape, place, low=0, high=word_dict_len - 1) lod, base_shape, place, low=0, high=word_dict_len - 1)
ctx_p2 = fluid.create_random_int_lodtensor( ctx_p2 = fluid.create_random_int_lodtensor(
lod, base_shape, place, low=0, high=word_dict_len - 1) lod, base_shape, place, low=0, high=word_dict_len - 1)
mark = fluid.create_random_int_lodtensor( mark = fluid.create_random_int_lodtensor(
lod, base_shape, place, low=0, high=mark_dict_len - 1) lod, base_shape, place, low=0, high=mark_dict_len - 1)
```
# Construct feed as a dictionary of {feed_target_name: feed_target_data} Using fluid.io.load_inference_model to load inference_program, feed_target_names is the name of the model's input variable, and fetch_targets is the predicted object.
# and results will contain a list of data corresponding to fetch_targets.
assert feed_target_names[0] == 'word_data' ```python
assert feed_target_names[1] == 'verb_data' [inference_program, feed_target_names,
assert feed_target_names[2] == 'ctx_n2_data' fetch_targets] = fluid.io.load_inference_model(save_dirname, exe)
assert feed_target_names[3] == 'ctx_n1_data' ```
assert feed_target_names[4] == 'ctx_0_data' Construct the feed dictionary {feed_target_name: feed_target_data}, where the results are a list of predicted targets
assert feed_target_names[5] == 'ctx_p1_data'
assert feed_target_names[6] == 'ctx_p2_data' ```python
assert feed_target_names[7] == 'mark_data' assert feed_target_names[0] == 'word_data'
assert feed_target_names[1] == 'verb_data'
results = exe.run(inference_program, assert feed_target_names[2] == 'ctx_n2_data'
assert feed_target_names[3] == 'ctx_n1_data'
assert feed_target_names[4] == 'ctx_0_data'
assert feed_target_names[5] == 'ctx_p1_data'
assert feed_target_names[6] == 'ctx_p2_data'
assert feed_target_names[7] == 'mark_data'
```
Execute prediction
```python
results = exe.run(inference_program,
feed={ feed={
feed_target_names[0]: word, feed_target_names[0]: word,
feed_target_names[1]: pred, feed_target_names[1]: pred,
...@@ -531,44 +524,33 @@ def infer(use_cuda, save_dirname=None): ...@@ -531,44 +524,33 @@ def infer(use_cuda, save_dirname=None):
}, },
fetch_list=fetch_targets, fetch_list=fetch_targets,
return_numpy=False) return_numpy=False)
print(results[0].lod())
np_data = np.array(results[0])
print("Inference Shape: ", np_data.shape)
``` ```
- The main entrance of the whole program is as below: Output result
```python ```python
def main(use_cuda, is_local=True): print(results[0].lod())
if use_cuda and not fluid.core.is_compiled_with_cuda(): np_data = np.array(results[0])
return print("Inference Shape: ", np_data.shape)
# Directory for saving the trained model
save_dirname = "label_semantic_roles.inference.model"
train(use_cuda, save_dirname, is_local)
infer(use_cuda, save_dirname)
main(use_cuda=False)
``` ```
## Conclusion ## Conclusion
Semantic Role Labeling is an important intermediate step in a wide range of natural language processing tasks. In this tutorial, we use SRL as an example to illustrate using PaddlePaddle to do sequence tagging tasks. The models proposed are from our published paper\[[10](#Reference)\]. We only use test data for illustration since the training data on the CoNLL 2005 dataset is not completely public. This aims to propose an end-to-end neural network model with fewer dependencies on natural language processing tools but is comparable, or even better than traditional models in terms of performance. Please check out our paper for more information and discussions. Labeling semantic roles is an important intermediate step in many natural language understanding tasks. In this tutorial, we take the label semantic roles task as an example to introduce how to use PaddlePaddle for sequence labeling tasks. The model presented in the tutorial comes from our published paper \[[10](#References)\]. Since the training data for the CoNLL 2005 SRL task is not currently fully open, only the test data is used as an example in the tutorial. In this process, we hope to reduce our reliance on other natural language processing tools. We can use neural network data-driven, end-to-end learning capabilities to get a model that is comparable or even better than traditional methods. In the paper, we confirmed this possibility. More information and discussion about the model can be found in the paper.
<a name="References"></a>
## References ## References
1. Sun W, Sui Z, Wang M, et al. [Chinese semantic role labeling with shallow parsing](http://www.aclweb.org/anthology/D09-1#page=1513)[C]//Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3-Volume 3. Association for Computational Linguistics, 2009: 1475-1483. 1. Sun W, Sui Z, Wang M, et al. [Chinese label semantic roles with shallow parsing](http://www.aclweb.org/anthology/D09-1#page=1513)[C]//Proceedings Of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3-Volume 3. Association for Computational Linguistics, 2009: 1475-1483.
2. Pascanu R, Gulcehre C, Cho K, et al. [How to construct deep recurrent neural networks](https://arxiv.org/abs/1312.6026)[J]. arXiv preprint arXiv:1312.6026, 2013. 2. Pascanu R, Gulcehre C, Cho K, et al. [How to construct deep recurrent neural networks](https://arxiv.org/abs/1312.6026)[J]. arXiv preprint arXiv:1312.6026, 2013.
3. Cho K, Van Merriënboer B, Gulcehre C, et al. [Learning phrase representations using RNN encoder-decoder for statistical machine translation](https://arxiv.org/abs/1406.1078)[J]. arXiv preprint arXiv:1406.1078, 2014. 3. Cho K, Van Merriënboer B, Gulcehre C, et al. [Learning phrase representations using RNN encoder-decoder for statistical machine translation](https://arxiv.org/abs/1406.1078)[J]. arXiv preprint arXiv: 1406.1078, 2014.
4. Bahdanau D, Cho K, Bengio Y. [Neural machine translation by jointly learning to align and translate](https://arxiv.org/abs/1409.0473)[J]. arXiv preprint arXiv:1409.0473, 2014. 4. Bahdanau D, Cho K, Bengio Y. [Neural machine translation by jointly learning to align and translate](https://arxiv.org/abs/1409.0473)[J]. arXiv preprint arXiv:1409.0473, 2014.
5. Lafferty J, McCallum A, Pereira F. [Conditional random fields: Probabilistic models for segmenting and labeling sequence data](http://www.jmlr.org/papers/volume15/doppa14a/source/biblio.bib.old)[C]//Proceedings of the eighteenth international conference on machine learning, ICML. 2001, 1: 282-289. 5. Lafferty J, McCallum A, Pereira F. [Conditional random fields: Probabilistic models for segmenting and labeling sequence data](https://repository.upenn.edu/cgi/viewcontent.cgi?article=1162&context=cis_papers) [C]//Proceedings of the eighteenth international conference on machine learning, ICML. 2001, 1: 282-289.
6. 李航. 统计学习方法[J]. 清华大学出版社, 北京, 2012. 6. Li Hang. Statistical Learning Method[J]. Tsinghua University Press, Beijing, 2012.
7. Marcus M P, Marcinkiewicz M A, Santorini B. [Building a large annotated corpus of English: The Penn Treebank](http://repository.upenn.edu/cgi/viewcontent.cgi?article=1246&context=cis_reports)[J]. Computational linguistics, 1993, 19(2): 313-330. 7. Marcus MP, Marcinkiewicz MA, Santorini B. [Building a large annotated corpus of English: The Penn Treebank](http://repository.upenn.edu/cgi/viewcontent.cgi?article=1246&context=cis_reports)[J] Computational linguistics, 1993, 19(2): 313-330.
8. Palmer M, Gildea D, Kingsbury P. [The proposition bank: An annotated corpus of semantic roles](http://www.mitpressjournals.org/doi/pdfplus/10.1162/0891201053630264)[J]. Computational linguistics, 2005, 31(1): 71-106. 8. Palmer M, Gildea D, Kingsbury P. [The proposition bank: An annotated corpus of semantic roles](http://www.mitpressjournals.org/doi/pdfplus/10.1162/0891201053630264) [J]. Computational linguistics, 2005 , 31(1): 71-106.
9. Carreras X, Màrquez L. [Introduction to the CoNLL-2005 shared task: Semantic role labeling](http://www.cs.upc.edu/~srlconll/st05/papers/intro.pdf)[C]//Proceedings of the Ninth Conference on Computational Natural Language Learning. Association for Computational Linguistics, 2005: 152-164. 9. Carreras X, Màrquez L. [Introduction to the CoNLL-2005 shared task: label semantic roles](http://www.cs.upc.edu/~srlconll/st05/papers/intro.pdf)[C]/ /Proceedings of the Ninth Conference on Computational Natural Language Learning. Association for Computational Linguistics, 2005: 152-164.
10. Zhou J, Xu W. [End-to-end learning of semantic role labeling using recurrent neural networks](http://www.aclweb.org/anthology/P/P15/P15-1109.pdf)[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2015. 10. Zhou J, Xu W. [End-to-end learning of label semantic roles using recurrent neural networks](http://www.aclweb.org/anthology/P/P15/P15-1109.pdf)[C] //Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2015.
<br/> <br/>
This tutorial is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://paddlepaddleimage.cdn.bcebos.com/bookimage/camo.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">This tutorial</span> is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.
### This file is only used for continuous evaluation test!
from __future__ import print_function
from __future__ import division
from __future__ import absolute_import
import os
import sys
sys.path.append(os.environ['ceroot'])
from kpi import CostKpi
train_cost_kpi = CostKpi('train_cost', 0.02, 0, actived=True, desc='train cost')
tracking_kpis = [train_cost_kpi]
def parse_log(log):
for line in log.split('\n'):
fs = line.strip().split('\t')
print(fs)
if len(fs) == 3 and fs[0] == 'kpis':
kpi_name = fs[1]
kpi_value = float(fs[2])
yield kpi_name, kpi_value
def log_to_ce(log):
kpi_tracker = {}
for kpi in tracking_kpis:
kpi_tracker[kpi.name] = kpi
for (kpi_name, kpi_value) in parse_log(log):
print(kpi_name, kpi_value)
kpi_tracker[kpi_name].add_record(kpi_value)
kpi_tracker[kpi_name].persist()
if __name__ == '__main__':
log = sys.stdin.read()
log_to_ce(log)
...@@ -44,13 +44,28 @@ ...@@ -44,13 +44,28 @@
本教程源代码目录在[book/label_semantic_roles](https://github.com/PaddlePaddle/book/tree/develop/07.label_semantic_roles),初次使用请您参考[Book文档使用说明](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)。 本教程源代码目录在[book/label_semantic_roles](https://github.com/PaddlePaddle/book/tree/develop/07.label_semantic_roles),初次使用请您参考[Book文档使用说明](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)。
### 说明
1. 本教程可支持在 CPU/GPU 环境下运行
2. Docker镜像支持的CUDA/cuDNN版本
如果使用了Docker运行Book,请注意:这里所提供的默认镜像的GPU环境为 CUDA 8/cuDNN 5,对于NVIDIA Tesla V100等要求CUDA 9的 GPU,使用该镜像可能会运行失败;
3. 文档和脚本中代码的一致性问题
请注意:为使本文更加易读易用,我们拆分、调整了[train.py](https://github.com/PaddlePaddle/book/tree/develop/07.label_semantic_roles/train.py)的代码并放入本文。本文中代码与train.py的运行结果一致,可直接运行train.py进行验证。
## 背景介绍 ## 背景介绍
自然语言分析技术大致分为三个层面:词法分析、句法分析和语义分析。语义角色标注是实现浅层语义分析的一种方式。在一个句子中,谓词是对主语的陈述或说明,指出“做什么”、“是什么”或“怎么样,代表了一个事件的核心,跟谓词搭配的名词称为论元。语义角色是指论元在动词所指事件中担任的角色。主要有:施事者(Agent)、受事者(Patient)、客体(Theme)、经验者(Experiencer)、受益者(Beneficiary)、工具(Instrument)、处所(Location)、目标(Goal)和来源(Source)等。 自然语言分析技术大致分为三个层面:词法分析、句法分析和语义分析。语义角色标注是实现浅层语义分析的一种方式。在一个句子中,谓词是对主语的陈述或说明,指出“做什么”、“是什么”或“怎么样,代表了一个事件的核心,跟谓词搭配的名词称为论元。语义角色是指论元在动词所指事件中担任的角色。主要有:施事者(Agent)、受事者(Patient)、客体(Theme)、经验者(Experiencer)、受益者(Beneficiary)、工具(Instrument)、处所(Location)、目标(Goal)和来源(Source)等。
请看下面的例子,“遇到” 是谓词(Predicate,通常简写为“Pred”),“小明”是施事者(Agent),“小红”是受事者(Patient),“昨天” 是事件发生的时间(Time),“公园”是事情发生的地点(Location)。 请看下面的例子,“遇到” 是谓词(Predicate,通常简写为“Pred”),“小明”是施事者(Agent),“小红”是受事者(Patient),“昨天” 是事件发生的时间(Time),“公园”是事情发生的地点(Location)。
$$\mbox{[小明]}_{\mbox{Agent}}\mbox{[昨天]}_{\mbox{Time}}\mbox{[晚上]}_\mbox{Time}\mbox{在[公园]}_{\mbox{Location}}\mbox{[遇到]}_{\mbox{Predicate}}\mbox{了[小红]}_{\mbox{Patient}}\mbox{。}$$ <p align="center">
<img src = "https://github.com/PaddlePaddle/book/blob/develop/07.label_semantic_roles/image/Eqn1.png?raw=true"><br/>
</p>
语义角色标注(Semantic Role Labeling,SRL)以句子的谓词为中心,不对句子所包含的语义信息进行深入分析,只分析句子中各成分与谓词之间的关系,即句子的谓词(Predicate)- 论元(Argument)结构,并用语义角色来描述这些结构关系,是许多自然语言理解任务(如信息抽取,篇章分析,深度问答等)的一个重要中间步骤。在研究中一般都假定谓词是给定的,所要做的就是找出给定谓词的各个论元和它们的语义角色。 语义角色标注(Semantic Role Labeling,SRL)以句子的谓词为中心,不对句子所包含的语义信息进行深入分析,只分析句子中各成分与谓词之间的关系,即句子的谓词(Predicate)- 论元(Argument)结构,并用语义角色来描述这些结构关系,是许多自然语言理解任务(如信息抽取,篇章分析,深度问答等)的一个重要中间步骤。在研究中一般都假定谓词是给定的,所要做的就是找出给定谓词的各个论元和它们的语义角色。
...@@ -127,15 +142,22 @@ CRF是一种概率化结构模型,可以看作是一个概率无向图模型 ...@@ -127,15 +142,22 @@ CRF是一种概率化结构模型,可以看作是一个概率无向图模型
根据线性链条件随机场上的因子分解定理\[[5](#参考文献)\],在给定观测序列$X$时,一个特定标记序列$Y$的概率可以定义为: 根据线性链条件随机场上的因子分解定理\[[5](#参考文献)\],在给定观测序列$X$时,一个特定标记序列$Y$的概率可以定义为:
$$p(Y | X) = \frac{1}{Z(X)} \text{exp}\left(\sum_{i=1}^{n}\left(\sum_{j}\lambda_{j}t_{j} (y_{i - 1}, y_{i}, X, i) + \sum_{k} \mu_k s_k (y_i, X, i)\right)\right)$$ <p align="center">
<img src = "https://github.com/PaddlePaddle/book/blob/develop/07.label_semantic_roles/image/Eqn2.gif?raw=true"><br/>
</p>
其中$Z(X)$是归一化因子,$t_j$ 是定义在边上的特征函数,依赖于当前和前一个位置,称为转移特征,表示对于输入序列$X$及其标注序列在 $i$及$i - 1$位置上标记的转移概率。$s_k$是定义在结点上的特征函数,称为状态特征,依赖于当前位置,表示对于观察序列$X$及其$i$位置的标记概率。$\lambda_j$ 和 $\mu_k$ 分别是转移特征函数和状态特征函数对应的权值。实际上,$t$和$s$可以用相同的数学形式表示,再对转移特征和状态特在各个位置$i$求和有:$f_{k}(Y, X) = \sum_{i=1}^{n}f_k({y_{i - 1}, y_i, X, i})$,把$f$统称为特征函数,于是$P(Y|X)$可表示为: 其中$Z(X)$是归一化因子,$t_j$ 是定义在边上的特征函数,依赖于当前和前一个位置,称为转移特征,表示对于输入序列$X$及其标注序列在 $i$及$i - 1$位置上标记的转移概率。$s_k$是定义在结点上的特征函数,称为状态特征,依赖于当前位置,表示对于观察序列$X$及其$i$位置的标记概率。$\lambda_j$ 和 $\mu_k$ 分别是转移特征函数和状态特征函数对应的权值。实际上,$t$和$s$可以用相同的数学形式表示,再对转移特征和状态特在各个位置$i$求和有:$f_{k}(Y, X) = \sum_{i=1}^{n}f_k({y_{i - 1}, y_i, X, i})$,把$f$统称为特征函数,于是$P(Y|X)$可表示为:
$$p(Y|X, W) = \frac{1}{Z(X)}\text{exp}\sum_{k}\omega_{k}f_{k}(Y, X)$$ <p align="center">
<img src = "https://github.com/PaddlePaddle/book/blob/develop/07.label_semantic_roles/image/Eqn3.gif?raw=true"><br/>
</p>
$\omega$是特征函数对应的权值,是CRF模型要学习的参数。训练时,对于给定的输入序列和对应的标记序列集合$D = \left[(X_1, Y_1), (X_2 , Y_2) , ... , (X_N, Y_N)\right]$ ,通过正则化的极大似然估计,求解如下优化目标: $\omega$是特征函数对应的权值,是CRF模型要学习的参数。训练时,对于给定的输入序列和对应的标记序列集合$D = \left[(X_1, Y_1), (X_2 , Y_2) , ... , (X_N, Y_N)\right]$ ,通过正则化的极大似然估计,求解如下优化目标:
$$\DeclareMathOperator*{\argmax}{arg\,max} L(\lambda, D) = - \text{log}\left(\prod_{m=1}^{N}p(Y_m|X_m, W)\right) + C \frac{1}{2}\lVert W\rVert^{2}$$ <p align="center">
<img src = "https://github.com/PaddlePaddle/book/blob/develop/07.label_semantic_roles/image/Eqn4.png?raw=true"><br/>
</p>
这个优化目标可以通过反向传播算法和整个神经网络一起求解。解码时,对于给定的输入序列$X$,通过解码算法(通常有:维特比算法、Beam Search)求令出条件概率$\bar{P}(Y|X)$最大的输出序列 $\bar{Y}$。 这个优化目标可以通过反向传播算法和整个神经网络一起求解。解码时,对于给定的输入序列$X$,通过解码算法(通常有:维特比算法、Beam Search)求令出条件概率$\bar{P}(Y|X)$最大的输出序列 $\bar{Y}$。
...@@ -497,6 +519,7 @@ save_dirname = "label_semantic_roles.inference.model" #调用训练好的模型 ...@@ -497,6 +519,7 @@ save_dirname = "label_semantic_roles.inference.model" #调用训练好的模型
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace() place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
exe = fluid.Executor(place) exe = fluid.Executor(place)
``` ```
设置输入用LoDTensor来表示输入的词序列这里每个词的形状 base_shape都是[1],是因为每个词都是用一个id来表示的假如基于长度的LoD是[[3, 4, 2]],这是一个单层的LoD那么构造出的LoDTensor就包含3个序列其长度分别为3、4和2 设置输入用LoDTensor来表示输入的词序列这里每个词的形状 base_shape都是[1],是因为每个词都是用一个id来表示的假如基于长度的LoD是[[3, 4, 2]],这是一个单层的LoD那么构造出的LoDTensor就包含3个序列其长度分别为3、4和2
注意LoD是个列表的列表 注意LoD是个列表的列表
...@@ -580,7 +603,7 @@ print("Inference Shape: ", np_data.shape) ...@@ -580,7 +603,7 @@ print("Inference Shape: ", np_data.shape)
2. Pascanu R, Gulcehre C, Cho K, et al. [How to construct deep recurrent neural networks](https://arxiv.org/abs/1312.6026)[J]. arXiv preprint arXiv:1312.6026, 2013. 2. Pascanu R, Gulcehre C, Cho K, et al. [How to construct deep recurrent neural networks](https://arxiv.org/abs/1312.6026)[J]. arXiv preprint arXiv:1312.6026, 2013.
3. Cho K, Van Merriënboer B, Gulcehre C, et al. [Learning phrase representations using RNN encoder-decoder for statistical machine translation](https://arxiv.org/abs/1406.1078)[J]. arXiv preprint arXiv:1406.1078, 2014. 3. Cho K, Van Merriënboer B, Gulcehre C, et al. [Learning phrase representations using RNN encoder-decoder for statistical machine translation](https://arxiv.org/abs/1406.1078)[J]. arXiv preprint arXiv:1406.1078, 2014.
4. Bahdanau D, Cho K, Bengio Y. [Neural machine translation by jointly learning to align and translate](https://arxiv.org/abs/1409.0473)[J]. arXiv preprint arXiv:1409.0473, 2014. 4. Bahdanau D, Cho K, Bengio Y. [Neural machine translation by jointly learning to align and translate](https://arxiv.org/abs/1409.0473)[J]. arXiv preprint arXiv:1409.0473, 2014.
5. Lafferty J, McCallum A, Pereira F. [Conditional random fields: Probabilistic models for segmenting and labeling sequence data](http://www.jmlr.org/papers/volume15/doppa14a/source/biblio.bib.old)[C]//Proceedings of the eighteenth international conference on machine learning, ICML. 2001, 1: 282-289. 5. Lafferty J, McCallum A, Pereira F. [Conditional random fields: Probabilistic models for segmenting and labeling sequence data](https://repository.upenn.edu/cgi/viewcontent.cgi?article=1162&context=cis_papers)[C]//Proceedings of the eighteenth international conference on machine learning, ICML. 2001, 1: 282-289.
6. 李航. 统计学习方法[J]. 清华大学出版社, 北京, 2012. 6. 李航. 统计学习方法[J]. 清华大学出版社, 北京, 2012.
7. Marcus M P, Marcinkiewicz M A, Santorini B. [Building a large annotated corpus of English: The Penn Treebank](http://repository.upenn.edu/cgi/viewcontent.cgi?article=1246&context=cis_reports)[J]. Computational linguistics, 1993, 19(2): 313-330. 7. Marcus M P, Marcinkiewicz M A, Santorini B. [Building a large annotated corpus of English: The Penn Treebank](http://repository.upenn.edu/cgi/viewcontent.cgi?article=1246&context=cis_reports)[J]. Computational linguistics, 1993, 19(2): 313-330.
8. Palmer M, Gildea D, Kingsbury P. [The proposition bank: An annotated corpus of semantic roles](http://www.mitpressjournals.org/doi/pdfplus/10.1162/0891201053630264)[J]. Computational linguistics, 2005, 31(1): 71-106. 8. Palmer M, Gildea D, Kingsbury P. [The proposition bank: An annotated corpus of semantic roles](http://www.mitpressjournals.org/doi/pdfplus/10.1162/0891201053630264)[J]. Computational linguistics, 2005, 31(1): 71-106.
...@@ -588,7 +611,7 @@ print("Inference Shape: ", np_data.shape) ...@@ -588,7 +611,7 @@ print("Inference Shape: ", np_data.shape)
10. Zhou J, Xu W. [End-to-end learning of semantic role labeling using recurrent neural networks](http://www.aclweb.org/anthology/P/P15/P15-1109.pdf)[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2015. 10. Zhou J, Xu W. [End-to-end learning of semantic role labeling using recurrent neural networks](http://www.aclweb.org/anthology/P/P15/P15-1109.pdf)[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2015.
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">知识共享 署名-相同方式共享 4.0 国际 许可协议</a>进行许可。 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://paddlepaddleimage.cdn.bcebos.com/bookimage/camo.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">知识共享 署名-相同方式共享 4.0 国际 许可协议</a>进行许可。
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
......
...@@ -40,186 +40,163 @@ ...@@ -40,186 +40,163 @@
<!-- This block will be replaced by each markdown file content. Please do not change lines below.--> <!-- This block will be replaced by each markdown file content. Please do not change lines below.-->
<div id="markdown" style='display:none'> <div id="markdown" style='display:none'>
# Semantic Role Labeling
The source code of this chapter locates at [book/label_semantic_roles](https://github.com/PaddlePaddle/book/tree/develop/07.label_semantic_roles). # Label Semantic Roles
For instructions on getting started with this book,see [Running This Book](https://github.com/PaddlePaddle/book/blob/develop/README.md#running-the-book). The source code of this tutorial is in [book/label_semantic_roles](https://github.com/PaddlePaddle/book/tree/develop/07.label_semantic_roles). For the new users to Paddle book, please refer to [Book Documentation Instructions](https://github.com/PaddlePaddle/book#running-the-book) .
## Background ## Background
Natural language analysis techniques consist of lexical, syntactic, and semantic analysis. **Semantic Role Labeling (SRL)** is an instance of **Shallow Semantic Analysis**. Natural language analysis techniques are roughly divided into three levels: lexical analysis, syntactic analysis, and semantic analysis. Labeling semantic roles is a way to implement shallow semantic analysis. In a sentence, the predicate is a statement or explanation of the subject, pointing out "what to do", "what is it" or "how is it", which represents the majority of an event. The noun with a predicate is called argument. The semantic role is the role of argument in the events. It mainly includes: Agent, Patient, Theme, Experiencer, Beneficiary, Instrument , Location, Goal, Source and so on.
In a sentence, a **predicate** states a property or a characterization of a *subject*, such as what it does and what it is like. The predicate represents the core of an event, whereas the words accompanying the predicate are **arguments**. A **semantic role** refers to the abstract role an argument of a predicate take on in the event, including *agent*, *patient*, *theme*, *experiencer*, *beneficiary*, *instrument*, *location*, *goal*, and *source*. Please look at the following example. "Encounter" is a predicate (Predicate, usually abbreviated as "Pred"), "Xiaoming" is an agent, "Xiaohong" is a patient, "Yesterday" is the time when the event occurred, the "park" is the location where the event occurred.
In the following example of a Chinese sentence, "to encounter" is the predicate (*pred*); "Ming" is the *agent*; "Hong" is the *patient*; "yesterday" and "evening" are the *time*; finally, "the park" is the *location*. $$\mbox{[Xiaoming]}_{\mbox{Agent}}\mbox{[yesterday]}_{\mbox{Time}}\mbox{[evening]}_\mbox{Time}\mbox{in[Park]}_{\mbox{Location}}\mbox{[encounter]}_{\mbox{Predicate}}\mbox{[Xiaohong]}_{\mbox{Patient}}\mbox{. }$$
$$\mbox{[小明 Ming]}_{\mbox{Agent}}\mbox{[昨天 yesterday]}_{\mbox{Time}}\mbox{[晚上 evening]}_\mbox{Time}\mbox{在[公园 a park]}_{\mbox{Location}}\mbox{[遇到 to encounter]}_{\mbox{Predicate}}\mbox{了[小红 Hong]}_{\mbox{Patient}}\mbox{。}$$ Semantic role labeling (SRL) is centered on the predicate of the sentence. It does not analyze the semantic information contained in the sentence. It only analyzes the relationship between the components and the predicate in the sentence, that is, the predicate of the sentence--the Argument structure. And using semantic roles to describe these structural relationships is an important intermediate step in many natural language understanding tasks (such as information extraction, text analysis, deep question and answer, etc.). It is generally assumed in the research that the predicate is given, and all that has to be done is to find the individual arguments of the given predicate and their semantic roles.
Instead of analyzing the semantic information, **Semantic Role Labeling** (**SRL**) identifies the relationship between the predicate and the other constituents surrounding it. The predicate-argument structures are labeled as specific semantic roles. A wide range of natural language understanding tasks, including *information extraction*, *discourse analysis*, and *deepQA*. Research usually assumes a predicate of a sentence to be specified; the only task is to identify its arguments and their semantic roles. Traditional SRL systems are mostly based on syntactic analysis and usually consist of five processes:
Conventional SRL systems mostly build on top of syntactic analysis, usually consisting of five steps:
1. Construct a syntax tree, as shown in Fig. 1
2. Identity the candidate arguments of the given predicate on the tree.
3. Prune the most unlikely candidate arguments.
4. Identify the real arguments, often by a binary classifier.
5. Multi-classify on results from step 4 to label the semantic roles. Steps 2 and 3 usually introduce hand-designed features based on syntactic analysis (step 1).
1. Construct a parse tree. For example, Figure 1 is a syntactic tree for the dependency syntax analysis of the above example.
2. Identify candidate arguments for a given predicate from the syntax tree.
3. Prune the candidate arguments; there may be many candidate arguments in a sentence, and pruning candidate arguments is pruned out of a large number of candidates that are the most unlikely candidates arguments.
4. Argument recognition: This process is to judge which is the real argument from the candidates after the previous pruning, usually as a two-classification problem.
5. For the result of step 4, get the semantic role label of the argument by multi-classification. It can be seen that syntactic analysis is the basis, and some artificial features are often constructed in subsequent steps, and these features are often also derived from syntactic analysis.
<div align="center"> <div align="center">
<img src="image/dependency_parsing_en.png" width = "80%" align=center /><br> <img src="https://github.com/PaddlePaddle/book/blob/develop/07.label_semantic_roles/image/dependency_parsing.png?raw=true" width = "80%" align=center /><br>
Fig 1. Syntax tree Figure 1. Example of dependency syntax analysis tree
</div> </div>
However, complete syntactic analysis needs to determine all the syntactic information contained in a sentence and the relationship between the components of the sentence. It is a very difficult task. The accuracy of syntactic analysis in current technology is not good, and the little errors in syntactic analysis will caused the SRL error. In order to reduce the complexity of the problem and obtain certain syntactic structure information, the idea of ​​"shallow syntactic analysis" came into being. Shallow syntactic analysis is also called partial parsing or chunking. Different from full syntactic analysis which obtains a complete syntactic tree, shallow syntactic analysis only needs to identify some relatively simple independent components of the sentence, such as verb phrases, these identified structures are called chunks. In order to avoid the difficulties caused by the failure to obtain a syntactic tree with high accuracy, some studies \[[1](#References)\] also proposed a chunk-based SRL method. The block-based SRL method solves the SRL as a sequence labeling problem. Sequence labeling tasks generally use the BIO representation to define the set of labels for sequence annotations. Firstly, Let's introduce this representation. In the BIO notation, B stands for the beginning of the block, I stands for the middle of the block, and O stands for the end of the block. Different blocks are assigned different labels by B, I, and O. For example, for a block group extended by role A, the first block it contains is assigned to tag B-A, the other blocks it contains are assigned to tag I-A, and the block not belonging to any argument is assigned tag O.
However, a complete syntactic analysis requires identifying the relationship among all constituents. Thus, the accuracy of SRL is sensitive to the preciseness of the syntactic analysis, making SRL challenging. To reduce its complexity and obtain some information on the syntactic structures, we often use *shallow syntactic analysis* a.k.a. partial parsing or chunking. Unlike complete syntactic analysis, which requires the construction of the complete parsing tree, *Shallow Syntactic Analysis* only requires identifying some independent constituents with relatively simple structures, such as verb phrases (chunk). To avoid difficulties in constructing a syntax tree with high accuracy, some work\[[1](#reference)\] proposed semantic chunking-based SRL methods, which reduces SRL into a sequence tagging problem. Sequence tagging tasks classify syntactic chunks using **BIO representation**. For syntactic chunks forming role A, its first chunk receives the B-A tag (Begin) and the remaining ones receive the tag I-A (Inside); in the end, the chunks left out will receive the tag O. Let's continue to take the above sentence as an example. Figure 1 shows the BIO representation method.
The BIO representation of above example is shown in Fig.1.
<div align="center"> <div align="center">
<img src="image/bio_example_en.png" width = "90%" align=center /><br> <img src="https://github.com/PaddlePaddle/book/blob/develop/07.label_semantic_roles/image/bio_example.png?raw=true" width = "90%" align=center /><br>
Fig 2. BIO representation Figure 2. Example of BIO labeling method
</div> </div>
This example illustrates the simplicity of sequence tagging, since As can be seen from the above example, it is a relatively simple process to directly get the semantic roles labeling result of the argument according to the sequence labeling result. This simplicity is reflected in: (1) relying on shallow syntactic analysis, reducing the requirements and difficulty of syntactic analysis; (2) there is no candidate argument to pruning in this step; (3) the identification and labeling of arguments are realized at the same time. This integrated approach to arguments identification and labeling simplifies the process, reduces the risk of error accumulation, and often achieves better results.
1. It only relies on shallow syntactic analysis, reduces the precision requirement of syntactic analysis;
2. Pruning the candidate arguments is no longer necessary;
3. Arguments are identified and tagged at the same time. Simplifying the workflow reduces the risk of accumulating errors; oftentimes, methods that unify multiple steps boost performance.
In this tutorial, our SRL system is built as an end-to-end system via a neural network. The system takes only text sequences as input, without using any syntactic parsing results or complex hand-designed features. The public dataset [CoNLL-2004 and CoNLL-2005 Shared Tasks](http://www.cs.upc.edu/~srlconll/) is used for the following task: given a sentence with predicates marked, identify the corresponding arguments and their semantic roles through sequence tagging. Similar to the block-based SRL method, in this tutorial we also regard the SRL as a sequence labeling problem. The difference is that we only rely on input text sequences, without relying on any additional syntax analysis results or complex artificial features. And constructing an end-to-end learning SRL system by using deep neural networks. Let's take the public data set of the SRL task in the [CoNLL-2004 and CoNLL-2005 Shared Tasks](http://www.cs.upc.edu/~srlconll/) task as an example to practice the following tasks. Giving a sentence and a predicate in this sentence, through the way of sequence labeling, find the arguments corresponding to the predicate from the sentence, and mark their semantic roles.
## Model ## Model Overview
**Recurrent Neural Networks** (*RNN*) are important tools for sequence modeling and have been successfully used in some natural language processing tasks. Unlike feed-forward neural networks, RNNs can model the dependencies between elements of sequences. As a variant of RNNs', LSTMs aim modeling long-term dependency in long sequences. We have introduced this in [understand_sentiment](https://github.com/PaddlePaddle/book/tree/develop/05.understand_sentiment). In this chapter, we continue to use LSTMs to solve SRL problems. Recurrent Neural Network is an important model for modeling sequences. It is widely used in natural language processing tasks. Unlike the feed-forward neural network, the RNN is able to handle the contextual correlation between inputs. LSTM is an important variant of RNN that is commonly used to learn the long-range dependencies contained in long sequences. We have already introduced in [Sentiment Analysis](https://github.com/PaddlePaddle/book/tree/develop/06.understand_sentiment), in this article we still use LSTM to solve the SRL problem.
### Stacked Recurrent Neural Network ### Stacked Recurrent Neural Network
*Deep Neural Networks* can extract hierarchical representations. The higher layers can form relatively abstract/complex representations, based on primitive features discovered through the lower layers. Unfolding LSTMs through time results in a deep feed-forward neural network. This is because any computational path between the input at time $k < t$ to the output at time $t$ crosses several nonlinear layers. On the other hand, due to parameter sharing over time, LSTMs are also *shallow*; that is, the computation carried out at each time-step is just a linear transformation. Deep LSTM networks are typically constructed by stacking multiple LSTM layers on top of each other and taking the output from lower LSTM layer at time $t$ as the input of upper LSTM layer at time $t$. Deep, hierarchical neural networks can be efficient at representing some functions and modeling varying-length dependencies\[[2](#reference)\]. The deep network helps to form hierarchical features, and the upper layers of the network form more complex advanced features based on the primary features that have been learned in the lower layers. Although the LSTM is expanded along the time axis and is equivalent to a very "deep" feedforward network. However, since the LSTM time step parameters are shared, the mapping of the $t-1$ time state to the time of $t$ always passes only one non-linear mapping. It means that the modeling of state transitions by single-layer LSTM is “shallow”. Stacking multiple LSTM units, making the output of the previous LSTM$t$ time as the input of the next LSTM unit $t$ time, helps us build a deep network. We call it the first version of the stack ecurrent neural networks. Deep networks improve the ability of models to fit complex patterns and better model patterns across different time steps\[[2](#References)\].
However, in a deep LSTM network, any gradient propagated back in depth needs to traverse a large number of nonlinear steps. As a result, while LSTMs of 4 layers can be trained properly, those with 4-8 have much worse performance. Conventional LSTMs prevent back-propagated errors from vanishing or exploding by introducing shortcut connections to skip the intermediate nonlinear layers. Therefore, deep LSTMs can consider shortcut connections in depth as well.
A single LSTM cell has three operations: However, training a deep LSTM network is not an easy task. Stacking multiple LSTM cells in portrait orientation may encounter problems with the propagation of gradients in the longitudinal depth. Generally, stacking 4 layers of LSTM units can be trained normally. When the number of layers reaches 4~8 layers, performance degradation will occur. At this time, some new structures must be considered to ensure the gradient is transmitted vertically and smoothly. This is a problem that must be solved in training a deep LSTM networks. We can learn from LSTM to solve one of the tips of the "gradient disappearance and gradient explosion" problem: there is no nonlinear mapping on the information propagation route of Memory Cell, and neither gradient decay nor explosion when the gradient propagates back. Therefore, the deep LSTM model can also add a path that ensures smooth gradient propagation in the vertical direction.
1. input-to-hidden: map input $x$ to the input of the forget gates, input gates, memory cells and output gates by linear transformation (i.e., matrix mapping); The operation performed by an LSTM unit can be divided into three parts: (1) Input-to-hidden: Each time step input information $x$ will first pass through a matrix map and then as a forgetting gate, input gate, memory unit, output gate's input. Note that this mapping does not introduce nonlinear activation; (2) Hidden-to-hidden: this step is the main body of LSTM calculation, including forgotten gate, input gate, memory unit update, output gate calculation; (3) hidden-to-output: usually simple to activate the hidden layer vector. On the basis of the first version of the stack network, we add a new path: in addition to the previous LSTM output, the mapping of the input of the previous LSTM to the hidden layer is used as a new input. and a new input is added. At the same time, add a linear map to learn a new transform.
2. hidden-to-hidden: calculate forget gates, input gates, output gates and update memory cell, this is the main part of LSTMs;
3. hidden-to-output: this part typically involves an activation operation on hidden states.
Based on the stacked LSTMs, we add shortcut connections: take the input-to-hidden from the previous layer as a new input and learn another linear transformation. Figure 3 is a schematic structural diagram of a finally obtained stack recurrent neural network.
Fig.3 illustrates the final stacked recurrent neural networks.
<p align="center"> <p align="center">
<img src="./image/stacked_lstm_en.png" width = "40%" align=center><br> <img src="https://github.com/PaddlePaddle/book/blob/develop/07.label_semantic_roles/image/stacked_lstm.png?raw=true" width = "40%" align=center><br>
Fig 3. Stacked Recurrent Neural Networks Figure 3. Schematic diagram of stack-based recurrent neural network based on LSTM
</p> </p>
### Bidirectional Recurrent Neural Network ### Bidirectional Recurrent Neural Network
While LSTMs can summarize the history, they can not see the future. Because most NLP (natural language processing) tasks provide the entirety of sentences, sequential learning can benefit from having the future encoded as well as the history. In LSTM, the hidden layer vector at the time of $t$ encodes all input information until the time of $t$. The LSTM at $t$ can see the history, but cannot see the future. In most natural language processing tasks, we almost always get the whole sentence. In this case, if you can get future information like the historical information, it will be of great help to the sequence learning task.
To address this, we can design a bidirectional recurrent neural network by making a minor modification. A higher LSTM layer can process the sequence in reversed direction with regards to its immediate lower LSTM layer, i.e., deep LSTM layers take turns to train on input sequences from left-to-right and right-to-left. Therefore, LSTM layers at time-step $t$ can see both histories and the future, starting from the second layer. Fig. 4 illustrates the bidirectional recurrent neural networks.
In order to overcome this shortcoming, we can design a bidirectional recurrent network unit, which is simple and straightforward: make a small modification to the stack recurrent neural network of the previous section, stack multiple LSTM units, and let each layer of LSTM units learn the output sequence of the previous layer in the order of forward, reverse, forward …… So, starting from layer 2, our LSTM unit will always see historical and future information at $t$. Figure 4 is a schematic diagram showing the structure of a bidirectional recurrent neural network based on LSTM.
<p align="center"> <p align="center">
<img src="./image/bidirectional_stacked_lstm_en.png" width = "60%" align=center><br> <img src="https://github.com/PaddlePaddle/book/blob/develop/07.label_semantic_roles/image/bidirectional_stacked_lstm.png?raw=true" width = "60%" align=center><br>
Fig 4. Bidirectional LSTMs Figure 4. Schematic diagram of a bidirectional recurrent neural network based on LSTM
</p> </p>
Note that, this bidirectional RNNs is different from the one proposed by Bengio et al. in machine translation tasks \[[3](#reference), [4](#reference)\]. We will introduce another bidirectional RNNs in the following chapter [machine translation](https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/README.md) It should be noted that this bidirectional RNN structure is not the same as the bidirectional RNN structure used by Bengio etc in machine translation tasks\[[3](#References), [4](#References)\] Another bidirectional recurrent neural network will be introduced in the following [Machine Translation](https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation) task.
### Conditional Random Field (CRF)
Typically, a neural network's lower layers learn representations while its very top layer accomplishes the final task. These principles can guide our problem-solving approaches. In SRL tasks, a **Conditional Random Field** (*CRF*) is built on top of the network in order to perform the final prediction to tag sequences. It takes representations provided by the last LSTM layer as input. ### Conditional Random Field
The idea of ​​using a neural network model to solve a problem usually is: the front-layer network learns the feature representation of the input, and the last layer of the network completes the final task based on the feature. In the SRL task, the feature representation of the deep LSTM network learns input. Conditional Random Filed (CRF) completes the sequence labeling on th basis of features at the end of the entire network.
The CRF is an undirected probabilistic graph with nodes denoting random variables and edges denoting dependencies between these variables. In essence, CRFs learn the conditional probability $P(Y|X)$, where $X = (x_1, x_2, ... , x_n)$ are sequences of input and $Y = (y_1, y_2, ... , y_n)$ are label sequences; to decode, simply search through $Y$ for a sequence that maximizes the conditional probability $P(Y|X)$, i.e., $Y^* = \mbox{arg max}_{Y} P(Y | X)$。 CRF is a probabilistic structural model, which can be regarded as a probabilistic undirected graph model. Nodes represent random variables and edges represent probability dependencies between random variables. In simple terms, CRF learns the conditional probability $P(X|Y)$, where $X = (x_1, x_2, ... , x_n)$ is the input sequence, $Y = (y_1, y_2, ..., y_n $ is a sequence of tokens; the decoding process is given the $X$ sequence to solve the $Y$ sequence with the largest $P(Y|X)$, that is $Y^* = \mbox{arg max}_{Y} P( Y | X)$.
Sequence tagging tasks do not assume a lot of conditional independence, because they only concern about the input and the output being linear sequences. Thus, the graph model of sequence tagging tasks is usually a simple chain or line, which results in a **Linear-Chain Conditional Random Field**, shown in Fig.5. The sequence labeling task only needs to consider that both the input and the output are a linear sequence. And since we only use the input sequence as a condition and do not make any conditional independent assumptions, there is no graph structure between the elements of the input sequence. In summary, the CRF defined on the chain diagram shown in Figure 5 is used in the sequence labeling task, which is called Linear Chain Conditional Random Field.
<p align="center"> <p align="center">
<img src="./image/linear_chain_crf.png" width = "35%" align=center><br> <img src="https://github.com/PaddlePaddle/book/blob/develop/07.label_semantic_roles/image/linear_chain_crf.png?raw=true" width = "35%" align=center><br>
Fig 5. Linear Chain Conditional Random Field used in SRL tasks Figure 5. Linear chain conditional random field used in sequence labeling tasks
</p> </p>
By the fundamental theorem of random fields \[[5](#reference)\], the joint distribution over the label sequence $Y$ given $X$ has the form: According to the factorization theorem on the linear chain condition random field \[[5](#References)\], the probability of a particular tag sequence $Y$ can be defined as given in the observation sequence $X$:
$$p(Y | X) = \frac{1}{Z(X)} \text{exp}\left(\sum_{i=1}^{n}\left(\sum_{j}\lambda_{j}t_{j} (y_{i - 1}, y_{i}, X, i) + \sum_{k} \mu_k s_k (y_i, X, i)\right)\right)$$ $$p(Y | X) = \frac{1}{Z(X)} \text{exp}\left(\sum_{i=1}^{n}\left(\sum_{j}\lambda_{ j}t_{j} (y_{i - 1}, y_{i}, X, i) + \sum_{k} \mu_k s_k (y_i, X, i)\right)\right)$$
Where $Z(X)$ is the normalization factor, and $t_j$ is the feature function defined on the edge, depending on the current and previous position, which called the transition feature. It represents the transition probability of the input sequence $X$ and its labeling sequence marked at the $i$ and $i - 1$ positions. $s_k$ is a feature function defined on the node, called a state feature, which depends on the current position. It represents the probability of marking for the observation sequence $X$ and its $i$ position. $\lambda_j$ and $\mu_k$ are the weights corresponding to the transfer feature function and the state feature function respectively. In fact, $t$ and $s$ can be represented in the same mathematical form, and the transfer feature and state are summed at each position $i$: $f_{k}(Y, X) = \sum_{i =1}^{n}f_k({y_{i - 1}, y_i, X, i})$. Calling $f$ collectively as a feature function, so $P(Y|X)$ can be expressed as:
where, $Z(X)$ is normalization constant, ${t_j}$ represents the feature functions defined on edges called the *transition feature*, which denotes the transition probabilities from $y_{i-1}$ to $y_i$ given input sequence $X$. ${s_k}$ represents the feature function defined on nodes, called the state feature, denoting the probability of $y_i$ given input sequence $X$. In addition, $\lambda_j$ and $\mu_k$ are weights corresponding to $t_j$ and $s_k$. Alternatively, $t$ and $s$ can be written in the same form that depends on $y_{i - 1}$, $y_i$, $X$, and $i$. Taking its summation over all nodes $i$, we have: $f_{k}(Y, X) = \sum_{i=1}^{n}f_k({y_{i - 1}, y_i, X, i})$, which defines the *feature function* $f$. Thus, $P(Y|X)$ can be written as:
$$p(Y|X, W) = \frac{1}{Z(X)}\text{exp}\sum_{k}\omega_{k}f_{k}(Y, X)$$ $$p(Y|X, W) = \frac{1}{Z(X)}\text{exp}\sum_{k}\omega_{k}f_{k}(Y, X)$$
where $\omega$ are the weights to the feature function that the CRF learns. While training, given input sequences and label sequences $D = \left[(X_1, Y_1), (X_2 , Y_2) , ... , (X_N, Y_N)\right]$, by maximum likelihood estimation (**MLE**), we construct the following objective function: $\omega$ is the weight corresponding to the feature function and is the parameter to be learned by the CRF model. During training, for a given input sequence and the corresponding set of markup sequences $D = \left[(X_1, Y_1), (X_2 , Y_2) , ... , (X_N, Y_N)\right]$ , by regularizing the maximum likelihood estimation to solve the following optimization objectives:
$$\DeclareMathOperator*{\argmax}{arg\,max} L(\lambda, D) = - \text{log}\left(\prod_{m=1}^{N}p(Y_m|X_m, W)\right) + C \frac{1}{2}\lVert W\rVert^{2}$$
$$\DeclareMathOperator*{\argmax}{arg\,max} L(\lambda, D) = - \text{log}\left(\prod_{m=1}^{N}p(Y_m|X_m, W )\right) + C \frac{1}{2}\lVert W\rVert^{2}$$
This objective function can be solved via back-propagation in an end-to-end manner. While decoding, given input sequences $X$, search for sequence $\bar{Y}$ to maximize the conditional probability $\bar{P}(Y|X)$ via decoding methods (such as *Viterbi*, or [Beam Search Algorithm](https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/README.md#beam-search-algorithm)). This optimization objectives can be solved by the back propagation algorithm together with the entire neural network. When decoding, for a given input sequence $X$, the output sequence $\bar{Y}$ of maximizing the conditional probability $\bar{P}(Y|X)$ by the decoding algorithm (such as: Viterbi algorithm, Beam Search).
### Deep Bidirectional LSTM (DB-LSTM) SRL model ### Deep bidirectional LSTM (DB-LSTM) SRL model
Given predicates and a sentence, SRL tasks aim to identify arguments of the given predicate and their semantic roles. If a sequence has $n$ predicates, we will process this sequence $n$ times. Here is the breakdown of a straight-forward model: In the SRL task, the input is “predicate” and “a sentence”. The goal is to find the argument of the predicate from this sentence and mark the semantic role of the argument. If a sentence contains $n$ predicates, the sentence will be processed for $n$ times. One of the most straightforward models is the following:
1. Construct inputs; 1. Construct the input;
- input 1: predicate, input 2: sentence - Input 1 is the predicate and 2 is the sentence
- expand input 1 into a sequence of the same length with input 2's sentence, using one-hot representation; - Extend input 1 to a sequence as long as input 2, expressed by one-hot mode;
2. Convert the one-hot sequences from step 1 to vector sequences via a word embedding's lookup table; 2. The predicate sequence and sentence sequence of the one-hot format are converted into a sequence of word vectors represented by real vectors through a vocabulary;
3. Learn the representation of input sequences by taking vector sequences from step 2 as inputs; 3. The two word vector sequences in step 2 are used as input of the bidirectional LSTM to learn the feature representation of the input sequence;
4. Take the representation from step 3 as input, label sequence as a supervisory signal, and realize sequence tagging tasks. 4. The CRF takes the features learned in the model in step 3 as input, and uses the tag sequence as the supervised signal to implement sequence labeling;
Here, we propose some improvements by introducing two simple but effective features: You can try this method. Here, we propose some improvements that introduce two simple features that are very effective in improving system performance:
- predicate context (**ctx-p**): A single predicate word may not describe all the predicate information, especially when the same words appear multiple times in a sentence. With the expanded context, the ambiguity can be largely eliminated. Thus, we extract $n$ words before and after predicate to construct a window chunk. - Predicate's context: In the above method, only the word vector of the predicate is used to express all the information related to the predicate. This method is always very weak, especially if the predicate appears multiple times in the sentence, it may cause certain ambiguity. From experience, a small segment of several words before and after the predicate can provide more information to help resolve ambiguity. So, we add this kind of experience to the model, and extract a "predicate context" fragment for each predicate, that is, a window fragment composed of $n$ words before and after the predicate;
- Predicate context area's tag: Introduces a 0-1 binary variable for each word in the sentence, which indicats whether they are in the "predicate context" fragment;
- region mark ($m_r$): The binary marker on a word, $m_r$, takes the value of $1$ when the word is in the predicate context region, and $0$ if not. The modified model is as follows (Figure 6 is a schematic diagram of the model structure with a depth of 4):
After these modifications, the model is as follows, as illustrated in Figure 6:
1. Construct inputs
- Input 1: word sequence. Input 2: predicate. Input 3: predicate context, extract $n$ words before and after predicate. Input 4: region mark sequence, where an entry is 1 if the word is located in the predicate context region, 0 otherwise.
- expand input 2~3 into sequences with the same length with input 1
2. Convert input 1~4 to vector sequences via word embedding lookup tables; While input 1 and 3 shares the same lookup table, input 2 and 4 have separate lookup tables.
3. Take the four vector sequences from step 2 as inputs to bidirectional LSTMs; Train the LSTMs to update representations.
4. Take the representation from step 3 as input to CRF, label sequence as a supervisory signal, and complete sequence tagging tasks.
1. Construct input
- Input 1 is a sentence sequence, input 2 is a predicate sequence, input 3 is a predicate context, and $n$ words before and after the predicate are extracted from the sentence to form a predicate context, which represented by one-hot. Input 4 is a predicate context area which marks whether each word in the sentence is in the context of the predicate;
- Extend the input 2~3 to a sequence as long as the input 1;
2. Input 1~4 are converted into a sequence of word vectors represented by real vectors in vocabulary; where inputs 1 and 3 share the same vocabulary, and inputs 2 and 4 each have their own vocabulary;
3. The four word vector sequences in step 2 are used as input to the bidirectional LSTM model; the LSTM model learns the feature representation of the input sequence to obtain a new feature representation sequence;
4. The CRF takes the features learned in step 3 of the LSTM as input, and uses the marked sequence as the supervised signal to complete the sequence labeling;
<div align="center"> <div align="center">
<img src="image/db_lstm_network_en.png" width = "60%" align=center /><br> <img src="https://github.com/PaddlePaddle/book/blob/develop/07.label_semantic_roles/image/db_lstm_network.png?raw=true" width = "60%" align=center /><br>
Fig 6. DB-LSTM for SRL tasks Figure 6. Deep bidirectional LSTM model on the SRL task
</div> </div>
## Data Preparation
In the tutorial, we use [CoNLL 2005](http://www.cs.upc.edu/~srlconll/) SRL task open dataset as an example. Note that the training set and development set of the CoNLL 2005 SRL task are not free to download after the competition. Currently, only the test set can be obtained, including 23 sections of the Wall Street Journal and three sections of the Brown corpus. In this tutorial, we use the WSJ corpus as the training dataset to explain the model. However, since the training set is small, for a usable neural network SRL system, please consider paying for the full corpus. ## Data Introduction
In this tutorial, We use the data set opened by the [CoNLL 2005](http://www.cs.upc.edu/~srlconll/) SRL task as an example. It is important to note that the training set and development set of the CoNLL 2005 SRL task are not free for public after the competition. Currently, only the test set is available, including 23 in the Wall Street Journal and 3 in the Brown corpus. In this tutorial, we use the WSJ data in the test set to solve the model for the training set. However, since the number of samples in the test set is far from enough, if you want to train an available neural network SRL system, consider paying for the full amount of data.
The original data includes a variety of information such as POS tagging, naming entity recognition, syntax tree, etc. In this tutorial, we only use the data under `test.wsj/words/` (text sequence) and `test.wsj/props/` (label results). The data directory used in this tutorial is as follows: The original data also includes a variety of information such as part-of-speech tagging, named entity recognition, and syntax parse tree. In this tutorial, we use the data in the test.wsj folder for training and testing, and only use the data under the words folder (text sequence) and the props folder (labeled results). The data directories used in this tutorial are as follows:
```text ```text
conll05st-release/ conll05st-release/
└── test.wsj └── test.wsj
├── props # label results     ├── props # Label result
└── words # text sequence     └── words # Input text sequence
``` ```
The annotation information is derived from the results of Penn TreeBank\[[7](#references)\] and PropBank \[[8](#references)\]. The labeling of the PropBank is different from the labeling methods mentioned before, but shares with it the same underlying principle. For descriptions of the labeling, please refer to the paper \[[9](#references)\]. The labeling information is derived from the labeling results of Penn TreeBank\[[7](#References)\] and PropBank\[[8](#References)\]. The label of the PropBank labeling result is different from the labeling result label we used in the first example of the article, but the principle is the same. For the description of the meaning of the labeling result label, please refer to the paper \[[9](#References)\].
The raw data needs to be preprocessed into formats that PaddlePaddle can handle. The preprocessing consists of the following steps: The raw data needs to be preprocessed in order to be processed by PaddlePaddle. The preprocessing includes the following steps:
1. Merge the text sequence and the tag sequence into the same record; 1. Combine text sequences and tag sequences into one record;
2. If a sentence contains $n$ predicates, the sentence will be processed $n$ times into $n$ separate training samples, each sample with a different predicate; 2. If a sentence contains $n$ predicates, the sentence will be processed for $n$ times, becoming a $n$ independent training sample, each sample with a different predicate;
3. Extract the predicate context and construct the predicate context region marker; 3. Extract the predicate context and construct the predicate context area tag;
4. Construct the markings in BIO format; 4. Construct a tag represented by the BIO method;
5. Obtain the integer index corresponding to the word according to the dictionary. 5. Get the integer index corresponding to the word according to the dictionary.
After preprocessing, a training sample contains nine features, namely: word sequence, predicate, predicate context (5 columns), region mark sequence, label sequence. The following table is an example of a training sample. After the pre-processing is completed, a training sample data contains 9 fields, namely: sentence sequence, predicate, predicate context (accounting for 5 columns), predicate context area tag, and labeling sequence. The following table is an example of a training sample.
| word sequence | predicate | predicate context(5 columns) | region mark sequence | label sequence| | Sentence Sequence | Predicate | Predicate Context (Window = 5) | Predicate Context Area Tag | Label Sequence |
|---|---|---|---|---| |---|---|---|---|---|
| A | set | n't been set . × | 0 | B-A1 | | A | set | n't been set . × | 0 | B-A1 |
| record | set | n't been set . × | 0 | I-A1 | | record | set | n't been set . × | 0 | I-A1 |
...@@ -230,18 +207,19 @@ After preprocessing, a training sample contains nine features, namely: word sequ ...@@ -230,18 +207,19 @@ After preprocessing, a training sample contains nine features, namely: word sequ
| set | set | n't been set . × | 1 | B-V | | set | set | n't been set . × | 1 | B-V |
| . | set | n't been set . × | 1 | O | | . | set | n't been set . × | 1 | O |
In addition to the data, we provide following resources:
| filename | explanation | In addition to the data, we also provide the following resources:
| File Name | Description |
|---|---| |---|---|
| word_dict | dictionary of input sentences, total 44068 words | | word_dict | Input a dictionary of sentences for a total of 44068 words |
| label_dict | dictionary of labels, total 106 labels | | label_dict | Tag dictionary, total 106 tags |
| predicate_dict | predicate dictionary, total 3162 predicates | | predicate_dict | Dictionary of predicates, totaling 3162 words |
| emb | a pre-trained word vector lookup table, 32-dimensional | | emb | A trained vocabulary, 32-dimensional |
We trained a language model on the English Wikipedia to get a word vector lookup table used to initialize the SRL model. While training the SRL model, the word vector lookup table is no longer updated. To learn more about the language model and the word vector lookup table, please refer to the tutorial [word vector](https://github.com/PaddlePaddle/book/blob/develop/04.word2vec/README.md). There are 995,000,000 tokens in the training corpus, and the dictionary size is 4900,000 words. In the CoNLL 2005 training corpus, 5% of the words are not in the 4900,000 words, and we see them all as unknown words, represented by `<unk>`. We trained a language model on English Wikipedia to get a word vector to initialize the SRL model. During the training of the SRL model, the word vector is no longer updated. For the language model and word vector, refer to [Word Vector](https://github.com/PaddlePaddle/book/blob/develop/04.word2vec) for this tutorial. The corpus of our training language model has a total of 995,000,000 tokens, and the dictionary size is controlled to 4,900,000 words. CoNLL 2005 training corpus 5% of this word is not in 4900,000 words, we have seen them all unknown words, with `<unk>` representation.
Here we fetch the dictionary, and print its size: Get the dictionary and print the dictionary size:
```python ```python
from __future__ import print_function from __future__ import print_function
...@@ -271,95 +249,132 @@ print('pred_dict_len: ', pred_dict_len) ...@@ -271,95 +249,132 @@ print('pred_dict_len: ', pred_dict_len)
- Define input data dimensions and model hyperparameters. - Define input data dimensions and model hyperparameters.
```python ```python
mark_dict_len = 2 mark_dict_len = 2 # The dimension of the context area flag, which is a 0-1 2 value feature, so the dimension is 2
word_dim = 32 Word_dim = 32 # Word vector dimension
mark_dim = 5 Mark_dim = 5 # The predicate context area is mapped to a real vector by the vocabulary, which is the adjacent dimension
hidden_dim = 512 Hidden_dim = 512 # LSTM Hidden Layer Vector Dimensions : 512 / 4
depth = 8 Depth = 8 # depth of stack LSTM
mix_hidden_lr = 1e-3 Mix_hidden_lr = 1e-3 # Basic learning rate of fundamental_chain_crf layer
IS_SPARSE = True IS_SPARSE = True # Whether to update embedding in sparse way
PASS_NUM = 10 PASS_NUM = 10 # Training epoches
BATCH_SIZE = 10 BATCH_SIZE = 10 # Batch size
embedding_name = 'emb' Embeddding_name = 'emb'
``` ```
Note that `hidden_dim = 512` means a LSTM hidden vector of 128 dimension (512/4). Please refer to PaddlePaddle's official documentation for detail: [lstmemory](http://www.paddlepaddle.org/doc/ui/api/trainer_config_helpers/layers.html#lstmemory)。 It should be specially noted that the parameter `hidden_dim = 512` actually specifies the dimension of the LSTM hidden layer's vector is 128. For this, please refer to the description of `dynamic_lstm` in the official PaddlePaddle API documentation.
- Define a parameter loader method to load the pre-trained word lookup tables from word embeddings trained on the English language Wikipedia. - As is mentioned above, we use the trained word vector based on English Wikipedia to initialize the embedding layer parameters of the total six features of the sequence input and predicate context, which are not updated during training.
```python ```python
#Here load the binary parameters saved by PaddlePaddle
def load_parameter(file_name, h, w): def load_parameter(file_name, h, w):
with open(file_name, 'rb') as f: with open(file_name, 'rb') as f:
f.read(16) # skip header. f.read(16) # skip header.
return np.fromfile(f, dtype=np.float32).reshape(h, w) return np.fromfile(f, dtype=np.float32).reshape(h, w)
``` ```
- Transform the word sequence itself, the predicate, the predicate context, and the region mark sequence into embedded vector sequences.
- 8 LSTM units are trained through alternating left-to-right / right-to-left order denoted by the variable `reverse`.
## Training Model
- We train according to the network topology and model parameters. We also need to specify the optimization method when constructing. Here we use the most basic SGD method (momentum is set to 0), and set the learning rate, regularition, and so on.
Define hyperparameters for the training process
```python
use_cuda = False #Execute training on cpu
save_dirname = "label_semantic_roles.inference.model" #The model parameters obtained by training are saved in the file.
is_local = True
```
### Data input layer definition
Defines the format of the model input features, including the sentence sequence, the predicate, the five features of the predicate context, and the predicate context area flags.
```python
# Sentence sequences
word = fluid.layers.data(
name='word_data', shape=[1], dtype='int64', lod_level=1)
# predicate
predicate = fluid.layers.data(
name='verb_data', shape=[1], dtype='int64', lod_level=1)
# predicate context's 5 features
ctx_n2 = fluid.layers.data(
name='ctx_n2_data', shape=[1], dtype='int64', lod_level=1)
ctx_n1 = fluid.layers.data(
name='ctx_n1_data', shape=[1], dtype='int64', lod_level=1)
ctx_0 = fluid.layers.data(
name='ctx_0_data', shape=[1], dtype='int64', lod_level=1)
ctx_p1 = fluid.layers.data(
name='ctx_p1_data', shape=[1], dtype='int64', lod_level=1)
ctx_p2 = fluid.layers.data(
name='ctx_p2_data', shape=[1], dtype='int64', lod_level=1)
# Predicate conotext area flag
mark = fluid.layers.data(
name='mark_data', shape=[1], dtype='int64', lod_level=1)
```
### Defining the network structure
First pre-train and define the model input layer
```python ```python
def db_lstm(word, predicate, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2, mark, #pre-training predicate and predicate context area flags
**ignored): predicate_embedding = fluid.layers.embedding(
# 8 features
predicate_embedding = fluid.layers.embedding(
input=predicate, input=predicate,
size=[pred_dict_len, word_dim], size=[pred_dict_len, word_dim],
dtype='float32', dtype='float32',
is_sparse=IS_SPARSE, is_sparse=IS_SPARSE,
param_attr='vemb') param_attr='vemb')
mark_embedding = fluid.layers.embedding( mark_embedding = fluid.layers.embedding(
input=mark, input=mark,
size=[mark_dict_len, mark_dim], size=[mark_dict_len, mark_dim],
dtype='float32', dtype='float32',
is_sparse=IS_SPARSE) is_sparse=IS_SPARSE)
word_input = [word, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2] #Sentence sequences and predicate context 5 features then pre-trained
# Since word vector lookup table is pre-trained, we won't update it this time. word_input = [word, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2]
# trainable being False prevents updating the lookup table during training. #Because word vector is pre-trained, no longer training embedding table,
emb_layers = [ # The trainable's parameter attribute set to False prevents the embedding table from being updated during training
emb_layers = [
fluid.layers.embedding( fluid.layers.embedding(
size=[word_dict_len, word_dim], size=[word_dict_len, word_dim],
input=x, input=x,
param_attr=fluid.ParamAttr( param_attr=fluid.ParamAttr(
name=embedding_name, trainable=False)) for x in word_input name=embedding_name, trainable=False)) for x in word_input
] ]
emb_layers.append(predicate_embedding) # Pre-training results for adding predicate and predicate context area tags
emb_layers.append(mark_embedding) emb_layers.append(predicate_embedding)
emb_layers.append(mark_embedding)
```
Define eight LSTM units to learn all input sequences in "forward/reverse" order.
# 8 LSTM units are trained through alternating left-to-right / right-to-left order ```python
# denoted by the variable `reverse`. # A total of 8 LSTM units are trained, each unit is oriented from left to right or right to left.
hidden_0_layers = [ # Determined by the parameter `is_reverse`
# First stack structure
hidden_0_layers = [
fluid.layers.fc(input=emb, size=hidden_dim, act='tanh') fluid.layers.fc(input=emb, size=hidden_dim, act='tanh')
for emb in emb_layers for emb in emb_layers
] ]
hidden_0 = fluid.layers.sums(input=hidden_0_layers) hidden_0 = fluid.layers.sums(input=hidden_0_layers)
lstm_0 = fluid.layers.dynamic_lstm( lstm_0 = fluid.layers.dynamic_lstm(
input=hidden_0, input=hidden_0,
size=hidden_dim, size=hidden_dim,
candidate_activation='relu', candidate_activation='relu',
gate_activation='sigmoid', gate_activation='sigmoid',
cell_activation='sigmoid') cell_activation='sigmoid')
# stack L-LSTM and R-LSTM with direct edges # Stack L-LSTM and R-LSTM with directly connected sides
input_tmp = [hidden_0, lstm_0] input_tmp = [hidden_0, lstm_0]
# In PaddlePaddle, state features and transition features of a CRF are implemented
# by a fully connected layer and a CRF layer seperately. The fully connected layer
# with linear activation learns the state features, here we use fluid.layers.sums
# (fluid.layers.fc can be uesed as well), and the CRF layer in PaddlePaddle:
# fluid.layers.linear_chain_crf only
# learns the transition features, which is a cost layer and is the last layer of the network.
# fluid.layers.linear_chain_crf outputs the log probability of true tag sequence
# as the cost by given the input sequence and it requires the true tag sequence
# as target in the learning process.
for i in range(1, depth): # remaining stack structure
for i in range(1, depth):
mix_hidden = fluid.layers.sums(input=[ mix_hidden = fluid.layers.sums(input=[
fluid.layers.fc(input=input_tmp[0], size=hidden_dim, act='tanh'), fluid.layers.fc(input=input_tmp[0], size=hidden_dim, act='tanh'),
fluid.layers.fc(input=input_tmp[1], size=hidden_dim, act='tanh') fluid.layers.fc(input=input_tmp[1], size=hidden_dim, act='tanh')
...@@ -375,102 +390,82 @@ def db_lstm(word, predicate, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2, mark, ...@@ -375,102 +390,82 @@ def db_lstm(word, predicate, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2, mark,
input_tmp = [mix_hidden, lstm] input_tmp = [mix_hidden, lstm]
feature_out = fluid.layers.sums(input=[ # Fetch the output of the last stack LSTM and the input of this LSTM unit to the hidden layer mapping,
# Learn the state feature of CRF after a fully connected layer maps to the dimensions of the tags dictionary
feature_out = fluid.layers.sums(input=[
fluid.layers.fc(input=input_tmp[0], size=label_dict_len, act='tanh'), fluid.layers.fc(input=input_tmp[0], size=label_dict_len, act='tanh'),
fluid.layers.fc(input=input_tmp[1], size=label_dict_len, act='tanh') fluid.layers.fc(input=input_tmp[1], size=label_dict_len, act='tanh')
]) ])
return feature_out
```
## Train model
- In the `train` method, we will create trainer given model topology, parameters, and optimization method. We will use the most basic **SGD** method, which is a momentum optimizer with 0 momentum. Meanwhile, we will set learning rate and decay.
- As mentioned in data preparation section, we will use CoNLL 2005 test corpus as the training data set. `conll05.test()` outputs one training instance at a time. It is shuffled and batched into mini batches, and used as input.
- `feeding` is used to specify the correspondence between data instance and data layer. For example, according to the `feeding`, the 0th column of data instance produced by`conll05.test()` is matched to the data layer named `word_data`.
- `event_handler` can be used as callback for training events, it will be used as an argument for the `train` method. Following `event_handler` prints cost during training.
- `trainer.train` will train the model.
```python # tag/label sequence
def train(use_cuda, save_dirname=None, is_local=True): target = fluid.layers.data(
# define network topology
word = fluid.layers.data(
name='word_data', shape=[1], dtype='int64', lod_level=1)
predicate = fluid.layers.data(
name='verb_data', shape=[1], dtype='int64', lod_level=1)
ctx_n2 = fluid.layers.data(
name='ctx_n2_data', shape=[1], dtype='int64', lod_level=1)
ctx_n1 = fluid.layers.data(
name='ctx_n1_data', shape=[1], dtype='int64', lod_level=1)
ctx_0 = fluid.layers.data(
name='ctx_0_data', shape=[1], dtype='int64', lod_level=1)
ctx_p1 = fluid.layers.data(
name='ctx_p1_data', shape=[1], dtype='int64', lod_level=1)
ctx_p2 = fluid.layers.data(
name='ctx_p2_data', shape=[1], dtype='int64', lod_level=1)
mark = fluid.layers.data(
name='mark_data', shape=[1], dtype='int64', lod_level=1)
# define network topology
feature_out = db_lstm(**locals())
target = fluid.layers.data(
name='target', shape=[1], dtype='int64', lod_level=1) name='target', shape=[1], dtype='int64', lod_level=1)
crf_cost = fluid.layers.linear_chain_crf(
# Learning CRF transfer features
crf_cost = fluid.layers.linear_chain_crf(
input=feature_out, input=feature_out,
label=target, label=target,
param_attr=fluid.ParamAttr( param_attr=fluid.ParamAttr(
name='crfw', learning_rate=mix_hidden_lr)) name='crfw', learning_rate=mix_hidden_lr))
avg_cost = fluid.layers.mean(crf_cost)
sgd_optimizer = fluid.optimizer.SGD( avg_cost = fluid.layers.mean(crf_cost)
# Use the most basic SGD optimization method (momentum is set to 0)
sgd_optimizer = fluid.optimizer.SGD(
learning_rate=fluid.layers.exponential_decay( learning_rate=fluid.layers.exponential_decay(
learning_rate=0.01, learning_rate=0.01,
decay_steps=100000, decay_steps=100000,
decay_rate=0.5, decay_rate=0.5,
staircase=True)) staircase=True))
sgd_optimizer.minimize(avg_cost) sgd_optimizer.minimize(avg_cost)
# The CRF decoding layer is used for evaluation and inference. ```
# It shares weights with CRF layer. The sharing of parameters among multiple layers
# is specified by using the same parameter name in these layers. If true tag sequence The data introduction section mentions the payment of the CoNLL 2005 training set. Here we use the test set training for everyone to learn. Conll05.test() produces one sample every time, containing 9 features, then shuffle and after batching as the input for training.
# is provided in training process, `fluid.layers.crf_decoding` calculates labelling error
# for each input token and sums the error over the entire sequence. ```python
# Otherwise, `fluid.layers.crf_decoding` generates the labelling tags. crf_decode = fluid.layers.crf_decoding(
crf_decode = fluid.layers.crf_decoding(
input=feature_out, param_attr=fluid.ParamAttr(name='crfw')) input=feature_out, param_attr=fluid.ParamAttr(name='crfw'))
train_data = paddle.batch( train_data = paddle.batch(
paddle.reader.shuffle( paddle.reader.shuffle(
paddle.dataset.conll05.test(), buf_size=8192), paddle.dataset.conll05.test(), buf_size=8192),
batch_size=BATCH_SIZE) batch_size=BATCH_SIZE)
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace() place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
```
The corresponding relationship between each data and data_layer is specified by the feeder. The following feeder indicates that the data_layer corresponding to the 0th column of the data generated by conll05.test() is `word`.
feeder = fluid.DataFeeder( ```python
feeder = fluid.DataFeeder(
feed_list=[ feed_list=[
word, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2, predicate, mark, target word, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2, predicate, mark, target
], ],
place=place) place=place)
exe = fluid.Executor(place) exe = fluid.Executor(place)
```
Start training
```python
main_program = fluid.default_main_program()
def train_loop(main_program): exe.run(fluid.default_startup_program())
exe.run(fluid.default_startup_program()) embedding_param = fluid.global_scope().find_var(
embedding_param = fluid.global_scope().find_var(
embedding_name).get_tensor() embedding_name).get_tensor()
embedding_param.set( embedding_param.set(
load_parameter(conll05.get_embedding(), word_dict_len, word_dim), load_parameter(conll05.get_embedding(), word_dict_len, word_dim),
place) place)
start_time = time.time() start_time = time.time()
batch_id = 0 batch_id = 0
for pass_id in six.moves.xrange(PASS_NUM): for pass_id in six.moves.xrange(PASS_NUM):
for data in train_data(): for data in train_data():
cost = exe.run(main_program, cost = exe.run(main_program,
feed=feeder.feed(data), feed=feeder.feed(data),
...@@ -490,77 +485,75 @@ def train(use_cuda, save_dirname=None, is_local=True): ...@@ -490,77 +485,75 @@ def train(use_cuda, save_dirname=None, is_local=True):
'ctx_n1_data', 'ctx_0_data', 'ctx_p1_data', 'ctx_n1_data', 'ctx_0_data', 'ctx_p1_data',
'ctx_p2_data', 'mark_data' 'ctx_p2_data', 'mark_data'
], [feature_out], exe) ], [feature_out], exe)
return break
batch_id = batch_id + 1 batch_id = batch_id + 1
train_loop(fluid.default_main_program())
``` ```
## Application ## Model Application
After completing the training, the optimal model needs to be selected according to a performance indicator we care about. You can simply select the model with the least markup error on the test set. We give an example of using a trained model for prediction as follows.
- When training is completed, we need to select an optimal model based one performance index to do inference. In this task, one can simply select the model with the least number of marks on the test set. We demonstrate doing an inference using the trained model. First set the parameters of the prediction process
```python ```python
def infer(use_cuda, save_dirname=None): use_cuda = False #predict on cpu
if save_dirname is None: save_dirname = "label_semantic_roles.inference.model" #call trained model for prediction
return
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace() exe = fluid.Executor(place)
exe = fluid.Executor(place) ```
Set the input, use LoDTensor to represent the input word sequence, where the shape of each word's base_shape is [1], because each word is represented by an id. If the length-based LoD is [[3, 4, 2]], which is a single-layer LoD, then the constructed LoDTensor contains three sequences which their length are 3, 4, and 2.
inference_scope = fluid.core.Scope()
with fluid.scope_guard(inference_scope): Note that LoD is a list of lists.
# Use fluid.io.load_inference_model to obtain the inference program desc,
# the feed_target_names (the names of variables that will be fed
# data using feed operators), and the fetch_targets (variables that
# we want to obtain data from using fetch operators).
[inference_program, feed_target_names,
fetch_targets] = fluid.io.load_inference_model(save_dirname, exe)
# Setup inputs by creating LoDTensors to represent sequences of words. ```python
# Here each word is the basic element of these LoDTensors and the shape of lod = [[3, 4, 2]]
# each word (base_shape) should be [1] since it is simply an index to base_shape = [1]
# look up for the corresponding word vector.
# Suppose the length_based level of detail (lod) info is set to [[3, 4, 2]], # Construct fake data as input, the range of random integer numbers is [low, high]
# which has only one lod level. Then the created LoDTensors will have only word = fluid.create_random_int_lodtensor(
# one higher level structure (sequence of words, or sentence) than the basic
# element (word). Hence the LoDTensor will hold data for three sentences of
# length 3, 4 and 2, respectively.
# Note that lod info should be a list of lists.
lod = [[3, 4, 2]]
base_shape = [1]
# The range of random integers is [low, high]
word = fluid.create_random_int_lodtensor(
lod, base_shape, place, low=0, high=word_dict_len - 1) lod, base_shape, place, low=0, high=word_dict_len - 1)
pred = fluid.create_random_int_lodtensor( pred = fluid.create_random_int_lodtensor(
lod, base_shape, place, low=0, high=pred_dict_len - 1) lod, base_shape, place, low=0, high=pred_dict_len - 1)
ctx_n2 = fluid.create_random_int_lodtensor( ctx_n2 = fluid.create_random_int_lodtensor(
lod, base_shape, place, low=0, high=word_dict_len - 1) lod, base_shape, place, low=0, high=word_dict_len - 1)
ctx_n1 = fluid.create_random_int_lodtensor( ctx_n1 = fluid.create_random_int_lodtensor(
lod, base_shape, place, low=0, high=word_dict_len - 1) lod, base_shape, place, low=0, high=word_dict_len - 1)
ctx_0 = fluid.create_random_int_lodtensor( ctx_0 = fluid.create_random_int_lodtensor(
lod, base_shape, place, low=0, high=word_dict_len - 1) lod, base_shape, place, low=0, high=word_dict_len - 1)
ctx_p1 = fluid.create_random_int_lodtensor( ctx_p1 = fluid.create_random_int_lodtensor(
lod, base_shape, place, low=0, high=word_dict_len - 1) lod, base_shape, place, low=0, high=word_dict_len - 1)
ctx_p2 = fluid.create_random_int_lodtensor( ctx_p2 = fluid.create_random_int_lodtensor(
lod, base_shape, place, low=0, high=word_dict_len - 1) lod, base_shape, place, low=0, high=word_dict_len - 1)
mark = fluid.create_random_int_lodtensor( mark = fluid.create_random_int_lodtensor(
lod, base_shape, place, low=0, high=mark_dict_len - 1) lod, base_shape, place, low=0, high=mark_dict_len - 1)
```
# Construct feed as a dictionary of {feed_target_name: feed_target_data} Using fluid.io.load_inference_model to load inference_program, feed_target_names is the name of the model's input variable, and fetch_targets is the predicted object.
# and results will contain a list of data corresponding to fetch_targets.
assert feed_target_names[0] == 'word_data' ```python
assert feed_target_names[1] == 'verb_data' [inference_program, feed_target_names,
assert feed_target_names[2] == 'ctx_n2_data' fetch_targets] = fluid.io.load_inference_model(save_dirname, exe)
assert feed_target_names[3] == 'ctx_n1_data' ```
assert feed_target_names[4] == 'ctx_0_data' Construct the feed dictionary {feed_target_name: feed_target_data}, where the results are a list of predicted targets
assert feed_target_names[5] == 'ctx_p1_data'
assert feed_target_names[6] == 'ctx_p2_data' ```python
assert feed_target_names[7] == 'mark_data' assert feed_target_names[0] == 'word_data'
assert feed_target_names[1] == 'verb_data'
results = exe.run(inference_program, assert feed_target_names[2] == 'ctx_n2_data'
assert feed_target_names[3] == 'ctx_n1_data'
assert feed_target_names[4] == 'ctx_0_data'
assert feed_target_names[5] == 'ctx_p1_data'
assert feed_target_names[6] == 'ctx_p2_data'
assert feed_target_names[7] == 'mark_data'
```
Execute prediction
```python
results = exe.run(inference_program,
feed={ feed={
feed_target_names[0]: word, feed_target_names[0]: word,
feed_target_names[1]: pred, feed_target_names[1]: pred,
...@@ -573,47 +566,36 @@ def infer(use_cuda, save_dirname=None): ...@@ -573,47 +566,36 @@ def infer(use_cuda, save_dirname=None):
}, },
fetch_list=fetch_targets, fetch_list=fetch_targets,
return_numpy=False) return_numpy=False)
print(results[0].lod())
np_data = np.array(results[0])
print("Inference Shape: ", np_data.shape)
``` ```
- The main entrance of the whole program is as below: Output result
```python ```python
def main(use_cuda, is_local=True): print(results[0].lod())
if use_cuda and not fluid.core.is_compiled_with_cuda(): np_data = np.array(results[0])
return print("Inference Shape: ", np_data.shape)
# Directory for saving the trained model
save_dirname = "label_semantic_roles.inference.model"
train(use_cuda, save_dirname, is_local)
infer(use_cuda, save_dirname)
main(use_cuda=False)
``` ```
## Conclusion ## Conclusion
Semantic Role Labeling is an important intermediate step in a wide range of natural language processing tasks. In this tutorial, we use SRL as an example to illustrate using PaddlePaddle to do sequence tagging tasks. The models proposed are from our published paper\[[10](#Reference)\]. We only use test data for illustration since the training data on the CoNLL 2005 dataset is not completely public. This aims to propose an end-to-end neural network model with fewer dependencies on natural language processing tools but is comparable, or even better than traditional models in terms of performance. Please check out our paper for more information and discussions. Labeling semantic roles is an important intermediate step in many natural language understanding tasks. In this tutorial, we take the label semantic roles task as an example to introduce how to use PaddlePaddle for sequence labeling tasks. The model presented in the tutorial comes from our published paper \[[10](#References)\]. Since the training data for the CoNLL 2005 SRL task is not currently fully open, only the test data is used as an example in the tutorial. In this process, we hope to reduce our reliance on other natural language processing tools. We can use neural network data-driven, end-to-end learning capabilities to get a model that is comparable or even better than traditional methods. In the paper, we confirmed this possibility. More information and discussion about the model can be found in the paper.
<a name="References"></a>
## References ## References
1. Sun W, Sui Z, Wang M, et al. [Chinese semantic role labeling with shallow parsing](http://www.aclweb.org/anthology/D09-1#page=1513)[C]//Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3-Volume 3. Association for Computational Linguistics, 2009: 1475-1483. 1. Sun W, Sui Z, Wang M, et al. [Chinese label semantic roles with shallow parsing](http://www.aclweb.org/anthology/D09-1#page=1513)[C]//Proceedings Of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3-Volume 3. Association for Computational Linguistics, 2009: 1475-1483.
2. Pascanu R, Gulcehre C, Cho K, et al. [How to construct deep recurrent neural networks](https://arxiv.org/abs/1312.6026)[J]. arXiv preprint arXiv:1312.6026, 2013. 2. Pascanu R, Gulcehre C, Cho K, et al. [How to construct deep recurrent neural networks](https://arxiv.org/abs/1312.6026)[J]. arXiv preprint arXiv:1312.6026, 2013.
3. Cho K, Van Merriënboer B, Gulcehre C, et al. [Learning phrase representations using RNN encoder-decoder for statistical machine translation](https://arxiv.org/abs/1406.1078)[J]. arXiv preprint arXiv:1406.1078, 2014. 3. Cho K, Van Merriënboer B, Gulcehre C, et al. [Learning phrase representations using RNN encoder-decoder for statistical machine translation](https://arxiv.org/abs/1406.1078)[J]. arXiv preprint arXiv: 1406.1078, 2014.
4. Bahdanau D, Cho K, Bengio Y. [Neural machine translation by jointly learning to align and translate](https://arxiv.org/abs/1409.0473)[J]. arXiv preprint arXiv:1409.0473, 2014. 4. Bahdanau D, Cho K, Bengio Y. [Neural machine translation by jointly learning to align and translate](https://arxiv.org/abs/1409.0473)[J]. arXiv preprint arXiv:1409.0473, 2014.
5. Lafferty J, McCallum A, Pereira F. [Conditional random fields: Probabilistic models for segmenting and labeling sequence data](http://www.jmlr.org/papers/volume15/doppa14a/source/biblio.bib.old)[C]//Proceedings of the eighteenth international conference on machine learning, ICML. 2001, 1: 282-289. 5. Lafferty J, McCallum A, Pereira F. [Conditional random fields: Probabilistic models for segmenting and labeling sequence data](https://repository.upenn.edu/cgi/viewcontent.cgi?article=1162&context=cis_papers) [C]//Proceedings of the eighteenth international conference on machine learning, ICML. 2001, 1: 282-289.
6. 李航. 统计学习方法[J]. 清华大学出版社, 北京, 2012. 6. Li Hang. Statistical Learning Method[J]. Tsinghua University Press, Beijing, 2012.
7. Marcus M P, Marcinkiewicz M A, Santorini B. [Building a large annotated corpus of English: The Penn Treebank](http://repository.upenn.edu/cgi/viewcontent.cgi?article=1246&context=cis_reports)[J]. Computational linguistics, 1993, 19(2): 313-330. 7. Marcus MP, Marcinkiewicz MA, Santorini B. [Building a large annotated corpus of English: The Penn Treebank](http://repository.upenn.edu/cgi/viewcontent.cgi?article=1246&context=cis_reports)[J] Computational linguistics, 1993, 19(2): 313-330.
8. Palmer M, Gildea D, Kingsbury P. [The proposition bank: An annotated corpus of semantic roles](http://www.mitpressjournals.org/doi/pdfplus/10.1162/0891201053630264)[J]. Computational linguistics, 2005, 31(1): 71-106. 8. Palmer M, Gildea D, Kingsbury P. [The proposition bank: An annotated corpus of semantic roles](http://www.mitpressjournals.org/doi/pdfplus/10.1162/0891201053630264) [J]. Computational linguistics, 2005 , 31(1): 71-106.
9. Carreras X, Màrquez L. [Introduction to the CoNLL-2005 shared task: Semantic role labeling](http://www.cs.upc.edu/~srlconll/st05/papers/intro.pdf)[C]//Proceedings of the Ninth Conference on Computational Natural Language Learning. Association for Computational Linguistics, 2005: 152-164. 9. Carreras X, Màrquez L. [Introduction to the CoNLL-2005 shared task: label semantic roles](http://www.cs.upc.edu/~srlconll/st05/papers/intro.pdf)[C]/ /Proceedings of the Ninth Conference on Computational Natural Language Learning. Association for Computational Linguistics, 2005: 152-164.
10. Zhou J, Xu W. [End-to-end learning of semantic role labeling using recurrent neural networks](http://www.aclweb.org/anthology/P/P15/P15-1109.pdf)[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2015. 10. Zhou J, Xu W. [End-to-end learning of label semantic roles using recurrent neural networks](http://www.aclweb.org/anthology/P/P15/P15-1109.pdf)[C] //Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2015.
<br/> <br/>
This tutorial is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://paddlepaddleimage.cdn.bcebos.com/bookimage/camo.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">This tutorial</span> is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
......
...@@ -7,6 +7,7 @@ import paddle.dataset.conll05 as conll05 ...@@ -7,6 +7,7 @@ import paddle.dataset.conll05 as conll05
import paddle.fluid as fluid import paddle.fluid as fluid
import six import six
import time import time
import argparse
with_gpu = os.getenv('WITH_GPU', '0') != '0' with_gpu = os.getenv('WITH_GPU', '0') != '0'
...@@ -29,6 +30,20 @@ BATCH_SIZE = 10 ...@@ -29,6 +30,20 @@ BATCH_SIZE = 10
embedding_name = 'emb' embedding_name = 'emb'
def parse_args():
parser = argparse.ArgumentParser("label_semantic_roles")
parser.add_argument(
'--enable_ce',
action='store_true',
help="If set, run the task with continuous evaluation logs.")
parser.add_argument(
'--use_gpu', type=int, default=0, help="Whether to use GPU or not.")
parser.add_argument(
'--num_epochs', type=int, default=100, help="number of epochs.")
args = parser.parse_args()
return args
def load_parameter(file_name, h, w): def load_parameter(file_name, h, w):
with open(file_name, 'rb') as f: with open(file_name, 'rb') as f:
f.read(16) # skip header. f.read(16) # skip header.
...@@ -122,6 +137,10 @@ def train(use_cuda, save_dirname=None, is_local=True): ...@@ -122,6 +137,10 @@ def train(use_cuda, save_dirname=None, is_local=True):
mark = fluid.layers.data( mark = fluid.layers.data(
name='mark_data', shape=[1], dtype='int64', lod_level=1) name='mark_data', shape=[1], dtype='int64', lod_level=1)
if args.enable_ce:
fluid.default_startup_program().random_seed = 90
fluid.default_main_program().random_seed = 90
# define network topology # define network topology
feature_out = db_lstm(**locals()) feature_out = db_lstm(**locals())
target = fluid.layers.data( target = fluid.layers.data(
...@@ -145,6 +164,10 @@ def train(use_cuda, save_dirname=None, is_local=True): ...@@ -145,6 +164,10 @@ def train(use_cuda, save_dirname=None, is_local=True):
crf_decode = fluid.layers.crf_decoding( crf_decode = fluid.layers.crf_decoding(
input=feature_out, param_attr=fluid.ParamAttr(name='crfw')) input=feature_out, param_attr=fluid.ParamAttr(name='crfw'))
if args.enable_ce:
train_data = paddle.batch(
paddle.dataset.conll05.test(), batch_size=BATCH_SIZE)
else:
train_data = paddle.batch( train_data = paddle.batch(
paddle.reader.shuffle(paddle.dataset.conll05.test(), buf_size=8192), paddle.reader.shuffle(paddle.dataset.conll05.test(), buf_size=8192),
batch_size=BATCH_SIZE) batch_size=BATCH_SIZE)
...@@ -181,6 +204,9 @@ def train(use_cuda, save_dirname=None, is_local=True): ...@@ -181,6 +204,9 @@ def train(use_cuda, save_dirname=None, is_local=True):
time.time() - start_time) / batch_id)) time.time() - start_time) / batch_id))
# Set the threshold low to speed up the CI test # Set the threshold low to speed up the CI test
if float(cost) < 60.0: if float(cost) < 60.0:
if args.enable_ce:
print("kpis\ttrain_cost\t%f" % cost)
if save_dirname is not None: if save_dirname is not None:
# TODO(liuyiqun): Change the target to crf_decode # TODO(liuyiqun): Change the target to crf_decode
fluid.io.save_inference_model(save_dirname, [ fluid.io.save_inference_model(save_dirname, [
...@@ -282,4 +308,8 @@ def main(use_cuda, is_local=True): ...@@ -282,4 +308,8 @@ def main(use_cuda, is_local=True):
infer(use_cuda, save_dirname) infer(use_cuda, save_dirname)
main(use_cuda=False) if __name__ == '__main__':
args = parse_args()
use_cuda = args.use_gpu
PASS_NUM = args.num_epochs
main(use_cuda)
...@@ -2,6 +2,11 @@ ...@@ -2,6 +2,11 @@
本教程源代码目录在[book/machine_translation](https://github.com/PaddlePaddle/book/tree/develop/08.machine_translation),初次使用请您参考[Book文档使用说明](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书) 本教程源代码目录在[book/machine_translation](https://github.com/PaddlePaddle/book/tree/develop/08.machine_translation),初次使用请您参考[Book文档使用说明](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)
### 说明
1. 硬件要求 本文可支持在CPU、GPU下运行
2. 对docker file cuda/cudnn的支持 如果您使用了本文配套的docker镜像,请注意:该镜像对GPU的支持仅限于CUDA 8,cuDNN 5
3. 文档中代码和train.py不一致的问题 请注意:为使本文更加易读易用,我们拆分、调整了train.py的代码并放入本文。本文中代码与train.py的运行结果一致,如希望直接看到训练脚本输出效果,可运行[train.py](https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/train.py)
## 背景介绍 ## 背景介绍
机器翻译(machine translation, MT)是用计算机来实现不同语言之间翻译的技术。被翻译的语言通常称为源语言(source language),翻译成的结果语言称为目标语言(target language)。机器翻译即实现从源语言到目标语言转换的过程,是自然语言处理的重要研究领域之一。 机器翻译(machine translation, MT)是用计算机来实现不同语言之间翻译的技术。被翻译的语言通常称为源语言(source language),翻译成的结果语言称为目标语言(target language)。机器翻译即实现从源语言到目标语言转换的过程,是自然语言处理的重要研究领域之一。
...@@ -37,7 +42,21 @@ ...@@ -37,7 +42,21 @@
## 模型概览 ## 模型概览
本节依次介绍双向循环神经网络(Bi-directional Recurrent Neural Network),NMT模型中典型的编码器-解码器(Encoder-Decoder)框架以及柱搜索(beam search)算法。 本节依次介绍GRU(Gated Recurrent Unit,门控循环单元),双向循环神经网络(Bi-directional Recurrent Neural Network),NMT模型中典型的编码器-解码器(Encoder-Decoder)框架和注意力(Attention)机制,以及柱搜索(beam search)算法。
### GRU
我们已经在[情感分析](https://github.com/PaddlePaddle/book/blob/develop/06.understand_sentiment/README.cn.md)一章中介绍了循环神经网络(RNN)及长短时间记忆网络(LSTM)。相比于简单的RNN,LSTM增加了记忆单元(memory cell)、输入门(input gate)、遗忘门(forget gate)及输出门(output gate),这些门及记忆单元组合起来大大提升了RNN处理远距离依赖问题的能力。
GRU\[[2](#参考文献)\]是Cho等人在LSTM上提出的简化版本,也是RNN的一种扩展,如下图所示。GRU单元只有两个门:
- 重置门(reset gate):如果重置门关闭,会忽略掉历史信息,即历史不相干的信息不会影响未来的输出。
- 更新门(update gate):将LSTM的输入门和遗忘门合并,用于控制历史信息对当前时刻隐层输出的影响。如果更新门接近1,会把历史信息传递下去。
<p align="center">
<img src="image/gru.png" width=700><br/>
图2. GRU(门控循环单元)
</p>
一般来说,具有短距离依赖属性的序列,其重置门比较活跃;相反,具有长距离依赖属性的序列,其更新门比较活跃。另外,Chung等人\[[3](#参考文献)\]通过多组实验表明,GRU虽然参数更少,但是在多个任务上都和LSTM有相近的表现。
### 双向循环神经网络 ### 双向循环神经网络
...@@ -81,19 +100,56 @@ ...@@ -81,19 +100,56 @@
机器翻译任务的训练过程中,解码阶段的目标是最大化下一个正确的目标语言词的概率。思路是: 机器翻译任务的训练过程中,解码阶段的目标是最大化下一个正确的目标语言词的概率。思路是:
1. 每一个时刻,根据源语言句子的编码信息(又叫上下文向量,context vector)$c$、真实目标语言序列的第$i$个词$u_i$和$i$时刻RNN的隐层状态$z_i$,计算出下一个隐层状态$z_{i+1}$。计算公式如下: 1. 每一个时刻,根据源语言句子的编码信息(又叫上下文向量,context vector)$c$、真实目标语言序列的第$i$个词$u_i$和$i$时刻RNN的隐层状态$z_i$,计算出下一个隐层状态$z_{i+1}$。计算公式如下:
$$z_{i+1}=\phi_{\theta '} \left ( c,u_i,z_i \right )$$
<div align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/image/decoder_formula.png?raw=true" width="200"><br/>
</div>
其中$\phi _{\theta '}$是一个非线性激活函数;$c$是源语言句子的上下文向量,在不使用注意力机制时,如果[编码器](#编码器)的输出是源语言句子编码后的最后一个元素,则可以定义$c=h_T$;$u_i$是目标语言序列的第$i$个单词,$u_0$是目标语言序列的开始标记`<s>`,表示解码开始;$z_i$是$i$时刻解码RNN的隐层状态,$z_0$是一个全零的向量。 其中$\phi _{\theta '}$是一个非线性激活函数;$c$是源语言句子的上下文向量,在不使用注意力机制时,如果[编码器](#编码器)的输出是源语言句子编码后的最后一个元素,则可以定义$c=h_T$;$u_i$是目标语言序列的第$i$个单词,$u_0$是目标语言序列的开始标记`<s>`,表示解码开始;$z_i$是$i$时刻解码RNN的隐层状态,$z_0$是一个全零的向量。
2. 将$z_{i+1}$通过`softmax`归一化,得到目标语言序列的第$i+1$个单词的概率分布$p_{i+1}$。概率分布公式如下: 1. 将$z_{i+1}$通过`softmax`归一化,得到目标语言序列的第$i+1$个单词的概率分布$p_{i+1}$。概率分布公式如下:
$$p\left ( u_{i+1}|u_{&lt;i+1},\mathbf{x} \right )=softmax(W_sz_{i+1}+b_z)$$
<div align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/image/probability_formula.png?raw=true" width="400"><br/>
</div>
其中$W_sz_{i+1}+b_z$是对每个可能的输出单词进行打分,再用softmax归一化就可以得到第$i+1$个词的概率$p_{i+1}$。 其中$W_sz_{i+1}+b_z$是对每个可能的输出单词进行打分,再用softmax归一化就可以得到第$i+1$个词的概率$p_{i+1}$。
3. 根据$p_{i+1}$和$u_{i+1}$计算代价。 1. 根据$p_{i+1}$和$u_{i+1}$计算代价。
4. 重复步骤1~3,直到目标语言序列中的所有词处理完毕。 2. 重复步骤1~3,直到目标语言序列中的所有词处理完毕。
机器翻译任务的生成过程,通俗来讲就是根据预先训练的模型来翻译源语言句子。生成过程中的解码阶段和上述训练过程的有所差异,具体介绍请见[柱搜索算法](#柱搜索算法) 机器翻译任务的生成过程,通俗来讲就是根据预先训练的模型来翻译源语言句子。生成过程中的解码阶段和上述训练过程的有所差异,具体介绍请见[柱搜索算法](#柱搜索算法)
### 注意力机制
如果编码阶段的输出是一个固定维度的向量,会带来以下两个问题:1)不论源语言序列的长度是5个词还是50个词,如果都用固定维度的向量去编码其中的语义和句法结构信息,对模型来说是一个非常高的要求,特别是对长句子序列而言;2)直觉上,当人类翻译一句话时,会对与当前译文更相关的源语言片段上给予更多关注,且关注点会随着翻译的进行而改变。而固定维度的向量则相当于,任何时刻都对源语言所有信息给予了同等程度的关注,这是不合理的。因此,Bahdanau等人\[[4](#参考文献)\]引入注意力(attention)机制,可以对编码后的上下文片段进行解码,以此来解决长句子的特征学习问题。下面介绍在注意力机制下的解码器结构。
与简单的解码器不同,这里$z_i$的计算公式为:
<div align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/image/attention_decoder_formula.png?raw=true" width="200"><br/>
</div>
可见,源语言句子的编码向量表示为第$i$个词的上下文片段$c_i$,即针对每一个目标语言中的词$u_i$,都有一个特定的$c_i$与之对应。$c_i$的计算公式如下:
<div align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/image/sum_formula.png?raw=true" width="300"><br/>
</div>
从公式中可以看出,注意力机制是通过对编码器中各时刻的RNN状态$h_j$进行加权平均实现的。权重$a_{ij}$表示目标语言中第$i$个词对源语言中第$j$个词的注意力大小,$a_{ij}$的计算公式如下:
<div align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/image/weight_formula.png?raw=true" width="300"><br/>
</div>
其中,$align$可以看作是一个对齐模型,用来衡量目标语言中第$i$个词和源语言中第$j$个词的匹配程度。具体而言,这个程度是通过解码RNN的第$i$个隐层状态$z_i$和源语言句子的第$j$个上下文片段$h_j$计算得到的。传统的对齐模型中,目标语言的每个词明确对应源语言的一个或多个词(hard alignment);而在注意力模型中采用的是soft alignment,即任何两个目标语言和源语言词间均存在一定的关联,且这个关联强度是由模型计算得到的实数,因此可以融入整个NMT框架,并通过反向传播算法进行训练。
<p align="center">
<img src="image/decoder_attention.png" width=500><br/>
图6. 基于注意力机制的解码器
</p>
<a name="柱搜索算法"></a> <a name="柱搜索算法"></a>
### 柱搜索算法 ### 柱搜索算法
...@@ -114,7 +170,7 @@ $$p\left ( u_{i+1}|u_{&lt;i+1},\mathbf{x} \right )=softmax(W_sz_{i+1}+b_z)$$ ...@@ -114,7 +170,7 @@ $$p\left ( u_{i+1}|u_{&lt;i+1},\mathbf{x} \right )=softmax(W_sz_{i+1}+b_z)$$
## 数据介绍 ## 数据介绍
本教程使用[WMT-14](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/)数据集中的[bitexts(after selection)](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz)作为训练集,[dev+test data](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz)作为测试集和生成集 本教程使用[WMT-16](http://www.statmt.org/wmt16/)新增的[multimodal task](http://www.statmt.org/wmt16/multimodal-task.html)中的[translation task](http://www.statmt.org/wmt16/multimodal-task.html#task1)的数据集。该数据集为英德翻译数据,包含29001条训练数据,1000条测试数据
### 数据预处理 ### 数据预处理
...@@ -130,347 +186,408 @@ $$p\left ( u_{i+1}|u_{&lt;i+1},\mathbf{x} \right )=softmax(W_sz_{i+1}+b_z)$$ ...@@ -130,347 +186,408 @@ $$p\left ( u_{i+1}|u_{&lt;i+1},\mathbf{x} \right )=softmax(W_sz_{i+1}+b_z)$$
### 示例数据 ### 示例数据
因为完整的数据集数据量较大,为了验证训练流程,PaddlePaddle接口paddle.dataset.wmt14中默认提供了一个经过预处理的[较小规模的数据集](http://paddlepaddle.bj.bcebos.com/demo/wmt_shrinked_data/wmt14.tgz) 为了验证训练流程,PaddlePaddle接口`paddle.dataset.wmt16`中提供了对该数据集[预处理后的版本](http://paddlemodels.bj.bcebos.com/wmt/wmt16.tar.gz),调用该接口即可直接使用,因为数据规模限制,这里只作为示例使用,在相应的测试集上具有一定效果但在更多测试数据上的效果无法保证。
该数据集有193319条训练数据,6003条测试数据,词典长度为30000。因为数据规模限制,使用该数据集训练出来的模型效果无法保证。
## 模型配置说明 ## 模型配置说明
下面我们开始根据输入数据的形式配置模型。首先引入所需的库函数以及定义全局变量 下面我们开始根据输入数据的形式配置模型。首先引入所需的库函数以及定义全局变量
```python ```python
from __future__ import print_function from __future__ import print_function
import contextlib import os
import six
import numpy as np
import paddle import paddle
import paddle.fluid as fluid import paddle.fluid as fluid
import paddle.fluid.framework as framework
import paddle.fluid.layers as pd
from paddle.fluid.executor import Executor
from functools import partial
import os
try:
from paddle.fluid.contrib.trainer import *
from paddle.fluid.contrib.inferencer import *
except ImportError:
print(
"In the fluid 1.0, the trainer and inferencer are moving to paddle.fluid.contrib",
file=sys.stderr)
from paddle.fluid.trainer import *
from paddle.fluid.inferencer import *
dict_size = 30000 # 字典维度
source_dict_dim = target_dict_dim = dict_size # 源/目标语言字典维度
hidden_dim = 32 # 编码器中的隐层大小
word_dim = 16 # 词向量维度
batch_size = 2 # batch 中的样本数
max_length = 8 # 生成句子的最大长度
beam_size = 2 # 柱宽度
dict_size = 30000 # 词典大小
source_dict_size = target_dict_size = dict_size # 源/目标语言字典大小
word_dim = 512 # 词向量维度
hidden_dim = 512 # 编码器中的隐层大小
decoder_size = hidden_dim # 解码器中的隐层大小 decoder_size = hidden_dim # 解码器中的隐层大小
max_length = 256 # 解码生成句子的最大长度
beam_size = 4 # beam search的柱宽度
batch_size = 64 # batch 中的样本数
is_sparse = True
model_save_dir = "machine_translation.inference.model"
``` ```
然后如下实现编码器框架: 然后如下实现编码器框架:
```python ```python
def encoder(is_sparse): def encoder():
# 定义源语言id序列的输入数据 # 定义源语言id序列的输入数据
src_word_id = pd.data( src_word_id = fluid.layers.data(
name="src_word_id", shape=[1], dtype='int64', lod_level=1) name="src_word_id", shape=[1], dtype='int64', lod_level=1)
# 将上述编码映射到低维语言空间的词向量 # 将上述编码映射到低维语言空间的词向量
src_embedding = pd.embedding( src_embedding = fluid.layers.embedding(
input=src_word_id, input=src_word_id,
size=[dict_size, word_dim], size=[source_dict_size, word_dim],
dtype='float32', dtype='float32',
is_sparse=is_sparse, is_sparse=is_sparse)
param_attr=fluid.ParamAttr(name='vemb')) # 用双向GRU编码源语言序列,拼接两个GRU的编码结果得到h
# LSTM层:fc + dynamic_lstm fc_forward = fluid.layers.fc(
fc1 = pd.fc(input=src_embedding, size=hidden_dim * 4, act='tanh') input=src_embedding, size=hidden_dim * 3, bias_attr=False)
lstm_hidden0, lstm_0 = pd.dynamic_lstm(input=fc1, size=hidden_dim * 4) src_forward = fluid.layers.dynamic_gru(input=fc_forward, size=hidden_dim)
# 取源语言序列编码后的最后一个状态 fc_backward = fluid.layers.fc(
encoder_out = pd.sequence_last_step(input=lstm_hidden0) input=src_embedding, size=hidden_dim * 3, bias_attr=False)
return encoder_out src_backward = fluid.layers.dynamic_gru(
input=fc_backward, size=hidden_dim, is_reverse=True)
encoded_vector = fluid.layers.concat(
input=[src_forward, src_backward], axis=1)
return encoded_vector
```
再实现基于注意力机制的解码器:
- 首先定义解码器中单步的计算,即$z_{i+1}=\phi _{\theta '}\left ( c_i,u_i,z_i \right )$,如下:
```python
# 定义RNN中的单步计算
def cell(x, hidden, encoder_out, encoder_out_proj):
# 定义attention用以计算context,即 c_i,这里使用Bahdanau attention机制
def simple_attention(encoder_vec, encoder_proj, decoder_state):
decoder_state_proj = fluid.layers.fc(
input=decoder_state, size=decoder_size, bias_attr=False)
# sequence_expand将单步内容扩展为与encoder输出相同的序列
decoder_state_expand = fluid.layers.sequence_expand(
x=decoder_state_proj, y=encoder_proj)
mixed_state = fluid.layers.elementwise_add(encoder_proj,
decoder_state_expand)
attention_weights = fluid.layers.fc(
input=mixed_state, size=1, bias_attr=False)
attention_weights = fluid.layers.sequence_softmax(
input=attention_weights)
weigths_reshape = fluid.layers.reshape(x=attention_weights, shape=[-1])
scaled = fluid.layers.elementwise_mul(
x=encoder_vec, y=weigths_reshape, axis=0)
context = fluid.layers.sequence_pool(input=scaled, pool_type='sum')
return context
context = simple_attention(encoder_out, encoder_out_proj, hidden)
out = fluid.layers.fc(
input=[x, context], size=decoder_size * 3, bias_attr=False)
out = fluid.layers.gru_unit(
input=out, hidden=hidden, size=decoder_size * 3)[0]
return out, out
``` ```
再实现训练模式下的解码器: - 基于定义的单步计算,使用`DynamicRNN`实现多步循环的训练模式下解码器,如下:
```python
def train_decoder(encoder_out):
# 获取编码器输出的最后一步并进行非线性映射以构造解码器RNN的初始状态
encoder_last = fluid.layers.sequence_last_step(input=encoder_out)
encoder_last_proj = fluid.layers.fc(
input=encoder_last, size=decoder_size, act='tanh')
# 编码器输出在attention中计算结果的cache
encoder_out_proj = fluid.layers.fc(
input=encoder_out, size=decoder_size, bias_attr=False)
```python
def train_decoder(context, is_sparse):
# 定义目标语言id序列的输入数据,并映射到低维语言空间的词向量 # 定义目标语言id序列的输入数据,并映射到低维语言空间的词向量
trg_language_word = pd.data( trg_language_word = fluid.layers.data(
name="target_language_word", shape=[1], dtype='int64', lod_level=1) name="target_language_word", shape=[1], dtype='int64', lod_level=1)
trg_embedding = pd.embedding( trg_embedding = fluid.layers.embedding(
input=trg_language_word, input=trg_language_word,
size=[dict_size, word_dim], size=[target_dict_size, word_dim],
dtype='float32', dtype='float32',
is_sparse=is_sparse, is_sparse=is_sparse)
param_attr=fluid.ParamAttr(name='vemb'))
rnn = pd.DynamicRNN() rnn = fluid.layers.DynamicRNN()
with rnn.block(): # 使用 DynamicRNN 定义每一步的计算 with rnn.block():
# 获取当前步目标语言输入的词向量 # 获取当前步目标语言输入的词向量
current_word = rnn.step_input(trg_embedding) x = rnn.step_input(trg_embedding)
# 获取隐层状态 # 获取隐层状态
pre_state = rnn.memory(init=context) pre_state = rnn.memory(init=encoder_last_proj, need_reorder=True)
# 解码器计算单元:单层前馈网络 # 在DynamicRNN中需使用static_input获取encoder相关的内容
current_state = pd.fc(input=[current_word, pre_state], # 对decoder来说这些内容在每个时间步都是固定的
size=decoder_size, encoder_out = rnn.static_input(encoder_out)
act='tanh') encoder_out_proj = rnn.static_input(encoder_out_proj)
# 执行单步的计算单元
out, current_state = cell(x, pre_state, encoder_out, encoder_out_proj)
# 计算归一化的单词预测概率 # 计算归一化的单词预测概率
current_score = pd.fc(input=current_state, prob = fluid.layers.fc(input=out, size=target_dict_size, act='softmax')
size=target_dict_dim, # 更新隐层状态
act='softmax')
# 更新RNN的隐层状态
rnn.update_memory(pre_state, current_state) rnn.update_memory(pre_state, current_state)
# 输出预测概率 # 输出预测概率
rnn.output(current_score) rnn.output(prob)
return rnn() return rnn()
``` ```
实现推测模式下的解码器 接着就可以使用编码器和解码器定义整个训练网络;为了进行训练还需要定义优化器,如下
```python ```python
def decode(context, is_sparse): def train_model():
init_state = context encoder_out = encoder()
# 定义解码过程循环计数变量 rnn_out = train_decoder(encoder_out)
array_len = pd.fill_constant(shape=[1], dtype='int64', value=max_length) label = fluid.layers.data(
counter = pd.zeros(shape=[1], dtype='int64', force_cpu=True) name="target_language_next_word", shape=[1], dtype='int64', lod_level=1)
# 定义损失函数
cost = fluid.layers.cross_entropy(input=rnn_out, label=label)
avg_cost = fluid.layers.mean(cost)
return avg_cost
# 定义 tensor array 用以保存各个时间步的内容,并写入初始id,score和state def optimizer_func():
state_array = pd.create_array('float32') # 设置梯度裁剪
pd.array_write(init_state, array=state_array, i=counter) fluid.clip.set_gradient_clip(
clip=fluid.clip.GradientClipByGlobalNorm(clip_norm=5.0))
# 定义先增后降的学习率策略
lr_decay = fluid.layers.learning_rate_scheduler.noam_decay(hidden_dim, 1000)
return fluid.optimizer.Adam(
learning_rate=lr_decay,
regularization=fluid.regularizer.L2DecayRegularizer(
regularization_coeff=1e-4))
```
ids_array = pd.create_array('int64') 以上是训练所需的模型构件,预测(生成)模式下基于beam search的解码器需要借助`while_op`实现,如下:
scores_array = pd.create_array('float32')
init_ids = pd.data(name="init_ids", shape=[1], dtype="int64", lod_level=2) ```python
init_scores = pd.data( def infer_decoder(encoder_out):
name="init_scores", shape=[1], dtype="float32", lod_level=2) # 获取编码器输出的最后一步并进行非线性映射以构造解码器RNN的初始状态
encoder_last = fluid.layers.sequence_last_step(input=encoder_out)
encoder_last_proj = fluid.layers.fc(
input=encoder_last, size=decoder_size, act='tanh')
# 编码器输出在attention中计算结果的cache
encoder_out_proj = fluid.layers.fc(
input=encoder_out, size=decoder_size, bias_attr=False)
# 最大解码步数
max_len = fluid.layers.fill_constant(
shape=[1], dtype='int64', value=max_length)
# 解码步数计数变量
counter = fluid.layers.zeros(shape=[1], dtype='int64', force_cpu=True)
pd.array_write(init_ids, array=ids_array, i=counter) # 定义 tensor array 用以保存各个时间步的内容,并写入初始id,score和state
pd.array_write(init_scores, array=scores_array, i=counter) init_ids = fluid.layers.data(
name="init_ids", shape=[1], dtype="int64", lod_level=2)
init_scores = fluid.layers.data(
name="init_scores", shape=[1], dtype="float32", lod_level=2)
ids_array = fluid.layers.array_write(init_ids, i=counter)
scores_array = fluid.layers.array_write(init_scores, i=counter)
state_array = fluid.layers.array_write(encoder_last_proj, i=counter)
# 定义循环终止条件变量 # 定义循环终止条件变量
cond = pd.less_than(x=counter, y=array_len) cond = fluid.layers.less_than(x=counter, y=max_len)
# 定义 while_op while_op = fluid.layers.While(cond=cond)
while_op = pd.While(cond=cond) with while_op.block():
with while_op.block(): # 定义每一步的计算
# 获取解码器在当前步的输入,包括上一步选择的id,对应的score和上一步的state # 获取解码器在当前步的输入,包括上一步选择的id,对应的score和上一步的state
pre_ids = pd.array_read(array=ids_array, i=counter) pre_ids = fluid.layers.array_read(array=ids_array, i=counter)
pre_state = pd.array_read(array=state_array, i=counter) pre_score = fluid.layers.array_read(array=scores_array, i=counter)
pre_score = pd.array_read(array=scores_array, i=counter) pre_state = fluid.layers.array_read(array=state_array, i=counter)
# 更新输入的state为上一步选择id对应的state # 同train_decoder中的内容,进行RNN的单步计算
pre_state_expanded = pd.sequence_expand(pre_state, pre_score) pre_ids_emb = fluid.layers.embedding(
# 同训练模式下解码器中的计算逻辑,包括获取输入向量,解码器计算单元计算和
# 归一化单词预测概率的计算
pre_ids_emb = pd.embedding(
input=pre_ids, input=pre_ids,
size=[dict_size, word_dim], size=[target_dict_size, word_dim],
dtype='float32', dtype='float32',
is_sparse=is_sparse) is_sparse=is_sparse)
current_state = pd.fc(input=[pre_state_expanded, pre_ids_emb], out, current_state = cell(pre_ids_emb, pre_state, encoder_out,
size=decoder_size, encoder_out_proj)
act='tanh') prob = fluid.layers.fc(
current_state_with_lod = pd.lod_reset(x=current_state, y=pre_score) input=current_state, size=target_dict_size, act='softmax')
current_score = pd.fc(input=current_state_with_lod,
size=target_dict_dim,
act='softmax')
topk_scores, topk_indices = pd.topk(current_score, k=beam_size)
# 计算累计得分,进行beam search # 计算累计得分,进行beam search
accu_scores = pd.elementwise_add( topk_scores, topk_indices = fluid.layers.topk(prob, k=beam_size)
x=pd.log(topk_scores), y=pd.reshape(pre_score, shape=[-1]), axis=0) accu_scores = fluid.layers.elementwise_add(
selected_ids, selected_scores = pd.beam_search( x=fluid.layers.log(topk_scores),
pre_ids, y=fluid.layers.reshape(pre_score, shape=[-1]),
pre_score, axis=0)
topk_indices, accu_scores = fluid.layers.lod_reset(x=accu_scores, y=pre_ids)
accu_scores, selected_ids, selected_scores = fluid.layers.beam_search(
beam_size, pre_ids, pre_score, topk_indices, accu_scores, beam_size, end_id=1)
end_id=10,
level=0) fluid.layers.increment(x=counter, value=1, in_place=True)
# 将 search 结果写入 tensor array 中
pd.increment(x=counter, value=1, in_place=True) fluid.layers.array_write(selected_ids, array=ids_array, i=counter)
# 将 search 结果和对应的隐层状态写入 tensor array 中 fluid.layers.array_write(selected_scores, array=scores_array, i=counter)
pd.array_write(current_state, array=state_array, i=counter) # sequence_expand 作为 gather 使用以获取search结果对应的状态,并更新
pd.array_write(selected_ids, array=ids_array, i=counter) current_state = fluid.layers.sequence_expand(current_state,
pd.array_write(selected_scores, array=scores_array, i=counter) selected_ids)
fluid.layers.array_write(current_state, array=state_array, i=counter)
current_enc_out = fluid.layers.sequence_expand(encoder_out,
selected_ids)
fluid.layers.assign(current_enc_out, encoder_out)
current_enc_out_proj = fluid.layers.sequence_expand(
encoder_out_proj, selected_ids)
fluid.layers.assign(current_enc_out_proj, encoder_out_proj)
# 更新循环终止条件 # 更新循环终止条件
length_cond = pd.less_than(x=counter, y=array_len) length_cond = fluid.layers.less_than(x=counter, y=max_len)
finish_cond = pd.logical_not(pd.is_empty(x=selected_ids)) finish_cond = fluid.layers.logical_not(
pd.logical_and(x=length_cond, y=finish_cond, out=cond) fluid.layers.is_empty(x=selected_ids))
fluid.layers.logical_and(x=length_cond, y=finish_cond, out=cond)
translation_ids, translation_scores = pd.beam_search_decode( # 根据保存的每一步的结果,回溯生成最终解码结果
ids=ids_array, scores=scores_array, beam_size=beam_size, end_id=10) translation_ids, translation_scores = fluid.layers.beam_search_decode(
ids=ids_array, scores=scores_array, beam_size=beam_size, end_id=1)
return translation_ids, translation_scores return translation_ids, translation_scores
``` ```
进而,我们定义一个`train_program`来使用`inference_program`计算出的结果,在标记数据的帮助下来计算误差。我们还定义了一个`optimizer_func`来定义优化器。 使用编码器和预测模式的解码器,预测网络定义如下:
```python ```python
def train_program(is_sparse): def infer_model():
context = encoder(is_sparse) encoder_out = encoder()
rnn_out = train_decoder(context, is_sparse) translation_ids, translation_scores = infer_decoder(encoder_out)
label = pd.data( return translation_ids, translation_scores
name="target_language_next_word", shape=[1], dtype='int64', lod_level=1)
cost = pd.cross_entropy(input=rnn_out, label=label)
avg_cost = pd.mean(cost)
return avg_cost
def optimizer_func():
return fluid.optimizer.Adagrad(
learning_rate=1e-4,
regularization=fluid.regularizer.L2DecayRegularizer(
regularization_coeff=0.1))
``` ```
## 训练模型 ## 训练模型
### 定义训练环境 ### 构建训练程序
定义您的训练环境,可以指定训练是发生在CPU还是GPU上。
```python
use_cuda = False
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
```
### 定义数据提供器 定义用于训练的`Program`,在其中创建训练的网络结构并添加优化器。同时还要定义用于初始化的`Program`,在创建训练网络的同时隐式的加入参数初始化的操作。
下一步是为训练和测试定义数据提供器。提供器读入一个大小为 `BATCH_SIZE`的数据。`paddle.dataset.wmt.train` 每次会在乱序化后提供一个大小为`BATCH_SIZE`的数据,乱序化的大小为缓存大小`buf_size`
```python ```python
train_reader = paddle.batch( train_prog = fluid.Program()
paddle.reader.shuffle( startup_prog = fluid.Program()
paddle.dataset.wmt14.train(dict_size), buf_size=1000), with fluid.program_guard(train_prog, startup_prog):
batch_size=batch_size) with fluid.unique_name.guard():
avg_cost = train_model()
optimizer = optimizer_func()
optimizer.minimize(avg_cost)
``` ```
### 构造训练器(trainer) ### 定义训练环境与执行器
训练器需要一个训练程序和一个训练优化函数。
定义您的训练环境,可以指定训练是发生在CPU还是GPU上;并基于这个训练环境定义执行器。
```python ```python
is_sparse = False use_cuda = False
trainer = Trainer( # 定义使用设备和执行器
train_func=partial(train_program, is_sparse), place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
place=place, exe = fluid.Executor(place)
optimizer_func=optimizer_func)
``` ```
### 提供数据 ### 构建数据提供器
`feed_order`用来定义每条产生的数据和`paddle.layer.data`之间的映射关系。比如,`wmt14.train`产生的第一列的数据对应的是`src_word_id`这个特征 使用封装的`paddle.dataset.wmt16.train`接口定义数据生成器,其每次产生一条样本,shuffle和组完batch后作为训练的输入;另外还需要指明输入数据中各字段和`data_layer`定义的各输入的对应关系,这可以通过`DataFeeder`完成, 下面的feeder将产生数据的第一列映射到`src_word_id`这个输入
```python ```python
feed_order = [ # 定义训练数据生成器
train_data = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.wmt16.train(source_dict_size, target_dict_size),
buf_size=10000),
batch_size=batch_size)
# DataFeeder完成
feeder = fluid.DataFeeder(
feed_list=[
'src_word_id', 'target_language_word', 'target_language_next_word' 'src_word_id', 'target_language_word', 'target_language_next_word'
] ],
place=place,
program=train_prog)
``` ```
### 事件处理器 ### 训练主循环
回调函数`event_handler`在一个之前定义好的事件发生后会被调用。例如,我们可以在每步训练结束后查看误差。
```python 通过训练循环数(EPOCH_NUM)来进行训练循环,并且每次循环都保存训练好的参数。注意,循环训练前要首先执行初始化的`Program`来初始化参数。另外作为示例这里EPOCH_NUM设置较小,该数据集上实际大概需要20个epoch左右收敛。
def event_handler(event):
if isinstance(event, EndStepEvent):
if event.step % 10 == 0:
print('pass_id=' + str(event.epoch) + ' batch=' + str(event.step))
if event.step == 20:
trainer.stop()
```
### 开始训练
最后,我们传入训练循环数(`num_epoch`)和一些别的参数,调用 `trainer.train` 来开始训练。
```python ```python
EPOCH_NUM = 1 # 执行初始化 Program,进行参数初始化
exe.run(startup_prog)
trainer.train( # 循环迭代执行训练
reader=train_reader, EPOCH_NUM = 2
num_epochs=EPOCH_NUM, for pass_id in six.moves.xrange(EPOCH_NUM):
event_handler=event_handler, batch_id = 0
feed_order=feed_order) for data in train_data():
cost = exe.run(
train_prog, feed=feeder.feed(data), fetch_list=[avg_cost])[0]
print('pass_id: %d, batch_id: %d, loss: %f' % (pass_id, batch_id,
cost))
batch_id += 1
# 保存模型
fluid.io.save_params(exe, model_save_dir, main_program=train_prog)
``` ```
## 应用模型 ## 应用模型
### 定义解码部分 ### 构建预测程序
使用上面定义的 `encoder``decoder` 函数来推测翻译后的对应id和分数. 定义用于预测的`Program`,在其中创建预测的网络结构。
```python ```python
context = encoder(is_sparse) infer_prog = fluid.Program()
translation_ids, translation_scores = decode(context, is_sparse) startup_prog = fluid.Program()
with fluid.program_guard(infer_prog, startup_prog):
with fluid.unique_name.guard():
translation_ids, translation_scores = infer_model()
``` ```
### 定义数据 ### 构建数据提供器
我们先初始化id和分数来生成tensors来作为输入数据。在这个预测例子中,我们用`wmt14.test`数据中的第一个记录来做推测,最后我们用"源字典"和"目标字典"来列印对应的句子结果 和训练类似,这里使用封装的`paddle.dataset.wmt16.test`接口定义测试数据生成器,测试数据共1000条,组完batch后作为预测的输入;另外我们获取源语言和目标语言id到word的词典,以将id序列转换为明文序列打印输出
```python ```python
init_ids_data = np.array([1 for _ in range(batch_size)], dtype='int64')
init_scores_data = np.array(
[1. for _ in range(batch_size)], dtype='float32')
init_ids_data = init_ids_data.reshape((batch_size, 1))
init_scores_data = init_scores_data.reshape((batch_size, 1))
init_lod = [1] * batch_size
init_lod = [init_lod, init_lod]
init_ids = fluid.create_lod_tensor(init_ids_data, init_lod, place)
init_scores = fluid.create_lod_tensor(init_scores_data, init_lod, place)
test_data = paddle.batch( test_data = paddle.batch(
paddle.reader.shuffle( paddle.dataset.wmt16.test(source_dict_size, target_dict_size),
paddle.dataset.wmt14.test(dict_size), buf_size=1000),
batch_size=batch_size) batch_size=batch_size)
src_idx2word = paddle.dataset.wmt16.get_dict(
feed_order = ['src_word_id'] "en", source_dict_size, reverse=True)
feed_list = [ trg_idx2word = paddle.dataset.wmt16.get_dict(
framework.default_main_program().global_block().var(var_name) "de", target_dict_size, reverse=True)
for var_name in feed_order
]
feeder = fluid.DataFeeder(feed_list, place)
src_dict, trg_dict = paddle.dataset.wmt14.get_dict(dict_size)
``` ```
### 测试 ### 测试
现在我们可以进行预测了。我们要在`feed_order`提供对应参数,放在`executor`上运行以取得id和分数结果 首先要加载训练过程保存下来的模型,然后就可以循环测试数据进行预测了。这里每次运行我们都会创建`data_layer`对应输入数据的`dict`传入,这个和`DataFeeder`相同的效果。生成过程对于每个测试数据都会将源语言句子和`beam_size`个生成句子打印输出。
```python ```python
exe = Executor(place) fluid.io.load_params(exe, model_save_dir, main_program=infer_prog)
exe.run(framework.default_startup_program())
for data in test_data():
for data in test_data(): src_word_id = fluid.create_lod_tensor(
feed_data = map(lambda x: [x[0]], data) data=[x[0] for x in data],
feed_dict = feeder.feed(feed_data) recursive_seq_lens=[[len(x[0]) for x in data]],
feed_dict['init_ids'] = init_ids place=place)
feed_dict['init_scores'] = init_scores # init_ids内容为start token
init_ids = fluid.create_lod_tensor(
results = exe.run( data=np.array([[0]] * len(data), dtype='int64'),
framework.default_main_program(), recursive_seq_lens=[[1] * len(data)] * 2,
feed=feed_dict, place=place)
# init_scores为beam search过程累积得分的初值
init_scores = fluid.create_lod_tensor(
data=np.array([[0.]] * len(data), dtype='float32'),
recursive_seq_lens=[[1] * len(data)] * 2,
place=place)
seq_ids, seq_scores = exe.run(
infer_prog,
feed={
'src_word_id': src_word_id,
'init_ids': init_ids,
'init_scores': init_scores
},
fetch_list=[translation_ids, translation_scores], fetch_list=[translation_ids, translation_scores],
return_numpy=False) return_numpy=False)
# 如何解析翻译结果详见 train.py 中对应代码的注释说明
result_ids = np.array(results[0]) hyps = [[] for i in range(len(seq_ids.lod()[0]) - 1)]
result_ids_lod = results[0].lod() scores = [[] for i in range(len(seq_scores.lod()[0]) - 1)]
result_scores = np.array(results[1]) for i in range(len(seq_ids.lod()[0]) - 1):
start = seq_ids.lod()[0][i]
end = seq_ids.lod()[0][i + 1]
print("Original sentence:") print("Original sentence:")
print(" ".join([src_dict[w] for w in feed_data[0][0][1:-1]])) print(" ".join([src_idx2word[idx] for idx in data[i][0][1:-1]]))
print("Translated score and sentence:") print("Translated score and sentence:")
for i in xrange(beam_size): for j in range(end - start):
start_pos = result_ids_lod[1][i] + 1 sub_start = seq_ids.lod()[1][start + j]
end_pos = result_ids_lod[1][i+1] sub_end = seq_ids.lod()[1][start + j + 1]
print("%d\t%.4f\t%s\n" % (i+1, result_scores[end_pos-1], hyps[i].append(" ".join([
" ".join([trg_dict[w] for w in result_ids[start_pos:end_pos]]))) trg_idx2word[idx]
for idx in np.array(seq_ids)[sub_start:sub_end][1:-1]
break ]))
scores[i].append(np.array(seq_scores)[sub_end - 1])
print(scores[i][-1], hyps[i][-1].encode('utf8'))
```
可以观察到如下的预测结果输出:
```txt
Original sentence:
Two adults and two children sit on a park bench .
Translated score and sentence:
-2.5993705 Zwei Erwachsene und zwei Kinder sitzen auf einer Parkbank .
-2.6617606 Zwei Erwachsene und zwei Kinder spielen auf einer Parkbank .
-3.186554 Zwei Erwachsene und zwei Kinder sitzen auf einer Bank .
-3.4353821 Zwei Erwachsene und zwei Kinder spielen auf einer Bank .
``` ```
## 总结 ## 总结
...@@ -487,4 +604,4 @@ for data in test_data(): ...@@ -487,4 +604,4 @@ for data in test_data():
5. Papineni K, Roukos S, Ward T, et al. [BLEU: a method for automatic evaluation of machine translation](http://dl.acm.org/citation.cfm?id=1073135)[C]//Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002: 311-318. 5. Papineni K, Roukos S, Ward T, et al. [BLEU: a method for automatic evaluation of machine translation](http://dl.acm.org/citation.cfm?id=1073135)[C]//Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002: 311-318.
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">知识共享 署名-相同方式共享 4.0 国际 许可协议</a>进行许可。 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://paddlepaddleimage.cdn.bcebos.com/bookimage/camo.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">知识共享 署名-相同方式共享 4.0 国际 许可协议</a>进行许可。
# Machine Translation # Machine Translation
The source code of this tutorial is live at [book/machine_translation](https://github.com/PaddlePaddle/book/tree/develop/08.machine_translation). For instructions on getting started with this book,see [Running This Book](https://github.com/PaddlePaddle/book/blob/develop/README.md#running-the-book). Source code of this tutorial is in [book/machine_translation](https://github.com/PaddlePaddle/book/tree/develop/08.machine_translation). For users new to Paddle book, please refer to [the user guide of Book Documentation](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#run_the_book).
## Background ## Background
Machine translation (MT) leverages computers to translate from one language to another. The language to be translated is referred to as the source language, while the language to be translated into is referred to as the target language. Thus, Machine translation is the process of translating from the source language to the target language. It is one of the most important research topics in the field of natural language processing. Machine translation is to translate different languages with computer. The language to be translated is usually called source language, and the language representing the result of translation is called target language. Machine translation is the process of transformation from source language to target language, which is an important research assignment of Natural Language Processing.
Machine translation systems at early age were mostly rule-based translation system, which needs linguists make transformation rule between two languages and then input these rules into computer. This method requires proficient professional linguistic background, but it is hard to cover all rules of a language, let it alone two or more languages. Therefore, the major challenge of traditional machine translation method is the impossibility of a completest set of rules\[[1](#References)\].
Early machine translation systems are mainly rule-based i.e. they rely on a language expert to specify the translation rules between the two languages. It is quite difficult to cover all the rules used in one language. So it is quite a challenge for language experts to specify all possible rules in two or more different languages. Hence, a major challenge in conventional machine translation has been the difficulty in obtaining a complete rule set \[[1](#references)\]. To solve the problem mentioned above, Statistical Machine Translation technology emerged afterwards. For Statistical Machine Translation, transformation rules are automatically learned from a large scale corpus instead of handcrafted rule. So it tackles with the limit of obtaining knowledge in rule-based machine translation systems. However, it still faces certain challenges: 1. man-made feature can never cover all language phenomena. 2. it is hard to use global feature. 3. it depends on many pre-processing parts, such as Word Alignment, Tokenization, Rule Extraction, Parsing. Errors accumulated by those parts will have a great influence on translation.
In recent years, Deep Learning technology proposes new solutions to overcome the bottleneck. Two methods for machine translation are realized with the aid of deep learning. 1. Based on the framework of statistical machine translation system, the neural network is in place to improve core parts, such as language model, reordering model and so on (See the left part in figure One). 2. Abandoning the framework of statistical machine translation system, it directly uses neural network to transform source language to target language, which is End-to-End Neural Machine Translation (See right part in figure One), NMT model in short.
<div align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/image/nmt.png?raw=true" width = "400" align=center/><br/>
Figure One. Neural Network Machine Translation System
</div>
To address the aforementioned problems, statistical machine translation techniques have been developed. These techniques learn the translation rules from a large corpus, instead of being designed by a language expert. While these techniques overcome the bottleneck of knowledge acquisition, there are still quite a lot of challenges, for example: In the following parts, we'll guide you through NMT model and its hands-on implementation in PaddlePaddle
1. Human designed features cannot cover all possible linguistic variations; ## Result Demo
2. It is difficult to use global features; Take Chinese to English translation model as an example. For a trained model, if input the following tokenized Chinese sentence :
3. The techniques heavily rely on pre-processing techniques like word alignment, word segmentation and tokenization, rule-extraction and syntactic parsing etc. The error introduced in any of these steps could accumulate and impact translation quality.
The recent development of deep learning provides new solutions to these challenges. The two main categories for deep learning based machine translation techniques are:
1. Techniques based on the statistical machine translation system but with some key components improved with neural networks, e.g., language model, reordering model (please refer to the left part of Figure 1);
2. Techniques mapping from source language to target language directly using a neural network, or end-to-end neural machine translation (NMT).
<p align="center">
<img src="image/nmt_en.png" width=400><br/>
Figure 1. Neural Network based Machine Translation
</p>
This tutorial will mainly introduce an NMT model and how to use PaddlePaddle to train it.
## Illustrative Results
Let's consider an example of Chinese-to-English translation. The model is given the following segmented sentence in Chinese
```text ```text
这些 是 希望 的 曙光 和 解脱 的 迹象 . 这些 是 希望 的 曙光 和 解脱 的 迹象 .
``` ```
After training and with a beam-search size of 3, the generated translations are as follows:
If it sets the entries of translation result ( e.t. the width of [beam search algorithm](#beam search algorithm)) as 3, the generated English sentence is as follows:
```text ```text
0 -5.36816 These are signs of hope and relief . <e> 0 -5.36816 These are signs of hope and relief . <e>
1 -6.23177 These are the light of hope and relief . <e> 1 -6.23177 These are the light of hope and relief . <e>
2 -7.7914 These are the light of hope and the relief of hope . <e> 2 -7.7914 These are the light of hope and the relief of hope . <e>
``` ```
- The first column corresponds to the id of the generated sentence; the second column corresponds to the score of the generated sentence (in descending order), where a larger value indicates better quality; the last column corresponds to the generated sentence.
- There are two special tokens: `<e>` denotes the end of a sentence while `<unk>` denotes unknown word, i.e., a word not in the training dictionary.
## Overview of the Model - The first column to the left is the serial numbers of generated sentences. The second column from left is scores of the sentences in descending order, in which higher score is better. The third column contains the generated English sentences.
This section will introduce Bi-directional Recurrent Neural Network, the Encoder-Decoder framework used in NMT, as well as the beam search algorithm.
### Bi-directional Recurrent Neural Network
We already introduced an instance of bi-directional RNN in the [Semantic Role Labeling](https://github.com/PaddlePaddle/book/blob/develop/label_semantic_roles/README.md) chapter. Here we present another bi-directional RNN model with a different architecture proposed by Bengio et al. in \[[2](#references),[4](#references)\]. This model takes a sequence as input and outputs a fixed dimensional feature vector at each step, encoding the context information at the corresponding time step. - In addition, there are two special marks. One is `<e>`, indicating the end of a sentence and another one is `<unk>`, representing unknown word, which have never appeared in dictionary.
Specifically, this bi-directional RNN processes the input sequence in the original and reverse order respectively, and then concatenates the output feature vectors at each time step as the final output. Thus the output node at each time step contains information from the past and future as context. The figure below shows an unrolled bi-directional RNN. This network contains a forward RNN and backward RNN with six weight matrices: weight matrices from input to forward hidden layer and backward hidden ($W_1, W_3$), weight matrices from hidden to itself ($W_2, W_5$), matrices from forward hidden and backward hidden to output layer ($W_4, W_6$). Note that there are no connections between forward hidden and backward hidden layers. ## Exploration of Models
<p align="center"> In this section, let's scrutinize Bi-directional Recurrent Neural Network, typical Encoder-Decoder structure in NMT model and beam search algorithm.
<img src="image/bi_rnn_en.png" width=450><br/>
Figure 3. Temporally unrolled bi-directional RNN
</p>
### Encoder-Decoder Framework ### Bi-directional Recurrent Neural Network
The Encoder-Decoder\[[2](#references)\] framework aims to solve the mapping of a sequence to another sequence, for sequences with arbitrary lengths. The source sequence is encoded into a vector via an encoder, which is then decoded to a target sequence via a decoder by maximizing the predictive probability. Both the encoder and the decoder are typically implemented via RNN.
<p align="center"> We have introduced a bi-directional recurrent neural network in the chapter [label_semantic_roles](https://github.com/PaddlePaddle/book/blob/develop/07.label_semantic_roles/README.md). Here we introduce another network proposed by Bengio team in thesis \[[2](#References),[4](#References)\] The aim of this network is to input a sequence and get its features at each time step. Specifically, fixed-length vector is incorporated to represent contextual semantic information for each time step in the output.
<img src="image/encoder_decoder_en.png" width=700><br/>
Figure 4. Encoder-Decoder Framework
</p>
#### Encoder To be concrete, the Bi-directional recurrent neural network sequentially processes the input sequences in time dimension in sequential order or in reverse order, i.e., forward and backward. And the output of RNN at each time step are concatenated to be the final output layer. Hereby the output node of each time step contains complete past and future context information of current time step of input sequence. The figure below shows a bi-directional recurrent neural network expanded by time step. The network consists of a forward and a backward RNN with six weight matrices: a weight matrix ($W_1, W_3$) from input layer to the forward and backward hidden layers, and a weight matrix ($W_2, W_5$) from a hidden layer to itself (self-loop), the weight matrix from the forward hidden layer and the backward hidden layer to the output layer ($W_4, W_6$). Note that there is no connection between the forward hidden layer and the backward hidden layer.
There are three steps for encoding a sentence:
1. One-hot vector representation of a word: Each word $x_i$ in the source sentence $x=\left \{ x_1,x_2,...,x_T \right \}$ is represented as a vector $w_i\epsilon \left \{ 0,1 \right \}^{\left | V \right |},i=1,2,...,T$ where $w_i$ has the same dimensionality as the size of the dictionary, i.e., $\left | V \right |$, and has an element of one at the location corresponding to the location of the word in the dictionary and zero elsewhere. <div align="center">
<img src = "https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/image/bi_rnn.png?raw=true" width="400"><br/>
Figure 2. Bi-directional Recurrent Neural Network expanded by time step.
</div>
2. Word embedding as a representation in the low-dimensional semantic space: There are two problems with one-hot vector representation ### Encoder-Decoder Structure
* The dimensionality of the vector is typically large, leading to the curse of dimensionality; The Encoder-Decoder\[[2](#References)\] structure helps transform a source sequence with arbitrary length to another target sequence with arbitrary length. In the encoding phase, it encodes the entire source sequence into a vector. And in the decoding phase, it decodes the entire target sequence by maximizing the predicted sequence probability. The encoding and decoding process is usually implemented by RNN.
* It is hard to capture the relationships between words, i.e., semantic similarities. Therefore, it is useful to project the one-hot vector into a low-dimensional semantic space as a dense vector with fixed dimensions, i.e., $s_i=Cw_i$ for the $i$-th word, with $C\epsilon R^{K\times \left | V \right |}$ as the projection matrix and $K$ is the dimensionality of the word embedding vector. <div align="center">
<img src ="https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/image/encoder_decoder.png?raw=true" width="700"><br/>
Figure 3. Encoder-Decoder Frame
</div>
3. Encoding of the source sequence via RNN: This can be described mathematically as: <a name="encoder"></a>
#### Encoder
$$h_i=\varnothing _\theta \left ( h_{i-1}, s_i \right )$$ Encoding can be done in 3 steps:
where 1.One-hot vector : each word $x_i$ of the source language sentence $x=\left \{ x_1,x_2,...,x_T \right \}$ is represented as a column vector $w_i\epsilon \left \{ 0,1 \right \}^{\left | V \right |},i=1,2,...,T$. The dimension of this vector $w_i$ is the same as the vocabulary size $\left | V \right |$, and there is only one dimension with a value of 1 (the position corresponds to the position of the word in the vocabulary), and the rest are all zero.
$h_0$ is a zero vector,
$\varnothing _\theta$ is a non-linear activation function, and
$\mathbf{h}=\left \{ h_1,..., h_T \right \}$
is the sequential encoding of the first $T$ words from the source sequence. The vector representation of the whole sentence can be represented as the encoding vector at the last time step $T$ from $\mathbf{h}$, or by temporal pooling over $\mathbf{h}$.
2.Word vector mapped to low-dimensional semantic space: one-hot vector indicates that there are two problems: 1. the dimension of generated vector is often large, which is easy to cause dimension disaster. 2. it is difficult to characterize the relationship between words and words (such as semantical similarity, that is, the semantics cannot be expressed well.) Therefore, one-hot vector needs to be mapped to a low-dimensional semantic space, represented by a dense vector(called a word vector) with fixed dimensions . The mapping matrix is $C\epsilon R^{K\times \left | V \right |}$, with $s_i=Cw_i$ representing the word vector of the $i$ word, $K$ representing the dimension of vector.
Bi-directional RNN can also be used in step (3) for more a complicated sentence encoding. This can be implemented using a bi-directional GRU. Forward GRU encodes the source sequence in its original order $(x_1,x_2,...,x_T)$, and generates a sequence of hidden states $(\overrightarrow{h_1},\overrightarrow{h_2},...,\overrightarrow{h_T})$. The backward GRU encodes the source sequence in reverse order, i.e., $(x_T,x_T-1,...,x_1)$ and generates $(\overleftarrow{h_1},\overleftarrow{h_2},...,\overleftarrow{h_T})$. Then for each word $x_i$, its complete hidden state is the concatenation of the corresponding hidden states from the two GRUs, i.e., $h_i=\left [ \overrightarrow{h_i^T},\overleftarrow{h_i^T} \right ]^{T}$. 3.Encode a source language word sequence with RNN: The calculation for this process is $h_i=\varnothing _\theta \left ( h_{i-1}, s_i \right )$, where $h_0$ is an all-zero vector. $\varnothing _\theta$ is a non-linear activation function, and the result $\mathbf{h}=\left \{ h_1,..., h_T \right \}$ is the state code sequence of source language $T$ words read by RNN respectively. The vector representation of the entire sentence can use $\mathbf{h}$ in the state code of the last time step $T$, or use the pooling result in the time dimension.
<p align="center"> Step 3 can also use bi-directional recurrent neural network to implement more complex sentence-coded representation, which can be implemented with bi-directional GRU. The forward GRU sequentially encodes the source language word in the order of the word sequence $(x_1, x_2,..., x_T)$, and obtains a series of hidden layer states $(\overrightarrow{h_1},\overrightarrow{h_2},. ..,\overrightarrow{h_T})$. Similarly, the backward GRU encodes the source language word in the order of $(x_T,x_{T-1},...,x_1)$, resulting in $(\overleftarrow{h_1},\overleftarrow{h_2},. ..,\overleftarrow{h_T})$. Finally, for the word $x_i$, the hidden layer state is obtained by jointing the two GRUs, namely $h_i=\left [ \overrightarrow{h_i^T},\overleftarrow{h_i^T} \right ]^{T} $.
<img src="image/encoder_attention_en.png" width=500><br/> <div align="center">
Figure 5. Encoder using bi-directional GRU <img src="https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/image/encoder_attention.png?raw=true" width="400"><br/>
</p> Figure 4. Use bi-directional GRU encoder
</div>
#### Decoder #### Decoder
The goal of the decoder is to maximize the probability of the next correct word in the target language. The main idea is as follows: During the training of machine translation tasks, the goal of in decode period is to maximize the probability of the next correct target language word. The idea is:
1.At each time, according to the encoding information of the source language sentence (also called context vector) $c$, the $i$th word $u_i$ of the real target language sequence and the hidden layer state $z_i $ of RNN at $i$th time, it calculates the next hidden layer state $z_{i+1}$. The formula is as follows:
$$z_{i+1}=\phi_{\theta '} \left ( c,u_i,z_i \right )$$
Where $\phi _{\theta '}$ is a nonlinear activation function; $c$ is the context vector of the source language sentence, if the attention mechanism is not used and the output of [encoder](#encoder) is the last element after the source language sentence is encoded, you can define $c=h_T$; $u_i$ is the $i$th word of the target language sequence, and $u_0$ is the start tag `<s>` of the target language sequence, indicating the start of decoding; $z_i$ is the hidden layer state of the RNN at $i$th time, and $z_0$ is an all-zero vector.
1. At each time step $i$, given the encoding vector (or context vector) $c$ of the source sentence, the $i$-th word $u_i$ from the ground-truth target language and the RNN hidden state $z_i$, the next hidden state $z_{i+1}$ is computed as: 2.Normalize $z_{i+1}$ by `softmax` to get the probability distribution $p_{i+1}$ of the $i+1$th word of the target language sequence. The probability distribution formula is as follows:
$$p\left ( u_{i+1}|u_{&lt;i+1},\mathbf{x} \right )=softmax(W_sz_{i+1}+b_z)$$
Where $W_sz_{i+1}+b_z$ scores each possible output word and normalizes with softmax to get the probability $p_{i+1}$ of $i+1$th word.
$$z_{i+1}=\phi _{\theta '}\left ( c,u_i,z_i \right )$$ 3.Calculate the cost according to $p_{i+1}$ and $u_{i+1}$.
where $\phi _{\theta '}$ is a non-linear activation function and $c=q\mathbf{h}$ is the context vector of the source sentence. Without using [attention](#Attention Mechanism), if the output of the [encoder](#Encoder) is the encoding vector at the last time step of the source sentence, then $c$ can be defined as $c=h_T$. $u_i$ denotes the $i$-th word from the target language sentence and $u_0$ denotes the beginning of the target language sentence (i.e., `<s>`), indicating the beginning of decoding. $z_i$ is the RNN hidden state at time step $i$ and $z_0$ is an all zero vector.
2. Calculate the probability $p_{i+1}$ for the $i+1$-th word in the target language sequence by normalizing $z_{i+1}$ using `softmax` as follows 4.Repeat steps 1~3 until all words in the target language sequence have been processed.
$$p\left ( u_{i+1}|u_{<i+1},\mathbf{x} \right )=softmax(W_sz_{i+1}+b_z)$$ The process of generating machine translation tasks, in general, is to translate source language sentences based on pre-trained model. The decode period in the generation process is different from the training process above. For details, please refer to [Beam Search Algorithm](#Beam Search Algorithm).
where $W_sz_{i+1}+b_z$ scores each possible words and is then normalized via softmax to produce the probability $p_{i+1}$ for the $i+1$-th word. <a name="beam search algorithm"></a>
### Beam Search Algorithm
3. Compute the cost accoding to $p_{i+1}$ and $u_{i+1}$. Beam Search ([beam search](http://en.wikipedia.org/wiki/Beam_search)) is a heuristic graph search algorithm for searching the graph or tree for the optimal extended nodes in a finite set, usually used in systems with very large solution space (such as machine translation, speech recognition), for that the memory can't fit all the unfolded solutions in the graph or tree. If you want to translate "`<s>Hello<e>`" in the machine translation task, even if there are only 3 words (`<s>`, `<e>`, `hello`) in the target language dictionary, it is possible generate infinite sentences (the number of occurrences of `hello` is uncertain). In order to find better translation results, we can use beam search algorithm.
4. Repeat Steps 1-3, until all the words in the target language sentence have been processed.
The generation process of machine translation is to translate the source sentence into a sentence in the target language according to a pre-trained model. There are some differences between the decoding step in generation and training. Please refer to [Beam Search Algorithm](#Beam Search Algorithm) for details. The beam search algorithm uses a breadth-first strategy to build a search tree. At each level of the tree, the nodes are sorted according to the heuristic cost (in this tutorial, the sum of the log probabilities of the generated words), and then only the predetermined number of nodes (commonly referred to in the literature as beam width, beam size, 柱宽度, etc.). Only these nodes will continue to expand in the next layer, and other nodes will be cut off, that is, the nodes with higher quality are retained, and the nodes with poor quality are pruned. Therefore, the space and time occupied by the search are greatly reduced, but the disadvantage is that there is no guarantee that an optimal solution will be obtained.
### Beam Search Algorithm In the decode period of using beam search algorithm, the goal is to maximize the probability of generated sequence. The idea is:
1.At each time, the next hidden layer state $z_{i+1}$ is calculated according to the encoding information $c$ of the source language sentence, the generated $i$th target language sequence words $u_i$, and the hidden layer state $z_i$ of RNN at $i$th time.
[Beam Search](http://en.wikipedia.org/wiki/Beam_search) is a heuristic search algorithm that explores a graph by expanding the most promising node in a limited set. It is typically used when the solution space is huge (e.g., for machine translation, speech recognition), and there is not enough memory for all the possible solutions. For example, if we want to translate “`<s>你好<e>`” into English, even if there are only three words in the dictionary (`<s>`, `<e>`, `hello`), it is still possible to generate an infinite number of sentences, where the word `hello` can appear different number of times. Beam search could be used to find a good translation among them. 2.Normalize $z_{i+1}$ by `softmax` to get the probability distribution $p_{i+1}$ of the $i+1$th words of the target language sequence.
Beam search builds a search tree using breadth first search and sorts the nodes according to a heuristic cost (sum of the log probability of the generated words) at each level of the tree. Only a fixed number of nodes according to the pre-specified beam size (or beam width) are considered. Thus, only nodes with highest scores are expanded in the next level. This reduces the space and time requirements significantly. However, a globally optimal solution is not guaranteed. 3.The word $u_{i+1}$ is sampled according to $p_{i+1}$.
The goal is to maximize the probability of the generated sequence when using beam search in decoding, The procedure is as follows: 4.Repeat steps 1~3 until you get the sentence end tag `<e>` or exceed the maximum generation length of the sentence.
1. At each time step $i$, compute the hidden state $z_{i+1}$ of the next time step according to the context vector $c$ of the source sentence, the $i$-th word $u_i$ generated for the target language sentence and the RNN hidden state $z_i$. Note: The formula for $z_{i+1}$ and $p_{i+1}$ is the same as in [Decoder](#Decoder). And since each step of the generation is implemented by the greedy method, it is not guaranteed to obtain the global optimal solution.
2. Normalize $z_{i+1}$ using `softmax` to get the probability $p_{i+1}$ for the $i+1$-th word for the target language sentence.
3. Sample the word $u_{i+1}$ according to $p_{i+1}$.
4. Repeat Steps 1-3, until end-of-sentence token `<e>` is generated or the maximum length of the sentence is reached.
Note: $z_{i+1}$ and $p_{i+1}$ are computed the same way as in [Decoder](#Decoder). In generation mode, each step is greedy in so there is no guarantee of a global optimum. ## Data Preparation
## BLEU Score This tutorial uses [bitexts(after selection)] in the [WMT-14](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/) dataset (http://www-lium.univ- Lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz) as a training set, [dev+test data](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz) as a test set and generated set.
Bilingual Evaluation understudy (BLEU) is a metric widely used for automatic machine translation proposed by IBM Watson Research Center in 2002\[[5](#References)\]. The closer the translation produced by a machine is to the translation produced by a human expert, the better the performance of the translation system. ### Data Pre-processing
To measure the closeness between machine translation and human translation, sentence precision is used. It compares the number of matched n-grams. More matches will lead to higher BLEU scores. It contains two steps in pre-processing:
## Data Preparation -Merge parallel corpora files from source language to target language into one file:
This tutorial uses a dataset from [WMT-14](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/), where [bitexts (after selection)](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz) is used as the training set, and [dev+test data](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz) is used as test and generation set. -Merge every `XXX.src` and `XXX.trg` into one file as `XXX`.
-Content in $i$th row of `XXX` is the connection of $i$th row of `XXX.src` with $i$th row of `XXX.trg`, which is divided by '\t'.
### Data Preprocessing -Create source language dictionary and target language dictionary of train data. There are **DICTSIZE** words in each dictionary, including: (DICTSIZE - 3) words with highest frequency in the corpus, and 3 special symbols `<s>` (the beginning of the sequence), `<e>` ( the end of the sequence) and `<unk>` (unknown word).
There are two steps for pre-processing: ### Sample Data
- Merge the source and target parallel corpus files into one file
- Merge `XXX.src` and `XXX.trg` file pair as `XXX`
- The $i$-th row in `XXX` is the concatenation of the $i$-th row from `XXX.src` with the $i$-th row from `XXX.trg`, separated with '\t'.
- Create source dictionary and target dictionary, each containing **DICTSIZE** number of words, including the most frequent (DICTSIZE - 3) fo word from the corpus and 3 special token `<s>` (begin of sequence), `<e>` (end of sequence) and `<unk>` (unknown words that are not in the vocabulary). Because the data volume of the complete data set is large, in order to verify the training process, the PaddlePaddle interface paddle.data set.wmt14 provides a pre-processed [smaller scale dataset](http://paddlepaddle.bj.bcebos.com/demo/wmt_shrinked_data/wmt14.tgz) by default .
### A Subset of Dataset In the data set, there are 193,319 training data, 6003 test data, and a dictionary with length of 30,000. Due to the limit of data size, the effects of models trained with this dataset are not guaranteed.
Because the full dataset is very big, to reduce the time for downloading the full dataset. PadddlePaddle package `paddle.dataset.wmt14` provides a preprocessed `subset of dataset`(http://paddlepaddle.bj.bcebos.com/demo/wmt_shrinked_data/wmt14.tgz). ## Model Configuration
This subset has 193319 instances of training data and 6003 instances of test data. Dictionary size is 30000. Because of the limitation of size of the subset, the effectiveness of trained model from this subset is not guaranteed. Next we start configuring model according to input data. First we import necessary library functions and define global variables.
## Model Configuration
Our program starts with importing necessary packages and initializing some global variables:
```python ```python
from __future__ import print_function from __future__ import print_function
...@@ -190,41 +165,51 @@ except ImportError: ...@@ -190,41 +165,51 @@ except ImportError:
from paddle.fluid.trainer import * from paddle.fluid.trainer import *
from paddle.fluid.inferencer import * from paddle.fluid.inferencer import *
dict_size = 30000 dict_size = 30000 # dictionary dimension
source_dict_dim = target_dict_dim = dict_size source_dict_dim = target_dict_dim = dict_size # source/target language dictionary dimension
hidden_dim = 32 hidden_dim = 32 # size of hidden layer in encoder
word_dim = 16 word_dim = 16 # dimension of word vector
batch_size = 2 batch_size = 2 # the number of samples in batch
max_length = 8 max_length = 8 # the maximum length of generated sentence
topk_size = 50 beam_size = 2 # width of beam
beam_size = 2
decoder_size = hidden_dim decoder_size = hidden_dim # size of hidden layer in decoder
``` ```
Then we implement encoder as follows:
```python
Then the frame of encoder is implemented as follows:
```python
def encoder(is_sparse): def encoder(is_sparse):
# define input data id of source language
src_word_id = pd.data( src_word_id = pd.data(
name="src_word_id", shape=[1], dtype='int64', lod_level=1) name="src_word_id", shape=[1], dtype='int64', lod_level=1)
# reflect encode above on the word vector of low-dimension language space.
src_embedding = pd.embedding( src_embedding = pd.embedding(
input=src_word_id, input=src_word_id,
size=[dict_size, word_dim], size=[dict_size, word_dim],
dtype='float32', dtype='float32',
is_sparse=is_sparse, is_sparse=is_sparse,
param_attr=fluid.ParamAttr(name='vemb')) param_attr=fluid.ParamAttr(name='vemb'))
# LSTM layer:fc + dynamic_lstm
fc1 = pd.fc(input=src_embedding, size=hidden_dim * 4, act='tanh') fc1 = pd.fc(input=src_embedding, size=hidden_dim * 4, act='tanh')
lstm_hidden0, lstm_0 = pd.dynamic_lstm(input=fc1, size=hidden_dim * 4) lstm_hidden0, lstm_0 = pd.dynamic_lstm(input=fc1, size=hidden_dim * 4)
# Fetch the final state after the sequence encode of source language
encoder_out = pd.sequence_last_step(input=lstm_hidden0) encoder_out = pd.sequence_last_step(input=lstm_hidden0)
return encoder_out return encoder_out
``` ```
Then implement decoder in training mode:
Implement the decoder for training as follows:
```python ```python
def train_decoder(context, is_sparse): def train_decoder(context, is_sparse):
# Define input data of sequence id of target language and reflect it on word vector of low-dimension language space
trg_language_word = pd.data( trg_language_word = pd.data(
name="target_language_word", shape=[1], dtype='int64', lod_level=1) name="target_language_word", shape=[1], dtype='int64', lod_level=1)
trg_embedding = pd.embedding( trg_embedding = pd.embedding(
...@@ -235,35 +220,44 @@ Implement the decoder for training as follows: ...@@ -235,35 +220,44 @@ Implement the decoder for training as follows:
param_attr=fluid.ParamAttr(name='vemb')) param_attr=fluid.ParamAttr(name='vemb'))
rnn = pd.DynamicRNN() rnn = pd.DynamicRNN()
with rnn.block(): with rnn.block(): # use DynamicRNN to define computation at each step
# Fetch input word vector of target language at present step
current_word = rnn.step_input(trg_embedding) current_word = rnn.step_input(trg_embedding)
# obtain state of hidden layer
pre_state = rnn.memory(init=context) pre_state = rnn.memory(init=context)
# computing unit of decoder: single-layer forward network
current_state = pd.fc(input=[current_word, pre_state], current_state = pd.fc(input=[current_word, pre_state],
size=decoder_size, size=decoder_size,
act='tanh') act='tanh')
# compute predicting probability of nomarlized word
current_score = pd.fc(input=current_state, current_score = pd.fc(input=current_state,
size=target_dict_dim, size=target_dict_dim,
act='softmax') act='softmax')
# update hidden layer of RNN
rnn.update_memory(pre_state, current_state) rnn.update_memory(pre_state, current_state)
# output predicted probability
rnn.output(current_score) rnn.output(current_score)
return rnn() return rnn()
``` ```
Implement the decoder for prediction as follows:
implement decoder in inference mode
```python ```python
def decode(context, is_sparse): def decode(context, is_sparse):
init_state = context init_state = context
# define counter variable in the decoding
array_len = pd.fill_constant(shape=[1], dtype='int64', value=max_length) array_len = pd.fill_constant(shape=[1], dtype='int64', value=max_length)
counter = pd.zeros(shape=[1], dtype='int64', force_cpu=True) counter = pd.zeros(shape=[1], dtype='int64', force_cpu=True)
# fill the first element with init_state # define tensor array to save content at each time step, and write initial id, score and state
state_array = pd.create_array('float32') state_array = pd.create_array('float32')
pd.array_write(init_state, array=state_array, i=counter) pd.array_write(init_state, array=state_array, i=counter)
# ids, scores as memory
ids_array = pd.create_array('int64') ids_array = pd.create_array('int64')
scores_array = pd.create_array('float32') scores_array = pd.create_array('float32')
...@@ -274,34 +268,35 @@ def decode(context, is_sparse): ...@@ -274,34 +268,35 @@ def decode(context, is_sparse):
pd.array_write(init_ids, array=ids_array, i=counter) pd.array_write(init_ids, array=ids_array, i=counter)
pd.array_write(init_scores, array=scores_array, i=counter) pd.array_write(init_scores, array=scores_array, i=counter)
# define conditional variable to stop loop
cond = pd.less_than(x=counter, y=array_len) cond = pd.less_than(x=counter, y=array_len)
# define while_op
while_op = pd.While(cond=cond) while_op = pd.While(cond=cond)
with while_op.block(): with while_op.block(): # define the computing of each step
# obtain input at present step of decoder, including id chosen at previous step, corresponding score and state at previous step.
pre_ids = pd.array_read(array=ids_array, i=counter) pre_ids = pd.array_read(array=ids_array, i=counter)
pre_state = pd.array_read(array=state_array, i=counter) pre_state = pd.array_read(array=state_array, i=counter)
pre_score = pd.array_read(array=scores_array, i=counter) pre_score = pd.array_read(array=scores_array, i=counter)
# expand the lod of pre_state to be the same with pre_score # update input state as state correspondent with id chosen at previous step
pre_state_expanded = pd.sequence_expand(pre_state, pre_score) pre_state_expanded = pd.sequence_expand(pre_state, pre_score)
# computing logic of decoder under the same train mode, including input vector and computing unit of decoder
# compute predicting probability of normalized word
pre_ids_emb = pd.embedding( pre_ids_emb = pd.embedding(
input=pre_ids, input=pre_ids,
size=[dict_size, word_dim], size=[dict_size, word_dim],
dtype='float32', dtype='float32',
is_sparse=is_sparse) is_sparse=is_sparse)
# use rnn unit to update rnn
current_state = pd.fc(input=[pre_state_expanded, pre_ids_emb], current_state = pd.fc(input=[pre_state_expanded, pre_ids_emb],
size=decoder_size, size=decoder_size,
act='tanh') act='tanh')
current_state_with_lod = pd.lod_reset(x=current_state, y=pre_score) current_state_with_lod = pd.lod_reset(x=current_state, y=pre_score)
# use score to do beam search
current_score = pd.fc(input=current_state_with_lod, current_score = pd.fc(input=current_state_with_lod,
size=target_dict_dim, size=target_dict_dim,
act='softmax') act='softmax')
topk_scores, topk_indices = pd.topk(current_score, k=beam_size) topk_scores, topk_indices = pd.topk(current_score, k=beam_size)
# calculate accumulated scores after topk to reduce computation cost
# compute accumulated score and perform beam search
accu_scores = pd.elementwise_add( accu_scores = pd.elementwise_add(
x=pd.log(topk_scores), y=pd.reshape(pre_score, shape=[-1]), axis=0) x=pd.log(topk_scores), y=pd.reshape(pre_score, shape=[-1]), axis=0)
selected_ids, selected_scores = pd.beam_search( selected_ids, selected_scores = pd.beam_search(
...@@ -314,14 +309,12 @@ def decode(context, is_sparse): ...@@ -314,14 +309,12 @@ def decode(context, is_sparse):
level=0) level=0)
pd.increment(x=counter, value=1, in_place=True) pd.increment(x=counter, value=1, in_place=True)
# write search result and corresponding hidden layer into tensor array
# update the memories
pd.array_write(current_state, array=state_array, i=counter) pd.array_write(current_state, array=state_array, i=counter)
pd.array_write(selected_ids, array=ids_array, i=counter) pd.array_write(selected_ids, array=ids_array, i=counter)
pd.array_write(selected_scores, array=scores_array, i=counter) pd.array_write(selected_scores, array=scores_array, i=counter)
# update the break condition: up to the max length or all candidates of # update condition to stop loop
# source sentences have ended.
length_cond = pd.less_than(x=counter, y=array_len) length_cond = pd.less_than(x=counter, y=array_len)
finish_cond = pd.logical_not(pd.is_empty(x=selected_ids)) finish_cond = pd.logical_not(pd.is_empty(x=selected_ids))
pd.logical_and(x=length_cond, y=finish_cond, out=cond) pd.logical_and(x=length_cond, y=finish_cond, out=cond)
...@@ -333,8 +326,9 @@ def decode(context, is_sparse): ...@@ -333,8 +326,9 @@ def decode(context, is_sparse):
``` ```
Then we define a `training_program` that uses the result from `encoder` and `train_decoder` to compute the cost with label data.
Also define `optimizer_func` to specify the optimizer.
Furthermore, we define a `train_program` to use result computed by `inference_program` and compute error with the help of marked data. We also define an `optimizer_func` to define optimizer.
```python ```python
def train_program(is_sparse): def train_program(is_sparse):
...@@ -354,21 +348,18 @@ def optimizer_func(): ...@@ -354,21 +348,18 @@ def optimizer_func():
regularization_coeff=0.1)) regularization_coeff=0.1))
``` ```
## Model Training ## Train Model
### Specify training environment ### Define Training Environment
Define your training environment and define the train executed on CPU or on GPU.
Specify your training environment, you should specify if the training is on CPU or GPU.
```python ```python
use_cuda = False use_cuda = False
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace() place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
``` ```
### Datafeeder Configuration ### Define Data Provider
The next step is to define data provider for train and test. Data Provider read data with size of `BATCH_SIZE` `paddle.dataset.wmt.train` will provide data with size of `BATCH_SIZE` after reordering every time. The size of reordering is `buf_size`.
Next we define data feeders for test and train. The feeder reads a `buf_size` of data each time and feed them to the training/testing process.
`paddle.dataset.wmt14.train` will yield records during each pass, after shuffling, a batch input of `BATCH_SIZE` is generated for training.
```python ```python
train_reader = paddle.batch( train_reader = paddle.batch(
...@@ -378,8 +369,7 @@ train_reader = paddle.batch( ...@@ -378,8 +369,7 @@ train_reader = paddle.batch(
``` ```
### Create Trainer ### Create Trainer
Trainer needs a train program and a train optimizer.
Create a trainer that takes `train_program` as input and specify optimizer function.
```python ```python
is_sparse = False is_sparse = False
...@@ -389,20 +379,17 @@ trainer = Trainer( ...@@ -389,20 +379,17 @@ trainer = Trainer(
optimizer_func=optimizer_func) optimizer_func=optimizer_func)
``` ```
### Feeding Data ### Provide Data
`feed_order` is devoted to specifying the correspondence between each yield record and `paddle.layer.data`. For instance, the first column of data generated by `wmt14.train` corresponds to `src_word_id`. `feed_order` is used to define every generated data and reflecting relationship between `paddle.layer.data`. For example, the first column data generated by `wmt14.train` is correspondent with the feature `src_word_id`.
```python ```python
feed_order = [ feed_order = ['src_word_id', 'target_language_word', 'target_language_next_word'
'src_word_id', 'target_language_word', 'target_language_next_word' ]
]
``` ```
### Event Handler ### Event Handler
Call function `event_handler` will be called after the touch of an event defined before. For example, we can examine the loss after the training at each step.
Callback function `event_handler` will be called during training when a pre-defined event happens.
For example, we can check the cost by `trainer.test` when `EndStepEvent` occurs
```python ```python
def event_handler(event): def event_handler(event):
...@@ -414,10 +401,8 @@ def event_handler(event): ...@@ -414,10 +401,8 @@ def event_handler(event):
trainer.stop() trainer.stop()
``` ```
### Start Training
### Training Finally, we feed in `num_epoch` and other parameters and call `trainer.train` to start training.
Finally, we invoke `trainer.train` to start training with `num_epochs` and other parameters.
```python ```python
EPOCH_NUM = 1 EPOCH_NUM = 1
...@@ -429,20 +414,20 @@ trainer.train( ...@@ -429,20 +414,20 @@ trainer.train(
feed_order=feed_order) feed_order=feed_order)
``` ```
## Inference ## Model Application
### Define the decode part ### Define Decoder Part
Use the `encoder` and `decoder` function we defined above to predict translation ids and scores. Use `encoder` and `decoder` function defined above to infer corresponding id and score after the translation.
```python ```python
context = encoder(is_sparse) context = encoder(is_sparse)
translation_ids, translation_scores = decode(context, is_sparse) translation_ids, translation_scores = decode(context, is_sparse)
``` ```
### Define DataSet ### Define Data
We initialize ids and scores and create tensors for input. In this test we are using first record data from `wmt14.test` for inference. At the end we get src dict and target dict for printing out results later. First we initialize id and score to generate tensor as input data. In this prediction, we use the first record in `wmt14.test` to infer and finally use "source language dictionary" and "target language dictionary" to output corresponding sentence.
```python ```python
init_ids_data = np.array([1 for _ in range(batch_size)], dtype='int64') init_ids_data = np.array([1 for _ in range(batch_size)], dtype='int64')
...@@ -471,9 +456,9 @@ feeder = fluid.DataFeeder(feed_list, place) ...@@ -471,9 +456,9 @@ feeder = fluid.DataFeeder(feed_list, place)
src_dict, trg_dict = paddle.dataset.wmt14.get_dict(dict_size) src_dict, trg_dict = paddle.dataset.wmt14.get_dict(dict_size)
``` ```
### Infer ### Test
Now we can start predicting. We need provide corresponding parameters in `feed_order` and run it on `executor` to obtain id and score.
We create `feed_dict` with all the inputs we need and run with `executor` to get predicted results id and corresponding scores.
```python ```python
exe = Executor(place) exe = Executor(place)
...@@ -507,6 +492,11 @@ for data in test_data(): ...@@ -507,6 +492,11 @@ for data in test_data():
break break
``` ```
## Summary
End-to-End neural network translation is an recently acclaimed machine translation method. In this section, we introduced the typical Encoder-Decoder of NMT. Because NMT is a typical Seq2Seq (Sequence to Sequence) learning task, tasks of Seq2Seq, such as query rewriting, abstraction, single round dialogue, can be tackled by this model.
<a name="References"></a>
## References ## References
1. Koehn P. [Statistical machine translation](https://books.google.com.hk/books?id=4v_Cx1wIMLkC&printsec=frontcover&hl=zh-CN&source=gbs_ge_summary_r&cad=0#v=onepage&q&f=false)[M]. Cambridge University Press, 2009. 1. Koehn P. [Statistical machine translation](https://books.google.com.hk/books?id=4v_Cx1wIMLkC&printsec=frontcover&hl=zh-CN&source=gbs_ge_summary_r&cad=0#v=onepage&q&f=false)[M]. Cambridge University Press, 2009.
...@@ -516,4 +506,5 @@ for data in test_data(): ...@@ -516,4 +506,5 @@ for data in test_data():
5. Papineni K, Roukos S, Ward T, et al. [BLEU: a method for automatic evaluation of machine translation](http://dl.acm.org/citation.cfm?id=1073135)[C]//Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002: 311-318. 5. Papineni K, Roukos S, Ward T, et al. [BLEU: a method for automatic evaluation of machine translation](http://dl.acm.org/citation.cfm?id=1073135)[C]//Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002: 311-318.
<br/> <br/>
This tutorial is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.
<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://paddlepaddleimage.cdn.bcebos.com/bookimage/camo.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">This tutorial</span> is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.
...@@ -44,6 +44,11 @@ ...@@ -44,6 +44,11 @@
本教程源代码目录在[book/machine_translation](https://github.com/PaddlePaddle/book/tree/develop/08.machine_translation),初次使用请您参考[Book文档使用说明](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)。 本教程源代码目录在[book/machine_translation](https://github.com/PaddlePaddle/book/tree/develop/08.machine_translation),初次使用请您参考[Book文档使用说明](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#运行这本书)。
### 说明
1. 硬件要求 本文可支持在CPU、GPU下运行
2. 对docker file cuda/cudnn的支持 如果您使用了本文配套的docker镜像,请注意:该镜像对GPU的支持仅限于CUDA 8,cuDNN 5
3. 文档中代码和train.py不一致的问题 请注意:为使本文更加易读易用,我们拆分、调整了train.py的代码并放入本文。本文中代码与train.py的运行结果一致,如希望直接看到训练脚本输出效果,可运行[train.py](https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/train.py)。
## 背景介绍 ## 背景介绍
机器翻译(machine translation, MT)是用计算机来实现不同语言之间翻译的技术。被翻译的语言通常称为源语言(source language),翻译成的结果语言称为目标语言(target language)。机器翻译即实现从源语言到目标语言转换的过程,是自然语言处理的重要研究领域之一。 机器翻译(machine translation, MT)是用计算机来实现不同语言之间翻译的技术。被翻译的语言通常称为源语言(source language),翻译成的结果语言称为目标语言(target language)。机器翻译即实现从源语言到目标语言转换的过程,是自然语言处理的重要研究领域之一。
...@@ -79,7 +84,21 @@ ...@@ -79,7 +84,21 @@
## 模型概览 ## 模型概览
本节依次介绍双向循环神经网络(Bi-directional Recurrent Neural Network),NMT模型中典型的编码器-解码器(Encoder-Decoder)框架以及柱搜索(beam search)算法。 本节依次介绍GRU(Gated Recurrent Unit,门控循环单元),双向循环神经网络(Bi-directional Recurrent Neural Network),NMT模型中典型的编码器-解码器(Encoder-Decoder)框架和注意力(Attention)机制,以及柱搜索(beam search)算法。
### GRU
我们已经在[情感分析](https://github.com/PaddlePaddle/book/blob/develop/06.understand_sentiment/README.cn.md)一章中介绍了循环神经网络(RNN)及长短时间记忆网络(LSTM)。相比于简单的RNN,LSTM增加了记忆单元(memory cell)、输入门(input gate)、遗忘门(forget gate)及输出门(output gate),这些门及记忆单元组合起来大大提升了RNN处理远距离依赖问题的能力。
GRU\[[2](#参考文献)\]是Cho等人在LSTM上提出的简化版本,也是RNN的一种扩展,如下图所示。GRU单元只有两个门:
- 重置门(reset gate):如果重置门关闭,会忽略掉历史信息,即历史不相干的信息不会影响未来的输出。
- 更新门(update gate):将LSTM的输入门和遗忘门合并,用于控制历史信息对当前时刻隐层输出的影响。如果更新门接近1,会把历史信息传递下去。
<p align="center">
<img src="image/gru.png" width=700><br/>
图2. GRU(门控循环单元)
</p>
一般来说,具有短距离依赖属性的序列,其重置门比较活跃;相反,具有长距离依赖属性的序列,其更新门比较活跃。另外,Chung等人\[[3](#参考文献)\]通过多组实验表明,GRU虽然参数更少,但是在多个任务上都和LSTM有相近的表现。
### 双向循环神经网络 ### 双向循环神经网络
...@@ -123,19 +142,56 @@ ...@@ -123,19 +142,56 @@
机器翻译任务的训练过程中,解码阶段的目标是最大化下一个正确的目标语言词的概率。思路是: 机器翻译任务的训练过程中,解码阶段的目标是最大化下一个正确的目标语言词的概率。思路是:
1. 每一个时刻,根据源语言句子的编码信息(又叫上下文向量,context vector)$c$、真实目标语言序列的第$i$个词$u_i$和$i$时刻RNN的隐层状态$z_i$,计算出下一个隐层状态$z_{i+1}$。计算公式如下: 1. 每一个时刻,根据源语言句子的编码信息(又叫上下文向量,context vector)$c$、真实目标语言序列的第$i$个词$u_i$和$i$时刻RNN的隐层状态$z_i$,计算出下一个隐层状态$z_{i+1}$。计算公式如下:
$$z_{i+1}=\phi_{\theta '} \left ( c,u_i,z_i \right )$$
<div align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/image/decoder_formula.png?raw=true" width="200"><br/>
</div>
其中$\phi _{\theta '}$是一个非线性激活函数;$c$是源语言句子的上下文向量,在不使用注意力机制时,如果[编码器](#编码器)的输出是源语言句子编码后的最后一个元素,则可以定义$c=h_T$;$u_i$是目标语言序列的第$i$个单词,$u_0$是目标语言序列的开始标记`<s>`,表示解码开始;$z_i$是$i$时刻解码RNN的隐层状态,$z_0$是一个全零的向量。 其中$\phi _{\theta '}$是一个非线性激活函数;$c$是源语言句子的上下文向量,在不使用注意力机制时,如果[编码器](#编码器)的输出是源语言句子编码后的最后一个元素,则可以定义$c=h_T$;$u_i$是目标语言序列的第$i$个单词,$u_0$是目标语言序列的开始标记`<s>`,表示解码开始;$z_i$是$i$时刻解码RNN的隐层状态,$z_0$是一个全零的向量。
2. 将$z_{i+1}$通过`softmax`归一化,得到目标语言序列的第$i+1$个单词的概率分布$p_{i+1}$。概率分布公式如下: 1. 将$z_{i+1}$通过`softmax`归一化,得到目标语言序列的第$i+1$个单词的概率分布$p_{i+1}$。概率分布公式如下:
$$p\left ( u_{i+1}|u_{&lt;i+1},\mathbf{x} \right )=softmax(W_sz_{i+1}+b_z)$$
<div align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/image/probability_formula.png?raw=true" width="400"><br/>
</div>
其中$W_sz_{i+1}+b_z$是对每个可能的输出单词进行打分,再用softmax归一化就可以得到第$i+1$个词的概率$p_{i+1}$。 其中$W_sz_{i+1}+b_z$是对每个可能的输出单词进行打分,再用softmax归一化就可以得到第$i+1$个词的概率$p_{i+1}$。
3. 根据$p_{i+1}$和$u_{i+1}$计算代价。 1. 根据$p_{i+1}$和$u_{i+1}$计算代价。
4. 重复步骤1~3,直到目标语言序列中的所有词处理完毕。 2. 重复步骤1~3,直到目标语言序列中的所有词处理完毕。
机器翻译任务的生成过程,通俗来讲就是根据预先训练的模型来翻译源语言句子。生成过程中的解码阶段和上述训练过程的有所差异,具体介绍请见[柱搜索算法](#柱搜索算法)。 机器翻译任务的生成过程,通俗来讲就是根据预先训练的模型来翻译源语言句子。生成过程中的解码阶段和上述训练过程的有所差异,具体介绍请见[柱搜索算法](#柱搜索算法)。
### 注意力机制
如果编码阶段的输出是一个固定维度的向量,会带来以下两个问题:1)不论源语言序列的长度是5个词还是50个词,如果都用固定维度的向量去编码其中的语义和句法结构信息,对模型来说是一个非常高的要求,特别是对长句子序列而言;2)直觉上,当人类翻译一句话时,会对与当前译文更相关的源语言片段上给予更多关注,且关注点会随着翻译的进行而改变。而固定维度的向量则相当于,任何时刻都对源语言所有信息给予了同等程度的关注,这是不合理的。因此,Bahdanau等人\[[4](#参考文献)\]引入注意力(attention)机制,可以对编码后的上下文片段进行解码,以此来解决长句子的特征学习问题。下面介绍在注意力机制下的解码器结构。
与简单的解码器不同,这里$z_i$的计算公式为:
<div align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/image/attention_decoder_formula.png?raw=true" width="200"><br/>
</div>
可见,源语言句子的编码向量表示为第$i$个词的上下文片段$c_i$,即针对每一个目标语言中的词$u_i$,都有一个特定的$c_i$与之对应。$c_i$的计算公式如下:
<div align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/image/sum_formula.png?raw=true" width="300"><br/>
</div>
从公式中可以看出,注意力机制是通过对编码器中各时刻的RNN状态$h_j$进行加权平均实现的。权重$a_{ij}$表示目标语言中第$i$个词对源语言中第$j$个词的注意力大小,$a_{ij}$的计算公式如下:
<div align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/image/weight_formula.png?raw=true" width="300"><br/>
</div>
其中,$align$可以看作是一个对齐模型,用来衡量目标语言中第$i$个词和源语言中第$j$个词的匹配程度。具体而言,这个程度是通过解码RNN的第$i$个隐层状态$z_i$和源语言句子的第$j$个上下文片段$h_j$计算得到的。传统的对齐模型中,目标语言的每个词明确对应源语言的一个或多个词(hard alignment);而在注意力模型中采用的是soft alignment,即任何两个目标语言和源语言词间均存在一定的关联,且这个关联强度是由模型计算得到的实数,因此可以融入整个NMT框架,并通过反向传播算法进行训练。
<p align="center">
<img src="image/decoder_attention.png" width=500><br/>
图6. 基于注意力机制的解码器
</p>
<a name="柱搜索算法"></a> <a name="柱搜索算法"></a>
### 柱搜索算法 ### 柱搜索算法
...@@ -156,7 +212,7 @@ $$p\left ( u_{i+1}|u_{&lt;i+1},\mathbf{x} \right )=softmax(W_sz_{i+1}+b_z)$$ ...@@ -156,7 +212,7 @@ $$p\left ( u_{i+1}|u_{&lt;i+1},\mathbf{x} \right )=softmax(W_sz_{i+1}+b_z)$$
## 数据介绍 ## 数据介绍
本教程使用[WMT-14](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/)数据集中的[bitexts(after selection)](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz)作为训练集,[dev+test data](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz)作为测试集和生成集 本教程使用[WMT-16](http://www.statmt.org/wmt16/)新增的[multimodal task](http://www.statmt.org/wmt16/multimodal-task.html)中的[translation task](http://www.statmt.org/wmt16/multimodal-task.html#task1)的数据集。该数据集为英德翻译数据,包含29001条训练数据,1000条测试数据
### 数据预处理 ### 数据预处理
...@@ -172,347 +228,408 @@ $$p\left ( u_{i+1}|u_{&lt;i+1},\mathbf{x} \right )=softmax(W_sz_{i+1}+b_z)$$ ...@@ -172,347 +228,408 @@ $$p\left ( u_{i+1}|u_{&lt;i+1},\mathbf{x} \right )=softmax(W_sz_{i+1}+b_z)$$
### 示例数据 ### 示例数据
因为完整的数据集数据量较大,为了验证训练流程,PaddlePaddle接口paddle.dataset.wmt14中默认提供了一个经过预处理的[较小规模的数据集](http://paddlepaddle.bj.bcebos.com/demo/wmt_shrinked_data/wmt14.tgz)。 为了验证训练流程,PaddlePaddle接口`paddle.dataset.wmt16`中提供了对该数据集[预处理后的版本](http://paddlemodels.bj.bcebos.com/wmt/wmt16.tar.gz),调用该接口即可直接使用,因为数据规模限制,这里只作为示例使用,在相应的测试集上具有一定效果但在更多测试数据上的效果无法保证。
该数据集有193319条训练数据,6003条测试数据,词典长度为30000。因为数据规模限制,使用该数据集训练出来的模型效果无法保证。
## 模型配置说明 ## 模型配置说明
下面我们开始根据输入数据的形式配置模型。首先引入所需的库函数以及定义全局变量 下面我们开始根据输入数据的形式配置模型。首先引入所需的库函数以及定义全局变量
```python ```python
from __future__ import print_function from __future__ import print_function
import contextlib import os
import six
import numpy as np
import paddle import paddle
import paddle.fluid as fluid import paddle.fluid as fluid
import paddle.fluid.framework as framework
import paddle.fluid.layers as pd
from paddle.fluid.executor import Executor
from functools import partial
import os
try:
from paddle.fluid.contrib.trainer import *
from paddle.fluid.contrib.inferencer import *
except ImportError:
print(
"In the fluid 1.0, the trainer and inferencer are moving to paddle.fluid.contrib",
file=sys.stderr)
from paddle.fluid.trainer import *
from paddle.fluid.inferencer import *
dict_size = 30000 # 字典维度
source_dict_dim = target_dict_dim = dict_size # 源/目标语言字典维度
hidden_dim = 32 # 编码器中的隐层大小
word_dim = 16 # 词向量维度
batch_size = 2 # batch 中的样本数
max_length = 8 # 生成句子的最大长度
beam_size = 2 # 柱宽度
dict_size = 30000 # 词典大小
source_dict_size = target_dict_size = dict_size # 源/目标语言字典大小
word_dim = 512 # 词向量维度
hidden_dim = 512 # 编码器中的隐层大小
decoder_size = hidden_dim # 解码器中的隐层大小 decoder_size = hidden_dim # 解码器中的隐层大小
max_length = 256 # 解码生成句子的最大长度
beam_size = 4 # beam search的柱宽度
batch_size = 64 # batch 中的样本数
is_sparse = True
model_save_dir = "machine_translation.inference.model"
``` ```
然后如下实现编码器框架: 然后如下实现编码器框架:
```python ```python
def encoder(is_sparse): def encoder():
# 定义源语言id序列的输入数据 # 定义源语言id序列的输入数据
src_word_id = pd.data( src_word_id = fluid.layers.data(
name="src_word_id", shape=[1], dtype='int64', lod_level=1) name="src_word_id", shape=[1], dtype='int64', lod_level=1)
# 将上述编码映射到低维语言空间的词向量 # 将上述编码映射到低维语言空间的词向量
src_embedding = pd.embedding( src_embedding = fluid.layers.embedding(
input=src_word_id, input=src_word_id,
size=[dict_size, word_dim], size=[source_dict_size, word_dim],
dtype='float32', dtype='float32',
is_sparse=is_sparse, is_sparse=is_sparse)
param_attr=fluid.ParamAttr(name='vemb')) # 用双向GRU编码源语言序列,拼接两个GRU的编码结果得到h
# LSTM层:fc + dynamic_lstm fc_forward = fluid.layers.fc(
fc1 = pd.fc(input=src_embedding, size=hidden_dim * 4, act='tanh') input=src_embedding, size=hidden_dim * 3, bias_attr=False)
lstm_hidden0, lstm_0 = pd.dynamic_lstm(input=fc1, size=hidden_dim * 4) src_forward = fluid.layers.dynamic_gru(input=fc_forward, size=hidden_dim)
# 取源语言序列编码后的最后一个状态 fc_backward = fluid.layers.fc(
encoder_out = pd.sequence_last_step(input=lstm_hidden0) input=src_embedding, size=hidden_dim * 3, bias_attr=False)
return encoder_out src_backward = fluid.layers.dynamic_gru(
input=fc_backward, size=hidden_dim, is_reverse=True)
encoded_vector = fluid.layers.concat(
input=[src_forward, src_backward], axis=1)
return encoded_vector
```
再实现基于注意力机制的解码器:
- 首先定义解码器中单步的计算,即$z_{i+1}=\phi _{\theta '}\left ( c_i,u_i,z_i \right )$,如下:
```python
# 定义RNN中的单步计算
def cell(x, hidden, encoder_out, encoder_out_proj):
# 定义attention用以计算context,即 c_i,这里使用Bahdanau attention机制
def simple_attention(encoder_vec, encoder_proj, decoder_state):
decoder_state_proj = fluid.layers.fc(
input=decoder_state, size=decoder_size, bias_attr=False)
# sequence_expand将单步内容扩展为与encoder输出相同的序列
decoder_state_expand = fluid.layers.sequence_expand(
x=decoder_state_proj, y=encoder_proj)
mixed_state = fluid.layers.elementwise_add(encoder_proj,
decoder_state_expand)
attention_weights = fluid.layers.fc(
input=mixed_state, size=1, bias_attr=False)
attention_weights = fluid.layers.sequence_softmax(
input=attention_weights)
weigths_reshape = fluid.layers.reshape(x=attention_weights, shape=[-1])
scaled = fluid.layers.elementwise_mul(
x=encoder_vec, y=weigths_reshape, axis=0)
context = fluid.layers.sequence_pool(input=scaled, pool_type='sum')
return context
context = simple_attention(encoder_out, encoder_out_proj, hidden)
out = fluid.layers.fc(
input=[x, context], size=decoder_size * 3, bias_attr=False)
out = fluid.layers.gru_unit(
input=out, hidden=hidden, size=decoder_size * 3)[0]
return out, out
``` ```
再实现训练模式下的解码器: - 基于定义的单步计算,使用`DynamicRNN`实现多步循环的训练模式下解码器,如下:
```python
def train_decoder(encoder_out):
# 获取编码器输出的最后一步并进行非线性映射以构造解码器RNN的初始状态
encoder_last = fluid.layers.sequence_last_step(input=encoder_out)
encoder_last_proj = fluid.layers.fc(
input=encoder_last, size=decoder_size, act='tanh')
# 编码器输出在attention中计算结果的cache
encoder_out_proj = fluid.layers.fc(
input=encoder_out, size=decoder_size, bias_attr=False)
```python
def train_decoder(context, is_sparse):
# 定义目标语言id序列的输入数据,并映射到低维语言空间的词向量 # 定义目标语言id序列的输入数据,并映射到低维语言空间的词向量
trg_language_word = pd.data( trg_language_word = fluid.layers.data(
name="target_language_word", shape=[1], dtype='int64', lod_level=1) name="target_language_word", shape=[1], dtype='int64', lod_level=1)
trg_embedding = pd.embedding( trg_embedding = fluid.layers.embedding(
input=trg_language_word, input=trg_language_word,
size=[dict_size, word_dim], size=[target_dict_size, word_dim],
dtype='float32', dtype='float32',
is_sparse=is_sparse, is_sparse=is_sparse)
param_attr=fluid.ParamAttr(name='vemb'))
rnn = pd.DynamicRNN() rnn = fluid.layers.DynamicRNN()
with rnn.block(): # 使用 DynamicRNN 定义每一步的计算 with rnn.block():
# 获取当前步目标语言输入的词向量 # 获取当前步目标语言输入的词向量
current_word = rnn.step_input(trg_embedding) x = rnn.step_input(trg_embedding)
# 获取隐层状态 # 获取隐层状态
pre_state = rnn.memory(init=context) pre_state = rnn.memory(init=encoder_last_proj, need_reorder=True)
# 解码器计算单元:单层前馈网络 # 在DynamicRNN中需使用static_input获取encoder相关的内容
current_state = pd.fc(input=[current_word, pre_state], # 对decoder来说这些内容在每个时间步都是固定的
size=decoder_size, encoder_out = rnn.static_input(encoder_out)
act='tanh') encoder_out_proj = rnn.static_input(encoder_out_proj)
# 执行单步的计算单元
out, current_state = cell(x, pre_state, encoder_out, encoder_out_proj)
# 计算归一化的单词预测概率 # 计算归一化的单词预测概率
current_score = pd.fc(input=current_state, prob = fluid.layers.fc(input=out, size=target_dict_size, act='softmax')
size=target_dict_dim, # 更新隐层状态
act='softmax')
# 更新RNN的隐层状态
rnn.update_memory(pre_state, current_state) rnn.update_memory(pre_state, current_state)
# 输出预测概率 # 输出预测概率
rnn.output(current_score) rnn.output(prob)
return rnn() return rnn()
``` ```
实现推测模式下的解码器 接着就可以使用编码器和解码器定义整个训练网络;为了进行训练还需要定义优化器,如下
```python ```python
def decode(context, is_sparse): def train_model():
init_state = context encoder_out = encoder()
# 定义解码过程循环计数变量 rnn_out = train_decoder(encoder_out)
array_len = pd.fill_constant(shape=[1], dtype='int64', value=max_length) label = fluid.layers.data(
counter = pd.zeros(shape=[1], dtype='int64', force_cpu=True) name="target_language_next_word", shape=[1], dtype='int64', lod_level=1)
# 定义损失函数
cost = fluid.layers.cross_entropy(input=rnn_out, label=label)
avg_cost = fluid.layers.mean(cost)
return avg_cost
# 定义 tensor array 用以保存各个时间步的内容,并写入初始id,score和state def optimizer_func():
state_array = pd.create_array('float32') # 设置梯度裁剪
pd.array_write(init_state, array=state_array, i=counter) fluid.clip.set_gradient_clip(
clip=fluid.clip.GradientClipByGlobalNorm(clip_norm=5.0))
# 定义先增后降的学习率策略
lr_decay = fluid.layers.learning_rate_scheduler.noam_decay(hidden_dim, 1000)
return fluid.optimizer.Adam(
learning_rate=lr_decay,
regularization=fluid.regularizer.L2DecayRegularizer(
regularization_coeff=1e-4))
```
ids_array = pd.create_array('int64') 以上是训练所需的模型构件,预测(生成)模式下基于beam search的解码器需要借助`while_op`实现,如下:
scores_array = pd.create_array('float32')
init_ids = pd.data(name="init_ids", shape=[1], dtype="int64", lod_level=2) ```python
init_scores = pd.data( def infer_decoder(encoder_out):
name="init_scores", shape=[1], dtype="float32", lod_level=2) # 获取编码器输出的最后一步并进行非线性映射以构造解码器RNN的初始状态
encoder_last = fluid.layers.sequence_last_step(input=encoder_out)
encoder_last_proj = fluid.layers.fc(
input=encoder_last, size=decoder_size, act='tanh')
# 编码器输出在attention中计算结果的cache
encoder_out_proj = fluid.layers.fc(
input=encoder_out, size=decoder_size, bias_attr=False)
# 最大解码步数
max_len = fluid.layers.fill_constant(
shape=[1], dtype='int64', value=max_length)
# 解码步数计数变量
counter = fluid.layers.zeros(shape=[1], dtype='int64', force_cpu=True)
pd.array_write(init_ids, array=ids_array, i=counter) # 定义 tensor array 用以保存各个时间步的内容,并写入初始id,score和state
pd.array_write(init_scores, array=scores_array, i=counter) init_ids = fluid.layers.data(
name="init_ids", shape=[1], dtype="int64", lod_level=2)
init_scores = fluid.layers.data(
name="init_scores", shape=[1], dtype="float32", lod_level=2)
ids_array = fluid.layers.array_write(init_ids, i=counter)
scores_array = fluid.layers.array_write(init_scores, i=counter)
state_array = fluid.layers.array_write(encoder_last_proj, i=counter)
# 定义循环终止条件变量 # 定义循环终止条件变量
cond = pd.less_than(x=counter, y=array_len) cond = fluid.layers.less_than(x=counter, y=max_len)
# 定义 while_op while_op = fluid.layers.While(cond=cond)
while_op = pd.While(cond=cond) with while_op.block():
with while_op.block(): # 定义每一步的计算
# 获取解码器在当前步的输入,包括上一步选择的id,对应的score和上一步的state # 获取解码器在当前步的输入,包括上一步选择的id,对应的score和上一步的state
pre_ids = pd.array_read(array=ids_array, i=counter) pre_ids = fluid.layers.array_read(array=ids_array, i=counter)
pre_state = pd.array_read(array=state_array, i=counter) pre_score = fluid.layers.array_read(array=scores_array, i=counter)
pre_score = pd.array_read(array=scores_array, i=counter) pre_state = fluid.layers.array_read(array=state_array, i=counter)
# 更新输入的state为上一步选择id对应的state # 同train_decoder中的内容,进行RNN的单步计算
pre_state_expanded = pd.sequence_expand(pre_state, pre_score) pre_ids_emb = fluid.layers.embedding(
# 同训练模式下解码器中的计算逻辑,包括获取输入向量,解码器计算单元计算和
# 归一化单词预测概率的计算
pre_ids_emb = pd.embedding(
input=pre_ids, input=pre_ids,
size=[dict_size, word_dim], size=[target_dict_size, word_dim],
dtype='float32', dtype='float32',
is_sparse=is_sparse) is_sparse=is_sparse)
current_state = pd.fc(input=[pre_state_expanded, pre_ids_emb], out, current_state = cell(pre_ids_emb, pre_state, encoder_out,
size=decoder_size, encoder_out_proj)
act='tanh') prob = fluid.layers.fc(
current_state_with_lod = pd.lod_reset(x=current_state, y=pre_score) input=current_state, size=target_dict_size, act='softmax')
current_score = pd.fc(input=current_state_with_lod,
size=target_dict_dim,
act='softmax')
topk_scores, topk_indices = pd.topk(current_score, k=beam_size)
# 计算累计得分,进行beam search # 计算累计得分,进行beam search
accu_scores = pd.elementwise_add( topk_scores, topk_indices = fluid.layers.topk(prob, k=beam_size)
x=pd.log(topk_scores), y=pd.reshape(pre_score, shape=[-1]), axis=0) accu_scores = fluid.layers.elementwise_add(
selected_ids, selected_scores = pd.beam_search( x=fluid.layers.log(topk_scores),
pre_ids, y=fluid.layers.reshape(pre_score, shape=[-1]),
pre_score, axis=0)
topk_indices, accu_scores = fluid.layers.lod_reset(x=accu_scores, y=pre_ids)
accu_scores, selected_ids, selected_scores = fluid.layers.beam_search(
beam_size, pre_ids, pre_score, topk_indices, accu_scores, beam_size, end_id=1)
end_id=10,
level=0) fluid.layers.increment(x=counter, value=1, in_place=True)
# 将 search 结果写入 tensor array 中
pd.increment(x=counter, value=1, in_place=True) fluid.layers.array_write(selected_ids, array=ids_array, i=counter)
# 将 search 结果和对应的隐层状态写入 tensor array 中 fluid.layers.array_write(selected_scores, array=scores_array, i=counter)
pd.array_write(current_state, array=state_array, i=counter) # sequence_expand 作为 gather 使用以获取search结果对应的状态,并更新
pd.array_write(selected_ids, array=ids_array, i=counter) current_state = fluid.layers.sequence_expand(current_state,
pd.array_write(selected_scores, array=scores_array, i=counter) selected_ids)
fluid.layers.array_write(current_state, array=state_array, i=counter)
current_enc_out = fluid.layers.sequence_expand(encoder_out,
selected_ids)
fluid.layers.assign(current_enc_out, encoder_out)
current_enc_out_proj = fluid.layers.sequence_expand(
encoder_out_proj, selected_ids)
fluid.layers.assign(current_enc_out_proj, encoder_out_proj)
# 更新循环终止条件 # 更新循环终止条件
length_cond = pd.less_than(x=counter, y=array_len) length_cond = fluid.layers.less_than(x=counter, y=max_len)
finish_cond = pd.logical_not(pd.is_empty(x=selected_ids)) finish_cond = fluid.layers.logical_not(
pd.logical_and(x=length_cond, y=finish_cond, out=cond) fluid.layers.is_empty(x=selected_ids))
fluid.layers.logical_and(x=length_cond, y=finish_cond, out=cond)
translation_ids, translation_scores = pd.beam_search_decode( # 根据保存的每一步的结果,回溯生成最终解码结果
ids=ids_array, scores=scores_array, beam_size=beam_size, end_id=10) translation_ids, translation_scores = fluid.layers.beam_search_decode(
ids=ids_array, scores=scores_array, beam_size=beam_size, end_id=1)
return translation_ids, translation_scores return translation_ids, translation_scores
``` ```
进而,我们定义一个`train_program`来使用`inference_program`计算出的结果,在标记数据的帮助下来计算误差。我们还定义了一个`optimizer_func`来定义优化器。 使用编码器和预测模式的解码器,预测网络定义如下:
```python ```python
def train_program(is_sparse): def infer_model():
context = encoder(is_sparse) encoder_out = encoder()
rnn_out = train_decoder(context, is_sparse) translation_ids, translation_scores = infer_decoder(encoder_out)
label = pd.data( return translation_ids, translation_scores
name="target_language_next_word", shape=[1], dtype='int64', lod_level=1)
cost = pd.cross_entropy(input=rnn_out, label=label)
avg_cost = pd.mean(cost)
return avg_cost
def optimizer_func():
return fluid.optimizer.Adagrad(
learning_rate=1e-4,
regularization=fluid.regularizer.L2DecayRegularizer(
regularization_coeff=0.1))
``` ```
## 训练模型 ## 训练模型
### 定义训练环境 ### 构建训练程序
定义您的训练环境,可以指定训练是发生在CPU还是GPU上。
```python 定义用于训练的`Program`,在其中创建训练的网络结构并添加优化器。同时还要定义用于初始化的`Program`,在创建训练网络的同时隐式的加入参数初始化的操作。
use_cuda = False
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
```
### 定义数据提供器
下一步是为训练和测试定义数据提供器。提供器读入一个大小为 `BATCH_SIZE`的数据。`paddle.dataset.wmt.train` 每次会在乱序化后提供一个大小为`BATCH_SIZE`的数据,乱序化的大小为缓存大小`buf_size`。
```python ```python
train_reader = paddle.batch( train_prog = fluid.Program()
paddle.reader.shuffle( startup_prog = fluid.Program()
paddle.dataset.wmt14.train(dict_size), buf_size=1000), with fluid.program_guard(train_prog, startup_prog):
batch_size=batch_size) with fluid.unique_name.guard():
avg_cost = train_model()
optimizer = optimizer_func()
optimizer.minimize(avg_cost)
``` ```
### 构造训练器(trainer) ### 定义训练环境与执行器
训练器需要一个训练程序和一个训练优化函数。
定义您的训练环境,可以指定训练是发生在CPU还是GPU上;并基于这个训练环境定义执行器。
```python ```python
is_sparse = False use_cuda = False
trainer = Trainer( # 定义使用设备和执行器
train_func=partial(train_program, is_sparse), place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
place=place, exe = fluid.Executor(place)
optimizer_func=optimizer_func)
``` ```
### 提供数据 ### 构建数据提供器
`feed_order`用来定义每条产生的数据和`paddle.layer.data`之间的映射关系。比如,`wmt14.train`产生的第一列的数据对应的是`src_word_id`这个特征 使用封装的`paddle.dataset.wmt16.train`接口定义数据生成器,其每次产生一条样本,shuffle和组完batch后作为训练的输入;另外还需要指明输入数据中各字段和`data_layer`定义的各输入的对应关系,这可以通过`DataFeeder`完成, 下面的feeder将产生数据的第一列映射到`src_word_id`这个输入
```python ```python
feed_order = [ # 定义训练数据生成器
train_data = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.wmt16.train(source_dict_size, target_dict_size),
buf_size=10000),
batch_size=batch_size)
# DataFeeder完成
feeder = fluid.DataFeeder(
feed_list=[
'src_word_id', 'target_language_word', 'target_language_next_word' 'src_word_id', 'target_language_word', 'target_language_next_word'
] ],
place=place,
program=train_prog)
``` ```
### 事件处理器 ### 训练主循环
回调函数`event_handler`在一个之前定义好的事件发生后会被调用。例如,我们可以在每步训练结束后查看误差。
```python
def event_handler(event):
if isinstance(event, EndStepEvent):
if event.step % 10 == 0:
print('pass_id=' + str(event.epoch) + ' batch=' + str(event.step))
if event.step == 20:
trainer.stop()
```
### 开始训练 通过训练循环数(EPOCH_NUM)来进行训练循环,并且每次循环都保存训练好的参数。注意,循环训练前要首先执行初始化的`Program`来初始化参数。另外作为示例这里EPOCH_NUM设置较小,该数据集上实际大概需要20个epoch左右收敛。
最后,我们传入训练循环数(`num_epoch`)和一些别的参数,调用 `trainer.train` 来开始训练。
```python ```python
EPOCH_NUM = 1 # 执行初始化 Program,进行参数初始化
exe.run(startup_prog)
trainer.train( # 循环迭代执行训练
reader=train_reader, EPOCH_NUM = 2
num_epochs=EPOCH_NUM, for pass_id in six.moves.xrange(EPOCH_NUM):
event_handler=event_handler, batch_id = 0
feed_order=feed_order) for data in train_data():
cost = exe.run(
train_prog, feed=feeder.feed(data), fetch_list=[avg_cost])[0]
print('pass_id: %d, batch_id: %d, loss: %f' % (pass_id, batch_id,
cost))
batch_id += 1
# 保存模型
fluid.io.save_params(exe, model_save_dir, main_program=train_prog)
``` ```
## 应用模型 ## 应用模型
### 定义解码部分 ### 构建预测程序
使用上面定义的 `encoder` 和 `decoder` 函数来推测翻译后的对应id和分数. 定义用于预测的`Program`,在其中创建预测的网络结构。
```python ```python
context = encoder(is_sparse) infer_prog = fluid.Program()
translation_ids, translation_scores = decode(context, is_sparse) startup_prog = fluid.Program()
with fluid.program_guard(infer_prog, startup_prog):
with fluid.unique_name.guard():
translation_ids, translation_scores = infer_model()
``` ```
### 定义数据 ### 构建数据提供器
我们先初始化id和分数来生成tensors来作为输入数据。在这个预测例子中,我们用`wmt14.test`数据中的第一个记录来做推测,最后我们用"源字典"和"目标字典"来列印对应的句子结果 和训练类似,这里使用封装的`paddle.dataset.wmt16.test`接口定义测试数据生成器,测试数据共1000条,组完batch后作为预测的输入;另外我们获取源语言和目标语言id到word的词典,以将id序列转换为明文序列打印输出
```python ```python
init_ids_data = np.array([1 for _ in range(batch_size)], dtype='int64')
init_scores_data = np.array(
[1. for _ in range(batch_size)], dtype='float32')
init_ids_data = init_ids_data.reshape((batch_size, 1))
init_scores_data = init_scores_data.reshape((batch_size, 1))
init_lod = [1] * batch_size
init_lod = [init_lod, init_lod]
init_ids = fluid.create_lod_tensor(init_ids_data, init_lod, place)
init_scores = fluid.create_lod_tensor(init_scores_data, init_lod, place)
test_data = paddle.batch( test_data = paddle.batch(
paddle.reader.shuffle( paddle.dataset.wmt16.test(source_dict_size, target_dict_size),
paddle.dataset.wmt14.test(dict_size), buf_size=1000),
batch_size=batch_size) batch_size=batch_size)
src_idx2word = paddle.dataset.wmt16.get_dict(
feed_order = ['src_word_id'] "en", source_dict_size, reverse=True)
feed_list = [ trg_idx2word = paddle.dataset.wmt16.get_dict(
framework.default_main_program().global_block().var(var_name) "de", target_dict_size, reverse=True)
for var_name in feed_order
]
feeder = fluid.DataFeeder(feed_list, place)
src_dict, trg_dict = paddle.dataset.wmt14.get_dict(dict_size)
``` ```
### 测试 ### 测试
现在我们可以进行预测了。我们要在`feed_order`提供对应参数,放在`executor`上运行以取得id和分数结果 首先要加载训练过程保存下来的模型,然后就可以循环测试数据进行预测了。这里每次运行我们都会创建`data_layer`对应输入数据的`dict`传入,这个和`DataFeeder`相同的效果。生成过程对于每个测试数据都会将源语言句子和`beam_size`个生成句子打印输出。
```python ```python
exe = Executor(place) fluid.io.load_params(exe, model_save_dir, main_program=infer_prog)
exe.run(framework.default_startup_program())
for data in test_data():
for data in test_data(): src_word_id = fluid.create_lod_tensor(
feed_data = map(lambda x: [x[0]], data) data=[x[0] for x in data],
feed_dict = feeder.feed(feed_data) recursive_seq_lens=[[len(x[0]) for x in data]],
feed_dict['init_ids'] = init_ids place=place)
feed_dict['init_scores'] = init_scores # init_ids内容为start token
init_ids = fluid.create_lod_tensor(
results = exe.run( data=np.array([[0]] * len(data), dtype='int64'),
framework.default_main_program(), recursive_seq_lens=[[1] * len(data)] * 2,
feed=feed_dict, place=place)
# init_scores为beam search过程累积得分的初值
init_scores = fluid.create_lod_tensor(
data=np.array([[0.]] * len(data), dtype='float32'),
recursive_seq_lens=[[1] * len(data)] * 2,
place=place)
seq_ids, seq_scores = exe.run(
infer_prog,
feed={
'src_word_id': src_word_id,
'init_ids': init_ids,
'init_scores': init_scores
},
fetch_list=[translation_ids, translation_scores], fetch_list=[translation_ids, translation_scores],
return_numpy=False) return_numpy=False)
# 如何解析翻译结果详见 train.py 中对应代码的注释说明
result_ids = np.array(results[0]) hyps = [[] for i in range(len(seq_ids.lod()[0]) - 1)]
result_ids_lod = results[0].lod() scores = [[] for i in range(len(seq_scores.lod()[0]) - 1)]
result_scores = np.array(results[1]) for i in range(len(seq_ids.lod()[0]) - 1):
start = seq_ids.lod()[0][i]
end = seq_ids.lod()[0][i + 1]
print("Original sentence:") print("Original sentence:")
print(" ".join([src_dict[w] for w in feed_data[0][0][1:-1]])) print(" ".join([src_idx2word[idx] for idx in data[i][0][1:-1]]))
print("Translated score and sentence:") print("Translated score and sentence:")
for i in xrange(beam_size): for j in range(end - start):
start_pos = result_ids_lod[1][i] + 1 sub_start = seq_ids.lod()[1][start + j]
end_pos = result_ids_lod[1][i+1] sub_end = seq_ids.lod()[1][start + j + 1]
print("%d\t%.4f\t%s\n" % (i+1, result_scores[end_pos-1], hyps[i].append(" ".join([
" ".join([trg_dict[w] for w in result_ids[start_pos:end_pos]]))) trg_idx2word[idx]
for idx in np.array(seq_ids)[sub_start:sub_end][1:-1]
break ]))
scores[i].append(np.array(seq_scores)[sub_end - 1])
print(scores[i][-1], hyps[i][-1].encode('utf8'))
```
可以观察到如下的预测结果输出:
```txt
Original sentence:
Two adults and two children sit on a park bench .
Translated score and sentence:
-2.5993705 Zwei Erwachsene und zwei Kinder sitzen auf einer Parkbank .
-2.6617606 Zwei Erwachsene und zwei Kinder spielen auf einer Parkbank .
-3.186554 Zwei Erwachsene und zwei Kinder sitzen auf einer Bank .
-3.4353821 Zwei Erwachsene und zwei Kinder spielen auf einer Bank .
``` ```
## 总结 ## 总结
...@@ -529,7 +646,7 @@ for data in test_data(): ...@@ -529,7 +646,7 @@ for data in test_data():
5. Papineni K, Roukos S, Ward T, et al. [BLEU: a method for automatic evaluation of machine translation](http://dl.acm.org/citation.cfm?id=1073135)[C]//Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002: 311-318. 5. Papineni K, Roukos S, Ward T, et al. [BLEU: a method for automatic evaluation of machine translation](http://dl.acm.org/citation.cfm?id=1073135)[C]//Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002: 311-318.
<br/> <br/>
<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">知识共享 署名-相同方式共享 4.0 国际 许可协议</a>进行许可。 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://paddlepaddleimage.cdn.bcebos.com/bookimage/camo.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">知识共享 署名-相同方式共享 4.0 国际 许可协议</a>进行许可。
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
......
...@@ -42,173 +42,148 @@ ...@@ -42,173 +42,148 @@
<div id="markdown" style='display:none'> <div id="markdown" style='display:none'>
# Machine Translation # Machine Translation
The source code of this tutorial is live at [book/machine_translation](https://github.com/PaddlePaddle/book/tree/develop/08.machine_translation). For instructions on getting started with this book,see [Running This Book](https://github.com/PaddlePaddle/book/blob/develop/README.md#running-the-book). Source code of this tutorial is in [book/machine_translation](https://github.com/PaddlePaddle/book/tree/develop/08.machine_translation). For users new to Paddle book, please refer to [the user guide of Book Documentation](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md#run_the_book).
## Background ## Background
Machine translation (MT) leverages computers to translate from one language to another. The language to be translated is referred to as the source language, while the language to be translated into is referred to as the target language. Thus, Machine translation is the process of translating from the source language to the target language. It is one of the most important research topics in the field of natural language processing. Machine translation is to translate different languages with computer. The language to be translated is usually called source language, and the language representing the result of translation is called target language. Machine translation is the process of transformation from source language to target language, which is an important research assignment of Natural Language Processing.
Machine translation systems at early age were mostly rule-based translation system, which needs linguists make transformation rule between two languages and then input these rules into computer. This method requires proficient professional linguistic background, but it is hard to cover all rules of a language, let it alone two or more languages. Therefore, the major challenge of traditional machine translation method is the impossibility of a completest set of rules\[[1](#References)\].
Early machine translation systems are mainly rule-based i.e. they rely on a language expert to specify the translation rules between the two languages. It is quite difficult to cover all the rules used in one language. So it is quite a challenge for language experts to specify all possible rules in two or more different languages. Hence, a major challenge in conventional machine translation has been the difficulty in obtaining a complete rule set \[[1](#references)\]. To solve the problem mentioned above, Statistical Machine Translation technology emerged afterwards. For Statistical Machine Translation, transformation rules are automatically learned from a large scale corpus instead of handcrafted rule. So it tackles with the limit of obtaining knowledge in rule-based machine translation systems. However, it still faces certain challenges: 1. man-made feature can never cover all language phenomena. 2. it is hard to use global feature. 3. it depends on many pre-processing parts, such as Word Alignment, Tokenization, Rule Extraction, Parsing. Errors accumulated by those parts will have a great influence on translation.
In recent years, Deep Learning technology proposes new solutions to overcome the bottleneck. Two methods for machine translation are realized with the aid of deep learning. 1. Based on the framework of statistical machine translation system, the neural network is in place to improve core parts, such as language model, reordering model and so on (See the left part in figure One). 2. Abandoning the framework of statistical machine translation system, it directly uses neural network to transform source language to target language, which is End-to-End Neural Machine Translation (See right part in figure One), NMT model in short.
<div align="center">
<img src="https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/image/nmt.png?raw=true" width = "400" align=center/><br/>
Figure One. Neural Network Machine Translation System
</div>
To address the aforementioned problems, statistical machine translation techniques have been developed. These techniques learn the translation rules from a large corpus, instead of being designed by a language expert. While these techniques overcome the bottleneck of knowledge acquisition, there are still quite a lot of challenges, for example: In the following parts, we'll guide you through NMT model and its hands-on implementation in PaddlePaddle
1. Human designed features cannot cover all possible linguistic variations;
2. It is difficult to use global features;
3. The techniques heavily rely on pre-processing techniques like word alignment, word segmentation and tokenization, rule-extraction and syntactic parsing etc. The error introduced in any of these steps could accumulate and impact translation quality.
The recent development of deep learning provides new solutions to these challenges. The two main categories for deep learning based machine translation techniques are:
1. Techniques based on the statistical machine translation system but with some key components improved with neural networks, e.g., language model, reordering model (please refer to the left part of Figure 1);
2. Techniques mapping from source language to target language directly using a neural network, or end-to-end neural machine translation (NMT).
<p align="center">
<img src="image/nmt_en.png" width=400><br/>
Figure 1. Neural Network based Machine Translation
</p>
This tutorial will mainly introduce an NMT model and how to use PaddlePaddle to train it. ## Result Demo
## Illustrative Results Take Chinese to English translation model as an example. For a trained model, if input the following tokenized Chinese sentence :
Let's consider an example of Chinese-to-English translation. The model is given the following segmented sentence in Chinese
```text ```text
这些 是 希望 的 曙光 和 解脱 的 迹象 . 这些 是 希望 的 曙光 和 解脱 的 迹象 .
``` ```
After training and with a beam-search size of 3, the generated translations are as follows:
If it sets the entries of translation result ( e.t. the width of [beam search algorithm](#beam search algorithm)) as 3, the generated English sentence is as follows:
```text ```text
0 -5.36816 These are signs of hope and relief . <e> 0 -5.36816 These are signs of hope and relief . <e>
1 -6.23177 These are the light of hope and relief . <e> 1 -6.23177 These are the light of hope and relief . <e>
2 -7.7914 These are the light of hope and the relief of hope . <e> 2 -7.7914 These are the light of hope and the relief of hope . <e>
``` ```
- The first column corresponds to the id of the generated sentence; the second column corresponds to the score of the generated sentence (in descending order), where a larger value indicates better quality; the last column corresponds to the generated sentence.
- There are two special tokens: `<e>` denotes the end of a sentence while `<unk>` denotes unknown word, i.e., a word not in the training dictionary.
## Overview of the Model
This section will introduce Bi-directional Recurrent Neural Network, the Encoder-Decoder framework used in NMT, as well as the beam search algorithm. - The first column to the left is the serial numbers of generated sentences. The second column from left is scores of the sentences in descending order, in which higher score is better. The third column contains the generated English sentences.
### Bi-directional Recurrent Neural Network
We already introduced an instance of bi-directional RNN in the [Semantic Role Labeling](https://github.com/PaddlePaddle/book/blob/develop/label_semantic_roles/README.md) chapter. Here we present another bi-directional RNN model with a different architecture proposed by Bengio et al. in \[[2](#references),[4](#references)\]. This model takes a sequence as input and outputs a fixed dimensional feature vector at each step, encoding the context information at the corresponding time step. - In addition, there are two special marks. One is `<e>`, indicating the end of a sentence and another one is `<unk>`, representing unknown word, which have never appeared in dictionary.
Specifically, this bi-directional RNN processes the input sequence in the original and reverse order respectively, and then concatenates the output feature vectors at each time step as the final output. Thus the output node at each time step contains information from the past and future as context. The figure below shows an unrolled bi-directional RNN. This network contains a forward RNN and backward RNN with six weight matrices: weight matrices from input to forward hidden layer and backward hidden ($W_1, W_3$), weight matrices from hidden to itself ($W_2, W_5$), matrices from forward hidden and backward hidden to output layer ($W_4, W_6$). Note that there are no connections between forward hidden and backward hidden layers. ## Exploration of Models
<p align="center"> In this section, let's scrutinize Bi-directional Recurrent Neural Network, typical Encoder-Decoder structure in NMT model and beam search algorithm.
<img src="image/bi_rnn_en.png" width=450><br/>
Figure 3. Temporally unrolled bi-directional RNN
</p>
### Encoder-Decoder Framework ### Bi-directional Recurrent Neural Network
The Encoder-Decoder\[[2](#references)\] framework aims to solve the mapping of a sequence to another sequence, for sequences with arbitrary lengths. The source sequence is encoded into a vector via an encoder, which is then decoded to a target sequence via a decoder by maximizing the predictive probability. Both the encoder and the decoder are typically implemented via RNN.
<p align="center"> We have introduced a bi-directional recurrent neural network in the chapter [label_semantic_roles](https://github.com/PaddlePaddle/book/blob/develop/07.label_semantic_roles/README.md). Here we introduce another network proposed by Bengio team in thesis \[[2](#References),[4](#References)\] The aim of this network is to input a sequence and get its features at each time step. Specifically, fixed-length vector is incorporated to represent contextual semantic information for each time step in the output.
<img src="image/encoder_decoder_en.png" width=700><br/>
Figure 4. Encoder-Decoder Framework
</p>
#### Encoder To be concrete, the Bi-directional recurrent neural network sequentially processes the input sequences in time dimension in sequential order or in reverse order, i.e., forward and backward. And the output of RNN at each time step are concatenated to be the final output layer. Hereby the output node of each time step contains complete past and future context information of current time step of input sequence. The figure below shows a bi-directional recurrent neural network expanded by time step. The network consists of a forward and a backward RNN with six weight matrices: a weight matrix ($W_1, W_3$) from input layer to the forward and backward hidden layers, and a weight matrix ($W_2, W_5$) from a hidden layer to itself (self-loop), the weight matrix from the forward hidden layer and the backward hidden layer to the output layer ($W_4, W_6$). Note that there is no connection between the forward hidden layer and the backward hidden layer.
There are three steps for encoding a sentence:
1. One-hot vector representation of a word: Each word $x_i$ in the source sentence $x=\left \{ x_1,x_2,...,x_T \right \}$ is represented as a vector $w_i\epsilon \left \{ 0,1 \right \}^{\left | V \right |},i=1,2,...,T$ where $w_i$ has the same dimensionality as the size of the dictionary, i.e., $\left | V \right |$, and has an element of one at the location corresponding to the location of the word in the dictionary and zero elsewhere. <div align="center">
<img src = "https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/image/bi_rnn.png?raw=true" width="400"><br/>
Figure 2. Bi-directional Recurrent Neural Network expanded by time step.
</div>
2. Word embedding as a representation in the low-dimensional semantic space: There are two problems with one-hot vector representation ### Encoder-Decoder Structure
* The dimensionality of the vector is typically large, leading to the curse of dimensionality; The Encoder-Decoder\[[2](#References)\] structure helps transform a source sequence with arbitrary length to another target sequence with arbitrary length. In the encoding phase, it encodes the entire source sequence into a vector. And in the decoding phase, it decodes the entire target sequence by maximizing the predicted sequence probability. The encoding and decoding process is usually implemented by RNN.
* It is hard to capture the relationships between words, i.e., semantic similarities. Therefore, it is useful to project the one-hot vector into a low-dimensional semantic space as a dense vector with fixed dimensions, i.e., $s_i=Cw_i$ for the $i$-th word, with $C\epsilon R^{K\times \left | V \right |}$ as the projection matrix and $K$ is the dimensionality of the word embedding vector. <div align="center">
<img src ="https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/image/encoder_decoder.png?raw=true" width="700"><br/>
Figure 3. Encoder-Decoder Frame
</div>
3. Encoding of the source sequence via RNN: This can be described mathematically as: <a name="encoder"></a>
#### Encoder
$$h_i=\varnothing _\theta \left ( h_{i-1}, s_i \right )$$ Encoding can be done in 3 steps:
where 1.One-hot vector : each word $x_i$ of the source language sentence $x=\left \{ x_1,x_2,...,x_T \right \}$ is represented as a column vector $w_i\epsilon \left \{ 0,1 \right \}^{\left | V \right |},i=1,2,...,T$. The dimension of this vector $w_i$ is the same as the vocabulary size $\left | V \right |$, and there is only one dimension with a value of 1 (the position corresponds to the position of the word in the vocabulary), and the rest are all zero.
$h_0$ is a zero vector,
$\varnothing _\theta$ is a non-linear activation function, and
$\mathbf{h}=\left \{ h_1,..., h_T \right \}$
is the sequential encoding of the first $T$ words from the source sequence. The vector representation of the whole sentence can be represented as the encoding vector at the last time step $T$ from $\mathbf{h}$, or by temporal pooling over $\mathbf{h}$.
2.Word vector mapped to low-dimensional semantic space: one-hot vector indicates that there are two problems: 1. the dimension of generated vector is often large, which is easy to cause dimension disaster. 2. it is difficult to characterize the relationship between words and words (such as semantical similarity, that is, the semantics cannot be expressed well.) Therefore, one-hot vector needs to be mapped to a low-dimensional semantic space, represented by a dense vector(called a word vector) with fixed dimensions . The mapping matrix is $C\epsilon R^{K\times \left | V \right |}$, with $s_i=Cw_i$ representing the word vector of the $i$ word, $K$ representing the dimension of vector.
Bi-directional RNN can also be used in step (3) for more a complicated sentence encoding. This can be implemented using a bi-directional GRU. Forward GRU encodes the source sequence in its original order $(x_1,x_2,...,x_T)$, and generates a sequence of hidden states $(\overrightarrow{h_1},\overrightarrow{h_2},...,\overrightarrow{h_T})$. The backward GRU encodes the source sequence in reverse order, i.e., $(x_T,x_T-1,...,x_1)$ and generates $(\overleftarrow{h_1},\overleftarrow{h_2},...,\overleftarrow{h_T})$. Then for each word $x_i$, its complete hidden state is the concatenation of the corresponding hidden states from the two GRUs, i.e., $h_i=\left [ \overrightarrow{h_i^T},\overleftarrow{h_i^T} \right ]^{T}$. 3.Encode a source language word sequence with RNN: The calculation for this process is $h_i=\varnothing _\theta \left ( h_{i-1}, s_i \right )$, where $h_0$ is an all-zero vector. $\varnothing _\theta$ is a non-linear activation function, and the result $\mathbf{h}=\left \{ h_1,..., h_T \right \}$ is the state code sequence of source language $T$ words read by RNN respectively. The vector representation of the entire sentence can use $\mathbf{h}$ in the state code of the last time step $T$, or use the pooling result in the time dimension.
<p align="center"> Step 3 can also use bi-directional recurrent neural network to implement more complex sentence-coded representation, which can be implemented with bi-directional GRU. The forward GRU sequentially encodes the source language word in the order of the word sequence $(x_1, x_2,..., x_T)$, and obtains a series of hidden layer states $(\overrightarrow{h_1},\overrightarrow{h_2},. ..,\overrightarrow{h_T})$. Similarly, the backward GRU encodes the source language word in the order of $(x_T,x_{T-1},...,x_1)$, resulting in $(\overleftarrow{h_1},\overleftarrow{h_2},. ..,\overleftarrow{h_T})$. Finally, for the word $x_i$, the hidden layer state is obtained by jointing the two GRUs, namely $h_i=\left [ \overrightarrow{h_i^T},\overleftarrow{h_i^T} \right ]^{T} $.
<img src="image/encoder_attention_en.png" width=500><br/> <div align="center">
Figure 5. Encoder using bi-directional GRU <img src="https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/image/encoder_attention.png?raw=true" width="400"><br/>
</p> Figure 4. Use bi-directional GRU encoder
</div>
#### Decoder #### Decoder
The goal of the decoder is to maximize the probability of the next correct word in the target language. The main idea is as follows: During the training of machine translation tasks, the goal of in decode period is to maximize the probability of the next correct target language word. The idea is:
1.At each time, according to the encoding information of the source language sentence (also called context vector) $c$, the $i$th word $u_i$ of the real target language sequence and the hidden layer state $z_i $ of RNN at $i$th time, it calculates the next hidden layer state $z_{i+1}$. The formula is as follows:
$$z_{i+1}=\phi_{\theta '} \left ( c,u_i,z_i \right )$$
Where $\phi _{\theta '}$ is a nonlinear activation function; $c$ is the context vector of the source language sentence, if the attention mechanism is not used and the output of [encoder](#encoder) is the last element after the source language sentence is encoded, you can define $c=h_T$; $u_i$ is the $i$th word of the target language sequence, and $u_0$ is the start tag `<s>` of the target language sequence, indicating the start of decoding; $z_i$ is the hidden layer state of the RNN at $i$th time, and $z_0$ is an all-zero vector.
1. At each time step $i$, given the encoding vector (or context vector) $c$ of the source sentence, the $i$-th word $u_i$ from the ground-truth target language and the RNN hidden state $z_i$, the next hidden state $z_{i+1}$ is computed as: 2.Normalize $z_{i+1}$ by `softmax` to get the probability distribution $p_{i+1}$ of the $i+1$th word of the target language sequence. The probability distribution formula is as follows:
$$p\left ( u_{i+1}|u_{&lt;i+1},\mathbf{x} \right )=softmax(W_sz_{i+1}+b_z)$$
Where $W_sz_{i+1}+b_z$ scores each possible output word and normalizes with softmax to get the probability $p_{i+1}$ of $i+1$th word.
$$z_{i+1}=\phi _{\theta '}\left ( c,u_i,z_i \right )$$ 3.Calculate the cost according to $p_{i+1}$ and $u_{i+1}$.
where $\phi _{\theta '}$ is a non-linear activation function and $c=q\mathbf{h}$ is the context vector of the source sentence. Without using [attention](#Attention Mechanism), if the output of the [encoder](#Encoder) is the encoding vector at the last time step of the source sentence, then $c$ can be defined as $c=h_T$. $u_i$ denotes the $i$-th word from the target language sentence and $u_0$ denotes the beginning of the target language sentence (i.e., `<s>`), indicating the beginning of decoding. $z_i$ is the RNN hidden state at time step $i$ and $z_0$ is an all zero vector.
2. Calculate the probability $p_{i+1}$ for the $i+1$-th word in the target language sequence by normalizing $z_{i+1}$ using `softmax` as follows 4.Repeat steps 1~3 until all words in the target language sequence have been processed.
$$p\left ( u_{i+1}|u_{<i+1},\mathbf{x} \right )=softmax(W_sz_{i+1}+b_z)$$ The process of generating machine translation tasks, in general, is to translate source language sentences based on pre-trained model. The decode period in the generation process is different from the training process above. For details, please refer to [Beam Search Algorithm](#Beam Search Algorithm).
where $W_sz_{i+1}+b_z$ scores each possible words and is then normalized via softmax to produce the probability $p_{i+1}$ for the $i+1$-th word. <a name="beam search algorithm"></a>
### Beam Search Algorithm
3. Compute the cost accoding to $p_{i+1}$ and $u_{i+1}$. Beam Search ([beam search](http://en.wikipedia.org/wiki/Beam_search)) is a heuristic graph search algorithm for searching the graph or tree for the optimal extended nodes in a finite set, usually used in systems with very large solution space (such as machine translation, speech recognition), for that the memory can't fit all the unfolded solutions in the graph or tree. If you want to translate "`<s>Hello<e>`" in the machine translation task, even if there are only 3 words (`<s>`, `<e>`, `hello`) in the target language dictionary, it is possible generate infinite sentences (the number of occurrences of `hello` is uncertain). In order to find better translation results, we can use beam search algorithm.
4. Repeat Steps 1-3, until all the words in the target language sentence have been processed.
The generation process of machine translation is to translate the source sentence into a sentence in the target language according to a pre-trained model. There are some differences between the decoding step in generation and training. Please refer to [Beam Search Algorithm](#Beam Search Algorithm) for details. The beam search algorithm uses a breadth-first strategy to build a search tree. At each level of the tree, the nodes are sorted according to the heuristic cost (in this tutorial, the sum of the log probabilities of the generated words), and then only the predetermined number of nodes (commonly referred to in the literature as beam width, beam size, 柱宽度, etc.). Only these nodes will continue to expand in the next layer, and other nodes will be cut off, that is, the nodes with higher quality are retained, and the nodes with poor quality are pruned. Therefore, the space and time occupied by the search are greatly reduced, but the disadvantage is that there is no guarantee that an optimal solution will be obtained.
### Beam Search Algorithm In the decode period of using beam search algorithm, the goal is to maximize the probability of generated sequence. The idea is:
1.At each time, the next hidden layer state $z_{i+1}$ is calculated according to the encoding information $c$ of the source language sentence, the generated $i$th target language sequence words $u_i$, and the hidden layer state $z_i$ of RNN at $i$th time.
[Beam Search](http://en.wikipedia.org/wiki/Beam_search) is a heuristic search algorithm that explores a graph by expanding the most promising node in a limited set. It is typically used when the solution space is huge (e.g., for machine translation, speech recognition), and there is not enough memory for all the possible solutions. For example, if we want to translate “`<s>你好<e>`” into English, even if there are only three words in the dictionary (`<s>`, `<e>`, `hello`), it is still possible to generate an infinite number of sentences, where the word `hello` can appear different number of times. Beam search could be used to find a good translation among them. 2.Normalize $z_{i+1}$ by `softmax` to get the probability distribution $p_{i+1}$ of the $i+1$th words of the target language sequence.
Beam search builds a search tree using breadth first search and sorts the nodes according to a heuristic cost (sum of the log probability of the generated words) at each level of the tree. Only a fixed number of nodes according to the pre-specified beam size (or beam width) are considered. Thus, only nodes with highest scores are expanded in the next level. This reduces the space and time requirements significantly. However, a globally optimal solution is not guaranteed. 3.The word $u_{i+1}$ is sampled according to $p_{i+1}$.
The goal is to maximize the probability of the generated sequence when using beam search in decoding, The procedure is as follows: 4.Repeat steps 1~3 until you get the sentence end tag `<e>` or exceed the maximum generation length of the sentence.
1. At each time step $i$, compute the hidden state $z_{i+1}$ of the next time step according to the context vector $c$ of the source sentence, the $i$-th word $u_i$ generated for the target language sentence and the RNN hidden state $z_i$. Note: The formula for $z_{i+1}$ and $p_{i+1}$ is the same as in [Decoder](#Decoder). And since each step of the generation is implemented by the greedy method, it is not guaranteed to obtain the global optimal solution.
2. Normalize $z_{i+1}$ using `softmax` to get the probability $p_{i+1}$ for the $i+1$-th word for the target language sentence.
3. Sample the word $u_{i+1}$ according to $p_{i+1}$.
4. Repeat Steps 1-3, until end-of-sentence token `<e>` is generated or the maximum length of the sentence is reached.
Note: $z_{i+1}$ and $p_{i+1}$ are computed the same way as in [Decoder](#Decoder). In generation mode, each step is greedy in so there is no guarantee of a global optimum. ## Data Preparation
## BLEU Score This tutorial uses [bitexts(after selection)] in the [WMT-14](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/) dataset (http://www-lium.univ- Lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz) as a training set, [dev+test data](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz) as a test set and generated set.
Bilingual Evaluation understudy (BLEU) is a metric widely used for automatic machine translation proposed by IBM Watson Research Center in 2002\[[5](#References)\]. The closer the translation produced by a machine is to the translation produced by a human expert, the better the performance of the translation system. ### Data Pre-processing
To measure the closeness between machine translation and human translation, sentence precision is used. It compares the number of matched n-grams. More matches will lead to higher BLEU scores. It contains two steps in pre-processing:
## Data Preparation -Merge parallel corpora files from source language to target language into one file:
This tutorial uses a dataset from [WMT-14](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/), where [bitexts (after selection)](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz) is used as the training set, and [dev+test data](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz) is used as test and generation set. -Merge every `XXX.src` and `XXX.trg` into one file as `XXX`.
-Content in $i$th row of `XXX` is the connection of $i$th row of `XXX.src` with $i$th row of `XXX.trg`, which is divided by '\t'.
### Data Preprocessing -Create source language dictionary and target language dictionary of train data. There are **DICTSIZE** words in each dictionary, including: (DICTSIZE - 3) words with highest frequency in the corpus, and 3 special symbols `<s>` (the beginning of the sequence), `<e>` ( the end of the sequence) and `<unk>` (unknown word).
There are two steps for pre-processing: ### Sample Data
- Merge the source and target parallel corpus files into one file
- Merge `XXX.src` and `XXX.trg` file pair as `XXX`
- The $i$-th row in `XXX` is the concatenation of the $i$-th row from `XXX.src` with the $i$-th row from `XXX.trg`, separated with '\t'.
- Create source dictionary and target dictionary, each containing **DICTSIZE** number of words, including the most frequent (DICTSIZE - 3) fo word from the corpus and 3 special token `<s>` (begin of sequence), `<e>` (end of sequence) and `<unk>` (unknown words that are not in the vocabulary). Because the data volume of the complete data set is large, in order to verify the training process, the PaddlePaddle interface paddle.data set.wmt14 provides a pre-processed [smaller scale dataset](http://paddlepaddle.bj.bcebos.com/demo/wmt_shrinked_data/wmt14.tgz) by default .
### A Subset of Dataset In the data set, there are 193,319 training data, 6003 test data, and a dictionary with length of 30,000. Due to the limit of data size, the effects of models trained with this dataset are not guaranteed.
Because the full dataset is very big, to reduce the time for downloading the full dataset. PadddlePaddle package `paddle.dataset.wmt14` provides a preprocessed `subset of dataset`(http://paddlepaddle.bj.bcebos.com/demo/wmt_shrinked_data/wmt14.tgz). ## Model Configuration
This subset has 193319 instances of training data and 6003 instances of test data. Dictionary size is 30000. Because of the limitation of size of the subset, the effectiveness of trained model from this subset is not guaranteed. Next we start configuring model according to input data. First we import necessary library functions and define global variables.
## Model Configuration
Our program starts with importing necessary packages and initializing some global variables:
```python ```python
from __future__ import print_function from __future__ import print_function
...@@ -232,41 +207,51 @@ except ImportError: ...@@ -232,41 +207,51 @@ except ImportError:
from paddle.fluid.trainer import * from paddle.fluid.trainer import *
from paddle.fluid.inferencer import * from paddle.fluid.inferencer import *
dict_size = 30000 dict_size = 30000 # dictionary dimension
source_dict_dim = target_dict_dim = dict_size source_dict_dim = target_dict_dim = dict_size # source/target language dictionary dimension
hidden_dim = 32 hidden_dim = 32 # size of hidden layer in encoder
word_dim = 16 word_dim = 16 # dimension of word vector
batch_size = 2 batch_size = 2 # the number of samples in batch
max_length = 8 max_length = 8 # the maximum length of generated sentence
topk_size = 50 beam_size = 2 # width of beam
beam_size = 2
decoder_size = hidden_dim decoder_size = hidden_dim # size of hidden layer in decoder
``` ```
Then we implement encoder as follows:
```python
Then the frame of encoder is implemented as follows:
```python
def encoder(is_sparse): def encoder(is_sparse):
# define input data id of source language
src_word_id = pd.data( src_word_id = pd.data(
name="src_word_id", shape=[1], dtype='int64', lod_level=1) name="src_word_id", shape=[1], dtype='int64', lod_level=1)
# reflect encode above on the word vector of low-dimension language space.
src_embedding = pd.embedding( src_embedding = pd.embedding(
input=src_word_id, input=src_word_id,
size=[dict_size, word_dim], size=[dict_size, word_dim],
dtype='float32', dtype='float32',
is_sparse=is_sparse, is_sparse=is_sparse,
param_attr=fluid.ParamAttr(name='vemb')) param_attr=fluid.ParamAttr(name='vemb'))
# LSTM layer:fc + dynamic_lstm
fc1 = pd.fc(input=src_embedding, size=hidden_dim * 4, act='tanh') fc1 = pd.fc(input=src_embedding, size=hidden_dim * 4, act='tanh')
lstm_hidden0, lstm_0 = pd.dynamic_lstm(input=fc1, size=hidden_dim * 4) lstm_hidden0, lstm_0 = pd.dynamic_lstm(input=fc1, size=hidden_dim * 4)
# Fetch the final state after the sequence encode of source language
encoder_out = pd.sequence_last_step(input=lstm_hidden0) encoder_out = pd.sequence_last_step(input=lstm_hidden0)
return encoder_out return encoder_out
``` ```
Then implement decoder in training mode:
Implement the decoder for training as follows:
```python ```python
def train_decoder(context, is_sparse): def train_decoder(context, is_sparse):
# Define input data of sequence id of target language and reflect it on word vector of low-dimension language space
trg_language_word = pd.data( trg_language_word = pd.data(
name="target_language_word", shape=[1], dtype='int64', lod_level=1) name="target_language_word", shape=[1], dtype='int64', lod_level=1)
trg_embedding = pd.embedding( trg_embedding = pd.embedding(
...@@ -277,35 +262,44 @@ Implement the decoder for training as follows: ...@@ -277,35 +262,44 @@ Implement the decoder for training as follows:
param_attr=fluid.ParamAttr(name='vemb')) param_attr=fluid.ParamAttr(name='vemb'))
rnn = pd.DynamicRNN() rnn = pd.DynamicRNN()
with rnn.block(): with rnn.block(): # use DynamicRNN to define computation at each step
# Fetch input word vector of target language at present step
current_word = rnn.step_input(trg_embedding) current_word = rnn.step_input(trg_embedding)
# obtain state of hidden layer
pre_state = rnn.memory(init=context) pre_state = rnn.memory(init=context)
# computing unit of decoder: single-layer forward network
current_state = pd.fc(input=[current_word, pre_state], current_state = pd.fc(input=[current_word, pre_state],
size=decoder_size, size=decoder_size,
act='tanh') act='tanh')
# compute predicting probability of nomarlized word
current_score = pd.fc(input=current_state, current_score = pd.fc(input=current_state,
size=target_dict_dim, size=target_dict_dim,
act='softmax') act='softmax')
# update hidden layer of RNN
rnn.update_memory(pre_state, current_state) rnn.update_memory(pre_state, current_state)
# output predicted probability
rnn.output(current_score) rnn.output(current_score)
return rnn() return rnn()
``` ```
Implement the decoder for prediction as follows:
implement decoder in inference mode
```python ```python
def decode(context, is_sparse): def decode(context, is_sparse):
init_state = context init_state = context
# define counter variable in the decoding
array_len = pd.fill_constant(shape=[1], dtype='int64', value=max_length) array_len = pd.fill_constant(shape=[1], dtype='int64', value=max_length)
counter = pd.zeros(shape=[1], dtype='int64', force_cpu=True) counter = pd.zeros(shape=[1], dtype='int64', force_cpu=True)
# fill the first element with init_state # define tensor array to save content at each time step, and write initial id, score and state
state_array = pd.create_array('float32') state_array = pd.create_array('float32')
pd.array_write(init_state, array=state_array, i=counter) pd.array_write(init_state, array=state_array, i=counter)
# ids, scores as memory
ids_array = pd.create_array('int64') ids_array = pd.create_array('int64')
scores_array = pd.create_array('float32') scores_array = pd.create_array('float32')
...@@ -316,34 +310,35 @@ def decode(context, is_sparse): ...@@ -316,34 +310,35 @@ def decode(context, is_sparse):
pd.array_write(init_ids, array=ids_array, i=counter) pd.array_write(init_ids, array=ids_array, i=counter)
pd.array_write(init_scores, array=scores_array, i=counter) pd.array_write(init_scores, array=scores_array, i=counter)
# define conditional variable to stop loop
cond = pd.less_than(x=counter, y=array_len) cond = pd.less_than(x=counter, y=array_len)
# define while_op
while_op = pd.While(cond=cond) while_op = pd.While(cond=cond)
with while_op.block(): with while_op.block(): # define the computing of each step
# obtain input at present step of decoder, including id chosen at previous step, corresponding score and state at previous step.
pre_ids = pd.array_read(array=ids_array, i=counter) pre_ids = pd.array_read(array=ids_array, i=counter)
pre_state = pd.array_read(array=state_array, i=counter) pre_state = pd.array_read(array=state_array, i=counter)
pre_score = pd.array_read(array=scores_array, i=counter) pre_score = pd.array_read(array=scores_array, i=counter)
# expand the lod of pre_state to be the same with pre_score # update input state as state correspondent with id chosen at previous step
pre_state_expanded = pd.sequence_expand(pre_state, pre_score) pre_state_expanded = pd.sequence_expand(pre_state, pre_score)
# computing logic of decoder under the same train mode, including input vector and computing unit of decoder
# compute predicting probability of normalized word
pre_ids_emb = pd.embedding( pre_ids_emb = pd.embedding(
input=pre_ids, input=pre_ids,
size=[dict_size, word_dim], size=[dict_size, word_dim],
dtype='float32', dtype='float32',
is_sparse=is_sparse) is_sparse=is_sparse)
# use rnn unit to update rnn
current_state = pd.fc(input=[pre_state_expanded, pre_ids_emb], current_state = pd.fc(input=[pre_state_expanded, pre_ids_emb],
size=decoder_size, size=decoder_size,
act='tanh') act='tanh')
current_state_with_lod = pd.lod_reset(x=current_state, y=pre_score) current_state_with_lod = pd.lod_reset(x=current_state, y=pre_score)
# use score to do beam search
current_score = pd.fc(input=current_state_with_lod, current_score = pd.fc(input=current_state_with_lod,
size=target_dict_dim, size=target_dict_dim,
act='softmax') act='softmax')
topk_scores, topk_indices = pd.topk(current_score, k=beam_size) topk_scores, topk_indices = pd.topk(current_score, k=beam_size)
# calculate accumulated scores after topk to reduce computation cost
# compute accumulated score and perform beam search
accu_scores = pd.elementwise_add( accu_scores = pd.elementwise_add(
x=pd.log(topk_scores), y=pd.reshape(pre_score, shape=[-1]), axis=0) x=pd.log(topk_scores), y=pd.reshape(pre_score, shape=[-1]), axis=0)
selected_ids, selected_scores = pd.beam_search( selected_ids, selected_scores = pd.beam_search(
...@@ -356,14 +351,12 @@ def decode(context, is_sparse): ...@@ -356,14 +351,12 @@ def decode(context, is_sparse):
level=0) level=0)
pd.increment(x=counter, value=1, in_place=True) pd.increment(x=counter, value=1, in_place=True)
# write search result and corresponding hidden layer into tensor array
# update the memories
pd.array_write(current_state, array=state_array, i=counter) pd.array_write(current_state, array=state_array, i=counter)
pd.array_write(selected_ids, array=ids_array, i=counter) pd.array_write(selected_ids, array=ids_array, i=counter)
pd.array_write(selected_scores, array=scores_array, i=counter) pd.array_write(selected_scores, array=scores_array, i=counter)
# update the break condition: up to the max length or all candidates of # update condition to stop loop
# source sentences have ended.
length_cond = pd.less_than(x=counter, y=array_len) length_cond = pd.less_than(x=counter, y=array_len)
finish_cond = pd.logical_not(pd.is_empty(x=selected_ids)) finish_cond = pd.logical_not(pd.is_empty(x=selected_ids))
pd.logical_and(x=length_cond, y=finish_cond, out=cond) pd.logical_and(x=length_cond, y=finish_cond, out=cond)
...@@ -375,8 +368,9 @@ def decode(context, is_sparse): ...@@ -375,8 +368,9 @@ def decode(context, is_sparse):
``` ```
Then we define a `training_program` that uses the result from `encoder` and `train_decoder` to compute the cost with label data.
Also define `optimizer_func` to specify the optimizer.
Furthermore, we define a `train_program` to use result computed by `inference_program` and compute error with the help of marked data. We also define an `optimizer_func` to define optimizer.
```python ```python
def train_program(is_sparse): def train_program(is_sparse):
...@@ -396,21 +390,18 @@ def optimizer_func(): ...@@ -396,21 +390,18 @@ def optimizer_func():
regularization_coeff=0.1)) regularization_coeff=0.1))
``` ```
## Model Training ## Train Model
### Specify training environment
Specify your training environment, you should specify if the training is on CPU or GPU. ### Define Training Environment
Define your training environment and define the train executed on CPU or on GPU.
```python ```python
use_cuda = False use_cuda = False
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace() place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
``` ```
### Datafeeder Configuration ### Define Data Provider
The next step is to define data provider for train and test. Data Provider read data with size of `BATCH_SIZE` `paddle.dataset.wmt.train` will provide data with size of `BATCH_SIZE` after reordering every time. The size of reordering is `buf_size`.
Next we define data feeders for test and train. The feeder reads a `buf_size` of data each time and feed them to the training/testing process.
`paddle.dataset.wmt14.train` will yield records during each pass, after shuffling, a batch input of `BATCH_SIZE` is generated for training.
```python ```python
train_reader = paddle.batch( train_reader = paddle.batch(
...@@ -420,8 +411,7 @@ train_reader = paddle.batch( ...@@ -420,8 +411,7 @@ train_reader = paddle.batch(
``` ```
### Create Trainer ### Create Trainer
Trainer needs a train program and a train optimizer.
Create a trainer that takes `train_program` as input and specify optimizer function.
```python ```python
is_sparse = False is_sparse = False
...@@ -431,20 +421,17 @@ trainer = Trainer( ...@@ -431,20 +421,17 @@ trainer = Trainer(
optimizer_func=optimizer_func) optimizer_func=optimizer_func)
``` ```
### Feeding Data ### Provide Data
`feed_order` is devoted to specifying the correspondence between each yield record and `paddle.layer.data`. For instance, the first column of data generated by `wmt14.train` corresponds to `src_word_id`. `feed_order` is used to define every generated data and reflecting relationship between `paddle.layer.data`. For example, the first column data generated by `wmt14.train` is correspondent with the feature `src_word_id`.
```python ```python
feed_order = [ feed_order = ['src_word_id', 'target_language_word', 'target_language_next_word'
'src_word_id', 'target_language_word', 'target_language_next_word' ]
]
``` ```
### Event Handler ### Event Handler
Call function `event_handler` will be called after the touch of an event defined before. For example, we can examine the loss after the training at each step.
Callback function `event_handler` will be called during training when a pre-defined event happens.
For example, we can check the cost by `trainer.test` when `EndStepEvent` occurs
```python ```python
def event_handler(event): def event_handler(event):
...@@ -456,10 +443,8 @@ def event_handler(event): ...@@ -456,10 +443,8 @@ def event_handler(event):
trainer.stop() trainer.stop()
``` ```
### Start Training
### Training Finally, we feed in `num_epoch` and other parameters and call `trainer.train` to start training.
Finally, we invoke `trainer.train` to start training with `num_epochs` and other parameters.
```python ```python
EPOCH_NUM = 1 EPOCH_NUM = 1
...@@ -471,20 +456,20 @@ trainer.train( ...@@ -471,20 +456,20 @@ trainer.train(
feed_order=feed_order) feed_order=feed_order)
``` ```
## Inference ## Model Application
### Define the decode part ### Define Decoder Part
Use the `encoder` and `decoder` function we defined above to predict translation ids and scores. Use `encoder` and `decoder` function defined above to infer corresponding id and score after the translation.
```python ```python
context = encoder(is_sparse) context = encoder(is_sparse)
translation_ids, translation_scores = decode(context, is_sparse) translation_ids, translation_scores = decode(context, is_sparse)
``` ```
### Define DataSet ### Define Data
We initialize ids and scores and create tensors for input. In this test we are using first record data from `wmt14.test` for inference. At the end we get src dict and target dict for printing out results later. First we initialize id and score to generate tensor as input data. In this prediction, we use the first record in `wmt14.test` to infer and finally use "source language dictionary" and "target language dictionary" to output corresponding sentence.
```python ```python
init_ids_data = np.array([1 for _ in range(batch_size)], dtype='int64') init_ids_data = np.array([1 for _ in range(batch_size)], dtype='int64')
...@@ -513,9 +498,9 @@ feeder = fluid.DataFeeder(feed_list, place) ...@@ -513,9 +498,9 @@ feeder = fluid.DataFeeder(feed_list, place)
src_dict, trg_dict = paddle.dataset.wmt14.get_dict(dict_size) src_dict, trg_dict = paddle.dataset.wmt14.get_dict(dict_size)
``` ```
### Infer ### Test
Now we can start predicting. We need provide corresponding parameters in `feed_order` and run it on `executor` to obtain id and score.
We create `feed_dict` with all the inputs we need and run with `executor` to get predicted results id and corresponding scores.
```python ```python
exe = Executor(place) exe = Executor(place)
...@@ -549,6 +534,11 @@ for data in test_data(): ...@@ -549,6 +534,11 @@ for data in test_data():
break break
``` ```
## Summary
End-to-End neural network translation is an recently acclaimed machine translation method. In this section, we introduced the typical Encoder-Decoder of NMT. Because NMT is a typical Seq2Seq (Sequence to Sequence) learning task, tasks of Seq2Seq, such as query rewriting, abstraction, single round dialogue, can be tackled by this model.
<a name="References"></a>
## References ## References
1. Koehn P. [Statistical machine translation](https://books.google.com.hk/books?id=4v_Cx1wIMLkC&printsec=frontcover&hl=zh-CN&source=gbs_ge_summary_r&cad=0#v=onepage&q&f=false)[M]. Cambridge University Press, 2009. 1. Koehn P. [Statistical machine translation](https://books.google.com.hk/books?id=4v_Cx1wIMLkC&printsec=frontcover&hl=zh-CN&source=gbs_ge_summary_r&cad=0#v=onepage&q&f=false)[M]. Cambridge University Press, 2009.
...@@ -558,7 +548,8 @@ for data in test_data(): ...@@ -558,7 +548,8 @@ for data in test_data():
5. Papineni K, Roukos S, Ward T, et al. [BLEU: a method for automatic evaluation of machine translation](http://dl.acm.org/citation.cfm?id=1073135)[C]//Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002: 311-318. 5. Papineni K, Roukos S, Ward T, et al. [BLEU: a method for automatic evaluation of machine translation](http://dl.acm.org/citation.cfm?id=1073135)[C]//Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002: 311-318.
<br/> <br/>
This tutorial is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.
<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://paddlepaddleimage.cdn.bcebos.com/bookimage/camo.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">This tutorial</span> is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
......
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import print_function
import numpy as np
import paddle
import paddle.fluid as fluid
import paddle.fluid.framework as framework
import paddle.fluid.layers as pd
from paddle.fluid.executor import Executor
import os
dict_size = 30000
source_dict_dim = target_dict_dim = dict_size
hidden_dim = 32
word_dim = 32
batch_size = 2
max_length = 8
topk_size = 50
beam_size = 2
is_sparse = True
decoder_size = hidden_dim
model_save_dir = "machine_translation.inference.model"
def encoder():
src_word_id = pd.data(
name="src_word_id", shape=[1], dtype='int64', lod_level=1)
src_embedding = pd.embedding(
input=src_word_id,
size=[dict_size, word_dim],
dtype='float32',
is_sparse=is_sparse,
param_attr=fluid.ParamAttr(name='vemb'))
fc1 = pd.fc(input=src_embedding, size=hidden_dim * 4, act='tanh')
lstm_hidden0, lstm_0 = pd.dynamic_lstm(input=fc1, size=hidden_dim * 4)
encoder_out = pd.sequence_last_step(input=lstm_hidden0)
return encoder_out
def decode(context):
init_state = context
array_len = pd.fill_constant(shape=[1], dtype='int64', value=max_length)
counter = pd.zeros(shape=[1], dtype='int64', force_cpu=True)
# fill the first element with init_state
state_array = pd.create_array('float32')
pd.array_write(init_state, array=state_array, i=counter)
# ids, scores as memory
ids_array = pd.create_array('int64')
scores_array = pd.create_array('float32')
init_ids = pd.data(name="init_ids", shape=[1], dtype="int64", lod_level=2)
init_scores = pd.data(
name="init_scores", shape=[1], dtype="float32", lod_level=2)
pd.array_write(init_ids, array=ids_array, i=counter)
pd.array_write(init_scores, array=scores_array, i=counter)
cond = pd.less_than(x=counter, y=array_len)
while_op = pd.While(cond=cond)
with while_op.block():
pre_ids = pd.array_read(array=ids_array, i=counter)
pre_state = pd.array_read(array=state_array, i=counter)
pre_score = pd.array_read(array=scores_array, i=counter)
# expand the lod of pre_state to be the same with pre_score
pre_state_expanded = pd.sequence_expand(pre_state, pre_score)
pre_ids_emb = pd.embedding(
input=pre_ids,
size=[dict_size, word_dim],
dtype='float32',
is_sparse=is_sparse,
param_attr=fluid.ParamAttr(name='vemb'))
# use rnn unit to update rnn
current_state = pd.fc(
input=[pre_state_expanded, pre_ids_emb],
size=decoder_size,
act='tanh')
current_state_with_lod = pd.lod_reset(x=current_state, y=pre_score)
# use score to do beam search
current_score = pd.fc(
input=current_state_with_lod, size=target_dict_dim, act='softmax')
topk_scores, topk_indices = pd.topk(current_score, k=beam_size)
# calculate accumulated scores after topk to reduce computation cost
accu_scores = pd.elementwise_add(
x=pd.log(topk_scores), y=pd.reshape(pre_score, shape=[-1]), axis=0)
selected_ids, selected_scores = pd.beam_search(
pre_ids,
pre_score,
topk_indices,
accu_scores,
beam_size,
end_id=10,
level=0)
with pd.Switch() as switch:
with switch.case(pd.is_empty(selected_ids)):
pd.fill_constant(
shape=[1], value=0, dtype='bool', force_cpu=True, out=cond)
with switch.default():
pd.increment(x=counter, value=1, in_place=True)
# update the memories
pd.array_write(current_state, array=state_array, i=counter)
pd.array_write(selected_ids, array=ids_array, i=counter)
pd.array_write(selected_scores, array=scores_array, i=counter)
# update the break condition: up to the max length or all candidates of
# source sentences have ended.
length_cond = pd.less_than(x=counter, y=array_len)
finish_cond = pd.logical_not(pd.is_empty(x=selected_ids))
pd.logical_and(x=length_cond, y=finish_cond, out=cond)
translation_ids, translation_scores = pd.beam_search_decode(
ids=ids_array, scores=scores_array, beam_size=beam_size, end_id=10)
return translation_ids, translation_scores
def decode_main(use_cuda):
if use_cuda and not fluid.core.is_compiled_with_cuda():
return
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
exe = Executor(place)
exe.run(framework.default_startup_program())
context = encoder()
translation_ids, translation_scores = decode(context)
fluid.io.load_persistables(executor=exe, dirname=model_save_dir)
init_ids_data = np.array([1 for _ in range(batch_size)], dtype='int64')
init_scores_data = np.array(
[1. for _ in range(batch_size)], dtype='float32')
init_ids_data = init_ids_data.reshape((batch_size, 1))
init_scores_data = init_scores_data.reshape((batch_size, 1))
init_lod = [1] * batch_size
init_lod = [init_lod, init_lod]
init_ids = fluid.create_lod_tensor(init_ids_data, init_lod, place)
init_scores = fluid.create_lod_tensor(init_scores_data, init_lod, place)
test_data = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.wmt14.test(dict_size), buf_size=1000),
batch_size=batch_size)
feed_order = ['src_word_id']
feed_list = [
framework.default_main_program().global_block().var(var_name)
for var_name in feed_order
]
feeder = fluid.DataFeeder(feed_list, place)
src_dict, trg_dict = paddle.dataset.wmt14.get_dict(dict_size)
for data in test_data():
feed_data = map(lambda x: [x[0]], data)
feed_dict = feeder.feed(feed_data)
feed_dict['init_ids'] = init_ids
feed_dict['init_scores'] = init_scores
results = exe.run(
framework.default_main_program(),
feed=feed_dict,
fetch_list=[translation_ids, translation_scores],
return_numpy=False)
result_ids = np.array(results[0])
result_ids_lod = results[0].lod()
result_scores = np.array(results[1])
print("Original sentence:")
print(" ".join([src_dict[w] for w in feed_data[0][0][1:-1]]))
print("Translated score and sentence:")
for i in xrange(beam_size):
start_pos = result_ids_lod[1][i] + 1
end_pos = result_ids_lod[1][i + 1]
print("%d\t%.4f\t%s\n" % (
i + 1, result_scores[end_pos - 1],
" ".join([trg_dict[w] for w in result_ids[start_pos:end_pos]])))
break
def main(use_cuda):
decode_main(False) # Beam Search does not support CUDA
if __name__ == '__main__':
use_cuda = os.getenv('WITH_GPU', '0') != '0'
main(use_cuda)
...@@ -12,130 +12,315 @@ ...@@ -12,130 +12,315 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
from __future__ import print_function from __future__ import print_function
import os
import six
import numpy as np
import paddle import paddle
import paddle.fluid as fluid import paddle.fluid as fluid
import paddle.fluid.layers as pd
import os
import sys
try:
from paddle.fluid.contrib.trainer import *
from paddle.fluid.contrib.inferencer import *
except ImportError:
print(
"In the fluid 1.0, the trainer and inferencer are moving to paddle.fluid.contrib",
file=sys.stderr)
from paddle.fluid.trainer import *
from paddle.fluid.inferencer import *
dict_size = 30000 dict_size = 30000
source_dict_dim = target_dict_dim = dict_size source_dict_size = target_dict_size = dict_size
hidden_dim = 32 word_dim = 512
word_dim = 32 hidden_dim = 512
batch_size = 2 decoder_size = hidden_dim
max_length = 8 max_length = 256
topk_size = 50 beam_size = 4
beam_size = 2 batch_size = 64
is_sparse = True is_sparse = True
decoder_size = hidden_dim
model_save_dir = "machine_translation.inference.model" model_save_dir = "machine_translation.inference.model"
def encoder(): def encoder():
src_word_id = pd.data( src_word_id = fluid.layers.data(
name="src_word_id", shape=[1], dtype='int64', lod_level=1) name="src_word_id", shape=[1], dtype='int64', lod_level=1)
src_embedding = pd.embedding( src_embedding = fluid.layers.embedding(
input=src_word_id, input=src_word_id,
size=[dict_size, word_dim], size=[source_dict_size, word_dim],
dtype='float32', dtype='float32',
is_sparse=is_sparse, is_sparse=is_sparse)
param_attr=fluid.ParamAttr(name='vemb'))
fc_forward = fluid.layers.fc(
input=src_embedding, size=hidden_dim * 3, bias_attr=False)
src_forward = fluid.layers.dynamic_gru(input=fc_forward, size=hidden_dim)
fc_backward = fluid.layers.fc(
input=src_embedding, size=hidden_dim * 3, bias_attr=False)
src_backward = fluid.layers.dynamic_gru(
input=fc_backward, size=hidden_dim, is_reverse=True)
encoded_vector = fluid.layers.concat(
input=[src_forward, src_backward], axis=1)
return encoded_vector
def cell(x, hidden, encoder_out, encoder_out_proj):
def simple_attention(encoder_vec, encoder_proj, decoder_state):
decoder_state_proj = fluid.layers.fc(
input=decoder_state, size=decoder_size, bias_attr=False)
decoder_state_expand = fluid.layers.sequence_expand(
x=decoder_state_proj, y=encoder_proj)
mixed_state = fluid.layers.elementwise_add(encoder_proj,
decoder_state_expand)
attention_weights = fluid.layers.fc(
input=mixed_state, size=1, bias_attr=False)
attention_weights = fluid.layers.sequence_softmax(
input=attention_weights)
weigths_reshape = fluid.layers.reshape(x=attention_weights, shape=[-1])
scaled = fluid.layers.elementwise_mul(
x=encoder_vec, y=weigths_reshape, axis=0)
context = fluid.layers.sequence_pool(input=scaled, pool_type='sum')
return context
fc1 = pd.fc(input=src_embedding, size=hidden_dim * 4, act='tanh') context = simple_attention(encoder_out, encoder_out_proj, hidden)
lstm_hidden0, lstm_0 = pd.dynamic_lstm(input=fc1, size=hidden_dim * 4) out = fluid.layers.fc(
encoder_out = pd.sequence_last_step(input=lstm_hidden0) input=[x, context], size=decoder_size * 3, bias_attr=False)
return encoder_out out = fluid.layers.gru_unit(
input=out, hidden=hidden, size=decoder_size * 3)[0]
return out, out
def train_decoder(context): def train_decoder(encoder_out):
trg_language_word = pd.data( encoder_last = fluid.layers.sequence_last_step(input=encoder_out)
encoder_last_proj = fluid.layers.fc(
input=encoder_last, size=decoder_size, act='tanh')
# cache the encoder_out's computed result in attention
encoder_out_proj = fluid.layers.fc(
input=encoder_out, size=decoder_size, bias_attr=False)
trg_language_word = fluid.layers.data(
name="target_language_word", shape=[1], dtype='int64', lod_level=1) name="target_language_word", shape=[1], dtype='int64', lod_level=1)
trg_embedding = pd.embedding( trg_embedding = fluid.layers.embedding(
input=trg_language_word, input=trg_language_word,
size=[dict_size, word_dim], size=[target_dict_size, word_dim],
dtype='float32', dtype='float32',
is_sparse=is_sparse, is_sparse=is_sparse)
param_attr=fluid.ParamAttr(name='vemb'))
rnn = pd.DynamicRNN() rnn = fluid.layers.DynamicRNN()
with rnn.block(): with rnn.block():
current_word = rnn.step_input(trg_embedding) x = rnn.step_input(trg_embedding)
pre_state = rnn.memory(init=context, need_reorder=True) pre_state = rnn.memory(init=encoder_last_proj, need_reorder=True)
current_state = pd.fc( encoder_out = rnn.static_input(encoder_out)
input=[current_word, pre_state], size=decoder_size, act='tanh') encoder_out_proj = rnn.static_input(encoder_out_proj)
out, current_state = cell(x, pre_state, encoder_out, encoder_out_proj)
prob = fluid.layers.fc(input=out, size=target_dict_size, act='softmax')
current_score = pd.fc(
input=current_state, size=target_dict_dim, act='softmax')
rnn.update_memory(pre_state, current_state) rnn.update_memory(pre_state, current_state)
rnn.output(current_score) rnn.output(prob)
return rnn() return rnn()
def train_program(): def train_model():
context = encoder() encoder_out = encoder()
rnn_out = train_decoder(context) rnn_out = train_decoder(encoder_out)
label = pd.data( label = fluid.layers.data(
name="target_language_next_word", shape=[1], dtype='int64', lod_level=1) name="target_language_next_word", shape=[1], dtype='int64', lod_level=1)
cost = pd.cross_entropy(input=rnn_out, label=label) cost = fluid.layers.cross_entropy(input=rnn_out, label=label)
avg_cost = pd.mean(cost) avg_cost = fluid.layers.mean(cost)
return avg_cost return avg_cost
def optimizer_func(): def optimizer_func():
return fluid.optimizer.Adagrad( fluid.clip.set_gradient_clip(
learning_rate=1e-4, clip=fluid.clip.GradientClipByGlobalNorm(clip_norm=5.0))
lr_decay = fluid.layers.learning_rate_scheduler.noam_decay(hidden_dim, 1000)
return fluid.optimizer.Adam(
learning_rate=lr_decay,
regularization=fluid.regularizer.L2DecayRegularizer( regularization=fluid.regularizer.L2DecayRegularizer(
regularization_coeff=0.1)) regularization_coeff=1e-4))
def train(use_cuda): def train(use_cuda):
EPOCH_NUM = 1 train_prog = fluid.Program()
startup_prog = fluid.Program()
with fluid.program_guard(train_prog, startup_prog):
with fluid.unique_name.guard():
avg_cost = train_model()
optimizer = optimizer_func()
optimizer.minimize(avg_cost)
if use_cuda and not fluid.core.is_compiled_with_cuda():
return
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace() place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
exe = fluid.Executor(place)
train_reader = paddle.batch( train_data = paddle.batch(
paddle.reader.shuffle( paddle.reader.shuffle(
paddle.dataset.wmt14.train(dict_size), buf_size=1000), paddle.dataset.wmt16.train(source_dict_size, target_dict_size),
buf_size=10000),
batch_size=batch_size) batch_size=batch_size)
feed_order = [ feeder = fluid.DataFeeder(
feed_list=[
'src_word_id', 'target_language_word', 'target_language_next_word' 'src_word_id', 'target_language_word', 'target_language_next_word'
] ],
place=place,
program=train_prog)
exe.run(startup_prog)
EPOCH_NUM = 20
for pass_id in six.moves.xrange(EPOCH_NUM):
batch_id = 0
for data in train_data():
cost = exe.run(
train_prog, feed=feeder.feed(data), fetch_list=[avg_cost])[0]
print('pass_id: %d, batch_id: %d, loss: %f' % (pass_id, batch_id,
cost))
batch_id += 1
fluid.io.save_params(exe, model_save_dir, main_program=train_prog)
def infer_decoder(encoder_out):
encoder_last = fluid.layers.sequence_last_step(input=encoder_out)
encoder_last_proj = fluid.layers.fc(
input=encoder_last, size=decoder_size, act='tanh')
encoder_out_proj = fluid.layers.fc(
input=encoder_out, size=decoder_size, bias_attr=False)
max_len = fluid.layers.fill_constant(
shape=[1], dtype='int64', value=max_length)
counter = fluid.layers.zeros(shape=[1], dtype='int64', force_cpu=True)
init_ids = fluid.layers.data(
name="init_ids", shape=[1], dtype="int64", lod_level=2)
init_scores = fluid.layers.data(
name="init_scores", shape=[1], dtype="float32", lod_level=2)
# create and init arrays to save selected ids, scores and states for each step
ids_array = fluid.layers.array_write(init_ids, i=counter)
scores_array = fluid.layers.array_write(init_scores, i=counter)
state_array = fluid.layers.array_write(encoder_last_proj, i=counter)
def event_handler(event): cond = fluid.layers.less_than(x=counter, y=max_len)
if isinstance(event, EndStepEvent): while_op = fluid.layers.While(cond=cond)
if event.step % 10 == 0: with while_op.block():
print('pass_id=' + str(event.epoch) + ' batch=' + str( pre_ids = fluid.layers.array_read(array=ids_array, i=counter)
event.step)) pre_score = fluid.layers.array_read(array=scores_array, i=counter)
pre_state = fluid.layers.array_read(array=state_array, i=counter)
if isinstance(event, EndEpochEvent): pre_ids_emb = fluid.layers.embedding(
trainer.save_params(model_save_dir) input=pre_ids,
size=[target_dict_size, word_dim],
dtype='float32',
is_sparse=is_sparse)
out, current_state = cell(pre_ids_emb, pre_state, encoder_out,
encoder_out_proj)
prob = fluid.layers.fc(
input=current_state, size=target_dict_size, act='softmax')
# beam search
topk_scores, topk_indices = fluid.layers.topk(prob, k=beam_size)
accu_scores = fluid.layers.elementwise_add(
x=fluid.layers.log(topk_scores),
y=fluid.layers.reshape(pre_score, shape=[-1]),
axis=0)
accu_scores = fluid.layers.lod_reset(x=accu_scores, y=pre_ids)
selected_ids, selected_scores = fluid.layers.beam_search(
pre_ids, pre_score, topk_indices, accu_scores, beam_size, end_id=1)
fluid.layers.increment(x=counter, value=1, in_place=True)
# save selected ids and corresponding scores of each step
fluid.layers.array_write(selected_ids, array=ids_array, i=counter)
fluid.layers.array_write(selected_scores, array=scores_array, i=counter)
# update rnn state by sequence_expand acting as gather
current_state = fluid.layers.sequence_expand(current_state,
selected_ids)
fluid.layers.array_write(current_state, array=state_array, i=counter)
current_enc_out = fluid.layers.sequence_expand(encoder_out,
selected_ids)
fluid.layers.assign(current_enc_out, encoder_out)
current_enc_out_proj = fluid.layers.sequence_expand(encoder_out_proj,
selected_ids)
fluid.layers.assign(current_enc_out_proj, encoder_out_proj)
# update conditional variable
length_cond = fluid.layers.less_than(x=counter, y=max_len)
finish_cond = fluid.layers.logical_not(
fluid.layers.is_empty(x=selected_ids))
fluid.layers.logical_and(x=length_cond, y=finish_cond, out=cond)
translation_ids, translation_scores = fluid.layers.beam_search_decode(
ids=ids_array, scores=scores_array, beam_size=beam_size, end_id=1)
return translation_ids, translation_scores
def infer_model():
encoder_out = encoder()
translation_ids, translation_scores = infer_decoder(encoder_out)
return translation_ids, translation_scores
def infer(use_cuda):
infer_prog = fluid.Program()
startup_prog = fluid.Program()
with fluid.program_guard(infer_prog, startup_prog):
with fluid.unique_name.guard():
translation_ids, translation_scores = infer_model()
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
exe = fluid.Executor(place)
test_data = paddle.batch(
paddle.dataset.wmt16.test(source_dict_size, target_dict_size),
batch_size=batch_size)
src_idx2word = paddle.dataset.wmt16.get_dict(
"en", source_dict_size, reverse=True)
trg_idx2word = paddle.dataset.wmt16.get_dict(
"de", target_dict_size, reverse=True)
trainer = Trainer( fluid.io.load_params(exe, model_save_dir, main_program=infer_prog)
train_func=train_program, place=place, optimizer_func=optimizer_func)
trainer.train( for data in test_data():
reader=train_reader, src_word_id = fluid.create_lod_tensor(
num_epochs=EPOCH_NUM, data=[x[0] for x in data],
event_handler=event_handler, recursive_seq_lens=[[len(x[0]) for x in data]],
feed_order=feed_order) place=place)
init_ids = fluid.create_lod_tensor(
data=np.array([[0]] * len(data), dtype='int64'),
recursive_seq_lens=[[1] * len(data)] * 2,
place=place)
init_scores = fluid.create_lod_tensor(
data=np.array([[0.]] * len(data), dtype='float32'),
recursive_seq_lens=[[1] * len(data)] * 2,
place=place)
seq_ids, seq_scores = exe.run(
infer_prog,
feed={
'src_word_id': src_word_id,
'init_ids': init_ids,
'init_scores': init_scores
},
fetch_list=[translation_ids, translation_scores],
return_numpy=False)
# How to parse the results:
# Suppose the lod of seq_ids is:
# [[0, 3, 6], [0, 12, 24, 40, 54, 67, 82]]
# then from lod[0]:
# there are 2 source sentences, beam width is 3.
# from lod[1]:
# the first source sentence has 3 hyps; the lengths are 12, 12, 16
# the second source sentence has 3 hyps; the lengths are 14, 13, 15
hyps = [[] for i in range(len(seq_ids.lod()[0]) - 1)]
scores = [[] for i in range(len(seq_scores.lod()[0]) - 1)]
for i in range(len(seq_ids.lod()[0]) - 1): # for each source sentence
start = seq_ids.lod()[0][i]
end = seq_ids.lod()[0][i + 1]
print("Original sentence:")
print(" ".join([src_idx2word[idx] for idx in data[i][0][1:-1]]))
print("Translated score and sentence:")
for j in range(end - start): # for each candidate
sub_start = seq_ids.lod()[1][start + j]
sub_end = seq_ids.lod()[1][start + j + 1]
hyps[i].append(" ".join([
trg_idx2word[idx]
for idx in np.array(seq_ids)[sub_start:sub_end][1:-1]
]))
scores[i].append(np.array(seq_scores)[sub_end - 1])
print(scores[i][-1], hyps[i][-1].encode('utf8'))
def main(use_cuda): def main(use_cuda):
train(use_cuda) train(use_cuda)
infer(use_cuda)
if __name__ == '__main__': if __name__ == '__main__':
......
#!/bin/bash
#This file is only used for continuous evaluation.
export FLAGS_cudnn_deterministic=True
export CUDA_VISIBLE_DEVICES=0
python dc_gan.py --enable_ce true --epoch 1 --use_gpu True | python _ce.py
# 生成对抗网络
本教程源代码目录在book/09.gan,初次使用请您参考Book文档使用说明。
### 说明: ###
1. 硬件环境要求:
本文可支持在CPU、GPU下运行
2. Docker镜像支持的CUDA/cuDNN版本:
如果使用了Docker运行Book,请注意:这里所提供的默认镜像的GPU环境为 CUDA 8/cuDNN 5,对于NVIDIA Tesla V100等要求CUDA 9的 GPU,使用该镜像可能会运行失败。
3. 文档和脚本中代码的一致性问题:
请注意:为使本文更加易读易用,我们拆分、调整了dc_gan.py的代码并放入本文。本文中代码与dc_gan.py的运行结果一致,可直接运行[dc_gan.py](https://github.com/PaddlePaddle/book/blob/develop/09.gan/dc_gan.py)进行验证。
## 背景介绍
生成对抗网络(Generative Adversarial Network \[[1](#参考文献)\],简称GAN)是非监督式学习的一种方法,通过让两个神经网络相互博弈的方式进行学习。该方法最初由 lan·Goodfellow 等人于2014年提出,原论文见 [Generative Adversarial Network](https://arxiv.org/abs/1406.2661)
生成对抗网络由一个生成网络与一个判别网络组成。生成网络从潜在空间(latent space)中随机采样作为输入,其输出结果需要尽量模仿训练集中的真实样本。判别网络的输入为真实样本或生成网络的输出,其目的是将生成网络的输出从真实样本中尽可能分辨出来。而生成网络则要尽可能地欺骗判别网络。两个网络相互对抗、不断调整参数,其目的是将生成网络生成的样本和真实样本尽可能的区分开\[[2](#参考文献)\] )。
生成对抗网络常用于生成以假乱真的图片 \[[3](#参考文献)\] )。此外,该方法还被用于生成视频、三维物体模型等。
## 效果展示
本教程将 MNIST 数据集输入网络进行训练,经过19轮训练后可以看到,生成的图片已经非常接近真实图片的样子,下图中前8行是真实图片的样子,后8行是网络生成的图像效果:
<p align="center">
<img src="https://raw.githubusercontent.com/PaddlePaddle/book/develop/09.gan/image/dcgan_demo.png" width="550"><br/>
图1. GAN 生成手写数字效果
</p>
## 模型概览
### GAN
GAN 网络顾名思义,是一种通过对抗的方式,去学习数据分布的生成模型。其中,“对抗”指的是生成网络(Generator)和判别网络(Discriminator)的相互对抗。这里以生成图片为例进行说明:
- 生成网络(G)接收一个随机的噪声z,尽可能的生成近似样本的图像,记为G(z)
- 判别网络(D)接收一张输入图片x,尽可以去判别该图像是真实样本还是网络生成的假样本,判别网络的输出 D(x) 代表 x 为真实图片的概率。如果 D(x)=1 说明判别网络认为该输入一定是真实图片,如果 D(x)=0 说明判别网络认为该输入一定是假图片。
在训练的过程中,两个网络互相对抗,最终形成了一个动态的平衡,上述过程用公式可以被描述为:
<p align="center">
<img src="https://raw.githubusercontent.com/PaddlePaddle/book/develop/09.gan/image/01.gif?raw=true"><br/>
</p>
在最理想的情况下,G 可以生成与真实样本极其相似的图片G(z),而 D 很难判断这张生成的图片是否为真,对图片的真假进行随机猜测,即 D(G(z))=0.5。
下图展示了生成对抗网络的训练过程,假设在训练开始时,真实样本分布、生成样本分布以及判别模型分别是图中的黑线、绿线和蓝线。在训练开始时,判别模型是无法很好地区分真实样本和生成样本的。接下来当我们固定生成模型,而优化判别模型时,优化结果如第二幅图所示,可以看出,这个时候判别模型已经可以较好地区分生成数据和真实数据了。第三步是固定判别模型,改进生成模型,试图让判别模型无法区分生成图片与真实图片,在这个过程中,可以看出由模型生成的图片分布与真实图片分布更加接近,这样的迭代不断进行,直到最终收敛,生成分布和真实分布重合,判别模型无法区分真实图片与生成图片。
<p align="center">
<img src="https://raw.githubusercontent.com/PaddlePaddle/book/develop/09.gan/image/process.png" width="550"><br/>
图2. GAN 训练过程
</p>
但是在实际过程中,很难得到这个完美的平衡点,关于GAN的收敛理论还在持续不断的研究中。
### DCGAN
[DCGAN](https://arxiv.org/abs/1511.06434) \[[4](#参考文献)\] 是深层卷积网络与 GAN 的结合,其基本原理与 GAN 相同,只是将生成网络和判别网络用两个卷积网络(CNN)替代。为了提高生成样本的质量和网络的收敛速度,论文中的 DCGAN 在网络结构上进行了一些改进:
- 取消 pooling 层:在网络中,所有的pooling层使用步幅卷积(strided convolutions)(判别器)和微步幅度卷积(fractional-strided convolutions)(生成器)进行替换。
- 加入 batch normalization:在生成器和判别器中均加入batchnorm。
- 使用全卷积网络:去掉了FC层,以实现更深的网络结构。
- 激活函数:在生成器(G)中,最后一层使用Tanh函数,其余层采用 ReLu 函数 ; 判别器(D)中都采用LeakyReLu。
DCGAN中的生成器(G)结构如下图所示:
<p align="center">
<img src="https://raw.githubusercontent.com/PaddlePaddle/book/develop/09.gan/image/dcgan_g.png" width="550"><br/>
图3. DCGAN中的生成器(G)
</p>
## 数据准备
本次教程使用数据规模较小的 MNIST 训练生成器和判别器,该数据集可通过paddle.dataset模块自动下载到本地。
关于 MNIST 的详细介绍可参考[数字识别](https://github.com/PaddlePaddle/book/tree/develop/02.recognize_digits)
## 训练模型
`09.gan/dc_gan.py` 演示了训练的整体过程。
### 加载包
首先加载 PaddlePaddle 的 Fluid 和其他相关包
```python
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import sys
import os
import matplotlib
import PIL
import six
import numpy as np
import math
import time
import paddle
import paddle.fluid as fluid
matplotlib.use('agg')
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
```
### 定义辅助工具
定义 plot 函数,将图像生成过程可视化
```python
def plot(gen_data):
pad_dim = 1
paded = pad_dim + img_dim
gen_data = gen_data.reshape(gen_data.shape[0], img_dim, img_dim)
n = int(math.ceil(math.sqrt(gen_data.shape[0])))
gen_data = (np.pad(
gen_data, [[0, n * n - gen_data.shape[0]], [pad_dim, 0], [pad_dim, 0]],
'constant').reshape((n, n, paded, paded)).transpose((0, 2, 1, 3))
.reshape((n * paded, n * paded)))
fig = plt.figure(figsize=(8, 8))
plt.axis('off')
plt.imshow(gen_data, cmap='Greys_r', vmin=-1, vmax=1)
return fig
```
### 定义超参数
```python
gf_dim = 64 # 生成器的feature map的基础通道数量,生成器中所有的feature map的通道数量都是基础通道数量的倍数
df_dim = 64 # 判别器的feature map的基础通道数量,判别器中所有的feature map的通道数量都是基础通道数量的倍数
gfc_dim = 1024 * 2 # 生成器的全连接层维度
dfc_dim = 1024 # 判别器的全连接层维度
img_dim = 28 # 输入图片的尺寸
NOISE_SIZE = 100 # 输入噪声的维度
LEARNING_RATE = 2e-4 # 训练的学习率
epoch = 20 # 训练的epoch数
output = "./output_dcgan" # 模型和测试结果的存储路径
use_cudnn = False # 是否使用cuDNN
use_gpu=False # 是否使用GPU训练
```
### 定义网络结构
- bn 层
调用 `fluid.layers.batch_norm` 接口实现bn层,激活函数默认使用ReLu。
```python
def bn(x, name=None, act='relu'):
return fluid.layers.batch_norm(
x,
param_attr=name + '1',
bias_attr=name + '2',
moving_mean_name=name + '3',
moving_variance_name=name + '4',
name=name,
act=act)
```
- 卷积层
调用 `fluid.nets.simple_img_conv_pool` 实现卷积池化组,卷积核大小为3x3,池化窗口大小为2x2,窗口滑动步长为2,激活函数类型由具体网络结构指定。
```python
def conv(x, num_filters, name=None, act=None):
return fluid.nets.simple_img_conv_pool(
input=x,
filter_size=5,
num_filters=num_filters,
pool_size=2,
pool_stride=2,
param_attr=name + 'w',
bias_attr=name + 'b',
use_cudnn=use_cudnn,
act=act)
```
- 全连接层
```python
def fc(x, num_filters, name=None, act=None):
return fluid.layers.fc(input=x,
size=num_filters,
act=act,
param_attr=name + 'w',
bias_attr=name + 'b')
```
- 转置卷积层
在生成器中,需要用随机采样值生成全尺寸图像,dcgan使用转置卷积层进行上采样,在Fluid中,我们调用 `fluid.layers.conv2d_transpose` 实现转置卷积。
```python
def deconv(x,
num_filters,
name=None,
filter_size=5,
stride=2,
dilation=1,
padding=2,
output_size=None,
act=None):
return fluid.layers.conv2d_transpose(
input=x,
param_attr=name + 'w',
bias_attr=name + 'b',
num_filters=num_filters,
output_size=output_size,
filter_size=filter_size,
stride=stride,
dilation=dilation,
padding=padding,
use_cudnn=use_cudnn,
act=act)
```
- 判别器
判别器使用真实数据集和生成器生成的假图片共同进行训练,在训练过程中尽量使真实数据集的输出结果为1,生成的假图片输出结果为0。本教程中实现的判别器由两个卷积池化层和两个全连接层组成,其中最后一个全连接层的神经元个数为1,输出一个二分类结果。
```python
def D(x):
x = fluid.layers.reshape(x=x, shape=[-1, 1, 28, 28])
x = conv(x, df_dim, act='leaky_relu',name='conv1')
x = bn(conv(x, df_dim * 2,name='conv2'), act='leaky_relu',name='bn1')
x = bn(fc(x, dfc_dim,name='fc1'), act='leaky_relu',name='bn2')
x = fc(x, 1, act='sigmoid',name='fc2')
return x
```
- 生成器
生成器由两组带BN的全连接层和两组转置卷积层组成,网络输入为随机的噪声数据,最后一层转置卷积的卷积核数为1,表示输出为灰度图片。
```python
def G(x):
x = bn(fc(x, gfc_dim,name='fc3'),name='bn3')
x = bn(fc(x, gf_dim * 2 * img_dim // 4 * img_dim // 4,name='fc4'),name='bn4')
x = fluid.layers.reshape(x, [-1, gf_dim * 2, img_dim // 4, img_dim // 4])
x = deconv(x, gf_dim * 2, act='relu', output_size=[14, 14],name='deconv1')
x = deconv(x, num_filters=1, filter_size=5, padding=2, act='tanh', output_size=[28, 28],name='deconv2')
x = fluid.layers.reshape(x, shape=[-1, 28 * 28])
return x
```
### 损失函数
损失函数使用 `sigmoid_cross_entropy_with_logits`
```python
def loss(x, label):
return fluid.layers.mean(
fluid.layers.sigmoid_cross_entropy_with_logits(x=x, label=label))
```
### 创建Program
```python
d_program = fluid.Program()
dg_program = fluid.Program()
# 定义判别真实图片的program
with fluid.program_guard(d_program):
# 输入图片大小为28*28=784
img = fluid.layers.data(name='img', shape=[784], dtype='float32')
# 标签shape=1
label = fluid.layers.data(name='label', shape=[1], dtype='float32')
d_logit = D(img)
d_loss = loss(d_logit, label)
# 定义判别生成图片的program
with fluid.program_guard(dg_program):
noise = fluid.layers.data(
name='noise', shape=[NOISE_SIZE], dtype='float32')
# 噪声数据作为输入得到生成图片
g_img = G(x=noise)
g_program = dg_program.clone()
g_program_test = dg_program.clone(for_test=True)
# 判断生成图片为真实样本的概率
dg_logit = D(g_img)
# 计算生成图片被判别为真实样本的loss
dg_loss = loss(
dg_logit,
fluid.layers.fill_constant_batch_size_like(
input=noise, dtype='float32', shape=[-1, 1], value=1.0))
```
使用adam作为优化器,分别优化判别真实图片的loss和判别生成图片的loss。
```python
opt = fluid.optimizer.Adam(learning_rate=LEARNING_RATE)
opt.minimize(loss=d_loss)
parameters = [p.name for p in g_program.global_block().all_parameters()]
opt.minimize(loss=dg_loss, parameter_list=parameters)
```
### 数据集 Feeders 配置
下一步,我们开始训练过程。paddle.dataset.mnist.train()用做训练数据集。这个函数返回一个reader——PaddlePaddle中的reader是一个Python函数,每次调用的时候返回一个Python yield generator。
下面shuffle是一个reader decorator,它接受一个reader A,返回另一个reader B。reader B 每次读入buffer_size条训练数据到一个buffer里,然后随机打乱其顺序,并且逐条输出。
batch是一个特殊的decorator,它的输入是一个reader,输出是一个batched reader。在PaddlePaddle里,一个reader每次yield一条训练数据,而一个batched reader每次yield一个minibatch。
```python
batch_size = 128 # Minibatch size
train_reader = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.mnist.train(), buf_size=60000),
batch_size=batch_size)
```
### 创建执行器
```python
if use_gpu:
exe = fluid.Executor(fluid.CUDAPlace(0))
else:
exe = fluid.Executor(fluid.CPUPlace())
exe.run(fluid.default_startup_program())
```
### 开始训练
训练过程中的每一次迭代,生成器和判别器分别设置自己的迭代次数。为了避免判别器快速收敛到0,本教程默认每迭代一次,训练一次判别器,两次生成器。
```python
t_time = 0
losses = [[], []]
# 判别器的迭代次数
NUM_TRAIN_TIMES_OF_DG = 2
# 最终生成图像的噪声数据
const_n = np.random.uniform(
low=-1.0, high=1.0,
size=[batch_size, NOISE_SIZE]).astype('float32')
for pass_id in range(epoch):
for batch_id, data in enumerate(train_reader()):
if len(data) != batch_size:
continue
# 生成训练过程的噪声数据
noise_data = np.random.uniform(
low=-1.0, high=1.0,
size=[batch_size, NOISE_SIZE]).astype('float32')
# 真实图片
real_image = np.array(list(map(lambda x: x[0], data))).reshape(
-1, 784).astype('float32')
# 真实标签
real_labels = np.ones(
shape=[real_image.shape[0], 1], dtype='float32')
# 虚假标签
fake_labels = np.zeros(
shape=[real_image.shape[0], 1], dtype='float32')
total_label = np.concatenate([real_labels, fake_labels])
s_time = time.time()
# 虚假图片
generated_image = exe.run(g_program,
feed={'noise': noise_data},
fetch_list=[g_img])[0]
total_images = np.concatenate([real_image, generated_image])
# D 判断虚假图片为假的loss
d_loss_1 = exe.run(d_program,
feed={
'img': generated_image,
'label': fake_labels,
},
fetch_list=[d_loss])[0][0]
# D 判断真实图片为真的loss
d_loss_2 = exe.run(d_program,
feed={
'img': real_image,
'label': real_labels,
},
fetch_list=[d_loss])[0][0]
d_loss_n = d_loss_1 + d_loss_2
losses[0].append(d_loss_n)
# 训练生成器
for _ in six.moves.xrange(NUM_TRAIN_TIMES_OF_DG):
noise_data = np.random.uniform(
low=-1.0, high=1.0,
size=[batch_size, NOISE_SIZE]).astype('float32')
dg_loss_n = exe.run(dg_program,
feed={'noise': noise_data},
fetch_list=[dg_loss])[0][0]
losses[1].append(dg_loss_n)
t_time += (time.time() - s_time)
if batch_id % 10 == 0 :
if not os.path.exists(output):
os.makedirs(output)
# 每轮的生成结果
generated_images = exe.run(g_program_test,
feed={'noise': const_n},
fetch_list=[g_img])[0]
# 将真实图片和生成图片连接
total_images = np.concatenate([real_image, generated_images])
fig = plot(total_images)
msg = "Epoch ID={0} Batch ID={1} D-Loss={2} DG-Loss={3}\n ".format(
pass_id, batch_id,
d_loss_n, dg_loss_n)
print(msg)
plt.title(msg)
plt.savefig(
'{}/{:04d}_{:04d}.png'.format(output, pass_id,
batch_id),
bbox_inches='tight')
plt.close(fig)
```
打印特定轮次的生成结果:
```python
def display_image(epoch_no,batch_id):
return PIL.Image.open('output_dcgan/{:04d}_{:04d}.png'.format(epoch_no,batch_id))
# 观察第10个epoch,460个batch的生成图像:
display_image(10,460)
```
## 总结
DCGAN采用一个随机噪声向量作为输入,输入通过与CNN类似但是相反的结构,将输入放大成二维数据。采用这种结构的生成模型和CNN结构的判别模型,DCGAN在图片生成上可以达到相当可观的效果。本案例中,我们利用DCGAN生成了手写数字图片,您可以尝试更换数据集生成符合个人需求的图片,或尝试修改网络结构观察不一样的生成效果。
<a name="参考文献"></a>
## 参考文献
[1] Goodfellow, Ian J.; Pouget-Abadie, Jean; Mirza, Mehdi; Xu, Bing; Warde-Farley, David; Ozair, Sherjil; Courville, Aaron; Bengio, Yoshua. Generative Adversarial Networks. 2014. arXiv:1406.2661 [stat.ML].
[2] Andrej Karpathy, Pieter Abbeel, Greg Brockman, Peter Chen, Vicki Cheung, Rocky Duan, Ian Goodfellow, Durk Kingma, Jonathan Ho, Rein Houthooft, Tim Salimans, John Schulman, Ilya Sutskever, And Wojciech Zaremba, Generative Models, OpenAI, [April 7, 2016]
[3] alimans, Tim; Goodfellow, Ian; Zaremba, Wojciech; Cheung, Vicki; Radford, Alec; Chen, Xi. Improved Techniques for Training GANs. 2016. arXiv:1606.03498 [cs.LG].
[4] Radford A, Metz L, Chintala S. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks[J]. Computer Science, 2015.
### This file is only used for continuous evaluation test!
from __future__ import print_function
from __future__ import division
from __future__ import absolute_import
import os
import sys
sys.path.append(os.environ['ceroot'])
from kpi import CostKpi
dcgan_d_train_cost_kpi = CostKpi(
'dcgan_d_train_cost',
0.02,
0,
actived=True,
desc='train cost of discriminator')
dcgan_g_train_cost_kpi = CostKpi(
'dcgan_g_train_cost', 0.02, 0, actived=True, desc='train cost of generator')
tracking_kpis = [dcgan_d_train_cost_kpi, dcgan_g_train_cost_kpi]
def parse_log(log):
for line in log.split('\n'):
fs = line.strip().split('\t')
print(fs)
if len(fs) == 3 and fs[0] == 'kpis':
kpi_name = fs[1]
kpi_value = float(fs[2])
yield kpi_name, kpi_value
def log_to_ce(log):
kpi_tracker = {}
for kpi in tracking_kpis:
kpi_tracker[kpi.name] = kpi
print(kpi.name)
print(kpi)
for (kpi_name, kpi_value) in parse_log(log):
print(kpi_name, kpi_value)
kpi_tracker[kpi_name].add_record(kpi_value)
kpi_tracker[kpi_name].persist()
if __name__ == '__main__':
log = sys.stdin.read()
log_to_ce(log)
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import sys
import os
import argparse
import functools
import matplotlib
import six
import numpy as np
import paddle
import time
import paddle.fluid as fluid
from utility import get_parent_function_name, plot, check, add_arguments, print_arguments
from network import G, D
matplotlib.use('agg')
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
NOISE_SIZE = 100
LEARNING_RATE = 2e-4
parser = argparse.ArgumentParser(description=__doc__)
add_arg = functools.partial(add_arguments, argparser=parser)
# yapf: disable
add_arg('batch_size', int, 128, "Minibatch size.")
add_arg('epoch', int, 20, "The number of epoched to be trained.")
add_arg('output', str, "./output_dcgan", "The directory the model and the test result to be saved to.")
add_arg('use_gpu', bool, True, "Whether to use GPU to train.")
add_arg('enable_ce', bool, False, "If set True, enable continuous evaluation job.")
# yapf: enable
def loss(x, label):
return fluid.layers.mean(
fluid.layers.sigmoid_cross_entropy_with_logits(x=x, label=label))
def train(args):
if args.enable_ce:
np.random.seed(10)
fluid.default_startup_program().random_seed = 90
d_program = fluid.Program()
dg_program = fluid.Program()
with fluid.program_guard(d_program):
img = fluid.layers.data(name='img', shape=[784], dtype='float32')
label = fluid.layers.data(name='label', shape=[1], dtype='float32')
d_logit = D(img)
d_loss = loss(d_logit, label)
with fluid.program_guard(dg_program):
noise = fluid.layers.data(
name='noise', shape=[NOISE_SIZE], dtype='float32')
g_img = G(x=noise)
g_program = dg_program.clone()
g_program_test = dg_program.clone(for_test=True)
dg_logit = D(g_img)
dg_loss = loss(dg_logit,
fluid.layers.fill_constant_batch_size_like(
input=noise,
dtype='float32',
shape=[-1, 1],
value=1.0))
opt = fluid.optimizer.Adam(learning_rate=LEARNING_RATE)
opt.minimize(loss=d_loss)
parameters = [p.name for p in g_program.global_block().all_parameters()]
opt.minimize(loss=dg_loss, parameter_list=parameters)
exe = fluid.Executor(fluid.CPUPlace())
if args.use_gpu:
exe = fluid.Executor(fluid.CUDAPlace(0))
exe.run(fluid.default_startup_program())
if args.enable_ce:
train_reader = paddle.batch(
paddle.dataset.mnist.train(), batch_size=args.batch_size)
else:
train_reader = paddle.batch(
paddle.reader.shuffle(paddle.dataset.mnist.train(), buf_size=60000),
batch_size=args.batch_size)
NUM_TRAIN_TIMES_OF_DG = 2
const_n = np.random.uniform(
low=-1.0, high=1.0,
size=[args.batch_size, NOISE_SIZE]).astype('float32')
t_time = 0
losses = [[], []]
for pass_id in range(args.epoch):
for batch_id, data in enumerate(train_reader()):
if len(data) != args.batch_size:
continue
noise_data = np.random.uniform(
low=-1.0, high=1.0,
size=[args.batch_size, NOISE_SIZE]).astype('float32')
real_image = np.array(list(map(lambda x: x[0], data))).reshape(
-1, 784).astype('float32')
real_labels = np.ones(
shape=[real_image.shape[0], 1], dtype='float32')
fake_labels = np.zeros(
shape=[real_image.shape[0], 1], dtype='float32')
total_label = np.concatenate([real_labels, fake_labels])
s_time = time.time()
generated_image = exe.run(
g_program, feed={'noise': noise_data}, fetch_list=[g_img])[0]
total_images = np.concatenate([real_image, generated_image])
d_loss_1 = exe.run(
d_program,
feed={
'img': generated_image,
'label': fake_labels,
},
fetch_list=[d_loss])[0][0]
d_loss_2 = exe.run(
d_program,
feed={
'img': real_image,
'label': real_labels,
},
fetch_list=[d_loss])[0][0]
d_loss_n = d_loss_1 + d_loss_2
losses[0].append(d_loss_n)
for _ in six.moves.xrange(NUM_TRAIN_TIMES_OF_DG):
noise_data = np.random.uniform(
low=-1.0, high=1.0,
size=[args.batch_size, NOISE_SIZE]).astype('float32')
dg_loss_n = exe.run(
dg_program,
feed={'noise': noise_data},
fetch_list=[dg_loss])[0][0]
losses[1].append(dg_loss_n)
t_time += (time.time() - s_time)
if batch_id % 10 == 0:
if not os.path.exists(args.output):
os.makedirs(args.output)
# generate image each batch
generated_images = exe.run(
g_program_test, feed={'noise': const_n},
fetch_list=[g_img])[0]
total_images = np.concatenate([real_image, generated_images])
fig = plot(total_images)
msg = "Epoch ID={0} Batch ID={1} D-Loss={2} DG-Loss={3}\n gen={4}".format(
pass_id, batch_id, d_loss_n, dg_loss_n,
check(generated_images))
print(msg)
plt.title(msg)
plt.savefig(
'{}/{:04d}_{:04d}.png'.format(args.output, pass_id,
batch_id),
bbox_inches='tight')
plt.close(fig)
if args.enable_ce and pass_id == args.epoch - 1:
print("kpis\tdcgan_d_train_cost\t%f" % np.mean(losses[0]))
print("kpis\tdcgan_g_train_cost\t%f" % np.mean(losses[1]))
if __name__ == "__main__":
args = parser.parse_args()
print_arguments(args)
train(args)
<html>
<head>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: {
inlineMath: [ ['$','$'] ],
displayMath: [ ['$$','$$'] ],
processEscapes: true
},
"HTML-CSS": { availableFonts: ["TeX"] }
});
</script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js" async></script>
<script type="text/javascript" src="../.tools/theme/marked.js">
</script>
<link href="http://cdn.bootcss.com/highlight.js/9.9.0/styles/darcula.min.css" rel="stylesheet">
<script src="http://cdn.bootcss.com/highlight.js/9.9.0/highlight.min.js"></script>
<link href="http://cdn.bootcss.com/bootstrap/4.0.0-alpha.6/css/bootstrap.min.css" rel="stylesheet">
<link href="https://cdn.jsdelivr.net/perfect-scrollbar/0.6.14/css/perfect-scrollbar.min.css" rel="stylesheet">
<link href="../.tools/theme/github-markdown.css" rel='stylesheet'>
</head>
<style type="text/css" >
.markdown-body {
box-sizing: border-box;
min-width: 200px;
max-width: 980px;
margin: 0 auto;
padding: 45px;
}
</style>
<body>
<div id="context" class="container-fluid markdown-body">
</div>
<!-- This block will be replaced by each markdown file content. Please do not change lines below.-->
<div id="markdown" style='display:none'>
# 生成对抗网络
本教程源代码目录在book/09.gan,初次使用请您参考Book文档使用说明。
### 说明: ###
1. 硬件环境要求:
本文可支持在CPU、GPU下运行
2. Docker镜像支持的CUDA/cuDNN版本:
如果使用了Docker运行Book,请注意:这里所提供的默认镜像的GPU环境为 CUDA 8/cuDNN 5,对于NVIDIA Tesla V100等要求CUDA 9的 GPU,使用该镜像可能会运行失败。
3. 文档和脚本中代码的一致性问题:
请注意:为使本文更加易读易用,我们拆分、调整了dc_gan.py的代码并放入本文。本文中代码与dc_gan.py的运行结果一致,可直接运行[dc_gan.py](https://github.com/PaddlePaddle/book/blob/develop/09.gan/dc_gan.py)进行验证。
## 背景介绍
生成对抗网络(Generative Adversarial Network \[[1](#参考文献)\],简称GAN)是非监督式学习的一种方法,通过让两个神经网络相互博弈的方式进行学习。该方法最初由 lan·Goodfellow 等人于2014年提出,原论文见 [Generative Adversarial Network](https://arxiv.org/abs/1406.2661)。
生成对抗网络由一个生成网络与一个判别网络组成。生成网络从潜在空间(latent space)中随机采样作为输入,其输出结果需要尽量模仿训练集中的真实样本。判别网络的输入为真实样本或生成网络的输出,其目的是将生成网络的输出从真实样本中尽可能分辨出来。而生成网络则要尽可能地欺骗判别网络。两个网络相互对抗、不断调整参数,其目的是将生成网络生成的样本和真实样本尽可能的区分开\[[2](#参考文献)\] )。
生成对抗网络常用于生成以假乱真的图片 \[[3](#参考文献)\] )。此外,该方法还被用于生成视频、三维物体模型等。
## 效果展示
本教程将 MNIST 数据集输入网络进行训练,经过19轮训练后可以看到,生成的图片已经非常接近真实图片的样子,下图中前8行是真实图片的样子,后8行是网络生成的图像效果:
<p align="center">
<img src="https://raw.githubusercontent.com/PaddlePaddle/book/develop/09.gan/image/dcgan_demo.png" width="550"><br/>
图1. GAN 生成手写数字效果
</p>
## 模型概览
### GAN
GAN 网络顾名思义,是一种通过对抗的方式,去学习数据分布的生成模型。其中,“对抗”指的是生成网络(Generator)和判别网络(Discriminator)的相互对抗。这里以生成图片为例进行说明:
- 生成网络(G)接收一个随机的噪声z,尽可能的生成近似样本的图像,记为G(z)
- 判别网络(D)接收一张输入图片x,尽可以去判别该图像是真实样本还是网络生成的假样本,判别网络的输出 D(x) 代表 x 为真实图片的概率。如果 D(x)=1 说明判别网络认为该输入一定是真实图片,如果 D(x)=0 说明判别网络认为该输入一定是假图片。
在训练的过程中,两个网络互相对抗,最终形成了一个动态的平衡,上述过程用公式可以被描述为:
<p align="center">
<img src="https://raw.githubusercontent.com/PaddlePaddle/book/develop/09.gan/image/01.gif?raw=true"><br/>
</p>
在最理想的情况下,G 可以生成与真实样本极其相似的图片G(z),而 D 很难判断这张生成的图片是否为真,对图片的真假进行随机猜测,即 D(G(z))=0.5。
下图展示了生成对抗网络的训练过程,假设在训练开始时,真实样本分布、生成样本分布以及判别模型分别是图中的黑线、绿线和蓝线。在训练开始时,判别模型是无法很好地区分真实样本和生成样本的。接下来当我们固定生成模型,而优化判别模型时,优化结果如第二幅图所示,可以看出,这个时候判别模型已经可以较好地区分生成数据和真实数据了。第三步是固定判别模型,改进生成模型,试图让判别模型无法区分生成图片与真实图片,在这个过程中,可以看出由模型生成的图片分布与真实图片分布更加接近,这样的迭代不断进行,直到最终收敛,生成分布和真实分布重合,判别模型无法区分真实图片与生成图片。
<p align="center">
<img src="https://raw.githubusercontent.com/PaddlePaddle/book/develop/09.gan/image/process.png" width="550"><br/>
图2. GAN 训练过程
</p>
但是在实际过程中,很难得到这个完美的平衡点,关于GAN的收敛理论还在持续不断的研究中。
### DCGAN
[DCGAN](https://arxiv.org/abs/1511.06434) \[[4](#参考文献)\] 是深层卷积网络与 GAN 的结合,其基本原理与 GAN 相同,只是将生成网络和判别网络用两个卷积网络(CNN)替代。为了提高生成样本的质量和网络的收敛速度,论文中的 DCGAN 在网络结构上进行了一些改进:
- 取消 pooling 层:在网络中,所有的pooling层使用步幅卷积(strided convolutions)(判别器)和微步幅度卷积(fractional-strided convolutions)(生成器)进行替换。
- 加入 batch normalization:在生成器和判别器中均加入batchnorm。
- 使用全卷积网络:去掉了FC层,以实现更深的网络结构。
- 激活函数:在生成器(G)中,最后一层使用Tanh函数,其余层采用 ReLu 函数 ; 判别器(D)中都采用LeakyReLu。
DCGAN中的生成器(G)结构如下图所示:
<p align="center">
<img src="https://raw.githubusercontent.com/PaddlePaddle/book/develop/09.gan/image/dcgan_g.png" width="550"><br/>
图3. DCGAN中的生成器(G)
</p>
## 数据准备
本次教程使用数据规模较小的 MNIST 训练生成器和判别器,该数据集可通过paddle.dataset模块自动下载到本地。
关于 MNIST 的详细介绍可参考[数字识别](https://github.com/PaddlePaddle/book/tree/develop/02.recognize_digits)。
## 训练模型
`09.gan/dc_gan.py` 演示了训练的整体过程。
### 加载包
首先加载 PaddlePaddle 的 Fluid 和其他相关包
```python
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import sys
import os
import matplotlib
import PIL
import six
import numpy as np
import math
import time
import paddle
import paddle.fluid as fluid
matplotlib.use('agg')
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
```
### 定义辅助工具
定义 plot 函数,将图像生成过程可视化
```python
def plot(gen_data):
pad_dim = 1
paded = pad_dim + img_dim
gen_data = gen_data.reshape(gen_data.shape[0], img_dim, img_dim)
n = int(math.ceil(math.sqrt(gen_data.shape[0])))
gen_data = (np.pad(
gen_data, [[0, n * n - gen_data.shape[0]], [pad_dim, 0], [pad_dim, 0]],
'constant').reshape((n, n, paded, paded)).transpose((0, 2, 1, 3))
.reshape((n * paded, n * paded)))
fig = plt.figure(figsize=(8, 8))
plt.axis('off')
plt.imshow(gen_data, cmap='Greys_r', vmin=-1, vmax=1)
return fig
```
### 定义超参数
```python
gf_dim = 64 # 生成器的feature map的基础通道数量,生成器中所有的feature map的通道数量都是基础通道数量的倍数
df_dim = 64 # 判别器的feature map的基础通道数量,判别器中所有的feature map的通道数量都是基础通道数量的倍数
gfc_dim = 1024 * 2 # 生成器的全连接层维度
dfc_dim = 1024 # 判别器的全连接层维度
img_dim = 28 # 输入图片的尺寸
NOISE_SIZE = 100 # 输入噪声的维度
LEARNING_RATE = 2e-4 # 训练的学习率
epoch = 20 # 训练的epoch数
output = "./output_dcgan" # 模型和测试结果的存储路径
use_cudnn = False # 是否使用cuDNN
use_gpu=False # 是否使用GPU训练
```
### 定义网络结构
- bn 层
调用 `fluid.layers.batch_norm` 接口实现bn层,激活函数默认使用ReLu。
```python
def bn(x, name=None, act='relu'):
return fluid.layers.batch_norm(
x,
param_attr=name + '1',
bias_attr=name + '2',
moving_mean_name=name + '3',
moving_variance_name=name + '4',
name=name,
act=act)
```
- 卷积层
调用 `fluid.nets.simple_img_conv_pool` 实现卷积池化组,卷积核大小为3x3,池化窗口大小为2x2,窗口滑动步长为2,激活函数类型由具体网络结构指定。
```python
def conv(x, num_filters, name=None, act=None):
return fluid.nets.simple_img_conv_pool(
input=x,
filter_size=5,
num_filters=num_filters,
pool_size=2,
pool_stride=2,
param_attr=name + 'w',
bias_attr=name + 'b',
use_cudnn=use_cudnn,
act=act)
```
- 全连接层
```python
def fc(x, num_filters, name=None, act=None):
return fluid.layers.fc(input=x,
size=num_filters,
act=act,
param_attr=name + 'w',
bias_attr=name + 'b')
```
- 转置卷积层
在生成器中,需要用随机采样值生成全尺寸图像,dcgan使用转置卷积层进行上采样,在Fluid中,我们调用 `fluid.layers.conv2d_transpose` 实现转置卷积。
```python
def deconv(x,
num_filters,
name=None,
filter_size=5,
stride=2,
dilation=1,
padding=2,
output_size=None,
act=None):
return fluid.layers.conv2d_transpose(
input=x,
param_attr=name + 'w',
bias_attr=name + 'b',
num_filters=num_filters,
output_size=output_size,
filter_size=filter_size,
stride=stride,
dilation=dilation,
padding=padding,
use_cudnn=use_cudnn,
act=act)
```
- 判别器
判别器使用真实数据集和生成器生成的假图片共同进行训练,在训练过程中尽量使真实数据集的输出结果为1,生成的假图片输出结果为0。本教程中实现的判别器由两个卷积池化层和两个全连接层组成,其中最后一个全连接层的神经元个数为1,输出一个二分类结果。
```python
def D(x):
x = fluid.layers.reshape(x=x, shape=[-1, 1, 28, 28])
x = conv(x, df_dim, act='leaky_relu',name='conv1')
x = bn(conv(x, df_dim * 2,name='conv2'), act='leaky_relu',name='bn1')
x = bn(fc(x, dfc_dim,name='fc1'), act='leaky_relu',name='bn2')
x = fc(x, 1, act='sigmoid',name='fc2')
return x
```
- 生成器
生成器由两组带BN的全连接层和两组转置卷积层组成,网络输入为随机的噪声数据,最后一层转置卷积的卷积核数为1,表示输出为灰度图片。
```python
def G(x):
x = bn(fc(x, gfc_dim,name='fc3'),name='bn3')
x = bn(fc(x, gf_dim * 2 * img_dim // 4 * img_dim // 4,name='fc4'),name='bn4')
x = fluid.layers.reshape(x, [-1, gf_dim * 2, img_dim // 4, img_dim // 4])
x = deconv(x, gf_dim * 2, act='relu', output_size=[14, 14],name='deconv1')
x = deconv(x, num_filters=1, filter_size=5, padding=2, act='tanh', output_size=[28, 28],name='deconv2')
x = fluid.layers.reshape(x, shape=[-1, 28 * 28])
return x
```
### 损失函数
损失函数使用 `sigmoid_cross_entropy_with_logits`
```python
def loss(x, label):
return fluid.layers.mean(
fluid.layers.sigmoid_cross_entropy_with_logits(x=x, label=label))
```
### 创建Program
```python
d_program = fluid.Program()
dg_program = fluid.Program()
# 定义判别真实图片的program
with fluid.program_guard(d_program):
# 输入图片大小为28*28=784
img = fluid.layers.data(name='img', shape=[784], dtype='float32')
# 标签shape=1
label = fluid.layers.data(name='label', shape=[1], dtype='float32')
d_logit = D(img)
d_loss = loss(d_logit, label)
# 定义判别生成图片的program
with fluid.program_guard(dg_program):
noise = fluid.layers.data(
name='noise', shape=[NOISE_SIZE], dtype='float32')
# 噪声数据作为输入得到生成图片
g_img = G(x=noise)
g_program = dg_program.clone()
g_program_test = dg_program.clone(for_test=True)
# 判断生成图片为真实样本的概率
dg_logit = D(g_img)
# 计算生成图片被判别为真实样本的loss
dg_loss = loss(
dg_logit,
fluid.layers.fill_constant_batch_size_like(
input=noise, dtype='float32', shape=[-1, 1], value=1.0))
```
使用adam作为优化器,分别优化判别真实图片的loss和判别生成图片的loss。
```python
opt = fluid.optimizer.Adam(learning_rate=LEARNING_RATE)
opt.minimize(loss=d_loss)
parameters = [p.name for p in g_program.global_block().all_parameters()]
opt.minimize(loss=dg_loss, parameter_list=parameters)
```
### 数据集 Feeders 配置
下一步,我们开始训练过程。paddle.dataset.mnist.train()用做训练数据集。这个函数返回一个reader——PaddlePaddle中的reader是一个Python函数,每次调用的时候返回一个Python yield generator。
下面shuffle是一个reader decorator,它接受一个reader A,返回另一个reader B。reader B 每次读入buffer_size条训练数据到一个buffer里,然后随机打乱其顺序,并且逐条输出。
batch是一个特殊的decorator,它的输入是一个reader,输出是一个batched reader。在PaddlePaddle里,一个reader每次yield一条训练数据,而一个batched reader每次yield一个minibatch。
```python
batch_size = 128 # Minibatch size
train_reader = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.mnist.train(), buf_size=60000),
batch_size=batch_size)
```
### 创建执行器
```python
if use_gpu:
exe = fluid.Executor(fluid.CUDAPlace(0))
else:
exe = fluid.Executor(fluid.CPUPlace())
exe.run(fluid.default_startup_program())
```
### 开始训练
训练过程中的每一次迭代,生成器和判别器分别设置自己的迭代次数。为了避免判别器快速收敛到0,本教程默认每迭代一次,训练一次判别器,两次生成器。
```python
t_time = 0
losses = [[], []]
# 判别器的迭代次数
NUM_TRAIN_TIMES_OF_DG = 2
# 最终生成图像的噪声数据
const_n = np.random.uniform(
low=-1.0, high=1.0,
size=[batch_size, NOISE_SIZE]).astype('float32')
for pass_id in range(epoch):
for batch_id, data in enumerate(train_reader()):
if len(data) != batch_size:
continue
# 生成训练过程的噪声数据
noise_data = np.random.uniform(
low=-1.0, high=1.0,
size=[batch_size, NOISE_SIZE]).astype('float32')
# 真实图片
real_image = np.array(list(map(lambda x: x[0], data))).reshape(
-1, 784).astype('float32')
# 真实标签
real_labels = np.ones(
shape=[real_image.shape[0], 1], dtype='float32')
# 虚假标签
fake_labels = np.zeros(
shape=[real_image.shape[0], 1], dtype='float32')
total_label = np.concatenate([real_labels, fake_labels])
s_time = time.time()
# 虚假图片
generated_image = exe.run(g_program,
feed={'noise': noise_data},
fetch_list=[g_img])[0]
total_images = np.concatenate([real_image, generated_image])
# D 判断虚假图片为假的loss
d_loss_1 = exe.run(d_program,
feed={
'img': generated_image,
'label': fake_labels,
},
fetch_list=[d_loss])[0][0]
# D 判断真实图片为真的loss
d_loss_2 = exe.run(d_program,
feed={
'img': real_image,
'label': real_labels,
},
fetch_list=[d_loss])[0][0]
d_loss_n = d_loss_1 + d_loss_2
losses[0].append(d_loss_n)
# 训练生成器
for _ in six.moves.xrange(NUM_TRAIN_TIMES_OF_DG):
noise_data = np.random.uniform(
low=-1.0, high=1.0,
size=[batch_size, NOISE_SIZE]).astype('float32')
dg_loss_n = exe.run(dg_program,
feed={'noise': noise_data},
fetch_list=[dg_loss])[0][0]
losses[1].append(dg_loss_n)
t_time += (time.time() - s_time)
if batch_id % 10 == 0 :
if not os.path.exists(output):
os.makedirs(output)
# 每轮的生成结果
generated_images = exe.run(g_program_test,
feed={'noise': const_n},
fetch_list=[g_img])[0]
# 将真实图片和生成图片连接
total_images = np.concatenate([real_image, generated_images])
fig = plot(total_images)
msg = "Epoch ID={0} Batch ID={1} D-Loss={2} DG-Loss={3}\n ".format(
pass_id, batch_id,
d_loss_n, dg_loss_n)
print(msg)
plt.title(msg)
plt.savefig(
'{}/{:04d}_{:04d}.png'.format(output, pass_id,
batch_id),
bbox_inches='tight')
plt.close(fig)
```
打印特定轮次的生成结果:
```python
def display_image(epoch_no,batch_id):
return PIL.Image.open('output_dcgan/{:04d}_{:04d}.png'.format(epoch_no,batch_id))
# 观察第10个epoch,460个batch的生成图像:
display_image(10,460)
```
## 总结
DCGAN采用一个随机噪声向量作为输入,输入通过与CNN类似但是相反的结构,将输入放大成二维数据。采用这种结构的生成模型和CNN结构的判别模型,DCGAN在图片生成上可以达到相当可观的效果。本案例中,我们利用DCGAN生成了手写数字图片,您可以尝试更换数据集生成符合个人需求的图片,或尝试修改网络结构观察不一样的生成效果。
<a name="参考文献"></a>
## 参考文献
[1] Goodfellow, Ian J.; Pouget-Abadie, Jean; Mirza, Mehdi; Xu, Bing; Warde-Farley, David; Ozair, Sherjil; Courville, Aaron; Bengio, Yoshua. Generative Adversarial Networks. 2014. arXiv:1406.2661 [stat.ML].
[2] Andrej Karpathy, Pieter Abbeel, Greg Brockman, Peter Chen, Vicki Cheung, Rocky Duan, Ian Goodfellow, Durk Kingma, Jonathan Ho, Rein Houthooft, Tim Salimans, John Schulman, Ilya Sutskever, And Wojciech Zaremba, Generative Models, OpenAI, [April 7, 2016]
[3] alimans, Tim; Goodfellow, Ian; Zaremba, Wojciech; Cheung, Vicki; Radford, Alec; Chen, Xi. Improved Techniques for Training GANs. 2016. arXiv:1606.03498 [cs.LG].
[4] Radford A, Metz L, Chintala S. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks[J]. Computer Science, 2015.
</div>
<!-- You can change the lines below now. -->
<script type="text/javascript">
marked.setOptions({
renderer: new marked.Renderer(),
gfm: true,
breaks: false,
smartypants: true,
highlight: function(code, lang) {
code = code.replace(/&amp;/g, "&")
code = code.replace(/&gt;/g, ">")
code = code.replace(/&lt;/g, "<")
code = code.replace(/&nbsp;/g, " ")
return hljs.highlightAuto(code, [lang]).value;
}
});
document.getElementById("context").innerHTML = marked(
document.getElementById("markdown").innerHTML)
</script>
</body>
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import paddle
import paddle.fluid as fluid
from utility import get_parent_function_name
import os
gf_dim = 64
df_dim = 64
gfc_dim = 1024 * 2
dfc_dim = 1024
img_dim = 28
c_dim = 3
y_dim = 1
output_height = 28
output_width = 28
use_cudnn = True
if 'ce_mode' in os.environ:
use_cudnn = False
def bn(x, name=None, act='relu'):
if name is None:
name = get_parent_function_name()
#return fluid.layers.leaky_relu(x)
return fluid.layers.batch_norm(
x,
param_attr=name + '1',
bias_attr=name + '2',
moving_mean_name=name + '3',
moving_variance_name=name + '4',
name=name,
act=act)
def conv(x, num_filters, name=None, act=None):
if name is None:
name = get_parent_function_name()
return fluid.nets.simple_img_conv_pool(
input=x,
filter_size=5,
num_filters=num_filters,
pool_size=2,
pool_stride=2,
param_attr=name + 'w',
bias_attr=name + 'b',
use_cudnn=use_cudnn,
act=act)
def fc(x, num_filters, name=None, act=None):
if name is None:
name = get_parent_function_name()
return fluid.layers.fc(
input=x,
size=num_filters,
act=act,
param_attr=name + 'w',
bias_attr=name + 'b')
def deconv(x,
num_filters,
name=None,
filter_size=5,
stride=2,
dilation=1,
padding=2,
output_size=None,
act=None):
if name is None:
name = get_parent_function_name()
return fluid.layers.conv2d_transpose(
input=x,
param_attr=name + 'w',
bias_attr=name + 'b',
num_filters=num_filters,
output_size=output_size,
filter_size=filter_size,
stride=stride,
dilation=dilation,
padding=padding,
use_cudnn=use_cudnn,
act=act)
def conv_cond_concat(x, y):
"""Concatenate conditioning vector on feature map axis."""
ones = fluid.layers.fill_constant_batch_size_like(
x, [-1, y.shape[1], x.shape[2], x.shape[3]], "float32", 1.0)
return fluid.layers.concat([x, ones * y], 1)
def D_cond(image, y):
image = fluid.layers.reshape(x=image, shape=[-1, 1, 28, 28])
yb = fluid.layers.reshape(y, [-1, y_dim, 1, 1])
x = conv_cond_concat(image, yb)
h0 = conv(x, c_dim + y_dim, act="leaky_relu")
h0 = conv_cond_concat(h0, yb)
h1 = bn(conv(h0, df_dim + y_dim), act="leaky_relu")
h1 = fluid.layers.flatten(h1, axis=1)
h1 = fluid.layers.concat([h1, y], 1)
h2 = bn(fc(h1, dfc_dim), act='leaky_relu')
h2 = fluid.layers.concat([h2, y], 1)
h3 = fc(h2, 1, act='sigmoid')
return h3
def G_cond(z, y):
s_h, s_w = output_height, output_width
s_h2, s_h4 = int(s_h // 2), int(s_h // 4)
s_w2, s_w4 = int(s_w // 2), int(s_w // 4)
yb = fluid.layers.reshape(y, [-1, y_dim, 1, 1]) #NCHW
z = fluid.layers.concat([z, y], 1)
h0 = bn(fc(z, gfc_dim // 2), act='relu')
h0 = fluid.layers.concat([h0, y], 1)
h1 = bn(fc(h0, gf_dim * 2 * s_h4 * s_w4), act='relu')
h1 = fluid.layers.reshape(h1, [-1, gf_dim * 2, s_h4, s_w4])
h1 = conv_cond_concat(h1, yb)
h2 = bn(deconv(h1, gf_dim * 2, output_size=[s_h2, s_w2]), act='relu')
h2 = conv_cond_concat(h2, yb)
h3 = deconv(h2, 1, output_size=[s_h, s_w], act='tanh')
return fluid.layers.reshape(h3, shape=[-1, s_h * s_w])
def D(x):
x = fluid.layers.reshape(x=x, shape=[-1, 1, 28, 28])
x = conv(x, df_dim, act='leaky_relu')
x = bn(conv(x, df_dim * 2), act='leaky_relu')
x = bn(fc(x, dfc_dim), act='leaky_relu')
x = fc(x, 1, act='sigmoid')
return x
def G(x):
x = bn(fc(x, gfc_dim))
x = bn(fc(x, gf_dim * 2 * img_dim // 4 * img_dim // 4))
x = fluid.layers.reshape(x, [-1, gf_dim * 2, img_dim // 4, img_dim // 4])
x = deconv(x, gf_dim * 2, act='relu', output_size=[14, 14])
x = deconv(x, 1, filter_size=5, padding=2, act='tanh', output_size=[28, 28])
x = fluid.layers.reshape(x, shape=[-1, 28 * 28])
return x
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import math
import distutils.util
import numpy as np
import inspect
import matplotlib
import six
matplotlib.use('agg')
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
img_dim = 28
def get_parent_function_name():
return inspect.stack()[2][3] + '.' + inspect.stack()[1][3] + '.' + str(
inspect.stack()[2][2]) + '.'
def plot(gen_data):
pad_dim = 1
paded = pad_dim + img_dim
gen_data = gen_data.reshape(gen_data.shape[0], img_dim, img_dim)
n = int(math.ceil(math.sqrt(gen_data.shape[0])))
gen_data = (np.pad(
gen_data, [[0, n * n - gen_data.shape[0]], [pad_dim, 0], [pad_dim, 0]],
'constant').reshape((n, n, paded, paded)).transpose((0, 2, 1, 3))
.reshape((n * paded, n * paded)))
fig = plt.figure(figsize=(8, 8))
plt.axis('off')
plt.imshow(gen_data, cmap='Greys_r', vmin=-1, vmax=1)
return fig
def check(a):
a = np.sort(np.array(a).flatten())
return [
np.average(a), np.min(a), np.max(a), a[int(len(a) * 0.25)],
a[int(len(a) * 0.75)]
]
def print_arguments(args):
"""Print argparse's arguments.
Usage:
.. code-block:: python
parser = argparse.ArgumentParser()
parser.add_argument("name", default="Jonh", type=str, help="User name.")
args = parser.parse_args()
print_arguments(args)
:param args: Input argparse.Namespace for printing.
:type args: argparse.Namespace
"""
print("----------- Configuration Arguments -----------")
for arg, value in sorted(six.iteritems(vars(args))):
print("%s: %s" % (arg, value))
print("------------------------------------------------")
def add_arguments(argname, type, default, help, argparser, **kwargs):
"""Add argparse's argument.
Usage:
.. code-block:: python
parser = argparse.ArgumentParser()
add_argument("name", str, "Jonh", "User name.", parser)
args = parser.parse_args()
"""
type = distutils.util.strtobool if type == bool else type
argparser.add_argument(
"--" + argname,
default=default,
type=type,
help=help + ' Default: %(default)s.',
**kwargs)
...@@ -4,14 +4,15 @@ ...@@ -4,14 +4,15 @@
[![Documentation Status](https://img.shields.io/badge/docs-latest-brightgreen.svg?style=flat)](https://github.com/PaddlePaddle/book/blob/develop/README.md) [![Documentation Status](https://img.shields.io/badge/docs-latest-brightgreen.svg?style=flat)](https://github.com/PaddlePaddle/book/blob/develop/README.md)
[![Documentation Status](https://img.shields.io/badge/中文文档-最新-brightgreen.svg)](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md) [![Documentation Status](https://img.shields.io/badge/中文文档-最新-brightgreen.svg)](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md)
1. [线性回归](http://www.paddlepaddle.org/documentation/book/zh/develop/01.fit_a_line/index.cn.html) 1. [线性回归](https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/beginners_guide/basics/fit_a_line/README.cn.html)
1. [识别数字](http://www.paddlepaddle.org/documentation/book/zh/develop/02.recognize_digits/index.cn.html) 1. [识别数字](https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/beginners_guide/basics/recognize_digits/README.cn.html)
1. [图像分类](http://www.paddlepaddle.org/documentation/book/zh/develop/03.image_classification/index.cn.html) 1. [图像分类](https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/beginners_guide/basics/image_classification/index.html)
1. [词向量](http://www.paddlepaddle.org/documentation/book/zh/develop/04.word2vec/index.cn.html) 1. [词向量](https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/beginners_guide/basics/word2vec/index.html)
1. [个性化推荐](http://www.paddlepaddle.org/documentation/book/zh/develop/05.recommender_system/index.cn.html) 1. [个性化推荐](https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/beginners_guide/basics/recommender_system/index.html)
1. [情感分析](http://www.paddlepaddle.org/documentation/book/zh/develop/06.understand_sentiment/index.cn.html) 1. [情感分析](https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/beginners_guide/basics/understand_sentiment/index.html)
1. [语义角色标注](http://www.paddlepaddle.org/documentation/book/zh/develop/07.label_semantic_roles/index.cn.html) 1. [语义角色标注](https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/beginners_guide/basics/label_semantic_roles/index.html)
1. [机器翻译](http://www.paddlepaddle.org/documentation/book/zh/develop/08.machine_translation/index.cn.html) 1. [机器翻译](https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/beginners_guide/basics/machine_translation/index.html)
1. [生成对抗网络](https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/beginners_guide/basics/gan/index.html)
更多学习内容请访问PaddlePaddle[视频课堂](http://bit.baidu.com/Course/datalist/column/117.html) 更多学习内容请访问PaddlePaddle[视频课堂](http://bit.baidu.com/Course/datalist/column/117.html)
......
...@@ -4,14 +4,14 @@ ...@@ -4,14 +4,14 @@
[![Documentation Status](https://img.shields.io/badge/docs-latest-brightgreen.svg?style=flat)](https://github.com/PaddlePaddle/book/blob/develop/README.md) [![Documentation Status](https://img.shields.io/badge/docs-latest-brightgreen.svg?style=flat)](https://github.com/PaddlePaddle/book/blob/develop/README.md)
[![Documentation Status](https://img.shields.io/badge/中文文档-最新-brightgreen.svg)](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md) [![Documentation Status](https://img.shields.io/badge/中文文档-最新-brightgreen.svg)](https://github.com/PaddlePaddle/book/blob/develop/README.cn.md)
1. [Fit a Line](http://www.paddlepaddle.org/documentation/book/en/develop/01.fit_a_line/index.html) 1. [Fit a Line](https://www.paddlepaddle.org.cn/documentation/docs/en/1.5/beginners_guide/basics/fit_a_line/README.html)
1. [Recognize Digits](http://www.paddlepaddle.org/documentation/book/en/develop/02.recognize_digits/index.html) 1. [Recognize Digits](https://www.paddlepaddle.org.cn/documentation/docs/en/1.5/beginners_guide/basics/recognize_digits/README.html)
1. [Image Classification](http://www.paddlepaddle.org/documentation/book/en/develop/03.image_classification/index.html) 1. [Image Classification](https://www.paddlepaddle.org.cn/documentation/docs/en/1.5/beginners_guide/basics/image_classification/index_en.html)
1. [Word to Vector](http://www.paddlepaddle.org/documentation/book/en/develop/04.word2vec/index.html) 1. [Word to Vector](https://www.paddlepaddle.org.cn/documentation/docs/en/1.5/beginners_guide/basics/word2vec/index_en.html)
1. [Recommender System](http://www.paddlepaddle.org/documentation/book/en/develop/05.recommender_system/index.html) 1. [Recommender System](https://www.paddlepaddle.org.cn/documentation/docs/en/1.5/beginners_guide/basics/recommender_system/index_en.html)
1. [Understand Sentiment](http://www.paddlepaddle.org/documentation/book/en/develop/06.understand_sentiment/index.html) 1. [Understand Sentiment](https://www.paddlepaddle.org.cn/documentation/docs/en/1.5/beginners_guide/basics/understand_sentiment/index_en.html)
1. [Label Semantic Roles](http://www.paddlepaddle.org/documentation/book/en/develop/07.label_semantic_roles/index.html) 1. [Label Semantic Roles](https://www.paddlepaddle.org.cn/documentation/docs/en/1.5/beginners_guide/basics/label_semantic_roles/index_en.html)
1. [Machine Translation](http://www.paddlepaddle.org/documentation/book/en/develop/08.machine_translation/index.html) 1. [Machine Translation](https://www.paddlepaddle.org.cn/documentation/docs/en/1.5/beginners_guide/basics/machine_translation/index_en.html)
## Running the Book ## Running the Book
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册