提交 abd859c4 编写于 作者: L liaogang

auto format and fix conflict

- repo: https://github.com/Lucas-C/pre-commit-hooks.git
sha: c25201a00e6b0514370501050cf2a8538ac12270
hooks:
- id: remove-crlf
- repo: https://github.com/reyoung/mirrors-yapf.git
sha: v0.13.2
hooks:
- id: yapf
files: (.*\.(py|bzl)|BUILD|.*\.BUILD|WORKSPACE)$ # Bazel BUILD files follow Python syntax.
- repo: https://github.com/pre-commit/pre-commit-hooks
sha: 7539d8bd1a00a3c1bfd34cdb606d3a6372e83469
sha: v0.7.1
hooks:
- id: check-merge-conflict
- id: check-symlinks
- id: detect-private-key
- id: end-of-file-fixer
- id: trailing-whitespace
- repo: git://github.com/Lucas-C/pre-commit-hooks
sha: v1.0.1
hooks:
- id: forbid-crlf
- id: remove-crlf
- id: forbid-tabs
- id: remove-tabs
......@@ -67,7 +67,7 @@ marked.setOptions({
}
});
document.getElementById("context").innerHTML = marked(
document.getElementById("markdown").innerHTML)
document.getElementById("markdown").innerHTML)
</script>
</body>
EOF
......@@ -1093,7 +1093,7 @@ function escape(html, encode) {
}
function unescape(html) {
// explicitly match decimal, hex, and named HTML entities
// explicitly match decimal, hex, and named HTML entities
return html.replace(/&(#(?:\d+)|(?:#x[0-9A-Fa-f]+)|(?:\w+));?/g, function(_, n) {
n = n.toLowerCase();
if (n === 'colon') return ':';
......
此差异已折叠。
......@@ -45,14 +45,25 @@ $$MSE=\frac{1}{n}\sum_{i=1}^{n}{(\hat{Y_i}-Y_i)}^2$$
3. 根据损失函数进行反向误差传播 ([backpropagation](https://en.wikipedia.org/wiki/Backpropagation)),将网络误差从输出层依次向前传递, 并更新网络中的参数。
4. 重复2~3步骤,直至网络训练误差达到规定的程度或训练轮次达到设定值。
## 数据集
## 数据准备
执行以下命令来准备数据:
```bash
cd data && python prepare_data.py
### 数据集接口的封装
首先加载需要的包
```python
import paddle.v2 as paddle
import paddle.v2.dataset.uci_housing as uci_housing
```
这段代码将从[UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing)下载数据并进行[预处理](#数据预处理),最后数据将被分为训练集和测试集。
我们通过uci_housing模块引入了数据集合[UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing)
其中,在uci_housing模块中封装了:
1. 数据下载的过程。下载数据保存在~/.cache/paddle/dataset/uci_housing/housing.data。
2. [数据预处理](#数据预处理)的过程。
### 数据集介绍
这份数据集共506行,每行包含了波士顿郊区的一类房屋的相关信息及该类房屋价格的中位数。其各维属性的意义如下:
| 属性名 | 解释 | 类型 |
......@@ -90,89 +101,89 @@ cd data && python prepare_data.py
</p>
#### 整理训练集与测试集
我们将数据集分割为两份:一份用于调整模型的参数,即进行模型的训练,模型在这份数据集上的误差被称为**训练误差**;另外一份被用来测试,模型在这份数据集上的误差被称为**测试误差**。我们训练模型的目的是为了通过从训练数据中找到规律来预测未知的新数据,所以测试误差是更能反映模型表现的指标。分割数据的比例要考虑到两个因素:更多的训练数据会降低参数估计的方差,从而得到更可信的模型;而更多的测试数据会降低测试误差的方差,从而得到更可信的测试误差。一种常见的分割比例为$8:2$,感兴趣的读者朋友们也可以尝试不同的设置来观察这两种误差的变化。
我们将数据集分割为两份:一份用于调整模型的参数,即进行模型的训练,模型在这份数据集上的误差被称为**训练误差**;另外一份被用来测试,模型在这份数据集上的误差被称为**测试误差**。我们训练模型的目的是为了通过从训练数据中找到规律来预测未知的新数据,所以测试误差是更能反映模型表现的指标。分割数据的比例要考虑到两个因素:更多的训练数据会降低参数估计的方差,从而得到更可信的模型;而更多的测试数据会降低测试误差的方差,从而得到更可信的测试误差。我们这个例子中设置的分割比例为$8:2$
在更复杂的模型训练过程中,我们往往还会多使用一种数据集:验证集。因为复杂的模型中常常还有一些超参数([Hyperparameter](https://en.wikipedia.org/wiki/Hyperparameter_optimization))需要调节,所以我们会尝试多种超参数的组合来分别训练多个模型,然后对比它们在验证集上的表现选择相对最好的一组超参数,最后才使用这组参数下训练的模型在测试集上评估测试误差。由于本章训练的模型比较简单,我们暂且忽略掉这个过程。
## 训练
`fit_a_line/trainer.py`演示了训练的整体过程。
### 初始化PaddlePaddle
执行如下命令可以分割数据集,并将训练集和测试集的地址分别写入train.list 和 test.list两个文件中,供PaddlePaddle读取。
```python
python prepare_data.py -r 0.8 #默认使用8:2的比例进行分割
paddle.init(use_gpu=False, trainer_count=1)
```
在更复杂的模型训练过程中,我们往往还会多使用一种数据集:验证集。因为复杂的模型中常常还有一些超参数([Hyperparameter](https://en.wikipedia.org/wiki/Hyperparameter_optimization))需要调节,所以我们会尝试多种超参数的组合来分别训练多个模型,然后对比它们在验证集上的表现选择相对最好的一组超参数,最后才使用这组参数下训练的模型在测试集上评估测试误差。由于本章训练的模型比较简单,我们暂且忽略掉这个过程。
### 模型配置
### 提供数据给PaddlePaddle
准备好数据之后,我们使用一个Python data provider来为PaddlePaddle的训练过程提供数据。一个 data provider 就是一个Python函数,它会被PaddlePaddle的训练过程调用。在这个例子里,只需要读取已经保存好的数据,然后一行一行地返回给PaddlePaddle的训练进程即可。
线性回归的模型其实就是一个采用线性激活函数(linear activation,`LinearActivation`)的全连接层(fully-connected layer,`fc_layer`):
```python
from paddle.trainer.PyDataProvider2 import *
import numpy as np
#定义数据的类型和维度
@provider(input_types=[dense_vector(13), dense_vector(1)])
def process(settings, input_file):
data = np.load(input_file.strip())
for row in data:
yield row[:-1].tolist(), row[-1:].tolist()
x = paddle.layer.data(name='x', type=paddle.data_type.dense_vector(13))
y_predict = paddle.layer.fc(input=x,
size=1,
act=paddle.activation.Linear())
y = paddle.layer.data(name='y', type=paddle.data_type.dense_vector(1))
cost = paddle.layer.regression_cost(input=y_predict, label=y)
```
### 创建参数
```python
parameters = paddle.parameters.create(cost)
```
## 模型配置说明
### 创建Trainer
### 数据定义
首先,通过 `define_py_data_sources2` 来配置PaddlePaddle从上面的`dataprovider.py`里读入训练数据和测试数据。 PaddlePaddle接受从命令行读入的配置信息,例如这里我们传入一个名为`is_predict`的变量来控制模型在训练和测试时的不同结构。
```python
from paddle.trainer_config_helpers import *
optimizer = paddle.optimizer.Momentum(momentum=0)
is_predict = get_config_arg('is_predict', bool, False)
trainer = paddle.trainer.SGD(cost=cost,
parameters=parameters,
update_equation=optimizer)
```
define_py_data_sources2(
train_list='data/train.list',
test_list='data/test.list',
module='dataprovider',
obj='process')
### 读取数据且打印训练的中间信息
```
PaddlePaddle提供一个
[reader机制](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader)
来读取数据。 Reader返回的数据可以包括多列,我们需要一个Python dict把列
序号映射到网络里的数据层。
### 算法配置
接着,指定模型优化算法的细节。由于线性回归模型比较简单,我们只要设置基本的`batch_size`即可,它指定每次更新参数的时候使用多少条数据计算梯度信息。
```python
settings(batch_size=2)
feeding={'x': 0, 'y': 1}
```
### 网络结构
最后,使用`fc_layer``LinearActivation`来表示线性回归的模型本身。
此外,我们还可以提供一个 event handler,来打印训练的进度:
```python
#输入数据,13维的房屋信息
x = data_layer(name='x', size=13)
y_predict = fc_layer(
input=x,
param_attr=ParamAttr(name='w'),
size=1,
act=LinearActivation(),
bias_attr=ParamAttr(name='b'))
if not is_predict: #训练时,我们使用MSE,即regression_cost作为损失函数
y = data_layer(name='y', size=1)
cost = regression_cost(input=y_predict, label=y)
outputs(cost) #训练时输出MSE来监控损失的变化
else: #测试时,输出预测值
outputs(y_predict)
# event_handler to print training and testing info
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0:
print "Pass %d, Batch %d, Cost %f" % (
event.pass_id, event.batch_id, event.cost)
if isinstance(event, paddle.event.EndPass):
result = trainer.test(
reader=paddle.batch(
uci_housing.test(), batch_size=2),
feeding=feeding)
print "Test %d, Cost %f" % (event.pass_id, result.cost)
```
## 训练模型
在对应代码的根目录下执行PaddlePaddle的命令行训练程序。这里指定模型配置文件为`trainer_config.py`,训练30轮,结果保存在`output`路径下。
```bash
./train.sh
```
### 开始训练
## 应用模型
现在来看下如何使用已经训练好的模型进行预测。
```bash
python predict.py
```
这里默认使用`output/pass-00029`中保存的模型进行预测,并将数据中的房价与预测结果进行对比,结果保存在 `predictions.png`中。
如果你想使用别的模型或者其它的数据进行预测,只要传入新的路径即可:
```bash
python predict.py -m output/pass-00020 -t data/housing.test.npy
```python
trainer.train(
reader=paddle.batch(
paddle.reader.shuffle(
uci_housing.train(), buf_size=500),
batch_size=2),
feeding=feeding,
event_handler=event_handler,
num_passes=30)
```
## 总结
......
......@@ -54,8 +54,8 @@ where $\omega_{d}$ and $b$ are the model parameters we want to estimate. Once th
## Results Demonstration
We first show the training result of our model. We use the [UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing) to train a linear model and predict the house prices in Boston. The figure below shows the predictions the model makes for some house prices. The $X$ coordinate of each point represents the median value of the prices of a certain type of houses, while the $Y$ coordinate represents the predicted value by our linear model. When $X=Y$, the point lies exactly on the dotted line. In other words, the more precise the model predicts, the closer the point is to the dotted line.
<p align="center">
<img src = "image/predictions_en.png" width=400><br/>
Figure 1. Predicted Value V.S. Actual Value
<img src = "image/predictions_en.png" width=400><br/>
Figure 1. Predicted Value V.S. Actual Value
</p>
## Model Overview
......@@ -126,8 +126,8 @@ There are at least three reasons for [Feature Normalization](https://en.wikipedi
- Many Machine Learning techniques or models (e.g., L1/L2 regularization and Vector Space Model) are based on the assumption that all the features have roughly zero means and their value ranges are similar.
<p align="center">
<img src = "image/ranges_en.png" width=550><br/>
Figure 2. The value ranges of the features
<img src = "image/ranges_en.png" width=550><br/>
Figure 2. The value ranges of the features
</p>
#### Prepare Training and Test Sets
......@@ -151,7 +151,7 @@ import numpy as np
def process(settings, input_file):
data = np.load(input_file.strip())
for row in data:
yield row[:-1].tolist(), row[-1:].tolist()
yield row[:-1].tolist(), row[-1:].tolist()
```
......@@ -246,6 +246,6 @@ marked.setOptions({
}
});
document.getElementById("context").innerHTML = marked(
document.getElementById("markdown").innerHTML)
document.getElementById("markdown").innerHTML)
</script>
</body>
......@@ -56,8 +56,8 @@ $$y_i = \omega_1x_{i1} + \omega_2x_{i2} + \ldots + \omega_dx_{id} + b, i=1,\ldo
## 效果展示
我们使用从[UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing)获得的波士顿房价数据集进行模型的训练和预测。下面的散点图展示了使用模型对部分房屋价格进行的预测。其中,每个点的横坐标表示同一类房屋真实价格的中位数,纵坐标表示线性回归模型根据特征预测的结果,当二者值完全相等的时候就会落在虚线上。所以模型预测得越准确,则点离虚线越近。
<p align="center">
<img src = "image/predictions.png" width=400><br/>
图1. 预测值 V.S. 真实值
<img src = "image/predictions.png" width=400><br/>
图1. 预测值 V.S. 真实值
</p>
## 模型概览
......@@ -126,8 +126,8 @@ cd data && python prepare_data.py
- 很多的机器学习技巧/模型(例如L1,L2正则项,向量空间模型-Vector Space Model)都基于这样的假设:所有的属性取值都差不多是以0为均值且取值范围相近的。
<p align="center">
<img src = "image/ranges.png" width=550><br/>
图2. 各维属性的取值范围
<img src = "image/ranges.png" width=550><br/>
图2. 各维属性的取值范围
</p>
#### 整理训练集与测试集
......@@ -151,7 +151,7 @@ import numpy as np
def process(settings, input_file):
data = np.load(input_file.strip())
for row in data:
yield row[:-1].tolist(), row[-1:].tolist()
yield row[:-1].tolist(), row[-1:].tolist()
```
......@@ -246,6 +246,6 @@ marked.setOptions({
}
});
document.getElementById("context").innerHTML = marked(
document.getElementById("markdown").innerHTML)
document.getElementById("markdown").innerHTML)
</script>
</body>
import paddle.v2 as paddle
import paddle.v2.dataset.uci_housing as uci_housing
def main():
# init
paddle.init(use_gpu=False, trainer_count=1)
# network config
x = paddle.layer.data(name='x', type=paddle.data_type.dense_vector(13))
y_predict = paddle.layer.fc(input=x, size=1, act=paddle.activation.Linear())
y = paddle.layer.data(name='y', type=paddle.data_type.dense_vector(1))
cost = paddle.layer.regression_cost(input=y_predict, label=y)
# create parameters
parameters = paddle.parameters.create(cost)
# create optimizer
optimizer = paddle.optimizer.Momentum(momentum=0)
trainer = paddle.trainer.SGD(cost=cost,
parameters=parameters,
update_equation=optimizer)
feeding = {'x': 0, 'y': 1}
# event_handler to print training and testing info
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0:
print "Pass %d, Batch %d, Cost %f" % (
event.pass_id, event.batch_id, event.cost)
if isinstance(event, paddle.event.EndPass):
result = trainer.test(
reader=paddle.batch(
uci_housing.test(), batch_size=2),
feeding=feeding)
print "Test %d, Cost %f" % (event.pass_id, result.cost)
# training
trainer.train(
reader=paddle.batch(
paddle.reader.shuffle(
uci_housing.train(), buf_size=500),
batch_size=2),
feeding=feeding,
event_handler=event_handler,
num_passes=30)
if __name__ == '__main__':
main()
......@@ -58,6 +58,6 @@ marked.setOptions({
}
});
document.getElementById("context").innerHTML = marked(
document.getElementById("markdown").innerHTML)
document.getElementById("markdown").innerHTML)
</script>
</body>
......@@ -57,6 +57,6 @@ marked.setOptions({
}
});
document.getElementById("context").innerHTML = marked(
document.getElementById("markdown").innerHTML)
document.getElementById("markdown").innerHTML)
</script>
</body>
......@@ -3,7 +3,7 @@
本教程源代码目录在[book/image_classification](https://github.com/PaddlePaddle/book/tree/develop/image_classification), 初次使用请参考PaddlePaddle[安装教程](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html)
## 背景介绍
## 背景介绍
图像相比文字能够提供更加生动、容易理解及更具艺术感的信息,是人们转递与交换信息的重要来源。在本教程中,我们专注于图像识别领域的一个重要问题,即图像分类。
......@@ -51,7 +51,7 @@
2). **特征编码**: 底层特征中包含了大量冗余与噪声,为了提高特征表达的鲁棒性,需要使用一种特征变换算法对底层特征进行编码,称作特征编码。常用的特征编码包括向量量化编码 \[[4](#参考文献)\]、稀疏编码 \[[5](#参考文献)\]、局部线性约束编码 \[[6](#参考文献)\]、Fisher向量编码 \[[7](#参考文献)\] 等。
3). **空间特征约束**: 特征编码之后一般会经过空间特征约束,也称作**特征汇聚**。特征汇聚是指在一个空间范围内,对每一维特征取最大值或者平均值,可以获得一定特征不变形的特征表达。金字塔特征匹配是一种常用的特征聚会方法,这种方法提出将图像均匀分块,在分块内做特征汇聚。
4). **通过分类器分类**: 经过前面步骤之后一张图像可以用一个固定维度的向量进行描述,接下来就是经过分类器对图像进行分类。通常使用的分类器包括SVM(Support Vector Machine, 支持向量机)、随机森林等。而使用核方法的SVM是最为广泛的分类器,在传统图像分类任务上性能很好。
这种方法在PASCAL VOC竞赛中的图像分类算法中被广泛使用 \[[18](#参考文献)\][NEC实验室](http://www.nec-labs.com/)在ILSVRC2010中采用SIFT和LBP特征,两个非线性编码器以及SVM分类器获得图像分类的冠军 \[[8](#参考文献)\]
Alex Krizhevsky在2012年ILSVRC提出的CNN模型 \[[9](#参考文献)\] 取得了历史性的突破,效果大幅度超越传统方法,获得了ILSVRC2012冠军,该模型被称作AlexNet。这也是首次将深度学习用于大规模图像分类中。从AlexNet之后,涌现了一系列CNN模型,不断地在ImageNet上刷新成绩,如图4展示。随着模型变得越来越深以及精妙的结构设计,Top-5的错误率也越来越低,降到了3.5%附近。而在同样的ImageNet数据集上,人眼的辨识错误率大概在5.1%,也就是目前的深度学习模型的识别能力已经超过了人眼。
......@@ -67,8 +67,8 @@ Alex Krizhevsky在2012年ILSVRC提出的CNN模型 \[[9](#参考文献)\] 取得
<p align="center">
<img src="image/lenet.png"><br/>
图5. CNN网络示例[20]
</p>
图5. CNN网络示例[20]
</p>
- 卷积层(convolution layer): 执行卷积操作提取底层到高层的特征,发掘出图片局部关联性质和空间不变性质。
- 池化层(pooling layer): 执行降采样操作。通过取卷积输出特征图中局部区块的最大值(max-pooling)或者均值(avg-pooling)。降采样也是图像处理中常见的一种操作,可以过滤掉一些不重要的高频信息。
......@@ -108,7 +108,7 @@ GoogleNet整体网络结构如图8所示,总共22层网络:开始由3层普
<p align="center">
<img src="image/googlenet.jpeg" ><br/>
图8. GoogleNet[12]
图8. GoogleNet[12]
</p>
......@@ -174,7 +174,7 @@ paddle.init(use_gpu=False, trainer_count=1)
1. 定义数据输入及其维度
网络输入定义为 `data_layer` (数据层),在图像分类中即为图像像素信息。CIFRAR10是RGB 3通道32x32大小的彩色图,因此输入数据大小为3072(3x32x32),类别大小为10,即10分类。
```python
datadim = 3 * 32 * 32
classdim = 10
......@@ -189,7 +189,7 @@ paddle.init(use_gpu=False, trainer_count=1)
net = vgg_bn_drop(image)
```
VGG核心模块的输入是数据层,`vgg_bn_drop` 定义了16层VGG结构,每层卷积后面引入BN层和Dropout层,详细的定义如下:
```python
def vgg_bn_drop(input):
def conv_block(ipt, num_filter, groups, dropouts, num_channels=None):
......@@ -220,11 +220,11 @@ paddle.init(use_gpu=False, trainer_count=1)
fc2 = paddle.layer.fc(input=bn, size=512, act=paddle.activation.Linear())
return fc2
```
2.1. 首先定义了一组卷积网络,即conv_block。卷积核大小为3x3,池化窗口大小为2x2,窗口滑动大小为2,groups决定每组VGG模块是几次连续的卷积操作,dropouts指定Dropout操作的概率。所使用的`img_conv_group`是在`paddle.networks`中预定义的模块,由若干组 `Conv->BN->ReLu->Dropout` 和 一组 `Pooling` 组成,
2.2. 五组卷积操作,即 5个conv_block。 第一、二组采用两次连续的卷积操作。第三、四、五组采用三次连续的卷积操作。每组最后一个卷积后面Dropout概率为0,即不使用Dropout操作。
2.3. 最后接两层512维的全连接。
3. 定义分类器
......@@ -240,7 +240,7 @@ paddle.init(use_gpu=False, trainer_count=1)
4. 定义损失函数和网络输出
在有监督训练中需要输入图像对应的类别信息,同样通过`paddle.layer.data`来定义。训练中采用多类交叉熵作为损失函数,并作为网络的输出,预测阶段定义网络的输出为分类器得到的概率信息。
```python
lbl = paddle.layer.data(
name="label", type=paddle.data_type.integer_value(classdim))
......@@ -305,9 +305,9 @@ def layer_warp(block_func, ipt, features, count, stride):
`resnet_cifar10` 的连接结构主要有以下几个过程。
1. 底层输入连接一层 `conv_bn_layer`,即带BN的卷积层。
1. 底层输入连接一层 `conv_bn_layer`,即带BN的卷积层。
2. 然后连接3组残差模块即下面配置3组 `layer_warp` ,每组采用图 10 左边残差模块组成。
3. 最后对网络做均值池化并返回该层。
3. 最后对网络做均值池化并返回该层。
注意:除过第一层卷积层和最后一层全连接层之外,要求三组 `layer_warp` 总的含参层数能够被6整除,即 `resnet_cifar10` 的 depth 要满足 $(depth - 2) % 6 == 0$ 。
......@@ -452,7 +452,7 @@ Test with Pass 0, {'classification_error_evaluator': 0.885200023651123}
[2] N. Dalal, B. Triggs, [Histograms of Oriented Gradients for Human Detection](http://vision.stanford.edu/teaching/cs231b_spring1213/papers/CVPR05_DalalTriggs.pdf), Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2005.
[3] Ahonen, T., Hadid, A., and Pietikinen, M. (2006). [Face description with local binary patterns: Application to face recognition](http://ieeexplore.ieee.org/document/1717463/). PAMI, 28.
[3] Ahonen, T., Hadid, A., and Pietikinen, M. (2006). [Face description with local binary patterns: Application to face recognition](http://ieeexplore.ieee.org/document/1717463/). PAMI, 28.
[4] J. Sivic, A. Zisserman, [Video Google: A Text Retrieval Approach to Object Matching in Videos](http://www.robots.ox.ac.uk/~vgg/publications/papers/sivic03.pdf), Proc. Ninth Int'l Conf. Computer Vision, pp. 1470-1478, 2003.
......
......@@ -3,7 +3,7 @@
本教程源代码目录在[book/image_classification](https://github.com/PaddlePaddle/book/tree/develop/image_classification), 初次使用请参考PaddlePaddle[安装教程](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html)
## 背景介绍
## 背景介绍
图像相比文字能够提供更加生动、容易理解及更具艺术感的信息,是人们转递与交换信息的重要来源。在本教程中,我们专注于图像识别领域的一个重要问题,即图像分类。
......@@ -51,7 +51,7 @@
2). **特征编码**: 底层特征中包含了大量冗余与噪声,为了提高特征表达的鲁棒性,需要使用一种特征变换算法对底层特征进行编码,称作特征编码。常用的特征编码包括向量量化编码 \[[4](#参考文献)\]、稀疏编码 \[[5](#参考文献)\]、局部线性约束编码 \[[6](#参考文献)\]、Fisher向量编码 \[[7](#参考文献)\] 等。
3). **空间特征约束**: 特征编码之后一般会经过空间特征约束,也称作**特征汇聚**。特征汇聚是指在一个空间范围内,对每一维特征取最大值或者平均值,可以获得一定特征不变形的特征表达。金字塔特征匹配是一种常用的特征聚会方法,这种方法提出将图像均匀分块,在分块内做特征汇聚。
4). **通过分类器分类**: 经过前面步骤之后一张图像可以用一个固定维度的向量进行描述,接下来就是经过分类器对图像进行分类。通常使用的分类器包括SVM(Support Vector Machine, 支持向量机)、随机森林等。而使用核方法的SVM是最为广泛的分类器,在传统图像分类任务上性能很好。
这种方法在PASCAL VOC竞赛中的图像分类算法中被广泛使用 \[[18](#参考文献)\][NEC实验室](http://www.nec-labs.com/)在ILSVRC2010中采用SIFT和LBP特征,两个非线性编码器以及SVM分类器获得图像分类的冠军 \[[8](#参考文献)\]
Alex Krizhevsky在2012年ILSVRC提出的CNN模型 \[[9](#参考文献)\] 取得了历史性的突破,效果大幅度超越传统方法,获得了ILSVRC2012冠军,该模型被称作AlexNet。这也是首次将深度学习用于大规模图像分类中。从AlexNet之后,涌现了一系列CNN模型,不断地在ImageNet上刷新成绩,如图4展示。随着模型变得越来越深以及精妙的结构设计,Top-5的错误率也越来越低,降到了3.5%附近。而在同样的ImageNet数据集上,人眼的辨识错误率大概在5.1%,也就是目前的深度学习模型的识别能力已经超过了人眼。
......@@ -67,8 +67,8 @@ Alex Krizhevsky在2012年ILSVRC提出的CNN模型 \[[9](#参考文献)\] 取得
<p align="center">
<img src="image/lenet.png"><br/>
图5. CNN网络示例[20]
</p>
图5. CNN网络示例[20]
</p>
- 卷积层(convolution layer): 执行卷积操作提取底层到高层的特征,发掘出图片局部关联性质和空间不变性质。
- 池化层(pooling layer): 执行降采样操作。通过取卷积输出特征图中局部区块的最大值(max-pooling)或者均值(avg-pooling)。降采样也是图像处理中常见的一种操作,可以过滤掉一些不重要的高频信息。
......@@ -108,7 +108,7 @@ GoogleNet整体网络结构如图8所示,总共22层网络:开始由3层普
<p align="center">
<img src="image/googlenet.jpeg" ><br/>
图8. GoogleNet[12]
图8. GoogleNet[12]
</p>
......@@ -245,7 +245,7 @@ $$ lr = lr_{0} * a^ {\lfloor \frac{n}{ b}\rfloor} $$
1. 定义数据输入及其维度
网络输入定义为 `data_layer` (数据层),在图像分类中即为图像像素信息。CIFRAR10是RGB 3通道32x32大小的彩色图,因此输入数据大小为3072(3x32x32),类别大小为10,即10分类。
```python
datadim = 3 * 32 * 32
classdim = 10
......@@ -258,7 +258,7 @@ $$ lr = lr_{0} * a^ {\lfloor \frac{n}{ b}\rfloor} $$
net = vgg_bn_drop(data)
```
VGG核心模块的输入是数据层,`vgg_bn_drop` 定义了16层VGG结构,每层卷积后面引入BN层和Dropout层,详细的定义如下:
```python
def vgg_bn_drop(input, num_channels):
def conv_block(ipt, num_filter, groups, dropouts, num_channels_=None):
......@@ -273,26 +273,26 @@ $$ lr = lr_{0} * a^ {\lfloor \frac{n}{ b}\rfloor} $$
conv_with_batchnorm=True,
conv_batchnorm_drop_rate=dropouts,
pool_type=MaxPooling())
conv1 = conv_block(input, 64, 2, [0.3, 0], 3)
conv2 = conv_block(conv1, 128, 2, [0.4, 0])
conv3 = conv_block(conv2, 256, 3, [0.4, 0.4, 0])
conv4 = conv_block(conv3, 512, 3, [0.4, 0.4, 0])
conv5 = conv_block(conv4, 512, 3, [0.4, 0.4, 0])
drop = dropout_layer(input=conv5, dropout_rate=0.5)
fc1 = fc_layer(input=drop, size=512, act=LinearActivation())
bn = batch_norm_layer(
input=fc1, act=ReluActivation(), layer_attr=ExtraAttr(drop_rate=0.5))
fc2 = fc_layer(input=bn, size=512, act=LinearActivation())
return fc2
```
2.1. 首先定义了一组卷积网络,即conv_block。卷积核大小为3x3,池化窗口大小为2x2,窗口滑动大小为2,groups决定每组VGG模块是几次连续的卷积操作,dropouts指定Dropout操作的概率。所使用的`img_conv_group`是在`paddle.trainer_config_helpers`中预定义的模块,由若干组 `Conv->BN->ReLu->Dropout` 和 一组 `Pooling` 组成,
2.2. 五组卷积操作,即 5个conv_block。 第一、二组采用两次连续的卷积操作。第三、四、五组采用三次连续的卷积操作。每组最后一个卷积后面Dropout概率为0,即不使用Dropout操作。
2.3. 最后接两层512维的全连接。
3. 定义分类器
......@@ -306,7 +306,7 @@ $$ lr = lr_{0} * a^ {\lfloor \frac{n}{ b}\rfloor} $$
4. 定义损失函数和网络输出
在有监督训练中需要输入图像对应的类别信息,同样通过`data_layer`来定义。训练中采用多类交叉熵作为损失函数,并作为网络的输出,预测阶段定义网络的输出为分类器得到的概率信息。
```python
if not is_predict:
lbl = data_layer(name="label", size=class_num)
......@@ -383,9 +383,9 @@ def layer_warp(block_func, ipt, features, count, stride):
`resnet_cifar10` 的连接结构主要有以下几个过程。
1. 底层输入连接一层 `conv_bn_layer`,即带BN的卷积层。
1. 底层输入连接一层 `conv_bn_layer`,即带BN的卷积层。
2. 然后连接3组残差模块即下面配置3组 `layer_warp` ,每组采用图 10 左边残差模块组成。
3. 最后对网络做均值池化并返回该层。
3. 最后对网络做均值池化并返回该层。
注意:除过第一层卷积层和最后一层全连接层之外,要求三组 `layer_warp` 总的含参层数能够被6整除,即 `resnet_cifar10` 的 depth 要满足 $(depth - 2) % 6 == 0$ 。
......@@ -487,7 +487,7 @@ python classify.py --job=extract --model=output/pass-00299 --data=image/dog.png
<p align="center">
<img src="image/fea_conv0.png" width="500"><br/>
图13. 卷积特征可视化图
图13. 卷积特征可视化图
</p>
## 总结
......@@ -501,7 +501,7 @@ python classify.py --job=extract --model=output/pass-00299 --data=image/dog.png
[2] N. Dalal, B. Triggs, [Histograms of Oriented Gradients for Human Detection](http://vision.stanford.edu/teaching/cs231b_spring1213/papers/CVPR05_DalalTriggs.pdf), Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2005.
[3] Ahonen, T., Hadid, A., and Pietikinen, M. (2006). [Face description with local binary patterns: Application to face recognition](http://ieeexplore.ieee.org/document/1717463/). PAMI, 28.
[3] Ahonen, T., Hadid, A., and Pietikinen, M. (2006). [Face description with local binary patterns: Application to face recognition](http://ieeexplore.ieee.org/document/1717463/). PAMI, 28.
[4] J. Sivic, A. Zisserman, [Video Google: A Text Retrieval Approach to Object Matching in Videos](http://www.robots.ox.ac.uk/~vgg/publications/papers/sivic03.pdf), Proc. Ninth Int'l Conf. Computer Vision, pp. 1470-1478, 2003.
......
......@@ -289,48 +289,48 @@ First we define VGG network. Since the image size and amount of CIFAR10 are rela
The input to the network is defined as `data_layer`, or image pixels in the context of image classification. The images in CIFAR10 are 32x32 color images of three channels. Therefore, the size of the input data is 3072 (3x32x32), and the number of categories is 10.
```python
datadim = 3 * 32 * 32
classdim = 10
data = data_layer(name='image', size=datadim)
```
```python
datadim = 3 * 32 * 32
classdim = 10
data = data_layer(name='image', size=datadim)
```
2. Define VGG main module
```python
net = vgg_bn_drop(data)
```
```python
net = vgg_bn_drop(data)
```
The input to VGG main module is from data layer. `vgg_bn_drop` defines a 16-layer VGG network, with each convolutional layer followed by BN and dropout layers. Here is the definition in detail:
```python
def vgg_bn_drop(input, num_channels):
def conv_block(ipt, num_filter, groups, dropouts, num_channels_=None):
return img_conv_group(
input=ipt,
num_channels=num_channels_,
pool_size=2,
pool_stride=2,
conv_num_filter=[num_filter] * groups,
conv_filter_size=3,
conv_act=ReluActivation(),
conv_with_batchnorm=True,
conv_batchnorm_drop_rate=dropouts,
pool_type=MaxPooling())
conv1 = conv_block(input, 64, 2, [0.3, 0], 3)
conv2 = conv_block(conv1, 128, 2, [0.4, 0])
conv3 = conv_block(conv2, 256, 3, [0.4, 0.4, 0])
conv4 = conv_block(conv3, 512, 3, [0.4, 0.4, 0])
conv5 = conv_block(conv4, 512, 3, [0.4, 0.4, 0])
drop = dropout_layer(input=conv5, dropout_rate=0.5)
fc1 = fc_layer(input=drop, size=512, act=LinearActivation())
bn = batch_norm_layer(
input=fc1, act=ReluActivation(), layer_attr=ExtraAttr(drop_rate=0.5))
fc2 = fc_layer(input=bn, size=512, act=LinearActivation())
return fc2
```
```python
def vgg_bn_drop(input, num_channels):
def conv_block(ipt, num_filter, groups, dropouts, num_channels_=None):
return img_conv_group(
input=ipt,
num_channels=num_channels_,
pool_size=2,
pool_stride=2,
conv_num_filter=[num_filter] * groups,
conv_filter_size=3,
conv_act=ReluActivation(),
conv_with_batchnorm=True,
conv_batchnorm_drop_rate=dropouts,
pool_type=MaxPooling())
conv1 = conv_block(input, 64, 2, [0.3, 0], 3)
conv2 = conv_block(conv1, 128, 2, [0.4, 0])
conv3 = conv_block(conv2, 256, 3, [0.4, 0.4, 0])
conv4 = conv_block(conv3, 512, 3, [0.4, 0.4, 0])
conv5 = conv_block(conv4, 512, 3, [0.4, 0.4, 0])
drop = dropout_layer(input=conv5, dropout_rate=0.5)
fc1 = fc_layer(input=drop, size=512, act=LinearActivation())
bn = batch_norm_layer(
input=fc1, act=ReluActivation(), layer_attr=ExtraAttr(drop_rate=0.5))
fc2 = fc_layer(input=bn, size=512, act=LinearActivation())
return fc2
```
2.1. First defines a convolution block or conv_block. The default convolution kernel is 3x3, and the default pooling size is 2x2 with stride 2. Dropout specifies the probability in dropout operation. Function `img_conv_group` is defined in `paddle.trainer_config_helpers` consisting of a series of `Conv->BN->ReLu->Dropout` and a `Pooling`.
......@@ -344,22 +344,22 @@ First we define VGG network. Since the image size and amount of CIFAR10 are rela
The above VGG network extracts high-level features and maps them to a vector of the same size as the categories. Softmax function or classifier is then used for calculating the probability of the image belonging to each category.
```python
out = fc_layer(input=net, size=class_num, act=SoftmaxActivation())
```
```python
out = fc_layer(input=net, size=class_num, act=SoftmaxActivation())
```
4. Define Loss Function and Outputs
In the context of supervised learning, labels of training images are defined in `data_layer`, too. During training, cross-entropy is used as loss function and as the output of the network; During testing, the outputs are the probabilities calculated in the classifier.
```python
if not is_predict:
lbl = data_layer(name="label", size=class_num)
cost = classification_cost(input=out, label=lbl)
outputs(cost)
else:
outputs(out)
```
```python
if not is_predict:
lbl = data_layer(name="label", size=class_num)
cost = classification_cost(input=out, label=lbl)
outputs(cost)
else:
outputs(out)
```
### ResNet
......@@ -607,6 +607,6 @@ marked.setOptions({
}
});
document.getElementById("context").innerHTML = marked(
document.getElementById("markdown").innerHTML)
document.getElementById("markdown").innerHTML)
</script>
</body>
此差异已折叠。
......@@ -57,6 +57,6 @@ marked.setOptions({
}
});
document.getElementById("context").innerHTML = marked(
document.getElementById("markdown").innerHTML)
document.getElementById("markdown").innerHTML)
</script>
</body>
......@@ -57,6 +57,6 @@ marked.setOptions({
}
});
document.getElementById("context").innerHTML = marked(
document.getElementById("markdown").innerHTML)
document.getElementById("markdown").innerHTML)
</script>
</body>
<html>
<head>
<meta http-equiv="refresh" content="0; url=https://github.com/paddlepaddle/book" />
<meta http-equiv="refresh" content="0; url=https://github.com/paddlepaddle/book" />
</head>
<body>
<a href="https://github.com/paddlepaddle/book">Please access github home page</a>
<a href="https://github.com/paddlepaddle/book">Please access github home page</a>
</body>
此差异已折叠。
此差异已折叠。
此差异已折叠。
......@@ -20,7 +20,7 @@ from optparse import OptionParser
def read_labels(props_file):
'''
a sentence maybe has more than one verb, each verb has its label sequence
label[], is a 3-dimension list.
label[], is a 3-dimension list.
the first dim is to store all sentence's label seqs, len is the sentence number
the second dim is to store all label sequences for one sentences
the third dim is to store each label for one word
......
label_semantic_roles/image/bio_example.png

379.1 KB | W: | H:

label_semantic_roles/image/bio_example.png

31.7 KB | W: | H:

label_semantic_roles/image/bio_example.png
label_semantic_roles/image/bio_example.png
label_semantic_roles/image/bio_example.png
label_semantic_roles/image/bio_example.png
  • 2-up
  • Swipe
  • Onion skin
label_semantic_roles/image/dependency_parsing.png

186.0 KB | W: | H:

label_semantic_roles/image/dependency_parsing.png

58.2 KB | W: | H:

label_semantic_roles/image/dependency_parsing.png
label_semantic_roles/image/dependency_parsing.png
label_semantic_roles/image/dependency_parsing.png
label_semantic_roles/image/dependency_parsing.png
  • 2-up
  • Swipe
  • Onion skin
此差异已折叠。
此差异已折叠。
......@@ -19,7 +19,7 @@ function get_best_pass() {
cat $1 | grep -Pzo 'Test .*\n.*pass-.*' | \
sed -r 'N;s/Test.* cost=([0-9]+\.[0-9]+).*\n.*pass-([0-9]+)/\1 \2/g' | \
sort -n | head -n 1
}
}
log=train.log
LOG=`get_best_pass $log`
......@@ -28,11 +28,11 @@ best_model_path="output/pass-${LOG[1]}"
config_file=db_lstm.py
dict_file=./data/wordDict.txt
label_file=./data/targetDict.txt
label_file=./data/targetDict.txt
predicate_dict_file=./data/verbDict.txt
input_file=./data/feature
output_file=predict.res
python predict.py \
-c $config_file \
-w $best_model_path \
......
......@@ -155,11 +155,11 @@ def main():
parameters=parameters,
update_equation=optimizer)
reader = paddle.reader.batched(
reader = paddle.batch(
paddle.reader.shuffle(
conll05.test(), buf_size=8192), batch_size=10)
reader_dict = {
feeding = {
'word_data': 0,
'ctx_n2_data': 1,
'ctx_n1_data': 2,
......@@ -181,7 +181,7 @@ def main():
reader=reader,
event_handler=event_handler,
num_passes=10000,
reader_dict=reader_dict)
feeding=feeding)
if __name__ == '__main__':
......
......@@ -9,19 +9,19 @@ Machine translation (MT) leverages computers to translate from one language to a
Early machine translation systems are mainly rule-based i.e. they rely on a language expert to specify the translation rules between the two languages. It is quite difficult to cover all the rules used in one languge. So it is quite a challenge for language experts to specify all possible rules in two or more different languages. Hence, a major challenge in conventional machine translation has been the difficulty in obtaining a complete rule set \[[1](#References)\]
To address the aforementioned problems, statistical machine translation techniques have been developed. These techniques learn the translation rules from a large corpus, instead of being designed by a language expert. While these techniques overcome the bottleneck of knowledge acquisition, there are still quite a lot of challenges, for example:
To address the aforementioned problems, statistical machine translation techniques have been developed. These techniques learn the translation rules from a large corpus, instead of being designed by a language expert. While these techniques overcome the bottleneck of knowledge acquisition, there are still quite a lot of challenges, for example:
1. human designed features cannot cover all possible linguistic variations;
1. human designed features cannot cover all possible linguistic variations;
2. it is difficult to use global features;
2. it is difficult to use global features;
3. the techniques heavily rely on pre-processing techniques like word alignment, word segmentation and tokenization, rule-extraction and syntactic parsing etc. The error introduced in any of these steps could accumulate and impact translation quality.
The recent development of deep learning provides new solutions to these challenges. The two main categories for deep learning based machine translation techniques are:
The recent development of deep learning provides new solutions to these challenges. The two main categories for deep learning based machine translation techniques are:
1. techniques based on the statistical machine translation system but with some key components improved with neural networks, e.g., language model, reordering model (please refer to the left part of Figure 1);
1. techniques based on the statistical machine translation system but with some key components improved with neural networks, e.g., language model, reordering model (please refer to the left part of Figure 1);
2. techniques mapping from source language to target language directly using a neural network, or end-to-end neural machine translation (NMT).
......@@ -57,7 +57,7 @@ This section will introduce Gated Recurrent Unit (GRU), Bi-directional Recurrent
We already introduced RNN and LSTM in the [Sentiment Analysis](https://github.com/PaddlePaddle/book/blob/develop/understand_sentiment/README.md) chapter.
Compared to a simple RNN, the LSTM added memory cell, input gate, forget gate and output gate. These gates combined with the memory cell greatly improve the ability to handle long-term dependencies.
GRU\[[2](#References)\] proposed by Cho et al is a simplified LSTM and an extension of a simple RNN. It is shown in the figure below.
GRU\[[2](#References)\] proposed by Cho et al is a simplified LSTM and an extension of a simple RNN. It is shown in the figure below.
A GRU unit has only two gates:
- reset gate: when this gate is closed, the history information is discarded, i.e., the irrelevant historical information has no effect on the future output.
- update gate: it combines the input gate and the forget gate and is used to control the impact of historical information on the hidden output. The historical information is passed over when the update gate is close to 1.
......@@ -96,20 +96,20 @@ There are three steps for encoding a sentence:
1. One-hot vector representation of a word: Each word $x_i$ in the source sentence $x=\left \{ x_1,x_2,...,x_T \right \}$ is represented as a vector $w_i\epsilon R^{\left | V \right |},i=1,2,...,T$ where $w_i$ has the same dimensionality as the size of the dictionary, i.e., $\left | V \right |$, and has an element of one at the location corresponding to the location of the word in the dictionary and zero elsewhere.
2. Word embedding as a representation in the low-dimensional semantic space: There are two problems with one-hot vector representation
2. Word embedding as a representation in the low-dimensional semantic space: There are two problems with one-hot vector representation
* the dimensionality of the vector is typically large, leading to the curse of dimensionality;
* the dimensionality of the vector is typically large, leading to the curse of dimensionality;
* it is hard to capture the relationships between words, i.e., semantic similarities. Therefore, it is useful to project the one-hot vector into a low-dimensional semantic space as a dense vector with fixed dimensions, i.e., $s_i=Cw_i$ for the $i$-th word, with $C\epsilon R^{K\times \left | V \right |}$ as the projection matrix and $K$ is the dimensionality of the word embedding vector.
3. Encoding of the source sequence via RNN: This can be described mathematically as:
$$h_i=\varnothing _\theta \left ( h_{i-1}, s_i \right )$$
where
$h_0$ is a zero vector,
$\varnothing _\theta$ is a non-linear activation function, and
$\mathbf{h}=\left \{ h_1,..., h_T \right \}$
where
$h_0$ is a zero vector,
$\varnothing _\theta$ is a non-linear activation function, and
$\mathbf{h}=\left \{ h_1,..., h_T \right \}$
is the sequential encoding of the first $T$ words from the source sequence. The vector representation of the whole sentence can be represented as the encoding vector at the last time step $T$ from $\mathbf{h}$, or by temporal pooling over $\mathbf{h}$.
......@@ -142,8 +142,8 @@ The generation process of machine translation is to translate the source sentenc
### Attention Mechanism
There are a few problems with the fixed dimensional vector representation from the encoding stage:
* It is very challenging to encode both the semantic and syntactic information a sentence with a fixed dimensional vector regardless of the length of the sentence.
There are a few problems with the fixed dimensional vector representation from the encoding stage:
* It is very challenging to encode both the semantic and syntactic information a sentence with a fixed dimensional vector regardless of the length of the sentence.
* Intuitively, when translating a sentence, we typically pay more attention to the parts in the source sentence more relevant to the current translation. Moreover, the focus changes along the process of the translation. With a fixed dimensional vector, all the information from the source sentence is treated equally in terms of attention. This is not reasonable. Therefore, Bahdanau et al. \[[4](#References)\] introduced attention mechanism, which can decode based on different fragments of the context sequence in order to address the difficulty of feature learning for long sentences. Decoder with attention will be explained in the following.
Different from the simple decoder, $z_i$ is computed as:
......@@ -172,7 +172,7 @@ Figure 6. Decoder with Attention Mechanism
[Beam Search](http://en.wikipedia.org/wiki/Beam_search) is a heuristic search algorithm that explores a graph by expanding the most promising node in a limited set. It is typically used when the solution space is huge (e.g., for machine translation, speech recognition), and there is not enough memory for all the possible solutions. For example, if we want to translate “`<s>你好<e>`” into English, even if there are only three words in the dictionary (`<s>`, `<e>`, `hello`), it is still possible to generate an infinite number of sentences, where the word `hello` can appear different number of times. Beam search could be used to find a good translation among them.
Beam search builds a search tree using breadth first search and sorts the nodes according to a heuristic cost (sum of the log probability of the generated words) at each level of the tree. Only a fixed number of nodes according to the pre-specified beam size (or beam width) are considered. Thus, only nodes with highest scores are expanded in the next level. This reduces the space and time requirements significantly. However, a globally optimal solution is not guaranteed.
Beam search builds a search tree using breadth first search and sorts the nodes according to a heuristic cost (sum of the log probability of the generated words) at each level of the tree. Only a fixed number of nodes according to the pre-specified beam size (or beam width) are considered. Thus, only nodes with highest scores are expanded in the next level. This reduces the space and time requirements significantly. However, a globally optimal solution is not guaranteed.
The goal is to maximize the probability of the generated sequence when using beam search in decoding, The procedure is as follows:
......@@ -452,7 +452,7 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn
source_dict_dim = len(open(src_lang_dict, "r").readlines()) # size of the source language dictionary
target_dict_dim = len(open(trg_lang_dict, "r").readlines()) # size of target language dictionary
word_vector_dim = 512 # dimensionality of word vector
encoder_size = 512 # dimensionality of the hidden state of encoder GRU
encoder_size = 512 # dimensionality of the hidden state of encoder GRU
decoder_size = 512 # dimentionality of the hidden state of decoder GRU
if is_generating:
......
此差异已折叠。
import paddle.v2 as paddle
def seqToseq_net(source_dict_dim, target_dict_dim):
### Network Architecture
word_vector_dim = 512 # dimension of word vector
decoder_size = 512 # dimension of hidden unit in GRU Decoder network
encoder_size = 512 # dimension of hidden unit in GRU Encoder network
#### Encoder
src_word_id = paddle.layer.data(
name='source_language_word',
type=paddle.data_type.integer_value_sequence(source_dict_dim))
src_embedding = paddle.layer.embedding(
input=src_word_id,
size=word_vector_dim,
param_attr=paddle.attr.ParamAttr(name='_source_language_embedding'))
src_forward = paddle.networks.simple_gru(
input=src_embedding, size=encoder_size)
src_backward = paddle.networks.simple_gru(
input=src_embedding, size=encoder_size, reverse=True)
encoded_vector = paddle.layer.concat(input=[src_forward, src_backward])
#### Decoder
with paddle.layer.mixed(size=decoder_size) as encoded_proj:
encoded_proj += paddle.layer.full_matrix_projection(
input=encoded_vector)
backward_first = paddle.layer.first_seq(input=src_backward)
with paddle.layer.mixed(
size=decoder_size, act=paddle.activation.Tanh()) as decoder_boot:
decoder_boot += paddle.layer.full_matrix_projection(
input=backward_first)
def gru_decoder_with_attention(enc_vec, enc_proj, current_word):
decoder_mem = paddle.layer.memory(
name='gru_decoder', size=decoder_size, boot_layer=decoder_boot)
context = paddle.networks.simple_attention(
encoded_sequence=enc_vec,
encoded_proj=enc_proj,
decoder_state=decoder_mem)
with paddle.layer.mixed(size=decoder_size * 3) as decoder_inputs:
decoder_inputs += paddle.layer.full_matrix_projection(input=context)
decoder_inputs += paddle.layer.full_matrix_projection(
input=current_word)
gru_step = paddle.layer.gru_step(
name='gru_decoder',
input=decoder_inputs,
output_mem=decoder_mem,
size=decoder_size)
with paddle.layer.mixed(
size=target_dict_dim,
bias_attr=True,
act=paddle.activation.Softmax()) as out:
out += paddle.layer.full_matrix_projection(input=gru_step)
return out
decoder_group_name = "decoder_group"
group_input1 = paddle.layer.StaticInputV2(input=encoded_vector, is_seq=True)
group_input2 = paddle.layer.StaticInputV2(input=encoded_proj, is_seq=True)
group_inputs = [group_input1, group_input2]
trg_embedding = paddle.layer.embedding(
input=paddle.layer.data(
name='target_language_word',
type=paddle.data_type.integer_value_sequence(target_dict_dim)),
size=word_vector_dim,
param_attr=paddle.attr.ParamAttr(name='_target_language_embedding'))
group_inputs.append(trg_embedding)
# For decoder equipped with attention mechanism, in training,
# target embeding (the groudtruth) is the data input,
# while encoded source sequence is accessed to as an unbounded memory.
# Here, the StaticInput defines a read-only memory
# for the recurrent_group.
decoder = paddle.layer.recurrent_group(
name=decoder_group_name,
step=gru_decoder_with_attention,
input=group_inputs)
lbl = paddle.layer.data(
name='target_language_next_word',
type=paddle.data_type.integer_value_sequence(target_dict_dim))
cost = paddle.layer.classification_cost(input=decoder, label=lbl)
return cost
def main():
paddle.init(use_gpu=False, trainer_count=1)
# source and target dict dim.
dict_size = 30000
source_dict_dim = target_dict_dim = dict_size
# define network topology
cost = seqToseq_net(source_dict_dim, target_dict_dim)
parameters = paddle.parameters.create(cost)
# define optimize method and trainer
optimizer = paddle.optimizer.Adam(learning_rate=1e-4)
trainer = paddle.trainer.SGD(cost=cost,
parameters=parameters,
update_equation=optimizer)
# define data reader
feeding = {
'source_language_word': 0,
'target_language_word': 1,
'target_language_next_word': 2
}
wmt14_reader = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.wmt14.train(dict_size=dict_size), buf_size=8192),
batch_size=5)
# define event_handler callback
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 10 == 0:
print "Pass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
# start to train
trainer.train(
reader=wmt14_reader,
event_handler=event_handler,
num_passes=10000,
feeding=feeding)
if __name__ == '__main__':
main()
......@@ -32,17 +32,17 @@ rm dev+test.tgz
# separate the dev and test dataset
mkdir test gen
mv dev/ntst1213.* test
mv dev/ntst14.* gen
mv dev/ntst14.* gen
rm -rf dev
set +x
# rename the suffix, .fr->.src, .en->.trg
for dir in train test gen
do
do
filelist=`ls $dir`
cd $dir
for file in $filelist
do
do
if [ ${file##*.} = "fr" ]; then
mv $file ${file/%fr/src}
elif [ ${file##*.} = 'en' ]; then
......
......@@ -31,7 +31,7 @@ else
print $3;
read_pos += (2 + res_num);
}}' res_num=$beam_size $gen_file >$top1
fi
fi
# evalute bleu value
bleu_script=multi-bleu.perl
......
......@@ -50,19 +50,19 @@ Machine translation (MT) leverages computers to translate from one language to a
Early machine translation systems are mainly rule-based i.e. they rely on a language expert to specify the translation rules between the two languages. It is quite difficult to cover all the rules used in one languge. So it is quite a challenge for language experts to specify all possible rules in two or more different languages. Hence, a major challenge in conventional machine translation has been the difficulty in obtaining a complete rule set \[[1](#References)\]。
To address the aforementioned problems, statistical machine translation techniques have been developed. These techniques learn the translation rules from a large corpus, instead of being designed by a language expert. While these techniques overcome the bottleneck of knowledge acquisition, there are still quite a lot of challenges, for example:
To address the aforementioned problems, statistical machine translation techniques have been developed. These techniques learn the translation rules from a large corpus, instead of being designed by a language expert. While these techniques overcome the bottleneck of knowledge acquisition, there are still quite a lot of challenges, for example:
1. human designed features cannot cover all possible linguistic variations;
1. human designed features cannot cover all possible linguistic variations;
2. it is difficult to use global features;
2. it is difficult to use global features;
3. the techniques heavily rely on pre-processing techniques like word alignment, word segmentation and tokenization, rule-extraction and syntactic parsing etc. The error introduced in any of these steps could accumulate and impact translation quality.
The recent development of deep learning provides new solutions to these challenges. The two main categories for deep learning based machine translation techniques are:
The recent development of deep learning provides new solutions to these challenges. The two main categories for deep learning based machine translation techniques are:
1. techniques based on the statistical machine translation system but with some key components improved with neural networks, e.g., language model, reordering model (please refer to the left part of Figure 1);
1. techniques based on the statistical machine translation system but with some key components improved with neural networks, e.g., language model, reordering model (please refer to the left part of Figure 1);
2. techniques mapping from source language to target language directly using a neural network, or end-to-end neural machine translation (NMT).
......@@ -98,7 +98,7 @@ This section will introduce Gated Recurrent Unit (GRU), Bi-directional Recurrent
We already introduced RNN and LSTM in the [Sentiment Analysis](https://github.com/PaddlePaddle/book/blob/develop/understand_sentiment/README.md) chapter.
Compared to a simple RNN, the LSTM added memory cell, input gate, forget gate and output gate. These gates combined with the memory cell greatly improve the ability to handle long-term dependencies.
GRU\[[2](#References)\] proposed by Cho et al is a simplified LSTM and an extension of a simple RNN. It is shown in the figure below.
GRU\[[2](#References)\] proposed by Cho et al is a simplified LSTM and an extension of a simple RNN. It is shown in the figure below.
A GRU unit has only two gates:
- reset gate: when this gate is closed, the history information is discarded, i.e., the irrelevant historical information has no effect on the future output.
- update gate: it combines the input gate and the forget gate and is used to control the impact of historical information on the hidden output. The historical information is passed over when the update gate is close to 1.
......@@ -137,20 +137,20 @@ There are three steps for encoding a sentence:
1. One-hot vector representation of a word: Each word $x_i$ in the source sentence $x=\left \{ x_1,x_2,...,x_T \right \}$ is represented as a vector $w_i\epsilon R^{\left | V \right |},i=1,2,...,T$ where $w_i$ has the same dimensionality as the size of the dictionary, i.e., $\left | V \right |$, and has an element of one at the location corresponding to the location of the word in the dictionary and zero elsewhere.
2. Word embedding as a representation in the low-dimensional semantic space: There are two problems with one-hot vector representation
2. Word embedding as a representation in the low-dimensional semantic space: There are two problems with one-hot vector representation
* the dimensionality of the vector is typically large, leading to the curse of dimensionality;
* the dimensionality of the vector is typically large, leading to the curse of dimensionality;
* it is hard to capture the relationships between words, i.e., semantic similarities. Therefore, it is useful to project the one-hot vector into a low-dimensional semantic space as a dense vector with fixed dimensions, i.e., $s_i=Cw_i$ for the $i$-th word, with $C\epsilon R^{K\times \left | V \right |}$ as the projection matrix and $K$ is the dimensionality of the word embedding vector.
3. Encoding of the source sequence via RNN: This can be described mathematically as:
$$h_i=\varnothing _\theta \left ( h_{i-1}, s_i \right )$$
where
$h_0$ is a zero vector,
$\varnothing _\theta$ is a non-linear activation function, and
$\mathbf{h}=\left \{ h_1,..., h_T \right \}$
where
$h_0$ is a zero vector,
$\varnothing _\theta$ is a non-linear activation function, and
$\mathbf{h}=\left \{ h_1,..., h_T \right \}$
is the sequential encoding of the first $T$ words from the source sequence. The vector representation of the whole sentence can be represented as the encoding vector at the last time step $T$ from $\mathbf{h}$, or by temporal pooling over $\mathbf{h}$.
......@@ -183,8 +183,8 @@ The generation process of machine translation is to translate the source sentenc
### Attention Mechanism
There are a few problems with the fixed dimensional vector representation from the encoding stage:
* It is very challenging to encode both the semantic and syntactic information a sentence with a fixed dimensional vector regardless of the length of the sentence.
There are a few problems with the fixed dimensional vector representation from the encoding stage:
* It is very challenging to encode both the semantic and syntactic information a sentence with a fixed dimensional vector regardless of the length of the sentence.
* Intuitively, when translating a sentence, we typically pay more attention to the parts in the source sentence more relevant to the current translation. Moreover, the focus changes along the process of the translation. With a fixed dimensional vector, all the information from the source sentence is treated equally in terms of attention. This is not reasonable. Therefore, Bahdanau et al. \[[4](#References)\] introduced attention mechanism, which can decode based on different fragments of the context sequence in order to address the difficulty of feature learning for long sentences. Decoder with attention will be explained in the following.
Different from the simple decoder, $z_i$ is computed as:
......@@ -213,7 +213,7 @@ Figure 6. Decoder with Attention Mechanism
[Beam Search](http://en.wikipedia.org/wiki/Beam_search) is a heuristic search algorithm that explores a graph by expanding the most promising node in a limited set. It is typically used when the solution space is huge (e.g., for machine translation, speech recognition), and there is not enough memory for all the possible solutions. For example, if we want to translate “`<s>你好<e>`” into English, even if there are only three words in the dictionary (`<s>`, `<e>`, `hello`), it is still possible to generate an infinite number of sentences, where the word `hello` can appear different number of times. Beam search could be used to find a good translation among them.
Beam search builds a search tree using breadth first search and sorts the nodes according to a heuristic cost (sum of the log probability of the generated words) at each level of the tree. Only a fixed number of nodes according to the pre-specified beam size (or beam width) are considered. Thus, only nodes with highest scores are expanded in the next level. This reduces the space and time requirements significantly. However, a globally optimal solution is not guaranteed.
Beam search builds a search tree using breadth first search and sorts the nodes according to a heuristic cost (sum of the log probability of the generated words) at each level of the tree. Only a fixed number of nodes according to the pre-specified beam size (or beam width) are considered. Thus, only nodes with highest scores are expanded in the next level. This reduces the space and time requirements significantly. However, a globally optimal solution is not guaranteed.
The goal is to maximize the probability of the generated sequence when using beam search in decoding, The procedure is as follows:
......@@ -493,7 +493,7 @@ This tutorial will use the default SGD and Adam learning algorithm, with a learn
source_dict_dim = len(open(src_lang_dict, "r").readlines()) # size of the source language dictionary
target_dict_dim = len(open(trg_lang_dict, "r").readlines()) # size of target language dictionary
word_vector_dim = 512 # dimensionality of word vector
encoder_size = 512 # dimensionality of the hidden state of encoder GRU
encoder_size = 512 # dimensionality of the hidden state of encoder GRU
decoder_size = 512 # dimentionality of the hidden state of decoder GRU
if is_generating:
......@@ -782,6 +782,6 @@ marked.setOptions({
}
});
document.getElementById("context").innerHTML = marked(
document.getElementById("markdown").innerHTML)
document.getElementById("markdown").innerHTML)
</script>
</body>
......@@ -134,7 +134,7 @@ GRU\[[2](#参考文献)\]是Cho等人在LSTM上提出的简化版本,也是RNN
机器翻译任务的训练过程中,解码阶段的目标是最大化下一个正确的目标语言词的概率。思路是:
1. 每一个时刻,根据源语言句子的编码信息(又叫上下文向量,context vector)$c$、真实目标语言序列的第$i$个词$u_i$和$i$时刻RNN的隐层状态$z_i$,计算出下一个隐层状态$z_{i+1}$。计算公式如下:
$$z_{i+1}=\phi _{\theta '}\left ( c,u_i,z_i \right )$$
其中$\phi _{\theta '}$是一个非线性激活函数;$c=q\mathbf{h}$是源语言句子的上下文向量,在不使用[注意力机制](#注意力机制)时,如果[编码器](#编码器)的输出是源语言句子编码后的最后一个元素,则可以定义$c=h_T$;$u_i$是目标语言序列的第$i$个单词,$u_0$是目标语言序列的开始标记`<s>`,表示解码开始;$z_i$是$i$时刻解码RNN的隐层状态,$z_0$是一个全零的向量。
......@@ -257,7 +257,7 @@ user_dataset
│   ├── gen_file1.trg
│   └── ...
```
- 一级目录`user_dataset`:用户自定义的数据集名字。
- 二级目录`train`、`test`和`gen`:必须使用这三个文件夹名字。
- 三级目录:存放源语言到目标语言的平行语料库文件,后缀名必须使用`.src`和`.trg`。
......@@ -323,11 +323,11 @@ pre-wmt14
2. 其次,使用初始化函数`hook`,分别定义了训练模式和生成模式下的数据输入格式(`input_types`)。
- 训练模式:有三个输入序列,其中“源语言序列”和“目标语言序列”作为输入数据,“目标语言的下一个词序列”作为标签数据。
- 生成模式:有两个输入序列,其中“源语言序列”作为输入数据,“源语言序列编号”作为输入数据的编号(该输入非必须,可以省略)。
`hook`函数中的`src_dict_path`是源语言字典路径,`trg_dict_path`是目标语言字典路径,`is_generating`(训练或生成模式)是从模型配置中传入的对象。`hook`函数的具体调用方式请见[模型配置说明](#模型配置说明)。
```python
def hook(settings, src_dict_path, trg_dict_path, is_generating, file_list,
def hook(settings, src_dict_path, trg_dict_path, is_generating, file_list,
**kwargs):
# job_mode = 1: 训练模式;0: 生成模式
settings.job_mode = not is_generating
......@@ -483,7 +483,7 @@ settings(
param_attr=ParamAttr(name='_source_language_embedding'))
```
2.3 用双向GRU编码源语言序列,拼接两个GRU的编码结果得到$\mathbf{h}$。
```python
src_forward = simple_gru(input=src_embedding, size=encoder_size)
src_backward = simple_gru(
......@@ -494,7 +494,7 @@ settings(
3. 接着,定义基于注意力机制的解码器框架。分为三步:
3.1 对源语言序列编码后的结果(见2.3),过一个前馈神经网络(Feed Forward Neural Network),得到其映射。
```python
with mixed_layer(size=decoder_size) as encoded_proj:
encoded_proj += full_matrix_projection(input=encoded_vector)
......@@ -514,8 +514,8 @@ settings(
- context通过调用`simple_attention`函数,实现公式$c_i=\sum {j=1}^{T}a_{ij}h_j$。其中,enc_vec是$h_j$,enc_proj是$h_j$的映射(见3.1),权重$a_{ij}$的计算已经封装在`simple_attention`函数中。
- decoder_inputs融合了$c_i$和当前目标词current_word(即$u_i$)的表示。
- gru_step通过调用`gru_step_layer`函数,在decoder_inputs和decoder_mem上做了激活操作,即实现公式$z_{i+1}=\phi _{\theta '}\left ( c_i,u_i,z_i \right )$。
- 最后,使用softmax归一化计算单词的概率,将out结果返回,即实现公式$p\left ( u_i|u_{&lt;i},\mathbf{x} \right )=softmax(W_sz_i+b_z)$。
- 最后,使用softmax归一化计算单词的概率,将out结果返回,即实现公式$p\left ( u_i|u_{&lt;i},\mathbf{x} \right )=softmax(W_sz_i+b_z)$。
```python
def gru_decoder_with_attention(enc_vec, enc_proj, current_word):
decoder_mem = memory(
......@@ -582,7 +582,7 @@ settings(
- 首先,在序列生成任务中,由于解码阶段的RNN总是引用上一时刻生成出的词的词向量,作为当前时刻的输入,因此,使用`GeneratedInput`来自动完成这一过程。具体说明可见[GeneratedInput文档](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/deep_model/rnn/recurrent_group_cn.md#输入)。
- 其次,使用`beam_search`函数循环调用`gru_decoder_with_attention`函数,生成出序列id。
- 最后,使用`seqtext_printer_evaluator`函数,根据目标字典`trg_lang_dict`,打印出完整的句子保存在`gen_trans_file`中。
```python
else:
trg_embedding = GeneratedInput(
......@@ -746,6 +746,6 @@ marked.setOptions({
}
});
document.getElementById("context").innerHTML = marked(
document.getElementById("markdown").innerHTML)
document.getElementById("markdown").innerHTML)
</script>
</body>
......@@ -20,4 +20,4 @@ wget http://paddlepaddle.bj.bcebos.com/model_zoo/wmt14_model.tar.gz
# untar the model
tar -zxvf wmt14_model.tar.gz
rm wmt14_model.tar.gz
rm wmt14_model.tar.gz
......@@ -57,6 +57,6 @@ marked.setOptions({
}
});
document.getElementById("context").innerHTML = marked(
document.getElementById("markdown").innerHTML)
document.getElementById("markdown").innerHTML)
</script>
</body>
此差异已折叠。
此差异已折叠。
recognize_digits/image/conv_layer.png

249.7 KB | W: | H:

recognize_digits/image/conv_layer.png

248.3 KB | W: | H:

recognize_digits/image/conv_layer.png
recognize_digits/image/conv_layer.png
recognize_digits/image/conv_layer.png
recognize_digits/image/conv_layer.png
  • 2-up
  • Swipe
  • Onion skin
此差异已折叠。
此差异已折叠。
此差异已折叠。
.idea
.ipynb_checkpoints
......@@ -72,7 +72,7 @@ Given the feature vectors of users and movies, we compute the relevance using co
<img src="image/rec_regression_network_en.png" width="90%" ><br/>
Figure 3. A hybrid recommendation model.
</p>
</p>
## Dataset
......
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
......@@ -57,6 +57,6 @@ marked.setOptions({
}
});
document.getElementById("context").innerHTML = marked(
document.getElementById("markdown").innerHTML)
document.getElementById("markdown").innerHTML)
</script>
</body>
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册