提交 abd859c4 编写于 作者: L liaogang

auto format and fix conflict

- repo: https://github.com/Lucas-C/pre-commit-hooks.git
sha: c25201a00e6b0514370501050cf2a8538ac12270
hooks:
- id: remove-crlf
- repo: https://github.com/reyoung/mirrors-yapf.git
sha: v0.13.2
hooks:
- id: yapf
files: (.*\.(py|bzl)|BUILD|.*\.BUILD|WORKSPACE)$ # Bazel BUILD files follow Python syntax.
- repo: https://github.com/pre-commit/pre-commit-hooks
sha: 7539d8bd1a00a3c1bfd34cdb606d3a6372e83469
sha: v0.7.1
hooks:
- id: check-merge-conflict
- id: check-symlinks
- id: detect-private-key
- id: end-of-file-fixer
- id: trailing-whitespace
- repo: git://github.com/Lucas-C/pre-commit-hooks
sha: v1.0.1
hooks:
- id: forbid-crlf
- id: remove-crlf
- id: forbid-tabs
- id: remove-tabs
此差异已折叠。
......@@ -45,14 +45,25 @@ $$MSE=\frac{1}{n}\sum_{i=1}^{n}{(\hat{Y_i}-Y_i)}^2$$
3. 根据损失函数进行反向误差传播 ([backpropagation](https://en.wikipedia.org/wiki/Backpropagation)),将网络误差从输出层依次向前传递, 并更新网络中的参数。
4. 重复2~3步骤,直至网络训练误差达到规定的程度或训练轮次达到设定值。
## 数据集
## 数据准备
执行以下命令来准备数据:
```bash
cd data && python prepare_data.py
### 数据集接口的封装
首先加载需要的包
```python
import paddle.v2 as paddle
import paddle.v2.dataset.uci_housing as uci_housing
```
这段代码将从[UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing)下载数据并进行[预处理](#数据预处理),最后数据将被分为训练集和测试集。
我们通过uci_housing模块引入了数据集合[UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing)
其中,在uci_housing模块中封装了:
1. 数据下载的过程。下载数据保存在~/.cache/paddle/dataset/uci_housing/housing.data。
2. [数据预处理](#数据预处理)的过程。
### 数据集介绍
这份数据集共506行,每行包含了波士顿郊区的一类房屋的相关信息及该类房屋价格的中位数。其各维属性的意义如下:
| 属性名 | 解释 | 类型 |
......@@ -90,89 +101,89 @@ cd data && python prepare_data.py
</p>
#### 整理训练集与测试集
我们将数据集分割为两份:一份用于调整模型的参数,即进行模型的训练,模型在这份数据集上的误差被称为**训练误差**;另外一份被用来测试,模型在这份数据集上的误差被称为**测试误差**。我们训练模型的目的是为了通过从训练数据中找到规律来预测未知的新数据,所以测试误差是更能反映模型表现的指标。分割数据的比例要考虑到两个因素:更多的训练数据会降低参数估计的方差,从而得到更可信的模型;而更多的测试数据会降低测试误差的方差,从而得到更可信的测试误差。一种常见的分割比例为$8:2$,感兴趣的读者朋友们也可以尝试不同的设置来观察这两种误差的变化。
我们将数据集分割为两份:一份用于调整模型的参数,即进行模型的训练,模型在这份数据集上的误差被称为**训练误差**;另外一份被用来测试,模型在这份数据集上的误差被称为**测试误差**。我们训练模型的目的是为了通过从训练数据中找到规律来预测未知的新数据,所以测试误差是更能反映模型表现的指标。分割数据的比例要考虑到两个因素:更多的训练数据会降低参数估计的方差,从而得到更可信的模型;而更多的测试数据会降低测试误差的方差,从而得到更可信的测试误差。我们这个例子中设置的分割比例为$8:2$
执行如下命令可以分割数据集,并将训练集和测试集的地址分别写入train.list 和 test.list两个文件中,供PaddlePaddle读取。
```python
python prepare_data.py -r 0.8 #默认使用8:2的比例进行分割
```
在更复杂的模型训练过程中,我们往往还会多使用一种数据集:验证集。因为复杂的模型中常常还有一些超参数([Hyperparameter](https://en.wikipedia.org/wiki/Hyperparameter_optimization))需要调节,所以我们会尝试多种超参数的组合来分别训练多个模型,然后对比它们在验证集上的表现选择相对最好的一组超参数,最后才使用这组参数下训练的模型在测试集上评估测试误差。由于本章训练的模型比较简单,我们暂且忽略掉这个过程。
### 提供数据给PaddlePaddle
准备好数据之后,我们使用一个Python data provider来为PaddlePaddle的训练过程提供数据。一个 data provider 就是一个Python函数,它会被PaddlePaddle的训练过程调用。在这个例子里,只需要读取已经保存好的数据,然后一行一行地返回给PaddlePaddle的训练进程即可。
```python
from paddle.trainer.PyDataProvider2 import *
import numpy as np
#定义数据的类型和维度
@provider(input_types=[dense_vector(13), dense_vector(1)])
def process(settings, input_file):
data = np.load(input_file.strip())
for row in data:
yield row[:-1].tolist(), row[-1:].tolist()
## 训练
```
`fit_a_line/trainer.py`演示了训练的整体过程。
## 模型配置说明
### 初始化PaddlePaddle
### 数据定义
首先,通过 `define_py_data_sources2` 来配置PaddlePaddle从上面的`dataprovider.py`里读入训练数据和测试数据。 PaddlePaddle接受从命令行读入的配置信息,例如这里我们传入一个名为`is_predict`的变量来控制模型在训练和测试时的不同结构。
```python
from paddle.trainer_config_helpers import *
paddle.init(use_gpu=False, trainer_count=1)
```
is_predict = get_config_arg('is_predict', bool, False)
### 模型配置
define_py_data_sources2(
train_list='data/train.list',
test_list='data/test.list',
module='dataprovider',
obj='process')
线性回归的模型其实就是一个采用线性激活函数(linear activation,`LinearActivation`)的全连接层(fully-connected layer,`fc_layer`):
```python
x = paddle.layer.data(name='x', type=paddle.data_type.dense_vector(13))
y_predict = paddle.layer.fc(input=x,
size=1,
act=paddle.activation.Linear())
y = paddle.layer.data(name='y', type=paddle.data_type.dense_vector(1))
cost = paddle.layer.regression_cost(input=y_predict, label=y)
```
### 创建参数
### 算法配置
接着,指定模型优化算法的细节。由于线性回归模型比较简单,我们只要设置基本的`batch_size`即可,它指定每次更新参数的时候使用多少条数据计算梯度信息。
```python
settings(batch_size=2)
parameters = paddle.parameters.create(cost)
```
### 网络结构
最后,使用`fc_layer``LinearActivation`来表示线性回归的模型本身。
### 创建Trainer
```python
#输入数据,13维的房屋信息
x = data_layer(name='x', size=13)
optimizer = paddle.optimizer.Momentum(momentum=0)
y_predict = fc_layer(
input=x,
param_attr=ParamAttr(name='w'),
size=1,
act=LinearActivation(),
bias_attr=ParamAttr(name='b'))
if not is_predict: #训练时,我们使用MSE,即regression_cost作为损失函数
y = data_layer(name='y', size=1)
cost = regression_cost(input=y_predict, label=y)
outputs(cost) #训练时输出MSE来监控损失的变化
else: #测试时,输出预测值
outputs(y_predict)
trainer = paddle.trainer.SGD(cost=cost,
parameters=parameters,
update_equation=optimizer)
```
## 训练模型
在对应代码的根目录下执行PaddlePaddle的命令行训练程序。这里指定模型配置文件为`trainer_config.py`,训练30轮,结果保存在`output`路径下。
```bash
./train.sh
### 读取数据且打印训练的中间信息
PaddlePaddle提供一个
[reader机制](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader)
来读取数据。 Reader返回的数据可以包括多列,我们需要一个Python dict把列
序号映射到网络里的数据层。
```python
feeding={'x': 0, 'y': 1}
```
## 应用模型
现在来看下如何使用已经训练好的模型进行预测。
```bash
python predict.py
此外,我们还可以提供一个 event handler,来打印训练的进度:
```python
# event_handler to print training and testing info
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0:
print "Pass %d, Batch %d, Cost %f" % (
event.pass_id, event.batch_id, event.cost)
if isinstance(event, paddle.event.EndPass):
result = trainer.test(
reader=paddle.batch(
uci_housing.test(), batch_size=2),
feeding=feeding)
print "Test %d, Cost %f" % (event.pass_id, result.cost)
```
这里默认使用`output/pass-00029`中保存的模型进行预测,并将数据中的房价与预测结果进行对比,结果保存在 `predictions.png`中。
如果你想使用别的模型或者其它的数据进行预测,只要传入新的路径即可:
```bash
python predict.py -m output/pass-00020 -t data/housing.test.npy
### 开始训练
```python
trainer.train(
reader=paddle.batch(
paddle.reader.shuffle(
uci_housing.train(), buf_size=500),
batch_size=2),
feeding=feeding,
event_handler=event_handler,
num_passes=30)
```
## 总结
......
import paddle.v2 as paddle
import paddle.v2.dataset.uci_housing as uci_housing
def main():
# init
paddle.init(use_gpu=False, trainer_count=1)
# network config
x = paddle.layer.data(name='x', type=paddle.data_type.dense_vector(13))
y_predict = paddle.layer.fc(input=x, size=1, act=paddle.activation.Linear())
y = paddle.layer.data(name='y', type=paddle.data_type.dense_vector(1))
cost = paddle.layer.regression_cost(input=y_predict, label=y)
# create parameters
parameters = paddle.parameters.create(cost)
# create optimizer
optimizer = paddle.optimizer.Momentum(momentum=0)
trainer = paddle.trainer.SGD(cost=cost,
parameters=parameters,
update_equation=optimizer)
feeding = {'x': 0, 'y': 1}
# event_handler to print training and testing info
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0:
print "Pass %d, Batch %d, Cost %f" % (
event.pass_id, event.batch_id, event.cost)
if isinstance(event, paddle.event.EndPass):
result = trainer.test(
reader=paddle.batch(
uci_housing.test(), batch_size=2),
feeding=feeding)
print "Test %d, Cost %f" % (event.pass_id, result.cost)
# training
trainer.train(
reader=paddle.batch(
paddle.reader.shuffle(
uci_housing.train(), buf_size=500),
batch_size=2),
feeding=feeding,
event_handler=event_handler,
num_passes=30)
if __name__ == '__main__':
main()
此差异已折叠。
此差异已折叠。
......@@ -22,34 +22,20 @@ Standard SRL system mostly builds on top of Syntactic Analysis and contains five
<div align="center">
<img src="image/dependency_parsing.png" width = "80%" align=center /><br>
<img src="image/dependency_parsing_en.png" width = "80%" align=center /><br>
Fig 1. Syntactic parse tree
</div>
核心关系-> HED
定中关系-> ATT
主谓关系-> SBV
状中结构-> ADV
介宾关系-> POB
右附加关系-> RAD
动宾关系-> VOB
标点-> WP
However, complete syntactic analysis requires identifying the relation among all constitutes and the performance of SRL is sensitive to the precision of syntactic analysis, which makes SRL a very challenging task. To reduce the complexity and obtain some syntactic structure information, we often use shallow syntactic analysis. Shallow Syntactic Analysis is also called partial parsing or chunking. Unlike complete syntactic analysis which requires the construction of the complete parsing tree, Shallow Syntactic Analysis only need to identify some independent components with relatively simple structure, such as verb phrases (chunk). To avoid difficulties in constructing a syntactic tree with high accuracy, some work\[[1](#Reference)\] proposed semantic chunking based SRL methods, which convert SRL as a sequence tagging problem. Sequence tagging tasks classify syntactic chunks using BIO representation. For syntactic chunks forming a chunk of type A, the first chunk receives the B-A tag (Begin), the remaining ones receive the tag I-A (Inside), and all chunks outside receive the tag O-A.
The BIO representation of above example is shown in Fig.1.
<div align="center">
<img src="image/bio_example.png" width = "90%" align=center /><br>
<img src="image/bio_example_en.png" width = "90%" align=center /><br>
Fig 2. BIO represention
</div>
输入序列-> input sequence
语块-> chunk
标注序列-> label sequence
角色-> role
This example illustrates the simplicity of sequence tagging because (1) shallow syntactic analysis reduces the precision requirement of syntactic analysis; (2) pruning candidate arguments is removed; 3) argument identification and tagging are finished at the same time. Such unified methods simplify the procedure, reduce the risk of accumulating errors and boost the performance further.
In this tutorial, our SRL system is built as an end-to-end system via a neural network. We take only text sequences, without using any syntactic parsing results or complex hand-designed features. We give public dataset [CoNLL-2004 and CoNLL-2005 Shared Tasks](http://www.cs.upc.edu/~srlconll/) as an example to illustrate: given a sentence with predicates marked, identify the corresponding arguments and their semantic roles by sequence tagging method.
......@@ -71,13 +57,10 @@ The operation of a single LSTM cell contain 3 parts: (1) input-to-hidden: map in
Fig.3 illustrate the final stacked recurrent neural networks.
<p align="center">
<img src="./image/stacked_lstm.png" width = "40%" align=center><br>
<img src="./image/stacked_lstm_en.png" width = "40%" align=center><br>
Fig 3. Stacked Recurrent Neural Networks
</p>
线性变换-> linear transformation
输入层到隐层-> input-to-hidden
### Bidirectional Recurrent Neural Network
LSTMs can summarize the history of previous inputs seen up to now, but can not see the future. In most of NLP (natural language processing) tasks, the entire sentences are ready to use. Therefore, sequential learning might be much efficient if the future can be encoded as well like histories.
......@@ -86,15 +69,10 @@ To address the above drawbacks, we can design bidirectional recurrent neural net
<p align="center">
<img src="./image/bidirectional_stacked_lstm.png" width = "60%" align=center><br>
<img src="./image/bidirectional_stacked_lstm_en.png" width = "60%" align=center><br>
Fig 4. Bidirectional LSTMs
</p>
线性变换-> linear transformation
输入层到隐层-> input-to-hidden
正向处理输出序列->process sequence in the forward direction
反向处理上一层序列-> process sequence from the previous layer in backward direction
Note that, this bidirectional RNNs is different with the one proposed by Bengio et al. in machine translation tasks \[[3](#Reference), [4](#Reference)\]. We will introduce another bidirectional RNNs in the following tasks[machine translation](https://github.com/PaddlePaddle/book/blob/develop/machine_translation/README.md)
### Conditional Random Field
......@@ -156,18 +134,10 @@ After modification, the model is as follows:
<div align="center">
<img src="image/db_lstm_network.png" width = "60%" align=center /><br>
<img src="image/db_lstm_en.png" width = "60%" align=center /><br>
Fig 6. DB-LSTM for SRL tasks
</div>
论元-> argu
谓词-> pred
谓词上下文-> ctx-p
谓词上下文区域标记-> $m_r$
输入-> input
原句-> sentence
反向LSTM-> LSTM Reverse
## Data Preparation
In the tutorial, we use [CoNLL 2005](http://www.cs.upc.edu/~srlconll/) SRL task open dataset as an example. It is important to note that the training set and development set of the CoNLL 2005 SRL task are not free to download after the competition. Currently, only the test set can be obtained, including 23 sections of the Wall Street Journal and three sections of the Brown corpus. In this tutorial, we use the WSJ corpus as the training dataset to explain the model. However, since the training set is small, if you want to train a usable neural network SRL system, consider paying for the full corpus.
......@@ -225,6 +195,8 @@ We trained in the English Wikipedia language model to get a word vector lookup t
Get dictionary, print dictionary size:
```python
import math
import numpy as np
import paddle.v2 as paddle
import paddle.v2.dataset.conll05 as conll05
......@@ -233,81 +205,81 @@ word_dict_len = len(word_dict)
label_dict_len = len(label_dict)
pred_len = len(verb_dict)
print len(word_dict_len)
print len(label_dict_len)
print len(pred_len)
print word_dict_len
print label_dict_len
print pred_len
```
## Model configuration
1. Define input data dimensions and model hyperparameters.
- 1. Define input data dimensions and model hyperparameters.
```python
mark_dict_len = 2 # Value range of region mark. Region mark is either 0 or 1, so range is 2
word_dim = 32 # word vector dimension
mark_dim = 5 # adjacent dimension
hidden_dim = 512 # the dimension of LSTM hidden layer vector is 128 (512/4)
depth = 8 # depth of stacked LSTM
# There are 9 features per sample, so we will define 9 data layers.
# They type for each layer is integer_value_sequence.
def d_type(value_range):
```python
mark_dict_len = 2 # Value range of region mark. Region mark is either 0 or 1, so range is 2
word_dim = 32 # word vector dimension
mark_dim = 5 # adjacent dimension
hidden_dim = 512 # the dimension of LSTM hidden layer vector is 128 (512/4)
depth = 8 # depth of stacked LSTM
# There are 9 features per sample, so we will define 9 data layers.
# They type for each layer is integer_value_sequence.
def d_type(value_range):
return paddle.data_type.integer_value_sequence(value_range)
# word sequence
word = paddle.layer.data(name='word_data', type=d_type(word_dict_len))
# predicate
predicate = paddle.layer.data(name='verb_data', type=d_type(pred_len))
# word sequence
word = paddle.layer.data(name='word_data', type=d_type(word_dict_len))
# predicate
predicate = paddle.layer.data(name='verb_data', type=d_type(pred_len))
# 5 features for predicate context
ctx_n2 = paddle.layer.data(name='ctx_n2_data', type=d_type(word_dict_len))
ctx_n1 = paddle.layer.data(name='ctx_n1_data', type=d_type(word_dict_len))
ctx_0 = paddle.layer.data(name='ctx_0_data', type=d_type(word_dict_len))
ctx_p1 = paddle.layer.data(name='ctx_p1_data', type=d_type(word_dict_len))
ctx_p2 = paddle.layer.data(name='ctx_p2_data', type=d_type(word_dict_len))
# 5 features for predicate context
ctx_n2 = paddle.layer.data(name='ctx_n2_data', type=d_type(word_dict_len))
ctx_n1 = paddle.layer.data(name='ctx_n1_data', type=d_type(word_dict_len))
ctx_0 = paddle.layer.data(name='ctx_0_data', type=d_type(word_dict_len))
ctx_p1 = paddle.layer.data(name='ctx_p1_data', type=d_type(word_dict_len))
ctx_p2 = paddle.layer.data(name='ctx_p2_data', type=d_type(word_dict_len))
# region marker sequence
mark = paddle.layer.data(name='mark_data', type=d_type(mark_dict_len))
# region marker sequence
mark = paddle.layer.data(name='mark_data', type=d_type(mark_dict_len))
# label sequence
target = paddle.layer.data(name='target', type=d_type(label_dict_len))
```
# label sequence
target = paddle.layer.data(name='target', type=d_type(label_dict_len))
```
Speciala note: hidden_dim = 512 means LSTM hidden vector of 128 dimension (512/4). Please refer PaddlePaddle official documentation for detail: [lstmemory](http://www.paddlepaddle.org/doc/ui/api/trainer_config_helpers/layers.html#lstmemory)
Speciala note: hidden_dim = 512 means LSTM hidden vector of 128 dimension (512/4). Please refer PaddlePaddle official documentation for detail: [lstmemory](http://www.paddlepaddle.org/doc/ui/api/trainer_config_helpers/layers.html#lstmemory)
2. The word sequence, predicate, predicate context, and region mark sequence are transformed into embedding vector sequences.
- 2. The word sequence, predicate, predicate context, and region mark sequence are transformed into embedding vector sequences.
```python
```python
# Since word vectorlookup table is pre-trained, we won't update it this time.
# is_static being True prevents updating the lookup table during training.
emb_para = paddle.attr.Param(name='emb', initial_std=0., is_static=True)
# hyperparameter configurations
default_std = 1 / math.sqrt(hidden_dim) / 3.0
std_default = paddle.attr.Param(initial_std=default_std)
std_0 = paddle.attr.Param(initial_std=0.)
# Since word vectorlookup table is pre-trained, we won't update it this time.
# is_static being True prevents updating the lookup table during training.
emb_para = paddle.attr.Param(name='emb', initial_std=0., is_static=True)
# hyperparameter configurations
default_std = 1 / math.sqrt(hidden_dim) / 3.0
std_default = paddle.attr.Param(initial_std=default_std)
std_0 = paddle.attr.Param(initial_std=0.)
predicate_embedding = paddle.layer.embedding(
predicate_embedding = paddle.layer.embedding(
size=word_dim,
input=predicate,
param_attr=paddle.attr.Param(
name='vemb', initial_std=default_std))
mark_embedding = paddle.layer.embedding(
mark_embedding = paddle.layer.embedding(
size=mark_dim, input=mark, param_attr=std_0)
word_input = [word, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2]
emb_layers = [
word_input = [word, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2]
emb_layers = [
paddle.layer.embedding(
size=word_dim, input=x, param_attr=emb_para) for x in word_input
]
emb_layers.append(predicate_embedding)
emb_layers.append(mark_embedding)
```
]
emb_layers.append(predicate_embedding)
emb_layers.append(mark_embedding)
```
3. 8 LSTM units will be trained in "forward / backward" order.
- 3. 8 LSTM units will be trained in "forward / backward" order.
```python
hidden_0 = paddle.layer.mixed(
```python
hidden_0 = paddle.layer.mixed(
size=hidden_dim,
bias_attr=std_default,
input=[
......@@ -315,12 +287,12 @@ print len(pred_len)
input=emb, param_attr=std_default) for emb in emb_layers
])
mix_hidden_lr = 1e-3
lstm_para_attr = paddle.attr.Param(initial_std=0.0, learning_rate=1.0)
hidden_para_attr = paddle.attr.Param(
mix_hidden_lr = 1e-3
lstm_para_attr = paddle.attr.Param(initial_std=0.0, learning_rate=1.0)
hidden_para_attr = paddle.attr.Param(
initial_std=default_std, learning_rate=mix_hidden_lr)
lstm_0 = paddle.layer.lstmemory(
lstm_0 = paddle.layer.lstmemory(
input=hidden_0,
act=paddle.activation.Relu(),
gate_act=paddle.activation.Sigmoid(),
......@@ -328,10 +300,10 @@ print len(pred_len)
bias_attr=std_0,
param_attr=lstm_para_attr)
# stack L-LSTM and R-LSTM with direct edges
input_tmp = [hidden_0, lstm_0]
# stack L-LSTM and R-LSTM with direct edges
input_tmp = [hidden_0, lstm_0]
for i in range(1, depth):
for i in range(1, depth):
mix_hidden = paddle.layer.mixed(
size=hidden_dim,
bias_attr=std_default,
......@@ -352,9 +324,9 @@ print len(pred_len)
param_attr=lstm_para_attr)
input_tmp = [mix_hidden, lstm]
```
```
4. We will concatenate the output of top LSTM unit with it's input, and project into a hidden layer. Then put a fully connected layer on top of it to get the final vector representation.
- 4. We will concatenate the output of top LSTM unit with it's input, and project into a hidden layer. Then put a fully connected layer on top of it to get the final vector representation.
```python
feature_out = paddle.layer.mixed(
......@@ -368,10 +340,10 @@ print len(pred_len)
], )
```
5. We use CRF as cost function, the parameter of CRF cost will be named `crfw`.
- 5. We use CRF as cost function, the parameter of CRF cost will be named `crfw`.
```python
crf_cost = paddle.layer.crf(
```python
crf_cost = paddle.layer.crf(
size=label_dict_len,
input=feature_out,
label=target,
......@@ -379,18 +351,18 @@ print len(pred_len)
name='crfw',
initial_std=default_std,
learning_rate=mix_hidden_lr))
```
```
6. CRF decoding layer is used for evaluation and inference. It shares parameter with CRF layer. The sharing of parameters among multiple layers is specified by the same parameter name in these layers.
- 6. CRF decoding layer is used for evaluation and inference. It shares parameter with CRF layer. The sharing of parameters among multiple layers is specified by the same parameter name in these layers.
```python
crf_dec = paddle.layer.crf_decoding(
```python
crf_dec = paddle.layer.crf_decoding(
name='crf_dec_l',
size=label_dict_len,
input=feature_out,
label=target,
param_attr=paddle.attr.Param(name='crfw'))
```
```
## Train model
......@@ -440,15 +412,15 @@ trainer = paddle.trainer.SGD(cost=crf_cost,
As mentioned in data preparation section, we will use CoNLL 2005 test corpus as training data set. `conll05.test()` outputs one training instance at a time. It will be shuffled, and batched into mini batches as input.
```python
reader = paddle.reader.batched(
reader = paddle.batch(
paddle.reader.shuffle(
conll05.test(), buf_size=8192), batch_size=20)
```
`reader_dict` is used to specify relationship between data instance and layer layer. For example, according to following `reader_dict`, the 0th column of data instance produced by`conll05.test()` correspond to data layer named `word_data`.
`feeding` is used to specify relationship between data instance and layer layer. For example, according to following `feeding`, the 0th column of data instance produced by`conll05.test()` correspond to data layer named `word_data`.
```python
reader_dict = {
feeding = {
'word_data': 0,
'ctx_n2_data': 1,
'ctx_n1_data': 2,
......@@ -478,7 +450,7 @@ trainer.train(
reader=reader,
event_handler=event_handler,
num_passes=10000,
reader_dict=reader_dict)
feeding=feeding)
```
## Conclusion
......
此差异已折叠。
label_semantic_roles/image/bio_example.png

379.1 KB | W: | H:

label_semantic_roles/image/bio_example.png

31.7 KB | W: | H:

label_semantic_roles/image/bio_example.png
label_semantic_roles/image/bio_example.png
label_semantic_roles/image/bio_example.png
label_semantic_roles/image/bio_example.png
  • 2-up
  • Swipe
  • Onion skin
label_semantic_roles/image/dependency_parsing.png

186.0 KB | W: | H:

label_semantic_roles/image/dependency_parsing.png

58.2 KB | W: | H:

label_semantic_roles/image/dependency_parsing.png
label_semantic_roles/image/dependency_parsing.png
label_semantic_roles/image/dependency_parsing.png
label_semantic_roles/image/dependency_parsing.png
  • 2-up
  • Swipe
  • Onion skin
此差异已折叠。
此差异已折叠。
......@@ -155,11 +155,11 @@ def main():
parameters=parameters,
update_equation=optimizer)
reader = paddle.reader.batched(
reader = paddle.batch(
paddle.reader.shuffle(
conll05.test(), buf_size=8192), batch_size=10)
reader_dict = {
feeding = {
'word_data': 0,
'ctx_n2_data': 1,
'ctx_n1_data': 2,
......@@ -181,7 +181,7 @@ def main():
reader=reader,
event_handler=event_handler,
num_passes=10000,
reader_dict=reader_dict)
feeding=feeding)
if __name__ == '__main__':
......
此差异已折叠。
import paddle.v2 as paddle
def seqToseq_net(source_dict_dim, target_dict_dim):
### Network Architecture
word_vector_dim = 512 # dimension of word vector
decoder_size = 512 # dimension of hidden unit in GRU Decoder network
encoder_size = 512 # dimension of hidden unit in GRU Encoder network
#### Encoder
src_word_id = paddle.layer.data(
name='source_language_word',
type=paddle.data_type.integer_value_sequence(source_dict_dim))
src_embedding = paddle.layer.embedding(
input=src_word_id,
size=word_vector_dim,
param_attr=paddle.attr.ParamAttr(name='_source_language_embedding'))
src_forward = paddle.networks.simple_gru(
input=src_embedding, size=encoder_size)
src_backward = paddle.networks.simple_gru(
input=src_embedding, size=encoder_size, reverse=True)
encoded_vector = paddle.layer.concat(input=[src_forward, src_backward])
#### Decoder
with paddle.layer.mixed(size=decoder_size) as encoded_proj:
encoded_proj += paddle.layer.full_matrix_projection(
input=encoded_vector)
backward_first = paddle.layer.first_seq(input=src_backward)
with paddle.layer.mixed(
size=decoder_size, act=paddle.activation.Tanh()) as decoder_boot:
decoder_boot += paddle.layer.full_matrix_projection(
input=backward_first)
def gru_decoder_with_attention(enc_vec, enc_proj, current_word):
decoder_mem = paddle.layer.memory(
name='gru_decoder', size=decoder_size, boot_layer=decoder_boot)
context = paddle.networks.simple_attention(
encoded_sequence=enc_vec,
encoded_proj=enc_proj,
decoder_state=decoder_mem)
with paddle.layer.mixed(size=decoder_size * 3) as decoder_inputs:
decoder_inputs += paddle.layer.full_matrix_projection(input=context)
decoder_inputs += paddle.layer.full_matrix_projection(
input=current_word)
gru_step = paddle.layer.gru_step(
name='gru_decoder',
input=decoder_inputs,
output_mem=decoder_mem,
size=decoder_size)
with paddle.layer.mixed(
size=target_dict_dim,
bias_attr=True,
act=paddle.activation.Softmax()) as out:
out += paddle.layer.full_matrix_projection(input=gru_step)
return out
decoder_group_name = "decoder_group"
group_input1 = paddle.layer.StaticInputV2(input=encoded_vector, is_seq=True)
group_input2 = paddle.layer.StaticInputV2(input=encoded_proj, is_seq=True)
group_inputs = [group_input1, group_input2]
trg_embedding = paddle.layer.embedding(
input=paddle.layer.data(
name='target_language_word',
type=paddle.data_type.integer_value_sequence(target_dict_dim)),
size=word_vector_dim,
param_attr=paddle.attr.ParamAttr(name='_target_language_embedding'))
group_inputs.append(trg_embedding)
# For decoder equipped with attention mechanism, in training,
# target embeding (the groudtruth) is the data input,
# while encoded source sequence is accessed to as an unbounded memory.
# Here, the StaticInput defines a read-only memory
# for the recurrent_group.
decoder = paddle.layer.recurrent_group(
name=decoder_group_name,
step=gru_decoder_with_attention,
input=group_inputs)
lbl = paddle.layer.data(
name='target_language_next_word',
type=paddle.data_type.integer_value_sequence(target_dict_dim))
cost = paddle.layer.classification_cost(input=decoder, label=lbl)
return cost
def main():
paddle.init(use_gpu=False, trainer_count=1)
# source and target dict dim.
dict_size = 30000
source_dict_dim = target_dict_dim = dict_size
# define network topology
cost = seqToseq_net(source_dict_dim, target_dict_dim)
parameters = paddle.parameters.create(cost)
# define optimize method and trainer
optimizer = paddle.optimizer.Adam(learning_rate=1e-4)
trainer = paddle.trainer.SGD(cost=cost,
parameters=parameters,
update_equation=optimizer)
# define data reader
feeding = {
'source_language_word': 0,
'target_language_word': 1,
'target_language_next_word': 2
}
wmt14_reader = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.wmt14.train(dict_size=dict_size), buf_size=8192),
batch_size=5)
# define event_handler callback
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 10 == 0:
print "Pass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
# start to train
trainer.train(
reader=wmt14_reader,
event_handler=event_handler,
num_passes=10000,
feeding=feeding)
if __name__ == '__main__':
main()
......@@ -42,7 +42,7 @@ In such a classification problem, we usually use the cross entropy loss function
$$ crossentropy(label, y) = -\sum_i label_ilog(y_i) $$
Fig. 2 shows a softmax regression network, with weights in black, and bias in red. +1 indicates bias is 1.
Fig. 2 shows a softmax regression network, with weights in blue, and bias in red. +1 indicates bias is 1.
<p align="center">
<img src="image/softmax_regression_en.png" width=400><br/>
......@@ -57,7 +57,7 @@ The Softmax regression model described above uses the simplest two-layer neural
2. After the second hidden layer, we get $ H_2 = \phi(W_2H_1 + b_2) $.
3. Finally, after output layer, we get $Y=softmax(W_3H_2 + b_3)$, the final classification result vector.
Fig. 3. is Multilayer Perceptron network, with weights in black, and bias in red. +1 indicates bias is 1.
Fig. 3. is Multilayer Perceptron network, with weights in blue, and bias in red. +1 indicates bias is 1.
<p align="center">
<img src="image/mlp_en.png" width=500><br/>
......@@ -196,32 +196,31 @@ def convolutional_neural_network(img):
PaddlePaddle provides a special layer `layer.data` for reading data. Let us create a data layer for reading images and connect it to a classification network created using one of above three functions. We also need a cost layer for training the model.
```python
def main():
paddle.init(use_gpu=False, trainer_count=1)
paddle.init(use_gpu=False, trainer_count=1)
images = paddle.layer.data(
images = paddle.layer.data(
name='pixel', type=paddle.data_type.dense_vector(784))
label = paddle.layer.data(
label = paddle.layer.data(
name='label', type=paddle.data_type.integer_value(10))
predict = softmax_regression(images)
#predict = multilayer_perceptron(images) # uncomment for MLP
#predict = convolutional_neural_network(images) # uncomment for LeNet5
predict = softmax_regression(images)
#predict = multilayer_perceptron(images) # uncomment for MLP
#predict = convolutional_neural_network(images) # uncomment for LeNet5
cost = paddle.layer.classification_cost(input=predict, label=label)
cost = paddle.layer.classification_cost(input=predict, label=label)
```
Now, it is time to specify training parameters. The number 0.9 in the following `Momentum` optimizer means that 90% of the current the momentum comes from the momentum of the previous iteration.
```python
parameters = paddle.parameters.create(cost)
parameters = paddle.parameters.create(cost)
optimizer = paddle.optimizer.Momentum(
optimizer = paddle.optimizer.Momentum(
learning_rate=0.1 / 128.0,
momentum=0.9,
regularization=paddle.optimizer.L2Regularization(rate=0.0005 * 128))
trainer = paddle.trainer.SGD(cost=cost,
trainer = paddle.trainer.SGD(cost=cost,
parameters=parameters,
update_equation=optimizer)
```
......@@ -233,9 +232,9 @@ Here `shuffle` is a reader decorator, which takes a reader A as its parameter, a
`batch` is a special decorator, whose input is a reader and output is a *batch reader*, which doesn't yield an instance at a time, but a minibatch.
```python
lists = []
lists = []
def event_handler(event):
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0:
print "Pass %d, Batch %d, Cost %f, %s" % (
......@@ -248,7 +247,7 @@ Here `shuffle` is a reader decorator, which takes a reader A as its parameter, a
lists.append((event.pass_id, result.cost,
result.metrics['classification_error_evaluator']))
trainer.train(
trainer.train(
reader=paddle.reader.batched(
paddle.reader.shuffle(
paddle.dataset.mnist.train(), buf_size=8192),
......@@ -260,21 +259,21 @@ Here `shuffle` is a reader decorator, which takes a reader A as its parameter, a
During training, `trainer.train` invokes `event_handler` for certain events. This gives us a chance to print the training progress.
```
# Pass 0, Batch 0, Cost 2.780790, {'classification_error_evaluator': 0.9453125}
# Pass 0, Batch 100, Cost 0.635356, {'classification_error_evaluator': 0.2109375}
# Pass 0, Batch 200, Cost 0.326094, {'classification_error_evaluator': 0.1328125}
# Pass 0, Batch 300, Cost 0.361920, {'classification_error_evaluator': 0.1015625}
# Pass 0, Batch 400, Cost 0.410101, {'classification_error_evaluator': 0.125}
# Test with Pass 0, Cost 0.326659, {'classification_error_evaluator': 0.09470000118017197}
# Pass 0, Batch 0, Cost 2.780790, {'classification_error_evaluator': 0.9453125}
# Pass 0, Batch 100, Cost 0.635356, {'classification_error_evaluator': 0.2109375}
# Pass 0, Batch 200, Cost 0.326094, {'classification_error_evaluator': 0.1328125}
# Pass 0, Batch 300, Cost 0.361920, {'classification_error_evaluator': 0.1015625}
# Pass 0, Batch 400, Cost 0.410101, {'classification_error_evaluator': 0.125}
# Test with Pass 0, Cost 0.326659, {'classification_error_evaluator': 0.09470000118017197}
```
After the training, we can check the model's prediction accuracy.
```
# find the best pass
best = sorted(lists, key=lambda list: float(list[1]))[0]
print 'Best pass is %s, testing Avgcost is %s' % (best[0], best[1])
print 'The classification accuracy is %.2f%%' % (100 - float(best[2]) * 100)
# find the best pass
best = sorted(lists, key=lambda list: float(list[1]))[0]
print 'Best pass is %s, testing Avgcost is %s' % (best[0], best[1])
print 'The classification accuracy is %.2f%%' % (100 - float(best[2]) * 100)
```
Usually, with MNIST data, the softmax regression model can get accuracy around 92.34%, MLP can get about 97.66%, and convolution network can get up to around 99.20%. Convolution layers have been widely considered a great invention for image processsing.
......
......@@ -42,7 +42,7 @@ $$ y_i = softmax(\sum_j W_{i,j}x_j + b_i) $$
$$ crossentropy(label, y) = -\sum_i label_ilog(y_i) $$
图2为softmax回归的网络图,图中权重用黑线表示、偏置用红线表示、+1代表偏置参数的系数为1。
图2为softmax回归的网络图,图中权重用蓝线表示、偏置用红线表示、+1代表偏置参数的系数为1。
<p align="center">
<img src="image/softmax_regression.png" width=400><br/>
......@@ -58,7 +58,7 @@ Softmax回归模型采用了最简单的两层神经网络,即只有输入层
3. 最后,再经过输出层,得到的$Y=softmax(W_3H_2 + b_3)$,即为最后的分类结果向量。
图3为多层感知器的网络结构图,图中权重用黑线表示、偏置用红线表示、+1代表偏置参数的系数为1。
图3为多层感知器的网络结构图,图中权重用蓝线表示、偏置用红线表示、+1代表偏置参数的系数为1。
<p align="center">
<img src="image/mlp.png" width=500><br/>
......@@ -67,16 +67,37 @@ Softmax回归模型采用了最简单的两层神经网络,即只有输入层
### 卷积神经网络(Convolutional Neural Network, CNN)
在多层感知器模型中,将图像展开成一维向量输入到网络中,忽略了图像的位置和结构信息,而卷积神经网络能够更好的利用图像的结构信息。[LeNet-5](http://yann.lecun.com/exdb/lenet/)是一个较简单的卷积神经网络。图6显示了其结构:输入的二维图像,先经过两次卷积层到池化层,再经过全连接层,最后使用softmax分类作为输出层。下面我们主要介绍卷积层和池化层。
<p align="center">
<img src="image/cnn.png"><br/>
图6. LeNet-5卷积神经网络结构<br/>
</p>
#### 卷积层
卷积层是卷积神经网络的核心基石。在图像识别里我们提到的卷积是二维卷积,即离散二维滤波器(也称作卷积核)与二维图像做卷积操作,简单的讲是二维滤波器滑动到二维图像上所有位置,并在每个位置上与该像素点及其领域像素点做内积。卷积操作被广泛应用与图像处理领域,不同卷积核可以提取不同的特征,例如边沿、线性、角等特征。在深层卷积神经网络中,通过卷积操作可以提取出图像低级到复杂的特征。
<p align="center">
<img src="image/conv_layer.png" width=500><br/>
<img src="image/conv_layer.png"><br/>
图4. 卷积层图片<br/>
</p>
卷积层是卷积神经网络的核心基石。该层的参数由一组可学习的过滤器(也叫作卷积核)组成。在前向过程中,每个卷积核在输入层进行横向和纵向的扫描,与输入层对应扫描位置进行卷积,得到的结果加上偏置并用相应的激活函数进行激活,结果能够得到一个二维的激活图(activation map)。每个特定的卷积核都能得到特定的激活图(activation map),如有的卷积核可能对识别边角,有的可能识别圆圈,那这些卷积核可能对于对应的特征响应要强。
图4给出一个卷积计算过程的示例图,输入图像大小为$H=5,W=5,D=3$,即$5 \times 5$大小的3通道(RGB,也称作深度)彩色图像。这个示例图中包含两(用$K$表示)组卷积核,即图中滤波器$W_0$和$W_1$。在卷积计算中,通常对不同的输入通道采用不同的卷积核,如图示例中每组卷积核包含($D=3)$个$3 \times 3$(用$F \times F$表示)大小的卷积核。另外,这个示例中卷积核在图像的水平方向($W$方向)和垂直方向($H$方向)的滑动步长为2(用$S$表示);对输入图像周围各填充1(用$P$表示)个0,即图中输入层原始数据为蓝色部分,灰色部分是进行了大小为1的扩展,用0来进行扩展。经过卷积操作得到输出为$3 \times 3 \times 2$(用$H_{o} \times W_{o} \times K$表示)大小的特征图,即$3 \times 3$大小的2通道特征图,其中$H_o$计算公式为:$H_o = (H - F + 2 \times P)/S + 1$,$W_o$同理。 而输出特征图中的每个像素,是每组滤波器与输入图像每个特征图的内积再求和,再加上偏置$b_o$,偏置通常对于每个输出特征图是共享的。例如图中输出特征图$o[:,:,0]$中的第一个$2$计算如下:
$$ o[0,0,0] = \sum x[0:3,0:3,0] * w_{0}[:,:,0]] + \sum x[0:3,0:3,1] * w_{0}[:,:,1]] + \sum x[0:3,0:3,2] * w_{0}[:,:,2]] + b_0 = 2 $$
$$ \sum x[0:3,0:3,0] * w_{0}[:,:,0]] = 0*1 + 0*1 + 0*1 + 0*1 + 1*1 + 2*(-1) + 0*(-1) + 0*1 + 0*(-1) = -1 $$
$$ \sum x[0:3,0:3,1] * w_{0}[:,:,1]] = 0*0 + 0*1 + 0*1 + 0*(-1) + 0*0 + 1*1 + 0*1 + 2*0 + 1*1 = 2 $$
$$ \sum x[0:3,0:3,2] * w_{0}[:,:,2]] = 0*(-1) + 0*1 + 0*(-1) + 0*0 + 1*1 + 1*0 + 0*(-1) + 1*0 + 1*(-1) = 0 $$
$$ b_0 = 1 $$
在卷积操作中卷积核是可学习的参数,经过上面示例介绍,每层卷积的参数大小为$D \times F \times F \times K$。在多层感知器模型中,神经元通常是全部连接,参数较多。而卷积层的参数较少,这也是由卷积层的主要特性即局部连接和共享权重所决定。
- 局部连接:每个神经元仅与输入神经元的一块区域连接,这块局部区域称作感受野(receptive field)。在图像卷积操作中,即神经元在空间维度(spatial dimension,即上图示例H和W所在的平面)是局部连接,但在深度上是全部连接。对于二维图像本身而言,也是局部像素关联较强。这种局部连接保证了学习后的过滤器能够对于局部的输入特征有最强的响应。局部连接的思想,也是受启发于生物学里面的视觉系统结构,视觉皮层的神经元就是局部接受信息的。
图4是卷积层的一个动态图。由于3D量难以表示,所有的3D量(输入的3D量(蓝色),权重3D量(红色),输出3D量(绿色))通过将深度在行上堆叠来表示。如图4,输入层是$W_1=5,H_1=5,D_1=3$,我们常见的彩色图片其实就是类似这样的输入层,彩色图片的宽和高对应这里的$W_1$和$H_1$,而彩色图片有RGB三个颜色通道,对应这里的$D_1$;卷积层的参数为$K=2,F=3,S=2,P=1$,这里的$K$是卷积核的数量,如图4中有$Filter W_0$和$Filter W_1$两个卷积核,$F$对应卷积核的大小,图中$W0$和$W1$在每一层深度上都是$3\times3$的矩阵,$S$对应卷积核扫描的步长,从动态图中可以看到,方框每次左移或下移2个单位,$P$对应Padding扩展,是对输入层的扩展,图中输入层,原始数据为蓝色部分,可以看到灰色部分是进行了大小为1的扩展,用0来进行扩展;图4的动态可视化对输出层结果(绿色)进行迭代,显示每个输出元素是通过将突出显示的输入(蓝色)与滤波器(红色)进行元素相乘,将其相加,然后通过偏置抵消结果来计算的。
- 权重共享:计算同一个深度切片的神经元时采用的滤波器是共享的。例如图4中计算$o[:,:,0]$的每个每个神经元的滤波器均相同,都为$W_0$,这样可以很大程度上减少参数。共享权重在一定程度上讲是有意义的,例如图片的底层边缘特征与特征在图中的具体位置无关。但是在一些场景中是无意的,比如输入的图片是人脸,眼睛和头发位于不同的位置,希望在不同的位置学到不同的特征 (参考[斯坦福大学公开课]( http://cs231n.github.io/convolutional-networks/))。请注意权重只是对于同一深度切片的神经元是共享的,在卷积层,通常采用多组卷积核提取不同特征,即对应不同深度切片的特征,不同深度切片的神经元权重是不共享。另外,偏重对同一深度切片的所有神经元都是共享的。
通过介绍卷积计算过程及其特性,可以看出卷积是线性操作,并具有平移不变性(shift-invariant),平移不变性即在图像每个位置执行相同的操作。卷积层的局部连接和权重共享使得需要学习的参数大大减小,这样也有利于训练较大卷积神经网络。
#### 池化层
......@@ -87,19 +108,6 @@ Softmax回归模型采用了最简单的两层神经网络,即只有输入层
池化是非线性下采样的一种形式,主要作用是通过减少网络的参数来减小计算量,并且能够在一定程度上控制过拟合。通常在卷积层的后面会加上一个池化层。池化包括最大池化、平均池化等。其中最大池化是用不重叠的矩形框将输入层分成不同的区域,对于每个矩形框的数取最大值作为输出层,如图5所示。
#### LeNet-5网络
<p align="center">
<img src="image/cnn.png"><br/>
图6. LeNet-5卷积神经网络结构<br/>
</p>
[LeNet-5](http://yann.lecun.com/exdb/lenet/)是一个最简单的卷积神经网络。图6显示了其结构:输入的二维图像,先经过两次卷积层到池化层,再经过全连接层,最后使用softmax分类作为输出层。卷积的如下三个特性,决定了LeNet-5能比同样使用全连接层的多层感知器更好地识别图像:
- 神经元的三维特性: 卷积层的神经元在宽度、高度和深度上进行了组织排列。每一层的神经元仅仅与前一层的一块小区域连接,这块小区域被称为感受野(receptive field)。
- 局部连接:CNN通过在相邻层的神经元之间实施局部连接模式来利用空间局部相关性。这样的结构保证了学习后的过滤器能够对于局部的输入特征有最强的响应。堆叠许多这样的层导致非线性“过滤器”变得越来越“全局”。这允许网络首先创建输入的小部分的良好表示,然后从它们组合较大区域的表示。
- 共享权重:在CNN中,每个滤波器在整个视野中重复扫描。 这些复制单元共享相同的参数化(权重向量和偏差)并形成特征图。 这意味着给定卷积层中的所有神经元检测完全相同的特征。 以这种方式的复制单元允许不管它们在视野中的位置都能检测到特征,从而构成平移不变性的性质。
更详细的关于卷积神经网络的具体知识可以参考[斯坦福大学公开课]( http://cs231n.github.io/convolutional-networks/ )[图像分类](https://github.com/PaddlePaddle/book/blob/develop/image_classification/README.md)教程。
### 常见激活函数介绍
......@@ -195,20 +203,19 @@ def convolutional_neural_network(img):
接着,通过`layer.data`调用来获取数据,然后调用分类器(这里我们提供了三个不同的分类器)得到分类结果。训练时,对该结果计算其损失函数,分类问题常常选择交叉熵损失函数。
```python
def main():
# 该模型运行在单个CPU上
paddle.init(use_gpu=False, trainer_count=1)
# 该模型运行在单个CPU上
paddle.init(use_gpu=False, trainer_count=1)
images = paddle.layer.data(
images = paddle.layer.data(
name='pixel', type=paddle.data_type.dense_vector(784))
label = paddle.layer.data(
label = paddle.layer.data(
name='label', type=paddle.data_type.integer_value(10))
predict = softmax_regression(images) # Softmax回归
#predict = multilayer_perceptron(images) #多层感知器
#predict = convolutional_neural_network(images) #LeNet5卷积神经网络
predict = softmax_regression(images) # Softmax回归
#predict = multilayer_perceptron(images) #多层感知器
#predict = convolutional_neural_network(images) #LeNet5卷积神经网络
cost = paddle.layer.classification_cost(input=predict, label=label)
cost = paddle.layer.classification_cost(input=predict, label=label)
```
然后,指定训练相关的参数。
......@@ -217,14 +224,14 @@ def main():
- 正则化(regularization): 是防止网络过拟合的一种手段,此处采用L2正则化。
```python
parameters = paddle.parameters.create(cost)
parameters = paddle.parameters.create(cost)
optimizer = paddle.optimizer.Momentum(
optimizer = paddle.optimizer.Momentum(
learning_rate=0.1 / 128.0,
momentum=0.9,
regularization=paddle.optimizer.L2Regularization(rate=0.0005 * 128))
trainer = paddle.trainer.SGD(cost=cost,
trainer = paddle.trainer.SGD(cost=cost,
parameters=parameters,
update_equation=optimizer)
```
......@@ -236,9 +243,9 @@ def main():
`batch`是一个特殊的decorator,它的输入是一个reader,输出是一个batched reader —— 在PaddlePaddle里,一个reader每次yield一条训练数据,而一个batched reader每次yield一个minbatch。
```python
lists = []
lists = []
def event_handler(event):
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0:
print "Pass %d, Batch %d, Cost %f, %s" % (
......@@ -251,8 +258,8 @@ def main():
lists.append((event.pass_id, result.cost,
result.metrics['classification_error_evaluator']))
trainer.train(
reader=paddle.batch(
trainer.train(
reader=paddle.reader.batched(
paddle.reader.shuffle(
paddle.dataset.mnist.train(), buf_size=8192),
batch_size=128),
......@@ -263,12 +270,12 @@ def main():
训练过程是完全自动的,event_handler里打印的日志类似如下所示:
```
# Pass 0, Batch 0, Cost 2.780790, {'classification_error_evaluator': 0.9453125}
# Pass 0, Batch 100, Cost 0.635356, {'classification_error_evaluator': 0.2109375}
# Pass 0, Batch 200, Cost 0.326094, {'classification_error_evaluator': 0.1328125}
# Pass 0, Batch 300, Cost 0.361920, {'classification_error_evaluator': 0.1015625}
# Pass 0, Batch 400, Cost 0.410101, {'classification_error_evaluator': 0.125}
# Test with Pass 0, Cost 0.326659, {'classification_error_evaluator': 0.09470000118017197}
# Pass 0, Batch 0, Cost 2.780790, {'classification_error_evaluator': 0.9453125}
# Pass 0, Batch 100, Cost 0.635356, {'classification_error_evaluator': 0.2109375}
# Pass 0, Batch 200, Cost 0.326094, {'classification_error_evaluator': 0.1328125}
# Pass 0, Batch 300, Cost 0.361920, {'classification_error_evaluator': 0.1015625}
# Pass 0, Batch 400, Cost 0.410101, {'classification_error_evaluator': 0.125}
# Test with Pass 0, Cost 0.326659, {'classification_error_evaluator': 0.09470000118017197}
```
训练之后,检查模型的预测准确度。用 MNIST 训练的时候,一般 softmax回归模型的分类准确率为约为 92.34%,多层感知器为97.66%,卷积神经网络可以达到 99.20%。
......
recognize_digits/image/conv_layer.png

249.7 KB | W: | H:

recognize_digits/image/conv_layer.png

248.3 KB | W: | H:

recognize_digits/image/conv_layer.png
recognize_digits/image/conv_layer.png
recognize_digits/image/conv_layer.png
recognize_digits/image/conv_layer.png
  • 2-up
  • Swipe
  • Onion skin
......@@ -83,7 +83,7 @@ In such a classification problem, we usually use the cross entropy loss function
$$ crossentropy(label, y) = -\sum_i label_ilog(y_i) $$
Fig. 2 shows a softmax regression network, with weights in black, and bias in red. +1 indicates bias is 1.
Fig. 2 shows a softmax regression network, with weights in blue, and bias in red. +1 indicates bias is 1.
<p align="center">
<img src="image/softmax_regression_en.png" width=400><br/>
......@@ -98,7 +98,7 @@ The Softmax regression model described above uses the simplest two-layer neural
2. After the second hidden layer, we get $ H_2 = \phi(W_2H_1 + b_2) $.
3. Finally, after output layer, we get $Y=softmax(W_3H_2 + b_3)$, the final classification result vector.
Fig. 3. is Multilayer Perceptron network, with weights in black, and bias in red. +1 indicates bias is 1.
Fig. 3. is Multilayer Perceptron network, with weights in blue, and bias in red. +1 indicates bias is 1.
<p align="center">
<img src="image/mlp_en.png" width=500><br/>
......@@ -156,15 +156,8 @@ For more information, please refer to [Activation functions on Wikipedia](https:
## Data Preparation
### Data Download
PaddlePaddle provides a Python module, `paddle.dataset.mnist`, which downloads and caches the [MNIST dataset](http://yann.lecun.com/exdb/mnist/). The cache is under `/home/username/.cache/paddle/dataset/mnist`:
Execute the following command to download the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset and unzip. Add paths to the training set and the test set to train.list and test.list respectively for PaddlePaddle to read.
```bash
./data/get_mnist_data.sh
```
`gzip` downloaded data. The following files can be found in `data/raw_data`:
| File name | Description |
|----------------------|-------------------------|
......@@ -173,283 +166,159 @@ Execute the following command to download the [MNIST](http://yann.lecun.com/exdb
|t10k-images-idx3-ubyte | Evaluation images, 10,000 |
|t10k-labels-idx1-ubyte | Evaluation labels, 10,000 |
Users can randomly generate 10 images with the following script (Refer to Fig. 1.)
```bash
./load_data.py
```
### Provide Data to PaddlePaddle
We use python interface to provide data to system. `mnist_provider.py` shows a complete example for training on MNIST data.
```python
# Define a py data provider
@provider(
input_types={'pixel': dense_vector(28 * 28),
'label': integer_value(10)})
def process(settings, filename): # settings is not used currently.
# Open image file
with open( filename + "-images-idx3-ubyte", "rb") as f:
# Read first 4 parameters. magic is data format. n is number of data. rows and cols are number of rows and columns, respectively
magic, n, rows, cols = struct.upack(">IIII", f.read(16))
# With empty string as a unit, read data one by one
images = np.fromfile(
f, 'ubyte',
count=n * rows * cols).reshape(n, rows, cols).astype('float32')
# Normalize data of [0, 255] to [-1,1]
images = images / 255.0 * 2.0 - 1.0
# Open label file
with open( filename + "-labels-idx1-ubyte", "rb") as l:
# Read first two parameters
magic, n = struct.upack(">II", l.read(8))
# With empty string as a unit, read data one by one
labels = np.fromfile(l, 'ubyte', count=n).astype("int")
for i in xrange(n):
yield {"pixel": images[i, :], 'label': labels[i]}
```
## Model Configurations
### Data Definition
## Model Configuration
In the model configuration, use `define_py_data_sources2` to define reading of data from `dataprovider`. If this configuration is used for prediction, data definition is not necessary.
A PaddlePaddle program starts from importing the API package:
```python
if not is_predict:
data_dir = './data/'
define_py_data_sources2(
train_list=data_dir + 'train.list',
test_list=data_dir + 'test.list',
module='mnist_provider',
obj='process')
import paddle.v2 as paddle
```
### Algorithm Configuration
We want to use this program to demonstrate multiple kinds of models. Let define each of them as a Python function:
Set training related parameters.
- batch_size: use 128 samples in each training step.
- learning_rate: determines step taken in each iteration, it determines how fast the model converges.
- learning_method: use optimizer `MomentumOptimizer` for training. The parameter 0.9 indicates momentum keeps 0.9 of previous speed.
- regularization: A method to prevent overfitting. Here L2 regularization is used.
```python
settings(
batch_size=128,
learning_rate=0.1 / 128.0,
learning_method=MomentumOptimizer(0.9),
regularization=L2Regularization(0.0005 * 128))
```
### Model Architecture
#### Overview
First get reference labels from `data_layer`, and get classification results (predictions) from classifier. Here we provide three different classifiers. In training, we compute loss function, which is usually cross entropy for classification problem. In prediction, we can directly output the results (predictions).
``` python
data_size = 1 * 28 * 28
label_size = 10
img = data_layer(name='pixel', size=data_size)
predict = softmax_regression(img) # Softmax Regression
#predict = multilayer_perceptron(img) # Multilayer Perceptron
#predict = convolutional_neural_network(img) #LeNet5 Convolutional Neural Network
if not is_predict:
lbl = data_layer(name="label", size=label_size)
inputs(img, lbl)
outputs(classification_cost(input=predict, label=lbl))
else:
outputs(predict)
```
#### Softmax Regression
One simple fully connected layer with softmax activation function outputs classification result.
- softmax regression: the network has a fully-connection layer with softmax activation:
```python
def softmax_regression(img):
predict = fc_layer(input=img, size=10, act=SoftmaxActivation())
predict = paddle.layer.fc(input=img,
size=10,
act=paddle.activation.Softmax())
return predict
```
#### MultiLayer Perceptron
The following code implements a Multilayer Perceptron with two fully connected hidden layers and a ReLU activation function. The output layer has a Softmax activation function.
- multi-layer perceptron: this network has two hidden fully-connected layers, one with LeRU and the other with softmax activation:
```python
def multilayer_perceptron(img):
# First fully connected layer with ReLU
hidden1 = fc_layer(input=img, size=128, act=ReluActivation())
# Second fully connected layer with ReLU
hidden2 = fc_layer(input=hidden1, size=64, act=ReluActivation())
# Output layer as fully connected layer and softmax activation. The size must be 10.
predict = fc_layer(input=hidden2, size=10, act=SoftmaxActivation())
hidden1 = paddle.layer.fc(input=img, size=128, act=paddle.activation.Relu())
hidden2 = paddle.layer.fc(input=hidden1,
size=64,
act=paddle.activation.Relu())
predict = paddle.layer.fc(input=hidden2,
size=10,
act=paddle.activation.Softmax())
return predict
```
#### Convolutional Neural Network LeNet-5
The following is the LeNet-5 network architecture. A 2D input image is first fed into two sets of convolutional layers and pooling layers, this result is then fed to a fully connected layer, and another fully connected layer with a softmax activation.
- convolution network LeNet-5: the input image is fed through two convolution-pooling layer, a fully-connected layer, and the softmax output layer:
```python
def convolutional_neural_network(img):
# First convolutional layer - pooling layer
conv_pool_1 = simple_img_conv_pool(
conv_pool_1 = paddle.networks.simple_img_conv_pool(
input=img,
filter_size=5,
num_filters=20,
num_channel=1,
pool_size=2,
pool_stride=2,
act=TanhActivation())
# Second convolutional layer - pooling layer
conv_pool_2 = simple_img_conv_pool(
act=paddle.activation.Tanh())
conv_pool_2 = paddle.networks.simple_img_conv_pool(
input=conv_pool_1,
filter_size=5,
num_filters=50,
num_channel=20,
pool_size=2,
pool_stride=2,
act=TanhActivation())
# Fully connected layer
fc1 = fc_layer(input=conv_pool_2, size=128, act=TanhActivation())
# Output layer as fully connected layer and softmax activation. The size must be 10.
predict = fc_layer(input=fc1, size=10, act=SoftmaxActivation())
return predict
```
## Training Model
### Training Commands and Logs
1.Configure `train.sh` to execute training:
act=paddle.activation.Tanh())
```bash
config=mnist_model.py # Select network in mnist_model.py
output=./softmax_mnist_model
log=softmax_train.log
fc1 = paddle.layer.fc(input=conv_pool_2,
size=128,
act=paddle.activation.Tanh())
paddle train \
--config=$config \ # Scripts for network configuration.
--dot_period=10 \ # After `dot_period` steps, print one `.`
--log_period=100 \ # Print a log every batchs
--test_all_data_in_one_period=1 \ # Whether to use all data in every test
--use_gpu=0 \ # Whether to use GPU
--trainer_count=1 \ # Number of CPU or GPU
--num_passes=100 \ # Passes for training (One pass uses all data.)
--save_dir=$output \ # Path to saved model
2>&1 | tee $log
python -m paddle.utils.plotcurve -i $log > plot.png
predict = paddle.layer.fc(input=fc1,
size=10,
act=paddle.activation.Softmax())
return predict
```
After configuring parameters, execute `./train.sh`. Training log is as follows.
PaddlePaddle provides a special layer `layer.data` for reading data. Let us create a data layer for reading images and connect it to a classification network created using one of above three functions. We also need a cost layer for training the model.
```
I0117 12:52:29.628617 4538 TrainerInternal.cpp:165] Batch=100 samples=12800 AvgCost=2.63996 CurrentCost=2.63996 Eval: classification_error_evaluator=0.241172 CurrentEval: classification_error_evaluator=0.241172
.........
I0117 12:52:29.768741 4538 TrainerInternal.cpp:165] Batch=200 samples=25600 AvgCost=1.74027 CurrentCost=0.840582 Eval: classification_error_evaluator=0.185234 CurrentEval: classification_error_evaluator=0.129297
.........
I0117 12:52:29.916970 4538 TrainerInternal.cpp:165] Batch=300 samples=38400 AvgCost=1.42119 CurrentCost=0.783026 Eval: classification_error_evaluator=0.167786 CurrentEval: classification_error_evaluator=0.132891
.........
I0117 12:52:30.061213 4538 TrainerInternal.cpp:165] Batch=400 samples=51200 AvgCost=1.23965 CurrentCost=0.695054 Eval: classification_error_evaluator=0.160039 CurrentEval: classification_error_evaluator=0.136797
......I0117 12:52:30.223270 4538 TrainerInternal.cpp:181] Pass=0 Batch=469 samples=60000 AvgCost=1.1628 Eval: classification_error_evaluator=0.156233
I0117 12:52:30.366894 4538 Tester.cpp:109] Test samples=10000 cost=0.50777 Eval: classification_error_evaluator=0.0978
```
2.Use `plot_cost.py` to plot error curve during training.
```python
paddle.init(use_gpu=False, trainer_count=1)
```bash
python plot_cost.py softmax_train.log
```
images = paddle.layer.data(
name='pixel', type=paddle.data_type.dense_vector(784))
label = paddle.layer.data(
name='label', type=paddle.data_type.integer_value(10))
3.Use `evaluate.py ` to select the best trained model.
predict = softmax_regression(images)
#predict = multilayer_perceptron(images) # uncomment for MLP
#predict = convolutional_neural_network(images) # uncomment for LeNet5
```bash
python evaluate.py softmax_train.log
cost = paddle.layer.classification_cost(input=predict, label=label)
```
### Training Results for Softmax Regression
Now, it is time to specify training parameters. The number 0.9 in the following `Momentum` optimizer means that 90% of the current the momentum comes from the momentum of the previous iteration.
<p align="center">
<img src="image/softmax_train_log_en.png" width="400px"><br/>
Fig. 7 Softmax regression error curve<br/>
</p>
```python
parameters = paddle.parameters.create(cost)
Evaluation results of the models:
optimizer = paddle.optimizer.Momentum(
learning_rate=0.1 / 128.0,
momentum=0.9,
regularization=paddle.optimizer.L2Regularization(rate=0.0005 * 128))
```text
Best pass is 00013, testing Avgcost is 0.484447
The classification accuracy is 90.01%
trainer = paddle.trainer.SGD(cost=cost,
parameters=parameters,
update_equation=optimizer)
```
From the evaluation results, the best pass for softmax regression model is pass-00013, where the classification accuracy is 90.01%, and the last pass-00099 has an accuracy of 89.3%. From Fig. 7, we also see that the best accuracy may not appear in the last pass. This is because during training, the model may already arrive at a local optimum, and it just swings around nearby in the following passes, or it gets a lower local optimum.
Then we specify the training data `paddle.dataset.movielens.train()` and testing data `paddle.dataset.movielens.test()`. These two functions are *reader creators*, once called, returns a *reader*. A reader is a Python function, which, once called, returns a Python generator, which yields instances of data.
### Results of Multilayer Perceptron
Here `shuffle` is a reader decorator, which takes a reader A as its parameter, and returns a new reader B, where B calls A to read in `buffer_size` data instances everytime into a buffer, then shuffles and yield instances in the buffer. If you want very shuffled data, try use a larger buffer size.
<p align="center">
<img src="image/mlp_train_log_en.png" width="400px"><br/>
Fig. 8. Multilayer Perceptron error curve<br/>
</p>
`batch` is a special decorator, whose input is a reader and output is a *batch reader*, which doesn't yield an instance at a time, but a minibatch.
Evaluation results of the models:
```text
Best pass is 00085, testing Avgcost is 0.164746
The classification accuracy is 94.95%
```python
lists = []
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0:
print "Pass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
if isinstance(event, paddle.event.EndPass):
result = trainer.test(reader=paddle.reader.batched(
paddle.dataset.mnist.test(), batch_size=128))
print "Test with Pass %d, Cost %f, %s\n" % (
event.pass_id, result.cost, result.metrics)
lists.append((event.pass_id, result.cost,
result.metrics['classification_error_evaluator']))
trainer.train(
reader=paddle.reader.batched(
paddle.reader.shuffle(
paddle.dataset.mnist.train(), buf_size=8192),
batch_size=128),
event_handler=event_handler,
num_passes=100)
```
From the evaluation results, the final training accuracy is 94.95%. It is significantly better than the softmax regression model. This is because the softmax regression is simple, and it cannot fit complex data. The Multilayer Perceptron with hidden layers has better capacity to fit complex data than the softmax regression.
### Training results for Convolutional Neural Network
<p align="center">
<img src="image/cnn_train_log_en.png" width="400px"><br/>
Fig. 9. Convolutional Neural Network error curve<br/>
</p>
During training, `trainer.train` invokes `event_handler` for certain events. This gives us a chance to print the training progress.
Results of model evaluation:
```text
Best pass is 00076, testing Avgcost is 0.0244684
The classification accuracy is 99.20%
```
From the evaluation result, the best accuracy of Convolutional Neural Network is 99.20%. So for image classification, a Convolutional Neural Network has better recognition results than a fully connected network. This is related to the local connection and parameter sharing of convolutional layers. In Fig. 9, the Convolutional Neural Network achieves good results in early steps, which indicates that it converges faster.
## Application Model
### Prediction Commands and Results
Script `predict.py` can make prediction for trained models. For example, in softmax regression:
```bash
python predict.py -c mnist_model.py -d data/raw_data/ -m softmax_mnist_model/pass-00047
# Pass 0, Batch 0, Cost 2.780790, {'classification_error_evaluator': 0.9453125}
# Pass 0, Batch 100, Cost 0.635356, {'classification_error_evaluator': 0.2109375}
# Pass 0, Batch 200, Cost 0.326094, {'classification_error_evaluator': 0.1328125}
# Pass 0, Batch 300, Cost 0.361920, {'classification_error_evaluator': 0.1015625}
# Pass 0, Batch 400, Cost 0.410101, {'classification_error_evaluator': 0.125}
# Test with Pass 0, Cost 0.326659, {'classification_error_evaluator': 0.09470000118017197}
```
- -c sets model architecture
- -d sets data for prediction
- -m sets model parameters, here the best trained model is used for prediction
Follow the instructions to input image ID for prediction. The classifier can output probabilities for each digit, predictions with the highest probability, and ground truth label.
After the training, we can check the model's prediction accuracy.
```
Input image_id [0~9999]: 3
Predicted probability of each digit:
[[ 1.00000000e+00 1.60381094e-28 1.60381094e-28 1.60381094e-28
1.60381094e-28 1.60381094e-28 1.60381094e-28 1.60381094e-28
1.60381094e-28 1.60381094e-28]]
Predict Number: 0
Actual Number: 0
# find the best pass
best = sorted(lists, key=lambda list: float(list[1]))[0]
print 'Best pass is %s, testing Avgcost is %s' % (best[0], best[1])
print 'The classification accuracy is %.2f%%' % (100 - float(best[2]) * 100)
```
From the result, this classifier recognizes the digit on the third image as digit 0 with near to 100% probability. This predicted result is consistent with the ground truth label.
Usually, with MNIST data, the softmax regression model can get accuracy around 92.34%, MLP can get about 97.66%, and convolution network can get up to around 99.20%. Convolution layers have been widely considered a great invention for image processsing.
## Conclusion
This tutorial describes a few basic Deep Learning models viz. Softmax regression, Multilayer Perceptron Network and Convolutional Neural Network. The subsequent tutorials will derive more sophisticated models from these. So it is crucial to understand these models for future learning. When our model evolved from a simple softmax regression to slightly complex Convolutional Neural Network, the recognition accuracy on the MNIST data set achieved large improvement in accuracy. This is due to the Convolutional layers' local connections and parameter sharing. While learning new models in the future, we encourage the readers to understand the key ideas that lead a new model to improve results of an old one. Moreover, this tutorial introduced the basic flow of PaddlePaddle model design, starting with a dataprovider, model layer construction, to final training and prediction. Readers can leverage the flow used in this MNIST handwritten digit classification example and experiment with different data and network architectures to train models for classification tasks of their choice.
......
此差异已折叠。
此差异已折叠。
.idea
.ipynb_checkpoints
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册