提交 c2c24bc8 编写于 作者: G gongweibao

Merge remote-tracking branch 'upstream/develop' into develop

# 语义角色标注
本教程源代码目录在[book/label_semantic_roles](https://github.com/PaddlePaddle/book/tree/develop/label_semantic_roles), 初次使用请参考PaddlePaddle[安装教程](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html)
## 背景介绍
自然语言分析技术大致分为三个层面:词法分析、句法分析和语义分析。语义角色标注是实现浅层语义分析的一种方式。在一个句子中,谓词是对主语的陈述或说明,指出“做什么”、“是什么”或“怎么样,代表了一个事件的核心,跟谓词搭配的名词称为论元。语义角色是指论元在动词所指事件中担任的角色。主要有:施事者(Agent)、受事者(Patient)、客体(Theme)、经验者(Experiencer)、受益者(Beneficiary)、工具(Instrument)、处所(Location)、目标(Goal)和来源(Source)等。
请看下面的例子,“遇到” 是谓词(Predicate,通常简写为“Pred”),“小明”是施事者(Agent),“小红”是受事者(Patient),“昨天” 是事件发生的时间(Time),“公园”是事情发生的地点(Location)。
$$\mbox{[小明]}_{\mbox{Agent}}\mbox{[昨天]}_{\mbox{Time}}\mbox{[晚上]}_\mbox{Time}\mbox{在[公园]}_{\mbox{Location}}\mbox{[遇到]}_{\mbox{Predicate}}\mbox{了[小红]}_{\mbox{Patient}}\mbox{。}$$
语义角色标注(Semantic Role Labeling,SRL)以句子的谓词为中心,不对句子所包含的语义信息进行深入分析,只分析句子中各成分与谓词之间的关系,即句子的谓词(Predicate)- 论元(Argument)结构,并用语义角色来描述这些结构关系,是许多自然语言理解任务(如信息抽取,篇章分析,深度问答等)的一个重要中间步骤。在研究中一般都假定谓词是给定的,所要做的就是找出给定谓词的各个论元和它们的语义角色。
传统的SRL系统大多建立在句法分析基础之上,通常包括5个流程:
1. 构建一棵句法分析树,例如,图1是对上面例子进行依存句法分析得到的一棵句法树。
2. 从句法树上识别出给定谓词的候选论元。
3. 候选论元剪除;一个句子中的候选论元可能很多,候选论元剪除就是从大量的候选项中剪除那些最不可能成为论元的候选项。
4. 论元识别:这个过程是从上一步剪除之后的候选中判断哪些是真正的论元,通常当做一个二分类问题来解决。
5. 对第4步的结果,通过多分类得到论元的语义角色标签。可以看到,句法分析是基础,并且后续步骤常常会构造的一些人工特征,这些特征往往也来自句法分析。
<div align="center">
<img src="image/dependency_parsing.png" width = "80%" align=center /><br>
图1. 依存句法分析句法树示例
</div>
然而,完全句法分析需要确定句子所包含的全部句法信息,并确定句子各成分之间的关系,是一个非常困难的任务,目前技术下的句法分析准确率并不高,句法分析的细微错误都会导致SRL的错误。为了降低问题的复杂度,同时获得一定的句法结构信息,“浅层句法分析”的思想应运而生。浅层句法分析也称为部分句法分析(partial parsing)或语块划分(chunking)。和完全句法分析得到一颗完整的句法树不同,浅层句法分析只需要识别句子中某些结构相对简单的独立成分,例如:动词短语,这些被识别出来的结构称为语块。为了回避 “无法获得准确率较高的句法树” 所带来的困难,一些研究\[[1](#参考文献)\]也提出了基于语块(chunk)的SRL方法。基于语块的SRL方法将SRL作为一个序列标注问题来解决。序列标注任务一般都会采用BIO表示方式来定义序列标注的标签集,我们先来介绍这种表示方法。在BIO表示法中,B代表语块的开始,I代表语块的中间,O代表语块结束。通过B、I、O 三种标记将不同的语块赋予不同的标签,例如:对于一个角色为A的论元,将它所包含的第一个语块赋予标签B-A,将它所包含的其它语块赋予标签I-A,不属于任何论元的语块赋予标签O。
我们继续以上面的这句话为例,图1展示了BIO表示方法。
<div align="center">
<img src="image/bio_example.png" width = "90%" align=center /><br>
图2. BIO标注方法示例
</div>
从上面的例子可以看到,根据序列标注结果可以直接得到论元的语义角色标注结果,是一个相对简单的过程。这种简单性体现在:(1)依赖浅层句法分析,降低了句法分析的要求和难度;(2)没有了候选论元剪除这一步骤;(3)论元的识别和论元标注是同时实现的。这种一体化处理论元识别和论元标注的方法,简化了流程,降低了错误累积的风险,往往能够取得更好的结果。
与基于语块的SRL方法类似,在本教程中我们也将SRL看作一个序列标注问题,不同的是,我们只依赖输入文本序列,不依赖任何额外的语法解析结果或是复杂的人造特征,利用深度神经网络构建一个端到端学习的SRL系统。我们以[CoNLL-2004 and CoNLL-2005 Shared Tasks](http://www.cs.upc.edu/~srlconll/)任务中SRL任务的公开数据集为例,实践下面的任务:给定一句话和这句话里的一个谓词,通过序列标注的方式,从句子中找到谓词对应的论元,同时标注它们的语义角色。
## 模型概览
循环神经网络(Recurrent Neural Network)是一种对序列建模的重要模型,在自然语言处理任务中有着广泛地应用。不同于前馈神经网络(Feed-forward Neural Network),RNN能够处理输入之间前后关联的问题。LSTM是RNN的一种重要变种,常用来学习长序列中蕴含的长程依赖关系,我们在[情感分析](https://github.com/PaddlePaddle/book/tree/develop/understand_sentiment)一篇中已经介绍过,这一篇中我们依然利用LSTM来解决SRL问题。
### 栈式循环神经网络(Stacked Recurrent Neural Network)
深层网络有助于形成层次化特征,网络上层在下层已经学习到的初级特征基础上,形成更复杂的高级特征。尽管LSTM沿时间轴展开后等价于一个非常“深”的前馈网络,但由于LSTM各个时间步参数共享,$t-1$时刻状态到$t$时刻的映射,始终只经过了一次非线性映射,也就是说单层LSTM对状态转移的建模是 “浅” 的。堆叠多个LSTM单元,令前一个LSTM$t$时刻的输出,成为下一个LSTM单元$t$时刻的输入,帮助我们构建起一个深层网络,我们把它称为第一个版本的栈式循环神经网络。深层网络提高了模型拟合复杂模式的能力,能够更好地建模跨不同时间步的模式\[[2](#参考文献)\]
然而,训练一个深层LSTM网络并非易事。纵向堆叠多个LSTM单元可能遇到梯度在纵向深度上传播受阻的问题。通常,堆叠4层LSTM单元可以正常训练,当层数达到4~8层时,会出现性能衰减,这时必须考虑一些新的结构以保证梯度纵向顺畅传播,这是训练深层LSTM网络必须解决的问题。我们可以借鉴LSTM解决 “梯度消失梯度爆炸” 问题的智慧之一:在记忆单元(Memory Cell)这条信息传播的路线上没有非线性映射,当梯度反向传播时既不会衰减、也不会爆炸。因此,深层LSTM模型也可以在纵向上添加一条保证梯度顺畅传播的路径。
一个LSTM单元完成的运算可以被分为三部分:(1)输入到隐层的映射(input-to-hidden) :每个时间步输入信息$x$会首先经过一个矩阵映射,再作为遗忘门,输入门,记忆单元,输出门的输入,注意,这一次映射没有引入非线性激活;(2)隐层到隐层的映射(hidden-to-hidden):这一步是LSTM计算的主体,包括遗忘门,输入门,记忆单元更新,输出门的计算;(3)隐层到输出的映射(hidden-to-output):通常是简单的对隐层向量进行激活。我们在第一个版本的栈式网络的基础上,加入一条新的路径:除上一层LSTM输出之外,将前层LSTM的输入到隐层的映射作为的一个新的输入,同时加入一个线性映射去学习一个新的变换。
图3是最终得到的栈式循环神经网络结构示意图。
<p align="center">
<img src="./image/stacked_lstm.png" width = "40%" align=center><br>
图3. 基于LSTM的栈式循环神经网络结构示意图
</p>
### 双向循环神经网络(Bidirectional Recurrent Neural Network)
在LSTM中,$t$时刻的隐藏层向量编码了到$t$时刻为止所有输入的信息,但$t$时刻的LSTM可以看到历史,却无法看到未来。在绝大多数自然语言处理任务中,我们几乎总是能拿到整个句子。这种情况下,如果能够像获取历史信息一样,得到未来的信息,对序列学习任务会有很大的帮助。
为了克服这一缺陷,我们可以设计一种双向循环网络单元,它的思想简单且直接:对上一节的栈式循环神经网络进行一个小小的修改,堆叠多个LSTM单元,让每一层LSTM单元分别以:正向、反向、正向 …… 的顺序学习上一层的输出序列。于是,从第2层开始,$t$时刻我们的LSTM单元便总是可以看到历史和未来的信息。图4是基于LSTM的双向循环神经网络结构示意图。
<p align="center">
<img src="./image/bidirectional_stacked_lstm.png" width = "60%" align=center><br>
图4. 基于LSTM的双向循环神经网络结构示意图
</p>
需要说明的是,这种双向RNN结构和Bengio等人在机器翻译任务中使用的双向RNN结构\[[3](#参考文献), [4](#参考文献)\] 并不相同,我们会在后续[机器翻译](https://github.com/PaddlePaddle/book/blob/develop/machine_translation/README.md)任务中,介绍另一种双向循环神经网络。
### 条件随机场 (Conditional Random Field)
使用神经网络模型解决问题的思路通常是:前层网络学习输入的特征表示,网络的最后一层在特征基础上完成最终的任务。在SRL任务中,深层LSTM网络学习输入的特征表示,条件随机场(Conditional Random Filed, CRF)在特征的基础上完成序列标注,处于整个网络的末端。
CRF是一种概率化结构模型,可以看作是一个概率无向图模型,结点表示随机变量,边表示随机变量之间的概率依赖关系。简单来讲,CRF学习条件概率$P(X|Y)$,其中 $X = (x_1, x_2, ... , x_n)$ 是输入序列,$Y = (y_1, y_2, ... , y_n)$ 是标记序列;解码过程是给定 $X$序列求解令$P(Y|X)$最大的$Y$序列,即$Y^* = \mbox{arg max}_{Y} P(Y | X)$。
序列标注任务只需要考虑输入和输出都是一个线性序列,并且由于我们只是将输入序列作为条件,不做任何条件独立假设,因此输入序列的元素之间并不存在图结构。综上,在序列标注任务中使用的是如图5所示的定义在链式图上的CRF,称之为线性链条件随机场(Linear Chain Conditional Random Field)。
<p align="center">
<img src="./image/linear_chain_crf.png" width = "35%" align=center><br>
图5. 序列标注任务中使用的线性链条件随机场
</p>
根据线性链条件随机场上的因子分解定理\[[5](#参考文献)\],在给定观测序列$X$时,一个特定标记序列$Y$的概率可以定义为:
$$p(Y | X) = \frac{1}{Z(X)} \text{exp}\left(\sum_{i=1}^{n}\left(\sum_{j}\lambda_{j}t_{j} (y_{i - 1}, y_{i}, X, i) + \sum_{k} \mu_k s_k (y_i, X, i)\right)\right)$$
其中$Z(X)$是归一化因子,$t_j$ 是定义在边上的特征函数,依赖于当前和前一个位置,称为转移特征,表示对于输入序列$X$及其标注序列在 $i$及$i - 1$位置上标记的转移概率。$s_k$是定义在结点上的特征函数,称为状态特征,依赖于当前位置,表示对于观察序列$X$及其$i$位置的标记概率。$\lambda_j$ 和 $\mu_k$ 分别是转移特征函数和状态特征函数对应的权值。实际上,$t$和$s$可以用相同的数学形式表示,再对转移特征和状态特在各个位置$i$求和有:$f_{k}(Y, X) = \sum_{i=1}^{n}f_k({y_{i - 1}, y_i, X, i})$,把$f$统称为特征函数,于是$P(Y|X)$可表示为:
$$p(Y|X, W) = \frac{1}{Z(X)}\text{exp}\sum_{k}\omega_{k}f_{k}(Y, X)$$
$\omega$是特征函数对应的权值,是CRF模型要学习的参数。训练时,对于给定的输入序列和对应的标记序列集合$D = \left[(X_1, Y_1), (X_2 , Y_2) , ... , (X_N, Y_N)\right]$ ,通过正则化的极大似然估计,求解如下优化目标:
$$L(\lambda, D) = - \text{log}\left(\prod_{m=1}^{N}p(Y_m|X_m, W)\right) + C \frac{1}{2}\lVert W\rVert^{2}$$
这个优化目标可以通过反向传播算法和整个神经网络一起求解。解码时,对于给定的输入序列$X$,通过解码算法(通常有:维特比算法、Beam Search)求令出条件概率$\bar{P}(Y|X)$最大的输出序列 $\bar{Y}$。
### 深度双向LSTM(DB-LSTM)SRL模型
在SRL任务中,输入是 “谓词” 和 “一句话”,目标是从这句话中找到谓词的论元,并标注论元的语义角色。如果一个句子含有$n$个谓词,这个句子会被处理$n$次。一个最为直接的模型是下面这样:
1. 构造输入;
- 输入1是谓词,输入2是句子
- 将输入1扩展成和输入2一样长的序列,用one-hot方式表示;
2. one-hot方式的谓词序列和句子序列通过词表,转换为实向量表示的词向量序列;
3. 将步骤2中的2个词向量序列作为双向LSTM的输入,学习输入序列的特征表示;
4. CRF以步骤3中模型学习到的特征为输入,以标记序列为监督信号,实现序列标注;
大家可以尝试上面这种方法。这里,我们提出一些改进,引入两个简单但对提高系统性能非常有效的特征:
- 谓词上下文:上面的方法中,只用到了谓词的词向量表达谓词相关的所有信息,这种方法始终是非常弱的,特别是如果谓词在句子中出现多次,有可能引起一定的歧义。从经验出发,谓词前后若干个词的一个小片段,能够提供更丰富的信息,帮助消解歧义。于是,我们把这样的经验也添加到模型中,为每个谓词同时抽取一个“谓词上下文” 片段,也就是从这个谓词前后各取$n$个词构成的一个窗口片段;
- 谓词上下文区域标记:为句子中的每一个词引入一个0-1二值变量,表示它们是否在“谓词上下文”片段中;
修改后的模型如下(图6是一个深度为4的模型结构示意图):
1. 构造输入
- 输入1是句子序列,输入2是谓词序列,输入3是谓词上下文,从句子中抽取这个谓词前后各$n$个词,构成谓词上下文,用one-hot方式表示,输入4是谓词上下文区域标记,标记了句子中每一个词是否在谓词上下文中;
- 将输入2~3均扩展为和输入1一样长的序列;
2. 输入1~4均通过词表取词向量转换为实向量表示的词向量序列;其中输入1、3共享同一个词表,输入2和4各自独有词表;
3. 第2步的4个词向量序列作为双向LSTM模型的输入;LSTM模型学习输入序列的特征表示,得到新的特性表示序列;
4. CRF以第3步中LSTM学习到的特征为输入,以标记序列为监督信号,完成序列标注;
<div align="center">
<img src="image/db_lstm_network.png" width = "60%" align=center /><br>
图6. SRL任务上的深层双向LSTM模型
</div>
## 数据介绍
在此教程中,我们选用[CoNLL 2005](http://www.cs.upc.edu/~srlconll/)SRL任务开放出的数据集作为示例。需要特别说明的是,CoNLL 2005 SRL任务的训练数集和开发集在比赛之后并非免费进行公开,目前,能够获取到的只有测试集,包括Wall Street Journal的23节和Brown语料集中的3节。在本教程中,我们以测试集中的WSJ数据为训练集来讲解模型。但是,由于测试集中样本的数量远远不够,如果希望训练一个可用的神经网络SRL系统,请考虑付费获取全量数据。
原始数据中同时包括了词性标注、命名实体识别、语法解析树等多种信息。本教程中,我们使用test.wsj文件夹中的数据进行训练和测试,并只会用到words文件夹(文本序列)和props文件夹(标注结果)下的数据。本教程使用的数据目录如下:
```text
conll05st-release/
└── test.wsj
├── props # 标注结果
└── words # 输入文本序列
```
标注信息源自Penn TreeBank\[[7](#参考文献)\]和PropBank\[[8](#参考文献)\]的标注结果。PropBank标注结果的标签和我们在文章一开始示例中使用的标注结果标签不同,但原理是相同的,关于标注结果标签含义的说明,请参考论文\[[9](#参考文献)\]
原始数据需要进行数据预处理才能被PaddlePaddle处理,预处理包括下面几个步骤:
1. 将文本序列和标记序列其合并到一条记录中;
2. 一个句子如果含有$n$个谓词,这个句子会被处理$n$次,变成$n$条独立的训练样本,每个样本一个不同的谓词;
3. 抽取谓词上下文和构造谓词上下文区域标记;
4. 构造以BIO法表示的标记;
5. 依据词典获取词对应的整数索引。
```python
# import paddle.v2.dataset.conll05 as conll05
# conll05.corpus_reader函数完成上面第1步和第2步.
# conll05.reader_creator函数完成上面第3步到第5步.
# conll05.test函数可以获取处理之后的每条样本来供PaddlePaddle训练.
```
预处理完成之后一条训练样本包含9个特征,分别是:句子序列、谓词、谓词上下文(占 5 列)、谓词上下区域标志、标注序列。下表是一条训练样本的示例。
| 句子序列 | 谓词 | 谓词上下文(窗口 = 5) | 谓词上下文区域标记 | 标注序列 |
|---|---|---|---|---|
| A | set | n't been set . × | 0 | B-A1 |
| record | set | n't been set . × | 0 | I-A1 |
| date | set | n't been set . × | 0 | I-A1 |
| has | set | n't been set . × | 0 | O |
| n't | set | n't been set . × | 1 | B-AM-NEG |
| been | set | n't been set . × | 1 | O |
| set | set | n't been set . × | 1 | B-V |
| . | set | n't been set . × | 1 | O |
除数据之外,我们同时提供了以下资源:
| 文件名称 | 说明 |
|---|---|
| word_dict | 输入句子的词典,共计44068个词 |
| label_dict | 标记的词典,共计106个标记 |
| predicate_dict | 谓词的词典,共计3162个词 |
| emb | 一个训练好的词表,32维 |
我们在英文维基百科上训练语言模型得到了一份词向量用来初始化SRL模型。在SRL模型训练过程中,词向量不再被更新。关于语言模型和词向量可以参考[词向量](https://github.com/PaddlePaddle/book/blob/develop/word2vec/README.md) 这篇教程。我们训练语言模型的语料共有995,000,000个token,词典大小控制为4900,000词。CoNLL 2005训练语料中有5%的词不在这4900,000个词中,我们将它们全部看作未登录词,用`<unk>`表示。
获取词典,打印词典大小:
```python
import paddle.v2 as paddle
import paddle.v2.dataset.conll05 as conll05
word_dict, verb_dict, label_dict = conll05.get_dict()
word_dict_len = len(word_dict)
label_dict_len = len(label_dict)
pred_len = len(verb_dict)
print len(word_dict_len)
print len(label_dict_len)
print len(pred_len)
```
## 模型配置说明
1. 定义输入数据维度及模型超参数。
```python
mark_dict_len = 2 # 谓上下文区域标志的维度,是一个0-1 2值特征,因此维度为2
word_dim = 32 # 词向量维度
mark_dim = 5 # 谓词上下文区域通过词表被映射为一个实向量,这个是相邻的维度
hidden_dim = 512 # LSTM隐层向量的维度 : 512 / 4
depth = 8 # 栈式LSTM的深度
# 一条样本总共9个特征,下面定义了9个data层,每个层类型为integer_value_sequence,表示整数ID的序列类型.
def d_type(size):
return paddle.data_type.integer_value_sequence(size)
# 句子序列
word = paddle.layer.data(name='word_data', type=d_type(word_dict_len))
# 谓词
predicate = paddle.layer.data(name='verb_data', type=d_type(pred_len))
# 谓词上下文5个特征
ctx_n2 = paddle.layer.data(name='ctx_n2_data', type=d_type(word_dict_len))
ctx_n1 = paddle.layer.data(name='ctx_n1_data', type=d_type(word_dict_len))
ctx_0 = paddle.layer.data(name='ctx_0_data', type=d_type(word_dict_len))
ctx_p1 = paddle.layer.data(name='ctx_p1_data', type=d_type(word_dict_len))
ctx_p2 = paddle.layer.data(name='ctx_p2_data', type=d_type(word_dict_len))
# 谓词上下区域标志
mark = paddle.layer.data(name='mark_data', type=d_type(mark_dict_len))
# 标注序列
target = paddle.layer.data(name='target', type=d_type(label_dict_len))
```
这里需要特别说明的是hidden_dim = 512指定了LSTM隐层向量的维度为128维,关于这一点请参考PaddlePaddle官方文档中[lstmemory](http://www.paddlepaddle.org/doc/ui/api/trainer_config_helpers/layers.html#lstmemory)的说明。
2. 将句子序列、谓词、谓词上下文、谓词上下文区域标记通过词表,转换为实向量表示的词向量序列。
```python
# 在本教程中,我们加载了预训练的词向量,这里设置了:is_static=True
# is_static 为 True 时保证了在训练 SRL 模型过程中,词表不再更新
emb_para = paddle.attr.Param(name='emb', initial_std=0., is_static=True)
# 设置超参数
default_std = 1 / math.sqrt(hidden_dim) / 3.0
std_default = paddle.attr.Param(initial_std=default_std)
std_0 = paddle.attr.Param(initial_std=0.)
predicate_embedding = paddle.layer.embedding(
size=word_dim,
input=predicate,
param_attr=paddle.attr.Param(
name='vemb', initial_std=default_std))
mark_embedding = paddle.layer.embedding(
size=mark_dim, input=mark, param_attr=std_0)
word_input = [word, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2]
emb_layers = [
paddle.layer.embedding(
size=word_dim, input=x, param_attr=emb_para) for x in word_input
]
emb_layers.append(predicate_embedding)
emb_layers.append(mark_embedding)
```
3. 8个LSTM单元以“正向/反向”的顺序对所有输入序列进行学习。
```python
hidden_0 = paddle.layer.mixed(
size=hidden_dim,
bias_attr=std_default,
input=[
paddle.layer.full_matrix_projection(
input=emb, param_attr=std_default) for emb in emb_layers
])
mix_hidden_lr = 1e-3
lstm_para_attr = paddle.attr.Param(initial_std=0.0, learning_rate=1.0)
hidden_para_attr = paddle.attr.Param(
initial_std=default_std, learning_rate=mix_hidden_lr)
lstm_0 = paddle.layer.lstmemory(
input=hidden_0,
act=paddle.activation.Relu(),
gate_act=paddle.activation.Sigmoid(),
state_act=paddle.activation.Sigmoid(),
bias_attr=std_0,
param_attr=lstm_para_attr)
#stack L-LSTM and R-LSTM with direct edges
input_tmp = [hidden_0, lstm_0]
for i in range(1, depth):
mix_hidden = paddle.layer.mixed(
size=hidden_dim,
bias_attr=std_default,
input=[
paddle.layer.full_matrix_projection(
input=input_tmp[0], param_attr=hidden_para_attr),
paddle.layer.full_matrix_projection(
input=input_tmp[1], param_attr=lstm_para_attr)
])
lstm = paddle.layer.lstmemory(
input=mix_hidden,
act=paddle.activation.Relu(),
gate_act=paddle.activation.Sigmoid(),
state_act=paddle.activation.Sigmoid(),
reverse=((i % 2) == 1),
bias_attr=std_0,
param_attr=lstm_para_attr)
input_tmp = [mix_hidden, lstm]
```
4. 取最后一个栈式LSTM的输出和这个LSTM单元的输入到隐层映射,经过一个全连接层映射到标记字典的维度,得到最终的特征向量表示。
```python
feature_out = paddle.layer.mixed(
size=label_dict_len,
bias_attr=std_default,
input=[
paddle.layer.full_matrix_projection(
input=input_tmp[0], param_attr=hidden_para_attr),
paddle.layer.full_matrix_projection(
input=input_tmp[1], param_attr=lstm_para_attr)
], )
```
5. 网络的末端定义CRF层计算损失(cost),指定参数名字为 `crfw`,该层需要输入正确的数据标签(target)。
```python
crf_cost = paddle.layer.crf(
size=label_dict_len,
input=feature_out,
label=target,
param_attr=paddle.attr.Param(
name='crfw',
initial_std=default_std,
learning_rate=mix_hidden_lr))
```
6. CRF译码层和CRF层参数名字相同,即共享权重。如果输入了正确的数据标签(target),会统计错误标签的个数,可以用来评估模型。如果没有输入正确的数据标签,该层可以推到出最优解,可以用来预测模型。
```python
crf_dec = paddle.layer.crf_decoding(
name='crf_dec_l',
size=label_dict_len,
input=feature_out,
label=target,
param_attr=paddle.attr.Param(name='crfw'))
```
## 训练模型
### 定义参数
首先依据模型配置的`crf_cost`定义模型参数。
```python
# create parameters
parameters = paddle.parameters.create([crf_cost, crf_dec])
```
可以打印参数名字,如果在网络配置中没有指定名字,则默认生成。
```python
print parameters.keys()
```
如上文提到,我们用基于英文维基百科训练好的词向量来初始化序列输入、谓词上下文总共6个特征的embedding层参数,在训练中不更新。
```python
# 这里加载PaddlePaddle上版保存的二进制模型
def load_parameter(file_name, h, w):
with open(file_name, 'rb') as f:
f.read(16)
return np.fromfile(f, dtype=np.float32).reshape(h, w)
parameters.set('emb', load_parameter(conll05.get_embedding(), 44068, 32))
```
### 构造训练(Trainer)
然后根据网络拓扑结构和模型参数来构造出trainer用来训练,在构造时还需指定优化方法,这里使用最基本的SGD方法(momentum设置为0),同时设定了学习率、正则等。
```python
# create optimizer
optimizer = paddle.optimizer.Momentum(
momentum=0,
learning_rate=2e-2,
regularization=paddle.optimizer.L2Regularization(rate=8e-4),
model_average=paddle.optimizer.ModelAverage(
average_window=0.5, max_average_window=10000), )
trainer = paddle.trainer.SGD(cost=crf_cost,
parameters=parameters,
update_equation=optimizer)
```
### 训练
数据介绍部分提到CoNLL 2005训练集付费,这里我们使用测试集训练供大家学习。`conll05.test()`每次产生一条样本,包含9个特征,shuffle和组完batch后作为训练的输入。
```python
reader = paddle.reader.batched(
paddle.reader.shuffle(
conll05.test(), buf_size=8192), batch_size=20)
```
通过`reader_dict`来指定每一个数据和data_layer的对应关系。 例如 下面`reader_dict`表示: `conll05.test()`产生数据的第0列对应`word_data`层的特征。
```python
reader_dict = {
'word_data': 0,
'ctx_n2_data': 1,
'ctx_n1_data': 2,
'ctx_0_data': 3,
'ctx_p1_data': 4,
'ctx_p2_data': 5,
'verb_data': 6,
'mark_data': 7,
'target': 8
}
```
可以使用`event_handler`回调函数来观察训练过程,或进行测试等。这里我们打印了训练过程的cost,该回调函数是`trainer.train`函数里设定。
```python
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0:
print "Pass %d, Batch %d, Cost %f" % (
event.pass_id, event.batch_id, event.cost)
```
通过`trainer.train`函数训练:
```python
trainer.train(
reader=reader,
event_handler=event_handler,
num_passes=10000,
reader_dict=reader_dict)
```
## 总结
语义角色标注是许多自然语言理解任务的重要中间步骤。这篇教程中我们以语义角色标注任务为例,介绍如何利用PaddlePaddle进行序列标注任务。教程中所介绍的模型来自我们发表的论文\[[10](#参考文献)\]。由于 CoNLL 2005 SRL任务的训练数据目前并非完全开放,教程中只使用测试数据作为示例。在这个过程中,我们希望减少对其它自然语言处理工具的依赖,利用神经网络数据驱动、端到端学习的能力,得到一个和传统方法可比、甚至更好的模型。在论文中我们证实了这种可能性。关于模型更多的信息和讨论可以在论文中找到。
## 参考文献
1. Sun W, Sui Z, Wang M, et al. [Chinese semantic role labeling with shallow parsing](http://www.aclweb.org/anthology/D09-1#page=1513)[C]//Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3-Volume 3. Association for Computational Linguistics, 2009: 1475-1483.
2. Pascanu R, Gulcehre C, Cho K, et al. [How to construct deep recurrent neural networks](https://arxiv.org/abs/1312.6026)[J]. arXiv preprint arXiv:1312.6026, 2013.
3. Cho K, Van Merriënboer B, Gulcehre C, et al. [Learning phrase representations using RNN encoder-decoder for statistical machine translation](https://arxiv.org/abs/1406.1078)[J]. arXiv preprint arXiv:1406.1078, 2014.
4. Bahdanau D, Cho K, Bengio Y. [Neural machine translation by jointly learning to align and translate](https://arxiv.org/abs/1409.0473)[J]. arXiv preprint arXiv:1409.0473, 2014.
5. Lafferty J, McCallum A, Pereira F. [Conditional random fields: Probabilistic models for segmenting and labeling sequence data](http://www.jmlr.org/papers/volume15/doppa14a/source/biblio.bib.old)[C]//Proceedings of the eighteenth international conference on machine learning, ICML. 2001, 1: 282-289.
6. 李航. 统计学习方法[J]. 清华大学出版社, 北京, 2012.
7. Marcus M P, Marcinkiewicz M A, Santorini B. [Building a large annotated corpus of English: The Penn Treebank](http://repository.upenn.edu/cgi/viewcontent.cgi?article=1246&context=cis_reports)[J]. Computational linguistics, 1993, 19(2): 313-330.
8. Palmer M, Gildea D, Kingsbury P. [The proposition bank: An annotated corpus of semantic roles](http://www.mitpressjournals.org/doi/pdfplus/10.1162/0891201053630264)[J]. Computational linguistics, 2005, 31(1): 71-106.
9. Carreras X, Màrquez L. [Introduction to the CoNLL-2005 shared task: Semantic role labeling](http://www.cs.upc.edu/~srlconll/st05/papers/intro.pdf)[C]//Proceedings of the Ninth Conference on Computational Natural Language Learning. Association for Computational Linguistics, 2005: 152-164.
10. Zhou J, Xu W. [End-to-end learning of semantic role labeling using recurrent neural networks](http://www.aclweb.org/anthology/P/P15/P15-1109.pdf)[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2015.
<br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Text" property="dct:title" rel="dct:type">本教程</span><a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a> 创作,采用 <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">知识共享 署名-非商业性使用-相同方式共享 4.0 国际 许可协议</a>进行许可。
...@@ -440,15 +440,15 @@ trainer = paddle.trainer.SGD(cost=crf_cost, ...@@ -440,15 +440,15 @@ trainer = paddle.trainer.SGD(cost=crf_cost,
As mentioned in data preparation section, we will use CoNLL 2005 test corpus as training data set. `conll05.test()` outputs one training instance at a time. It will be shuffled, and batched into mini batches as input. As mentioned in data preparation section, we will use CoNLL 2005 test corpus as training data set. `conll05.test()` outputs one training instance at a time. It will be shuffled, and batched into mini batches as input.
```python ```python
reader = paddle.reader.batched( reader = paddle.batch(
paddle.reader.shuffle( paddle.reader.shuffle(
conll05.test(), buf_size=8192), batch_size=20) conll05.test(), buf_size=8192), batch_size=20)
``` ```
`reader_dict` is used to specify relationship between data instance and layer layer. For example, according to following `reader_dict`, the 0th column of data instance produced by`conll05.test()` correspond to data layer named `word_data`. `feeding` is used to specify relationship between data instance and layer layer. For example, according to following `feeding`, the 0th column of data instance produced by`conll05.test()` correspond to data layer named `word_data`.
```python ```python
reader_dict = { feeding = {
'word_data': 0, 'word_data': 0,
'ctx_n2_data': 1, 'ctx_n2_data': 1,
'ctx_n1_data': 2, 'ctx_n1_data': 2,
...@@ -478,7 +478,7 @@ trainer.train( ...@@ -478,7 +478,7 @@ trainer.train(
reader=reader, reader=reader,
event_handler=event_handler, event_handler=event_handler,
num_passes=10000, num_passes=10000,
reader_dict=reader_dict) feeding=feeding)
``` ```
## Conclusion ## Conclusion
......
...@@ -127,10 +127,10 @@ $$L(\lambda, D) = - \text{log}\left(\prod_{m=1}^{N}p(Y_m|X_m, W)\right) + C \fra ...@@ -127,10 +127,10 @@ $$L(\lambda, D) = - \text{log}\left(\prod_{m=1}^{N}p(Y_m|X_m, W)\right) + C \fra
图6. SRL任务上的深层双向LSTM模型 图6. SRL任务上的深层双向LSTM模型
</div> </div>
## 数据准备
### 数据介绍与下载
在此教程中,我们选用[CoNLL 2005](http://www.cs.upc.edu/~srlconll/)SRL任务开放出的数据集作为示例。运行 `sh ./get_data.sh` 会自动从官方网站上下载原始数据。需要特别说明的是,CoNLL 2005 SRL任务的训练数集和开发集在比赛之后并非免费进行公开,目前,能够获取到的只有测试集,包括Wall Street Journal的23节和Brown语料集中的3节。在本教程中,我们以测试集中的WSJ数据为训练集来讲解模型。但是,由于测试集中样本的数量远远不够,如果希望训练一个可用的神经网络SRL系统,请考虑付费获取全量数据。 ## 数据介绍
在此教程中,我们选用[CoNLL 2005](http://www.cs.upc.edu/~srlconll/)SRL任务开放出的数据集作为示例。需要特别说明的是,CoNLL 2005 SRL任务的训练数集和开发集在比赛之后并非免费进行公开,目前,能够获取到的只有测试集,包括Wall Street Journal的23节和Brown语料集中的3节。在本教程中,我们以测试集中的WSJ数据为训练集来讲解模型。但是,由于测试集中样本的数量远远不够,如果希望训练一个可用的神经网络SRL系统,请考虑付费获取全量数据。
原始数据中同时包括了词性标注、命名实体识别、语法解析树等多种信息。本教程中,我们使用test.wsj文件夹中的数据进行训练和测试,并只会用到words文件夹(文本序列)和props文件夹(标注结果)下的数据。本教程使用的数据目录如下: 原始数据中同时包括了词性标注、命名实体识别、语法解析树等多种信息。本教程中,我们使用test.wsj文件夹中的数据进行训练和测试,并只会用到words文件夹(文本序列)和props文件夹(标注结果)下的数据。本教程使用的数据目录如下:
...@@ -143,26 +143,23 @@ conll05st-release/ ...@@ -143,26 +143,23 @@ conll05st-release/
标注信息源自Penn TreeBank\[[7](#参考文献)\]和PropBank\[[8](#参考文献)\]的标注结果。PropBank标注结果的标签和我们在文章一开始示例中使用的标注结果标签不同,但原理是相同的,关于标注结果标签含义的说明,请参考论文\[[9](#参考文献)\] 标注信息源自Penn TreeBank\[[7](#参考文献)\]和PropBank\[[8](#参考文献)\]的标注结果。PropBank标注结果的标签和我们在文章一开始示例中使用的标注结果标签不同,但原理是相同的,关于标注结果标签含义的说明,请参考论文\[[9](#参考文献)\]
除数据之外,`get_data.sh`同时下载了以下资源: 原始数据需要进行数据预处理才能被PaddlePaddle处理,预处理包括下面几个步骤:
| 文件名称 | 说明 |
|---|---|
| word_dict | 输入句子的词典,共计44068个词 |
| label_dict | 标记的词典,共计106个标记 |
| predicate_dict | 谓词的词典,共计3162个词 |
| emb | 一个训练好的词表,32维 |
我们在英文维基百科上训练语言模型得到了一份词向量用来初始化SRL模型。在SRL模型训练过程中,词向量不再被更新。关于语言模型和词向量可以参考[词向量](https://github.com/PaddlePaddle/book/blob/develop/word2vec/README.md) 这篇教程。我们训练语言模型的语料共有995,000,000个token,词典大小控制为4900,000词。CoNLL 2005训练语料中有5%的词不在这4900,000个词中,我们将它们全部看作未登录词,用`<unk>`表示。
### 数据预处理
脚本在下载数据之后,又调用了`extract_pair.py``extract_dict_feature.py`两个子脚本进行数据预处理,前者完成了下面的第1步,后者完成了下面的2~4步:
1. 将文本序列和标记序列其合并到一条记录中; 1. 将文本序列和标记序列其合并到一条记录中;
2. 一个句子如果含有$n$个谓词,这个句子会被处理$n$次,变成$n$条独立的训练样本,每个样本一个不同的谓词; 2. 一个句子如果含有$n$个谓词,这个句子会被处理$n$次,变成$n$条独立的训练样本,每个样本一个不同的谓词;
3. 抽取谓词上下文和构造谓词上下文区域标记; 3. 抽取谓词上下文和构造谓词上下文区域标记;
4. 构造以BIO法表示的标记; 4. 构造以BIO法表示的标记;
5. 依据词典获取词对应的整数索引。
```python
# import paddle.v2.dataset.conll05 as conll05
# conll05.corpus_reader函数完成上面第1步和第2步.
# conll05.reader_creator函数完成上面第3步到第5步.
# conll05.test函数可以获取处理之后的每条样本来供PaddlePaddle训练.
```
`data/feature`文件是处理好的模型输入,一行是一条训练样本,以"\t"分隔,共9列,分别是:句子序列、谓词、谓词上下文(占 5 列)、谓词上下区域标志、标注序列。下表是一条训练样本的示例。 预处理完成之后一条训练样本包含9个特征,分别是:句子序列、谓词、谓词上下文(占 5 列)、谓词上下区域标志、标注序列。下表是一条训练样本的示例。
| 句子序列 | 谓词 | 谓词上下文(窗口 = 5) | 谓词上下文区域标记 | 标注序列 | | 句子序列 | 谓词 | 谓词上下文(窗口 = 5) | 谓词上下文区域标记 | 标注序列 |
|---|---|---|---|---| |---|---|---|---|---|
...@@ -175,109 +172,35 @@ conll05st-release/ ...@@ -175,109 +172,35 @@ conll05st-release/
| set | set | n't been set . × | 1 | B-V | | set | set | n't been set . × | 1 | B-V |
| . | set | n't been set . × | 1 | O | | . | set | n't been set . × | 1 | O |
### 提供数据给 PaddlePaddle
1. 使用hook函数进行PaddlePaddle输入字段的格式定义。
```python
def hook(settings, word_dict, label_dict, predicate_dict, **kwargs):
settings.word_dict = word_dict # 获取句子序列的字典
settings.label_dict = label_dict # 获取标记序列的字典
settings.predicate_dict = predicate_dict # 获取谓词的字典
# 所有输入特征都是使用one-hot表示序列,在PaddlePaddle中是interger_value_sequence类型
# input_types是一个字典,字典中每个元素对应着配置中的一个data_layer,key恰好就是data_layer的名字
settings.input_types = {
'word_data': integer_value_sequence(len(word_dict)), # 句子序列
'ctx_n2_data': integer_value_sequence(len(word_dict)), # 谓词上下文中的第1个词
'ctx_n1_data': integer_value_sequence(len(word_dict)), # 谓词上下文中的第2个词
'ctx_0_data': integer_value_sequence(len(word_dict)), # 谓词上下文中的第3个词
'ctx_p1_data': integer_value_sequence(len(word_dict)), # 谓词上下文中的第4个词
'ctx_p2_data': integer_value_sequence(len(word_dict)), # 谓词上下文中的第5个词
'verb_data': integer_value_sequence(len(predicate_dict)), # 谓词
'mark_data': integer_value_sequence(2), # 谓词上下文区域标记
'target': integer_value_sequence(len(label_dict)) # 标记序列
}
```
2. 使用process将数据逐一提供给PaddlePaddle,只需要考虑如何从原始数据文件中返回一条训练样本。
```python 除数据之外,我们同时提供了以下资源:
def process(settings, file_name):
with open(file_name, 'r') as fdata:
for line in fdata:
sentence, predicate, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2, mark, label = \
line.strip().split('\t')
# 句子文本
words = sentence.split()
sen_len = len(words)
word_slot = [settings.word_dict.get(w, UNK_IDX) for w in words]
# 一个谓词,这里将谓词扩展成一个和句子一样长的序列
predicate_slot = [settings.predicate_dict.get(predicate)] * sen_len
# 在教程中,我们使用一个窗口为 5 的谓词上下文窗口:谓词和这个谓词前后隔两个词
# 这里会将窗口中的每一个词,扩展成和输入句子一样长的序列
ctx_n2_slot = [settings.word_dict.get(ctx_n2, UNK_IDX)] * sen_len
ctx_n1_slot = [settings.word_dict.get(ctx_n1, UNK_IDX)] * sen_len
ctx_0_slot = [settings.word_dict.get(ctx_0, UNK_IDX)] * sen_len
ctx_p1_slot = [settings.word_dict.get(ctx_p1, UNK_IDX)] * sen_len
ctx_p2_slot = [settings.word_dict.get(ctx_p2, UNK_IDX)] * sen_len
# 谓词上下文区域标记,是一个二值特征
marks = mark.split()
mark_slot = [int(w) for w in marks]
label_list = label.split()
label_slot = [settings.label_dict.get(w) for w in label_list]
yield {
'word_data': word_slot,
'ctx_n2_data': ctx_n2_slot,
'ctx_n1_data': ctx_n1_slot,
'ctx_0_data': ctx_0_slot,
'ctx_p1_data': ctx_p1_slot,
'ctx_p2_data': ctx_p2_slot,
'verb_data': predicate_slot,
'mark_data': mark_slot,
'target': label_slot
}
```
## 模型配置说明 | 文件名称 | 说明 |
|---|---|
| word_dict | 输入句子的词典,共计44068个词 |
| label_dict | 标记的词典,共计106个标记 |
| predicate_dict | 谓词的词典,共计3162个词 |
| emb | 一个训练好的词表,32维 |
### 数据定义 我们在英文维基百科上训练语言模型得到了一份词向量用来初始化SRL模型。在SRL模型训练过程中,词向量不再被更新。关于语言模型和词向量可以参考[词向量](https://github.com/PaddlePaddle/book/blob/develop/word2vec/README.md) 这篇教程。我们训练语言模型的语料共有995,000,000个token,词典大小控制为4900,000词。CoNLL 2005训练语料中有5%的词不在这4900,000个词中,我们将它们全部看作未登录词,用`<unk>`表示。
首先通过 define_py_data_sources2 从dataprovider中读入数据。配置文件中会读取三个字典:输入文本序列的字典、标记的字典、谓词的字典,并传给data provider,data provider会利用这三个字典,将相应的文本输入转换成one-hot序列。 获取词典,打印词典大小:
```python ```python
define_py_data_sources2( import paddle.v2 as paddle
train_list=train_list_file, import paddle.v2.dataset.conll05 as conll05
test_list=test_list_file,
module='dataprovider',
obj='process',
args={
'word_dict': word_dict, # 输入文本序列的字典
'label_dict': label_dict, # 标记的字典
'predicate_dict': predicate_dict # 谓词的词典
}
)
```
### 算法配置
在这里,我们指定了模型的训练参数,选择了$L_2$正则、学习率和batch size,并使用带Momentum的随机梯度下降法作为优化算法。 word_dict, verb_dict, label_dict = conll05.get_dict()
word_dict_len = len(word_dict)
label_dict_len = len(label_dict)
pred_len = len(verb_dict)
```python print len(word_dict_len)
settings( print len(label_dict_len)
batch_size=150, print len(pred_len)
learning_method=MomentumOptimizer(momentum=0),
learning_rate=2e-2,
regularization=L2Regularization(8e-4),
model_average=ModelAverage(average_window=0.5, max_average_window=10000)
)
``` ```
### 模型结构 ## 模型配置说明
1. 定义输入数据维度及模型超参数。 1. 定义输入数据维度及模型超参数。
...@@ -288,175 +211,239 @@ settings( ...@@ -288,175 +211,239 @@ settings(
hidden_dim = 512 # LSTM隐层向量的维度 : 512 / 4 hidden_dim = 512 # LSTM隐层向量的维度 : 512 / 4
depth = 8 # 栈式LSTM的深度 depth = 8 # 栈式LSTM的深度
word = data_layer(name='word_data', size=word_dict_len) # 一条样本总共9个特征,下面定义了9个data层,每个层类型为integer_value_sequence,表示整数ID的序列类型.
predicate = data_layer(name='verb_data', size=pred_len) def d_type(size):
return paddle.data_type.integer_value_sequence(size)
ctx_n2 = data_layer(name='ctx_n2_data', size=word_dict_len)
ctx_n1 = data_layer(name='ctx_n1_data', size=word_dict_len) # 句子序列
ctx_0 = data_layer(name='ctx_0_data', size=word_dict_len) word = paddle.layer.data(name='word_data', type=d_type(word_dict_len))
ctx_p1 = data_layer(name='ctx_p1_data', size=word_dict_len) # 谓词
ctx_p2 = data_layer(name='ctx_p2_data', size=word_dict_len) predicate = paddle.layer.data(name='verb_data', type=d_type(pred_len))
mark = data_layer(name='mark_data', size=mark_dict_len)
# 谓词上下文5个特征
if not is_predict: ctx_n2 = paddle.layer.data(name='ctx_n2_data', type=d_type(word_dict_len))
target = data_layer(name='target', size=label_dict_len) # 标记序列只在训练和测试流程中定义 ctx_n1 = paddle.layer.data(name='ctx_n1_data', type=d_type(word_dict_len))
ctx_0 = paddle.layer.data(name='ctx_0_data', type=d_type(word_dict_len))
ctx_p1 = paddle.layer.data(name='ctx_p1_data', type=d_type(word_dict_len))
ctx_p2 = paddle.layer.data(name='ctx_p2_data', type=d_type(word_dict_len))
# 谓词上下区域标志
mark = paddle.layer.data(name='mark_data', type=d_type(mark_dict_len))
# 标注序列
target = paddle.layer.data(name='target', type=d_type(label_dict_len))
``` ```
这里需要特别说明的是hidden_dim = 512指定了LSTM隐层向量的维度为128维,关于这一点请参考PaddlePaddle官方文档中[lstmemory](http://www.paddlepaddle.org/doc/ui/api/trainer_config_helpers/layers.html#lstmemory)的说明。
这里需要特别说明的是hidden_dim = 512指定了LSTM隐层向量的维度为128维,关于这一点请参考PaddlePaddle官方文档中[lstmemory](http://www.paddlepaddle.org/doc/ui/api/trainer_config_helpers/layers.html#lstmemory)的说明。
2. 将句子序列、谓词、谓词上下文、谓词上下文区域标记通过词表,转换为实向量表示的词向量序列。 2. 将句子序列、谓词、谓词上下文、谓词上下文区域标记通过词表,转换为实向量表示的词向量序列。
```python ```python
# 在本教程中,我们加载了预训练的词向量,这里设置了:is_static=True # 在本教程中,我们加载了预训练的词向量,这里设置了:is_static=True
# is_static 为 True 时保证了在训练 SRL 模型过程中,词表不再更新 # is_static 为 True 时保证了在训练 SRL 模型过程中,词表不再更新
emb_para = ParameterAttribute(name='emb', initial_std=0., is_static=True) emb_para = paddle.attr.Param(name='emb', initial_std=0., is_static=True)
# 设置超参数
word_input = [word, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2] default_std = 1 / math.sqrt(hidden_dim) / 3.0
emb_layers = [ std_default = paddle.attr.Param(initial_std=default_std)
embedding_layer( std_0 = paddle.attr.Param(initial_std=0.)
size=word_dim, input=x, param_attr=emb_para) for x in word_input
] predicate_embedding = paddle.layer.embedding(
emb_layers.append(predicate_embedding) size=word_dim,
mark_embedding = embedding_layer( input=predicate,
name='word_ctx-in_embedding', size=mark_dim, input=mark, param_attr=std_0) param_attr=paddle.attr.Param(
emb_layers.append(mark_embedding) name='vemb', initial_std=default_std))
mark_embedding = paddle.layer.embedding(
size=mark_dim, input=mark, param_attr=std_0)
word_input = [word, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2]
emb_layers = [
paddle.layer.embedding(
size=word_dim, input=x, param_attr=emb_para) for x in word_input
]
emb_layers.append(predicate_embedding)
emb_layers.append(mark_embedding)
``` ```
3. 8个LSTM单元以“正向/反向”的顺序对所有输入序列进行学习。 3. 8个LSTM单元以“正向/反向”的顺序对所有输入序列进行学习。
```python ```python
# std_0 指定的参数以均值为0的高斯分布初始化,用在LSTM的bias初始化中 hidden_0 = paddle.layer.mixed(
std_0 = ParameterAttribute(initial_std=0.) size=hidden_dim,
bias_attr=std_default,
hidden_0 = mixed_layer( input=[
name='hidden0', paddle.layer.full_matrix_projection(
size=hidden_dim, input=emb, param_attr=std_default) for emb in emb_layers
bias_attr=std_default, ])
input=[
full_matrix_projection( mix_hidden_lr = 1e-3
input=emb, param_attr=std_default) for emb in emb_layers lstm_para_attr = paddle.attr.Param(initial_std=0.0, learning_rate=1.0)
]) hidden_para_attr = paddle.attr.Param(
lstm_0 = lstmemory( initial_std=default_std, learning_rate=mix_hidden_lr)
name='lstm0',
input=hidden_0, lstm_0 = paddle.layer.lstmemory(
act=ReluActivation(), input=hidden_0,
gate_act=SigmoidActivation(), act=paddle.activation.Relu(),
state_act=SigmoidActivation(), gate_act=paddle.activation.Sigmoid(),
bias_attr=std_0, state_act=paddle.activation.Sigmoid(),
param_attr=lstm_para_attr) bias_attr=std_0,
input_tmp = [hidden_0, lstm_0] param_attr=lstm_para_attr)
for i in range(1, depth): #stack L-LSTM and R-LSTM with direct edges
mix_hidden = mixed_layer( input_tmp = [hidden_0, lstm_0]
name='hidden' + str(i),
size=hidden_dim, for i in range(1, depth):
bias_attr=std_default, mix_hidden = paddle.layer.mixed(
input=[ size=hidden_dim,
full_matrix_projection( bias_attr=std_default,
input=input_tmp[0], param_attr=hidden_para_attr), input=[
full_matrix_projection( paddle.layer.full_matrix_projection(
input=input_tmp[1], param_attr=lstm_para_attr) input=input_tmp[0], param_attr=hidden_para_attr),
]) paddle.layer.full_matrix_projection(
lstm = lstmemory( input=input_tmp[1], param_attr=lstm_para_attr)
name='lstm' + str(i), ])
input=mix_hidden,
act=ReluActivation(), lstm = paddle.layer.lstmemory(
gate_act=SigmoidActivation(), input=mix_hidden,
state_act=SigmoidActivation(), act=paddle.activation.Relu(),
reverse=((i % 2) == 1), gate_act=paddle.activation.Sigmoid(),
bias_attr=std_0, state_act=paddle.activation.Sigmoid(),
param_attr=lstm_para_attr) reverse=((i % 2) == 1),
bias_attr=std_0,
input_tmp = [mix_hidden, lstm] param_attr=lstm_para_attr)
input_tmp = [mix_hidden, lstm]
``` ```
4. 取最后一个栈式LSTM的输出和这个LSTM单元的输入到隐层映射,经过一个全连接层映射到标记字典的维度,得到最终的特征向量表示。 4. 取最后一个栈式LSTM的输出和这个LSTM单元的输入到隐层映射,经过一个全连接层映射到标记字典的维度,得到最终的特征向量表示。
```python ```python
feature_out = mixed_layer( feature_out = paddle.layer.mixed(
name='output', size=label_dict_len,
size=label_dict_len, bias_attr=std_default,
bias_attr=std_default, input=[
input=[ paddle.layer.full_matrix_projection(
full_matrix_projection( input=input_tmp[0], param_attr=hidden_para_attr),
input=input_tmp[0], param_attr=hidden_para_attr), paddle.layer.full_matrix_projection(
full_matrix_projection( input=input_tmp[1], param_attr=lstm_para_attr)
input=input_tmp[1], param_attr=lstm_para_attr) ], )
], )
``` ```
5. CRF层在网络的末端,完成序列标注 5. 网络的末端定义CRF层计算损失(cost),指定参数名字为 `crfw`,该层需要输入正确的数据标签(target)
```python ```python
crf_l = crf_layer( crf_cost = paddle.layer.crf(
name='crf', size=label_dict_len,
size=label_dict_len, input=feature_out,
input=feature_out, label=target,
label=target, param_attr=paddle.attr.Param(
param_attr=ParameterAttribute( name='crfw',
name='crfw', initial_std=default_std, learning_rate=mix_hidden_lr)) initial_std=default_std,
learning_rate=mix_hidden_lr))
``` ```
6. CRF译码层和CRF层参数名字相同,即共享权重。如果输入了正确的数据标签(target),会统计错误标签的个数,可以用来评估模型。如果没有输入正确的数据标签,该层可以推到出最优解,可以用来预测模型。
```python
crf_dec = paddle.layer.crf_decoding(
name='crf_dec_l',
size=label_dict_len,
input=feature_out,
label=target,
param_attr=paddle.attr.Param(name='crfw'))
```
## 训练模型 ## 训练模型
执行`sh train.sh`进行模型的训练,其中指定了总共需要训练150个pass。
### 定义参数
```bash
paddle train \ 首先依据模型配置的`crf_cost`定义模型参数。
--config=./db_lstm.py \
--save_dir=./output \ ```python
--trainer_count=1 \ # create parameters
--dot_period=500 \ parameters = paddle.parameters.create([crf_cost, crf_dec])
--log_period=10 \
--num_passes=200 \
--use_gpu=false \
--show_parameter_stats_period=10 \
--test_all_data_in_one_period=1 \
2>&1 | tee 'train.log'
``` ```
训练日志示例如下。 可以打印参数名字,如果在网络配置中没有指定名字,则默认生成。
```python
print parameters.keys()
```
```text 如上文提到,我们用基于英文维基百科训练好的词向量来初始化序列输入、谓词上下文总共6个特征的embedding层参数,在训练中不更新。
I1224 18:11:53.661479 1433 TrainerInternal.cpp:165] Batch=880 samples=145305 AvgCost=2.11541 CurrentCost=1.8645 Eval: __sum_evaluator_0__=0.607942 CurrentEval: __sum_evaluator_0__=0.59322
I1224 18:11:55.254021 1433 TrainerInternal.cpp:165] Batch=885 samples=146134 AvgCost=2.11408 CurrentCost=1.88156 Eval: __sum_evaluator_0__=0.607299 CurrentEval: __sum_evaluator_0__=0.494572 ```python
I1224 18:11:56.867604 1433 TrainerInternal.cpp:165] Batch=890 samples=146987 AvgCost=2.11277 CurrentCost=1.88839 Eval: __sum_evaluator_0__=0.607203 CurrentEval: __sum_evaluator_0__=0.590856 # 这里加载PaddlePaddle上版保存的二进制模型
I1224 18:11:58.424069 1433 TrainerInternal.cpp:165] Batch=895 samples=147793 AvgCost=2.11129 CurrentCost=1.84247 Eval: __sum_evaluator_0__=0.607099 CurrentEval: __sum_evaluator_0__=0.588089 def load_parameter(file_name, h, w):
I1224 18:12:00.006893 1433 TrainerInternal.cpp:165] Batch=900 samples=148611 AvgCost=2.11148 CurrentCost=2.14526 Eval: __sum_evaluator_0__=0.607882 CurrentEval: __sum_evaluator_0__=0.749389 with open(file_name, 'rb') as f:
I1224 18:12:00.164089 1433 TrainerInternal.cpp:181] Pass=0 Batch=901 samples=148647 AvgCost=2.11195 Eval: __sum_evaluator_0__=0.60793 f.read(16)
return np.fromfile(f, dtype=np.float32).reshape(h, w)
parameters.set('emb', load_parameter(conll05.get_embedding(), 44068, 32))
``` ```
经过150个 pass 后,得到平均 error 约为 0.0516055。
## 应用模型 ### 构造训练(Trainer)
训练好的$N$个pass,会得到$N$个模型,我们需要从中选择一个最优模型进行预测。通常做法是在开发集上进行调参,并基于我们关心的某个性能指标选择最优模型。本教程的`predict.sh`脚本简单地选择了测试集上标记错误最少的那个pass(这里是pass-00100)用于预测 然后根据网络拓扑结构和模型参数来构造出trainer用来训练,在构造时还需指定优化方法,这里使用最基本的SGD方法(momentum设置为0),同时设定了学习率、正则等
预测时,我们需要将配置中的 `crf_layer` 删掉,替换为 `crf_decoding_layer`,如下所示: ```python
# create optimizer
optimizer = paddle.optimizer.Momentum(
momentum=0,
learning_rate=2e-2,
regularization=paddle.optimizer.L2Regularization(rate=8e-4),
model_average=paddle.optimizer.ModelAverage(
average_window=0.5, max_average_window=10000), )
trainer = paddle.trainer.SGD(cost=crf_cost,
parameters=parameters,
update_equation=optimizer)
```
### 训练
数据介绍部分提到CoNLL 2005训练集付费,这里我们使用测试集训练供大家学习。`conll05.test()`每次产生一条样本,包含9个特征,shuffle和组完batch后作为训练的输入。
```python ```python
crf_dec_l = crf_decoding_layer( reader = paddle.batch(
name='crf_dec_l', paddle.reader.shuffle(
size=label_dict_len, conll05.test(), buf_size=8192), batch_size=20)
input=feature_out,
param_attr=ParameterAttribute(name='crfw'))
``` ```
运行`python predict.py`脚本,便可使用指定的模型进行预测。 通过`feeding`来指定每一个数据和data_layer的对应关系。 例如 下面`feeding`表示: `conll05.test()`产生数据的第0列对应`word_data`层的特征。
```bash
python predict.py ```python
-c db_lstm.py # 指定配置文件 feeding = {
-w output/pass-00100 # 指定预测使用的模型所在的路径 'word_data': 0,
-l data/targetDict.txt # 指定标记的字典 'ctx_n2_data': 1,
-p data/verbDict.txt # 指定谓词的词典 'ctx_n1_data': 2,
-d data/wordDict.txt # 指定输入文本序列的字典 'ctx_0_data': 3,
-i data/feature # 指定输入数据的路径 'ctx_p1_data': 4,
-o predict.res # 指定标记结果输出到文件的路径 'ctx_p2_data': 5,
'verb_data': 6,
'mark_data': 7,
'target': 8
}
``` ```
预测结束后,在 - o 参数所指定的标记结果文件中,我们会得到如下格式的输出:每行是一条样本,以 “\t” 分隔的 2 列,第一列是输入文本,第二列是标记的结果。通过BIO标记可以直接得到论元的语义角色标签 可以使用`event_handler`回调函数来观察训练过程,或进行测试等。这里我们打印了训练过程的cost,该回调函数是`trainer.train`函数里设定
```text ```python
The interest-only securities were priced at 35 1\/2 to yield 10.72 % . B-A0 I-A0 I-A0 O O O O O O B-V B-A1 I-A1 O def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0:
print "Pass %d, Batch %d, Cost %f" % (
event.pass_id, event.batch_id, event.cost)
```
通过`trainer.train`函数训练:
```python
trainer.train(
reader=reader,
event_handler=event_handler,
num_passes=10000,
feeding=feeding)
``` ```
## 总结 ## 总结
......
...@@ -41,24 +41,24 @@ ...@@ -41,24 +41,24 @@
<div id="markdown" style='display:none'> <div id="markdown" style='display:none'>
# Semantic Role Labeling # Semantic Role Labeling
Source code of this chpater is in [book/label_semantic_roles](https://github.com/PaddlePaddle/book/tree/develop/label_semantic_roles). Source code of this chapter is in [book/label_semantic_roles](https://github.com/PaddlePaddle/book/tree/develop/label_semantic_roles).
## Background ## Background
Natural Language Analysis contains three components: Lexical Analysis, Syntactic Analysis, and Semantic Analysis. Semantic Role Labelling (SRL) is one way for Shallow Semantic Analysis. A predicate of a sentence is seen as a property that a subject has or is characterized by, such as what it does, what it is or how it is, which mostly corresponds to the core of an event. The noun associated with predicate is called Arugment. Sementic roles express the abstract roles that arguments of a predicate can take in the event, such as Agent, Patient, Theme, Experiencer, Beneficiary, Instrument, Location, Goal and Source etc. Natural Language Analysis contains three components: Lexical Analysis, Syntactic Analysis, and Semantic Analysis. Semantic Role Labelling (SRL) is one way for Shallow Semantic Analysis. A predicate of a sentence is a property that a subject possesses or is characterized, such as what it does, what it is or how it is, which mostly corresponds to the core of an event. The noun associated with a predicate is called Argument. Semantic roles express the abstract roles that arguments of a predicate can take in the event, such as Agent, Patient, Theme, Experiencer, Beneficiary, Instrument, Location, Goal and Source, etc.
In the following example, “遇到” is Predicate (“Pred”),“小明” is Agent,“小红” is Patient,“昨天” means when the event occurs (Time), and “公园” means where the event occurs (Location). In the following example, “遇到” (encounters) is a Predicate (“Pred”),“小明” (Ming) is an Agent,“小红” (Hong) is a Patient,“昨天” (yesterday) indicates the Time, and “公园” (park) is the Location.
$$\mbox{[小明]}_{\mbox{Agent}}\mbox{[昨天]}_{\mbox{Time}}\mbox{[晚上]}_\mbox{Time}\mbox{在[公园]}_{\mbox{Location}}\mbox{[遇到]}_{\mbox{Predicate}}\mbox{了[小红]}_{\mbox{Patient}}\mbox{。}$$ $$\mbox{[小明]}_{\mbox{Agent}}\mbox{[昨天]}_{\mbox{Time}}\mbox{[晚上]}_\mbox{Time}\mbox{在[公园]}_{\mbox{Location}}\mbox{[遇到]}_{\mbox{Predicate}}\mbox{了[小红]}_{\mbox{Patient}}\mbox{。}$$
Instead of in-depth analysis on semantic information, the goal of Semantic Role Labeling is to identify the relation of predicate and other constituents, e.g., predicate-argument structure, as specific semantic roles, which is an important intermediate step in a wide range of natural language understanding tasks (Information Extraction, Discourse Analysis, DeepQA etc). Predicates are always assumed to be given, the only thing is to identify arguments and their semantic roles. Instead of in-depth analysis on semantic information, the goal of Semantic Role Labeling is to identify the relation of predicate and other constituents, e.g., predicate-argument structure, as specific semantic roles, which is an important intermediate step in a wide range of natural language understanding tasks (Information Extraction, Discourse Analysis, DeepQA etc). Predicates are always assumed to be given; the only thing is to identify arguments and their semantic roles.
Standard SRL system mostly build on top of Syntactic Analysis and contains 5 steps: Standard SRL system mostly builds on top of Syntactic Analysis and contains five steps:
1. Construct a syntactic parse tree, as shown in Fig. 1 1. Construct a syntactic parse tree, as shown in Fig. 1
2. Identity candidate arguments of given predicate from constructed syntactic parse tree. 2. Identity candidate arguments of given predicate from constructed syntactic parse tree.
3. Prune most unlikely candidate arguments. 3. Prune most unlikely candidate arguments.
4. Identify argument, which is usually solved as a binary classification problem. 4. Identify arguments, often by a binary classifier.
5. Multi-class semantic role labeling. Steps 2-3 usually introduce hand-designed features based on Syntactic Analysis (step 1). 5. Multi-class semantic role labeling. Steps 2-3 usually introduce hand-designed features based on Syntactic Analysis (step 1).
...@@ -77,7 +77,7 @@ Fig 1. Syntactic parse tree ...@@ -77,7 +77,7 @@ Fig 1. Syntactic parse tree
标点-> WP 标点-> WP
However, complete syntactic analysis requires to identify the relation among all constitutes and the performance of SRL is sensitive to the precision of syntactic analysis, which make SRL a very challenging task. In order to reduce the complexity and obtain some syntactic structure information, shallow syntactic analysis is proposed. Shallow Syntactic Analysis is also called partial parsing or chunking. Unlike complete syntactic analysis which requires constructing complete parsing tree, Shallow Syntactic Analysis only need to identify some idependent components with relatively simple structure, such as verb phrases (chunk). In order to avoid constructing syntactic tree with high accuracy, some work\[[1](#Reference)\] proposed semantic chunking based SRL methods, which convert SRL as a sequence tagging problem. Sequence tagging tasks classify syntactic chunks using BIO representation. For syntactic chunks forming a chunk of type A, the first chunk receives the B-A tag (Begin), the remaining ones receive the tag I-A (Inside), and all chunks outside receive the tag O-A. However, complete syntactic analysis requires identifying the relation among all constitutes and the performance of SRL is sensitive to the precision of syntactic analysis, which makes SRL a very challenging task. To reduce the complexity and obtain some syntactic structure information, we often use shallow syntactic analysis. Shallow Syntactic Analysis is also called partial parsing or chunking. Unlike complete syntactic analysis which requires the construction of the complete parsing tree, Shallow Syntactic Analysis only need to identify some independent components with relatively simple structure, such as verb phrases (chunk). To avoid difficulties in constructing a syntactic tree with high accuracy, some work\[[1](#Reference)\] proposed semantic chunking based SRL methods, which convert SRL as a sequence tagging problem. Sequence tagging tasks classify syntactic chunks using BIO representation. For syntactic chunks forming a chunk of type A, the first chunk receives the B-A tag (Begin), the remaining ones receive the tag I-A (Inside), and all chunks outside receive the tag O-A.
The BIO representation of above example is shown in Fig.1. The BIO representation of above example is shown in Fig.1.
...@@ -91,23 +91,23 @@ Fig 2. BIO represention ...@@ -91,23 +91,23 @@ Fig 2. BIO represention
标注序列-> label sequence 标注序列-> label sequence
角色-> role 角色-> role
This example illustrates the simplicity of sequence tagging because (1) shallow syntactic analysis reduces precision requirement of syntactic analysis; (2) pruning candidate arguments is removed; 3) argument identification and tagging are finished at the same time. Such unified methods simplify the precedure, reduce the risk of accumulating errors and boost the performance further. This example illustrates the simplicity of sequence tagging because (1) shallow syntactic analysis reduces the precision requirement of syntactic analysis; (2) pruning candidate arguments is removed; 3) argument identification and tagging are finished at the same time. Such unified methods simplify the procedure, reduce the risk of accumulating errors and boost the performance further.
In this tutorial, our SRL system is built as an end-to-end system via neural network. We take only text sequences, without using any syntactic parsing results or complex hand-designed features. We give public dataset [CoNLL-2004 and CoNLL-2005 Shared Tasks](http://www.cs.upc.edu/~srlconll/) as an example to illustrate: given a sentence and it's predicates, identify the corresponding arguments and their semantic roles by sequence tagging method. In this tutorial, our SRL system is built as an end-to-end system via a neural network. We take only text sequences, without using any syntactic parsing results or complex hand-designed features. We give public dataset [CoNLL-2004 and CoNLL-2005 Shared Tasks](http://www.cs.upc.edu/~srlconll/) as an example to illustrate: given a sentence with predicates marked, identify the corresponding arguments and their semantic roles by sequence tagging method.
## Model ## Model
Recurrent Nerual Networks are important tools for sequence modeling and have been successfully used in some natural language processing tasks. Unlike Feed-forward neural netowrks, RNNs can model the dependency between elements of sequences. LSTMs as variants of RNNs aim to model long-term dependency in long sequences. We have introduced this in [understand_sentiment](https://github.com/PaddlePaddle/book/tree/develop/understand_sentiment). In this chapter, we continue to use LSTMs to solve SRL problems. Recurrent Neural Networks are important tools for sequence modeling and have been successfully used in some natural language processing tasks. Unlike Feed-forward neural networks, RNNs can model the dependency between elements of sequences. LSTMs as variants of RNNs aim to model long-term dependency in long sequences. We have introduced this in [understand_sentiment](https://github.com/PaddlePaddle/book/tree/develop/understand_sentiment). In this chapter, we continue to use LSTMs to solve SRL problems.
### Stacked Recurrent Neural Network ### Stacked Recurrent Neural Network
Deep Neural Networks allows to extract hierarchical represetations, higher layer can form more abstract/complex representations on top of lower layers. LSTMs when unfolded in time is deep, because a computational path between the input at time $k < t$ to the output at time $t$ crosses several nonlinear layers. However, the computation carried out at each time-step is only linear transformation, which makes LSTMs a shallow model. Deep LSTMs are typically constructed by stacking multiple LSTM layers on top of each other and taking the output from lower LSTM layer at time $t$ as the input of upper LSTM layer at time $t$. Deep, hierarchical nerual networks can be much efficient at representing some functions and modeling varying-length dependencies\[[2](#Reference)\]. Deep Neural Networks allows extracting hierarchical representations. Higher layers can form more abstract/complex representations on top of lower layers. LSTMs, when unfolded in time, is a deep feed-forward neural network, because a computational path between the input at time $k < t$ to the output at time $t$ crosses several nonlinear layers. However, the computation carried out at each time-step is only linear transformation, which makes LSTMs a shallow model. Deep LSTMs are typically constructed by stacking multiple LSTM layers on top of each other and taking the output from lower LSTM layer at time $t$ as the input of upper LSTM layer at time $t$. Deep, hierarchical neural networks can be much efficient at representing some functions and modeling varying-length dependencies\[[2](#Reference)\].
However, deep LSTMs increases the number of nonlinear steps the gradient has to traverse when propagated back in depth. For example, 4 layer LSTMs can be trained properly, but the performance becomes worse as the number of layers up to 4-8. Conventional LSTMs prevent backpropagated errors from vanishing and exploding by introduce shortcut connections to skip the intermediate nonlinear layers. Therefore, deep LSTMs can consider shortcut connections in depth as well. However, deep LSTMs increases the number of nonlinear steps the gradient has to traverse when propagated back in depth. For example, four layer LSTMs can be trained properly, but the performance becomes worse as the number of layers up to 4-8. Conventional LSTMs prevent backpropagated errors from vanishing and exploding by introducing shortcut connections to skip the intermediate nonlinear layers. Therefore, deep LSTMs can consider shortcut connections in depth as well.
The operation of a single LSTM cell contain 3 parts: (1) input-to-hidden: map input $x$ to the input of forget gates, input gates, memory cells and output gates by linear transformation (i.e., matrix mapping); (2) hidden-to-hidden: calculate forget gates, input gates, output gates and update memory cell, this is the main part of LSTMs; (3)hidden-to-output: this part typically involves an activation operation on hidden states. Based on the above stacked LSTMs, we add a shortcut connection: take the input-to-hidden from previous layer as a new input and learn another linear transfermation. The operation of a single LSTM cell contain 3 parts: (1) input-to-hidden: map input $x$ to the input of the forget gates, input gates, memory cells and output gates by linear transformation (i.e., matrix mapping); (2) hidden-to-hidden: calculate forget gates, input gates, output gates and update memory cell, this is the main part of LSTMs; (3)hidden-to-output: this part typically involves an activation operation on hidden states. Based on the stacked LSTMs, we add a shortcut connection: take the input-to-hidden from the previous layer as a new input and learn another linear transformation.
Fig.3 illustrate the final stacked recurrent neural networks. Fig.3 illustrate the final stacked recurrent neural networks.
...@@ -121,7 +121,7 @@ Fig 3. Stacked Recurrent Neural Networks ...@@ -121,7 +121,7 @@ Fig 3. Stacked Recurrent Neural Networks
### Bidirectional Recurrent Neural Network ### Bidirectional Recurrent Neural Network
LSTMs can summarize the history of previous inputs seen up to now, but can not see the future. In most of natural language processing tasks, the entire sentences are ready to use. Therefore, sequencal learning might be much effecient if the future can be encoded as well like histories. LSTMs can summarize the history of previous inputs seen up to now, but can not see the future. In most of NLP (natural language processing) tasks, the entire sentences are ready to use. Therefore, sequential learning might be much efficient if the future can be encoded as well like histories.
To address the above drawbacks, we can design bidirectional recurrent neural networks by making a minor modification. Higher LSTM layers process the sequence in reversed direction with previous lower LSTM layers, i.e., Deep LSTMs operate from left-to-right, right-to-left, left-to-right,..., in depth. Therefore, LSTM layers at time-step $t$ can see both histories and the future since the second layer. Fig. 4 illustrates the bidirectional recurrent neural networks. To address the above drawbacks, we can design bidirectional recurrent neural networks by making a minor modification. Higher LSTM layers process the sequence in reversed direction with previous lower LSTM layers, i.e., Deep LSTMs operate from left-to-right, right-to-left, left-to-right,..., in depth. Therefore, LSTM layers at time-step $t$ can see both histories and the future since the second layer. Fig. 4 illustrates the bidirectional recurrent neural networks.
...@@ -133,10 +133,10 @@ Fig 4. Bidirectional LSTMs ...@@ -133,10 +133,10 @@ Fig 4. Bidirectional LSTMs
线性变换-> linear transformation 线性变换-> linear transformation
输入层到隐层-> input-to-hidden 输入层到隐层-> input-to-hidden
正向处理输出序列->process sequence in forward direction 正向处理输出序列->process sequence in the forward direction
反向处理上一层序列-> process sequence from previous layer in backward direction 反向处理上一层序列-> process sequence from the previous layer in backward direction
Note that, this bidirectional RNNs is different with the one proposed by Bengio et al in machine translation tasks \[[3](#Reference), [4](#Reference)\]. We will introduce another bidirectional RNNs in the following tasks[machine translation](https://github.com/PaddlePaddle/book/blob/develop/machine_translation/README.md) Note that, this bidirectional RNNs is different with the one proposed by Bengio et al. in machine translation tasks \[[3](#Reference), [4](#Reference)\]. We will introduce another bidirectional RNNs in the following tasks[machine translation](https://github.com/PaddlePaddle/book/blob/develop/machine_translation/README.md)
### Conditional Random Field ### Conditional Random Field
...@@ -145,7 +145,7 @@ The basic pipeline of Neural Networks solving problems is 1) all lower layers ai ...@@ -145,7 +145,7 @@ The basic pipeline of Neural Networks solving problems is 1) all lower layers ai
CRF is a probabilistic graph model (undirected) with nodes denoting random variables and edges denoting dependencies between nodes. To be simplicity, CRFs learn conditional probability $P(Y|X)$, where $X = (x_1, x_2, ... , x_n)$ are sequences of input, $Y = (y_1, y_2, ... , y_n)$ are label sequences; Decoding is to search sequence $Y$ to maximize conditional probability $P(Y|X)$, i.e., $Y^* = \mbox{arg max}_{Y} P(Y | X)$。 CRF is a probabilistic graph model (undirected) with nodes denoting random variables and edges denoting dependencies between nodes. To be simplicity, CRFs learn conditional probability $P(Y|X)$, where $X = (x_1, x_2, ... , x_n)$ are sequences of input, $Y = (y_1, y_2, ... , y_n)$ are label sequences; Decoding is to search sequence $Y$ to maximize conditional probability $P(Y|X)$, i.e., $Y^* = \mbox{arg max}_{Y} P(Y | X)$。
Sequence tagging tasks only consider input and output as linear sequences without extra dependent assumptions on graph model. Thus, the graph model of sequence tagging tasks is simple chain or line, which results in a Linear Chain Conditional Random Field, shown in Fig.5. Sequence tagging tasks only consider input and output as linear sequences without extra dependent assumptions on graph model. Thus, the graph model of sequence tagging tasks is simple chain or line, which results in a Linear-Chain Conditional Random Field, shown in Fig.5.
<p align="center"> <p align="center">
<img src="./image/linear_chain_crf.png" width = "35%" align=center><br> <img src="./image/linear_chain_crf.png" width = "35%" align=center><br>
...@@ -174,25 +174,25 @@ This objective function can be solved via back-propagation in an end-to-end mann ...@@ -174,25 +174,25 @@ This objective function can be solved via back-propagation in an end-to-end mann
Given predicates and a sentence, SRL tasks aim to identify arguments of the given predicate and their semantic roles. If a sequence has n predicates, we will process this sequence n times. One model is as follows: Given predicates and a sentence, SRL tasks aim to identify arguments of the given predicate and their semantic roles. If a sequence has n predicates, we will process this sequence n times. One model is as follows:
1. Construct inputs; 1. Construct inputs;
- output 1: predicate, output 2: sentence - input 1: predicate, input 2: sentence
- expand input 1 as a sequence with the same length with input 2 using one-hot representation; - expand input 1 as a sequence with the same length with input 2 using one-hot representation;
2. Convert one-hot sequences from step 1 to real-vector sequences via lookup table; 2. Convert one-hot sequences from step 1 to vector sequences via lookup table;
3. Learn the representation of input sequences by taking real-vector sequences from step 2 as inputs; 3. Learn the representation of input sequences by taking vector sequences from step 2 as inputs;
4. Take representations from step 3 as inputs, label sequence as supervision signal, do sequence tagging tasks 4. Take representations from step 3 as inputs, label sequence as supervision signal, do sequence tagging tasks
We can try above method. Here, we propose some modifications by introducing two simple but effective features: We can try above method. Here, we propose some modifications by introducing two simple but effective features:
- predicate context (ctx-p): A single predicate word can not exactly describe the predicate information, especially when the same words appear more than one times in a sentence. With the expanded context, the ambiguity can be largely eliminated. Thus, we extract $n$ words before and after predicate to construct a window chunk. - predicate context (ctx-p): A single predicate word can not exactly describe the predicate information, especially when the same words appear more than one times in a sentence. With the expanded context, the ambiguity can be largely eliminated. Thus, we extract $n$ words before and after predicate to construct a window chunk.
- region mark ($m_r$): $m_r = 1$ to denote the argument position if it locates in the predicate context region, or $m_r = 0$ if not. - region mark ($m_r$): $m_r = 1$ to denote word in that position locates in the predicate context region, or $m_r = 0$ if not.
After modification, the model is as follows: After modification, the model is as follows:
1. Construct inputs 1. Construct inputs
- input 1: sentence, input 2: predicate sequence, input 3: predicate context, extract $n$ words before and after predicate and get one-hot representation, input 4: region mark, annotate argument position if it locates in the predicate context region - Input 1: word sequence. Input 2: predicate. Input 3: predicate context, extract $n$ words before and after predicate. Input 4: region mark sequence, element value will be 1 if word locates in the predicate context region, 0 otherwise.
- expand input 2~3 as sequences with the same length with input 1 - expand input 2~3 as sequences with the same length with input 1
2. Convert input 1~4 to real-vector sequences via lookup table; input 1 and 3 share the same lookup table, input 2 and 4 have separate lookup tables 2. Convert input 1~4 to vector sequences via lookup table; input 1 and 3 shares the same lookup table, input 2 and 4 have separate lookup tables
3. Take four real-vector sequences from step 2 as inputs of bidirectional LSTMs; Train LSTMs to update representations 3. Take four vector sequences from step 2 as inputs of bidirectional LSTMs; Train LSTMs to update representations
4. Take representation from step 3 as input of CRF, label sequence as supervision signal, do sequence tagging tasks 4. Take representation from step 3 as input of CRF, label sequence as supervision signal, do sequence tagging tasks
...@@ -209,12 +209,11 @@ Fig 6. DB-LSTM for SRL tasks ...@@ -209,12 +209,11 @@ Fig 6. DB-LSTM for SRL tasks
原句-> sentence 原句-> sentence
反向LSTM-> LSTM Reverse 反向LSTM-> LSTM Reverse
## 数据准备 ## Data Preparation
### 数据介绍与下载
在此教程中,我们选用[CoNLL 2005](http://www.cs.upc.edu/~srlconll/)SRL任务开放出的数据集作为示例。运行 `sh ./get_data.sh` 会自动从官方网站上下载原始数据。需要特别说明的是,CoNLL 2005 SRL任务的训练数集和开发集在比赛之后并非免费进行公开,目前,能够获取到的只有测试集,包括Wall Street Journal的23节和Brown语料集中的3节。在本教程中,我们以测试集中的WSJ数据为训练集来讲解模型。但是,由于测试集中样本的数量远远不够,如果希望训练一个可用的神经网络SRL系统,请考虑付费获取全量数据。 In the tutorial, we use [CoNLL 2005](http://www.cs.upc.edu/~srlconll/) SRL task open dataset as an example. It is important to note that the training set and development set of the CoNLL 2005 SRL task are not free to download after the competition. Currently, only the test set can be obtained, including 23 sections of the Wall Street Journal and three sections of the Brown corpus. In this tutorial, we use the WSJ corpus as the training dataset to explain the model. However, since the training set is small, if you want to train a usable neural network SRL system, consider paying for the full corpus.
原始数据中同时包括了词性标注、命名实体识别、语法解析树等多种信息。本教程中,我们使用test.wsj文件夹中的数据进行训练和测试,并只会用到words文件夹(文本序列)和props文件夹(标注结果)下的数据。本教程使用的数据目录如下: The original data includes a variety of information such as POS tagging, naming entity recognition, parsing tree, and so on. In this tutorial, we only use the data under the words folder (text sequence) and the props folder (label results) inside test.wsj parent folder. The data directory used in this tutorial is as follows:
```text ```text
conll05st-release/ conll05st-release/
...@@ -223,30 +222,26 @@ conll05st-release/ ...@@ -223,30 +222,26 @@ conll05st-release/
└── words # 输入文本序列 └── words # 输入文本序列
``` ```
标注信息源自Penn TreeBank\[[7](#参考文献)\]和PropBank\[[8](#参考文献)\]的标注结果。PropBank标注结果的标签和我们在文章一开始示例中使用的标注结果标签不同,但原理是相同的,关于标注结果标签含义的说明,请参考论文\[[9](#参考文献)\]。 The annotation information is derived from the results of Penn TreeBank\[[7](#references)\] and PropBank \[[8](# references)\]. The label of the PropBank is different from the label that we used in the example at the beginning of the article, but the principle is the same. For the description of the label, please refer to the paper \[[9](#references)\].
除数据之外,`get_data.sh`同时下载了以下资源: The raw data needs to be preprocessed before used by PaddlePaddle. The preprocessing consists of the following steps:
| 文件名称 | 说明 | 1. Merge the text sequence and the tag sequence into the same record;
|---|---| 2. If a sentence contains $n$ predicates, the sentence will be processed $n$ times into $n$ separate training samples, each sample with a different predicate;
| word_dict | 输入句子的词典,共计44068个词 | 3. Extract the predicate context and construct the predicate context region marker;
| label_dict | 标记的词典,共计106个标记 | 4. Construct the markings in BIO format;
| predicate_dict | 谓词的词典,共计3162个词 | 5. Obtain the integer index corresponding to the word according to the dictionary.
| emb | 一个训练好的词表,32维 |
我们在英文维基百科上训练语言模型得到了一份词向量用来初始化SRL模型。在SRL模型训练过程中,词向量不再被更新。关于语言模型和词向量可以参考[词向量](https://github.com/PaddlePaddle/book/blob/develop/word2vec/README.md) 这篇教程。我们训练语言模型的语料共有995,000,000个token,词典大小控制为4900,000词。CoNLL 2005训练语料中有5%的词不在这4900,000个词中,我们将它们全部看作未登录词,用`<unk>`表示。
### 数据预处理
脚本在下载数据之后,又调用了`extract_pair.py`和`extract_dict_feature.py`两个子脚本进行数据预处理,前者完成了下面的第1步,后者完成了下面的2~4步:
1. 将文本序列和标记序列其合并到一条记录中; ```python
2. 一个句子如果含有$n$个谓词,这个句子会被处理$n$次,变成$n$条独立的训练样本,每个样本一个不同的谓词; # import paddle.v2.dataset.conll05 as conll05
3. 抽取谓词上下文和构造谓词上下文区域标记; # conll05.corpus_reader does step 1 and 2 as mentioned above.
4. 构造以BIO法表示的标记; # conll05.reader_creator does step 3 to 5.
# conll05.test gets preprocessed training instances.
```
`data/feature`文件是处理好的模型输入,一行是一条训练样本,以"\t"分隔,共9列,分别是:句子序列、谓词、谓词上下文(占 5 列)、谓词上下区域标志、标注序列。下表是一条训练样本的示例。 After preprocessing completes, a training sample contains nine features, namely: word sequence, predicate, predicate context (5 columns), region mark sequence, label sequence. Following table is an example of a training sample.
| 句子序列 | 谓词 | 谓词上下文(窗口 = 5) | 谓词上下文区域标记 | 标注序列 | | word sequence | predicate | predicate context(5 columns) | region mark sequence | label sequence|
|---|---|---|---|---| |---|---|---|---|---|
| A | set | n't been set . × | 0 | B-A1 | | A | set | n't been set . × | 0 | B-A1 |
| record | set | n't been set . × | 0 | I-A1 | | record | set | n't been set . × | 0 | I-A1 |
...@@ -257,293 +252,279 @@ conll05st-release/ ...@@ -257,293 +252,279 @@ conll05st-release/
| set | set | n't been set . × | 1 | B-V | | set | set | n't been set . × | 1 | B-V |
| . | set | n't been set . × | 1 | O | | . | set | n't been set . × | 1 | O |
### 提供数据给 PaddlePaddle In addition to the data, we provide following resources:
1. 使用hook函数进行PaddlePaddle输入字段的格式定义。
| filename | explanation |
```python |---|---|
def hook(settings, word_dict, label_dict, predicate_dict, **kwargs): | word_dict | dictionary of input sentences, total 44068 words |
settings.word_dict = word_dict # 获取句子序列的字典 | label_dict | dictionary of labels, total 106 labels |
settings.label_dict = label_dict # 获取标记序列的字典 | predicate_dict | predicate dictionary, total 3162 predicates |
settings.predicate_dict = predicate_dict # 获取谓词的字典 | emb | a pre-trained word vector lookup table, 32-dimentional |
# 所有输入特征都是使用one-hot表示序列,在PaddlePaddle中是interger_value_sequence类型 We trained in the English Wikipedia language model to get a word vector lookup table used to initialize the SRL model. During the SRL model training process, the word vector lookup table is no longer updated. About the language model and the word vector lookup table can refer to [word vector](https://github.com/PaddlePaddle/book/blob/develop/word2vec/README.md) tutorial. There are 995,000,000 token in training corpus, and the dictionary size is 4900,000 words. In the CoNLL 2005 training corpus, 5% of the words are not in the 4900,000 words, and we see them all as unknown words, represented by `<unk>`.
# input_types是一个字典,字典中每个元素对应着配置中的一个data_layer,key恰好就是data_layer的名字
Get dictionary, print dictionary size:
settings.input_types = {
'word_data': integer_value_sequence(len(word_dict)), # 句子序列
'ctx_n2_data': integer_value_sequence(len(word_dict)), # 谓词上下文中的第1个词
'ctx_n1_data': integer_value_sequence(len(word_dict)), # 谓词上下文中的第2个词
'ctx_0_data': integer_value_sequence(len(word_dict)), # 谓词上下文中的第3个词
'ctx_p1_data': integer_value_sequence(len(word_dict)), # 谓词上下文中的第4个词
'ctx_p2_data': integer_value_sequence(len(word_dict)), # 谓词上下文中的第5个词
'verb_data': integer_value_sequence(len(predicate_dict)), # 谓词
'mark_data': integer_value_sequence(2), # 谓词上下文区域标记
'target': integer_value_sequence(len(label_dict)) # 标记序列
}
```
2. 使用process将数据逐一提供给PaddlePaddle,只需要考虑如何从原始数据文件中返回一条训练样本。
```python
def process(settings, file_name):
with open(file_name, 'r') as fdata:
for line in fdata:
sentence, predicate, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2, mark, label = \
line.strip().split('\t')
# 句子文本
words = sentence.split()
sen_len = len(words)
word_slot = [settings.word_dict.get(w, UNK_IDX) for w in words]
# 一个谓词,这里将谓词扩展成一个和句子一样长的序列
predicate_slot = [settings.predicate_dict.get(predicate)] * sen_len
# 在教程中,我们使用一个窗口为 5 的谓词上下文窗口:谓词和这个谓词前后隔两个词
# 这里会将窗口中的每一个词,扩展成和输入句子一样长的序列
ctx_n2_slot = [settings.word_dict.get(ctx_n2, UNK_IDX)] * sen_len
ctx_n1_slot = [settings.word_dict.get(ctx_n1, UNK_IDX)] * sen_len
ctx_0_slot = [settings.word_dict.get(ctx_0, UNK_IDX)] * sen_len
ctx_p1_slot = [settings.word_dict.get(ctx_p1, UNK_IDX)] * sen_len
ctx_p2_slot = [settings.word_dict.get(ctx_p2, UNK_IDX)] * sen_len
# 谓词上下文区域标记,是一个二值特征
marks = mark.split()
mark_slot = [int(w) for w in marks]
label_list = label.split()
label_slot = [settings.label_dict.get(w) for w in label_list]
yield {
'word_data': word_slot,
'ctx_n2_data': ctx_n2_slot,
'ctx_n1_data': ctx_n1_slot,
'ctx_0_data': ctx_0_slot,
'ctx_p1_data': ctx_p1_slot,
'ctx_p2_data': ctx_p2_slot,
'verb_data': predicate_slot,
'mark_data': mark_slot,
'target': label_slot
}
```
## 模型配置说明
### 数据定义
首先通过 define_py_data_sources2 从dataprovider中读入数据。配置文件中会读取三个字典:输入文本序列的字典、标记的字典、谓词的字典,并传给data provider,data provider会利用这三个字典,将相应的文本输入转换成one-hot序列。
```python ```python
define_py_data_sources2( import paddle.v2 as paddle
train_list=train_list_file, import paddle.v2.dataset.conll05 as conll05
test_list=test_list_file,
module='dataprovider', word_dict, verb_dict, label_dict = conll05.get_dict()
obj='process', word_dict_len = len(word_dict)
args={ label_dict_len = len(label_dict)
'word_dict': word_dict, # 输入文本序列的字典 pred_len = len(verb_dict)
'label_dict': label_dict, # 标记的字典
'predicate_dict': predicate_dict # 谓词的词典 print len(word_dict_len)
} print len(label_dict_len)
) print len(pred_len)
``` ```
### 算法配置
在这里,我们指定了模型的训练参数,选择了$L_2$正则、学习率和batch size,并使用带Momentum的随机梯度下降法作为优化算法。 ## Model configuration
1. Define input data dimensions and model hyperparameters.
```python
mark_dict_len = 2 # Value range of region mark. Region mark is either 0 or 1, so range is 2
word_dim = 32 # word vector dimension
mark_dim = 5 # adjacent dimension
hidden_dim = 512 # the dimension of LSTM hidden layer vector is 128 (512/4)
depth = 8 # depth of stacked LSTM
# There are 9 features per sample, so we will define 9 data layers.
# They type for each layer is integer_value_sequence.
def d_type(value_range):
return paddle.data_type.integer_value_sequence(value_range)
# word sequence
word = paddle.layer.data(name='word_data', type=d_type(word_dict_len))
# predicate
predicate = paddle.layer.data(name='verb_data', type=d_type(pred_len))
# 5 features for predicate context
ctx_n2 = paddle.layer.data(name='ctx_n2_data', type=d_type(word_dict_len))
ctx_n1 = paddle.layer.data(name='ctx_n1_data', type=d_type(word_dict_len))
ctx_0 = paddle.layer.data(name='ctx_0_data', type=d_type(word_dict_len))
ctx_p1 = paddle.layer.data(name='ctx_p1_data', type=d_type(word_dict_len))
ctx_p2 = paddle.layer.data(name='ctx_p2_data', type=d_type(word_dict_len))
# region marker sequence
mark = paddle.layer.data(name='mark_data', type=d_type(mark_dict_len))
# label sequence
target = paddle.layer.data(name='target', type=d_type(label_dict_len))
```
Speciala note: hidden_dim = 512 means LSTM hidden vector of 128 dimension (512/4). Please refer PaddlePaddle official documentation for detail: [lstmemory](http://www.paddlepaddle.org/doc/ui/api/trainer_config_helpers/layers.html#lstmemory)。
2. The word sequence, predicate, predicate context, and region mark sequence are transformed into embedding vector sequences.
```python
# Since word vectorlookup table is pre-trained, we won't update it this time.
# is_static being True prevents updating the lookup table during training.
emb_para = paddle.attr.Param(name='emb', initial_std=0., is_static=True)
# hyperparameter configurations
default_std = 1 / math.sqrt(hidden_dim) / 3.0
std_default = paddle.attr.Param(initial_std=default_std)
std_0 = paddle.attr.Param(initial_std=0.)
predicate_embedding = paddle.layer.embedding(
size=word_dim,
input=predicate,
param_attr=paddle.attr.Param(
name='vemb', initial_std=default_std))
mark_embedding = paddle.layer.embedding(
size=mark_dim, input=mark, param_attr=std_0)
word_input = [word, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2]
emb_layers = [
paddle.layer.embedding(
size=word_dim, input=x, param_attr=emb_para) for x in word_input
]
emb_layers.append(predicate_embedding)
emb_layers.append(mark_embedding)
```
3. 8 LSTM units will be trained in "forward / backward" order.
```python
hidden_0 = paddle.layer.mixed(
size=hidden_dim,
bias_attr=std_default,
input=[
paddle.layer.full_matrix_projection(
input=emb, param_attr=std_default) for emb in emb_layers
])
mix_hidden_lr = 1e-3
lstm_para_attr = paddle.attr.Param(initial_std=0.0, learning_rate=1.0)
hidden_para_attr = paddle.attr.Param(
initial_std=default_std, learning_rate=mix_hidden_lr)
lstm_0 = paddle.layer.lstmemory(
input=hidden_0,
act=paddle.activation.Relu(),
gate_act=paddle.activation.Sigmoid(),
state_act=paddle.activation.Sigmoid(),
bias_attr=std_0,
param_attr=lstm_para_attr)
# stack L-LSTM and R-LSTM with direct edges
input_tmp = [hidden_0, lstm_0]
for i in range(1, depth):
mix_hidden = paddle.layer.mixed(
size=hidden_dim,
bias_attr=std_default,
input=[
paddle.layer.full_matrix_projection(
input=input_tmp[0], param_attr=hidden_para_attr),
paddle.layer.full_matrix_projection(
input=input_tmp[1], param_attr=lstm_para_attr)
])
lstm = paddle.layer.lstmemory(
input=mix_hidden,
act=paddle.activation.Relu(),
gate_act=paddle.activation.Sigmoid(),
state_act=paddle.activation.Sigmoid(),
reverse=((i % 2) == 1),
bias_attr=std_0,
param_attr=lstm_para_attr)
input_tmp = [mix_hidden, lstm]
```
4. We will concatenate the output of top LSTM unit with it's input, and project into a hidden layer. Then put a fully connected layer on top of it to get the final vector representation.
```python
feature_out = paddle.layer.mixed(
size=label_dict_len,
bias_attr=std_default,
input=[
paddle.layer.full_matrix_projection(
input=input_tmp[0], param_attr=hidden_para_attr),
paddle.layer.full_matrix_projection(
input=input_tmp[1], param_attr=lstm_para_attr)
], )
```
5. We use CRF as cost function, the parameter of CRF cost will be named `crfw`.
```python
crf_cost = paddle.layer.crf(
size=label_dict_len,
input=feature_out,
label=target,
param_attr=paddle.attr.Param(
name='crfw',
initial_std=default_std,
learning_rate=mix_hidden_lr))
```
6. CRF decoding layer is used for evaluation and inference. It shares parameter with CRF layer. The sharing of parameters among multiple layers is specified by the same parameter name in these layers.
```python
crf_dec = paddle.layer.crf_decoding(
name='crf_dec_l',
size=label_dict_len,
input=feature_out,
label=target,
param_attr=paddle.attr.Param(name='crfw'))
```
## Train model
### Create Parameters
All necessary parameters will be traced created given output layers that we need to use.
```python ```python
settings( parameters = paddle.parameters.create([crf_cost, crf_dec])
batch_size=150,
learning_method=MomentumOptimizer(momentum=0),
learning_rate=2e-2,
regularization=L2Regularization(8e-4),
model_average=ModelAverage(average_window=0.5, max_average_window=10000)
)
``` ```
### 模型结构 We can print out parameter name. It will be generated if not specified.
1. 定义输入数据维度及模型超参数。 ```python
print parameters.keys()
```python
mark_dict_len = 2 # 谓上下文区域标志的维度,是一个0-1 2值特征,因此维度为2
word_dim = 32 # 词向量维度
mark_dim = 5 # 谓词上下文区域通过词表被映射为一个实向量,这个是相邻的维度
hidden_dim = 512 # LSTM隐层向量的维度 : 512 / 4
depth = 8 # 栈式LSTM的深度
word = data_layer(name='word_data', size=word_dict_len)
predicate = data_layer(name='verb_data', size=pred_len)
ctx_n2 = data_layer(name='ctx_n2_data', size=word_dict_len)
ctx_n1 = data_layer(name='ctx_n1_data', size=word_dict_len)
ctx_0 = data_layer(name='ctx_0_data', size=word_dict_len)
ctx_p1 = data_layer(name='ctx_p1_data', size=word_dict_len)
ctx_p2 = data_layer(name='ctx_p2_data', size=word_dict_len)
mark = data_layer(name='mark_data', size=mark_dict_len)
if not is_predict:
target = data_layer(name='target', size=label_dict_len) # 标记序列只在训练和测试流程中定义
```
这里需要特别说明的是hidden_dim = 512指定了LSTM隐层向量的维度为128维,关于这一点请参考PaddlePaddle官方文档中[lstmemory](http://www.paddlepaddle.org/doc/ui/api/trainer_config_helpers/layers.html#lstmemory)的说明。
2. 将句子序列、谓词、谓词上下文、谓词上下文区域标记通过词表,转换为实向量表示的词向量序列。
```python
# 在本教程中,我们加载了预训练的词向量,这里设置了:is_static=True
# is_static 为 True 时保证了在训练 SRL 模型过程中,词表不再更新
emb_para = ParameterAttribute(name='emb', initial_std=0., is_static=True)
word_input = [word, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2]
emb_layers = [
embedding_layer(
size=word_dim, input=x, param_attr=emb_para) for x in word_input
]
emb_layers.append(predicate_embedding)
mark_embedding = embedding_layer(
name='word_ctx-in_embedding', size=mark_dim, input=mark, param_attr=std_0)
emb_layers.append(mark_embedding)
```
3. 8个LSTM单元以“正向/反向”的顺序对所有输入序列进行学习。
```python
# std_0 指定的参数以均值为0的高斯分布初始化,用在LSTM的bias初始化中
std_0 = ParameterAttribute(initial_std=0.)
hidden_0 = mixed_layer(
name='hidden0',
size=hidden_dim,
bias_attr=std_default,
input=[
full_matrix_projection(
input=emb, param_attr=std_default) for emb in emb_layers
])
lstm_0 = lstmemory(
name='lstm0',
input=hidden_0,
act=ReluActivation(),
gate_act=SigmoidActivation(),
state_act=SigmoidActivation(),
bias_attr=std_0,
param_attr=lstm_para_attr)
input_tmp = [hidden_0, lstm_0]
for i in range(1, depth):
mix_hidden = mixed_layer(
name='hidden' + str(i),
size=hidden_dim,
bias_attr=std_default,
input=[
full_matrix_projection(
input=input_tmp[0], param_attr=hidden_para_attr),
full_matrix_projection(
input=input_tmp[1], param_attr=lstm_para_attr)
])
lstm = lstmemory(
name='lstm' + str(i),
input=mix_hidden,
act=ReluActivation(),
gate_act=SigmoidActivation(),
state_act=SigmoidActivation(),
reverse=((i % 2) == 1),
bias_attr=std_0,
param_attr=lstm_para_attr)
input_tmp = [mix_hidden, lstm]
```
4. 取最后一个栈式LSTM的输出和这个LSTM单元的输入到隐层映射,经过一个全连接层映射到标记字典的维度,得到最终的特征向量表示。
```python
feature_out = mixed_layer(
name='output',
size=label_dict_len,
bias_attr=std_default,
input=[
full_matrix_projection(
input=input_tmp[0], param_attr=hidden_para_attr),
full_matrix_projection(
input=input_tmp[1], param_attr=lstm_para_attr)
], )
```
5. CRF层在网络的末端,完成序列标注。
```python
crf_l = crf_layer(
name='crf',
size=label_dict_len,
input=feature_out,
label=target,
param_attr=ParameterAttribute(
name='crfw', initial_std=default_std, learning_rate=mix_hidden_lr))
```
## 训练模型
执行`sh train.sh`进行模型的训练,其中指定了总共需要训练150个pass。
```bash
paddle train \
--config=./db_lstm.py \
--save_dir=./output \
--trainer_count=1 \
--dot_period=500 \
--log_period=10 \
--num_passes=200 \
--use_gpu=false \
--show_parameter_stats_period=10 \
--test_all_data_in_one_period=1 \
2>&1 | tee 'train.log'
``` ```
训练日志示例如下。 Now we load pre-trained word lookup table.
```text ```python
I1224 18:11:53.661479 1433 TrainerInternal.cpp:165] Batch=880 samples=145305 AvgCost=2.11541 CurrentCost=1.8645 Eval: __sum_evaluator_0__=0.607942 CurrentEval: __sum_evaluator_0__=0.59322 def load_parameter(file_name, h, w):
I1224 18:11:55.254021 1433 TrainerInternal.cpp:165] Batch=885 samples=146134 AvgCost=2.11408 CurrentCost=1.88156 Eval: __sum_evaluator_0__=0.607299 CurrentEval: __sum_evaluator_0__=0.494572 with open(file_name, 'rb') as f:
I1224 18:11:56.867604 1433 TrainerInternal.cpp:165] Batch=890 samples=146987 AvgCost=2.11277 CurrentCost=1.88839 Eval: __sum_evaluator_0__=0.607203 CurrentEval: __sum_evaluator_0__=0.590856 f.read(16)
I1224 18:11:58.424069 1433 TrainerInternal.cpp:165] Batch=895 samples=147793 AvgCost=2.11129 CurrentCost=1.84247 Eval: __sum_evaluator_0__=0.607099 CurrentEval: __sum_evaluator_0__=0.588089 return np.fromfile(f, dtype=np.float32).reshape(h, w)
I1224 18:12:00.006893 1433 TrainerInternal.cpp:165] Batch=900 samples=148611 AvgCost=2.11148 CurrentCost=2.14526 Eval: __sum_evaluator_0__=0.607882 CurrentEval: __sum_evaluator_0__=0.749389 parameters.set('emb', load_parameter(conll05.get_embedding(), 44068, 32))
I1224 18:12:00.164089 1433 TrainerInternal.cpp:181] Pass=0 Batch=901 samples=148647 AvgCost=2.11195 Eval: __sum_evaluator_0__=0.60793
``` ```
经过150个 pass 后,得到平均 error 约为 0.0516055。
## 应用模型 ### Create Trainer
We will create trainer given model topology, parameters and optimization method. We will use most basic SGD method (momentum optimizer with 0 momentum). In the meantime, we will set learning rate and regularization.
```python
optimizer = paddle.optimizer.Momentum(
momentum=0,
learning_rate=2e-2,
regularization=paddle.optimizer.L2Regularization(rate=8e-4),
model_average=paddle.optimizer.ModelAverage(
average_window=0.5, max_average_window=10000), )
trainer = paddle.trainer.SGD(cost=crf_cost,
parameters=parameters,
update_equation=optimizer)
```
训练好的$N$个pass,会得到$N$个模型,我们需要从中选择一个最优模型进行预测。通常做法是在开发集上进行调参,并基于我们关心的某个性能指标选择最优模型。本教程的`predict.sh`脚本简单地选择了测试集上标记错误最少的那个pass(这里是pass-00100)用于预测。 ### Trainer
预测时,我们需要将配置中的 `crf_layer` 删掉,替换为 `crf_decoding_layer`,如下所示: As mentioned in data preparation section, we will use CoNLL 2005 test corpus as training data set. `conll05.test()` outputs one training instance at a time. It will be shuffled, and batched into mini batches as input.
```python ```python
crf_dec_l = crf_decoding_layer( reader = paddle.reader.batched(
name='crf_dec_l', paddle.reader.shuffle(
size=label_dict_len, conll05.test(), buf_size=8192), batch_size=20)
input=feature_out,
param_attr=ParameterAttribute(name='crfw'))
``` ```
运行`python predict.py`脚本,便可使用指定的模型进行预测。 `reader_dict` is used to specify relationship between data instance and layer layer. For example, according to following `reader_dict`, the 0th column of data instance produced by`conll05.test()` correspond to data layer named `word_data`.
```bash ```python
python predict.py reader_dict = {
-c db_lstm.py # 指定配置文件 'word_data': 0,
-w output/pass-00100 # 指定预测使用的模型所在的路径 'ctx_n2_data': 1,
-l data/targetDict.txt # 指定标记的字典 'ctx_n1_data': 2,
-p data/verbDict.txt # 指定谓词的词典 'ctx_0_data': 3,
-d data/wordDict.txt # 指定输入文本序列的字典 'ctx_p1_data': 4,
-i data/feature # 指定输入数据的路径 'ctx_p2_data': 5,
-o predict.res # 指定标记结果输出到文件的路径 'verb_data': 6,
'mark_data': 7,
'target': 8
}
``` ```
预测结束后,在 - o 参数所指定的标记结果文件中,我们会得到如下格式的输出:每行是一条样本,以 “\t” 分隔的 2 列,第一列是输入文本,第二列是标记的结果。通过BIO标记可以直接得到论元的语义角色标签。 `event_handle` can be used as callback for training events, it will be used as an argument for `train`. Following `event_handle` prints cost during training.
```text ```python
The interest-only securities were priced at 35 1\/2 to yield 10.72 % . B-A0 I-A0 I-A0 O O O O O O B-V B-A1 I-A1 O def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0:
print "Pass %d, Batch %d, Cost %f" % (
event.pass_id, event.batch_id, event.cost)
```
`trainer.train` will train the model.
```python
trainer.train(
reader=reader,
event_handler=event_handler,
num_passes=10000,
reader_dict=reader_dict)
``` ```
## Conclusion ## Conclusion
Semantic Role Labeling is an important intermediate step in a wide range of natural language processing tasks. In this tutorial, we give SRL as an example to introduce how to use PaddlePaddle to do sequence tagging tasks. Proposed models are from our published paper\[[10](#Reference)\]. We only use test data as illustration since train data on CoNLL 2005 dataset is not completely public. We hope to propose an end-to-end neural network model with less dependencies on natural language processing tools, but is comparable, or even better than trandional models. Please check out our paper for more information and discussions. Semantic Role Labeling is an important intermediate step in a wide range of natural language processing tasks. In this tutorial, we give SRL as an example to introduce how to use PaddlePaddle to do sequence tagging tasks. Proposed models are from our published paper\[[10](#Reference)\]. We only use test data as an illustration since train data on CoNLL 2005 dataset is not completely public. We hope to propose an end-to-end neural network model with fewer dependencies on natural language processing tools but is comparable, or even better than traditional models. Please check out our paper for more information and discussions.
## Reference ## Reference
1. Sun W, Sui Z, Wang M, et al. [Chinese semantic role labeling with shallow parsing](http://www.aclweb.org/anthology/D09-1#page=1513)[C]//Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3-Volume 3. Association for Computational Linguistics, 2009: 1475-1483. 1. Sun W, Sui Z, Wang M, et al. [Chinese semantic role labeling with shallow parsing](http://www.aclweb.org/anthology/D09-1#page=1513)[C]//Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3-Volume 3. Association for Computational Linguistics, 2009: 1475-1483.
......
...@@ -155,7 +155,7 @@ def main(): ...@@ -155,7 +155,7 @@ def main():
parameters=parameters, parameters=parameters,
update_equation=optimizer) update_equation=optimizer)
reader = paddle.reader.batched( reader = paddle.batch(
paddle.reader.shuffle( paddle.reader.shuffle(
conll05.test(), buf_size=8192), batch_size=10) conll05.test(), buf_size=8192), batch_size=10)
......
...@@ -42,7 +42,7 @@ In such a classification problem, we usually use the cross entropy loss function ...@@ -42,7 +42,7 @@ In such a classification problem, we usually use the cross entropy loss function
$$ crossentropy(label, y) = -\sum_i label_ilog(y_i) $$ $$ crossentropy(label, y) = -\sum_i label_ilog(y_i) $$
Fig. 2 shows a softmax regression network, with weights in black, and bias in red. +1 indicates bias is 1. Fig. 2 shows a softmax regression network, with weights in blue, and bias in red. +1 indicates bias is 1.
<p align="center"> <p align="center">
<img src="image/softmax_regression_en.png" width=400><br/> <img src="image/softmax_regression_en.png" width=400><br/>
...@@ -57,7 +57,7 @@ The Softmax regression model described above uses the simplest two-layer neural ...@@ -57,7 +57,7 @@ The Softmax regression model described above uses the simplest two-layer neural
2. After the second hidden layer, we get $ H_2 = \phi(W_2H_1 + b_2) $. 2. After the second hidden layer, we get $ H_2 = \phi(W_2H_1 + b_2) $.
3. Finally, after output layer, we get $Y=softmax(W_3H_2 + b_3)$, the final classification result vector. 3. Finally, after output layer, we get $Y=softmax(W_3H_2 + b_3)$, the final classification result vector.
Fig. 3. is Multilayer Perceptron network, with weights in black, and bias in red. +1 indicates bias is 1. Fig. 3. is Multilayer Perceptron network, with weights in blue, and bias in red. +1 indicates bias is 1.
<p align="center"> <p align="center">
<img src="image/mlp_en.png" width=500><br/> <img src="image/mlp_en.png" width=500><br/>
...@@ -196,34 +196,33 @@ def convolutional_neural_network(img): ...@@ -196,34 +196,33 @@ def convolutional_neural_network(img):
PaddlePaddle provides a special layer `layer.data` for reading data. Let us create a data layer for reading images and connect it to a classification network created using one of above three functions. We also need a cost layer for training the model. PaddlePaddle provides a special layer `layer.data` for reading data. Let us create a data layer for reading images and connect it to a classification network created using one of above three functions. We also need a cost layer for training the model.
```python ```python
def main(): paddle.init(use_gpu=False, trainer_count=1)
paddle.init(use_gpu=False, trainer_count=1)
images = paddle.layer.data( images = paddle.layer.data(
name='pixel', type=paddle.data_type.dense_vector(784)) name='pixel', type=paddle.data_type.dense_vector(784))
label = paddle.layer.data( label = paddle.layer.data(
name='label', type=paddle.data_type.integer_value(10)) name='label', type=paddle.data_type.integer_value(10))
predict = softmax_regression(images) predict = softmax_regression(images)
#predict = multilayer_perceptron(images) # uncomment for MLP #predict = multilayer_perceptron(images) # uncomment for MLP
#predict = convolutional_neural_network(images) # uncomment for LeNet5 #predict = convolutional_neural_network(images) # uncomment for LeNet5
cost = paddle.layer.classification_cost(input=predict, label=label) cost = paddle.layer.classification_cost(input=predict, label=label)
``` ```
Now, it is time to specify training parameters. The number 0.9 in the following `Momentum` optimizer means that 90% of the current the momentum comes from the momentum of the previous iteration. Now, it is time to specify training parameters. The number 0.9 in the following `Momentum` optimizer means that 90% of the current the momentum comes from the momentum of the previous iteration.
```python ```python
parameters = paddle.parameters.create(cost) parameters = paddle.parameters.create(cost)
optimizer = paddle.optimizer.Momentum( optimizer = paddle.optimizer.Momentum(
learning_rate=0.1 / 128.0, learning_rate=0.1 / 128.0,
momentum=0.9, momentum=0.9,
regularization=paddle.optimizer.L2Regularization(rate=0.0005 * 128)) regularization=paddle.optimizer.L2Regularization(rate=0.0005 * 128))
trainer = paddle.trainer.SGD(cost=cost, trainer = paddle.trainer.SGD(cost=cost,
parameters=parameters, parameters=parameters,
update_equation=optimizer) update_equation=optimizer)
``` ```
Then we specify the training data `paddle.dataset.movielens.train()` and testing data `paddle.dataset.movielens.test()`. These two functions are *reader creators*, once called, returns a *reader*. A reader is a Python function, which, once called, returns a Python generator, which yields instances of data. Then we specify the training data `paddle.dataset.movielens.train()` and testing data `paddle.dataset.movielens.test()`. These two functions are *reader creators*, once called, returns a *reader*. A reader is a Python function, which, once called, returns a Python generator, which yields instances of data.
...@@ -233,48 +232,48 @@ Here `shuffle` is a reader decorator, which takes a reader A as its parameter, a ...@@ -233,48 +232,48 @@ Here `shuffle` is a reader decorator, which takes a reader A as its parameter, a
`batch` is a special decorator, whose input is a reader and output is a *batch reader*, which doesn't yield an instance at a time, but a minibatch. `batch` is a special decorator, whose input is a reader and output is a *batch reader*, which doesn't yield an instance at a time, but a minibatch.
```python ```python
lists = [] lists = []
def event_handler(event): def event_handler(event):
if isinstance(event, paddle.event.EndIteration): if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0: if event.batch_id % 100 == 0:
print "Pass %d, Batch %d, Cost %f, %s" % ( print "Pass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics) event.pass_id, event.batch_id, event.cost, event.metrics)
if isinstance(event, paddle.event.EndPass): if isinstance(event, paddle.event.EndPass):
result = trainer.test(reader=paddle.reader.batched( result = trainer.test(reader=paddle.reader.batched(
paddle.dataset.mnist.test(), batch_size=128)) paddle.dataset.mnist.test(), batch_size=128))
print "Test with Pass %d, Cost %f, %s\n" % ( print "Test with Pass %d, Cost %f, %s\n" % (
event.pass_id, result.cost, result.metrics) event.pass_id, result.cost, result.metrics)
lists.append((event.pass_id, result.cost, lists.append((event.pass_id, result.cost,
result.metrics['classification_error_evaluator'])) result.metrics['classification_error_evaluator']))
trainer.train( trainer.train(
reader=paddle.reader.batched( reader=paddle.reader.batched(
paddle.reader.shuffle( paddle.reader.shuffle(
paddle.dataset.mnist.train(), buf_size=8192), paddle.dataset.mnist.train(), buf_size=8192),
batch_size=128), batch_size=128),
event_handler=event_handler, event_handler=event_handler,
num_passes=100) num_passes=100)
``` ```
During training, `trainer.train` invokes `event_handler` for certain events. This gives us a chance to print the training progress. During training, `trainer.train` invokes `event_handler` for certain events. This gives us a chance to print the training progress.
``` ```
# Pass 0, Batch 0, Cost 2.780790, {'classification_error_evaluator': 0.9453125} # Pass 0, Batch 0, Cost 2.780790, {'classification_error_evaluator': 0.9453125}
# Pass 0, Batch 100, Cost 0.635356, {'classification_error_evaluator': 0.2109375} # Pass 0, Batch 100, Cost 0.635356, {'classification_error_evaluator': 0.2109375}
# Pass 0, Batch 200, Cost 0.326094, {'classification_error_evaluator': 0.1328125} # Pass 0, Batch 200, Cost 0.326094, {'classification_error_evaluator': 0.1328125}
# Pass 0, Batch 300, Cost 0.361920, {'classification_error_evaluator': 0.1015625} # Pass 0, Batch 300, Cost 0.361920, {'classification_error_evaluator': 0.1015625}
# Pass 0, Batch 400, Cost 0.410101, {'classification_error_evaluator': 0.125} # Pass 0, Batch 400, Cost 0.410101, {'classification_error_evaluator': 0.125}
# Test with Pass 0, Cost 0.326659, {'classification_error_evaluator': 0.09470000118017197} # Test with Pass 0, Cost 0.326659, {'classification_error_evaluator': 0.09470000118017197}
``` ```
After the training, we can check the model's prediction accuracy. After the training, we can check the model's prediction accuracy.
``` ```
# find the best pass # find the best pass
best = sorted(lists, key=lambda list: float(list[1]))[0] best = sorted(lists, key=lambda list: float(list[1]))[0]
print 'Best pass is %s, testing Avgcost is %s' % (best[0], best[1]) print 'Best pass is %s, testing Avgcost is %s' % (best[0], best[1])
print 'The classification accuracy is %.2f%%' % (100 - float(best[2]) * 100) print 'The classification accuracy is %.2f%%' % (100 - float(best[2]) * 100)
``` ```
Usually, with MNIST data, the softmax regression model can get accuracy around 92.34%, MLP can get about 97.66%, and convolution network can get up to around 99.20%. Convolution layers have been widely considered a great invention for image processsing. Usually, with MNIST data, the softmax regression model can get accuracy around 92.34%, MLP can get about 97.66%, and convolution network can get up to around 99.20%. Convolution layers have been widely considered a great invention for image processsing.
......
...@@ -42,7 +42,7 @@ $$ y_i = softmax(\sum_j W_{i,j}x_j + b_i) $$ ...@@ -42,7 +42,7 @@ $$ y_i = softmax(\sum_j W_{i,j}x_j + b_i) $$
$$ crossentropy(label, y) = -\sum_i label_ilog(y_i) $$ $$ crossentropy(label, y) = -\sum_i label_ilog(y_i) $$
图2为softmax回归的网络图,图中权重用线表示、偏置用红线表示、+1代表偏置参数的系数为1。 图2为softmax回归的网络图,图中权重用线表示、偏置用红线表示、+1代表偏置参数的系数为1。
<p align="center"> <p align="center">
<img src="image/softmax_regression.png" width=400><br/> <img src="image/softmax_regression.png" width=400><br/>
...@@ -58,7 +58,7 @@ Softmax回归模型采用了最简单的两层神经网络,即只有输入层 ...@@ -58,7 +58,7 @@ Softmax回归模型采用了最简单的两层神经网络,即只有输入层
3. 最后,再经过输出层,得到的$Y=softmax(W_3H_2 + b_3)$,即为最后的分类结果向量。 3. 最后,再经过输出层,得到的$Y=softmax(W_3H_2 + b_3)$,即为最后的分类结果向量。
图3为多层感知器的网络结构图,图中权重用线表示、偏置用红线表示、+1代表偏置参数的系数为1。 图3为多层感知器的网络结构图,图中权重用线表示、偏置用红线表示、+1代表偏置参数的系数为1。
<p align="center"> <p align="center">
<img src="image/mlp.png" width=500><br/> <img src="image/mlp.png" width=500><br/>
...@@ -66,18 +66,38 @@ Softmax回归模型采用了最简单的两层神经网络,即只有输入层 ...@@ -66,18 +66,38 @@ Softmax回归模型采用了最简单的两层神经网络,即只有输入层
</p> </p>
### 卷积神经网络(Convolutional Neural Network, CNN) ### 卷积神经网络(Convolutional Neural Network, CNN)
在多层感知器模型中,将图像展开成一维向量输入到网络中,忽略了图像的位置和结构信息,而卷积神经网络能够更好的利用图像的结构信息。[LeNet-5](http://yann.lecun.com/exdb/lenet/)是一个较简单的卷积神经网络。图6显示了其结构:输入的二维图像,先经过两次卷积层到池化层,再经过全连接层,最后使用softmax分类作为输出层。下面我们主要介绍卷积层和池化层。
<p align="center">
<img src="image/cnn.png"><br/>
图6. LeNet-5卷积神经网络结构<br/>
</p>
#### 卷积层 #### 卷积层
<p align="center"> 卷积层是卷积神经网络的核心基石。在图像识别里我们提到的卷积是二维卷积,即离散二维滤波器(也称作卷积核)与二维图像做卷积操作,简单的讲是二维滤波器滑动到二维图像上所有位置,并在每个位置上与该像素点及其领域像素点做内积。卷积操作被广泛应用与图像处理领域,不同卷积核可以提取不同的特征,例如边沿、线性、角等特征。在深层卷积神经网络中,通过卷积操作可以提取出图像低级到复杂的特征。
<img src="image/conv_layer.png" width=500><br/>
图4. 卷积层图片<br/> <p align="center">
</p> <img src="image/conv_layer.png"><br/>
图4. 卷积层图片<br/>
卷积层是卷积神经网络的核心基石。该层的参数由一组可学习的过滤器(也叫作卷积核)组成。在前向过程中,每个卷积核在输入层进行横向和纵向的扫描,与输入层对应扫描位置进行卷积,得到的结果加上偏置并用相应的激活函数进行激活,结果能够得到一个二维的激活图(activation map)。每个特定的卷积核都能得到特定的激活图(activation map),如有的卷积核可能对识别边角,有的可能识别圆圈,那这些卷积核可能对于对应的特征响应要强。 </p>
图4是卷积层的一个动态图。由于3D量难以表示,所有的3D量(输入的3D量(蓝色),权重3D量(红色),输出3D量(绿色))通过将深度在行上堆叠来表示。如图4,输入层是$W_1=5,H_1=5,D_1=3$,我们常见的彩色图片其实就是类似这样的输入层,彩色图片的宽和高对应这里的$W_1$和$H_1$,而彩色图片有RGB三个颜色通道,对应这里的$D_1$;卷积层的参数为$K=2,F=3,S=2,P=1$,这里的$K$是卷积核的数量,如图4中有$Filter W_0$和$Filter W_1$两个卷积核,$F$对应卷积核的大小,图中$W0$和$W1$在每一层深度上都是$3\times3$的矩阵,$S$对应卷积核扫描的步长,从动态图中可以看到,方框每次左移或下移2个单位,$P$对应Padding扩展,是对输入层的扩展,图中输入层,原始数据为蓝色部分,可以看到灰色部分是进行了大小为1的扩展,用0来进行扩展;图4的动态可视化对输出层结果(绿色)进行迭代,显示每个输出元素是通过将突出显示的输入(蓝色)与滤波器(红色)进行元素相乘,将其相加,然后通过偏置抵消结果来计算的。 图4给出一个卷积计算过程的示例图,输入图像大小为$H=5,W=5,D=3$,即$5x5$大小的3通道(RGB,也称作深度)彩色图像。这个示例图中包含两(用$K$表示)组卷积核,即图中$Filter W_0$和$Filter W_1$,在卷积计算中,通常对不同的输入通道采用不同的卷积核,在图示例中每组卷积核又包含3($D$)个$3x3$(用$FXF$表示)大小的卷积核。另外,这个示例中卷积核在图像的水平方向($W$方向)和垂直方向($H$方向)的滑动步长为2(用$S$表示);对输入图像周围各填充1(用$P$表示)个0,即图中输入层原始数据为蓝色部分,灰色部分是进行了大小为1的扩展,用0来进行扩展。经过卷积操作得到输出为$3x3x2$(用$H_{o}xW_{o}xK$表示)大小的特征图,即$3x3$大小的2通道特征图,其中$H_o$计算公式为:$H_o = (H - F + 2*P)/S + 1$,$W_o$同理。 而输出特征图中的每个像素,是每组滤波器与输入图像每个特征图的内积再求和,再加上偏置($b_o$),偏置通常对于每个输出特征图是共享的。例如图中输出特征图`o[:,:,0]`中的$-9$计算如下:
$$-9 = \sum x[4:6,4:6,0] * W[:,:,0]] + \sum x[4:6,4:6,1] * W[:,:,1]] + \sum x[4:6,4:6,2] * W[:,:,2]] + b_0\\
\sum x[4:6,4:6,0] * W[:,:,0]] = 2*1 + 2*(-1) + 0*1 + 0*0 + 2*(-1) + 0*1 + 0*0 + 0*0 + 0*0 = -2 \\
\sum x[4:6,4:6,1] * W[:,:,1]] = 2*(-1) + 2*(-1) + 0*0 + 2*0 + 2*(-1) + 0*(-1) + 0*0 + 0*1 + 0*1 = -6 \\
\sum x[4:6,4:6,2] * W[:,:,2]] = 0*0 + 0*1 + 0*1 + 2*(-1) + 1*0 + 0*1 + 0*1 + 0*0 + 0*1 = -2$$
在卷积操作中卷积核是可学习的参数,经过上面示例介绍,每层卷积的参数大小为$DxFxFxK$。在多层感知器模型中,神经元通常是全部连接,参数较多。而卷积层的参数较少,这也是由卷积层的主要特性即局部连接和共享权重所决定。
- 局部连接:每个神经元仅与输入神经元的一块区域连接,这块局部区域称作感受野(receptive field)。在图像卷积操作中,即神经元在空间维度(spatial dimension,即上图示例H和W所在的平面)是局部连接,但在深度上是全部连接。对于二维图像本身而言,也是局部像素关联较强。这种局部连接保证了学习后的过滤器能够对于局部的输入特征有最强的响应。局部连接的思想,也是受启发于生物学里面的视觉系统结构,视觉皮层的神经元就是局部接受信息的。
- 权重共享:计算同一个深度切片的神经元时采用的滤波器是共享的。例如图4中计算$o[:,:,0]$的每个每个神经元的滤波器均相同,都为$W_0$,这样可以很大程度上减少参数。共享权重在一定程度上讲是有意义的,例如图片的底层边缘特征与特征在图中的具体位置无关。但是在一些场景中是无意的,比如输入的图片是人脸,眼睛和头发位于不同的位置,希望在不同的位置学到不同的特征 (参考[斯坦福大学公开课]( http://cs231n.github.io/convolutional-networks/))。请注意权重只是对于同一深度切片的神经元是共享的,在卷积层,通常采用多组卷积核提取不同特征,即对应不同深度切片的特征,不同深度切片的神经元权重是不共享。另外,偏重对同一深度切片的所有神经元都是共享的。
通过介绍卷积计算过程及其特性,可以看出卷积是线性操作,并具有平移不变性(shift-invariant),平移不变性即在图像每个位置执行相同的操作。卷积层的局部连接和权重共享使得需要学习的参数大大减小,这样也有利于训练较大卷积神经网络。
#### 池化层 #### 池化层
<p align="center"> <p align="center">
...@@ -86,20 +106,7 @@ Softmax回归模型采用了最简单的两层神经网络,即只有输入层 ...@@ -86,20 +106,7 @@ Softmax回归模型采用了最简单的两层神经网络,即只有输入层
</p> </p>
池化是非线性下采样的一种形式,主要作用是通过减少网络的参数来减小计算量,并且能够在一定程度上控制过拟合。通常在卷积层的后面会加上一个池化层。池化包括最大池化、平均池化等。其中最大池化是用不重叠的矩形框将输入层分成不同的区域,对于每个矩形框的数取最大值作为输出层,如图5所示。 池化是非线性下采样的一种形式,主要作用是通过减少网络的参数来减小计算量,并且能够在一定程度上控制过拟合。通常在卷积层的后面会加上一个池化层。池化包括最大池化、平均池化等。其中最大池化是用不重叠的矩形框将输入层分成不同的区域,对于每个矩形框的数取最大值作为输出层,如图5所示。
#### LeNet-5网络
<p align="center">
<img src="image/cnn.png"><br/>
图6. LeNet-5卷积神经网络结构<br/>
</p>
[LeNet-5](http://yann.lecun.com/exdb/lenet/)是一个最简单的卷积神经网络。图6显示了其结构:输入的二维图像,先经过两次卷积层到池化层,再经过全连接层,最后使用softmax分类作为输出层。卷积的如下三个特性,决定了LeNet-5能比同样使用全连接层的多层感知器更好地识别图像:
- 神经元的三维特性: 卷积层的神经元在宽度、高度和深度上进行了组织排列。每一层的神经元仅仅与前一层的一块小区域连接,这块小区域被称为感受野(receptive field)。
- 局部连接:CNN通过在相邻层的神经元之间实施局部连接模式来利用空间局部相关性。这样的结构保证了学习后的过滤器能够对于局部的输入特征有最强的响应。堆叠许多这样的层导致非线性“过滤器”变得越来越“全局”。这允许网络首先创建输入的小部分的良好表示,然后从它们组合较大区域的表示。
- 共享权重:在CNN中,每个滤波器在整个视野中重复扫描。 这些复制单元共享相同的参数化(权重向量和偏差)并形成特征图。 这意味着给定卷积层中的所有神经元检测完全相同的特征。 以这种方式的复制单元允许不管它们在视野中的位置都能检测到特征,从而构成平移不变性的性质。
更详细的关于卷积神经网络的具体知识可以参考[斯坦福大学公开课]( http://cs231n.github.io/convolutional-networks/ )[图像分类](https://github.com/PaddlePaddle/book/blob/develop/image_classification/README.md)教程。 更详细的关于卷积神经网络的具体知识可以参考[斯坦福大学公开课]( http://cs231n.github.io/convolutional-networks/ )[图像分类](https://github.com/PaddlePaddle/book/blob/develop/image_classification/README.md)教程。
### 常见激活函数介绍 ### 常见激活函数介绍
...@@ -195,20 +202,19 @@ def convolutional_neural_network(img): ...@@ -195,20 +202,19 @@ def convolutional_neural_network(img):
接着,通过`layer.data`调用来获取数据,然后调用分类器(这里我们提供了三个不同的分类器)得到分类结果。训练时,对该结果计算其损失函数,分类问题常常选择交叉熵损失函数。 接着,通过`layer.data`调用来获取数据,然后调用分类器(这里我们提供了三个不同的分类器)得到分类结果。训练时,对该结果计算其损失函数,分类问题常常选择交叉熵损失函数。
```python ```python
def main(): # 该模型运行在单个CPU上
# 该模型运行在单个CPU上 paddle.init(use_gpu=False, trainer_count=1)
paddle.init(use_gpu=False, trainer_count=1)
images = paddle.layer.data(
images = paddle.layer.data( name='pixel', type=paddle.data_type.dense_vector(784))
name='pixel', type=paddle.data_type.dense_vector(784)) label = paddle.layer.data(
label = paddle.layer.data( name='label', type=paddle.data_type.integer_value(10))
name='label', type=paddle.data_type.integer_value(10))
predict = softmax_regression(images) # Softmax回归
predict = softmax_regression(images) # Softmax回归 #predict = multilayer_perceptron(images) #多层感知器
#predict = multilayer_perceptron(images) #多层感知器 #predict = convolutional_neural_network(images) #LeNet5卷积神经网络
#predict = convolutional_neural_network(images) #LeNet5卷积神经网络
cost = paddle.layer.classification_cost(input=predict, label=label)
cost = paddle.layer.classification_cost(input=predict, label=label)
``` ```
然后,指定训练相关的参数。 然后,指定训练相关的参数。
...@@ -217,58 +223,58 @@ def main(): ...@@ -217,58 +223,58 @@ def main():
- 正则化(regularization): 是防止网络过拟合的一种手段,此处采用L2正则化。 - 正则化(regularization): 是防止网络过拟合的一种手段,此处采用L2正则化。
```python ```python
parameters = paddle.parameters.create(cost) parameters = paddle.parameters.create(cost)
optimizer = paddle.optimizer.Momentum(
learning_rate=0.1 / 128.0,
momentum=0.9,
regularization=paddle.optimizer.L2Regularization(rate=0.0005 * 128))
trainer = paddle.trainer.SGD(cost=cost,
parameters=parameters,
update_equation=optimizer)
```
optimizer = paddle.optimizer.Momentum( 下一步,我们开始训练过程。`paddle.dataset.movielens.train()``paddle.dataset.movielens.test()`分别做训练和测试数据集。这两个函数各自返回一个reader——PaddlePaddle中的reader是一个Python函数,每次调用的时候返回一个Python yield generator。
learning_rate=0.1 / 128.0,
momentum=0.9,
regularization=paddle.optimizer.L2Regularization(rate=0.0005 * 128))
trainer = paddle.trainer.SGD(cost=cost, 下面`shuffle`是一个reader decorator,它接受一个reader A,返回另一个reader B —— reader B 每次读入`buffer_size`条训练数据到一个buffer里,然后随机打乱其顺序,并且逐条输出。
parameters=parameters,
update_equation=optimizer)
```
下一步,我们开始训练过程。`paddle.dataset.movielens.train()``paddle.dataset.movielens.test()`分别做训练和测试数据集。这两个函数各自返回一个reader——PaddlePaddle中的reader是一个Python函数,每次调用的时候返回一个Python yield generator。
下面`shuffle`是一个reader decorator,它接受一个reader A,返回另一个reader B —— reader B 每次读入`buffer_size`条训练数据到一个buffer里,然后随机打乱其顺序,并且逐条输出。
`batch`是一个特殊的decorator,它的输入是一个reader,输出是一个batched reader —— 在PaddlePaddle里,一个reader每次yield一条训练数据,而一个batched reader每次yield一个minbatch。 `batch`是一个特殊的decorator,它的输入是一个reader,输出是一个batched reader —— 在PaddlePaddle里,一个reader每次yield一条训练数据,而一个batched reader每次yield一个minbatch。
```python ```python
lists = [] lists = []
def event_handler(event): def event_handler(event):
if isinstance(event, paddle.event.EndIteration): if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0: if event.batch_id % 100 == 0:
print "Pass %d, Batch %d, Cost %f, %s" % ( print "Pass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics) event.pass_id, event.batch_id, event.cost, event.metrics)
if isinstance(event, paddle.event.EndPass): if isinstance(event, paddle.event.EndPass):
result = trainer.test(reader=paddle.reader.batched( result = trainer.test(reader=paddle.reader.batched(
paddle.dataset.mnist.test(), batch_size=128)) paddle.dataset.mnist.test(), batch_size=128))
print "Test with Pass %d, Cost %f, %s\n" % ( print "Test with Pass %d, Cost %f, %s\n" % (
event.pass_id, result.cost, result.metrics) event.pass_id, result.cost, result.metrics)
lists.append((event.pass_id, result.cost, lists.append((event.pass_id, result.cost,
result.metrics['classification_error_evaluator'])) result.metrics['classification_error_evaluator']))
trainer.train( trainer.train(
reader=paddle.batch( reader=paddle.reader.batched(
paddle.reader.shuffle( paddle.reader.shuffle(
paddle.dataset.mnist.train(), buf_size=8192), paddle.dataset.mnist.train(), buf_size=8192),
batch_size=128), batch_size=128),
event_handler=event_handler, event_handler=event_handler,
num_passes=100) num_passes=100)
``` ```
训练过程是完全自动的,event_handler里打印的日志类似如下所示: 训练过程是完全自动的,event_handler里打印的日志类似如下所示:
``` ```
# Pass 0, Batch 0, Cost 2.780790, {'classification_error_evaluator': 0.9453125} # Pass 0, Batch 0, Cost 2.780790, {'classification_error_evaluator': 0.9453125}
# Pass 0, Batch 100, Cost 0.635356, {'classification_error_evaluator': 0.2109375} # Pass 0, Batch 100, Cost 0.635356, {'classification_error_evaluator': 0.2109375}
# Pass 0, Batch 200, Cost 0.326094, {'classification_error_evaluator': 0.1328125} # Pass 0, Batch 200, Cost 0.326094, {'classification_error_evaluator': 0.1328125}
# Pass 0, Batch 300, Cost 0.361920, {'classification_error_evaluator': 0.1015625} # Pass 0, Batch 300, Cost 0.361920, {'classification_error_evaluator': 0.1015625}
# Pass 0, Batch 400, Cost 0.410101, {'classification_error_evaluator': 0.125} # Pass 0, Batch 400, Cost 0.410101, {'classification_error_evaluator': 0.125}
# Test with Pass 0, Cost 0.326659, {'classification_error_evaluator': 0.09470000118017197} # Test with Pass 0, Cost 0.326659, {'classification_error_evaluator': 0.09470000118017197}
``` ```
训练之后,检查模型的预测准确度。用 MNIST 训练的时候,一般 softmax回归模型的分类准确率为约为 92.34%,多层感知器为97.66%,卷积神经网络可以达到 99.20%。 训练之后,检查模型的预测准确度。用 MNIST 训练的时候,一般 softmax回归模型的分类准确率为约为 92.34%,多层感知器为97.66%,卷积神经网络可以达到 99.20%。
......
recognize_digits/image/conv_layer.png

249.7 KB | W: | H:

recognize_digits/image/conv_layer.png

248.3 KB | W: | H:

recognize_digits/image/conv_layer.png
recognize_digits/image/conv_layer.png
recognize_digits/image/conv_layer.png
recognize_digits/image/conv_layer.png
  • 2-up
  • Swipe
  • Onion skin
...@@ -83,7 +83,7 @@ In such a classification problem, we usually use the cross entropy loss function ...@@ -83,7 +83,7 @@ In such a classification problem, we usually use the cross entropy loss function
$$ crossentropy(label, y) = -\sum_i label_ilog(y_i) $$ $$ crossentropy(label, y) = -\sum_i label_ilog(y_i) $$
Fig. 2 shows a softmax regression network, with weights in black, and bias in red. +1 indicates bias is 1. Fig. 2 shows a softmax regression network, with weights in blue, and bias in red. +1 indicates bias is 1.
<p align="center"> <p align="center">
<img src="image/softmax_regression_en.png" width=400><br/> <img src="image/softmax_regression_en.png" width=400><br/>
...@@ -98,7 +98,7 @@ The Softmax regression model described above uses the simplest two-layer neural ...@@ -98,7 +98,7 @@ The Softmax regression model described above uses the simplest two-layer neural
2. After the second hidden layer, we get $ H_2 = \phi(W_2H_1 + b_2) $. 2. After the second hidden layer, we get $ H_2 = \phi(W_2H_1 + b_2) $.
3. Finally, after output layer, we get $Y=softmax(W_3H_2 + b_3)$, the final classification result vector. 3. Finally, after output layer, we get $Y=softmax(W_3H_2 + b_3)$, the final classification result vector.
Fig. 3. is Multilayer Perceptron network, with weights in black, and bias in red. +1 indicates bias is 1. Fig. 3. is Multilayer Perceptron network, with weights in blue, and bias in red. +1 indicates bias is 1.
<p align="center"> <p align="center">
<img src="image/mlp_en.png" width=500><br/> <img src="image/mlp_en.png" width=500><br/>
...@@ -156,15 +156,8 @@ For more information, please refer to [Activation functions on Wikipedia](https: ...@@ -156,15 +156,8 @@ For more information, please refer to [Activation functions on Wikipedia](https:
## Data Preparation ## Data Preparation
### Data Download PaddlePaddle provides a Python module, `paddle.dataset.mnist`, which downloads and caches the [MNIST dataset](http://yann.lecun.com/exdb/mnist/). The cache is under `/home/username/.cache/paddle/dataset/mnist`:
Execute the following command to download the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset and unzip. Add paths to the training set and the test set to train.list and test.list respectively for PaddlePaddle to read.
```bash
./data/get_mnist_data.sh
```
`gzip` downloaded data. The following files can be found in `data/raw_data`:
| File name | Description | | File name | Description |
|----------------------|-------------------------| |----------------------|-------------------------|
...@@ -173,283 +166,159 @@ Execute the following command to download the [MNIST](http://yann.lecun.com/exdb ...@@ -173,283 +166,159 @@ Execute the following command to download the [MNIST](http://yann.lecun.com/exdb
|t10k-images-idx3-ubyte | Evaluation images, 10,000 | |t10k-images-idx3-ubyte | Evaluation images, 10,000 |
|t10k-labels-idx1-ubyte | Evaluation labels, 10,000 | |t10k-labels-idx1-ubyte | Evaluation labels, 10,000 |
Users can randomly generate 10 images with the following script (Refer to Fig. 1.)
```bash
./load_data.py
```
### Provide Data to PaddlePaddle
We use python interface to provide data to system. `mnist_provider.py` shows a complete example for training on MNIST data.
```python
# Define a py data provider
@provider(
input_types={'pixel': dense_vector(28 * 28),
'label': integer_value(10)})
def process(settings, filename): # settings is not used currently.
# Open image file
with open( filename + "-images-idx3-ubyte", "rb") as f:
# Read first 4 parameters. magic is data format. n is number of data. rows and cols are number of rows and columns, respectively
magic, n, rows, cols = struct.upack(">IIII", f.read(16))
# With empty string as a unit, read data one by one
images = np.fromfile(
f, 'ubyte',
count=n * rows * cols).reshape(n, rows, cols).astype('float32')
# Normalize data of [0, 255] to [-1,1]
images = images / 255.0 * 2.0 - 1.0
# Open label file
with open( filename + "-labels-idx1-ubyte", "rb") as l:
# Read first two parameters
magic, n = struct.upack(">II", l.read(8))
# With empty string as a unit, read data one by one
labels = np.fromfile(l, 'ubyte', count=n).astype("int")
for i in xrange(n):
yield {"pixel": images[i, :], 'label': labels[i]}
```
## Model Configurations
### Data Definition
In the model configuration, use `define_py_data_sources2` to define reading of data from `dataprovider`. If this configuration is used for prediction, data definition is not necessary.
```python
if not is_predict:
data_dir = './data/'
define_py_data_sources2(
train_list=data_dir + 'train.list',
test_list=data_dir + 'test.list',
module='mnist_provider',
obj='process')
```
### Algorithm Configuration ## Model Configuration
Set training related parameters. A PaddlePaddle program starts from importing the API package:
- batch_size: use 128 samples in each training step.
- learning_rate: determines step taken in each iteration, it determines how fast the model converges.
- learning_method: use optimizer `MomentumOptimizer` for training. The parameter 0.9 indicates momentum keeps 0.9 of previous speed.
- regularization: A method to prevent overfitting. Here L2 regularization is used.
```python ```python
settings( import paddle.v2 as paddle
batch_size=128,
learning_rate=0.1 / 128.0,
learning_method=MomentumOptimizer(0.9),
regularization=L2Regularization(0.0005 * 128))
``` ```
### Model Architecture We want to use this program to demonstrate multiple kinds of models. Let define each of them as a Python function:
#### Overview - softmax regression: the network has a fully-connection layer with softmax activation:
First get reference labels from `data_layer`, and get classification results (predictions) from classifier. Here we provide three different classifiers. In training, we compute loss function, which is usually cross entropy for classification problem. In prediction, we can directly output the results (predictions).
``` python
data_size = 1 * 28 * 28
label_size = 10
img = data_layer(name='pixel', size=data_size)
predict = softmax_regression(img) # Softmax Regression
#predict = multilayer_perceptron(img) # Multilayer Perceptron
#predict = convolutional_neural_network(img) #LeNet5 Convolutional Neural Network
if not is_predict:
lbl = data_layer(name="label", size=label_size)
inputs(img, lbl)
outputs(classification_cost(input=predict, label=lbl))
else:
outputs(predict)
```
#### Softmax Regression
One simple fully connected layer with softmax activation function outputs classification result.
```python ```python
def softmax_regression(img): def softmax_regression(img):
predict = fc_layer(input=img, size=10, act=SoftmaxActivation()) predict = paddle.layer.fc(input=img,
size=10,
act=paddle.activation.Softmax())
return predict return predict
``` ```
#### MultiLayer Perceptron - multi-layer perceptron: this network has two hidden fully-connected layers, one with LeRU and the other with softmax activation:
The following code implements a Multilayer Perceptron with two fully connected hidden layers and a ReLU activation function. The output layer has a Softmax activation function.
```python ```python
def multilayer_perceptron(img): def multilayer_perceptron(img):
# First fully connected layer with ReLU hidden1 = paddle.layer.fc(input=img, size=128, act=paddle.activation.Relu())
hidden1 = fc_layer(input=img, size=128, act=ReluActivation()) hidden2 = paddle.layer.fc(input=hidden1,
# Second fully connected layer with ReLU size=64,
hidden2 = fc_layer(input=hidden1, size=64, act=ReluActivation()) act=paddle.activation.Relu())
# Output layer as fully connected layer and softmax activation. The size must be 10. predict = paddle.layer.fc(input=hidden2,
predict = fc_layer(input=hidden2, size=10, act=SoftmaxActivation()) size=10,
act=paddle.activation.Softmax())
return predict return predict
``` ```
#### Convolutional Neural Network LeNet-5 - convolution network LeNet-5: the input image is fed through two convolution-pooling layer, a fully-connected layer, and the softmax output layer:
The following is the LeNet-5 network architecture. A 2D input image is first fed into two sets of convolutional layers and pooling layers, this result is then fed to a fully connected layer, and another fully connected layer with a softmax activation.
```python ```python
def convolutional_neural_network(img): def convolutional_neural_network(img):
# First convolutional layer - pooling layer
conv_pool_1 = simple_img_conv_pool( conv_pool_1 = paddle.networks.simple_img_conv_pool(
input=img, input=img,
filter_size=5, filter_size=5,
num_filters=20, num_filters=20,
num_channel=1, num_channel=1,
pool_size=2, pool_size=2,
pool_stride=2, pool_stride=2,
act=TanhActivation()) act=paddle.activation.Tanh())
# Second convolutional layer - pooling layer
conv_pool_2 = simple_img_conv_pool( conv_pool_2 = paddle.networks.simple_img_conv_pool(
input=conv_pool_1, input=conv_pool_1,
filter_size=5, filter_size=5,
num_filters=50, num_filters=50,
num_channel=20, num_channel=20,
pool_size=2, pool_size=2,
pool_stride=2, pool_stride=2,
act=TanhActivation()) act=paddle.activation.Tanh())
# Fully connected layer
fc1 = fc_layer(input=conv_pool_2, size=128, act=TanhActivation())
# Output layer as fully connected layer and softmax activation. The size must be 10.
predict = fc_layer(input=fc1, size=10, act=SoftmaxActivation())
return predict
```
## Training Model
### Training Commands and Logs
1.Configure `train.sh` to execute training:
```bash fc1 = paddle.layer.fc(input=conv_pool_2,
config=mnist_model.py # Select network in mnist_model.py size=128,
output=./softmax_mnist_model act=paddle.activation.Tanh())
log=softmax_train.log
paddle train \ predict = paddle.layer.fc(input=fc1,
--config=$config \ # Scripts for network configuration. size=10,
--dot_period=10 \ # After `dot_period` steps, print one `.` act=paddle.activation.Softmax())
--log_period=100 \ # Print a log every batchs return predict
--test_all_data_in_one_period=1 \ # Whether to use all data in every test
--use_gpu=0 \ # Whether to use GPU
--trainer_count=1 \ # Number of CPU or GPU
--num_passes=100 \ # Passes for training (One pass uses all data.)
--save_dir=$output \ # Path to saved model
2>&1 | tee $log
python -m paddle.utils.plotcurve -i $log > plot.png
``` ```
After configuring parameters, execute `./train.sh`. Training log is as follows. PaddlePaddle provides a special layer `layer.data` for reading data. Let us create a data layer for reading images and connect it to a classification network created using one of above three functions. We also need a cost layer for training the model.
``` ```python
I0117 12:52:29.628617 4538 TrainerInternal.cpp:165] Batch=100 samples=12800 AvgCost=2.63996 CurrentCost=2.63996 Eval: classification_error_evaluator=0.241172 CurrentEval: classification_error_evaluator=0.241172 paddle.init(use_gpu=False, trainer_count=1)
.........
I0117 12:52:29.768741 4538 TrainerInternal.cpp:165] Batch=200 samples=25600 AvgCost=1.74027 CurrentCost=0.840582 Eval: classification_error_evaluator=0.185234 CurrentEval: classification_error_evaluator=0.129297
.........
I0117 12:52:29.916970 4538 TrainerInternal.cpp:165] Batch=300 samples=38400 AvgCost=1.42119 CurrentCost=0.783026 Eval: classification_error_evaluator=0.167786 CurrentEval: classification_error_evaluator=0.132891
.........
I0117 12:52:30.061213 4538 TrainerInternal.cpp:165] Batch=400 samples=51200 AvgCost=1.23965 CurrentCost=0.695054 Eval: classification_error_evaluator=0.160039 CurrentEval: classification_error_evaluator=0.136797
......I0117 12:52:30.223270 4538 TrainerInternal.cpp:181] Pass=0 Batch=469 samples=60000 AvgCost=1.1628 Eval: classification_error_evaluator=0.156233
I0117 12:52:30.366894 4538 Tester.cpp:109] Test samples=10000 cost=0.50777 Eval: classification_error_evaluator=0.0978
```
2.Use `plot_cost.py` to plot error curve during training.
```bash images = paddle.layer.data(
python plot_cost.py softmax_train.log name='pixel', type=paddle.data_type.dense_vector(784))
``` label = paddle.layer.data(
name='label', type=paddle.data_type.integer_value(10))
3.Use `evaluate.py ` to select the best trained model. predict = softmax_regression(images)
#predict = multilayer_perceptron(images) # uncomment for MLP
#predict = convolutional_neural_network(images) # uncomment for LeNet5
```bash cost = paddle.layer.classification_cost(input=predict, label=label)
python evaluate.py softmax_train.log
``` ```
### Training Results for Softmax Regression Now, it is time to specify training parameters. The number 0.9 in the following `Momentum` optimizer means that 90% of the current the momentum comes from the momentum of the previous iteration.
<p align="center"> ```python
<img src="image/softmax_train_log_en.png" width="400px"><br/> parameters = paddle.parameters.create(cost)
Fig. 7 Softmax regression error curve<br/>
</p>
Evaluation results of the models: optimizer = paddle.optimizer.Momentum(
learning_rate=0.1 / 128.0,
momentum=0.9,
regularization=paddle.optimizer.L2Regularization(rate=0.0005 * 128))
```text trainer = paddle.trainer.SGD(cost=cost,
Best pass is 00013, testing Avgcost is 0.484447 parameters=parameters,
The classification accuracy is 90.01% update_equation=optimizer)
``` ```
From the evaluation results, the best pass for softmax regression model is pass-00013, where the classification accuracy is 90.01%, and the last pass-00099 has an accuracy of 89.3%. From Fig. 7, we also see that the best accuracy may not appear in the last pass. This is because during training, the model may already arrive at a local optimum, and it just swings around nearby in the following passes, or it gets a lower local optimum. Then we specify the training data `paddle.dataset.movielens.train()` and testing data `paddle.dataset.movielens.test()`. These two functions are *reader creators*, once called, returns a *reader*. A reader is a Python function, which, once called, returns a Python generator, which yields instances of data.
### Results of Multilayer Perceptron Here `shuffle` is a reader decorator, which takes a reader A as its parameter, and returns a new reader B, where B calls A to read in `buffer_size` data instances everytime into a buffer, then shuffles and yield instances in the buffer. If you want very shuffled data, try use a larger buffer size.
<p align="center"> `batch` is a special decorator, whose input is a reader and output is a *batch reader*, which doesn't yield an instance at a time, but a minibatch.
<img src="image/mlp_train_log_en.png" width="400px"><br/>
Fig. 8. Multilayer Perceptron error curve<br/>
</p>
Evaluation results of the models: ```python
lists = []
```text
Best pass is 00085, testing Avgcost is 0.164746 def event_handler(event):
The classification accuracy is 94.95% if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0:
print "Pass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
if isinstance(event, paddle.event.EndPass):
result = trainer.test(reader=paddle.reader.batched(
paddle.dataset.mnist.test(), batch_size=128))
print "Test with Pass %d, Cost %f, %s\n" % (
event.pass_id, result.cost, result.metrics)
lists.append((event.pass_id, result.cost,
result.metrics['classification_error_evaluator']))
trainer.train(
reader=paddle.reader.batched(
paddle.reader.shuffle(
paddle.dataset.mnist.train(), buf_size=8192),
batch_size=128),
event_handler=event_handler,
num_passes=100)
``` ```
From the evaluation results, the final training accuracy is 94.95%. It is significantly better than the softmax regression model. This is because the softmax regression is simple, and it cannot fit complex data. The Multilayer Perceptron with hidden layers has better capacity to fit complex data than the softmax regression. During training, `trainer.train` invokes `event_handler` for certain events. This gives us a chance to print the training progress.
### Training results for Convolutional Neural Network
<p align="center">
<img src="image/cnn_train_log_en.png" width="400px"><br/>
Fig. 9. Convolutional Neural Network error curve<br/>
</p>
Results of model evaluation:
```text
Best pass is 00076, testing Avgcost is 0.0244684
The classification accuracy is 99.20%
``` ```
# Pass 0, Batch 0, Cost 2.780790, {'classification_error_evaluator': 0.9453125}
From the evaluation result, the best accuracy of Convolutional Neural Network is 99.20%. So for image classification, a Convolutional Neural Network has better recognition results than a fully connected network. This is related to the local connection and parameter sharing of convolutional layers. In Fig. 9, the Convolutional Neural Network achieves good results in early steps, which indicates that it converges faster. # Pass 0, Batch 100, Cost 0.635356, {'classification_error_evaluator': 0.2109375}
# Pass 0, Batch 200, Cost 0.326094, {'classification_error_evaluator': 0.1328125}
## Application Model # Pass 0, Batch 300, Cost 0.361920, {'classification_error_evaluator': 0.1015625}
# Pass 0, Batch 400, Cost 0.410101, {'classification_error_evaluator': 0.125}
### Prediction Commands and Results # Test with Pass 0, Cost 0.326659, {'classification_error_evaluator': 0.09470000118017197}
Script `predict.py` can make prediction for trained models. For example, in softmax regression:
```bash
python predict.py -c mnist_model.py -d data/raw_data/ -m softmax_mnist_model/pass-00047
``` ```
- -c sets model architecture After the training, we can check the model's prediction accuracy.
- -d sets data for prediction
- -m sets model parameters, here the best trained model is used for prediction
Follow the instructions to input image ID for prediction. The classifier can output probabilities for each digit, predictions with the highest probability, and ground truth label.
``` ```
Input image_id [0~9999]: 3 # find the best pass
Predicted probability of each digit: best = sorted(lists, key=lambda list: float(list[1]))[0]
[[ 1.00000000e+00 1.60381094e-28 1.60381094e-28 1.60381094e-28 print 'Best pass is %s, testing Avgcost is %s' % (best[0], best[1])
1.60381094e-28 1.60381094e-28 1.60381094e-28 1.60381094e-28 print 'The classification accuracy is %.2f%%' % (100 - float(best[2]) * 100)
1.60381094e-28 1.60381094e-28]]
Predict Number: 0
Actual Number: 0
``` ```
From the result, this classifier recognizes the digit on the third image as digit 0 with near to 100% probability. This predicted result is consistent with the ground truth label. Usually, with MNIST data, the softmax regression model can get accuracy around 92.34%, MLP can get about 97.66%, and convolution network can get up to around 99.20%. Convolution layers have been widely considered a great invention for image processsing.
## Conclusion ## Conclusion
This tutorial describes a few basic Deep Learning models viz. Softmax regression, Multilayer Perceptron Network and Convolutional Neural Network. The subsequent tutorials will derive more sophisticated models from these. So it is crucial to understand these models for future learning. When our model evolved from a simple softmax regression to slightly complex Convolutional Neural Network, the recognition accuracy on the MNIST data set achieved large improvement in accuracy. This is due to the Convolutional layers' local connections and parameter sharing. While learning new models in the future, we encourage the readers to understand the key ideas that lead a new model to improve results of an old one. Moreover, this tutorial introduced the basic flow of PaddlePaddle model design, starting with a dataprovider, model layer construction, to final training and prediction. Readers can leverage the flow used in this MNIST handwritten digit classification example and experiment with different data and network architectures to train models for classification tasks of their choice. This tutorial describes a few basic Deep Learning models viz. Softmax regression, Multilayer Perceptron Network and Convolutional Neural Network. The subsequent tutorials will derive more sophisticated models from these. So it is crucial to understand these models for future learning. When our model evolved from a simple softmax regression to slightly complex Convolutional Neural Network, the recognition accuracy on the MNIST data set achieved large improvement in accuracy. This is due to the Convolutional layers' local connections and parameter sharing. While learning new models in the future, we encourage the readers to understand the key ideas that lead a new model to improve results of an old one. Moreover, this tutorial introduced the basic flow of PaddlePaddle model design, starting with a dataprovider, model layer construction, to final training and prediction. Readers can leverage the flow used in this MNIST handwritten digit classification example and experiment with different data and network architectures to train models for classification tasks of their choice.
......
...@@ -83,7 +83,7 @@ $$ y_i = softmax(\sum_j W_{i,j}x_j + b_i) $$ ...@@ -83,7 +83,7 @@ $$ y_i = softmax(\sum_j W_{i,j}x_j + b_i) $$
$$ crossentropy(label, y) = -\sum_i label_ilog(y_i) $$ $$ crossentropy(label, y) = -\sum_i label_ilog(y_i) $$
图2为softmax回归的网络图,图中权重用线表示、偏置用红线表示、+1代表偏置参数的系数为1。 图2为softmax回归的网络图,图中权重用线表示、偏置用红线表示、+1代表偏置参数的系数为1。
<p align="center"> <p align="center">
<img src="image/softmax_regression.png" width=400><br/> <img src="image/softmax_regression.png" width=400><br/>
...@@ -99,7 +99,7 @@ Softmax回归模型采用了最简单的两层神经网络,即只有输入层 ...@@ -99,7 +99,7 @@ Softmax回归模型采用了最简单的两层神经网络,即只有输入层
3. 最后,再经过输出层,得到的$Y=softmax(W_3H_2 + b_3)$,即为最后的分类结果向量。 3. 最后,再经过输出层,得到的$Y=softmax(W_3H_2 + b_3)$,即为最后的分类结果向量。
图3为多层感知器的网络结构图,图中权重用线表示、偏置用红线表示、+1代表偏置参数的系数为1。 图3为多层感知器的网络结构图,图中权重用线表示、偏置用红线表示、+1代表偏置参数的系数为1。
<p align="center"> <p align="center">
<img src="image/mlp.png" width=500><br/> <img src="image/mlp.png" width=500><br/>
...@@ -236,20 +236,19 @@ def convolutional_neural_network(img): ...@@ -236,20 +236,19 @@ def convolutional_neural_network(img):
接着,通过`layer.data`调用来获取数据,然后调用分类器(这里我们提供了三个不同的分类器)得到分类结果。训练时,对该结果计算其损失函数,分类问题常常选择交叉熵损失函数。 接着,通过`layer.data`调用来获取数据,然后调用分类器(这里我们提供了三个不同的分类器)得到分类结果。训练时,对该结果计算其损失函数,分类问题常常选择交叉熵损失函数。
```python ```python
def main(): # 该模型运行在单个CPU上
# 该模型运行在单个CPU上 paddle.init(use_gpu=False, trainer_count=1)
paddle.init(use_gpu=False, trainer_count=1)
images = paddle.layer.data( images = paddle.layer.data(
name='pixel', type=paddle.data_type.dense_vector(784)) name='pixel', type=paddle.data_type.dense_vector(784))
label = paddle.layer.data( label = paddle.layer.data(
name='label', type=paddle.data_type.integer_value(10)) name='label', type=paddle.data_type.integer_value(10))
predict = softmax_regression(images) # Softmax回归 predict = softmax_regression(images) # Softmax回归
#predict = multilayer_perceptron(images) #多层感知器 #predict = multilayer_perceptron(images) #多层感知器
#predict = convolutional_neural_network(images) #LeNet5卷积神经网络 #predict = convolutional_neural_network(images) #LeNet5卷积神经网络
cost = paddle.layer.classification_cost(input=predict, label=label) cost = paddle.layer.classification_cost(input=predict, label=label)
``` ```
然后,指定训练相关的参数。 然后,指定训练相关的参数。
...@@ -258,84 +257,61 @@ def main(): ...@@ -258,84 +257,61 @@ def main():
- 正则化(regularization): 是防止网络过拟合的一种手段,此处采用L2正则化。 - 正则化(regularization): 是防止网络过拟合的一种手段,此处采用L2正则化。
```python ```python
parameters = paddle.parameters.create(cost) parameters = paddle.parameters.create(cost)
optimizer = paddle.optimizer.Momentum( optimizer = paddle.optimizer.Momentum(
learning_rate=0.1 / 128.0, learning_rate=0.1 / 128.0,
momentum=0.9, momentum=0.9,
regularization=paddle.optimizer.L2Regularization(rate=0.0005 * 128)) regularization=paddle.optimizer.L2Regularization(rate=0.0005 * 128))
trainer = paddle.trainer.SGD(cost=cost, trainer = paddle.trainer.SGD(cost=cost,
parameters=parameters, parameters=parameters,
update_equation=optimizer) update_equation=optimizer)
``` ```
下一步,我们开始训练过程。`paddle.dataset.movielens.train()`和`paddle.dataset.movielens.test()`分别做训练和测试数据集,每次训练使用的数据为128条 下一步,我们开始训练过程。`paddle.dataset.movielens.train()`和`paddle.dataset.movielens.test()`分别做训练和测试数据集。这两个函数各自返回一个reader——PaddlePaddle中的reader是一个Python函数,每次调用的时候返回一个Python yield generator
```python 下面`shuffle`是一个reader decorator,它接受一个reader A,返回另一个reader B —— reader B 每次读入`buffer_size`条训练数据到一个buffer里,然后随机打乱其顺序,并且逐条输出。
lists = []
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0:
print "Pass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
if isinstance(event, paddle.event.EndPass):
result = trainer.test(reader=paddle.reader.batched(
paddle.dataset.mnist.test(), batch_size=128))
print "Test with Pass %d, Cost %f, %s\n" % (
event.pass_id, result.cost, result.metrics)
lists.append((event.pass_id, result.cost,
result.metrics['classification_error_evaluator']))
trainer.train(
reader=paddle.reader.batched(
paddle.reader.shuffle(
paddle.dataset.mnist.train(), buf_size=8192),
batch_size=128),
event_handler=event_handler,
num_passes=100)
```
训练过程是完全自动的,event_handler里打印的日志类似如下所示: `batch`是一个特殊的decorator,它的输入是一个reader,输出是一个batched reader —— 在PaddlePaddle里,一个reader每次yield一条训练数据,而一个batched reader每次yield一个minbatch。
```python ```python
# Pass 0, Batch 0, Cost 2.780790, {'classification_error_evaluator': 0.9453125} lists = []
# Pass 0, Batch 100, Cost 0.635356, {'classification_error_evaluator': 0.2109375}
# Pass 0, Batch 200, Cost 0.326094, {'classification_error_evaluator': 0.1328125} def event_handler(event):
# Pass 0, Batch 300, Cost 0.361920, {'classification_error_evaluator': 0.1015625} if isinstance(event, paddle.event.EndIteration):
# Pass 0, Batch 400, Cost 0.410101, {'classification_error_evaluator': 0.125} if event.batch_id % 100 == 0:
# Test with Pass 0, Cost 0.326659, {'classification_error_evaluator': 0.09470000118017197} print "Pass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
if isinstance(event, paddle.event.EndPass):
result = trainer.test(reader=paddle.reader.batched(
paddle.dataset.mnist.test(), batch_size=128))
print "Test with Pass %d, Cost %f, %s\n" % (
event.pass_id, result.cost, result.metrics)
lists.append((event.pass_id, result.cost,
result.metrics['classification_error_evaluator']))
trainer.train(
reader=paddle.reader.batched(
paddle.reader.shuffle(
paddle.dataset.mnist.train(), buf_size=8192),
batch_size=128),
event_handler=event_handler,
num_passes=100)
``` ```
最后,选出最佳模型,并评估其效果。 训练过程是完全自动的,event_handler里打印的日志类似如下所示:
```python
# find the best pass
best = sorted(lists, key=lambda list: float(list[1]))[0]
print 'Best pass is %s, testing Avgcost is %s' % (best[0], best[1])
print 'The classification accuracy is %.2f%%' % (100 - float(best[2]) * 100)
```
- softmax回归模型:分类效果最好的时候是pass-34,分类准确率为92.34%。
```python
# Best pass is 34, testing Avgcost is 0.275004139346
# The classification accuracy is 92.34%
``` ```
# Pass 0, Batch 0, Cost 2.780790, {'classification_error_evaluator': 0.9453125}
- 多层感知器:最终训练的准确率为97.66%,相比于softmax回归模型有了显著的提升。原因是softmax回归模型较为简单,无法拟合更为复杂的数据,而加入了隐藏层之后的多层感知器则具有更强的拟合能力。 # Pass 0, Batch 100, Cost 0.635356, {'classification_error_evaluator': 0.2109375}
# Pass 0, Batch 200, Cost 0.326094, {'classification_error_evaluator': 0.1328125}
```python # Pass 0, Batch 300, Cost 0.361920, {'classification_error_evaluator': 0.1015625}
# Best pass is 85, testing Avgcost is 0.0784368447196 # Pass 0, Batch 400, Cost 0.410101, {'classification_error_evaluator': 0.125}
# The classification accuracy is 97.66% # Test with Pass 0, Cost 0.326659, {'classification_error_evaluator': 0.09470000118017197}
``` ```
- 卷积神经网络:最好分类准确率达到惊人的99.20%。说明对于图像问题而言,卷积神经网络能够比一般的全连接网络达到更好的识别效果,而这与卷积层具有局部连接和共享权重的特性是分不开的。同时,从训练日志中可以看到,卷积神经网络在很早的时候就能达到很好的效果,说明其收敛速度非常快。 训练之后,检查模型的预测准确度。用 MNIST 训练的时候,一般 softmax回归模型的分类准确率为约为 92.34%,多层感知器为97.66%,卷积神经网络可以达到 99.20%。
```python
# Best pass is 76, testing Avgcost is 0.0244684
# The classification accuracy is 99.20%
```
## 总结 ## 总结
......
import paddle.v2 as paddle
def softmax_regression(img):
predict = paddle.layer.fc(input=img,
size=10,
act=paddle.activation.Softmax())
return predict
def multilayer_perceptron(img):
# The first fully-connected layer
hidden1 = paddle.layer.fc(input=img, size=128, act=paddle.activation.Relu())
# The second fully-connected layer and the according activation function
hidden2 = paddle.layer.fc(input=hidden1,
size=64,
act=paddle.activation.Relu())
# The thrid fully-connected layer, note that the hidden size should be 10,
# which is the number of unique digits
predict = paddle.layer.fc(input=hidden2,
size=10,
act=paddle.activation.Softmax())
return predict
def convolutional_neural_network(img):
# first conv layer
conv_pool_1 = paddle.networks.simple_img_conv_pool(
input=img,
filter_size=5,
num_filters=20,
num_channel=1,
pool_size=2,
pool_stride=2,
act=paddle.activation.Tanh())
# second conv layer
conv_pool_2 = paddle.networks.simple_img_conv_pool(
input=conv_pool_1,
filter_size=5,
num_filters=50,
num_channel=20,
pool_size=2,
pool_stride=2,
act=paddle.activation.Tanh())
# The first fully-connected layer
fc1 = paddle.layer.fc(input=conv_pool_2,
size=128,
act=paddle.activation.Tanh())
# The softmax layer, note that the hidden size should be 10,
# which is the number of unique digits
predict = paddle.layer.fc(input=fc1,
size=10,
act=paddle.activation.Softmax())
return predict
paddle.init(use_gpu=False, trainer_count=1)
# define network topology
images = paddle.layer.data(
name='pixel', type=paddle.data_type.dense_vector(784))
label = paddle.layer.data(name='label', type=paddle.data_type.integer_value(10))
# Here we can build the prediction network in different ways. Please
# choose one by uncomment corresponding line.
predict = softmax_regression(images)
#predict = multilayer_perceptron(images)
#predict = convolutional_neural_network(images)
cost = paddle.layer.classification_cost(input=predict, label=label)
parameters = paddle.parameters.create(cost)
optimizer = paddle.optimizer.Momentum(
learning_rate=0.1 / 128.0,
momentum=0.9,
regularization=paddle.optimizer.L2Regularization(rate=0.0005 * 128))
trainer = paddle.trainer.SGD(cost=cost,
parameters=parameters,
update_equation=optimizer)
lists = []
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0:
print "Pass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
if isinstance(event, paddle.event.EndPass):
result = trainer.test(reader=paddle.reader.batched(
paddle.dataset.mnist.test(), batch_size=128))
print "Test with Pass %d, Cost %f, %s\n" % (event.pass_id, result.cost,
result.metrics)
lists.append((event.pass_id, result.cost,
result.metrics['classification_error_evaluator']))
trainer.train(
reader=paddle.reader.batched(
paddle.reader.shuffle(
paddle.dataset.mnist.train(), buf_size=8192),
batch_size=128),
event_handler=event_handler,
num_passes=100)
# find the best pass
best = sorted(lists, key=lambda list: float(list[1]))[0]
print 'Best pass is %s, testing Avgcost is %s' % (best[0], best[1])
print 'The classification accuracy is %.2f%%' % (100 - float(best[2]) * 100)
...@@ -111,7 +111,7 @@ Given the feature vectors of users and movies, we compute the relevance using co ...@@ -111,7 +111,7 @@ Given the feature vectors of users and movies, we compute the relevance using co
<p align="center"> <p align="center">
<img src="image/rec_regression_network.png" width="90%" ><br/> <img src="image/rec_regression_network_en.png" width="90%" ><br/>
Figure 3. A hybrid recommendation model. Figure 3. A hybrid recommendation model.
</p> </p>
......
...@@ -93,372 +93,210 @@ $$ h_t=Recrurent(x_t,h_{t-1})$$ ...@@ -93,372 +93,210 @@ $$ h_t=Recrurent(x_t,h_{t-1})$$
<img src="image/stacked_lstm.jpg" width=450><br/> <img src="image/stacked_lstm.jpg" width=450><br/>
图4. 栈式双向LSTM用于文本分类 图4. 栈式双向LSTM用于文本分类
</p> </p>
## 数据准备
### 数据介绍与下载
我们以[IMDB情感分析数据集](http://ai.stanford.edu/%7Eamaas/data/sentiment/)为例进行介绍。IMDB数据集的训练集和测试集分别包含25000个已标注过的电影评论。其中,负面评论的得分小于等于4,正面评论的得分大于等于7,满分10分。您可以使用下面的脚本下载 IMDB 数椐集和[Moses](http://www.statmt.org/moses/)工具:
```bash ## 示例程序
./data/get_imdb.sh ### 数据集介绍
我们以[IMDB情感分析数据集](http://ai.stanford.edu/%7Eamaas/data/sentiment/)为例进行介绍。IMDB数据集的训练集和测试集分别包含25000个已标注过的电影评论。其中,负面评论的得分小于等于4,正面评论的得分大于等于7,满分10分。
```text
aclImdb
|- test
|-- neg
|-- pos
|- train
|-- neg
|-- pos
``` ```
如果数椐获取成功,您将在目录```data```中看到下面的文件: Paddle在`dataset/imdb.py`中提实现了imdb数据集的自动下载和读取,并提供了读取字典、训练数据、测试数据等API。
``` ```
aclImdb get_imdb.sh imdb mosesdecoder-master import sys
import paddle.trainer_config_helpers.attrs as attrs
from paddle.trainer_config_helpers.poolings import MaxPooling
import paddle.v2 as paddle
``` ```
## 配置模型
* aclImdb: 从外部网站上下载的原始数椐集。 在该示例中,我们实现了两种文本分类算法,分别基于上文所述的[文本卷积神经网络](#文本卷积神经网络(CNN))[栈式双向LSTM](#栈式双向LSTM(Stacked Bidirectional LSTM))。
* imdb: 仅包含训练和测试数椐集。 ### 文本卷积神经网络
* mosesdecoder-master: Moses 工具。
### 数据预处理
我们使用的预处理脚本为`preprocess.py`。该脚本会调用Moses工具中的`tokenizer.perl`脚本来切分单词和标点符号,并会将训练集随机打乱排序再构建字典。注意:我们只使用已标注的训练集和测试集。执行下面的命令就可以预处理数椐:
```
data_dir="./data/imdb"
python preprocess.py -i $data_dir
```
运行成功后目录`./data/pre-imdb` 结构如下:
```
dict.txt labels.list test.list test_part_000 train.list train_part_000
```
* test\_part\_000 和 train\_part\_000: 所有标记的测试集和训练集,训练集已经随机打乱。
* train.list 和 test.list: 训练集和测试集文件列表。
* dict.txt: 利用训练集生成的字典。
* labels.list: 类别标签列表,标签0表示负面评论,标签1表示正面评论。
### 提供数据给PaddlePaddle
PaddlePaddle可以读取Python写的传输数据脚本,下面`dataprovider.py`文件给出了完整例子,主要包括两部分:
* hook: 定义文本信息、类别Id的数据类型。文本被定义为整数序列`integer_value_sequence`,类别被定义为整数`integer_value`
* process: 按行读取以`'\t\t'`分隔的类别ID和文本信息,并用yield关键字返回。
```python
from paddle.trainer.PyDataProvider2 import *
def hook(settings, dictionary, **kwargs):
settings.word_dict = dictionary
settings.input_types = {
'word': integer_value_sequence(len(settings.word_dict)),
'label': integer_value(2)
}
settings.logger.info('dict len : %d' % (len(settings.word_dict)))
@provider(init_hook=hook)
def process(settings, file_name):
with open(file_name, 'r') as fdata:
for line_count, line in enumerate(fdata):
label, comment = line.strip().split('\t\t')
label = int(label)
words = comment.split()
word_slot = [
settings.word_dict[w] for w in words if w in settings.word_dict
]
yield {
'word': word_slot,
'label': label
}
```
## 模型配置说明
`trainer_config.py` 是一个配置文件的例子。
### 数据定义
```python
from os.path import join as join_path
from paddle.trainer_config_helpers import *
# 是否是测试模式
is_test = get_config_arg('is_test', bool, False)
# 是否是预测模式
is_predict = get_config_arg('is_predict', bool, False)
# 数据路径
data_dir = "./data/pre-imdb"
# 文件名
train_list = "train.list"
test_list = "test.list"
dict_file = "dict.txt"
# 字典大小
dict_dim = len(open(join_path(data_dir, "dict.txt")).readlines())
# 类别个数
class_dim = len(open(join_path(data_dir, 'labels.list')).readlines())
if not is_predict:
train_list = join_path(data_dir, train_list)
test_list = join_path(data_dir, test_list)
dict_file = join_path(data_dir, dict_file)
train_list = train_list if not is_test else None
# 构造字典
word_dict = dict()
with open(dict_file, 'r') as f:
for i, line in enumerate(open(dict_file, 'r')):
word_dict[line.split('\t')[0]] = i
# 通过define_py_data_sources2函数从dataprovider.py中读取数据
define_py_data_sources2(
train_list,
test_list,
module="dataprovider",
obj="process", # 指定生成数据的函数。
args={'dictionary': word_dict}) # 额外的参数,这里指定词典。
```
### 算法配置
```python
settings(
batch_size=128,
learning_rate=2e-3,
learning_method=AdamOptimizer(),
regularization=L2Regularization(8e-4),
gradient_clipping_threshold=25)
``` ```
* 设置batch size大小为128。
* 设置全局学习率。
* 使用adam优化。
* 设置L2正则。
* 设置梯度截断(clipping)阈值。
### 模型结构
我们用PaddlePaddle实现了两种文本分类算法,分别基于上文所述的[文本卷积神经网络](#文本卷积神经网络(CNN))[栈式双向LSTM](#栈式双向LSTM(Stacked Bidirectional LSTM))。
#### 文本卷积神经网络的实现
```python
def convolution_net(input_dim, def convolution_net(input_dim,
class_dim=2, class_dim=2,
emb_dim=128, emb_dim=128,
hid_dim=128, hid_dim=128):
is_predict=False): data = paddle.layer.data("word",
# 网络输入:id表示的词序列,词典大小为input_dim paddle.data_type.integer_value_sequence(input_dim))
data = data_layer("word", input_dim) emb = paddle.layer.embedding(input=data, size=emb_dim)
# 将id表示的词序列映射为embedding序列 conv_3 = paddle.networks.sequence_conv_pool(
emb = embedding_layer(input=data, size=emb_dim) input=emb, context_len=3, hidden_size=hid_dim)
# 卷积及最大化池操作,卷积核窗口大小为3 conv_4 = paddle.networks.sequence_conv_pool(
conv_3 = sequence_conv_pool(input=emb, context_len=3, hidden_size=hid_dim) input=emb, context_len=4, hidden_size=hid_dim)
# 卷积及最大化池操作,卷积核窗口大小为4 output = paddle.layer.fc(input=[conv_3, conv_4],
conv_4 = sequence_conv_pool(input=emb, context_len=4, hidden_size=hid_dim) size=class_dim,
# 将conv_3和conv_4拼接起来输入给softmax分类,类别数为class_dim act=paddle.activation.Softmax())
output = fc_layer( lbl = paddle.layer.data("label", paddle.data_type.integer_value(2))
input=[conv_3, conv_4], size=class_dim, act=SoftmaxActivation()) cost = paddle.layer.classification_cost(input=output, label=lbl)
return cost
if not is_predict: ```
lbl = data_layer("label", 1) #网络输入:类别标签 网络的输入`input_dim`表示的是词典的大小,`class_dim`表示类别数。这里,我们使用[`sequence_conv_pool`](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/trainer_config_helpers/networks.py) API实现了卷积和池化操作。
outputs(classification_cost(input=output, label=lbl)) ### 栈式双向LSTM
else:
outputs(output)
``` ```
其中,我们仅用一个[`sequence_conv_pool`](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/trainer_config_helpers/networks.py)方法就实现了卷积和池化操作,卷积核的数量为hidden_size参数。
#### 栈式双向LSTM的实现
```python
def stacked_lstm_net(input_dim, def stacked_lstm_net(input_dim,
class_dim=2, class_dim=2,
emb_dim=128, emb_dim=128,
hid_dim=512, hid_dim=512,
stacked_num=3, stacked_num=3):
is_predict=False): """
A Wrapper for sentiment classification task.
# LSTM的层数stacked_num为奇数,确保最高层LSTM正向 This network uses bi-directional recurrent network,
consisting three LSTM layers. This configure is referred to
the paper as following url, but use fewer layrs.
http://www.aclweb.org/anthology/P15-1109
input_dim: here is word dictionary dimension.
class_dim: number of categories.
emb_dim: dimension of word embedding.
hid_dim: dimension of hidden layer.
stacked_num: number of stacked lstm-hidden layer.
"""
assert stacked_num % 2 == 1 assert stacked_num % 2 == 1
# 设置神经网络层的属性
layer_attr = ExtraLayerAttribute(drop_rate=0.5)
# 设置参数的属性
fc_para_attr = ParameterAttribute(learning_rate=1e-3)
lstm_para_attr = ParameterAttribute(initial_std=0., learning_rate=1.)
para_attr = [fc_para_attr, lstm_para_attr]
bias_attr = ParameterAttribute(initial_std=0., l2_rate=0.)
# 激活函数
relu = ReluActivation()
linear = LinearActivation()
# 网络输入:id表示的词序列,词典大小为input_dim layer_attr = attrs.ExtraLayerAttribute(drop_rate=0.5)
data = data_layer("word", input_dim) fc_para_attr = attrs.ParameterAttribute(learning_rate=1e-3)
# 将id表示的词序列映射为embedding序列 lstm_para_attr = attrs.ParameterAttribute(initial_std=0., learning_rate=1.)
emb = embedding_layer(input=data, size=emb_dim) para_attr = [fc_para_attr, lstm_para_attr]
bias_attr = attrs.ParameterAttribute(initial_std=0., l2_rate=0.)
fc1 = fc_layer(input=emb, size=hid_dim, act=linear, bias_attr=bias_attr) relu = paddle.activation.Relu()
# 基于LSTM的循环神经网络 linear = paddle.activation.Linear()
lstm1 = lstmemory(
data = paddle.layer.data("word",
paddle.data_type.integer_value_sequence(input_dim))
emb = paddle.layer.embedding(input=data, size=emb_dim)
fc1 = paddle.layer.fc(input=emb,
size=hid_dim,
act=linear,
bias_attr=bias_attr)
lstm1 = paddle.layer.lstmemory(
input=fc1, act=relu, bias_attr=bias_attr, layer_attr=layer_attr) input=fc1, act=relu, bias_attr=bias_attr, layer_attr=layer_attr)
# 由fc_layer和lstmemory构建深度为stacked_num的栈式双向LSTM
inputs = [fc1, lstm1] inputs = [fc1, lstm1]
for i in range(2, stacked_num + 1): for i in range(2, stacked_num + 1):
fc = fc_layer( fc = paddle.layer.fc(input=inputs,
input=inputs, size=hid_dim,
size=hid_dim, act=linear,
act=linear, param_attr=para_attr,
param_attr=para_attr, bias_attr=bias_attr)
bias_attr=bias_attr) lstm = paddle.layer.lstmemory(
lstm = lstmemory(
input=fc, input=fc,
# 奇数层正向,偶数层反向。
reverse=(i % 2) == 0, reverse=(i % 2) == 0,
act=relu, act=relu,
bias_attr=bias_attr, bias_attr=bias_attr,
layer_attr=layer_attr) layer_attr=layer_attr)
inputs = [fc, lstm] inputs = [fc, lstm]
# 对最后一层fc_layer使用时间维度上的最大池化得到定长向量 fc_last = paddle.layer.pooling(input=inputs[0], pooling_type=MaxPooling())
fc_last = pooling_layer(input=inputs[0], pooling_type=MaxPooling()) lstm_last = paddle.layer.pooling(input=inputs[1], pooling_type=MaxPooling())
# 对最后一层lstmemory使用时间维度上的最大池化得到定长向量 output = paddle.layer.fc(input=[fc_last, lstm_last],
lstm_last = pooling_layer(input=inputs[1], pooling_type=MaxPooling()) size=class_dim,
# 将fc_last和lstm_last拼接起来输入给softmax分类,类别数为class_dim act=paddle.activation.Softmax(),
output = fc_layer( bias_attr=bias_attr,
input=[fc_last, lstm_last], param_attr=para_attr)
size=class_dim,
act=SoftmaxActivation(), lbl = paddle.layer.data("label", paddle.data_type.integer_value(2))
bias_attr=bias_attr, cost = paddle.layer.classification_cost(input=output, label=lbl)
param_attr=para_attr) return cost
if is_predict:
outputs(output)
else:
outputs(classification_cost(input=output, label=data_layer('label', 1)))
```
我们的模型配置`trainer_config.py`默认使用`stacked_lstm_net`网络,如果要使用`convolution_net`,注释相应的行即可。
```python
stacked_lstm_net(
dict_dim, class_dim=class_dim, stacked_num=3, is_predict=is_predict)
# convolution_net(dict_dim, class_dim=class_dim, is_predict=is_predict)
``` ```
网络的输入`stacked_num`表示的是LSTM的层数,需要是奇数,确保最高层LSTM正向。Paddle里面是通过一个fc和一个lstmemory来实现基于LSTM的循环神经网络。
## 训练模型 ## 训练模型
使用`train.sh`脚本可以开启本地的训练:
``` ```
./train.sh if __name__ == '__main__':
# init
paddle.init(use_gpu=False)
``` ```
启动paddle程序,use_gpu=False表示用CPU训练,如果系统支持GPU也可以修改成True使用GPU训练。
train.sh内容如下: ### 训练数据
使用Paddle提供的数据集`dataset.imdb`中的API来读取训练数据。
```bash
paddle train --config=trainer_config.py \
--save_dir=./model_output \
--job=train \
--use_gpu=false \
--trainer_count=4 \
--num_passes=10 \
--log_period=20 \
--dot_period=20 \
--show_parameter_stats_period=100 \
--test_all_data_in_one_period=1 \
2>&1 | tee 'train.log'
``` ```
print 'load dictionary...'
* \--config=trainer_config.py: 设置模型配置。 word_dict = paddle.dataset.imdb.word_dict()
* \--save\_dir=./model_output: 设置输出路径以保存训练完成的模型。 dict_dim = len(word_dict)
* \--job=train: 设置工作模式为训练。 class_dim = 2
* \--use\_gpu=false: 使用CPU训练,如果您安装GPU版本的PaddlePaddle,并想使用GPU来训练可将此设置为true。
* \--trainer\_count=4:设置线程数(或GPU个数)。
* \--num\_passes=15: 设置pass,PaddlePaddle中的一个pass意味着对数据集中的所有样本进行一次训练。
* \--log\_period=20: 每20个batch打印一次日志。
* \--show\_parameter\_stats\_period=100: 每100个batch打印一次统计信息。
* \--test\_all_data\_in\_one\_period=1: 每次测试都测试所有数据。
如果运行成功,输出日志保存在 `train.log`中,模型保存在目录`model_output/`中。 输出日志说明如下:
``` ```
Batch=20 samples=2560 AvgCost=0.681644 CurrentCost=0.681644 Eval: classification_error_evaluator=0.36875 CurrentEval: classification_error_evaluator=0.36875 加载数据字典,这里通过`word_dict()`API可以直接构造字典。`class_dim`是指样本类别数,该示例中样本只有正负两类。
...
Pass=0 Batch=196 samples=25000 AvgCost=0.418964 Eval: classification_error_evaluator=0.1922
Test samples=24999 cost=0.39297 Eval: classification_error_evaluator=0.149406
``` ```
train_reader = paddle.batch(
* Batch=xx: 表示训练了xx个Batch。 paddle.reader.shuffle(
* samples=xx: 表示训练了xx个样本。 lambda: paddle.dataset.imdb.train(word_dict), buf_size=1000),
* AvgCost=xx: 从第0个batch到当前batch的平均损失。 batch_size=100)
* CurrentCost=xx: 最新log_period个batch的损失。 test_reader = paddle.batch(
* Eval: classification\_error\_evaluator=xx: 表示第0个batch到当前batch的分类错误。 lambda: paddle.dataset.imdb.test(word_dict),
* CurrentEval: classification\_error\_evaluator: 最新log_period个batch的分类错误。 batch_size=100)
* Pass=0: 通过所有训练集一次称为一个Pass。 0表示第一次经过训练集。
## 应用模型
### 测试
测试是指使用训练出的模型评估已标记的数据集。
``` ```
./test.sh 这里,`dataset.imdb.train()``dataset.imdb.test()`分别是`dataset.imdb`中的训练数据和测试数据API。`train_reader`在训练时使用,意义是将读取的训练数据进行shuffle后,组成一个batch数据。同理,`test_reader`是在测试的时候使用,将读取的测试数据组成一个batch。
``` ```
reader_dict={'word': 0, 'label': 1}
测试脚本`test.sh`的内容如下,其中函数`get_best_pass`通过对分类错误率进行排序来获得最佳模型:
```bash
function get_best_pass() {
cat $1 | grep -Pzo 'Test .*\n.*pass-.*' | \
sed -r 'N;s/Test.* error=([0-9]+\.[0-9]+).*\n.*pass-([0-9]+)/\1 \2/g' | \
sort | head -n 1
}
log=train.log
LOG=`get_best_pass $log`
LOG=(${LOG})
evaluate_pass="model_output/pass-${LOG[1]}"
echo 'evaluating from pass '$evaluate_pass
model_list=./model.list
touch $model_list | echo $evaluate_pass > $model_list
net_conf=trainer_config.py
paddle train --config=$net_conf \
--model_list=$model_list \
--job=test \
--use_gpu=false \
--trainer_count=4 \
--config_args=is_test=1 \
2>&1 | tee 'test.log'
``` ```
`reader_dict`用来指定`train_reader``test_reader`返回的数据与模型配置中data_layer的对应关系。这里表示reader返回的第0列数据对应`word`层,第1列数据对应`label`层。
与训练不同,测试时需要指定`--job = test`和模型路径`--model_list = $model_list`。如果测试成功,日志将保存在`test.log`中。 在我们的测试中,最好的模型是`model_output/pass-00002`,分类错误率是0.115645: ### 构造模型
``` ```
Pass=0 samples=24999 AvgCost=0.280471 Eval: classification_error_evaluator=0.115645 # Please choose the way to build the network
# by uncommenting the corresponding line.
cost = convolution_net(dict_dim, class_dim=class_dim)
# cost = stacked_lstm_net(dict_dim, class_dim=class_dim, stacked_num=3)
``` ```
该示例中默认使用`convolution_net`网络,如果使用`stacked_lstm_net`网络,注释相应的行即可。其中cost是网络的优化目标,同时cost包含了整个网络的拓扑信息。
### 预测 ### 网络参数
`predict.py`脚本提供了一个预测接口。预测IMDB中未标记评论的示例如下:
``` ```
./predict.sh # create parameters
parameters = paddle.parameters.create(cost)
``` ```
predict.sh的内容如下(注意应该确保默认模型路径`model_output/pass-00002`存在或更改为其它模型路径): 根据网络的拓扑构造网络参数。这里parameters是整个网络的参数集。
### 优化算法
```bash
model=model_output/pass-00002/
config=trainer_config.py
label=data/pre-imdb/labels.list
cat ./data/aclImdb/test/pos/10007_10.txt | python predict.py \
--tconf=$config \
--model=$model \
--label=$label \
--dict=./data/pre-imdb/dict.txt \
--batch_size=1
``` ```
# create optimizer
* `cat ./data/aclImdb/test/pos/10007_10.txt` : 输入预测样本。 adam_optimizer = paddle.optimizer.Adam(
* `predict.py` : 预测接口脚本。 learning_rate=2e-3,
* `--tconf=$config` : 设置网络配置。 regularization=paddle.optimizer.L2Regularization(rate=8e-4),
* `--model=$model` : 设置模型路径。 model_average=paddle.optimizer.ModelAverage(average_window=0.5))
* `--label=$label` : 设置标签类别字典,这个字典是整数标签和字符串标签的一个对应。 ```
* `--dict=data/pre-imdb/dict.txt` : 设置文本数据字典文件。 Paddle中提供了一系列优化算法的API,这里使用Adam优化算法。
* `--batch_size=1` : 预测时的batch size大小。 ### 训练
可以通过`paddle.trainer.SGD`构造一个sgd trainer,并调用`trainer.train`来训练模型。
```
本示例的预测结果: # End batch and end pass event handler
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0:
print "\nPass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
else:
sys.stdout.write('.')
sys.stdout.flush()
if isinstance(event, paddle.event.EndPass):
result = trainer.test(reader=test_reader, reader_dict=reader_dict)
print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics)
```
可以通过给train函数传递一个`event_handler`来获取每个batch和每个pass结束的状态。比如构造如下一个`event_handler`可以在每100个batch结束后输出cost和error;在每个pass结束后调用`trainer.test`计算一遍测试集并获得当前模型在测试集上的error。
```
# create trainer
trainer = paddle.trainer.SGD(cost=cost,
parameters=parameters,
update_equation=adam_optimizer)
trainer.train(
reader=train_reader,
event_handler=event_handler,
reader_dict=reader_dict,
num_passes=2)
```
程序运行之后的输出如下。
``` ```
Loading parameters from model_output/pass-00002/ Pass 0, Batch 0, Cost 0.693721, {'classification_error_evaluator': 0.5546875}
predicting label is pos ...................................................................................................
Pass 0, Batch 100, Cost 0.294321, {'classification_error_evaluator': 0.1015625}
...............................................................................................
Test with Pass 0, {'classification_error_evaluator': 0.11432000249624252}
``` ```
`10007_10.txt`在路径`./data/aclImdb/test/pos`下面,而这里预测的标签也是pos,说明预测正确。
## 总结 ## 总结
本章我们以情感分析为例,介绍了使用深度学习的方法进行端对端的短文本分类,并且使用PaddlePaddle完成了全部相关实验。同时,我们简要介绍了两种文本处理模型:卷积神经网络和循环神经网络。在后续的章节中我们会看到这两种基本的深度学习模型在其它任务上的应用。 本章我们以情感分析为例,介绍了使用深度学习的方法进行端对端的短文本分类,并且使用PaddlePaddle完成了全部相关实验。同时,我们简要介绍了两种文本处理模型:卷积神经网络和循环神经网络。在后续的章节中我们会看到这两种基本的深度学习模型在其它任务上的应用。
## 参考文献 ## 参考文献
......
# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import sys
import paddle.trainer_config_helpers.attrs as attrs
from paddle.trainer_config_helpers.poolings import MaxPooling
import paddle.v2 as paddle
def convolution_net(input_dim, class_dim=2, emb_dim=128, hid_dim=128):
data = paddle.layer.data("word",
paddle.data_type.integer_value_sequence(input_dim))
emb = paddle.layer.embedding(input=data, size=emb_dim)
conv_3 = paddle.networks.sequence_conv_pool(
input=emb, context_len=3, hidden_size=hid_dim)
conv_4 = paddle.networks.sequence_conv_pool(
input=emb, context_len=4, hidden_size=hid_dim)
output = paddle.layer.fc(input=[conv_3, conv_4],
size=class_dim,
act=paddle.activation.Softmax())
lbl = paddle.layer.data("label", paddle.data_type.integer_value(2))
cost = paddle.layer.classification_cost(input=output, label=lbl)
return cost
def stacked_lstm_net(input_dim,
class_dim=2,
emb_dim=128,
hid_dim=512,
stacked_num=3):
"""
A Wrapper for sentiment classification task.
This network uses bi-directional recurrent network,
consisting three LSTM layers. This configure is referred to
the paper as following url, but use fewer layrs.
http://www.aclweb.org/anthology/P15-1109
input_dim: here is word dictionary dimension.
class_dim: number of categories.
emb_dim: dimension of word embedding.
hid_dim: dimension of hidden layer.
stacked_num: number of stacked lstm-hidden layer.
"""
assert stacked_num % 2 == 1
layer_attr = attrs.ExtraLayerAttribute(drop_rate=0.5)
fc_para_attr = attrs.ParameterAttribute(learning_rate=1e-3)
lstm_para_attr = attrs.ParameterAttribute(initial_std=0., learning_rate=1.)
para_attr = [fc_para_attr, lstm_para_attr]
bias_attr = attrs.ParameterAttribute(initial_std=0., l2_rate=0.)
relu = paddle.activation.Relu()
linear = paddle.activation.Linear()
data = paddle.layer.data("word",
paddle.data_type.integer_value_sequence(input_dim))
emb = paddle.layer.embedding(input=data, size=emb_dim)
fc1 = paddle.layer.fc(input=emb,
size=hid_dim,
act=linear,
bias_attr=bias_attr)
lstm1 = paddle.layer.lstmemory(
input=fc1, act=relu, bias_attr=bias_attr, layer_attr=layer_attr)
inputs = [fc1, lstm1]
for i in range(2, stacked_num + 1):
fc = paddle.layer.fc(input=inputs,
size=hid_dim,
act=linear,
param_attr=para_attr,
bias_attr=bias_attr)
lstm = paddle.layer.lstmemory(
input=fc,
reverse=(i % 2) == 0,
act=relu,
bias_attr=bias_attr,
layer_attr=layer_attr)
inputs = [fc, lstm]
fc_last = paddle.layer.pooling(input=inputs[0], pooling_type=MaxPooling())
lstm_last = paddle.layer.pooling(input=inputs[1], pooling_type=MaxPooling())
output = paddle.layer.fc(input=[fc_last, lstm_last],
size=class_dim,
act=paddle.activation.Softmax(),
bias_attr=bias_attr,
param_attr=para_attr)
lbl = paddle.layer.data("label", paddle.data_type.integer_value(2))
cost = paddle.layer.classification_cost(input=output, label=lbl)
return cost
if __name__ == '__main__':
# init
paddle.init(use_gpu=False)
#data
print 'load dictionary...'
word_dict = paddle.dataset.imdb.word_dict()
dict_dim = len(word_dict)
class_dim = 2
train_reader = paddle.batch(
paddle.reader.shuffle(
lambda: paddle.dataset.imdb.train(word_dict), buf_size=1000),
batch_size=100)
test_reader = paddle.batch(
lambda: paddle.dataset.imdb.test(word_dict), batch_size=100)
reader_dict = {'word': 0, 'label': 1}
# network config
# Please choose the way to build the network
# by uncommenting the corresponding line.
cost = convolution_net(dict_dim, class_dim=class_dim)
# cost = stacked_lstm_net(dict_dim, class_dim=class_dim, stacked_num=3)
# create parameters
parameters = paddle.parameters.create(cost)
# create optimizer
adam_optimizer = paddle.optimizer.Adam(
learning_rate=2e-3,
regularization=paddle.optimizer.L2Regularization(rate=8e-4),
model_average=paddle.optimizer.ModelAverage(average_window=0.5))
# End batch and end pass event handler
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0:
print "\nPass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
else:
sys.stdout.write('.')
sys.stdout.flush()
if isinstance(event, paddle.event.EndPass):
result = trainer.test(reader=test_reader, reader_dict=reader_dict)
print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics)
# create trainer
trainer = paddle.trainer.SGD(cost=cost,
parameters=parameters,
update_equation=adam_optimizer)
trainer.train(
reader=train_reader,
event_handler=event_handler,
reader_dict=reader_dict,
num_passes=2)
...@@ -194,7 +194,7 @@ As illustrated in the figure above, skip-gram model maps the word embedding of t ...@@ -194,7 +194,7 @@ As illustrated in the figure above, skip-gram model maps the word embedding of t
## Model Configuration ## Model Configuration
<p align="center"> <p align="center">
<img src="image/ngram.png" width=400><br/> <img src="image/ngram.en.png" width=400><br/>
Figure 5. N-gram neural network model in model configuration Figure 5. N-gram neural network model in model configuration
</p> </p>
......
...@@ -182,7 +182,7 @@ CBOW的好处是对上下文词语的分布在词向量上进行了平滑,去 ...@@ -182,7 +182,7 @@ CBOW的好处是对上下文词语的分布在词向量上进行了平滑,去
## 数据准备 ## 数据准备
### 数据介绍与下载 ### 数据介绍
本教程使用Penn Tree Bank (PTB)数据集。PTB数据集较小,训练速度快,应用于Mikolov的公开语言模型训练工具\[[2](#参考文献)\]中。其统计情况如下: 本教程使用Penn Tree Bank (PTB)数据集。PTB数据集较小,训练速度快,应用于Mikolov的公开语言模型训练工具\[[2](#参考文献)\]中。其统计情况如下:
...@@ -206,109 +206,24 @@ CBOW的好处是对上下文词语的分布在词向量上进行了平滑,去 ...@@ -206,109 +206,24 @@ CBOW的好处是对上下文词语的分布在词向量上进行了平滑,去
</table> </table>
</p> </p>
执行以下命令,可下载该数据集,并分别将训练数据和验证数据输入`train.list`和`test.list`文件中,供PaddlePaddle训练时使用。
```bash
./data/getdata.sh
```
### 提供数据给PaddlePaddle ### 数据预处理
1. 使用initializer函数进行dataprovider的初始化,包括字典的建立(build_dict函数中)和PaddlePaddle输入字段的格式定义。注意:这里N为n-gram模型中的`n`, 本章代码中,定义$N=5$, 表示在PaddlePaddle训练时,每条数据的前4个词用来预测第5个词。大家也可以根据自己的数据和需求自行调整N,但调整的同时要在模型配置文件中加入/减少相应输入字段。
```python
from paddle.trainer.PyDataProvider2 import *
import collections
import logging
import pdb
logging.basicConfig(
format='[%(levelname)s %(asctime)s %(filename)s:%(lineno)s] %(message)s', )
logger = logging.getLogger('paddle')
logger.setLevel(logging.INFO)
N = 5 # Ngram
cutoff = 50 # select words with frequency > cutoff to dictionary
def build_dict(ftrain, fdict):
sentences = []
with open(ftrain) as fin:
for line in fin:
line = ['<s>'] + line.strip().split() + ['<e>']
sentences += line
wordfreq = collections.Counter(sentences)
wordfreq = filter(lambda x: x[1] > cutoff, wordfreq.items())
dictionary = sorted(wordfreq, key = lambda x: (-x[1], x[0]))
words, _ = list(zip(*dictionary))
for word in words:
print >> fdict, word
word_idx = dict(zip(words, xrange(len(words))))
logger.info("Dictionary size=%s" %len(words))
return word_idx
def initializer(settings, srcText, dictfile, **xargs):
with open(dictfile, 'w') as fdict:
settings.dicts = build_dict(srcText, fdict)
input_types = []
for i in xrange(N):
input_types.append(integer_value(len(settings.dicts)))
settings.input_types = input_types
```
2. 使用process函数中将数据逐一提供给PaddlePaddle。具体来说,将每句话前面补上N-1个开始符号 `<s>`, 末尾补上一个结束符号 `<e>`,然后以N为窗口大小,从头到尾每次向右滑动窗口并生成一条数据。
```python
@provider(init_hook=initializer)
def process(settings, filename):
UNKID = settings.dicts['<unk>']
with open(filename) as fin:
for line in fin:
line = ['<s>']*(N-1) + line.strip().split() + ['<e>']
line = [settings.dicts.get(w, UNKID) for w in line]
for i in range(N, len(line) + 1):
yield line[i-N: i]
```
如"I have a dream" 一句提供了5条数据:
> `<s> <s> <s> <s> I` <br>
> `<s> <s> <s> I have` <br>
> `<s> <s> I have a` <br>
> `<s> I have a dream` <br>
> `I have a dream <e>` <br>
## 模型配置说明
### 数据定义
通过`define_py_data_sources2`函数从dataprovider中读入数据,其中args指定了训练文本(srcText)和词汇表(dictfile)。
```python
from paddle.trainer_config_helpers import *
import math
args = {'srcText': 'data/simple-examples/data/ptb.train.txt', 本章训练的是5-gram模型,表示在PaddlePaddle训练时,每条数据的前4个词用来预测第5个词。PaddlePaddle提供了对应PTB数据集的python包`paddle.dataset.imikolov`,自动做数据的下载与预处理,方便大家使用。
'dictfile': 'data/vocabulary.txt'}
define_py_data_sources2(
train_list="data/train.list",
test_list="data/test.list",
module="dataprovider",
obj="process",
args=args)
```
### 算法配置 预处理会把数据集中的每一句话前后加上开始符号`<s>`以及结束符号`<e>`。然后依据窗口大小(本教程中为5),从头到尾每次向右滑动窗口并生成一条数据。
在这里,我们指定了模型的训练参数, L2正则项系数、学习率和batch size。 如"I have a dream that one day" 一句提供了5条数据:
```python ```text
settings( <s> I have a dream
batch_size=100, regularization=L2Regularization(8e-4), learning_rate=3e-3) I have a dream that
have a dream that one
a dream that one day
dream that one day <e>
``` ```
### 模型结构 ## 编程实现
本配置的模型结构如下图所示: 本配置的模型结构如下图所示:
...@@ -317,94 +232,132 @@ settings( ...@@ -317,94 +232,132 @@ settings(
图5. 模型配置中的N-gram神经网络模型 图5. 模型配置中的N-gram神经网络模型
</p> </p>
1. 定义参数维度和和数据输入。 首先,加载所需要的包:
```python ```python
dictsize = 1953 # 字典大小 import math
embsize = 32 # 词向量维度 import paddle.v2 as paddle
hiddensize = 256 # 隐层维度 ```
firstword = data_layer(name = "firstw", size = dictsize) 然后,定义参数:
secondword = data_layer(name = "secondw", size = dictsize) ```python
thirdword = data_layer(name = "thirdw", size = dictsize) embsize = 32 # 词向量维度
fourthword = data_layer(name = "fourthw", size = dictsize) hiddensize = 256 # 隐层维度
nextword = data_layer(name = "fifthw", size = dictsize) N = 5 # 训练5-Gram
``` ```
2. 将$w_t$之前的$n-1$个词 $w_{t-n+1},...w_{t-1}$,通过$|V|\times D$的矩阵映射到D维词向量(本例中取D=32)。 接着,定义网络结构:
- 将$w_t$之前的$n-1$个词 $w_{t-n+1},...w_{t-1}$,通过$|V|\times D$的矩阵映射到D维词向量(本例中取D=32)。
```python ```python
def wordemb(inlayer): def wordemb(inlayer):
wordemb = table_projection( wordemb = paddle.layer.table_projection(
input = inlayer, input=inlayer,
size = embsize, size=embsize,
param_attr=ParamAttr(name = "_proj", param_attr=paddle.attr.Param(
initial_std=0.001, # 参数初始化标准差 name="_proj",
l2_rate= 0,)) # 词向量不需要稀疏化,因此其l2_rate设为0 initial_std=0.001,
learning_rate=1,
l2_rate=0, ))
return wordemb return wordemb
```
Efirst = wordemb(firstword) - 定义输入层接受的数据类型以及名字。
Esecond = wordemb(secondword)
Ethird = wordemb(thirdword) ```python
Efourth = wordemb(fourthword) def main():
``` paddle.init(use_gpu=False, trainer_count=1) # 初始化PaddlePaddle
word_dict = paddle.dataset.imikolov.build_dict()
3. 接着,将这n-1个词向量经过concat_layer连接成一个大向量作为历史文本特征。 dict_size = len(word_dict)
# 每个输入层都接受整形数据,这些数据的范围是[0, dict_size)
```python firstword = paddle.layer.data(
contextemb = concat_layer(input = [Efirst, Esecond, Ethird, Efourth]) name="firstw", type=paddle.data_type.integer_value(dict_size))
``` secondword = paddle.layer.data(
4. 然后,将历史文本特征经过一个全连接得到文本隐层特征。 name="secondw", type=paddle.data_type.integer_value(dict_size))
thirdword = paddle.layer.data(
```python name="thirdw", type=paddle.data_type.integer_value(dict_size))
hidden1 = fc_layer( fourthword = paddle.layer.data(
input = contextemb, name="fourthw", type=paddle.data_type.integer_value(dict_size))
size = hiddensize, nextword = paddle.layer.data(
act = SigmoidActivation(), name="fifthw", type=paddle.data_type.integer_value(dict_size))
layer_attr = ExtraAttr(drop_rate=0.5),
bias_attr = ParamAttr(learning_rate = 2), Efirst = wordemb(firstword)
param_attr = ParamAttr( Esecond = wordemb(secondword)
initial_std = 1./math.sqrt(embsize*8), Ethird = wordemb(thirdword)
learning_rate = 1)) Efourth = wordemb(fourthword)
``` ```
5. 最后,将文本隐层特征,再经过一个全连接,映射成一个$|V|$维向量,同时通过softmax归一化得到这`|V|`个词的生成概率。 - 将这n-1个词向量经过concat_layer连接成一个大向量作为历史文本特征。
```python ```python
# use context embedding to predict nextword contextemb = paddle.layer.concat(input=[Efirst, Esecond, Ethird, Efourth])
predictword = fc_layer( ```
input = hidden1,
size = dictsize, - 将历史文本特征经过一个全连接得到文本隐层特征。
bias_attr = ParamAttr(learning_rate = 2),
act = SoftmaxActivation()) ```python
``` hidden1 = paddle.layer.fc(input=contextemb,
size=hiddensize,
6. 网络的损失函数为多分类交叉熵,可直接调用`classification_cost`函数。 act=paddle.activation.Sigmoid(),
layer_attr=paddle.attr.Extra(drop_rate=0.5),
```python bias_attr=paddle.attr.Param(learning_rate=2),
cost = classification_cost( param_attr=paddle.attr.Param(
input = predictword, initial_std=1. / math.sqrt(embsize * 8),
label = nextword) learning_rate=1))
# network input and output ```
outputs(cost)
```
##训练模型 - 将文本隐层特征,再经过一个全连接,映射成一个$|V|$维向量,同时通过softmax归一化得到这`|V|`个词的生成概率。
```python
predictword = paddle.layer.fc(input=hidden1,
size=dict_size,
bias_attr=paddle.attr.Param(learning_rate=2),
act=paddle.activation.Softmax())
```
模型训练命令为`./train.sh`。脚本内容如下,其中指定了总共需要执行30个pass - 网络的损失函数为多分类交叉熵,可直接调用`classification_cost`函数
```bash ```python
paddle train \ cost = paddle.layer.classification_cost(input=predictword, label=nextword)
--config ngram.py \ ```
--use_gpu=1 \
--dot_period=100 \ 然后,指定训练相关的参数:
--log_period=3000 \
--test_period=0 \ - 训练方法(optimizer): 代表训练过程在更新权重时采用动量优化器,本教程使用Adam优化器。
--save_dir=model \ - 训练速度(learning_rate): 迭代的速度,与网络的训练收敛速度有关系。
--num_passes=30 - 正则化(regularization): 是防止网络过拟合的一种手段,此处采用L2正则化。
```python
parameters = paddle.parameters.create(cost)
adam_optimizer = paddle.optimizer.Adam(
learning_rate=3e-3,
regularization=paddle.optimizer.L2Regularization(8e-4))
trainer = paddle.trainer.SGD(cost, parameters, adam_optimizer)
```
下一步,我们开始训练过程。`paddle.dataset.imikolov.train()`和`paddle.dataset.imikolov.test()`分别做训练和测试数据集。这两个函数各自返回一个reader——PaddlePaddle中的reader是一个Python函数,每次调用的时候返回一个Python generator。
`paddle.batch`的输入是一个reader,输出是一个batched reader —— 在PaddlePaddle里,一个reader每次yield一条训练数据,而一个batched reader每次yield一个minbatch。
```python
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0:
result = trainer.test(
paddle.batch(
paddle.dataset.imikolov.test(word_dict, N), 32))
print "Pass %d, Batch %d, Cost %f, %s, Testing metrics %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics,
result.metrics)
trainer.train(
paddle.batch(paddle.dataset.imikolov.train(word_dict, N), 32),
num_passes=30,
event_handler=event_handler)
``` ```
一个pass的训练日志如下所示: 训练过程是完全自动的,event_handler里打印的日志类似如下所示:
```text ```text
............................. .............................
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册