提交 f88033ef 编写于 作者: R root

for code review

上级 0a8f16a1
......@@ -4,91 +4,29 @@
```text
.
├── data # 存储运行本例所依赖的数据
│   ├── download.sh
├── data # 存储运行本例所依赖的数据,从外部获取
├── network_conf.py # 模型定义
├── reader.py # 数据读取接口
├── reader.py # 数据读取接口, 从外部获取
├── README.md # 文档
├── train.py # 训练脚本
├── infer.py # 预测脚本
└── utils.py # 定义同样的函数
└── utils.py # 定义通用的函数, 从外部获取
└── utils_extend.py # 对utils.py的拓展
```
## 简介
## 简介,模型详解与数据说明
命名实体识别(Named Entity Recognition,NER)又称作“专名识别”,是指识别文本中具有特定意义的实体,主要包括人名、地名、机构名、专有名词等,是自然语言处理研究的一个基础问题。NER任务通常包括实体边界识别、确定实体类别两部分,可以将其作为序列标注问题解决。
参考https://github.com/PaddlePaddle/models/blob/develop/sequence_tagging_for_ner/README.md
在模型上,我们使用LSTM代替原始的RNN。
序列标注可以分为Sequence Classification、Segment Classification和Temporal Classification三类[[1](#参考文献)],本例只考虑Segment Classification,即对输入序列中的每个元素在输出序列中给出对应的标签。对于NER任务,由于需要标识边界,一般采用[BIO标注方法](http://book.paddlepaddle.org/07.label_semantic_roles/)定义的标签集。
## 数据获取
根据序列标注结果可以直接得到实体边界和实体类别。类似的,分词、词性标注、语块识别、[语义角色标注](http://book.paddlepaddle.org/07.label_semantic_roles/index.cn.html)等任务都可通过序列标注来解决。使用神经网络模型解决问题的思路通常是:前层网络学习输入的特征表示,网络的最后一层在特征基础上完成最终的任务;对于序列标注问题,通常:使用基于RNN的网络结构学习特征,将学习到的特征接入CRF完成序列标注。实际上是将传统CRF中的线性模型换成了非线性神经网络。沿用CRF的出发点是:CRF使用句子级别的似然概率,能够更好的解决标记偏置问题[[2](#参考文献)]。本例也将基于此思路建立模型。虽然,这里以NER任务作为示例,但所给出的模型可以应用到其他各种序列标注任务中
参照https://github.com/PaddlePaddle/models/blob/develop/sequence_tagging_for_ner/README.md中的数据获取方式,将获取的data目录复制到本目录下
由于序列标注问题的广泛性,产生了[CRF](http://book.paddlepaddle.org/07.label_semantic_roles/index.cn.html)等经典的序列模型,这些模型大多只能使用局部信息或需要人工设计特征。随着深度学习研究的发展,循环神经网络(Recurrent Neural Network,RNN等 序列模型能够处理序列元素之间前后关联问题,能够从原始输入文本中学习特征表示,而更加适合序列标注任务,更多相关知识可参考PaddleBook中[语义角色标注](https://github.com/PaddlePaddle/book/blob/develop/07.label_semantic_roles/README.cn.md)一课。
## 通用脚本获取
## 模型详解
NER任务的输入是"一句话",目标是识别句子中的实体边界及类别,我们参照论文\[[2](#参考文献)\]仅对原始句子进行了一些简单的预处理工作:将每个词转换为小写,并将原词是否大写另作为一个特征,共同作为模型的输入。工作流程如下:
1. 构造输入
- 输入1是句子序列,采用one-hot方式表示
- 输入2是大写标记序列,标记了句子中每一个词是否是大写,采用one-hot方式表示;
2. one-hot方式的句子序列和大写标记序列通过词表,转换为实向量表示的词向量序列;
3. 将步骤2中的2个词向量序列作为双向LSTM的输入,学习输入序列的特征表示,得到新的特性表示序列;
4. CRF以步骤3中模型学习到的特征为输入,以标记序列为监督信号,实现序列标注。
## 数据说明
在本例中,我们以 [CoNLL 2003 NER任务](http://www.clips.uantwerpen.be/conll2003/ner/)为例,原始Reuters数据由于版权原因需另外申请免费下载,请大家按照原网站说明获取。
+ 我们仅在`data`目录下的`train``test`文件中放置少数样本用以示例输入数据格式。
+ 本例依赖数据还包括
1. 输入文本的词典
2. 为词典中的词语提供预训练好的词向量
2. 标记标签的词典
标记标签词典已附在`data`目录中,对应于`data/target.txt`文件。输入文本的词典以及词典中词语的预训练的词向量来自:[Stanford CS224d](http://cs224d.stanford.edu/)课程作业。**为运行本例,请首先在`data`目录下运行`download.sh`脚本下载输入文本的词典和预训练的词向量。** 完成后会将这两个文件一并放入`data`目录下,输入文本的词典和预训练的词向量分别对应:`data/vocab.txt``data/wordVectors.txt`这两个文件。
CoNLL 2003原始数据格式如下:
```
U.N. NNP I-NP I-ORG
official NN I-NP O
Ekeus NNP I-NP I-PER
heads VBZ I-VP O
for IN I-PP O
Baghdad NNP I-NP I-LOC
. . O O
```
- 第一列为原始句子序列
- 第二、三列分别为词性标签和句法分析中的语块标签,本例不使用
- 第四列为采用了 I-TYPE 方式表示的NER标签
- I-TYPE 和 BIO 方式的主要区别在于语块开始标记的使用上,I-TYPE只有在出现相邻的同类别实体时对后者使用B标记,其他均使用I标记),句子之间以空行分隔。
我们在`reader.py`脚本中完成对原始数据的处理以及读取,主要包括下面几个步骤:
1. 从原始数据文件中抽取出句子和标签,构造句子序列和标签序列;
2. 将 I-TYPE 表示的标签转换为 BIO 方式表示的标签;
3. 将句子序列中的单词转换为小写,并构造大写标记序列;
4. 依据词典获取词对应的整数索引。
预处理完成后,一条训练样本包含3个部分作为神经网络的输入信息用于训练:(1)句子序列;(2)首字母大写标记序列;(3)标注序列,下表是一条训练样本的示例:
| 句子序列 | 大写标记序列 | 标注序列 |
| -------- | ------------ | -------- |
| u.n. | 1 | B-ORG |
| official | 0 | O |
| ekeus | 1 | B-PER |
| heads | 0 | O |
| for | 0 | O |
| baghdad | 1 | B-LOC |
| . | 0 | O |
## 运行
### 编写数据读取接口
自定义数据读取接口只需编写一个 Python 生成器实现从原始输入文本中解析一条训练样本的逻辑。[reader.py](./reader.py) 中的`data_reader`函数实现了读取原始数据返回类型为: `paddle.data_type.integer_value_sequence`的 3 个输入(分别对应:词语在字典的序号、是否为大写、标注结果在字典中的序号)给`network_conf.ner_net`中定义的 3 个 `data_layer` 的功能。
本例需要使用https://github.com/PaddlePaddle/models/blob/develop/sequence_tagging_for_ner/reader.py以及https://github.com/PaddlePaddle/models/blob/develop/sequence_tagging_for_ner/utils.py,请将这两个文件复制到本目录下。
### 训练
......@@ -145,44 +83,38 @@ Baghdad NNP I-NP I-LOC
2. 在终端运行 `python infer.py`,开始测试,会看到如下预测结果(以下为训练70个pass所得模型的部分预测结果):
```
leicestershire B-ORG B-LOC
extended O O
their O O
first O O
innings O O
by O O
DGDG O O
runs O O
before O O
being O O
bowled O O
out O O
for O O
296 O O
with O O
england B-LOC B-LOC
discard O O
andy B-PER B-PER
caddick I-PER I-PER
taking O O
three O O
for O O
DGDG O O
. O O
```
```text
leicestershire B-ORG B-LOC
extended O O
their O O
first O O
innings O O
by O O
DGDG O O
runs O O
before O O
being O O
bowled O O
out O O
for O O
296 O O
with O O
england B-LOC B-LOC
discard O O
andy B-PER B-PER
caddick I-PER I-PER
taking O O
three O O
for O O
DGDG O O
. O O
```
输出分为三列,以“\t” 分隔,第一列是输入的词语,第二列是标准结果,第三列为生成的标记结果。多条输入序列之间以空行分隔。
## 真实结果示例
## 结果示例
<p align="center">
<img src="imgs/convergent_curve.png" width="80%" align="center"/><br/>
图1. Fluid下实验结果示例
<img src="imgs/convergence_curve.png" width="80%" align="center"/><br/>
图1. Paddle下实验结果示例, 横轴表示训练轮数,纵轴表示F1值
</p>
## 参考文献
1. Graves A. [Supervised Sequence Labelling with Recurrent Neural Networks](http://www.cs.toronto.edu/~graves/preprint.pdf)[J]. Studies in Computational Intelligence, 2013, 385.
2. Collobert R, Weston J, Bottou L, et al. [Natural Language Processing (Almost) from Scratch](http://www.jmlr.org/papers/volume12/collobert11a/collobert11a.pdf)[J]. Journal of Machine Learning Research, 2011, 12(1):2493-2537.
if [ -f assignment2.zip ]; then
echo "data exist"
else
wget http://cs224d.stanford.edu/assignment2/assignment2.zip
fi
if [ $? -eq 0 ];then
unzip assignment2.zip
cp assignment2_release/data/ner/wordVectors.txt ./data
cp assignment2_release/data/ner/vocab.txt ./data
rm -rf assignment2.zip assignment2_release
else
echo "download data error!" >> /dev/stderr
exit 1
fi
B-LOC
I-LOC
B-MISC
I-MISC
B-ORG
I-ORG
B-PER
I-PER
O
CRICKET NNP I-NP O
- : O O
LEICESTERSHIRE NNP I-NP I-ORG
TAKE NNP I-NP O
OVER IN I-PP O
AT NNP I-NP O
TOP NNP I-NP O
AFTER NNP I-NP O
INNINGS NNP I-NP O
VICTORY NN I-NP O
. . O O
LONDON NNP I-NP I-LOC
1996-08-30 CD I-NP O
West NNP I-NP I-MISC
Indian NNP I-NP I-MISC
all-rounder NN I-NP O
Phil NNP I-NP I-PER
Simmons NNP I-NP I-PER
took VBD I-VP O
four CD I-NP O
for IN I-PP O
38 CD I-NP O
on IN I-PP O
Friday NNP I-NP O
as IN I-PP O
Leicestershire NNP I-NP I-ORG
beat VBD I-VP O
Somerset NNP I-NP I-ORG
by IN I-PP O
an DT I-NP O
innings NN I-NP O
and CC O O
39 CD I-NP O
runs NNS I-NP O
in IN I-PP O
two CD I-NP O
days NNS I-NP O
to TO I-VP O
take VB I-VP O
over IN I-PP O
at IN B-PP O
the DT I-NP O
head NN I-NP O
of IN I-PP O
the DT I-NP O
county NN I-NP O
championship NN I-NP O
. . O O
Their PRP$ I-NP O
stay NN I-NP O
on IN I-PP O
top NN I-NP O
, , O O
though RB I-ADVP O
, , O O
may MD I-VP O
be VB I-VP O
short-lived JJ I-ADJP O
as IN I-PP O
title NN I-NP O
rivals NNS I-NP O
Essex NNP I-NP I-ORG
, , O O
Derbyshire NNP I-NP I-ORG
and CC I-NP O
Surrey NNP I-NP I-ORG
all DT O O
closed VBD I-VP O
in RP I-PRT O
on IN I-PP O
victory NN I-NP O
while IN I-SBAR O
Kent NNP I-NP I-ORG
made VBD I-VP O
up RP I-PRT O
for IN I-PP O
lost VBN I-NP O
time NN I-NP O
in IN I-PP O
their PRP$ I-NP O
rain-affected JJ I-NP O
match NN I-NP O
against IN I-PP O
Nottinghamshire NNP I-NP I-ORG
. . O O
After IN I-PP O
bowling VBG I-NP O
Somerset NNP I-NP I-ORG
out RP I-PRT O
for IN I-PP O
83 CD I-NP O
on IN I-PP O
the DT I-NP O
opening NN I-NP O
morning NN I-NP O
at IN I-PP O
Grace NNP I-NP I-LOC
Road NNP I-NP I-LOC
, , O O
Leicestershire NNP I-NP I-ORG
extended VBD I-VP O
their PRP$ I-NP O
first JJ I-NP O
innings NN I-NP O
by IN I-PP O
94 CD I-NP O
runs VBZ I-VP O
before IN I-PP O
being VBG I-VP O
bowled VBD I-VP O
out RP I-PRT O
for IN I-PP O
296 CD I-NP O
with IN I-PP O
England NNP I-NP I-LOC
discard VBP I-VP O
Andy NNP I-NP I-PER
Caddick NNP I-NP I-PER
taking VBG I-VP O
three CD I-NP O
for IN I-PP O
83 CD I-NP O
. . O O
EU NNP I-NP I-ORG
rejects VBZ I-VP O
German JJ I-NP I-MISC
call NN I-NP O
to TO I-VP O
boycott VB I-VP O
British JJ I-NP I-MISC
lamb NN I-NP O
. . O O
Peter NNP I-NP I-PER
Blackburn NNP I-NP I-PER
BRUSSELS NNP I-NP I-LOC
1996-08-22 CD I-NP O
The DT I-NP O
European NNP I-NP I-ORG
Commission NNP I-NP I-ORG
said VBD I-VP O
on IN I-PP O
Thursday NNP I-NP O
it PRP B-NP O
disagreed VBD I-VP O
with IN I-PP O
German JJ I-NP I-MISC
advice NN I-NP O
to TO I-PP O
consumers NNS I-NP O
to TO I-VP O
shun VB I-VP O
British JJ I-NP I-MISC
lamb NN I-NP O
until IN I-SBAR O
scientists NNS I-NP O
determine VBP I-VP O
whether IN I-SBAR O
mad JJ I-NP O
cow NN I-NP O
disease NN I-NP O
can MD I-VP O
be VB I-VP O
transmitted VBN I-VP O
to TO I-PP O
sheep NN I-NP O
. . O O
Germany NNP I-NP I-LOC
's POS B-NP O
representative NN I-NP O
to TO I-PP O
the DT I-NP O
European NNP I-NP I-ORG
Union NNP I-NP I-ORG
's POS B-NP O
veterinary JJ I-NP O
committee NN I-NP O
Werner NNP I-NP I-PER
Zwingmann NNP I-NP I-PER
said VBD I-VP O
on IN I-PP O
Wednesday NNP I-NP O
consumers NNS I-NP O
should MD I-VP O
buy VB I-VP O
sheepmeat NN I-NP O
from IN I-PP O
countries NNS I-NP O
other JJ I-ADJP O
than IN I-PP O
Britain NNP I-NP I-LOC
until IN I-SBAR O
the DT I-NP O
scientific JJ I-NP O
advice NN I-NP O
was VBD I-VP O
clearer JJR I-ADJP O
. . O O
" " O O
We PRP I-NP O
do VBP I-VP O
n't RB I-VP O
support VB I-VP O
any DT I-NP O
such JJ I-NP O
recommendation NN I-NP O
because IN I-SBAR O
we PRP I-NP O
do VBP I-VP O
n't RB I-VP O
see VB I-VP O
any DT I-NP O
grounds NNS I-NP O
for IN I-PP O
it PRP I-NP O
, , O O
" " O O
the DT I-NP O
Commission NNP I-NP I-ORG
's POS B-NP O
chief JJ I-NP O
spokesman NN I-NP O
Nikolaus NNP I-NP I-PER
van NNP I-NP I-PER
der FW I-NP I-PER
Pas NNP I-NP I-PER
told VBD I-VP O
a DT I-NP O
news NN I-NP O
briefing NN I-NP O
. . O O
He PRP I-NP O
said VBD I-VP O
further JJ I-NP O
scientific JJ I-NP O
study NN I-NP O
was VBD I-VP O
required VBN I-VP O
and CC O O
if IN I-SBAR O
it PRP I-NP O
was VBD I-VP O
found VBN I-VP O
that IN I-SBAR O
action NN I-NP O
was VBD I-VP O
needed VBN I-VP O
it PRP I-NP O
should MD I-VP O
be VB I-VP O
taken VBN I-VP O
by IN I-PP O
the DT I-NP O
European NNP I-NP I-ORG
Union NNP I-NP I-ORG
. . O O
import numpy as np
import paddle.fluid as fluid
import paddle.v2 as paddle
from network_conf import ner_net
import reader
from utils import load_dict, load_reverse_dict, to_lodtensor
from utils import load_dict, load_reverse_dict
from utils_extend import to_lodtensor
def infer(model_path, batch_size, test_data_file, vocab_file, target_file,
use_gpu):
"""
use the model under model_path to predict the test data, the result will be printed on the screen
return nothing
"""
word_dict = load_dict(vocab_file)
word_reverse_dict = load_reverse_dict(vocab_file)
......
"""
Conll03 dataset.
"""
import re
__all__ = ["data_reader"]
def canonicalize_digits(word):
if any([c.isalpha() for c in word]): return word
word = re.sub("\d", "DG", word)
if word.startswith("DG"):
word = word.replace(",", "") # remove thousands separator
return word
def canonicalize_word(word, wordset=None, digits=True):
word = word.lower()
if digits:
if (wordset != None) and (word in wordset): return word
word = canonicalize_digits(word) # try to canonicalize numbers
if (wordset == None) or (word in wordset): return word
else: return "UUUNKKK" # unknown token
def data_reader(data_file, word_dict, label_dict):
"""
The dataset can be obtained according to http://www.clips.uantwerpen.be/conll2003/ner/.
It returns a reader creator, each sample in the reader includes:
word id sequence, label id sequence and raw sentence.
:return: reader creator
:rtype: callable
"""
def reader():
UNK_IDX = word_dict["UUUNKKK"]
sentence = []
labels = []
with open(data_file, "r") as f:
for line in f:
if len(line.strip()) == 0:
if len(sentence) > 0:
word_idx = [
word_dict.get(
canonicalize_word(w, word_dict), UNK_IDX)
for w in sentence
]
mark = [1 if w[0].isupper() else 0 for w in sentence]
label_idx = [label_dict[l] for l in labels]
yield word_idx, mark, label_idx
sentence = []
labels = []
else:
segs = line.strip().split()
sentence.append(segs[0])
# transform I-TYPE to BIO schema
if segs[-1] != "O" and (len(labels) == 0 or
labels[-1][1:] != segs[-1][1:]):
labels.append("B" + segs[-1][1:])
else:
labels.append(segs[-1])
return reader
import os
import math
import numpy as np
import paddle.v2 as paddle
import paddle.fluid as fluid
import reader
from network_conf import ner_net
from utils import logger, load_dict, get_embedding, to_lodtensor
from utils import logger, load_dict
from utils_extend import to_lodtensor, get_embedding
def test(exe, chunk_evaluator, inference_program, test_data, place):
......
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import logging
import paddle.fluid as fluid
import numpy as np
logger = logging.getLogger("paddle")
logger.setLevel(logging.INFO)
import paddle.fluid as fluid
def get_embedding(emb_file='data/wordVectors.txt'):
......@@ -17,36 +9,10 @@ def get_embedding(emb_file='data/wordVectors.txt'):
return np.loadtxt(emb_file, dtype='float32')
def load_dict(dict_path):
"""
Load the word dictionary from the given file.
Each line of the given file is a word, which can include multiple columns
seperated by tab.
This function takes the first column (columns in a line are seperated by
tab) as key and takes line number of a line as the key (index of the word
in the dictionary).
"""
return dict((line.strip().split("\t")[0], idx)
for idx, line in enumerate(open(dict_path, "r").readlines()))
def load_reverse_dict(dict_path):
def to_lodtensor(data, place):
"""
Load the word dictionary from the given file.
Each line of the given file is a word, which can include multiple columns
seperated by tab.
This function takes line number of a line as the key (index of the word in
the dictionary) and the first column (columns in a line are seperated by
tab) as the value.
convert data to lodtensor
"""
return dict((idx, line.strip().split("\t")[0])
for idx, line in enumerate(open(dict_path, "r").readlines()))
def to_lodtensor(data, place):
seq_lens = [len(seq) for seq in data]
cur_len = 0
lod = [cur_len]
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册