提交 228db9c2 编写于 作者: G guosheng

change according to comments on commit e43f1e2

上级 cee7f2b9
#命名实体识别 #命名实体识别
##背景说明 ##背景说明
命名实体识别(Named Entity Recognition,NER)又称作“专名识别”,是指识别文本中具有特定意义的实体,主要包括人名、地名、机构名、专有名词等,是自然语言处理研究的一个基础问题。NER任务通常包括实体边界识别、确定实体类别两部分,可以将其作为序列标注问题,根据序列标注结果可以直接得到实体边界和实体类别。 命名实体识别(Named Entity Recognition,NER)又称作“专名识别”,是指识别文本中具有特定意义的实体,主要包括人名、地名、机构名、专有名词等,是自然语言处理研究的一个基础问题。NER任务通常包括实体边界识别、确定实体类别两部分,可以将其作为序列标注问题,根据序列标注结果可以直接得到实体边界和实体类别。
##数据说明 ##数据说明
在本示例中,我们将使用CoNLL 2003 NER任务中开放出的数据集。由于版权原因,我们暂不提供此数据集的下载,可以按照[此页面](http://www.clips.uantwerpen.be/conll2003/ner/)中的说明免费获取该数据。该数据集中训练和测试数据格式如下
<img src="image/data_format.png" width = "60%" align=center /><br> 在本示例中,我们将使用CoNLL 2003 NER任务中开放出的数据集。由于版权原因,我们暂不提供此数据集的下载,可以按照[此页面](http://www.clips.uantwerpen.be/conll2003/ner/)中的说明免费获取该数据。此数据集中训练和测试数据格式如下:
其中第一列为原始句子序列,第四列为采用了I-TYPE方式表示的NER标签(I-TYPE和[BIO方式](https://github.com/PaddlePaddle/book/tree/develop/07.label_semantic_roles)的主要区别在于语块开始标记的使用上,I-TYPE只有在出现相邻的同类别实体时对后者使用B标记,其他均使用I标记),而我们这里将使用BIO方式表示的标签集,这两种方式的转换过程在我们提供的`conll03.py`文件中进行。另外,我们针对此数据集提供了word词典、label词典和预训练的词向量三个文件,可以直接下载使用。
```
U.N. NNP I-NP I-ORG
official NN I-NP O
Ekeus NNP I-NP I-PER
heads VBZ I-VP O
for IN I-PP O
Baghdad NNP I-NP I-LOC
. . O O
```
其中第一列为原始句子序列(第二、三列分别为词性标签和句法分析中的语块标签,这里暂时不用),第四列为采用了I-TYPE方式表示的NER标签(I-TYPE和[BIO方式](https://github.com/PaddlePaddle/book/tree/develop/07.label_semantic_roles)的主要区别在于语块开始标记的使用上,I-TYPE只有在出现相邻的同类别实体时对后者使用B标记,其他均使用I标记),而我们这里将使用BIO方式表示的标签集,这两种方式的转换过程在我们提供的`conll03.py`文件中进行。另外,我们附上word词典、label词典和预训练的词向量三个文件(word词典和词向量来源于[Stanford cs224d](http://cs224d.stanford.edu/)课程作业)以供使用。
##模型说明 ##模型说明
在本示例中,我们所使用的模型结构如图1所示,更多关于序列标注网络模型的知识可见[此页面](https://github.com/PaddlePaddle/book/tree/develop/07.label_semantic_roles)
在本示例中,我们所使用的模型结构如图1所示。其输入为句子序列,在取词向量转换为词向量序列后,经过多组全连接层、双向RNN进行特征提取,最后接入CRF以学习到的特征为输入,以标记序列为监督信号,完成序列标注。更多关于RNN及其变体的知识可见[此页面](http://book.paddlepaddle.org/06.understand_sentiment/)
<div align="center"> <div align="center">
<img src="image/ner_network.png" width = "60%" align=center /><br> <img src="image/ner_network.png" width = "40%" align=center /><br>
图1. NER模型网络结构 图1. NER模型网络结构
</div> </div>
##使用说明
在获取到上文提到的数据集和文件资源后,将`ner.py`中如下的数据设置部分进行更改 ##运行说明
###数据设置
运行`ner.py`需要对数据设置部分进行更改,将以下代码中的变量值修改为正确的文件路径即可。
```python ```python
# init dataset # init dataset
train_data_file = 'data/train' train_data_file = 'data/train' #训练数据文件
test_data_file = 'data/test' test_data_file = 'data/test' #测试数据文件
vocab_file = 'data/vocab.txt' vocab_file = 'data/vocab.txt' #word_dict文件
target_file = 'data/target.txt' target_file = 'data/target.txt' #label_dict文件
emb_file = 'data/wordVectors.txt' emb_file = 'data/wordVectors.txt' #词向量文件
``` ```
TBD
###训练和预测
`ner.py`提供了以下两个接口分别进行模型训练和预测:
1. `ner_net_train(data_reader, num_passes)`函数实现了模型训练功能,参数`data_reader`表示训练数据的迭代器(使用默认值即可)、`num_passes`表示训练pass的轮数。训练过程中每100个iteration会打印模型训练信息,每个pass后会将模型保存下来,并将最终模型保存为`ner_net.tar.gz`
2. `ner_net_infer(data_reader, model_file)`函数实现了预测功能,参数`data_reader`表示测试数据的迭代器(使用默认值即可)、`model_file`表示保存在本地的模型文件,预测过程会按如下格式打印预测结果:
```
U.N. B-ORG
official O
Ekeus B-PER
heads O
for O
Baghdad B-LOC
. O
```
其中第一列为原始句子序列,第二列为BIO方式表示的NER标签。
# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" """
Conll03 dataset. Conll03 dataset.
""" """
...@@ -18,6 +5,7 @@ Conll03 dataset. ...@@ -18,6 +5,7 @@ Conll03 dataset.
import tarfile import tarfile
import gzip import gzip
import itertools import itertools
import collections
import re import re
import numpy as np import numpy as np
...@@ -43,6 +31,32 @@ def canonicalize_word(word, wordset=None, digits=True): ...@@ -43,6 +31,32 @@ def canonicalize_word(word, wordset=None, digits=True):
else: return "UUUNKKK" # unknown token else: return "UUUNKKK" # unknown token
def corpus_reader(filename='data/train'):
def reader():
sentence = []
labels = []
with open(filename) as f:
for line in f:
if re.match(r"-DOCSTART-.+", line) or (len(line.strip()) == 0):
if len(sentence) > 0:
yield sentence, labels
sentence = []
labels = []
else:
segs = line.strip().split()
sentence.append(segs[0])
# transform from I-TYPE to BIO schema
if segs[-1] != 'O' and (len(labels) == 0 or
labels[-1][1:] != segs[-1][1:]):
labels.append('B' + segs[-1][1:])
else:
labels.append(segs[-1])
f.close()
return reader
def load_dict(filename): def load_dict(filename):
d = dict() d = dict()
with open(filename, 'r') as f: with open(filename, 'r') as f:
...@@ -98,7 +112,7 @@ def reader_creator(corpus_reader, word_dict, label_dict): ...@@ -98,7 +112,7 @@ def reader_creator(corpus_reader, word_dict, label_dict):
Conll03 train set creator. Conll03 train set creator.
The dataset can be obtained according to http://www.clips.uantwerpen.be/conll2003/ner/. The dataset can be obtained according to http://www.clips.uantwerpen.be/conll2003/ner/.
It returns a reader creator, each sample in the reader includes sentence sequence and tagged sequence. It returns a reader creator, each sample in the reader includes word id sequence, label id sequence and raw sentence for purpose of print.
:return: Training reader creator :return: Training reader creator
:rtype: callable :rtype: callable
...@@ -111,7 +125,7 @@ def reader_creator(corpus_reader, word_dict, label_dict): ...@@ -111,7 +125,7 @@ def reader_creator(corpus_reader, word_dict, label_dict):
for w in sentence for w in sentence
] ]
label_idx = [label_dict.get(w) for w in labels] label_idx = [label_dict.get(w) for w in labels]
yield word_idx, label_idx yield word_idx, label_idx, sentence
return reader return reader
......
...@@ -49,6 +49,7 @@ def ner_net(is_train): ...@@ -49,6 +49,7 @@ def ner_net(is_train):
hidden_1 = paddle.layer.mixed( hidden_1 = paddle.layer.mixed(
name='hidden1', name='hidden1',
size=hidden_dim, size=hidden_dim,
act=paddle.activation.Tanh(),
bias_attr=std_default, bias_attr=std_default,
input=[ input=[
paddle.layer.full_matrix_projection( paddle.layer.full_matrix_projection(
...@@ -74,8 +75,10 @@ def ner_net(is_train): ...@@ -74,8 +75,10 @@ def ner_net(is_train):
param_attr=rnn_para_attr) param_attr=rnn_para_attr)
hidden_2_1 = paddle.layer.mixed( hidden_2_1 = paddle.layer.mixed(
name='hidden2-1',
size=hidden_dim, size=hidden_dim,
bias_attr=std_default, bias_attr=std_default,
act=paddle.activation.STanh(),
input=[ input=[
paddle.layer.full_matrix_projection( paddle.layer.full_matrix_projection(
input=hidden_1, param_attr=hidden_para_attr), input=hidden_1, param_attr=hidden_para_attr),
...@@ -83,8 +86,10 @@ def ner_net(is_train): ...@@ -83,8 +86,10 @@ def ner_net(is_train):
input=rnn_1_1, param_attr=rnn_para_attr) input=rnn_1_1, param_attr=rnn_para_attr)
]) ])
hidden_2_2 = paddle.layer.mixed( hidden_2_2 = paddle.layer.mixed(
name='hidden2-2',
size=hidden_dim, size=hidden_dim,
bias_attr=std_default, bias_attr=std_default,
act=paddle.activation.STanh(),
input=[ input=[
paddle.layer.full_matrix_projection( paddle.layer.full_matrix_projection(
input=hidden_1, param_attr=hidden_para_attr), input=hidden_1, param_attr=hidden_para_attr),
...@@ -110,6 +115,7 @@ def ner_net(is_train): ...@@ -110,6 +115,7 @@ def ner_net(is_train):
name='hidden3', name='hidden3',
size=hidden_dim, size=hidden_dim,
bias_attr=std_default, bias_attr=std_default,
act=paddle.activation.STanh(),
input=[ input=[
paddle.layer.full_matrix_projection( paddle.layer.full_matrix_projection(
input=hidden_2_1, param_attr=hidden_para_attr), input=hidden_2_1, param_attr=hidden_para_attr),
...@@ -158,7 +164,7 @@ def ner_net(is_train): ...@@ -158,7 +164,7 @@ def ner_net(is_train):
return predict return predict
def ner_net_train(data_reader, num_passes=1): def ner_net_train(data_reader=train_data_reader, num_passes=1):
# define network topology # define network topology
crf_cost, crf_dec, target = ner_net(is_train=True) crf_cost, crf_dec, target = ner_net(is_train=True)
evaluator.sum(name='error', input=crf_dec) evaluator.sum(name='error', input=crf_dec)
...@@ -201,7 +207,9 @@ def ner_net_train(data_reader, num_passes=1): ...@@ -201,7 +207,9 @@ def ner_net_train(data_reader, num_passes=1):
# save parameters # save parameters
with gzip.open('params_pass_%d.tar.gz' % event.pass_id, 'w') as f: with gzip.open('params_pass_%d.tar.gz' % event.pass_id, 'w') as f:
parameters.to_tar(f) parameters.to_tar(f)
if event.pass_id == num_passes - 1:
with gzip.open('ner_model.tar.gz', 'w') as f:
parameters.to_tar(f)
result = trainer.test(reader=reader, feeding=feeding) result = trainer.test(reader=reader, feeding=feeding)
print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics) print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics)
...@@ -214,11 +222,13 @@ def ner_net_train(data_reader, num_passes=1): ...@@ -214,11 +222,13 @@ def ner_net_train(data_reader, num_passes=1):
return parameters return parameters
def ner_net_infer(data_reader, parameters): def ner_net_infer(data_reader=test_data_reader, model_file='ner_model.tar.gz'):
test_creator = data_reader test_creator = data_reader
test_data = [] test_data = []
test_sentences = []
for item in test_creator(): for item in test_creator():
test_data.append([item[0]]) test_data.append([item[0]])
test_sentences.append(item[-1])
if len(test_data) == 10: if len(test_data) == 10:
break break
...@@ -226,18 +236,25 @@ def ner_net_infer(data_reader, parameters): ...@@ -226,18 +236,25 @@ def ner_net_infer(data_reader, parameters):
lab_ids = paddle.infer( lab_ids = paddle.infer(
output_layer=predict, output_layer=predict,
parameters=parameters, parameters=paddle.parameters.Parameters.from_tar(gzip.open(model_file)),
input=test_data, input=test_data,
field='id') field='id')
'''words_reverse = {}
for (k, v) in word_dict.items():
words_reverse[v] = k
flat_data = [words_reverse[word_id] for word_id in itertools.chain.from_iterable(itertools.chain.from_iterable(test_data))]'''
flat_data = [word for word in itertools.chain.from_iterable(test_sentences)]
labels_reverse = {} labels_reverse = {}
for (k, v) in label_dict.items(): for (k, v) in label_dict.items():
labels_reverse[v] = k labels_reverse[v] = k
pre_lab = [labels_reverse[lab_id] for lab_id in lab_ids] pre_lab = [labels_reverse[lab_id] for lab_id in lab_ids]
print pre_lab
for word, label in zip(flat_data, pre_lab):
print word, label
if __name__ == '__main__': if __name__ == '__main__':
paddle.init(use_gpu=False, trainer_count=1) paddle.init(use_gpu=False, trainer_count=1)
parameters = ner_net_train(train_data_reader, 1) ner_net_train(num_passes=1)
ner_net_infer(test_data_reader, parameters) ner_net_infer()
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册