diff --git a/sequence_tagging_for_ner/README.md b/sequence_tagging_for_ner/README.md index a0990367ef8b03c70c29d285e22ef85907e1d0b7..e0c488c9aa5298d9d316d45f121c68dc41be73b2 100644 --- a/sequence_tagging_for_ner/README.md +++ b/sequence_tagging_for_ner/README.md @@ -1 +1,251 @@ -TBD +# 命名实体识别 + +## 背景说明 + +命名实体识别(Named Entity Recognition,NER)又称作“专名识别”,是指识别文本中具有特定意义的实体,主要包括人名、地名、机构名、专有名词等,是自然语言处理研究的一个基础问题。NER任务通常包括实体边界识别、确定实体类别两部分,可以将其作为序列标注问题解决。 + +序列标注可以分为Sequence Classification、Segment Classification和Temporal Classification三类[[1](#参考文献)],我们这里限定序列标注为Segment Classification,即对输入序列中的每个元素在输出序列中给出对应的标签。对于NER任务,由于需要标识边界,一般采用[BIO方式](http://book.paddlepaddle.org/07.label_semantic_roles/)定义的标签集,如下是一个NER的标注结果示例: + +
+
+图1. BIO标注方法示例 +
+ + +根据序列标注结果可以直接得到实体边界和实体类别。类似的,分词、词性标注、语块识别、[语义角色标注](http://book.paddlepaddle.org/07.label_semantic_roles/index.cn.html)等任务同样可作为序列标注问题。 + +由于序列标注问题的广泛性,产生了[CRF](http://book.paddlepaddle.org/07.label_semantic_roles/index.cn.html)等经典的序列模型,这些模型多只能使用局部信息或需要人工设计特征。发展到深度学习阶段,各种网络结构能够实现复杂的特征抽取功能,循环神经网络(Recurrent Neural Network,RNN,更多相关知识见PaddleBook中[语义角色标注](http://book.paddlepaddle.org/07.label_semantic_roles/index.cn.html)一课)能够处理输入序列元素之间前后关联的问题而更适合序列数据。使用神经网络模型解决问题的思路通常是:前层网络学习输入的特征表示,网络的最后一层在特征基础上完成最终的任务;对于序列标注问题的通常做法是:使用基于RNN的网络结构学习特征,将学习到的特征接入CRF进行序列标注。这实际上是将传统CRF中的线性模型换成了非线性神经网络,沿用CRF的出发点是:CRF使用句子级别的似然概率,能够更好的解决标记偏置问题[[2](#参考文献)]。本示例中也将基于此思路建立模型,另外,虽然这里使用的是NER任务,但是所给出的模型也可以应用到其他序列标注任务中。 + +## 模型说明 + +在NER任务中,输入是"一句话",目标是识别句子中的实体边界及类别,我们参照论文\[[2](#参考文献)\]仅对原始句子进行了一些预处理工作:将每个词转换为小写,并将原词是否大写另作为一个特征,共同作为模型的输入。按照上文所述处理序列标注问题的思路,可以构造如下结构的模型(图2是模型结构示意图): + +1. 构造输入 + - 输入1是句子序列,采用one-hot方式表示 + - 输入2是大写标记序列,标记了句子中每一个词是否是大写,采用one-hot方式表示; +2. one-hot方式的句子序列和大写标记序列通过词表,转换为实向量表示的词向量序列; +3. 将步骤2中的2个词向量序列作为双向RNN的输入,学习输入序列的特征表示,得到新的特性表示序列; +4. CRF以步骤3中模型学习到的特征为输入,以标记序列为监督信号,实现序列标注。 + +
+
+图2. NER模型网络结构 +
+ + +## 数据说明 + +在本示例中,我们将使用CoNLL 2003 NER任务中开放出的数据集。该任务(见[此页面](http://www.clips.uantwerpen.be/conll2003/ner/))只提供了标注工具的下载,原始Reuters数据由于版权原因需另外申请免费下载。在获取原始数据后可参照标注工具中README生成所需数据文件,完成后将包括如下三个数据文件: + +| 文件名 | 描述 | +|---|---| +| eng.train | 训练数据 | +| eng.testa | 验证数据,可用来进行参数调优 | +| eng.testb | 评估数据,用来进行最终效果评估 | + +为保证本例的完整性,我们从中抽取少量样本放在`data/train`和`data/test`文件中,作为示例使用;由于版权原因,完整数据还请大家自行获取。这三个文件数据格式如下: + +``` + U.N. NNP I-NP I-ORG + official NN I-NP O + Ekeus NNP I-NP I-PER + heads VBZ I-VP O + for IN I-PP O + Baghdad NNP I-NP I-LOC + . . O O +``` + +其中第一列为原始句子序列(第二、三列分别为词性标签和句法分析中的语块标签,这里暂时不用),第四列为采用了I-TYPE方式表示的NER标签(I-TYPE和BIO方式的主要区别在于语块开始标记的使用上,I-TYPE只有在出现相邻的同类别实体时对后者使用B标记,其他均使用I标记),句子之间以空行分隔。 + +原始数据需要进行数据预处理才能被PaddlePaddle处理,预处理主要包括下面几个步骤: + +1. 从原始数据文件中抽取出句子和标签,构造句子序列和标签序列; +2. 将I-TYPE表示的标签转换为BIO方式表示的标签; +3. 将句子序列中的单词转换为小写,并构造大写标记序列; +4. 依据词典获取词对应的整数索引。 + +我们将在`conll03.py`中完成以上预处理工作(使用方法将在后文给出): + +```python +# import conll03 +# conll03.corpus_reader函数完成上面第1步和第2步. +# conll03.reader_creator函数完成上面第3步和第4步. +# conll03.train和conll03.test函数可以获取处理之后的每条样本来供PaddlePaddle训练和测试. +``` + +预处理完成之后一条训练样本包含3个部分,分别是:句子序列、首字母大写标记序列、标注序列。下表是一条训练样本的示例。 + +| 句子序列 | 大写标记序列 | 标注序列 | +|---|---|---| +| u.n. | 1 | B-ORG | +| official | 0 | O | +| ekeus | 1 | B-PER | +| heads | 0 | O | +| for | 0 | O | +| baghdad | 1 | B-LOC | +| . | 0 | O | + +另外,使用本示例时数据相关的还有word词典、label词典和预训练的词向量三个文件:label词典已附在`data`目录中,对应于`data/target.txt`;word词典和预训练的词向量来源于[Stanford CS224d](http://cs224d.stanford.edu/)课程作业,请先在该示例所在目录下运行`data/download.sh`脚本进行下载,完成后会将这两个文件一并放入`data`目录下,分别对应`data/vocab.txt`和`data/wordVectors.txt`。 + +## 使用说明 + +本示例给出的`conll03.py`和`ner.py`两个Python脚本分别提供了数据相关和模型相关接口。 + +### 数据接口使用 + +`conll03.py`提供了使用CoNLL 2003数据的接口,各主要函数的功能已在数据说明部分进行说明。结合我们提供的接口和文件,可以按照如下步骤使用CoNLL 2003数据: + +1. 定义各数据文件、词典文件和词向量文件路径; +2. 调用`conll03.train`和`conll03.test`接口。 + +对应如下代码: + +```python +import conll03 + +# 修改以下变量为对应文件路径 +train_data_file = 'data/train' # 训练数据文件的路径 +test_data_file = 'data/test' # 测试数据文件的路径 +vocab_file = 'data/vocab.txt' # 输入句子对应的字典文件的路径 +target_file = 'data/target.txt' # 标签对应的字典文件的路径 +emb_file = 'data/wordVectors.txt' # 预训练的词向量参数的路径 + +# 返回训练数据的生成器 +train_data_reader = conll03.train(train_data_file, vocab_file, target_file) +# 返回测试数据的生成器 +test_data_reader = conll03.test(test_data_file, vocab_file, target_file) +``` + +### 模型接口使用 + +`ner.py`提供了以下两个接口分别进行模型训练和预测: + +1. `ner_net_train(data_reader, num_passes)`函数实现了模型训练功能,参数`data_reader`表示训练数据的迭代器、`num_passes`表示训练pass的轮数。训练过程中每100个iteration会打印模型训练信息。我们同时在模型配置中加入了chunk evaluator,会输出当前模型对语块识别的Precision、Recall和F1值。chunk evaluator 的详细使用说明请参照[文档](http://www.paddlepaddle.org/develop/doc/api/v2/config/evaluators.html#chunk)。每个pass后会将模型保存为`params_pass_***.tar.gz`的文件(`***`表示pass的id)。 + +2. `ner_net_infer(data_reader, model_file)`函数实现了预测功能,参数`data_reader`表示测试数据的迭代器、`model_file`表示保存在本地的模型文件,预测过程会按如下格式打印预测结果: + + ``` + U.N. B-ORG + official O + Ekeus B-PER + heads O + for O + Baghdad B-LOC + . O + ``` + 其中第一列为原始句子序列,第二列为BIO方式表示的NER标签。 + +### 运行程序 + +本例另在`ner.py`中提供了完整的运行流程,包括数据接口的使用和模型训练、预测。根据上文所述的接口使用方法,使用时需要将`ner.py`中如下的数据设置部分中的各变量修改为对应文件路径: + +```python +# 修改以下变量为对应文件路径 +train_data_file = 'data/train' # 训练数据文件的路径 +test_data_file = 'data/test' # 测试数据文件的路径 +vocab_file = 'data/vocab.txt' # 输入句子对应的字典文件的路径 +target_file = 'data/target.txt' # 标签对应的字典文件的路径 +emb_file = 'data/wordVectors.txt' # 预训练的词向量参数的路径 +``` + +而各接口的调用已在`ner.py`中预先提供 + +```python +# 训练数据的生成器 +train_data_reader = conll03.train(train_data_file, vocab_file, target_file) +# 测试数据的生成器 +test_data_reader = conll03.test(test_data_file, vocab_file, target_file) + +# 模型训练 +ner_net_train(data_reader=train_data_reader, num_passes=1) +# 预测 +ner_net_infer(data_reader=test_data_reader, model_file='params_pass_0.tar.gz') +``` + +除适当调整`num_passes`和`model_file`两参数值外无需再做修改(也可根据需要自行调用各接口,如只使用预测功能)。完成修改后,运行本示例只需在`ner.py`所在路径下执行`python ner.py`即可。该示例程序会执行数据读取、模型训练和保存、模型读取及新样本预测等步骤。 + +### 自定义数据和任务 + +前文提到本示例中的模型可以应用到其他序列标注任务中,这里以词性标注任务为例,给出使用其他数据并应用到其他任务的方法。 + +假定有如下格式的原始数据: + +``` +U.N. NNP +official NN +Ekeus NNP +heads VBZ +for IN +Baghdad NNP +. . +``` + +其中第一列为原始句子序列,第二列为词性标签序列,两列之间以空格分隔,句子之间以空行分隔。 + +为使用PaddlePaddle和本示例提供的模型,可参照`conll03.py`并根据需要自定义数据接口,如下: + +1. 参照`conll03.py`中的`corpus_reader`函数,定义接口返回句子序列和标签序列生成器; + + ```python + # 实现句子和对应标签的抽取,传入数据文件路径,返回句子和标签序列生成器。 + def corpus_reader(filename): + def reader(): + sentence = [] + labels = [] + with open(filename) as f: + for line in f: + if len(line.strip()) == 0: + if len(sentence) > 0: + yield sentence, labels + sentence = [] + labels = [] + else: + segs = line.strip().split() + sentence.append(segs[0]) + labels.append(segs[-1]) + f.close() + + return reader + ``` + +2. 参照`conll03.py`中的`reader_creator`函数,定义接口返回id化的句子和标签序列生成器。 + + ```python + # 传入corpus_reader返回的生成器、dict类型的word词典和label词典,返回id化的句子和标签序列生成器。 + def reader_creator(corpus_reader, word_dict, label_dict): + def reader(): + for sentence, labels in corpus_reader(): + word_idx = [ + word_dict.get(w, UNK_IDX) # 若使用小写单词,请使用w.lower() + for w in sentence + ] + # 若使用首字母大写标记,请去掉以下注释符号,并在yield语句的word_idx后加上mark + # mark = [ + # 1 if w[0].isupper() else 0 + # for w in sentence + # ] + label_idx = [label_dict.get(w) for w in labels] + yield word_idx, label_idx, sentence # 加上sentence方便预测时打印 + return reader + ``` + +自定义了数据接口后,要使用本示例中的模型,只需在调用模型训练和预测接口`ner_net_train`和`ner_net_infer`时传入调用`reader_creator`返回的生成器即可。另外需要注意,这里给出的数据接口定义去掉了`conll03.py`一些预处理(使用原始句子,而非转换成小写单词加上大写标记),`ner.py`中的模型相关接口也需要进行一些调整: + +1. 修改网络结构定义接口`ner_net`中大写标记相关内容: + + 删去`mark`和`mark_embedding`两个变量; + +2. 修改模型训练接口`ner_net_train`中大写标记相关内容: + + 将变量`feeding`定义改为`feeding = {'word': 0, 'target': 1}`; + +3. 修改预测接口`ner_net_infer`中大写标记相关内容: + + 将`test_data.append([item[0], item[1]])`改为`test_data.append([item[0]])`。 + +如果要继续使用NER中的特征预处理(小写单词、大写标记),请参照上文`reader_creator`代码段给出的注释进行修改,此时`ner.py`中的模型相关接口不必进行修改。 + +## 参考文献 + +1. Graves A. [Supervised Sequence Labelling with Recurrent Neural Networks](http://www.cs.toronto.edu/~graves/preprint.pdf)[J]. Studies in Computational Intelligence, 2013, 385. +2. Collobert R, Weston J, Bottou L, et al. [Natural Language Processing (Almost) from Scratch](http://www.jmlr.org/papers/volume12/collobert11a/collobert11a.pdf)[J]. Journal of Machine Learning Research, 2011, 12(1):2493-2537. diff --git a/sequence_tagging_for_ner/conll03.py b/sequence_tagging_for_ner/conll03.py new file mode 100644 index 0000000000000000000000000000000000000000..d74ae48de30aeda9c83fb4a0675ebf204f919e17 --- /dev/null +++ b/sequence_tagging_for_ner/conll03.py @@ -0,0 +1,123 @@ +""" +Conll03 dataset. +""" + +import tarfile +import gzip +import itertools +import collections +import re +import numpy as np + +__all__ = ['train', 'test', 'get_dict', 'get_embedding'] + +UNK_IDX = 0 + + +def canonicalize_digits(word): + if any([c.isalpha() for c in word]): return word + word = re.sub("\d", "DG", word) + if word.startswith("DG"): + word = word.replace(",", "") # remove thousands separator + return word + + +def canonicalize_word(word, wordset=None, digits=True): + word = word.lower() + if digits: + if (wordset != None) and (word in wordset): return word + word = canonicalize_digits(word) # try to canonicalize numbers + if (wordset == None) or (word in wordset): return word + else: return "UUUNKKK" # unknown token + + +def load_dict(filename): + d = dict() + with open(filename, 'r') as f: + for i, line in enumerate(f): + d[line.strip()] = i + return d + + +def get_dict(vocab_file='data/vocab.txt', target_file='data/target.txt'): + """ + Get the word and label dictionary. + """ + word_dict = load_dict(vocab_file) + label_dict = load_dict(target_file) + return word_dict, label_dict + + +def get_embedding(emb_file='data/wordVectors.txt'): + """ + Get the trained word vector. + """ + return np.loadtxt(emb_file, dtype=float) + + +def corpus_reader(filename='data/train'): + def reader(): + sentence = [] + labels = [] + with open(filename) as f: + for line in f: + if re.match(r"-DOCSTART-.+", line) or (len(line.strip()) == 0): + if len(sentence) > 0: + yield sentence, labels + sentence = [] + labels = [] + else: + segs = line.strip().split() + sentence.append(segs[0]) + # transform from I-TYPE to BIO schema + if segs[-1] != 'O' and (len(labels) == 0 or + labels[-1][1:] != segs[-1][1:]): + labels.append('B' + segs[-1][1:]) + else: + labels.append(segs[-1]) + + f.close() + + return reader + + +def reader_creator(corpus_reader, word_dict, label_dict): + """ + Conll03 train set creator. + + The dataset can be obtained according to http://www.clips.uantwerpen.be/conll2003/ner/. + It returns a reader creator, each sample in the reader includes word id sequence, label id sequence and raw sentence for purpose of print. + + :return: Training reader creator + :rtype: callable + """ + + def reader(): + for sentence, labels in corpus_reader(): + word_idx = [ + word_dict.get(canonicalize_word(w, word_dict), UNK_IDX) + for w in sentence + ] + mark = [1 if w[0].isupper() else 0 for w in sentence] + label_idx = [label_dict.get(w) for w in labels] + yield word_idx, mark, label_idx, sentence + + return reader + + +def train(data_file='data/train', + vocab_file='data/vocab.txt', + target_file='data/target.txt'): + return reader_creator( + corpus_reader(data_file), + word_dict=load_dict(vocab_file), + label_dict=load_dict(target_file)) + + +def test(data_file='data/test', + vocab_file='data/vocab.txt', + target_file='data/target.txt'): + return reader_creator( + corpus_reader(data_file), + word_dict=load_dict(vocab_file), + label_dict=load_dict(target_file)) diff --git a/sequence_tagging_for_ner/data/download.sh b/sequence_tagging_for_ner/data/download.sh new file mode 100644 index 0000000000000000000000000000000000000000..cdc10f66f2428ccccd02b788a3b9e6ad98b5654b --- /dev/null +++ b/sequence_tagging_for_ner/data/download.sh @@ -0,0 +1,6 @@ +wget http://cs224d.stanford.edu/assignment2/assignment2.zip +unzip assignment2.zip +cp assignment2_release/data/ner/wordVectors.txt data/ +cp assignment2_release/data/ner/vocab.txt data/ +rm -rf assignment2.zip assignment2_release + diff --git a/sequence_tagging_for_ner/data/target.txt b/sequence_tagging_for_ner/data/target.txt new file mode 100644 index 0000000000000000000000000000000000000000..e0fa4d8f6654be07b4d1188750abb861d7c6f264 --- /dev/null +++ b/sequence_tagging_for_ner/data/target.txt @@ -0,0 +1,9 @@ +B-LOC +I-LOC +B-MISC +I-MISC +B-ORG +I-ORG +B-PER +I-PER +O diff --git a/sequence_tagging_for_ner/data/test b/sequence_tagging_for_ner/data/test new file mode 100644 index 0000000000000000000000000000000000000000..ca2f212896527a5ff2520b62bff46dbb9de9f291 --- /dev/null +++ b/sequence_tagging_for_ner/data/test @@ -0,0 +1,130 @@ +-DOCSTART- -X- O O + +CRICKET NNP I-NP O +- : O O +LEICESTERSHIRE NNP I-NP I-ORG +TAKE NNP I-NP O +OVER IN I-PP O +AT NNP I-NP O +TOP NNP I-NP O +AFTER NNP I-NP O +INNINGS NNP I-NP O +VICTORY NN I-NP O +. . O O + +LONDON NNP I-NP I-LOC +1996-08-30 CD I-NP O + +West NNP I-NP I-MISC +Indian NNP I-NP I-MISC +all-rounder NN I-NP O +Phil NNP I-NP I-PER +Simmons NNP I-NP I-PER +took VBD I-VP O +four CD I-NP O +for IN I-PP O +38 CD I-NP O +on IN I-PP O +Friday NNP I-NP O +as IN I-PP O +Leicestershire NNP I-NP I-ORG +beat VBD I-VP O +Somerset NNP I-NP I-ORG +by IN I-PP O +an DT I-NP O +innings NN I-NP O +and CC O O +39 CD I-NP O +runs NNS I-NP O +in IN I-PP O +two CD I-NP O +days NNS I-NP O +to TO I-VP O +take VB I-VP O +over IN I-PP O +at IN B-PP O +the DT I-NP O +head NN I-NP O +of IN I-PP O +the DT I-NP O +county NN I-NP O +championship NN I-NP O +. . O O + +Their PRP$ I-NP O +stay NN I-NP O +on IN I-PP O +top NN I-NP O +, , O O +though RB I-ADVP O +, , O O +may MD I-VP O +be VB I-VP O +short-lived JJ I-ADJP O +as IN I-PP O +title NN I-NP O +rivals NNS I-NP O +Essex NNP I-NP I-ORG +, , O O +Derbyshire NNP I-NP I-ORG +and CC I-NP O +Surrey NNP I-NP I-ORG +all DT O O +closed VBD I-VP O +in RP I-PRT O +on IN I-PP O +victory NN I-NP O +while IN I-SBAR O +Kent NNP I-NP I-ORG +made VBD I-VP O +up RP I-PRT O +for IN I-PP O +lost VBN I-NP O +time NN I-NP O +in IN I-PP O +their PRP$ I-NP O +rain-affected JJ I-NP O +match NN I-NP O +against IN I-PP O +Nottinghamshire NNP I-NP I-ORG +. . O O + +After IN I-PP O +bowling VBG I-NP O +Somerset NNP I-NP I-ORG +out RP I-PRT O +for IN I-PP O +83 CD I-NP O +on IN I-PP O +the DT I-NP O +opening NN I-NP O +morning NN I-NP O +at IN I-PP O +Grace NNP I-NP I-LOC +Road NNP I-NP I-LOC +, , O O +Leicestershire NNP I-NP I-ORG +extended VBD I-VP O +their PRP$ I-NP O +first JJ I-NP O +innings NN I-NP O +by IN I-PP O +94 CD I-NP O +runs VBZ I-VP O +before IN I-PP O +being VBG I-VP O +bowled VBD I-VP O +out RP I-PRT O +for IN I-PP O +296 CD I-NP O +with IN I-PP O +England NNP I-NP I-LOC +discard VBP I-VP O +Andy NNP I-NP I-PER +Caddick NNP I-NP I-PER +taking VBG I-VP O +three CD I-NP O +for IN I-PP O +83 CD I-NP O +. . O O + diff --git a/sequence_tagging_for_ner/data/train b/sequence_tagging_for_ner/data/train new file mode 100644 index 0000000000000000000000000000000000000000..ffe131a2d2ea06bab806247c2f472f733c3750a1 --- /dev/null +++ b/sequence_tagging_for_ner/data/train @@ -0,0 +1,141 @@ +-DOCSTART- -X- O O + +EU NNP I-NP I-ORG +rejects VBZ I-VP O +German JJ I-NP I-MISC +call NN I-NP O +to TO I-VP O +boycott VB I-VP O +British JJ I-NP I-MISC +lamb NN I-NP O +. . O O + +Peter NNP I-NP I-PER +Blackburn NNP I-NP I-PER + +BRUSSELS NNP I-NP I-LOC +1996-08-22 CD I-NP O + +The DT I-NP O +European NNP I-NP I-ORG +Commission NNP I-NP I-ORG +said VBD I-VP O +on IN I-PP O +Thursday NNP I-NP O +it PRP B-NP O +disagreed VBD I-VP O +with IN I-PP O +German JJ I-NP I-MISC +advice NN I-NP O +to TO I-PP O +consumers NNS I-NP O +to TO I-VP O +shun VB I-VP O +British JJ I-NP I-MISC +lamb NN I-NP O +until IN I-SBAR O +scientists NNS I-NP O +determine VBP I-VP O +whether IN I-SBAR O +mad JJ I-NP O +cow NN I-NP O +disease NN I-NP O +can MD I-VP O +be VB I-VP O +transmitted VBN I-VP O +to TO I-PP O +sheep NN I-NP O +. . O O + +Germany NNP I-NP I-LOC +'s POS B-NP O +representative NN I-NP O +to TO I-PP O +the DT I-NP O +European NNP I-NP I-ORG +Union NNP I-NP I-ORG +'s POS B-NP O +veterinary JJ I-NP O +committee NN I-NP O +Werner NNP I-NP I-PER +Zwingmann NNP I-NP I-PER +said VBD I-VP O +on IN I-PP O +Wednesday NNP I-NP O +consumers NNS I-NP O +should MD I-VP O +buy VB I-VP O +sheepmeat NN I-NP O +from IN I-PP O +countries NNS I-NP O +other JJ I-ADJP O +than IN I-PP O +Britain NNP I-NP I-LOC +until IN I-SBAR O +the DT I-NP O +scientific JJ I-NP O +advice NN I-NP O +was VBD I-VP O +clearer JJR I-ADJP O +. . O O + +" " O O +We PRP I-NP O +do VBP I-VP O +n't RB I-VP O +support VB I-VP O +any DT I-NP O +such JJ I-NP O +recommendation NN I-NP O +because IN I-SBAR O +we PRP I-NP O +do VBP I-VP O +n't RB I-VP O +see VB I-VP O +any DT I-NP O +grounds NNS I-NP O +for IN I-PP O +it PRP I-NP O +, , O O +" " O O +the DT I-NP O +Commission NNP I-NP I-ORG +'s POS B-NP O +chief JJ I-NP O +spokesman NN I-NP O +Nikolaus NNP I-NP I-PER +van NNP I-NP I-PER +der FW I-NP I-PER +Pas NNP I-NP I-PER +told VBD I-VP O +a DT I-NP O +news NN I-NP O +briefing NN I-NP O +. . O O + +He PRP I-NP O +said VBD I-VP O +further JJ I-NP O +scientific JJ I-NP O +study NN I-NP O +was VBD I-VP O +required VBN I-VP O +and CC O O +if IN I-SBAR O +it PRP I-NP O +was VBD I-VP O +found VBN I-VP O +that IN I-SBAR O +action NN I-NP O +was VBD I-VP O +needed VBN I-VP O +it PRP I-NP O +should MD I-VP O +be VB I-VP O +taken VBN I-VP O +by IN I-PP O +the DT I-NP O +European NNP I-NP I-ORG +Union NNP I-NP I-ORG +. . O O + diff --git a/sequence_tagging_for_ner/image/ner_label_ins.png b/sequence_tagging_for_ner/image/ner_label_ins.png new file mode 100644 index 0000000000000000000000000000000000000000..a3667c82e0bd012eea50b1d451a7a4063d26aa54 Binary files /dev/null and b/sequence_tagging_for_ner/image/ner_label_ins.png differ diff --git a/sequence_tagging_for_ner/image/ner_network.png b/sequence_tagging_for_ner/image/ner_network.png new file mode 100644 index 0000000000000000000000000000000000000000..e9c07e34ac287ed04301bdede87dcc53377881b7 Binary files /dev/null and b/sequence_tagging_for_ner/image/ner_network.png differ diff --git a/sequence_tagging_for_ner/ner.py b/sequence_tagging_for_ner/ner.py new file mode 100644 index 0000000000000000000000000000000000000000..b219aeb3de5e4cafe6d22aa90720b1fa13cef4db --- /dev/null +++ b/sequence_tagging_for_ner/ner.py @@ -0,0 +1,267 @@ +import math +import gzip +import paddle.v2 as paddle +import paddle.v2.evaluator as evaluator +import conll03 +import itertools + +# init dataset +train_data_file = 'data/train' +test_data_file = 'data/test' +vocab_file = 'data/vocab.txt' +target_file = 'data/target.txt' +emb_file = 'data/wordVectors.txt' + +train_data_reader = conll03.train(train_data_file, vocab_file, target_file) +test_data_reader = conll03.test(test_data_file, vocab_file, target_file) +word_dict, label_dict = conll03.get_dict(vocab_file, target_file) +word_vector_values = conll03.get_embedding(emb_file) + +# init hyper-params +word_dict_len = len(word_dict) +label_dict_len = len(label_dict) +mark_dict_len = 2 +word_dim = 50 +mark_dim = 5 +hidden_dim = 300 + +mix_hidden_lr = 1e-3 +default_std = 1 / math.sqrt(hidden_dim) / 3.0 +emb_para = paddle.attr.Param( + name='emb', initial_std=math.sqrt(1. / word_dim), is_static=True) +std_0 = paddle.attr.Param(initial_std=0.) +std_default = paddle.attr.Param(initial_std=default_std) + + +def d_type(size): + return paddle.data_type.integer_value_sequence(size) + + +def ner_net(is_train): + word = paddle.layer.data(name='word', type=d_type(word_dict_len)) + mark = paddle.layer.data(name='mark', type=d_type(mark_dict_len)) + + word_embedding = paddle.layer.mixed( + name='word_embedding', + size=word_dim, + input=paddle.layer.table_projection(input=word, param_attr=emb_para)) + mark_embedding = paddle.layer.mixed( + name='mark_embedding', + size=mark_dim, + input=paddle.layer.table_projection(input=mark, param_attr=std_0)) + emb_layers = [word_embedding, mark_embedding] + + word_caps_vector = paddle.layer.concat( + name='word_caps_vector', input=emb_layers) + hidden_1 = paddle.layer.mixed( + name='hidden1', + size=hidden_dim, + act=paddle.activation.Tanh(), + bias_attr=std_default, + input=[ + paddle.layer.full_matrix_projection( + input=word_caps_vector, param_attr=std_default) + ]) + + rnn_para_attr = paddle.attr.Param(initial_std=0.0, learning_rate=0.1) + hidden_para_attr = paddle.attr.Param( + initial_std=default_std, learning_rate=mix_hidden_lr) + + rnn_1_1 = paddle.layer.recurrent( + name='rnn1-1', + input=hidden_1, + act=paddle.activation.Relu(), + bias_attr=std_0, + param_attr=rnn_para_attr) + rnn_1_2 = paddle.layer.recurrent( + name='rnn1-2', + input=hidden_1, + act=paddle.activation.Relu(), + reverse=1, + bias_attr=std_0, + param_attr=rnn_para_attr) + + hidden_2_1 = paddle.layer.mixed( + name='hidden2-1', + size=hidden_dim, + bias_attr=std_default, + act=paddle.activation.STanh(), + input=[ + paddle.layer.full_matrix_projection( + input=hidden_1, param_attr=hidden_para_attr), + paddle.layer.full_matrix_projection( + input=rnn_1_1, param_attr=rnn_para_attr) + ]) + hidden_2_2 = paddle.layer.mixed( + name='hidden2-2', + size=hidden_dim, + bias_attr=std_default, + act=paddle.activation.STanh(), + input=[ + paddle.layer.full_matrix_projection( + input=hidden_1, param_attr=hidden_para_attr), + paddle.layer.full_matrix_projection( + input=rnn_1_2, param_attr=rnn_para_attr) + ]) + + rnn_2_1 = paddle.layer.recurrent( + name='rnn2-1', + input=hidden_2_1, + act=paddle.activation.Relu(), + reverse=1, + bias_attr=std_0, + param_attr=rnn_para_attr) + rnn_2_2 = paddle.layer.recurrent( + name='rnn2-2', + input=hidden_2_2, + act=paddle.activation.Relu(), + bias_attr=std_0, + param_attr=rnn_para_attr) + + hidden_3 = paddle.layer.mixed( + name='hidden3', + size=hidden_dim, + bias_attr=std_default, + act=paddle.activation.STanh(), + input=[ + paddle.layer.full_matrix_projection( + input=hidden_2_1, param_attr=hidden_para_attr), + paddle.layer.full_matrix_projection( + input=rnn_2_1, + param_attr=rnn_para_attr), paddle.layer.full_matrix_projection( + input=hidden_2_2, param_attr=hidden_para_attr), + paddle.layer.full_matrix_projection( + input=rnn_2_2, param_attr=rnn_para_attr) + ]) + + output = paddle.layer.mixed( + name='output', + size=label_dict_len, + bias_attr=False, + input=[ + paddle.layer.full_matrix_projection( + input=hidden_3, param_attr=std_default) + ]) + + if is_train: + target = paddle.layer.data(name='target', type=d_type(label_dict_len)) + + crf_cost = paddle.layer.crf( + size=label_dict_len, + input=output, + label=target, + param_attr=paddle.attr.Param( + name='crfw', + initial_std=default_std, + learning_rate=mix_hidden_lr)) + + crf_dec = paddle.layer.crf_decoding( + size=label_dict_len, + input=output, + label=target, + param_attr=paddle.attr.Param(name='crfw')) + + return crf_cost, crf_dec, target + else: + predict = paddle.layer.crf_decoding( + size=label_dict_len, + input=output, + param_attr=paddle.attr.Param(name='crfw')) + + return predict + + +def ner_net_train(data_reader=train_data_reader, num_passes=1): + # define network topology + crf_cost, crf_dec, target = ner_net(is_train=True) + evaluator.sum(name='error', input=crf_dec) + evaluator.chunk( + name='ner_chunk', + input=crf_dec, + label=target, + chunk_scheme='IOB', + num_chunk_types=(label_dict_len - 1) / 2) + + # create parameters + parameters = paddle.parameters.create(crf_cost) + parameters.set('emb', word_vector_values) + + # create optimizer + optimizer = paddle.optimizer.Momentum( + momentum=0, + learning_rate=2e-4, + regularization=paddle.optimizer.L2Regularization(rate=8e-4), + gradient_clipping_threshold=25, + model_average=paddle.optimizer.ModelAverage( + average_window=0.5, max_average_window=10000), ) + + trainer = paddle.trainer.SGD( + cost=crf_cost, + parameters=parameters, + update_equation=optimizer, + extra_layers=crf_dec) + + reader = paddle.batch( + paddle.reader.shuffle(data_reader, buf_size=8192), batch_size=64) + + feeding = {'word': 0, 'mark': 1, 'target': 2} + + def event_handler(event): + if isinstance(event, paddle.event.EndIteration): + if event.batch_id % 100 == 0: + print "Pass %d, Batch %d, Cost %f, %s" % ( + event.pass_id, event.batch_id, event.cost, event.metrics) + if event.batch_id % 1000 == 0: + result = trainer.test(reader=reader, feeding=feeding) + print "\nTest with Pass %d, Batch %d, %s" % ( + event.pass_id, event.batch_id, result.metrics) + + if isinstance(event, paddle.event.EndPass): + # save parameters + with gzip.open('params_pass_%d.tar.gz' % event.pass_id, 'w') as f: + parameters.to_tar(f) + result = trainer.test(reader=reader, feeding=feeding) + print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics) + + trainer.train( + reader=reader, + event_handler=event_handler, + num_passes=num_passes, + feeding=feeding) + + return parameters + + +def ner_net_infer(data_reader=test_data_reader, model_file='ner_model.tar.gz'): + test_data = [] + test_sentences = [] + for item in data_reader(): + test_data.append([item[0], item[1]]) + test_sentences.append(item[-1]) + if len(test_data) == 10: + break + + predict = ner_net(is_train=False) + + lab_ids = paddle.infer( + output_layer=predict, + parameters=paddle.parameters.Parameters.from_tar(gzip.open(model_file)), + input=test_data, + field='id') + + flat_data = [word for word in itertools.chain.from_iterable(test_sentences)] + + labels_reverse = {} + for (k, v) in label_dict.items(): + labels_reverse[v] = k + pre_lab = [labels_reverse[lab_id] for lab_id in lab_ids] + + for word, label in zip(flat_data, pre_lab): + print word, label + + +if __name__ == '__main__': + paddle.init(use_gpu=False, trainer_count=1) + ner_net_train(data_reader=train_data_reader, num_passes=1) + ner_net_infer( + data_reader=test_data_reader, model_file='params_pass_0.tar.gz')