diff --git a/fluid/sequence_tagging_for_ner/README.md b/fluid/sequence_tagging_for_ner/README.md index e5d2edc4f78718872a666176c845f346cd1a7a49..fb9214b28e25ef22f1966f84f666b52fec837584 100644 --- a/fluid/sequence_tagging_for_ner/README.md +++ b/fluid/sequence_tagging_for_ner/README.md @@ -4,91 +4,29 @@ ```text . -├── data # 存储运行本例所依赖的数据 -│   ├── download.sh +├── data # 存储运行本例所依赖的数据,从外部获取 ├── network_conf.py # 模型定义 -├── reader.py # 数据读取接口 +├── reader.py # 数据读取接口, 从外部获取 ├── README.md # 文档 ├── train.py # 训练脚本 ├── infer.py # 预测脚本 -└── utils.py # 定义同样的函数 +└── utils.py # 定义通用的函数, 从外部获取 +└── utils_extend.py # 对utils.py的拓展 ``` -## 简介 +## 简介,模型详解与数据说明 -命名实体识别(Named Entity Recognition,NER)又称作“专名识别”,是指识别文本中具有特定意义的实体,主要包括人名、地名、机构名、专有名词等,是自然语言处理研究的一个基础问题。NER任务通常包括实体边界识别、确定实体类别两部分,可以将其作为序列标注问题解决。 +参考https://github.com/PaddlePaddle/models/blob/develop/sequence_tagging_for_ner/README.md +在模型上,我们使用LSTM代替原始的RNN。 -序列标注可以分为Sequence Classification、Segment Classification和Temporal Classification三类[[1](#参考文献)],本例只考虑Segment Classification,即对输入序列中的每个元素在输出序列中给出对应的标签。对于NER任务,由于需要标识边界,一般采用[BIO标注方法](http://book.paddlepaddle.org/07.label_semantic_roles/)定义的标签集。 +## 数据获取 -根据序列标注结果可以直接得到实体边界和实体类别。类似的,分词、词性标注、语块识别、[语义角色标注](http://book.paddlepaddle.org/07.label_semantic_roles/index.cn.html)等任务都可通过序列标注来解决。使用神经网络模型解决问题的思路通常是:前层网络学习输入的特征表示,网络的最后一层在特征基础上完成最终的任务;对于序列标注问题,通常:使用基于RNN的网络结构学习特征,将学习到的特征接入CRF完成序列标注。实际上是将传统CRF中的线性模型换成了非线性神经网络。沿用CRF的出发点是:CRF使用句子级别的似然概率,能够更好的解决标记偏置问题[[2](#参考文献)]。本例也将基于此思路建立模型。虽然,这里以NER任务作为示例,但所给出的模型可以应用到其他各种序列标注任务中。 +参照https://github.com/PaddlePaddle/models/blob/develop/sequence_tagging_for_ner/README.md中的数据获取方式,将获取的data目录复制到本目录下。 -由于序列标注问题的广泛性,产生了[CRF](http://book.paddlepaddle.org/07.label_semantic_roles/index.cn.html)等经典的序列模型,这些模型大多只能使用局部信息或需要人工设计特征。随着深度学习研究的发展,循环神经网络(Recurrent Neural Network,RNN等 序列模型能够处理序列元素之间前后关联问题,能够从原始输入文本中学习特征表示,而更加适合序列标注任务,更多相关知识可参考PaddleBook中[语义角色标注](https://github.com/PaddlePaddle/book/blob/develop/07.label_semantic_roles/README.cn.md)一课。 +## 通用脚本获取 -## 模型详解 - -NER任务的输入是"一句话",目标是识别句子中的实体边界及类别,我们参照论文\[[2](#参考文献)\]仅对原始句子进行了一些简单的预处理工作:将每个词转换为小写,并将原词是否大写另作为一个特征,共同作为模型的输入。工作流程如下: - -1. 构造输入 - - 输入1是句子序列,采用one-hot方式表示 - - 输入2是大写标记序列,标记了句子中每一个词是否是大写,采用one-hot方式表示; -2. one-hot方式的句子序列和大写标记序列通过词表,转换为实向量表示的词向量序列; -3. 将步骤2中的2个词向量序列作为双向LSTM的输入,学习输入序列的特征表示,得到新的特性表示序列; -4. CRF以步骤3中模型学习到的特征为输入,以标记序列为监督信号,实现序列标注。 - - -## 数据说明 - -在本例中,我们以 [CoNLL 2003 NER任务](http://www.clips.uantwerpen.be/conll2003/ner/)为例,原始Reuters数据由于版权原因需另外申请免费下载,请大家按照原网站说明获取。 - -+ 我们仅在`data`目录下的`train`和`test`文件中放置少数样本用以示例输入数据格式。 -+ 本例依赖数据还包括 - 1. 输入文本的词典 - 2. 为词典中的词语提供预训练好的词向量 - 2. 标记标签的词典 - 标记标签词典已附在`data`目录中,对应于`data/target.txt`文件。输入文本的词典以及词典中词语的预训练的词向量来自:[Stanford CS224d](http://cs224d.stanford.edu/)课程作业。**为运行本例,请首先在`data`目录下运行`download.sh`脚本下载输入文本的词典和预训练的词向量。** 完成后会将这两个文件一并放入`data`目录下,输入文本的词典和预训练的词向量分别对应:`data/vocab.txt`和`data/wordVectors.txt`这两个文件。 - -CoNLL 2003原始数据格式如下: - -``` -U.N. NNP I-NP I-ORG -official NN I-NP O -Ekeus NNP I-NP I-PER -heads VBZ I-VP O -for IN I-PP O -Baghdad NNP I-NP I-LOC -. . O O -``` - -- 第一列为原始句子序列 -- 第二、三列分别为词性标签和句法分析中的语块标签,本例不使用 -- 第四列为采用了 I-TYPE 方式表示的NER标签 - - I-TYPE 和 BIO 方式的主要区别在于语块开始标记的使用上,I-TYPE只有在出现相邻的同类别实体时对后者使用B标记,其他均使用I标记),句子之间以空行分隔。 - -我们在`reader.py`脚本中完成对原始数据的处理以及读取,主要包括下面几个步骤: - -1. 从原始数据文件中抽取出句子和标签,构造句子序列和标签序列; -2. 将 I-TYPE 表示的标签转换为 BIO 方式表示的标签; -3. 将句子序列中的单词转换为小写,并构造大写标记序列; -4. 依据词典获取词对应的整数索引。 - - -预处理完成后,一条训练样本包含3个部分作为神经网络的输入信息用于训练:(1)句子序列;(2)首字母大写标记序列;(3)标注序列,下表是一条训练样本的示例: - -| 句子序列 | 大写标记序列 | 标注序列 | -| -------- | ------------ | -------- | -| u.n. | 1 | B-ORG | -| official | 0 | O | -| ekeus | 1 | B-PER | -| heads | 0 | O | -| for | 0 | O | -| baghdad | 1 | B-LOC | -| . | 0 | O | - -## 运行 -### 编写数据读取接口 - -自定义数据读取接口只需编写一个 Python 生成器实现从原始输入文本中解析一条训练样本的逻辑。[reader.py](./reader.py) 中的`data_reader`函数实现了读取原始数据返回类型为: `paddle.data_type.integer_value_sequence`的 3 个输入(分别对应:词语在字典的序号、是否为大写、标注结果在字典中的序号)给`network_conf.ner_net`中定义的 3 个 `data_layer` 的功能。 +本例需要使用https://github.com/PaddlePaddle/models/blob/develop/sequence_tagging_for_ner/reader.py以及https://github.com/PaddlePaddle/models/blob/develop/sequence_tagging_for_ner/utils.py,请将这两个文件复制到本目录下。 ### 训练 @@ -145,44 +83,38 @@ Baghdad NNP I-NP I-LOC 2. 在终端运行 `python infer.py`,开始测试,会看到如下预测结果(以下为训练70个pass所得模型的部分预测结果): -``` -leicestershire B-ORG B-LOC -extended O O -their O O -first O O -innings O O -by O O -DGDG O O -runs O O -before O O -being O O -bowled O O -out O O -for O O -296 O O -with O O -england B-LOC B-LOC -discard O O -andy B-PER B-PER -caddick I-PER I-PER -taking O O -three O O -for O O -DGDG O O -. O O -``` + ```text + leicestershire B-ORG B-LOC + extended O O + their O O + first O O + innings O O + by O O + DGDG O O + runs O O + before O O + being O O + bowled O O + out O O + for O O + 296 O O + with O O + england B-LOC B-LOC + discard O O + andy B-PER B-PER + caddick I-PER I-PER + taking O O + three O O + for O O + DGDG O O + . O O + ``` 输出分为三列,以“\t” 分隔,第一列是输入的词语,第二列是标准结果,第三列为生成的标记结果。多条输入序列之间以空行分隔。 -## 真实结果示例 +## 结果示例

-
-图1. Fluid下实验结果示例 +
+图1. Paddle下实验结果示例, 横轴表示训练轮数,纵轴表示F1值

- - -## 参考文献 - -1. Graves A. [Supervised Sequence Labelling with Recurrent Neural Networks](http://www.cs.toronto.edu/~graves/preprint.pdf)[J]. Studies in Computational Intelligence, 2013, 385. -2. Collobert R, Weston J, Bottou L, et al. [Natural Language Processing (Almost) from Scratch](http://www.jmlr.org/papers/volume12/collobert11a/collobert11a.pdf)[J]. Journal of Machine Learning Research, 2011, 12(1):2493-2537. diff --git a/fluid/sequence_tagging_for_ner/data/download.sh b/fluid/sequence_tagging_for_ner/data/download.sh deleted file mode 100644 index 99d81c1e0949e47187cd082947117eb4e6bd888d..0000000000000000000000000000000000000000 --- a/fluid/sequence_tagging_for_ner/data/download.sh +++ /dev/null @@ -1,16 +0,0 @@ -if [ -f assignment2.zip ]; then - echo "data exist" -else - wget http://cs224d.stanford.edu/assignment2/assignment2.zip -fi - -if [ $? -eq 0 ];then - unzip assignment2.zip - cp assignment2_release/data/ner/wordVectors.txt ./data - cp assignment2_release/data/ner/vocab.txt ./data - rm -rf assignment2.zip assignment2_release -else - echo "download data error!" >> /dev/stderr - exit 1 -fi - diff --git a/fluid/sequence_tagging_for_ner/data/target.txt b/fluid/sequence_tagging_for_ner/data/target.txt deleted file mode 100644 index e0fa4d8f6654be07b4d1188750abb861d7c6f264..0000000000000000000000000000000000000000 --- a/fluid/sequence_tagging_for_ner/data/target.txt +++ /dev/null @@ -1,9 +0,0 @@ -B-LOC -I-LOC -B-MISC -I-MISC -B-ORG -I-ORG -B-PER -I-PER -O diff --git a/fluid/sequence_tagging_for_ner/data/test b/fluid/sequence_tagging_for_ner/data/test deleted file mode 100644 index 66163e1a869d57303117dd94d59ff01be05de8f7..0000000000000000000000000000000000000000 --- a/fluid/sequence_tagging_for_ner/data/test +++ /dev/null @@ -1,128 +0,0 @@ -CRICKET NNP I-NP O -- : O O -LEICESTERSHIRE NNP I-NP I-ORG -TAKE NNP I-NP O -OVER IN I-PP O -AT NNP I-NP O -TOP NNP I-NP O -AFTER NNP I-NP O -INNINGS NNP I-NP O -VICTORY NN I-NP O -. . O O - -LONDON NNP I-NP I-LOC -1996-08-30 CD I-NP O - -West NNP I-NP I-MISC -Indian NNP I-NP I-MISC -all-rounder NN I-NP O -Phil NNP I-NP I-PER -Simmons NNP I-NP I-PER -took VBD I-VP O -four CD I-NP O -for IN I-PP O -38 CD I-NP O -on IN I-PP O -Friday NNP I-NP O -as IN I-PP O -Leicestershire NNP I-NP I-ORG -beat VBD I-VP O -Somerset NNP I-NP I-ORG -by IN I-PP O -an DT I-NP O -innings NN I-NP O -and CC O O -39 CD I-NP O -runs NNS I-NP O -in IN I-PP O -two CD I-NP O -days NNS I-NP O -to TO I-VP O -take VB I-VP O -over IN I-PP O -at IN B-PP O -the DT I-NP O -head NN I-NP O -of IN I-PP O -the DT I-NP O -county NN I-NP O -championship NN I-NP O -. . O O - -Their PRP$ I-NP O -stay NN I-NP O -on IN I-PP O -top NN I-NP O -, , O O -though RB I-ADVP O -, , O O -may MD I-VP O -be VB I-VP O -short-lived JJ I-ADJP O -as IN I-PP O -title NN I-NP O -rivals NNS I-NP O -Essex NNP I-NP I-ORG -, , O O -Derbyshire NNP I-NP I-ORG -and CC I-NP O -Surrey NNP I-NP I-ORG -all DT O O -closed VBD I-VP O -in RP I-PRT O -on IN I-PP O -victory NN I-NP O -while IN I-SBAR O -Kent NNP I-NP I-ORG -made VBD I-VP O -up RP I-PRT O -for IN I-PP O -lost VBN I-NP O -time NN I-NP O -in IN I-PP O -their PRP$ I-NP O -rain-affected JJ I-NP O -match NN I-NP O -against IN I-PP O -Nottinghamshire NNP I-NP I-ORG -. . O O - -After IN I-PP O -bowling VBG I-NP O -Somerset NNP I-NP I-ORG -out RP I-PRT O -for IN I-PP O -83 CD I-NP O -on IN I-PP O -the DT I-NP O -opening NN I-NP O -morning NN I-NP O -at IN I-PP O -Grace NNP I-NP I-LOC -Road NNP I-NP I-LOC -, , O O -Leicestershire NNP I-NP I-ORG -extended VBD I-VP O -their PRP$ I-NP O -first JJ I-NP O -innings NN I-NP O -by IN I-PP O -94 CD I-NP O -runs VBZ I-VP O -before IN I-PP O -being VBG I-VP O -bowled VBD I-VP O -out RP I-PRT O -for IN I-PP O -296 CD I-NP O -with IN I-PP O -England NNP I-NP I-LOC -discard VBP I-VP O -Andy NNP I-NP I-PER -Caddick NNP I-NP I-PER -taking VBG I-VP O -three CD I-NP O -for IN I-PP O -83 CD I-NP O -. . O O - diff --git a/fluid/sequence_tagging_for_ner/data/train b/fluid/sequence_tagging_for_ner/data/train deleted file mode 100644 index cbf3e678c555a3b6db26fd14e38889f040f048ca..0000000000000000000000000000000000000000 --- a/fluid/sequence_tagging_for_ner/data/train +++ /dev/null @@ -1,139 +0,0 @@ -EU NNP I-NP I-ORG -rejects VBZ I-VP O -German JJ I-NP I-MISC -call NN I-NP O -to TO I-VP O -boycott VB I-VP O -British JJ I-NP I-MISC -lamb NN I-NP O -. . O O - -Peter NNP I-NP I-PER -Blackburn NNP I-NP I-PER - -BRUSSELS NNP I-NP I-LOC -1996-08-22 CD I-NP O - -The DT I-NP O -European NNP I-NP I-ORG -Commission NNP I-NP I-ORG -said VBD I-VP O -on IN I-PP O -Thursday NNP I-NP O -it PRP B-NP O -disagreed VBD I-VP O -with IN I-PP O -German JJ I-NP I-MISC -advice NN I-NP O -to TO I-PP O -consumers NNS I-NP O -to TO I-VP O -shun VB I-VP O -British JJ I-NP I-MISC -lamb NN I-NP O -until IN I-SBAR O -scientists NNS I-NP O -determine VBP I-VP O -whether IN I-SBAR O -mad JJ I-NP O -cow NN I-NP O -disease NN I-NP O -can MD I-VP O -be VB I-VP O -transmitted VBN I-VP O -to TO I-PP O -sheep NN I-NP O -. . O O - -Germany NNP I-NP I-LOC -'s POS B-NP O -representative NN I-NP O -to TO I-PP O -the DT I-NP O -European NNP I-NP I-ORG -Union NNP I-NP I-ORG -'s POS B-NP O -veterinary JJ I-NP O -committee NN I-NP O -Werner NNP I-NP I-PER -Zwingmann NNP I-NP I-PER -said VBD I-VP O -on IN I-PP O -Wednesday NNP I-NP O -consumers NNS I-NP O -should MD I-VP O -buy VB I-VP O -sheepmeat NN I-NP O -from IN I-PP O -countries NNS I-NP O -other JJ I-ADJP O -than IN I-PP O -Britain NNP I-NP I-LOC -until IN I-SBAR O -the DT I-NP O -scientific JJ I-NP O -advice NN I-NP O -was VBD I-VP O -clearer JJR I-ADJP O -. . O O - -" " O O -We PRP I-NP O -do VBP I-VP O -n't RB I-VP O -support VB I-VP O -any DT I-NP O -such JJ I-NP O -recommendation NN I-NP O -because IN I-SBAR O -we PRP I-NP O -do VBP I-VP O -n't RB I-VP O -see VB I-VP O -any DT I-NP O -grounds NNS I-NP O -for IN I-PP O -it PRP I-NP O -, , O O -" " O O -the DT I-NP O -Commission NNP I-NP I-ORG -'s POS B-NP O -chief JJ I-NP O -spokesman NN I-NP O -Nikolaus NNP I-NP I-PER -van NNP I-NP I-PER -der FW I-NP I-PER -Pas NNP I-NP I-PER -told VBD I-VP O -a DT I-NP O -news NN I-NP O -briefing NN I-NP O -. . O O - -He PRP I-NP O -said VBD I-VP O -further JJ I-NP O -scientific JJ I-NP O -study NN I-NP O -was VBD I-VP O -required VBN I-VP O -and CC O O -if IN I-SBAR O -it PRP I-NP O -was VBD I-VP O -found VBN I-VP O -that IN I-SBAR O -action NN I-NP O -was VBD I-VP O -needed VBN I-VP O -it PRP I-NP O -should MD I-VP O -be VB I-VP O -taken VBN I-VP O -by IN I-PP O -the DT I-NP O -European NNP I-NP I-ORG -Union NNP I-NP I-ORG -. . O O - diff --git a/fluid/sequence_tagging_for_ner/imgs/convergence_curve.png b/fluid/sequence_tagging_for_ner/imgs/convergence_curve.png new file mode 100644 index 0000000000000000000000000000000000000000..bebdf7fd213a054246b9fea9957d405cf116bc55 Binary files /dev/null and b/fluid/sequence_tagging_for_ner/imgs/convergence_curve.png differ diff --git a/fluid/sequence_tagging_for_ner/imgs/convergent_curve.png b/fluid/sequence_tagging_for_ner/imgs/convergent_curve.png deleted file mode 100644 index 491b2895c24accafc1bfca3131292a2070ac1400..0000000000000000000000000000000000000000 Binary files a/fluid/sequence_tagging_for_ner/imgs/convergent_curve.png and /dev/null differ diff --git a/fluid/sequence_tagging_for_ner/infer.py b/fluid/sequence_tagging_for_ner/infer.py index 0e04e8797877bf9cc5dc77e44892deee412507aa..2d0bd9496ed2ec1db019a0124905093e0b12531a 100644 --- a/fluid/sequence_tagging_for_ner/infer.py +++ b/fluid/sequence_tagging_for_ner/infer.py @@ -1,14 +1,21 @@ import numpy as np + import paddle.fluid as fluid import paddle.v2 as paddle from network_conf import ner_net import reader -from utils import load_dict, load_reverse_dict, to_lodtensor +from utils import load_dict, load_reverse_dict +from utils_extend import to_lodtensor def infer(model_path, batch_size, test_data_file, vocab_file, target_file, use_gpu): + """ + use the model under model_path to predict the test data, the result will be printed on the screen + + return nothing + """ word_dict = load_dict(vocab_file) word_reverse_dict = load_reverse_dict(vocab_file) diff --git a/fluid/sequence_tagging_for_ner/reader.py b/fluid/sequence_tagging_for_ner/reader.py deleted file mode 100644 index a817dd199987ae0050014595296fe4717ab198e4..0000000000000000000000000000000000000000 --- a/fluid/sequence_tagging_for_ner/reader.py +++ /dev/null @@ -1,65 +0,0 @@ -""" -Conll03 dataset. -""" -import re - -__all__ = ["data_reader"] - - -def canonicalize_digits(word): - if any([c.isalpha() for c in word]): return word - word = re.sub("\d", "DG", word) - if word.startswith("DG"): - word = word.replace(",", "") # remove thousands separator - return word - - -def canonicalize_word(word, wordset=None, digits=True): - word = word.lower() - if digits: - if (wordset != None) and (word in wordset): return word - word = canonicalize_digits(word) # try to canonicalize numbers - if (wordset == None) or (word in wordset): return word - else: return "UUUNKKK" # unknown token - - -def data_reader(data_file, word_dict, label_dict): - """ - The dataset can be obtained according to http://www.clips.uantwerpen.be/conll2003/ner/. - It returns a reader creator, each sample in the reader includes: - word id sequence, label id sequence and raw sentence. - - :return: reader creator - :rtype: callable - """ - - def reader(): - UNK_IDX = word_dict["UUUNKKK"] - - sentence = [] - labels = [] - with open(data_file, "r") as f: - for line in f: - if len(line.strip()) == 0: - if len(sentence) > 0: - word_idx = [ - word_dict.get( - canonicalize_word(w, word_dict), UNK_IDX) - for w in sentence - ] - mark = [1 if w[0].isupper() else 0 for w in sentence] - label_idx = [label_dict[l] for l in labels] - yield word_idx, mark, label_idx - sentence = [] - labels = [] - else: - segs = line.strip().split() - sentence.append(segs[0]) - # transform I-TYPE to BIO schema - if segs[-1] != "O" and (len(labels) == 0 or - labels[-1][1:] != segs[-1][1:]): - labels.append("B" + segs[-1][1:]) - else: - labels.append(segs[-1]) - - return reader diff --git a/fluid/sequence_tagging_for_ner/train.py b/fluid/sequence_tagging_for_ner/train.py index 6073514d375802c550c807e431ffcd801294fb31..6ed77cd5ca1d504a8b79b4f87349242b5051c539 100644 --- a/fluid/sequence_tagging_for_ner/train.py +++ b/fluid/sequence_tagging_for_ner/train.py @@ -1,13 +1,14 @@ import os import math - import numpy as np + import paddle.v2 as paddle import paddle.fluid as fluid import reader from network_conf import ner_net -from utils import logger, load_dict, get_embedding, to_lodtensor +from utils import logger, load_dict +from utils_extend import to_lodtensor, get_embedding def test(exe, chunk_evaluator, inference_program, test_data, place): diff --git a/fluid/sequence_tagging_for_ner/utils.py b/fluid/sequence_tagging_for_ner/utils.py deleted file mode 100644 index 09b8e5fdbceb369cb4f4bf2c2a1e95723e2edbac..0000000000000000000000000000000000000000 --- a/fluid/sequence_tagging_for_ner/utils.py +++ /dev/null @@ -1,61 +0,0 @@ -#!/usr/bin/env python -# -*- coding: utf-8 -*- -import logging - -import paddle.fluid as fluid - -import numpy as np - -logger = logging.getLogger("paddle") -logger.setLevel(logging.INFO) - - -def get_embedding(emb_file='data/wordVectors.txt'): - """ - Get the trained word vector. - """ - return np.loadtxt(emb_file, dtype='float32') - - -def load_dict(dict_path): - """ - Load the word dictionary from the given file. - Each line of the given file is a word, which can include multiple columns - seperated by tab. - - This function takes the first column (columns in a line are seperated by - tab) as key and takes line number of a line as the key (index of the word - in the dictionary). - """ - - return dict((line.strip().split("\t")[0], idx) - for idx, line in enumerate(open(dict_path, "r").readlines())) - - -def load_reverse_dict(dict_path): - """ - Load the word dictionary from the given file. - Each line of the given file is a word, which can include multiple columns - seperated by tab. - - This function takes line number of a line as the key (index of the word in - the dictionary) and the first column (columns in a line are seperated by - tab) as the value. - """ - return dict((idx, line.strip().split("\t")[0]) - for idx, line in enumerate(open(dict_path, "r").readlines())) - - -def to_lodtensor(data, place): - seq_lens = [len(seq) for seq in data] - cur_len = 0 - lod = [cur_len] - for l in seq_lens: - cur_len += l - lod.append(cur_len) - flattened_data = np.concatenate(data, axis=0).astype("int64") - flattened_data = flattened_data.reshape([len(flattened_data), 1]) - res = fluid.LoDTensor() - res.set(flattened_data, place) - res.set_lod([lod]) - return res diff --git a/fluid/sequence_tagging_for_ner/utils_extend.py b/fluid/sequence_tagging_for_ner/utils_extend.py new file mode 100644 index 0000000000000000000000000000000000000000..c069140334c1f2c65ee6bec4ae31b8d8b4bc0e4d --- /dev/null +++ b/fluid/sequence_tagging_for_ner/utils_extend.py @@ -0,0 +1,27 @@ +import numpy as np +import paddle.fluid as fluid + + +def get_embedding(emb_file='data/wordVectors.txt'): + """ + Get the trained word vector. + """ + return np.loadtxt(emb_file, dtype='float32') + + +def to_lodtensor(data, place): + """ + convert data to lodtensor + """ + seq_lens = [len(seq) for seq in data] + cur_len = 0 + lod = [cur_len] + for l in seq_lens: + cur_len += l + lod.append(cur_len) + flattened_data = np.concatenate(data, axis=0).astype("int64") + flattened_data = flattened_data.reshape([len(flattened_data), 1]) + res = fluid.LoDTensor() + res.set(flattened_data, place) + res.set_lod([lod]) + return res