for code review

f88033ef · root · 0a8f16a1 · f88033ef · 0a8f16a1 · 0a8f16a1
11 changed file
--- a/fluid/sequence_tagging_for_ner/README.md
+++ b/fluid/sequence_tagging_for_ner/README.md
@@ -4,91 +4,29 @@

 ```text
 .
-├── data                 # 存储运行本例所依赖的数据
-│   ├── download.sh
+├── data                 # 存储运行本例所依赖的数据，从外部获取
 ├── network_conf.py      # 模型定义
-├── reader.py            # 数据读取接口
+├── reader.py            # 数据读取接口, 从外部获取
 ├── README.md            # 文档
 ├── train.py             # 训练脚本
 ├── infer.py             # 预测脚本
-└── utils.py             # 定义同样的函数
+└── utils.py             # 定义通用的函数, 从外部获取
+└── utils_extend.py      # 对utils.py的拓展
 ```


-## 简介
+## 简介，模型详解与数据说明

-命名实体识别（Named Entity Recognition，NER）又称作“专名识别”，是指识别文本中具有特定意义的实体，主要包括人名、地名、机构名、专有名词等，是自然语言处理研究的一个基础问题。NER任务通常包括实体边界识别、确定实体类别两部分，可以将其作为序列标注问题解决。
+参考https://github.com/PaddlePaddle/models/blob/develop/sequence_tagging_for_ner/README.md
+在模型上，我们使用LSTM代替原始的RNN。

-序列标注可以分为Sequence Classification、Segment Classification和Temporal Classification三类[[1](#参考文献)]，本例只考虑Segment Classification，即对输入序列中的每个元素在输出序列中给出对应的标签。对于NER任务，由于需要标识边界，一般采用[BIO标注方法](http://book.paddlepaddle.org/07.label_semantic_roles/)定义的标签集。
+## 数据获取

-根据序列标注结果可以直接得到实体边界和实体类别。类似的，分词、词性标注、语块识别、[语义角色标注](http://book.paddlepaddle.org/07.label_semantic_roles/index.cn.html)等任务都可通过序列标注来解决。使用神经网络模型解决问题的思路通常是：前层网络学习输入的特征表示，网络的最后一层在特征基础上完成最终的任务；对于序列标注问题，通常：使用基于RNN的网络结构学习特征，将学习到的特征接入CRF完成序列标注。实际上是将传统CRF中的线性模型换成了非线性神经网络。沿用CRF的出发点是：CRF使用句子级别的似然概率，能够更好的解决标记偏置问题[[2](#参考文献)]。本例也将基于此思路建立模型。虽然，这里以NER任务作为示例，但所给出的模型可以应用到其他各种序列标注任务中。
+参照https://github.com/PaddlePaddle/models/blob/develop/sequence_tagging_for_ner/README.md中的数据获取方式，将获取的data目录复制到本目录下。

-由于序列标注问题的广泛性，产生了[CRF](http://book.paddlepaddle.org/07.label_semantic_roles/index.cn.html)等经典的序列模型，这些模型大多只能使用局部信息或需要人工设计特征。随着深度学习研究的发展，循环神经网络（Recurrent Neural Network，RNN等 序列模型能够处理序列元素之间前后关联问题，能够从原始输入文本中学习特征表示，而更加适合序列标注任务，更多相关知识可参考PaddleBook中[语义角色标注](https://github.com/PaddlePaddle/book/blob/develop/07.label_semantic_roles/README.cn.md)一课。
+## 通用脚本获取

-## 模型详解
-
-NER任务的输入是"一句话"，目标是识别句子中的实体边界及类别，我们参照论文\[[2](#参考文献)\]仅对原始句子进行了一些简单的预处理工作：将每个词转换为小写，并将原词是否大写另作为一个特征，共同作为模型的输入。工作流程如下：
-
-1. 构造输入
- - 输入1是句子序列，采用one-hot方式表示
- - 输入2是大写标记序列，标记了句子中每一个词是否是大写，采用one-hot方式表示；
-2. one-hot方式的句子序列和大写标记序列通过词表，转换为实向量表示的词向量序列；
-3. 将步骤2中的2个词向量序列作为双向LSTM的输入，学习输入序列的特征表示，得到新的特性表示序列；
-4. CRF以步骤3中模型学习到的特征为输入，以标记序列为监督信号，实现序列标注。
-
-
-## 数据说明
-
-在本例中，我们以 [CoNLL 2003 NER任务](http://www.clips.uantwerpen.be/conll2003/ner/)为例，原始Reuters数据由于版权原因需另外申请免费下载，请大家按照原网站说明获取。
-
-+ 我们仅在`data`目录下的`train`和`test`文件中放置少数样本用以示例输入数据格式。
-+ 本例依赖数据还包括
-    1. 输入文本的词典
-    2. 为词典中的词语提供预训练好的词向量
-    2. 标记标签的词典
-   标记标签词典已附在`data`目录中，对应于`data/target.txt`文件。输入文本的词典以及词典中词语的预训练的词向量来自：[Stanford CS224d](http://cs224d.stanford.edu/)课程作业。**为运行本例，请首先在`data`目录下运行`download.sh`脚本下载输入文本的词典和预训练的词向量。** 完成后会将这两个文件一并放入`data`目录下，输入文本的词典和预训练的词向量分别对应：`data/vocab.txt`和`data/wordVectors.txt`这两个文件。
-
-CoNLL 2003原始数据格式如下：
-
-```
-U.N.         NNP  I-NP  I-ORG
-official     NN   I-NP  O
-Ekeus        NNP  I-NP  I-PER
-heads        VBZ  I-VP  O
-for          IN   I-PP  O
-Baghdad      NNP  I-NP  I-LOC
-.            .    O     O
-```
-
- 第一列为原始句子序列
- 第二、三列分别为词性标签和句法分析中的语块标签，本例不使用
- 第四列为采用了 I-TYPE 方式表示的NER标签
-    - I-TYPE 和 BIO 方式的主要区别在于语块开始标记的使用上，I-TYPE只有在出现相邻的同类别实体时对后者使用B标记，其他均使用I标记），句子之间以空行分隔。
-
-我们在`reader.py`脚本中完成对原始数据的处理以及读取，主要包括下面几个步骤:
-
-1. 从原始数据文件中抽取出句子和标签，构造句子序列和标签序列；
-2. 将 I-TYPE 表示的标签转换为 BIO 方式表示的标签；
-3. 将句子序列中的单词转换为小写，并构造大写标记序列；
-4. 依据词典获取词对应的整数索引。
-
-
-预处理完成后，一条训练样本包含3个部分作为神经网络的输入信息用于训练：（1）句子序列；（2）首字母大写标记序列；（3）标注序列，下表是一条训练样本的示例：
-
-| 句子序列 | 大写标记序列 | 标注序列 |
-| -------- | ------------ | -------- |
-| u.n.     | 1            | B-ORG    |
-| official | 0            | O        |
-| ekeus    | 1            | B-PER    |
-| heads    | 0            | O        |
-| for      | 0            | O        |
-| baghdad  | 1            | B-LOC    |
-| .        | 0            | O        |
-
-## 运行
-### 编写数据读取接口
-
-自定义数据读取接口只需编写一个 Python 生成器实现从原始输入文本中解析一条训练样本的逻辑。[reader.py](./reader.py) 中的`data_reader`函数实现了读取原始数据返回类型为： `paddle.data_type.integer_value_sequence`的 3 个输入（分别对应：词语在字典的序号、是否为大写、标注结果在字典中的序号）给`network_conf.ner_net`中定义的 3 个 `data_layer` 的功能。
+本例需要使用https://github.com/PaddlePaddle/models/blob/develop/sequence_tagging_for_ner/reader.py以及https://github.com/PaddlePaddle/models/blob/develop/sequence_tagging_for_ner/utils.py，请将这两个文件复制到本目录下。

 ### 训练

@@ -145,44 +83,38 @@ Baghdad      NNP  I-NP  I-LOC

 2. 在终端运行 `python infer.py`，开始测试，会看到如下预测结果（以下为训练70个pass所得模型的部分预测结果）：

-```
-leicestershire    B-ORG    B-LOC
-extended    O    O
-their    O    O
-first    O    O
-innings    O    O
-by    O    O
-DGDG    O    O
-runs    O    O
-before    O    O
-being    O    O
-bowled    O    O
-out    O    O
-for    O    O
-296    O    O
-with    O    O
-england    B-LOC    B-LOC
-discard    O    O
-andy    B-PER    B-PER
-caddick    I-PER    I-PER
-taking    O    O
-three    O    O
-for    O    O
-DGDG    O    O
-.    O    O
-```
+    ```text
+    leicestershire    B-ORG    B-LOC
+    extended    O    O
+    their    O    O
+    first    O    O
+    innings    O    O
+    by    O    O
+    DGDG    O    O
+    runs    O    O
+    before    O    O
+    being    O    O
+    bowled    O    O
+    out    O    O
+    for    O    O
+    296    O    O
+    with    O    O
+    england    B-LOC    B-LOC
+    discard    O    O
+    andy    B-PER    B-PER
+    caddick    I-PER    I-PER
+    taking    O    O
+    three    O    O
+    for    O    O
+    DGDG    O    O
+    .    O    O
+    ```

    输出分为三列，以“\t” 分隔，第一列是输入的词语，第二列是标准结果，第三列为生成的标记结果。多条输入序列之间以空行分隔。

-## 真实结果示例
+## 结果示例

 <p align="center">
-<img src="imgs/convergent_curve.png" width="80%" align="center"/><br/>
-图1. Fluid下实验结果示例
+<img src="imgs/convergence_curve.png" width="80%" align="center"/><br/>
+图1. Paddle下实验结果示例, 横轴表示训练轮数，纵轴表示F1值
 </p>
-
-
-## 参考文献
-
-1. Graves A. [Supervised Sequence Labelling with Recurrent Neural Networks](http://www.cs.toronto.edu/~graves/preprint.pdf)[J]. Studies in Computational Intelligence, 2013, 385.
-2. Collobert R, Weston J, Bottou L, et al. [Natural Language Processing (Almost) from Scratch](http://www.jmlr.org/papers/volume12/collobert11a/collobert11a.pdf)[J]. Journal of Machine Learning Research, 2011, 12(1):2493-2537.
--- a/fluid/sequence_tagging_for_ner/data/download.sh
+++ b/fluid/sequence_tagging_for_ner/data/download.sh
-if [ -f assignment2.zip ]; then
-    echo "data exist"
-else
-    wget http://cs224d.stanford.edu/assignment2/assignment2.zip
-fi
-
-if [ $? -eq 0  ];then
-    unzip assignment2.zip
-    cp assignment2_release/data/ner/wordVectors.txt ./data
-    cp assignment2_release/data/ner/vocab.txt ./data
-    rm -rf assignment2.zip assignment2_release
-else
-  echo "download data error!" >> /dev/stderr
-  exit 1
-fi
-
--- a/fluid/sequence_tagging_for_ner/data/target.txt
+++ b/fluid/sequence_tagging_for_ner/data/target.txt
-B-LOC
-I-LOC
-B-MISC
-I-MISC
-B-ORG
-I-ORG
-B-PER
-I-PER
-O
--- a/fluid/sequence_tagging_for_ner/data/test
+++ b/fluid/sequence_tagging_for_ner/data/test
-CRICKET NNP I-NP O
- : O O
-LEICESTERSHIRE NNP I-NP I-ORG
-TAKE NNP I-NP O
-OVER IN I-PP O
-AT NNP I-NP O
-TOP NNP I-NP O
-AFTER NNP I-NP O
-INNINGS NNP I-NP O
-VICTORY NN I-NP O
-. . O O
-
-LONDON NNP I-NP I-LOC
-1996-08-30 CD I-NP O
-
-West NNP I-NP I-MISC
-Indian NNP I-NP I-MISC
-all-rounder NN I-NP O
-Phil NNP I-NP I-PER
-Simmons NNP I-NP I-PER
-took VBD I-VP O
-four CD I-NP O
-for IN I-PP O
-38 CD I-NP O
-on IN I-PP O
-Friday NNP I-NP O
-as IN I-PP O
-Leicestershire NNP I-NP I-ORG
-beat VBD I-VP O
-Somerset NNP I-NP I-ORG
-by IN I-PP O
-an DT I-NP O
-innings NN I-NP O
-and CC O O
-39 CD I-NP O
-runs NNS I-NP O
-in IN I-PP O
-two CD I-NP O
-days NNS I-NP O
-to TO I-VP O
-take VB I-VP O
-over IN I-PP O
-at IN B-PP O
-the DT I-NP O
-head NN I-NP O
-of IN I-PP O
-the DT I-NP O
-county NN I-NP O
-championship NN I-NP O
-. . O O
-
-Their PRP$ I-NP O
-stay NN I-NP O
-on IN I-PP O
-top NN I-NP O
-, , O O
-though RB I-ADVP O
-, , O O
-may MD I-VP O
-be VB I-VP O
-short-lived JJ I-ADJP O
-as IN I-PP O
-title NN I-NP O
-rivals NNS I-NP O
-Essex NNP I-NP I-ORG
-, , O O
-Derbyshire NNP I-NP I-ORG
-and CC I-NP O
-Surrey NNP I-NP I-ORG
-all DT O O
-closed VBD I-VP O
-in RP I-PRT O
-on IN I-PP O
-victory NN I-NP O
-while IN I-SBAR O
-Kent NNP I-NP I-ORG
-made VBD I-VP O
-up RP I-PRT O
-for IN I-PP O
-lost VBN I-NP O
-time NN I-NP O
-in IN I-PP O
-their PRP$ I-NP O
-rain-affected JJ I-NP O
-match NN I-NP O
-against IN I-PP O
-Nottinghamshire NNP I-NP I-ORG
-. . O O
-
-After IN I-PP O
-bowling VBG I-NP O
-Somerset NNP I-NP I-ORG
-out RP I-PRT O
-for IN I-PP O
-83 CD I-NP O
-on IN I-PP O
-the DT I-NP O
-opening NN I-NP O
-morning NN I-NP O
-at IN I-PP O
-Grace NNP I-NP I-LOC
-Road NNP I-NP I-LOC
-, , O O
-Leicestershire NNP I-NP I-ORG
-extended VBD I-VP O
-their PRP$ I-NP O
-first JJ I-NP O
-innings NN I-NP O
-by IN I-PP O
-94 CD I-NP O
-runs VBZ I-VP O
-before IN I-PP O
-being VBG I-VP O
-bowled VBD I-VP O
-out RP I-PRT O
-for IN I-PP O
-296 CD I-NP O
-with IN I-PP O
-England NNP I-NP I-LOC
-discard VBP I-VP O
-Andy NNP I-NP I-PER
-Caddick NNP I-NP I-PER
-taking VBG I-VP O
-three CD I-NP O
-for IN I-PP O
-83 CD I-NP O
-. . O O
-
--- a/fluid/sequence_tagging_for_ner/data/train
+++ b/fluid/sequence_tagging_for_ner/data/train
-EU NNP I-NP I-ORG
-rejects VBZ I-VP O
-German JJ I-NP I-MISC
-call NN I-NP O
-to TO I-VP O
-boycott VB I-VP O
-British JJ I-NP I-MISC
-lamb NN I-NP O
-. . O O
-
-Peter NNP I-NP I-PER
-Blackburn NNP I-NP I-PER
-
-BRUSSELS NNP I-NP I-LOC
-1996-08-22 CD I-NP O
-
-The DT I-NP O
-European NNP I-NP I-ORG
-Commission NNP I-NP I-ORG
-said VBD I-VP O
-on IN I-PP O
-Thursday NNP I-NP O
-it PRP B-NP O
-disagreed VBD I-VP O
-with IN I-PP O
-German JJ I-NP I-MISC
-advice NN I-NP O
-to TO I-PP O
-consumers NNS I-NP O
-to TO I-VP O
-shun VB I-VP O
-British JJ I-NP I-MISC
-lamb NN I-NP O
-until IN I-SBAR O
-scientists NNS I-NP O
-determine VBP I-VP O
-whether IN I-SBAR O
-mad JJ I-NP O
-cow NN I-NP O
-disease NN I-NP O
-can MD I-VP O
-be VB I-VP O
-transmitted VBN I-VP O
-to TO I-PP O
-sheep NN I-NP O
-. . O O
-
-Germany NNP I-NP I-LOC
-'s POS B-NP O
-representative NN I-NP O
-to TO I-PP O
-the DT I-NP O
-European NNP I-NP I-ORG
-Union NNP I-NP I-ORG
-'s POS B-NP O
-veterinary JJ I-NP O
-committee NN I-NP O
-Werner NNP I-NP I-PER
-Zwingmann NNP I-NP I-PER
-said VBD I-VP O
-on IN I-PP O
-Wednesday NNP I-NP O
-consumers NNS I-NP O
-should MD I-VP O
-buy VB I-VP O
-sheepmeat NN I-NP O
-from IN I-PP O
-countries NNS I-NP O
-other JJ I-ADJP O
-than IN I-PP O
-Britain NNP I-NP I-LOC
-until IN I-SBAR O
-the DT I-NP O
-scientific JJ I-NP O
-advice NN I-NP O
-was VBD I-VP O
-clearer JJR I-ADJP O
-. . O O
-
-" " O O
-We PRP I-NP O
-do VBP I-VP O
-n't RB I-VP O
-support VB I-VP O
-any DT I-NP O
-such JJ I-NP O
-recommendation NN I-NP O
-because IN I-SBAR O
-we PRP I-NP O
-do VBP I-VP O
-n't RB I-VP O
-see VB I-VP O
-any DT I-NP O
-grounds NNS I-NP O
-for IN I-PP O
-it PRP I-NP O
-, , O O
-" " O O
-the DT I-NP O
-Commission NNP I-NP I-ORG
-'s POS B-NP O
-chief JJ I-NP O
-spokesman NN I-NP O
-Nikolaus NNP I-NP I-PER
-van NNP I-NP I-PER
-der FW I-NP I-PER
-Pas NNP I-NP I-PER
-told VBD I-VP O
-a DT I-NP O
-news NN I-NP O
-briefing NN I-NP O
-. . O O
-
-He PRP I-NP O
-said VBD I-VP O
-further JJ I-NP O
-scientific JJ I-NP O
-study NN I-NP O
-was VBD I-VP O
-required VBN I-VP O
-and CC O O
-if IN I-SBAR O
-it PRP I-NP O
-was VBD I-VP O
-found VBN I-VP O
-that IN I-SBAR O
-action NN I-NP O
-was VBD I-VP O
-needed VBN I-VP O
-it PRP I-NP O
-should MD I-VP O
-be VB I-VP O
-taken VBN I-VP O
-by IN I-PP O
-the DT I-NP O
-European NNP I-NP I-ORG
-Union NNP I-NP I-ORG
-. . O O
-
--- a/fluid/sequence_tagging_for_ner/imgs/convergence_curve.png
+++ b/fluid/sequence_tagging_for_ner/imgs/convergence_curve.png
--- a/fluid/sequence_tagging_for_ner/imgs/convergent_curve.png
+++ b/fluid/sequence_tagging_for_ner/imgs/convergent_curve.png
--- a/fluid/sequence_tagging_for_ner/infer.py
+++ b/fluid/sequence_tagging_for_ner/infer.py
 import numpy as np
+
 import paddle.fluid as fluid
 import paddle.v2 as paddle

 from network_conf import ner_net
 import reader
-from utils import load_dict, load_reverse_dict, to_lodtensor
+from utils import load_dict, load_reverse_dict
+from utils_extend import to_lodtensor


 def infer(model_path, batch_size, test_data_file, vocab_file, target_file,
          use_gpu):
+    """
+    use the model under model_path to predict the test data, the result will be printed on the screen
+
+    return nothing
+    """
    word_dict = load_dict(vocab_file)
    word_reverse_dict = load_reverse_dict(vocab_file)


--- a/fluid/sequence_tagging_for_ner/reader.py
+++ b/fluid/sequence_tagging_for_ner/reader.py
-"""
-Conll03 dataset.
-"""
-import re
-
-__all__ = ["data_reader"]
-
-
-def canonicalize_digits(word):
-    if any([c.isalpha() for c in word]): return word
-    word = re.sub("\d", "DG", word)
-    if word.startswith("DG"):
-        word = word.replace(",", "")  # remove thousands separator
-    return word
-
-
-def canonicalize_word(word, wordset=None, digits=True):
-    word = word.lower()
-    if digits:
-        if (wordset != None) and (word in wordset): return word
-        word = canonicalize_digits(word)  # try to canonicalize numbers
-    if (wordset == None) or (word in wordset): return word
-    else: return "UUUNKKK"  # unknown token
-
-
-def data_reader(data_file, word_dict, label_dict):
-    """
-    The dataset can be obtained according to http://www.clips.uantwerpen.be/conll2003/ner/.
-    It returns a reader creator, each sample in the reader includes:
-    word id sequence, label id sequence and raw sentence.
-
-    :return: reader creator
-    :rtype: callable
-    """
-
-    def reader():
-        UNK_IDX = word_dict["UUUNKKK"]
-
-        sentence = []
-        labels = []
-        with open(data_file, "r") as f:
-            for line in f:
-                if len(line.strip()) == 0:
-                    if len(sentence) > 0:
-                        word_idx = [
-                            word_dict.get(
-                                canonicalize_word(w, word_dict), UNK_IDX)
-                            for w in sentence
-                        ]
-                        mark = [1 if w[0].isupper() else 0 for w in sentence]
-                        label_idx = [label_dict[l] for l in labels]
-                        yield word_idx, mark, label_idx
-                    sentence = []
-                    labels = []
-                else:
-                    segs = line.strip().split()
-                    sentence.append(segs[0])
-                    # transform I-TYPE to BIO schema
-                    if segs[-1] != "O" and (len(labels) == 0 or
-                                            labels[-1][1:] != segs[-1][1:]):
-                        labels.append("B" + segs[-1][1:])
-                    else:
-                        labels.append(segs[-1])
-
-    return reader
--- a/fluid/sequence_tagging_for_ner/train.py
+++ b/fluid/sequence_tagging_for_ner/train.py
 import os
 import math
-
 import numpy as np
+
 import paddle.v2 as paddle
 import paddle.fluid as fluid

 import reader
 from network_conf import ner_net
-from utils import logger, load_dict, get_embedding, to_lodtensor
+from utils import logger, load_dict
+from utils_extend import to_lodtensor, get_embedding


 def test(exe, chunk_evaluator, inference_program, test_data, place):

--- a/fluid/sequence_tagging_for_ner/utils.py
+++ b/fluid/sequence_tagging_for_ner/utils.py
-#!/usr/bin/env python
-# -*- coding: utf-8 -*-
-import logging
-
-import paddle.fluid as fluid
-
 import numpy as np
-
-logger = logging.getLogger("paddle")
-logger.setLevel(logging.INFO)
+import paddle.fluid as fluid


 def get_embedding(emb_file='data/wordVectors.txt'):
@@ -17,36 +9,10 @@ def get_embedding(emb_file='data/wordVectors.txt'):
    return np.loadtxt(emb_file, dtype='float32')


-def load_dict(dict_path):
-    """
-    Load the word dictionary from the given file.
-    Each line of the given file is a word, which can include multiple columns
-    seperated by tab.
-
-    This function takes the first column (columns in a line are seperated by
-    tab) as key and takes line number of a line as the key (index of the word
-    in the dictionary).
-    """
-
-    return dict((line.strip().split("\t")[0], idx)
-                for idx, line in enumerate(open(dict_path, "r").readlines()))
-
-
-def load_reverse_dict(dict_path):
+def to_lodtensor(data, place):
    """
-    Load the word dictionary from the given file.
-    Each line of the given file is a word, which can include multiple columns
-    seperated by tab.
-
-    This function takes line number of a line as the key (index of the word in
-    the dictionary) and the first column (columns in a line are seperated by
-    tab) as the value.
+    convert data to lodtensor
    """
-    return dict((idx, line.strip().split("\t")[0])
-                for idx, line in enumerate(open(dict_path, "r").readlines()))
-
-
-def to_lodtensor(data, place):
    seq_lens = [len(seq) for seq in data]
    cur_len = 0
    lod = [cur_len]