add cnn_text

ad9b6c43 · zkqym · e15da197 · ad9b6c43 · ad9b6c43 · ad9b6c43
8 changed file
--- a/cnn_text/README.md
+++ b/cnn_text/README.md
+运行本目录下的程序示例需要使用PaddlePaddle v0.10.0 版本。如果您的PaddlePaddle安装版本低于此要求，请按照[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)中的说明更新PaddlePaddle安装版本。
+
+---
+
+# Convolutional Neural Networks for Sentence Classification 模型复现
+
+以下是本例目录包含的文件以及对应说明:
+
+```
+.
+├── images              # 说明文档中的图片
+│   ├── intro.png
+│   └── structure.png
+├── infer.py            # 预测脚本
+├── network_conf.py     # 论文模型中所涉及的网络结构
+├── reader.py           # 读取数据接口
+├── README.md           # 说明文档
+├── train.py            # 训练脚本
+└── utils.py            # 定义通用的函数，例如：打印日志、解析命令行参数、构建字典、加载字典等
+```
+
+## 简介
+传统的句子分类器一般使用SVM和Naive Bayes。传统方法使用的文本表示方法大多是“词袋模型”。即只考虑文本中词的出现的频率，不考虑词的序列信息。传统方法也可以强行使用N-gram的方法，但是这样会带来稀疏问题，意义不大。
+
+**CNN（卷积神经网络）**，虽然出身于图像处理，但是它的思路，给我们提供了在NLP应用上的参考。**“卷积”**这个术语本身来自于信号处理，简单地说就是一系列的输入信号进来之后，系统也会有一系列的输出。但是并不是某一时刻的输出只对应该时刻的输入，而是根据系统自身的特征，**每一个时刻的输出，都和之前的输入相关**。那么如果文本是一些列输入，我们当然希望考虑词和词的序列特征，比如“Tom 的 手机 ”，使用卷积，系统就会知道“手机是tom”的，而不是仅仅是一个“手机”。
+<p align="center">
+<img src="images/structure.png" width = "90%" align="center"/><br/>
+图1. 常见的 CNN 文本分类模型
+</p>
+
+或者更直观地理解，在CNN模型中，卷积就是拿**kernel**在图像上移动，每移动一次提取一次特征，组成feature map， 这个提取特征的过程，就是卷积。
+
+这里实现了Yoon Kim的paper[Convolutional Neural Networks for Sentence Classification (EMNLP 2014)](https://link.jianshu.com/?t=http://arxiv.org/abs/1408.5882)中的模型
+
+
+## 模型详解
+
+`network_conf.py` 中包括以下模型：
+
+`cnn_network`：浅层 CNN 模型，是一个基础的序列模型，能够处理变长的序列输入，提取一个局部区域之内的特征。
+
+
+**CNN 模型结构如下图所示：**
+
+<p align="center">
+<img src="images/intro.png" width = "90%" align="center"/><br/>
+图2. 论文中的 CNN 文本分类模型
+</p>
+
+通过 PaddlePaddle 实现该 CNN 结构的代码见 `network_conf.py` 中的 `convolution_net` 函数，模型主要分为如下几个部分:
+
+- **数据预处理**：对原始数据进行一定的预处理，使之符合模型输入的要求。
+
+```
+data = paddle.layer.data("word",
+                             paddle.data_type.integer_value_sequence(dict_dim))
+    lbl = paddle.layer.data("label", paddle.data_type.integer_value(class_dim))
+
+```
+
+- **词向量层**：将词语转化为固定维度的向量，利用向量之间的距离来表示词之间的语义相关程度。
+
+```
+ emb = paddle.layer.embedding(input=data, size=emb_dim)
+```
+
+- **卷积层**： 文本分类中的卷积在时间序列上进行，即卷积核的宽度和词向量层产出的矩阵一致，卷积沿着矩阵的高度方向进行。
+
+```
+conv_3 = paddle.networks.sequence_conv_pool(
+        input=emb, context_len=3, hidden_size=hid_dim)
+```
+
+- **最大池化层**: 对卷积得到的各个特征图分别进行最大池化操作。由于特征图本身已经是向量，因此这里的最大池化实际上就是简单地选出各个向量中的最大元素。
+
+```
+ conv_4 = paddle.networks.sequence_conv_pool(
+        input=emb, context_len=4, hidden_size=hid_dim)
+```
+
+- **全连接与输出层**：将最大池化的结果通过全连接层输出:
+
+```
+prob = paddle.layer.fc(input=[conv_3, conv_4],
+                           size=class_dim,
+                           act=paddle.activation.Softmax())
+```
+
+## 模型使用
+
+### 训练
+
+在终端中执行 `python train.py --nn_type=cnn --batch_size=64 num_passes=20` 命令， 将以 PaddlePaddle 内置的情感分类数据集：`paddle.dataset.imdb` 直接运行本例，会看到如下输入：
+
+```text
+Pass 0, Batch 0, Cost 0.696031, {'__auc_evaluator_0__': 0.47360000014305115, 'classification_error_evaluator': 0.5}
+Pass 0, Batch 100, Cost 0.544438, {'__auc_evaluator_0__': 0.839249312877655, 'classification_error_evaluator': 0.30000001192092896}
+Pass 0, Batch 200, Cost 0.406581, {'__auc_evaluator_0__': 0.9030032753944397, 'classification_error_evaluator': 0.2199999988079071}
+Test at Pass 0, {'__auc_evaluator_0__': 0.9289745092391968, 'classification_error_evaluator': 0.14927999675273895}
+```
+日志每隔 100 个 batch 输出一次，输出信息包括：
+
+（1）Pass 序号；
+
+（2）Batch 序号；
+
+（3）依次输出当前 Batch 上评估指标的评估结果。评估指标在配置网络拓扑结构时指定，在上面的输出中，输出了训练样本集之的 AUC 以及错误率指标。
+
+### 预测
+
+训练结束后模型默认存储在当前工作目录下，在终端中执行 `python infer.py` ，预测脚本会加载训练好的模型进行预测。
+
+- 默认加载使用 `paddle.data.imdb.train` 训练一个 Pass 产出的 CNN 模型对 `paddle.dataset.imdb.test` 进行测试
+
+会看到如下输出：
+
+```
+positive        0.9275 0.0725   previous reviewer <unk> <unk> gave a much better <unk> of the films plot details than i could what i recall mostly is that it was just so beautiful in every sense emotionally visually <unk> just <unk> br if you like movies that are wonderful to look at and also have emotional content to which that beauty is relevant i think you will be glad to have seen this extraordinary and unusual work of <unk> br on a scale of 1 to 10 id give it about an <unk> the only reason i shy away from 9 is that it is a mood piece if you are in the mood for a really artistic very romantic film then its a 10 i definitely think its a mustsee but none of us can be in that mood all the time so overall <unk>
+negative        0.0300 0.9700   i love scifi and am willing to put up with a lot scifi <unk> are usually <unk> <unk> and <unk> i tried to like this i really did but it is to good tv scifi as <unk> 5 is to star trek the original silly <unk> cheap cardboard sets stilted dialogues cg that doesnt match the background and painfully onedimensional characters cannot be overcome with a scifi setting im sure there are those of you out there who think <unk> 5 is good scifi tv its not its clichéd and <unk> while us viewers might like emotion and character development scifi is a genre that does not take itself seriously <unk> star trek it may treat important issues yet not as a serious philosophy its really difficult to care about the characters here as they are not simply <unk> just missing a <unk> of life their actions and reactions are wooden and predictable often painful to watch the makers of earth know its rubbish as they have to always say gene <unk> earth otherwise people would not continue watching <unk> <unk> must be turning in their <unk> as this dull cheap poorly edited watching it without <unk> breaks really brings this home <unk> <unk> of a show <unk> into space spoiler so kill off a main character and then bring him back as another actor <unk> <unk> all over again
+```
+
+输出日志每一行是对一条样本预测的结果，以 `\t` 分隔，共 3 列，分别是：
+
+（1）预测类别标签；
+
+（2）样本分别属于每一类的概率，内部以空格分隔；
+
+（3）输入文本。
\ No newline at end of file
--- a/cnn_text/images/intro.png
+++ b/cnn_text/images/intro.png
--- a/cnn_text/images/structure.png
+++ b/cnn_text/images/structure.png
--- a/cnn_text/infer.py
+++ b/cnn_text/infer.py
+import sys
+import os
+import gzip
+
+import paddle.v2 as paddle
+
+import reader
+from network_conf import cnn_network
+from utils import logger, load_dict, load_reverse_dict
+
+
+def infer(topology, data_dir, model_path, word_dict_path, label_dict_path,
+          batch_size):
+    def _infer_a_batch(inferer, test_batch, ids_2_word, ids_2_label):
+        probs = inferer.infer(input=test_batch, field=["value"])
+        assert len(probs) == len(test_batch)
+        for word_ids, prob in zip(test_batch, probs):
+            word_text = " ".join([ids_2_word[id] for id in word_ids[0]])
+            print("%s\t%s\t%s" % (ids_2_label[prob.argmax()],
+                                  " ".join(["{:0.4f}".format(p)
+                                            for p in prob]), word_text))
+
+    logger.info("begin to predict...")
+    use_default_data = (data_dir is None)
+
+    if use_default_data:
+        word_dict = paddle.dataset.imdb.word_dict()
+        word_reverse_dict = dict((value, key)
+                                 for key, value in word_dict.iteritems())
+        label_reverse_dict = {0: "positive", 1: "negative"}
+        test_reader = paddle.dataset.imdb.test(word_dict)()
+    else:
+        assert os.path.exists(
+            word_dict_path), "the word dictionary file does not exist"
+        assert os.path.exists(
+            label_dict_path), "the label dictionary file does not exist"
+
+        word_dict = load_dict(word_dict_path)
+        word_reverse_dict = load_reverse_dict(word_dict_path)
+        label_reverse_dict = load_reverse_dict(label_dict_path)
+
+        test_reader = reader.test_reader(data_dir, word_dict)()
+
+    dict_dim = len(word_dict)
+    class_num = len(label_reverse_dict)
+    prob_layer = topology(dict_dim, class_num, is_infer=True)
+
+    # initialize PaddlePaddle
+    paddle.init(use_gpu=False, trainer_count=1)
+
+    # load the trained models
+    parameters = paddle.parameters.Parameters.from_tar(
+        gzip.open(model_path, "r"))
+    inferer = paddle.inference.Inference(
+        output_layer=prob_layer, parameters=parameters)
+
+    test_batch = []
+    for idx, item in enumerate(test_reader):
+        test_batch.append([item[0]])
+        if len(test_batch) == batch_size:
+            _infer_a_batch(inferer, test_batch, word_reverse_dict,
+                           label_reverse_dict)
+            test_batch = []
+
+    if len(test_batch):
+        _infer_a_batch(inferer, test_batch, word_reverse_dict,
+                       label_reverse_dict)
+        test_batch = []
+
+
+if __name__ == "__main__":
+    model_path = "models/cnn_params_pass_00000.tar.gz"
+    assert os.path.exists(model_path), "the trained model does not exist."
+
+    nn_type = "cnn"
+    test_dir = None
+    word_dict = None
+    label_dict = None
+
+    assert nn_type == "cnn", "wrong network type."
+
+    infer(
+        topology=topology,
+        data_dir=test_dir,
+        word_dict_path=word_dict,
+        label_dict_path=label_dict,
+        model_path=model_path,
+        batch_size=10)
--- a/cnn_text/network_conf.py
+++ b/cnn_text/network_conf.py
+import sys
+import math
+import gzip
+
+from paddle.v2.layer import parse_network
+import paddle.v2 as paddle
+
+__all__ = ["cnn_network"]
+
+
+def cnn_network(dict_dim,
+        class_dim=2,
+        emb_dim=28,
+        hid_dim=128,
+        is_infer=False):
+ 
+    data = paddle.layer.data("word",
+                             paddle.data_type.integer_value_sequence(dict_dim))
+    lbl = paddle.layer.data("label", paddle.data_type.integer_value(class_dim))
+
+    emb = paddle.layer.embedding(input=data, size=emb_dim)
+
+    conv_3 = paddle.networks.sequence_conv_pool(
+        input=emb, context_len=3, hidden_size=hid_dim)
+    conv_4 = paddle.networks.sequence_conv_pool(
+        input=emb, context_len=4, hidden_size=hid_dim)
+
+    prob = paddle.layer.fc(input=[conv_3, conv_4],
+                           size=class_dim,
+                           act=paddle.activation.Softmax())
+
+    if is_infer:
+        return prob
+    else:
+        cost = paddle.layer.classification_cost(input=prob, label=lbl)
+
+        return cost, prob, lbl
--- a/cnn_text/reader.py
+++ b/cnn_text/reader.py
+import os
+
+
+def train_reader(data_dir, word_dict, label_dict):
+
+    def reader():
+        UNK_ID = word_dict["<UNK>"]
+        word_col = 1
+        lbl_col = 0
+
+        for file_name in os.listdir(data_dir):
+            with open(os.path.join(data_dir, file_name), "r") as f:
+                for line in f:
+                    line_split = line.strip().split("\t")
+                    word_ids = [
+                        word_dict.get(w, UNK_ID)
+                        for w in line_split[word_col].split()
+                    ]
+                    yield word_ids, label_dict[line_split[lbl_col]]
+
+    return reader
+
+
+def test_reader(data_dir, word_dict):
+
+    def reader():
+        UNK_ID = word_dict["<UNK>"]
+        word_col = 1
+
+        for file_name in os.listdir(data_dir):
+            with open(os.path.join(data_dir, file_name), "r") as f:
+                for line in f:
+                    line_split = line.strip().split("\t")
+                    if len(line_split) < word_col: continue
+                    word_ids = [
+                        word_dict.get(w, UNK_ID)
+                        for w in line_split[word_col].split()
+                    ]
+                    yield word_ids, line_split[word_col]
+
+    return reader
--- a/cnn_text/train.py
+++ b/cnn_text/train.py
+import os
+import sys
+import gzip
+
+import paddle.v2 as paddle
+
+import reader
+from utils import logger, parse_train_cmd, build_dict, load_dict
+from network_conf import cnn_network
+
+
+def train(topology,
+          train_data_dir=None,
+          test_data_dir=None,
+          word_dict_path=None,
+          label_dict_path=None,
+          model_save_dir="models",
+          batch_size=32,
+          num_passes=10):
+
+    if not os.path.exists(model_save_dir):
+        os.mkdir(model_save_dir)
+
+    use_default_data = (train_data_dir is None)
+
+    if use_default_data:
+        logger.info(("No training data are provided, "
+                     "use paddle.dataset.imdb to train the model."))
+        logger.info("please wait to build the word dictionary ...")
+
+        word_dict = paddle.dataset.imdb.word_dict()
+        train_reader = paddle.batch(
+            paddle.reader.shuffle(
+                lambda: paddle.dataset.imdb.train(word_dict)(), buf_size=51200),
+            batch_size=100)
+        test_reader = paddle.batch(
+            lambda: paddle.dataset.imdb.test(word_dict)(), batch_size=100)
+
+        class_num = 2
+    else:
+        if word_dict_path is None or not os.path.exists(word_dict_path):
+            logger.info(("word dictionary is not given, the dictionary "
+                         "is automatically built from the training data."))
+
+            build_dict(
+                data_dir=train_data_dir,
+                save_path=word_dict_path,
+                use_col=1,
+                cutoff_fre=5,
+                insert_extra_words=["<UNK>"])
+
+        if not os.path.exists(label_dict_path):
+            logger.info(("label dictionary is not given, the dictionary "
+                         "is automatically built from the training data."))
+            # build the label dictionary to map the original string-typed
+            # label into integer-typed index
+            build_dict(
+                data_dir=train_data_dir, save_path=label_dict_path, use_col=0)
+
+        word_dict = load_dict(word_dict_path)
+
+        lbl_dict = load_dict(label_dict_path)
+        class_num = len(lbl_dict)
+        logger.info("class number is : %d." % (len(lbl_dict)))
+
+        train_reader = paddle.batch(
+            paddle.reader.shuffle(
+                reader.train_reader(train_data_dir, word_dict, lbl_dict),
+                buf_size=51200),
+            batch_size=batch_size)
+
+        if test_data_dir is not None:
+            # here, because training and testing data share a same format,
+            # we still use the reader.train_reader to read the testing data.
+            test_reader = paddle.batch(
+                reader.train_reader(test_data_dir, word_dict, lbl_dict),
+                batch_size=batch_size)
+        else:
+            test_reader = None
+
+    dict_dim = len(word_dict)
+    logger.info("length of word dictionary is : %d." % (dict_dim))
+
+    paddle.init(use_gpu=False, trainer_count=1)
+
+    # network config
+    cost, prob, label = topology(dict_dim, class_num)
+
+    # create parameters
+    parameters = paddle.parameters.create(cost)
+
+    # create optimizer
+    adam_optimizer = paddle.optimizer.Adam(
+        learning_rate=1e-3,
+        regularization=paddle.optimizer.L2Regularization(rate=1e-3),
+        model_average=paddle.optimizer.ModelAverage(average_window=0.5))
+
+    # create trainer
+    trainer = paddle.trainer.SGD(
+        cost=cost,
+        extra_layers=paddle.evaluator.auc(input=prob, label=label),
+        parameters=parameters,
+        update_equation=adam_optimizer)
+
+    # begin training network
+    feeding = {"word": 0, "label": 1}
+
+    def _event_handler(event):
+        """
+        Define end batch and end pass event handler
+        """
+        if isinstance(event, paddle.event.EndIteration):
+            if event.batch_id % 100 == 0:
+                logger.info("Pass %d, Batch %d, Cost %f, %s\n" % (
+                    event.pass_id, event.batch_id, event.cost, event.metrics))
+
+        if isinstance(event, paddle.event.EndPass):
+            if test_reader is not None:
+                result = trainer.test(reader=test_reader, feeding=feeding)
+                logger.info("Test at Pass %d, %s \n" % (event.pass_id,
+                                                        result.metrics))
+            with gzip.open(
+                    os.path.join(model_save_dir, "cnn_params_pass_%05d.tar.gz" %
+                                 event.pass_id), "w") as f:
+                trainer.save_parameter_to_tar(f)
+
+    trainer.train(
+        reader=train_reader,
+        event_handler=_event_handler,
+        feeding=feeding,
+        num_passes=num_passes)
+
+    logger.info("Training has finished.")
+
+
+def main(args):
+    if args.nn_type == "cnn":
+        topology = cnn_network
+    else:
+        logger.error("wrong network type, check again.")
+
+    train(
+        topology=topology,
+        train_data_dir=args.train_data_dir,
+        test_data_dir=args.test_data_dir,
+        word_dict_path=args.word_dict,
+        label_dict_path=args.label_dict,
+        batch_size=args.batch_size,
+        num_passes=args.num_passes,
+        model_save_dir=args.model_save_dir)
+
+
+if __name__ == "__main__":
+    args = parse_train_cmd()
+    if args.train_data_dir is not None:
+        assert args.word_dict and args.label_dict, (
+            "the parameter train_data_dir, word_dict_path, and label_dict_path "
+            "should be set at the same time.")
+    main(args)
--- a/cnn_text/utils.py
+++ b/cnn_text/utils.py
+import logging
+import os
+import argparse
+from collections import defaultdict
+
+logger = logging.getLogger("paddle")
+logger.setLevel(logging.INFO)
+
+
+def parse_train_cmd():
+    parser = argparse.ArgumentParser(
+        description="PaddlePaddle text classification example.")
+    parser.add_argument(
+        "--nn_type",
+        type=str,
+        help=("A flag that defines which type of network to use, "
+              "available: [cnn]."),
+        default="cnn")
+    parser.add_argument(
+        "--train_data_dir",
+        type=str,
+        required=False,
+        help=("The path of training dataset (default: None). If this parameter "
+              "is not set, paddle.dataset.imdb will be used."),
+        default=None)
+    parser.add_argument(
+        "--test_data_dir",
+        type=str,
+        required=False,
+        help=("The path of testing dataset (default: None). If this parameter "
+              "is not set, paddle.dataset.imdb will be used."),
+        default=None)
+    parser.add_argument(
+        "--word_dict",
+        type=str,
+        required=False,
+        help=("The path of word dictionary (default: None). If this parameter "
+              "is not set, paddle.dataset.imdb will be used. If this parameter "
+              "is set, but the file does not exist, word dictionay "
+              "will be built from the training data automatically."),
+        default=None)
+    parser.add_argument(
+        "--label_dict",
+        type=str,
+        required=False,
+        help=("The path of label dictionay (default: None).If this parameter "
+              "is not set, paddle.dataset.imdb will be used. If this parameter "
+              "is set, but the file does not exist, word dictionay "
+              "will be built from the training data automatically."),
+        default=None)
+    parser.add_argument(
+        "--batch_size",
+        type=int,
+        default=32,
+        help="The number of training examples in one forward/backward pass.")
+    parser.add_argument(
+        "--num_passes",
+        type=int,
+        default=10,
+        help="The number of passes to train the model.")
+    parser.add_argument(
+        "--model_save_dir",
+        type=str,
+        required=False,
+        help=("The path to save the trained models."),
+        default="models")
+
+    return parser.parse_args()
+
+
+def build_dict(data_dir,
+               save_path,
+               use_col=0,
+               cutoff_fre=0,
+               insert_extra_words=[]):
+    values = defaultdict(int)
+
+    for file_name in os.listdir(data_dir):
+        file_path = os.path.join(data_dir, file_name)
+        if not os.path.isfile(file_path):
+            continue
+        with open(file_path, "r") as fdata:
+            for line in fdata:
+                line_splits = line.strip().split("\t")
+                if len(line_splits) < use_col: continue
+                for w in line_splits[use_col].split():
+                    values[w] += 1
+
+    with open(save_path, "w") as f:
+        for w in insert_extra_words:
+            f.write("%s\t-1\n" % (w))
+
+        for v, count in sorted(
+                values.iteritems(), key=lambda x: x[1], reverse=True):
+            if count < cutoff_fre:
+                break
+            f.write("%s\t%d\n" % (v, count))
+
+
+def load_dict(dict_path):
+    return dict((line.strip().split("\t")[0], idx)
+                for idx, line in enumerate(open(dict_path, "r").readlines()))
+
+
+def load_reverse_dict(dict_path):
+    return dict((idx, line.strip().split("\t")[0])
+                for idx, line in enumerate(open(dict_path, "r").readlines()))