Merge pull request #1342 from xuezhong/machine_reading_comprehesion

Machine reading comprehension

Merge pull request #1342 from xuezhong/machine_reading_comprehesion
Machine reading comprehension
297a5621 · Yibing Liu · GitHub · 15fa29d0 · bc7b5594 · 297a5621
20 changed file
--- a/fluid/README.cn.rst
+++ b/fluid/README.cn.rst
@@ -167,3 +167,12 @@ SimNet是百度自然语言处理部于2013年自主研发的语义匹配框架

 -  `SimNet in PaddlePaddle
   Fluid <https://github.com/baidu/AnyQ/blob/master/tools/simnet/train/paddle/README.md>`__
+   
+机器阅读理解
+----
+
+机器阅读理解(MRC)是自然语言处理(NLP)中的核心任务之一，最终目标是让机器像人类一样阅读文本，提炼文本信息并回答相关问题。深度学习近年来在NLP中得到广泛使用，也使得机器阅读理解能力在近年有了大幅提高，但是目前研究的机器阅读理解都采用人工构造的数据集，以及回答一些相对简单的问题，和人类处理的数据还有明显差距，因此亟需大规模真实训练数据推动MRC的进一步发展。
+
+百度阅读理解数据集是由百度自然语言处理部开源的一个真实世界数据集，所有的问题、原文都来源于实际数据(百度搜索引擎数据和百度知道问答社区)，答案是由人类回答的。每个问题都对应多个答案，数据集包含200k问题、1000k原文和420k答案，是目前最大的中文MRC数据集。百度同时开源了对应的阅读理解模型，称为DuReader，采用当前通用的网络分层结构，通过双向attention机制捕捉问题和原文之间的交互关系，生成query-aware的原文表示，最终基于query-aware的原文表示通过point network预测答案范围。
+
+-  `DuReader in PaddlePaddle Fluid] <https://github.com/PaddlePaddle/models/blob/develop/fluid/machine_reading_comprehension/README.md>`__
--- a/fluid/README.md
+++ b/fluid/README.md
@@ -136,3 +136,12 @@ AnyQ
 SimNet是百度自然语言处理部于2013年自主研发的语义匹配框架，该框架在百度各产品上广泛应用，主要包括BOW、CNN、RNN、MM-DNN等核心网络结构形式，同时基于该框架也集成了学术界主流的语义匹配模型，如MatchPyramid、MV-LSTM、K-NRM等模型。使用SimNet构建出的模型可以便捷的加入AnyQ系统中，增强AnyQ系统的语义匹配能力。

 -  [SimNet in PaddlePaddle Fluid](https://github.com/baidu/AnyQ/blob/master/tools/simnet/train/paddle/README.md)
+
+机器阅读理解
+----------
+
+机器阅读理解(MRC)是自然语言处理(NLP)中的核心任务之一，最终目标是让机器像人类一样阅读文本，提炼文本信息并回答相关问题。深度学习近年来在NLP中得到广泛使用，也使得机器阅读理解能力在近年有了大幅提高，但是目前研究的机器阅读理解都采用人工构造的数据集，以及回答一些相对简单的问题，和人类处理的数据还有明显差距，因此亟需大规模真实训练数据推动MRC的进一步发展。
+
+百度阅读理解数据集是由百度自然语言处理部开源的一个真实世界数据集，所有的问题、原文都来源于实际数据(百度搜索引擎数据和百度知道问答社区)，答案是由人类回答的。每个问题都对应多个答案，数据集包含200k问题、1000k原文和420k答案，是目前最大的中文MRC数据集。百度同时开源了对应的阅读理解模型，称为DuReader，采用当前通用的网络分层结构，通过双向attention机制捕捉问题和原文之间的交互关系，生成query-aware的原文表示，最终基于query-aware的原文表示通过point network预测答案范围。
+
+-  [DuReader in PaddlePaddle Fluid](https://github.com/PaddlePaddle/models/blob/develop/fluid/machine_reading_comprehension/README.md)
--- a/fluid/machine_reading_comprehesion/README.md
+++ b/fluid/machine_reading_comprehesion/README.md
 # Abstract
-Dureader is an end-to-end neural network model for machine reading comprehesion style question answering, which aims to anser questions from given passages. We first match the question and passage with a bidireactional attention flow network to obtrain the question-aware passages represenation. Then we employ a pointer network to locate the positions of answers from passages. Our experimental evalutions show that DuReader model achieves the state-of-the-art results in DuReader Dadaset.
+Dureader is an end-to-end neural network model for machine reading comprehension style question answering, which aims to answer questions from given passages. We first match the question and passages with a bidireactional attention flow network to obtrain the question-aware passages represenation. Then we employ a pointer network to locate the positions of answers from passages. Our experimental evalutions show that DuReader model achieves the state-of-the-art results in DuReader Dadaset.
 # Dataset
 DuReader Dataset is a new large-scale real-world and human sourced MRC dataset in Chinese. DuReader focuses on real-world open-domain question answering. The advantages of DuReader over existing datasets are concluded as follows:
 - Real question
@@ -9,7 +9,7 @@ DuReader Dataset is a new large-scale real-world and human sourced MRC dataset i
 - Rich annotation

 # Network
-DuReader is inspired by 3 classic reading comprehension models([BiDAF](https://arxiv.org/abs/1611.01603), [Match-LSTM](https://arxiv.org/abs/1608.07905), [R-NET](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/05/r-net.pdf)).
+DuReader model is inspired by 3 classic reading comprehension models([BiDAF](https://arxiv.org/abs/1611.01603), [Match-LSTM](https://arxiv.org/abs/1608.07905), [R-NET](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/05/r-net.pdf)).

 DuReader model is a hierarchical multi-stage process and consists of five layers

@@ -63,7 +63,7 @@ sh run.sh --evaluate  --load_dir models/1
 You can also predict answers for the samples in some files using the following command:

 ```
-sh run.sh --predict --load_dir models/1 --testset ../data/demo/devset/search.dev.json
+sh run.sh --predict --load_dir models/1 --testset ../data/preprocessed/testset/search.dev.json
 ```

 By default, the results are saved at `../data/results/` folder. You can change this by specifying `--result_dir DIR_PATH`.
--- a/fluid/machine_reading_comprehesion/args.py
+++ b/fluid/machine_reading_comprehesion/args.py
@@ -58,6 +58,11 @@ def parse_args():
        type=float,
        default=0.001,
        help="Learning rate used to train the model. (default: %(default)f)")
+    parser.add_argument(
+        "--weight_decay",
+        type=float,
+        default=0.0001,
+        help="Weight decay. (default: %(default)f)")
    parser.add_argument(
        "--use_gpu",
        type=distutils.util.strtobool,

--- a/fluid/machine_reading_comprehesion/data/download.sh
+++ b/fluid/machine_reading_comprehesion/data/download.sh
--- a/fluid/machine_reading_comprehesion/data/md5sum.txt
+++ b/fluid/machine_reading_comprehesion/data/md5sum.txt
--- a/fluid/machine_reading_comprehesion/dataset.py
+++ b/fluid/machine_reading_comprehesion/dataset.py
--- a/fluid/machine_reading_comprehesion/rc_model.py
+++ b/fluid/machine_reading_comprehesion/rc_model.py
--- a/fluid/machine_reading_comprehesion/run.py
+++ b/fluid/machine_reading_comprehesion/run.py
@@ -21,6 +21,7 @@ import time
 import os
 import random
 import json
+import six

 import paddle
 import paddle.fluid as fluid
@@ -209,6 +210,8 @@ def validation(inference_program, avg_cost, s_probs, e_probs, feed_order, place,
                    'yesno_answers': []
                })
    if args.result_dir is not None and args.result_name is not None:
+        if not os.path.exists(args.result_dir):
+            os.makedirs(args.result_dir)
        result_file = os.path.join(args.result_dir, args.result_name + '.json')
        with open(result_file, 'w') as fout:
            for pred_answer in pred_answers:
@@ -235,7 +238,10 @@ def validation(inference_program, avg_cost, s_probs, e_probs, feed_order, place,
 def train(logger, args):
    logger.info('Load data_set and vocab...')
    with open(os.path.join(args.vocab_dir, 'vocab.data'), 'rb') as fin:
-        vocab = pickle.load(fin)
+        if six.PY2:
+            vocab = pickle.load(fin)
+        else:
+            vocab = pickle.load(fin, encoding='bytes')
        logger.info('vocab size is {} and embed dim is {}'.format(vocab.size(
        ), vocab.embed_dim))
    brc_data = BRCDataset(args.max_p_num, args.max_p_len, args.max_q_len,
@@ -259,13 +265,20 @@ def train(logger, args):
            # build optimizer
            if args.optim == 'sgd':
                optimizer = fluid.optimizer.SGD(
-                    learning_rate=args.learning_rate)
+                    learning_rate=args.learning_rate,
+                    regularization=fluid.regularizer.L2DecayRegularizer(
+                        regularization_coeff=args.weight_decay))
            elif args.optim == 'adam':
                optimizer = fluid.optimizer.Adam(
-                    learning_rate=args.learning_rate)
+                    learning_rate=args.learning_rate,
+                    regularization=fluid.regularizer.L2DecayRegularizer(
+                        regularization_coeff=args.weight_decay))
+
            elif args.optim == 'rprop':
                optimizer = fluid.optimizer.RMSPropOptimizer(
-                    learning_rate=args.learning_rate)
+                    learning_rate=args.learning_rate,
+                    regularization=fluid.regularizer.L2DecayRegularizer(
+                        regularization_coeff=args.weight_decay))
            else:
                logger.error('Unsupported optimizer: {}'.format(args.optim))
                exit(-1)

--- a/fluid/machine_reading_comprehesion/run.sh
+++ b/fluid/machine_reading_comprehesion/run.sh
--- a/fluid/machine_reading_comprehesion/utils/__init__.py
+++ b/fluid/machine_reading_comprehesion/utils/__init__.py
--- a/fluid/machine_reading_comprehesion/utils/download_thirdparty.sh
+++ b/fluid/machine_reading_comprehesion/utils/download_thirdparty.sh
--- a/fluid/machine_reading_comprehesion/utils/dureader_eval.py
+++ b/fluid/machine_reading_comprehesion/utils/dureader_eval.py
@@ -528,10 +528,9 @@ def main(args):
    except AssertionError as ae:
        err = ae

-    print(
-        json.dumps(
-            format_metrics(metrics, args.task, err), ensure_ascii=False).encode(
-                'utf8'))
+    print(json.dumps(
+        format_metrics(metrics, args.task, err), ensure_ascii=False).encode(
+            'utf8'))


 if __name__ == '__main__':

--- a/fluid/machine_reading_comprehesion/utils/get_vocab.py
+++ b/fluid/machine_reading_comprehesion/utils/get_vocab.py
--- a/fluid/machine_reading_comprehesion/utils/marco_tokenize_data.py
+++ b/fluid/machine_reading_comprehesion/utils/marco_tokenize_data.py
--- a/fluid/machine_reading_comprehesion/utils/marcov1_to_dureader.py
+++ b/fluid/machine_reading_comprehesion/utils/marcov1_to_dureader.py
--- a/fluid/machine_reading_comprehesion/utils/marcov2_to_v1_tojsonl.py
+++ b/fluid/machine_reading_comprehesion/utils/marcov2_to_v1_tojsonl.py
--- a/fluid/machine_reading_comprehesion/utils/preprocess.py
+++ b/fluid/machine_reading_comprehesion/utils/preprocess.py
--- a/fluid/machine_reading_comprehesion/utils/run_marco2dureader_preprocess.sh
+++ b/fluid/machine_reading_comprehesion/utils/run_marco2dureader_preprocess.sh
--- a/fluid/machine_reading_comprehesion/vocab.py
+++ b/fluid/machine_reading_comprehesion/vocab.py