diff --git a/fluid/README.cn.rst b/fluid/README.cn.rst index bca8679df1eb35d4ac164f53c0ed3fde9f23dbf5..e83669e9f6d69187196bc3b2727ce04379db7f5e 100644 --- a/fluid/README.cn.rst +++ b/fluid/README.cn.rst @@ -167,3 +167,12 @@ SimNet是百度自然语言处理部于2013年自主研发的语义匹配框架 - `SimNet in PaddlePaddle Fluid `__ + +机器阅读理解 +---- + +机器阅读理解(MRC)是自然语言处理(NLP)中的核心任务之一,最终目标是让机器像人类一样阅读文本,提炼文本信息并回答相关问题。深度学习近年来在NLP中得到广泛使用,也使得机器阅读理解能力在近年有了大幅提高,但是目前研究的机器阅读理解都采用人工构造的数据集,以及回答一些相对简单的问题,和人类处理的数据还有明显差距,因此亟需大规模真实训练数据推动MRC的进一步发展。 + +百度阅读理解数据集是由百度自然语言处理部开源的一个真实世界数据集,所有的问题、原文都来源于实际数据(百度搜索引擎数据和百度知道问答社区),答案是由人类回答的。每个问题都对应多个答案,数据集包含200k问题、1000k原文和420k答案,是目前最大的中文MRC数据集。百度同时开源了对应的阅读理解模型,称为DuReader,采用当前通用的网络分层结构,通过双向attention机制捕捉问题和原文之间的交互关系,生成query-aware的原文表示,最终基于query-aware的原文表示通过point network预测答案范围。 + +- `DuReader in PaddlePaddle Fluid] `__ diff --git a/fluid/README.md b/fluid/README.md index dc20ddb14c644c1b1ec0737e8caf7404896c36dd..0df29b9ac734030bf16a765c8d85b1f911aaaddb 100644 --- a/fluid/README.md +++ b/fluid/README.md @@ -136,3 +136,12 @@ AnyQ SimNet是百度自然语言处理部于2013年自主研发的语义匹配框架,该框架在百度各产品上广泛应用,主要包括BOW、CNN、RNN、MM-DNN等核心网络结构形式,同时基于该框架也集成了学术界主流的语义匹配模型,如MatchPyramid、MV-LSTM、K-NRM等模型。使用SimNet构建出的模型可以便捷的加入AnyQ系统中,增强AnyQ系统的语义匹配能力。 - [SimNet in PaddlePaddle Fluid](https://github.com/baidu/AnyQ/blob/master/tools/simnet/train/paddle/README.md) + +机器阅读理解 +---------- + +机器阅读理解(MRC)是自然语言处理(NLP)中的核心任务之一,最终目标是让机器像人类一样阅读文本,提炼文本信息并回答相关问题。深度学习近年来在NLP中得到广泛使用,也使得机器阅读理解能力在近年有了大幅提高,但是目前研究的机器阅读理解都采用人工构造的数据集,以及回答一些相对简单的问题,和人类处理的数据还有明显差距,因此亟需大规模真实训练数据推动MRC的进一步发展。 + +百度阅读理解数据集是由百度自然语言处理部开源的一个真实世界数据集,所有的问题、原文都来源于实际数据(百度搜索引擎数据和百度知道问答社区),答案是由人类回答的。每个问题都对应多个答案,数据集包含200k问题、1000k原文和420k答案,是目前最大的中文MRC数据集。百度同时开源了对应的阅读理解模型,称为DuReader,采用当前通用的网络分层结构,通过双向attention机制捕捉问题和原文之间的交互关系,生成query-aware的原文表示,最终基于query-aware的原文表示通过point network预测答案范围。 + +- [DuReader in PaddlePaddle Fluid](https://github.com/PaddlePaddle/models/blob/develop/fluid/machine_reading_comprehension/README.md) diff --git a/fluid/machine_reading_comprehesion/README.md b/fluid/machine_reading_comprehension/README.md similarity index 81% rename from fluid/machine_reading_comprehesion/README.md rename to fluid/machine_reading_comprehension/README.md index b46d54cf41df66fc26e0f1c597e5cfb7b32e11cd..884c15058e9b5601c7754e27d1b106fd41e2ac27 100644 --- a/fluid/machine_reading_comprehesion/README.md +++ b/fluid/machine_reading_comprehension/README.md @@ -1,5 +1,5 @@ # Abstract -Dureader is an end-to-end neural network model for machine reading comprehesion style question answering, which aims to anser questions from given passages. We first match the question and passage with a bidireactional attention flow network to obtrain the question-aware passages represenation. Then we employ a pointer network to locate the positions of answers from passages. Our experimental evalutions show that DuReader model achieves the state-of-the-art results in DuReader Dadaset. +Dureader is an end-to-end neural network model for machine reading comprehension style question answering, which aims to answer questions from given passages. We first match the question and passages with a bidireactional attention flow network to obtrain the question-aware passages represenation. Then we employ a pointer network to locate the positions of answers from passages. Our experimental evalutions show that DuReader model achieves the state-of-the-art results in DuReader Dadaset. # Dataset DuReader Dataset is a new large-scale real-world and human sourced MRC dataset in Chinese. DuReader focuses on real-world open-domain question answering. The advantages of DuReader over existing datasets are concluded as follows: - Real question @@ -9,7 +9,7 @@ DuReader Dataset is a new large-scale real-world and human sourced MRC dataset i - Rich annotation # Network -DuReader is inspired by 3 classic reading comprehension models([BiDAF](https://arxiv.org/abs/1611.01603), [Match-LSTM](https://arxiv.org/abs/1608.07905), [R-NET](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/05/r-net.pdf)). +DuReader model is inspired by 3 classic reading comprehension models([BiDAF](https://arxiv.org/abs/1611.01603), [Match-LSTM](https://arxiv.org/abs/1608.07905), [R-NET](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/05/r-net.pdf)). DuReader model is a hierarchical multi-stage process and consists of five layers @@ -63,7 +63,7 @@ sh run.sh --evaluate --load_dir models/1 You can also predict answers for the samples in some files using the following command: ``` -sh run.sh --predict --load_dir models/1 --testset ../data/demo/devset/search.dev.json +sh run.sh --predict --load_dir models/1 --testset ../data/preprocessed/testset/search.dev.json ``` By default, the results are saved at `../data/results/` folder. You can change this by specifying `--result_dir DIR_PATH`. diff --git a/fluid/machine_reading_comprehesion/args.py b/fluid/machine_reading_comprehension/args.py similarity index 96% rename from fluid/machine_reading_comprehesion/args.py rename to fluid/machine_reading_comprehension/args.py index 228375584eec4d9602bb77a853cfd61c4016e909..e37aad9a929ba2aaa122d328206a040779b4527f 100644 --- a/fluid/machine_reading_comprehesion/args.py +++ b/fluid/machine_reading_comprehension/args.py @@ -58,6 +58,11 @@ def parse_args(): type=float, default=0.001, help="Learning rate used to train the model. (default: %(default)f)") + parser.add_argument( + "--weight_decay", + type=float, + default=0.0001, + help="Weight decay. (default: %(default)f)") parser.add_argument( "--use_gpu", type=distutils.util.strtobool, diff --git a/fluid/machine_reading_comprehesion/data/download.sh b/fluid/machine_reading_comprehension/data/download.sh similarity index 100% rename from fluid/machine_reading_comprehesion/data/download.sh rename to fluid/machine_reading_comprehension/data/download.sh diff --git a/fluid/machine_reading_comprehesion/data/md5sum.txt b/fluid/machine_reading_comprehension/data/md5sum.txt similarity index 100% rename from fluid/machine_reading_comprehesion/data/md5sum.txt rename to fluid/machine_reading_comprehension/data/md5sum.txt diff --git a/fluid/machine_reading_comprehesion/dataset.py b/fluid/machine_reading_comprehension/dataset.py similarity index 100% rename from fluid/machine_reading_comprehesion/dataset.py rename to fluid/machine_reading_comprehension/dataset.py diff --git a/fluid/machine_reading_comprehesion/rc_model.py b/fluid/machine_reading_comprehension/rc_model.py similarity index 100% rename from fluid/machine_reading_comprehesion/rc_model.py rename to fluid/machine_reading_comprehension/rc_model.py diff --git a/fluid/machine_reading_comprehesion/run.py b/fluid/machine_reading_comprehension/run.py similarity index 96% rename from fluid/machine_reading_comprehesion/run.py rename to fluid/machine_reading_comprehension/run.py index bae54d42856787ef2c17481281ac6d14cb074812..1b68d79f284104097da6b208aee4d763c4f3dcbc 100644 --- a/fluid/machine_reading_comprehesion/run.py +++ b/fluid/machine_reading_comprehension/run.py @@ -21,6 +21,7 @@ import time import os import random import json +import six import paddle import paddle.fluid as fluid @@ -209,6 +210,8 @@ def validation(inference_program, avg_cost, s_probs, e_probs, feed_order, place, 'yesno_answers': [] }) if args.result_dir is not None and args.result_name is not None: + if not os.path.exists(args.result_dir): + os.makedirs(args.result_dir) result_file = os.path.join(args.result_dir, args.result_name + '.json') with open(result_file, 'w') as fout: for pred_answer in pred_answers: @@ -235,7 +238,10 @@ def validation(inference_program, avg_cost, s_probs, e_probs, feed_order, place, def train(logger, args): logger.info('Load data_set and vocab...') with open(os.path.join(args.vocab_dir, 'vocab.data'), 'rb') as fin: - vocab = pickle.load(fin) + if six.PY2: + vocab = pickle.load(fin) + else: + vocab = pickle.load(fin, encoding='bytes') logger.info('vocab size is {} and embed dim is {}'.format(vocab.size( ), vocab.embed_dim)) brc_data = BRCDataset(args.max_p_num, args.max_p_len, args.max_q_len, @@ -259,13 +265,20 @@ def train(logger, args): # build optimizer if args.optim == 'sgd': optimizer = fluid.optimizer.SGD( - learning_rate=args.learning_rate) + learning_rate=args.learning_rate, + regularization=fluid.regularizer.L2DecayRegularizer( + regularization_coeff=args.weight_decay)) elif args.optim == 'adam': optimizer = fluid.optimizer.Adam( - learning_rate=args.learning_rate) + learning_rate=args.learning_rate, + regularization=fluid.regularizer.L2DecayRegularizer( + regularization_coeff=args.weight_decay)) + elif args.optim == 'rprop': optimizer = fluid.optimizer.RMSPropOptimizer( - learning_rate=args.learning_rate) + learning_rate=args.learning_rate, + regularization=fluid.regularizer.L2DecayRegularizer( + regularization_coeff=args.weight_decay)) else: logger.error('Unsupported optimizer: {}'.format(args.optim)) exit(-1) diff --git a/fluid/machine_reading_comprehesion/run.sh b/fluid/machine_reading_comprehension/run.sh similarity index 100% rename from fluid/machine_reading_comprehesion/run.sh rename to fluid/machine_reading_comprehension/run.sh diff --git a/fluid/machine_reading_comprehesion/utils/__init__.py b/fluid/machine_reading_comprehension/utils/__init__.py similarity index 100% rename from fluid/machine_reading_comprehesion/utils/__init__.py rename to fluid/machine_reading_comprehension/utils/__init__.py diff --git a/fluid/machine_reading_comprehesion/utils/download_thirdparty.sh b/fluid/machine_reading_comprehension/utils/download_thirdparty.sh similarity index 100% rename from fluid/machine_reading_comprehesion/utils/download_thirdparty.sh rename to fluid/machine_reading_comprehension/utils/download_thirdparty.sh diff --git a/fluid/machine_reading_comprehesion/utils/dureader_eval.py b/fluid/machine_reading_comprehension/utils/dureader_eval.py similarity index 99% rename from fluid/machine_reading_comprehesion/utils/dureader_eval.py rename to fluid/machine_reading_comprehension/utils/dureader_eval.py index d60988871a63ce304fc1afbf0af7b1c1801e2161..fe43989ca955a01d6cb8b791a085ee33b8fff6c5 100644 --- a/fluid/machine_reading_comprehesion/utils/dureader_eval.py +++ b/fluid/machine_reading_comprehension/utils/dureader_eval.py @@ -528,10 +528,9 @@ def main(args): except AssertionError as ae: err = ae - print( - json.dumps( - format_metrics(metrics, args.task, err), ensure_ascii=False).encode( - 'utf8')) + print(json.dumps( + format_metrics(metrics, args.task, err), ensure_ascii=False).encode( + 'utf8')) if __name__ == '__main__': diff --git a/fluid/machine_reading_comprehesion/utils/get_vocab.py b/fluid/machine_reading_comprehension/utils/get_vocab.py similarity index 100% rename from fluid/machine_reading_comprehesion/utils/get_vocab.py rename to fluid/machine_reading_comprehension/utils/get_vocab.py diff --git a/fluid/machine_reading_comprehesion/utils/marco_tokenize_data.py b/fluid/machine_reading_comprehension/utils/marco_tokenize_data.py similarity index 100% rename from fluid/machine_reading_comprehesion/utils/marco_tokenize_data.py rename to fluid/machine_reading_comprehension/utils/marco_tokenize_data.py diff --git a/fluid/machine_reading_comprehesion/utils/marcov1_to_dureader.py b/fluid/machine_reading_comprehension/utils/marcov1_to_dureader.py similarity index 100% rename from fluid/machine_reading_comprehesion/utils/marcov1_to_dureader.py rename to fluid/machine_reading_comprehension/utils/marcov1_to_dureader.py diff --git a/fluid/machine_reading_comprehesion/utils/marcov2_to_v1_tojsonl.py b/fluid/machine_reading_comprehension/utils/marcov2_to_v1_tojsonl.py similarity index 100% rename from fluid/machine_reading_comprehesion/utils/marcov2_to_v1_tojsonl.py rename to fluid/machine_reading_comprehension/utils/marcov2_to_v1_tojsonl.py diff --git a/fluid/machine_reading_comprehesion/utils/preprocess.py b/fluid/machine_reading_comprehension/utils/preprocess.py similarity index 100% rename from fluid/machine_reading_comprehesion/utils/preprocess.py rename to fluid/machine_reading_comprehension/utils/preprocess.py diff --git a/fluid/machine_reading_comprehesion/utils/run_marco2dureader_preprocess.sh b/fluid/machine_reading_comprehension/utils/run_marco2dureader_preprocess.sh similarity index 100% rename from fluid/machine_reading_comprehesion/utils/run_marco2dureader_preprocess.sh rename to fluid/machine_reading_comprehension/utils/run_marco2dureader_preprocess.sh diff --git a/fluid/machine_reading_comprehesion/vocab.py b/fluid/machine_reading_comprehension/vocab.py similarity index 100% rename from fluid/machine_reading_comprehesion/vocab.py rename to fluid/machine_reading_comprehension/vocab.py