未验证 提交 8dc94b83 编写于 作者: X xuezhong 提交者: GitHub

Merge pull request #1989 from xuezhong/dureader_v2.0

dureader udpate to 2.0
# Abstract DuReader是一个端到端的机器阅读理解神经网络模型,能够在给定文档和问题的情况,定位文档中问题的答案。我们首先利用双向注意力网络获得文档和问题的相同向量空间的表示,然后使用`point network` 定位文档中答案的位置。实验显示,我们的模型能够获得在Dureader数据集上SOTA的结果。
Dureader is an end-to-end neural network model for machine reading comprehension style question answering, which aims to answer questions from given passages. We first match the question and passages with a bidireactional attention flow network to obtrain the question-aware passages represenation. Then we employ a pointer network to locate the positions of answers from passages. Our experimental evalutions show that DuReader model achieves the state-of-the-art results in DuReader Dadaset.
# Dataset
DuReader Dataset is a new large-scale real-world and human sourced MRC dataset in Chinese. DuReader focuses on real-world open-domain question answering. The advantages of DuReader over existing datasets are concluded as follows:
- Real question
- Real article
- Real answer
- Real application scenario
- Rich annotation
# Network # 算法介绍
DuReader model is inspired by 3 classic reading comprehension models([BiDAF](https://arxiv.org/abs/1611.01603), [Match-LSTM](https://arxiv.org/abs/1608.07905), [R-NET](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/05/r-net.pdf)). DuReader模型主要实现了论文[BiDAF](https://arxiv.org/abs/1611.01603)[Match-LSTM](https://arxiv.org/abs/1608.07905)中的模型结构。
DuReader model is a hierarchical multi-stage process and consists of five layers 模型在层次上可以分为5层:
- **Word Embedding Layer** maps each word to a vector using a pre-trained word embedding model. - **词嵌入层** 将每一个词映射到一个向量(这个向量可以是预训练好的)。
- **Encoding Layer** extracts context infomation for each position in question and passages with a bi-directional LSTM network. - **编码层** 使用双向LSMT网络获得问题和文档的每一个词的上下文信息。
- **Attention Flow Layer** couples the query and context vectors and produces a set of query-aware feature vectors for each word in the context. Please refer to [BiDAF](https://arxiv.org/abs/1611.01603) for more details. - **注意力层** 通过双向注意力网络获得文档的问题向量空间表示。更多参考[BiDAF](https://arxiv.org/abs/1611.01603)
- **Fusion Layer** employs a layer of bi-directional LSTM to capture the interaction among context words independent of the query. - **融合层** 通过双向LSTM网络获得文档向量表示中上下文的关联性信息。
- **Decode Layer** employs an answer point network with attention pooling of the quesiton to locate the positions of answers from passages. Please refer to [Match-LSTM](https://arxiv.org/abs/1608.07905) and [R-NET](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/05/r-net.pdf) for more details. - **输出层** 通过`point network`预测答案在问题中的位置。更多参考 [Match-LSTM](https://arxiv.org/abs/1608.07905)
## How to Run ## 数据准备
### Download the Dataset ### 下载数据集
To Download DuReader dataset: 通过如下脚本下载数据集:
``` ```
cd data && bash download.sh cd data && bash download.sh
``` ```
For more details about DuReader dataset please refer to [DuReader Dataset Homepage](https://ai.baidu.com//broad/subordinate?dataset=dureader). 模型默认使用DuReader数据集,是百度开源的真实阅读理解数据,更多参考[DuReader Dataset Homepage](https://ai.baidu.com//broad/subordinate?dataset=dureader)
### Download Thirdparty Dependencies ### 下载第三方依赖
We use Bleu and Rouge as evaluation metrics, the calculation of these metrics relies on the scoring scripts under [coco-caption](https://github.com/tylin/coco-caption), to download them, run: 我们使用Bleu和Rouge作为度量指标, 这些度量指标的源码位于[coco-caption](https://github.com/tylin/coco-caption), 可以使用如下命令下载源码:
``` ```
cd utils && bash download_thirdparty.sh cd utils && bash download_thirdparty.sh
``` ```
### Environment Requirements ### 环境依赖
For now we've only tested on PaddlePaddle v1.0, to install PaddlePaddle and for more details about PaddlePaddle, see [PaddlePaddle Homepage](http://paddlepaddle.org). 当前模型是在paddlepaddle 1.2版本上测试, 因此建议在1.2版本上使用本模型。关于PaddlePaddle的安装可以参考[PaddlePaddle Homepage](http://paddlepaddle.org)
### Preparation
Before training the model, we have to make sure that the data is ready. For preparation, we will check the data files, make directories and extract a vocabulary for later use. You can run the following command to do this with a specified task name:
## 模型训练
### 段落抽取
在段落抽取阶段,主要是使用文档相关性score对文档内容进行优化, 抽取的结果将会放到`data/extracted/`目录下。如果你用demo数据测试,可以跳过这一步。如果你用dureader数据,需要指定抽取的数据目录,命令如下:
``` ```
sh run.sh --prepare sh run.sh --para_extraction --trainset data/preprocessed/trainset/zhidao.train.json data/preprocessed/trainset/search.train.json --devset data/preprocessed/devset/zhidao.dev.json data/preprocessed/devset/search.dev.json --testset data/preprocessed/testset/zhidao.test.json data/preprocessed/testset/search.test.json
``` ```
You can specify the files for train/dev/test by setting the `trainset`/`devset`/`testset`. 其中参数 `trainset`/`devset`/`testset`分别对应训练、验证和测试数据集(下同)。
### Training ### 词典准备
To train the model and you can also set the hyper-parameters such as the learning rate by using `--learning_rate NUM`. For example, to train the model for 10 passes, you can run: 在训练模型之前,我们应该确保数据已经准备好。在准备阶段,通过全部数据文件生成一个词典,这个词典会在后续的训练和预测中用到。你可以通过如下命令生成词典:
``` ```
sh run.sh --train --pass_num 10 run.sh --prepare
``` ```
上面的命令默认使用demo数据,如果想使用dureader数据集,应该按照如下方式指定:
The training process includes an evaluation on the dev set after each training epoch. By default, the model with the least Bleu-4 score on the dev set will be saved. ```
run.sh --prepare --trainset data/extracted/trainset/zhidao.train.json data/extracted/trainset/search.train.json --devset data/extracted/devset/zhidao.dev.json data/extracted/devset/search.dev.json --testset data/extracted/testset/zhidao.test.json data/extracted/testset/search.test.json
### Evaluation
To conduct a single evaluation on the dev set with the the model already trained, you can run the following command:
``` ```
sh run.sh --evaluate --load_dir models/1 其中参数 `trainset`/`devset`/`testset`分别对应训练、验证和测试数据集。
### 模型训练
训练模型的启动命令如下:
``` ```
sh run.sh --train
```
可以通过设置超参数更改训练的配置,比如通过`--learning_rate NUM`更改学习率,通过`--pass_num NUM`更改训练的轮数
训练的过程中,每隔一定迭代周期,会测试在验证集上的性能指标, 通过`--dev_interval NUM`设置周期大小
### Prediction ### 模型评测
You can also predict answers for the samples in some files using the following command: 在模型训练结束后,如果想使用训练好的模型进行评测,获得度量指标,可以使用如下命令:
```
sh run.sh --evaluate --load_dir data/models/1
```
其中,`--load_dir data/models/1`是模型的checkpoint目录
### 预测
使用训练好的模型,对问答文档数据直接预测结果,获得答案,可以使用如下命令:
``` ```
sh run.sh --predict --load_dir models/1 --testset ../data/preprocessed/testset/search.dev.json sh run.sh --predict --load_dir data/models/1 --testset data/extracted/testset/search.dev.json
``` ```
其中`--testset`指定了预测用的数据集,生成的问题答案默认会放到`data/results/` 目录,你可以通过参数`--result_dir DIR_PATH`更改配置
### 实验结果
验证集 ROUGE-L:47.65,测试集 ROUGE-L:54.58。
这是在P40上,使用4卡GPU,batch size=4*32的训练结果,如果使用单卡,指标可能会略有降低,但在验证集上的ROUGE-L也不小于47。
## 参考文献
[Machine Comprehension Using Match-LSTM and Answer Pointer](https://arxiv.org/abs/1608.07905)
By default, the results are saved at `../data/results/` folder. You can change this by specifying `--result_dir DIR_PATH`. [Bidirectional Attention Flow for Machine Comprehension](https://arxiv.org/abs/1611.01603)
...@@ -27,102 +27,65 @@ def parse_args(): ...@@ -27,102 +27,65 @@ def parse_args():
action='store_true', action='store_true',
help='create the directories, prepare the vocabulary and embeddings') help='create the directories, prepare the vocabulary and embeddings')
parser.add_argument('--train', action='store_true', help='train the model') parser.add_argument('--train', action='store_true', help='train the model')
parser.add_argument( parser.add_argument('--evaluate', action='store_true', help='evaluate the model on dev set')
'--evaluate', action='store_true', help='evaluate the model on dev set') parser.add_argument('--predict', action='store_true',
parser.add_argument(
'--predict',
action='store_true',
help='predict the answers for test set with trained model') help='predict the answers for test set with trained model')
parser.add_argument(
"--embed_size", parser.add_argument("--embed_size", type=int, default=300,
type=int,
default=300,
help="The dimension of embedding table. (default: %(default)d)") help="The dimension of embedding table. (default: %(default)d)")
parser.add_argument( parser.add_argument("--hidden_size", type=int, default=150,
"--hidden_size",
type=int,
default=300,
help="The size of rnn hidden unit. (default: %(default)d)") help="The size of rnn hidden unit. (default: %(default)d)")
parser.add_argument( parser.add_argument("--learning_rate", type=float, default=0.001,
"--batch_size",
type=int,
default=32,
help="The sequence number of a mini-batch data. (default: %(default)d)")
parser.add_argument(
"--pass_num",
type=int,
default=5,
help="The pass number to train. (default: %(default)d)")
parser.add_argument(
"--learning_rate",
type=float,
default=0.001,
help="Learning rate used to train the model. (default: %(default)f)") help="Learning rate used to train the model. (default: %(default)f)")
parser.add_argument( parser.add_argument('--optim', default='adam', help='optimizer type')
"--weight_decay", parser.add_argument("--weight_decay", type=float, default=0.0001,
type=float,
default=0.0001,
help="Weight decay. (default: %(default)f)") help="Weight decay. (default: %(default)f)")
parser.add_argument(
"--use_gpu", parser.add_argument('--drop_rate', type=float, default=0.0, help="Dropout probability")
type=distutils.util.strtobool, parser.add_argument('--random_seed', type=int, default=123)
default=True, parser.add_argument("--batch_size", type=int, default=32,
help="The sequence number of a mini-batch data. (default: %(default)d)")
parser.add_argument("--pass_num", type=int, default=5,
help="The number epochs to train. (default: %(default)d)")
parser.add_argument("--use_gpu", type=distutils.util.strtobool, default=True,
help="Whether to use gpu. (default: %(default)d)") help="Whether to use gpu. (default: %(default)d)")
parser.add_argument( parser.add_argument("--log_interval", type=int, default=50,
"--save_dir", help="log the train loss every n batches. (default: %(default)d)")
type=str,
default="model",
help="Specify the path to save trained models.")
parser.add_argument(
"--load_dir",
type=str,
default="",
help="Specify the path to load trained models.")
parser.add_argument(
"--save_interval",
type=int,
default=1,
help="Save the trained model every n passes."
"(default: %(default)d)")
parser.add_argument(
"--log_interval",
type=int,
default=50,
help="log the train loss every n batches."
"(default: %(default)d)")
parser.add_argument(
"--dev_interval",
type=int,
default=1000,
help="cal dev loss every n batches."
"(default: %(default)d)")
parser.add_argument('--optim', default='adam', help='optimizer type')
parser.add_argument('--trainset', nargs='+', help='train dataset')
parser.add_argument('--devset', nargs='+', help='dev dataset')
parser.add_argument('--testset', nargs='+', help='test dataset')
parser.add_argument('--vocab_dir', help='dict')
parser.add_argument('--max_p_num', type=int, default=5) parser.add_argument('--max_p_num', type=int, default=5)
parser.add_argument('--max_a_len', type=int, default=200) parser.add_argument('--max_a_len', type=int, default=200)
parser.add_argument('--max_p_len', type=int, default=500) parser.add_argument('--max_p_len', type=int, default=500)
parser.add_argument('--max_q_len', type=int, default=9) parser.add_argument('--max_q_len', type=int, default=60)
parser.add_argument('--doc_num', type=int, default=5) parser.add_argument('--doc_num', type=int, default=5)
parser.add_argument('--para_print', action='store_true')
parser.add_argument('--drop_rate', type=float, default=0.0) parser.add_argument('--vocab_dir', default='data/vocab', help='vocabulary')
parser.add_argument('--random_seed', type=int, default=123) parser.add_argument("--save_dir", type=str, default="data/models",
parser.add_argument( help="Specify the path to save trained models.")
'--log_path', parser.add_argument("--save_interval", type=int, default=1,
help="Save the trained model every n passes. (default: %(default)d)")
parser.add_argument("--load_dir", type=str, default="",
help="Specify the path to load trained models.")
parser.add_argument('--log_path',
help='path of the log file. If not set, logs are printed to console') help='path of the log file. If not set, logs are printed to console')
parser.add_argument( parser.add_argument('--result_dir', default='data/results/',
'--result_dir',
default='../data/results/',
help='the dir to output the results') help='the dir to output the results')
parser.add_argument( parser.add_argument('--result_name', default='test_result',
'--result_name', help='the file name of the predicted results')
default='test_result',
help='the file name of the results') parser.add_argument('--trainset', nargs='+',
parser.add_argument( default=['data/demo/trainset/search.train.json'],
"--enable_ce", help='train dataset')
action='store_true', parser.add_argument('--devset', nargs='+',
default=['data/demo/devset/search.dev.json'],
help='dev dataset')
parser.add_argument('--testset', nargs='+',
default=['data/demo/testset/search.test.json'],
help='test dataset')
parser.add_argument("--enable_ce", action='store_true',
help="If set, run the task with continuous evaluation logs.") help="If set, run the task with continuous evaluation logs.")
parser.add_argument('--para_print', action='store_true', help="Print debug info")
parser.add_argument("--dev_interval", type=int, default=-1,
help="evaluate on dev set loss every n batches. (default: %(default)d)")
args = parser.parse_args() args = parser.parse_args()
return args return args
...@@ -20,11 +20,14 @@ if [[ -d preprocessed ]] && [[ -d raw ]]; then ...@@ -20,11 +20,14 @@ if [[ -d preprocessed ]] && [[ -d raw ]]; then
echo "data exist" echo "data exist"
exit 0 exit 0
else else
wget -c --no-check-certificate http://dureader.gz.bcebos.com/dureader_preprocessed.zip wget -c http://dureader.gz.bcebos.com/demo.zip
wget -c --no-check-certificate http://dureader.gz.bcebos.com/demo.tgz wget -c https://aipedataset.cdn.bcebos.com/dureader/dureader_raw.zip
wget -c https://aipedataset.cdn.bcebos.com/dureader/dureader_preprocessed.zip
fi fi
if md5sum --status -c md5sum.txt; then if md5sum --status -c md5sum.txt; then
unzip demo.zip
unzip dureader_raw.zip
unzip dureader_preprocessed.zip unzip dureader_preprocessed.zip
else else
echo "download data error!" >> /dev/stderr echo "download data error!" >> /dev/stderr
......
7a4c28026f7dc94e8135d17203c63664 dureader_preprocessed.zip 0ca0510fa625d35d902b73033c4ba9d8 demo.zip
dc7658b8cdf4f94b8714d130b7d15196 dureader_raw.zip
3db9a32e5a7c5375a604a70687b45479 dureader_preprocessed.zip
...@@ -157,7 +157,8 @@ class BRCDataset(object): ...@@ -157,7 +157,8 @@ class BRCDataset(object):
passade_idx_offset = sum(batch_data['passage_num']) passade_idx_offset = sum(batch_data['passage_num'])
batch_data['passage_num'].append(count) batch_data['passage_num'].append(count)
gold_passage_offset = 0 gold_passage_offset = 0
if 'answer_passages' in sample and len(sample['answer_passages']): if 'answer_passages' in sample and len(sample['answer_passages']) and \
sample['answer_passages'][0] < len(sample['documents']):
for i in range(sample['answer_passages'][0]): for i in range(sample['answer_passages'][0]):
gold_passage_offset += len(batch_data['passage_token_ids'][ gold_passage_offset += len(batch_data['passage_token_ids'][
passade_idx_offset + i]) passade_idx_offset + i])
......
#!/usr/bin/python
#-*- coding:utf-8 -*-
import sys
if sys.version[0] == '2':
reload(sys)
sys.setdefaultencoding("utf-8")
import json
import copy
from preprocess import metric_max_over_ground_truths, f1_score
def compute_paragraph_score(sample):
"""
For each paragraph, compute the f1 score compared with the question
Args:
sample: a sample in the dataset.
Returns:
None
Raises:
None
"""
question = sample["segmented_question"]
for doc in sample['documents']:
doc['segmented_paragraphs_scores'] = []
for p_idx, para_tokens in enumerate(doc['segmented_paragraphs']):
if len(question) > 0:
related_score = metric_max_over_ground_truths(f1_score,
para_tokens,
question)
else:
related_score = 0.0
doc['segmented_paragraphs_scores'].append(related_score)
def dup_remove(doc):
"""
For each document, remove the duplicated paragraphs
Args:
doc: a doc in the sample
Returns:
bool
Raises:
None
"""
paragraphs_his = {}
del_ids = []
para_id = None
if 'most_related_para' in doc:
para_id = doc['most_related_para']
doc['paragraphs_length'] = []
for p_idx, (segmented_paragraph, paragraph_score) in \
enumerate(zip(doc["segmented_paragraphs"], doc["segmented_paragraphs_scores"])):
doc['paragraphs_length'].append(len(segmented_paragraph))
paragraph = ''.join(segmented_paragraph)
if paragraph in paragraphs_his:
del_ids.append(p_idx)
if p_idx == para_id:
para_id = paragraphs_his[paragraph]
continue
paragraphs_his[paragraph] = p_idx
# delete
prev_del_num = 0
del_num = 0
for p_idx in del_ids:
if p_idx < para_id:
prev_del_num += 1
del doc["segmented_paragraphs"][p_idx - del_num]
del doc["segmented_paragraphs_scores"][p_idx - del_num]
del doc['paragraphs_length'][p_idx - del_num]
del_num += 1
if len(del_ids) != 0:
if 'most_related_para' in doc:
doc['most_related_para'] = para_id - prev_del_num
doc['paragraphs'] = []
for segmented_para in doc["segmented_paragraphs"]:
paragraph = ''.join(segmented_para)
doc['paragraphs'].append(paragraph)
return True
else:
return False
def paragraph_selection(sample, mode):
"""
For each document, select paragraphs that includes as much information as possible
Args:
sample: a sample in the dataset.
mode: string of ("train", "dev", "test"), indicate the type of dataset to process.
Returns:
None
Raises:
None
"""
# predefined maximum length of paragraph
MAX_P_LEN = 500
# predefined splitter
splitter = u'<splitter>'
# topN of related paragraph to choose
topN = 3
doc_id = None
if 'answer_docs' in sample and len(sample['answer_docs']) > 0:
doc_id = sample['answer_docs'][0]
if doc_id >= len(sample['documents']):
# Data error, answer doc ID > number of documents, this sample
# will be filtered by dataset.py
return
for d_idx, doc in enumerate(sample['documents']):
if 'segmented_paragraphs_scores' not in doc:
continue
status = dup_remove(doc)
segmented_title = doc["segmented_title"]
title_len = len(segmented_title)
para_id = None
if doc_id is not None:
para_id = sample['documents'][doc_id]['most_related_para']
total_len = title_len + sum(doc['paragraphs_length'])
# add splitter
para_num = len(doc["segmented_paragraphs"])
total_len += para_num
if total_len <= MAX_P_LEN:
incre_len = title_len
total_segmented_content = copy.deepcopy(segmented_title)
for p_idx, segmented_para in enumerate(doc["segmented_paragraphs"]):
if doc_id == d_idx and para_id > p_idx:
incre_len += len([splitter] + segmented_para)
if doc_id == d_idx and para_id == p_idx:
incre_len += 1
total_segmented_content += [splitter] + segmented_para
if doc_id == d_idx:
answer_start = incre_len + sample['answer_spans'][0][0]
answer_end = incre_len + sample['answer_spans'][0][1]
sample['answer_spans'][0][0] = answer_start
sample['answer_spans'][0][1] = answer_end
doc["segmented_paragraphs"] = [total_segmented_content]
doc["segmented_paragraphs_scores"] = [1.0]
doc['paragraphs_length'] = [total_len]
doc['paragraphs'] = [''.join(total_segmented_content)]
doc['most_related_para'] = 0
continue
# find topN paragraph id
para_infos = []
for p_idx, (para_tokens, para_scores) in \
enumerate(zip(doc['segmented_paragraphs'], doc['segmented_paragraphs_scores'])):
para_infos.append((para_tokens, para_scores, len(para_tokens), p_idx))
para_infos.sort(key=lambda x: (-x[1], x[2]))
topN_idx = []
for para_info in para_infos[:topN]:
topN_idx.append(para_info[-1])
final_idx = []
total_len = title_len
if doc_id == d_idx:
if mode == "train":
final_idx.append(para_id)
total_len = title_len + 1 + doc['paragraphs_length'][para_id]
for id in topN_idx:
if total_len > MAX_P_LEN:
break
if doc_id == d_idx and id == para_id and mode == "train":
continue
total_len += 1 + doc['paragraphs_length'][id]
final_idx.append(id)
total_segmented_content = copy.deepcopy(segmented_title)
final_idx.sort()
incre_len = title_len
for id in final_idx:
if doc_id == d_idx and id < para_id:
incre_len += 1 + doc['paragraphs_length'][id]
if doc_id == d_idx and id == para_id:
incre_len += 1
total_segmented_content += [splitter] + doc['segmented_paragraphs'][id]
if doc_id == d_idx:
answer_start = incre_len + sample['answer_spans'][0][0]
answer_end = incre_len + sample['answer_spans'][0][1]
sample['answer_spans'][0][0] = answer_start
sample['answer_spans'][0][1] = answer_end
doc["segmented_paragraphs"] = [total_segmented_content]
doc["segmented_paragraphs_scores"] = [1.0]
doc['paragraphs_length'] = [total_len]
doc['paragraphs'] = [''.join(total_segmented_content)]
doc['most_related_para'] = 0
if __name__ == "__main__":
# mode="train"/"dev"/"test"
mode = sys.argv[1]
for line in sys.stdin:
line = line.strip()
if line == "":
continue
try:
sample = json.loads(line, encoding='utf8')
except:
print >>sys.stderr, "Invalid input json format - '{}' will be ignored".format(line)
continue
compute_paragraph_score(sample)
paragraph_selection(sample, mode)
print(json.dumps(sample, encoding='utf8', ensure_ascii=False))
###############################################################################
# ==============================================================================
# Copyright 2017 Baidu.com, Inc. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""
This module finds the most related paragraph of each document according to recall.
"""
import sys
if sys.version[0] == '2':
reload(sys)
sys.setdefaultencoding("utf-8")
import json
from collections import Counter
def precision_recall_f1(prediction, ground_truth):
"""
This function calculates and returns the precision, recall and f1-score
Args:
prediction: prediction string or list to be matched
ground_truth: golden string or list reference
Returns:
floats of (p, r, f1)
Raises:
None
"""
if not isinstance(prediction, list):
prediction_tokens = prediction.split()
else:
prediction_tokens = prediction
if not isinstance(ground_truth, list):
ground_truth_tokens = ground_truth.split()
else:
ground_truth_tokens = ground_truth
common = Counter(prediction_tokens) & Counter(ground_truth_tokens)
num_same = sum(common.values())
if num_same == 0:
return 0, 0, 0
p = 1.0 * num_same / len(prediction_tokens)
r = 1.0 * num_same / len(ground_truth_tokens)
f1 = (2 * p * r) / (p + r)
return p, r, f1
def recall(prediction, ground_truth):
"""
This function calculates and returns the recall
Args:
prediction: prediction string or list to be matched
ground_truth: golden string or list reference
Returns:
floats of recall
Raises:
None
"""
return precision_recall_f1(prediction, ground_truth)[1]
def f1_score(prediction, ground_truth):
"""
This function calculates and returns the f1-score
Args:
prediction: prediction string or list to be matched
ground_truth: golden string or list reference
Returns:
floats of f1
Raises:
None
"""
return precision_recall_f1(prediction, ground_truth)[2]
def metric_max_over_ground_truths(metric_fn, prediction, ground_truths):
"""
This function calculates and returns the precision, recall and f1-score
Args:
metric_fn: metric function pointer which calculates scores according to corresponding logic.
prediction: prediction string or list to be matched
ground_truth: golden string or list reference
Returns:
floats of (p, r, f1)
Raises:
None
"""
scores_for_ground_truths = []
for ground_truth in ground_truths:
score = metric_fn(prediction, ground_truth)
scores_for_ground_truths.append(score)
return max(scores_for_ground_truths)
def find_best_question_match(doc, question, with_score=False):
"""
For each document, find the paragraph that matches best to the question.
Args:
doc: The document object.
question: The question tokens.
with_score: If True then the match score will be returned,
otherwise False.
Returns:
The index of the best match paragraph, if with_score=False,
otherwise returns a tuple of the index of the best match paragraph
and the match score of that paragraph.
"""
most_related_para = -1
max_related_score = 0
most_related_para_len = 0
for p_idx, para_tokens in enumerate(doc['segmented_paragraphs']):
if len(question) > 0:
related_score = metric_max_over_ground_truths(recall,
para_tokens,
question)
else:
related_score = 0
if related_score > max_related_score \
or (related_score == max_related_score \
and len(para_tokens) < most_related_para_len):
most_related_para = p_idx
max_related_score = related_score
most_related_para_len = len(para_tokens)
if most_related_para == -1:
most_related_para = 0
if with_score:
return most_related_para, max_related_score
return most_related_para
def find_fake_answer(sample):
"""
For each document, finds the most related paragraph based on recall,
then finds a span that maximize the f1_score compared with the gold answers
and uses this span as a fake answer span
Args:
sample: a sample in the dataset
Returns:
None
Raises:
None
"""
for doc in sample['documents']:
most_related_para = -1
most_related_para_len = 999999
max_related_score = 0
for p_idx, para_tokens in enumerate(doc['segmented_paragraphs']):
if len(sample['segmented_answers']) > 0:
related_score = metric_max_over_ground_truths(recall,
para_tokens,
sample['segmented_answers'])
else:
continue
if related_score > max_related_score \
or (related_score == max_related_score
and len(para_tokens) < most_related_para_len):
most_related_para = p_idx
most_related_para_len = len(para_tokens)
max_related_score = related_score
doc['most_related_para'] = most_related_para
sample['answer_docs'] = []
sample['answer_spans'] = []
sample['fake_answers'] = []
sample['match_scores'] = []
best_match_score = 0
best_match_d_idx, best_match_span = -1, [-1, -1]
best_fake_answer = None
answer_tokens = set()
for segmented_answer in sample['segmented_answers']:
answer_tokens = answer_tokens | set([token for token in segmented_answer])
for d_idx, doc in enumerate(sample['documents']):
if not doc['is_selected']:
continue
if doc['most_related_para'] == -1:
doc['most_related_para'] = 0
most_related_para_tokens = doc['segmented_paragraphs'][doc['most_related_para']][:1000]
for start_tidx in range(len(most_related_para_tokens)):
if most_related_para_tokens[start_tidx] not in answer_tokens:
continue
for end_tidx in range(len(most_related_para_tokens) - 1, start_tidx - 1, -1):
span_tokens = most_related_para_tokens[start_tidx: end_tidx + 1]
if len(sample['segmented_answers']) > 0:
match_score = metric_max_over_ground_truths(f1_score, span_tokens,
sample['segmented_answers'])
else:
match_score = 0
if match_score == 0:
break
if match_score > best_match_score:
best_match_d_idx = d_idx
best_match_span = [start_tidx, end_tidx]
best_match_score = match_score
best_fake_answer = ''.join(span_tokens)
if best_match_score > 0:
sample['answer_docs'].append(best_match_d_idx)
sample['answer_spans'].append(best_match_span)
sample['fake_answers'].append(best_fake_answer)
sample['match_scores'].append(best_match_score)
if __name__ == '__main__':
for line in sys.stdin:
sample = json.loads(line)
find_fake_answer(sample)
print(json.dumps(sample, encoding='utf8', ensure_ascii=False))
...@@ -22,6 +22,7 @@ import numpy as np ...@@ -22,6 +22,7 @@ import numpy as np
def dropout(input, args): def dropout(input, args):
"""Dropout function"""
if args.drop_rate: if args.drop_rate:
return layers.dropout( return layers.dropout(
input, input,
...@@ -33,10 +34,12 @@ def dropout(input, args): ...@@ -33,10 +34,12 @@ def dropout(input, args):
def bi_lstm_encoder(input_seq, gate_size, para_name, args): def bi_lstm_encoder(input_seq, gate_size, para_name, args):
# A bi-directional lstm encoder implementation. """
# Linear transformation part for input gate, output gate, forget gate A bi-directional lstm encoder implementation.
# and cell activation vectors need be done outside of dynamic_lstm. Linear transformation part for input gate, output gate, forget gate
# So the output size is 4 times of gate_size. and cell activation vectors need be done outside of dynamic_lstm.
So the output size is 4 times of gate_size.
"""
input_forward_proj = layers.fc( input_forward_proj = layers.fc(
input=input_seq, input=input_seq,
...@@ -75,6 +78,7 @@ def get_data(input_name, lod_level, args): ...@@ -75,6 +78,7 @@ def get_data(input_name, lod_level, args):
def embedding(input_ids, shape, args): def embedding(input_ids, shape, args):
"""Embedding layer"""
input_embedding = layers.embedding( input_embedding = layers.embedding(
input=input_ids, input=input_ids,
size=shape, size=shape,
...@@ -85,6 +89,7 @@ def embedding(input_ids, shape, args): ...@@ -85,6 +89,7 @@ def embedding(input_ids, shape, args):
def encoder(input_embedding, para_name, hidden_size, args): def encoder(input_embedding, para_name, hidden_size, args):
"""Encoding layer"""
encoder_out = bi_lstm_encoder( encoder_out = bi_lstm_encoder(
input_seq=input_embedding, input_seq=input_embedding,
gate_size=hidden_size, gate_size=hidden_size,
...@@ -94,6 +99,7 @@ def encoder(input_embedding, para_name, hidden_size, args): ...@@ -94,6 +99,7 @@ def encoder(input_embedding, para_name, hidden_size, args):
def attn_flow(q_enc, p_enc, p_ids_name, args): def attn_flow(q_enc, p_enc, p_ids_name, args):
"""Bidirectional Attention layer"""
tag = p_ids_name + "::" tag = p_ids_name + "::"
drnn = layers.DynamicRNN() drnn = layers.DynamicRNN()
with drnn.block(): with drnn.block():
...@@ -123,7 +129,15 @@ def attn_flow(q_enc, p_enc, p_ids_name, args): ...@@ -123,7 +129,15 @@ def attn_flow(q_enc, p_enc, p_ids_name, args):
return dropout(g, args) return dropout(g, args)
def fusion(g, args):
"""Fusion layer"""
m = bi_lstm_encoder(
input_seq=g, gate_size=args.hidden_size, para_name='fusion', args=args)
return dropout(m, args)
def lstm_step(x_t, hidden_t_prev, cell_t_prev, size, para_name, args): def lstm_step(x_t, hidden_t_prev, cell_t_prev, size, para_name, args):
"""Util function for pointer network"""
def linear(inputs, para_name, args): def linear(inputs, para_name, args):
return layers.fc(input=inputs, return layers.fc(input=inputs,
size=size, size=size,
...@@ -150,8 +164,8 @@ def lstm_step(x_t, hidden_t_prev, cell_t_prev, size, para_name, args): ...@@ -150,8 +164,8 @@ def lstm_step(x_t, hidden_t_prev, cell_t_prev, size, para_name, args):
return hidden_t, cell_t return hidden_t, cell_t
#point network
def point_network_decoder(p_vec, q_vec, hidden_size, args): def point_network_decoder(p_vec, q_vec, hidden_size, args):
"""Output layer - pointer network"""
tag = 'pn_decoder:' tag = 'pn_decoder:'
init_random = fluid.initializer.Normal(loc=0.0, scale=1.0) init_random = fluid.initializer.Normal(loc=0.0, scale=1.0)
...@@ -258,20 +272,15 @@ def point_network_decoder(p_vec, q_vec, hidden_size, args): ...@@ -258,20 +272,15 @@ def point_network_decoder(p_vec, q_vec, hidden_size, args):
return start_prob, end_prob return start_prob, end_prob
def fusion(g, args):
m = bi_lstm_encoder(
input_seq=g, gate_size=args.hidden_size, para_name='fusion', args=args)
return dropout(m, args)
def rc_model(hidden_size, vocab, args): def rc_model(hidden_size, vocab, args):
"""This function build the whole BiDAF network"""
emb_shape = [vocab.size(), vocab.embed_dim] emb_shape = [vocab.size(), vocab.embed_dim]
start_labels = layers.data( start_labels = layers.data(
name="start_lables", shape=[1], dtype='float32', lod_level=1) name="start_lables", shape=[1], dtype='float32', lod_level=1)
end_labels = layers.data( end_labels = layers.data(
name="end_lables", shape=[1], dtype='float32', lod_level=1) name="end_lables", shape=[1], dtype='float32', lod_level=1)
# stage 1:encode # stage 1:setup input data, embedding table & encode
q_id0 = get_data('q_id0', 1, args) q_id0 = get_data('q_id0', 1, args)
q_ids = get_data('q_ids', 2, args) q_ids = get_data('q_ids', 2, args)
...@@ -302,6 +311,7 @@ def rc_model(hidden_size, vocab, args): ...@@ -302,6 +311,7 @@ def rc_model(hidden_size, vocab, args):
start_probs, end_probs = point_network_decoder( start_probs, end_probs = point_network_decoder(
p_vec=p_vec, q_vec=q_vec, hidden_size=hidden_size, args=args) p_vec=p_vec, q_vec=q_vec, hidden_size=hidden_size, args=args)
# calculate model loss
cost0 = layers.sequence_pool( cost0 = layers.sequence_pool(
layers.cross_entropy( layers.cross_entropy(
input=start_probs, label=start_labels, soft_label=True), input=start_probs, label=start_labels, soft_label=True),
......
...@@ -133,6 +133,7 @@ def LodTensor_Array(lod_tensor): ...@@ -133,6 +133,7 @@ def LodTensor_Array(lod_tensor):
def print_para(train_prog, train_exe, logger, args): def print_para(train_prog, train_exe, logger, args):
"""Print para info for debug purpose"""
if args.para_print: if args.para_print:
param_list = train_prog.block(0).all_parameters() param_list = train_prog.block(0).all_parameters()
param_name_list = [p.name for p in param_list] param_name_list = [p.name for p in param_list]
...@@ -171,7 +172,8 @@ def find_best_answer_for_passage(start_probs, end_probs, passage_len): ...@@ -171,7 +172,8 @@ def find_best_answer_for_passage(start_probs, end_probs, passage_len):
return (best_start, best_end), max_prob return (best_start, best_end), max_prob
def find_best_answer_for_inst(sample, start_prob, end_prob, inst_lod): def find_best_answer_for_inst(sample, start_prob, end_prob, inst_lod,
para_prior_scores=(0.44, 0.23, 0.15, 0.09, 0.07)):
""" """
Finds the best answer for a sample given start_prob and end_prob for each position. Finds the best answer for a sample given start_prob and end_prob for each position.
This will call find_best_answer_for_passage because there are multiple passages in a sample This will call find_best_answer_for_passage because there are multiple passages in a sample
...@@ -190,6 +192,10 @@ def find_best_answer_for_inst(sample, start_prob, end_prob, inst_lod): ...@@ -190,6 +192,10 @@ def find_best_answer_for_inst(sample, start_prob, end_prob, inst_lod):
answer_span, score = find_best_answer_for_passage( answer_span, score = find_best_answer_for_passage(
start_prob[passage_start:passage_end], start_prob[passage_start:passage_end],
end_prob[passage_start:passage_end], passage_len) end_prob[passage_start:passage_end], passage_len)
if para_prior_scores is not None:
# the Nth prior score = the Number of training samples whose gold answer comes
# from the Nth paragraph / the number of the training samples
score *= para_prior_scores[p_idx]
if score > best_score: if score > best_score:
best_score = score best_score = score
best_p_idx = p_idx best_p_idx = p_idx
...@@ -205,16 +211,12 @@ def find_best_answer_for_inst(sample, start_prob, end_prob, inst_lod): ...@@ -205,16 +211,12 @@ def find_best_answer_for_inst(sample, start_prob, end_prob, inst_lod):
def validation(inference_program, avg_cost, s_probs, e_probs, match, feed_order, def validation(inference_program, avg_cost, s_probs, e_probs, match, feed_order,
place, dev_count, vocab, brc_data, logger, args): place, dev_count, vocab, brc_data, logger, args):
""" """
do inference with given inference_program
""" """
build_strategy = fluid.BuildStrategy()
build_strategy.enable_inplace = False
build_strategy.memory_optimize = False
parallel_executor = fluid.ParallelExecutor( parallel_executor = fluid.ParallelExecutor(
main_program=inference_program, main_program=inference_program,
use_cuda=bool(args.use_gpu), use_cuda=bool(args.use_gpu),
loss_name=avg_cost.name, loss_name=avg_cost.name)
build_strategy=build_strategy)
print_para(inference_program, parallel_executor, logger, args) print_para(inference_program, parallel_executor, logger, args)
# Use test set as validation each pass # Use test set as validation each pass
...@@ -277,7 +279,7 @@ def validation(inference_program, avg_cost, s_probs, e_probs, match, feed_order, ...@@ -277,7 +279,7 @@ def validation(inference_program, avg_cost, s_probs, e_probs, match, feed_order,
'question_type': sample['question_type'], 'question_type': sample['question_type'],
'answers': [best_answer], 'answers': [best_answer],
'entity_answers': [[]], 'entity_answers': [[]],
'yesno_answers': [best_span] 'yesno_answers': []
} }
pred_answers.append(pred) pred_answers.append(pred)
if 'answers' in sample: if 'answers' in sample:
...@@ -296,7 +298,7 @@ def validation(inference_program, avg_cost, s_probs, e_probs, match, feed_order, ...@@ -296,7 +298,7 @@ def validation(inference_program, avg_cost, s_probs, e_probs, match, feed_order,
if result_dir is not None and result_prefix is not None: if result_dir is not None and result_prefix is not None:
if not os.path.exists(args.result_dir): if not os.path.exists(args.result_dir):
os.makedirs(args.result_dir) os.makedirs(args.result_dir)
result_file = os.path.join(result_dir, result_prefix + 'json') result_file = os.path.join(result_dir, result_prefix + '.json')
with open(result_file, 'w') as fout: with open(result_file, 'w') as fout:
for pred_answer in pred_answers: for pred_answer in pred_answers:
fout.write(json.dumps(pred_answer, ensure_ascii=False) + '\n') fout.write(json.dumps(pred_answer, ensure_ascii=False) + '\n')
...@@ -328,6 +330,7 @@ def l2_loss(train_prog): ...@@ -328,6 +330,7 @@ def l2_loss(train_prog):
def train(logger, args): def train(logger, args):
"""train a model"""
logger.info('Load data_set and vocab...') logger.info('Load data_set and vocab...')
with open(os.path.join(args.vocab_dir, 'vocab.data'), 'rb') as fin: with open(os.path.join(args.vocab_dir, 'vocab.data'), 'rb') as fin:
if six.PY2: if six.PY2:
...@@ -489,6 +492,7 @@ def train(logger, args): ...@@ -489,6 +492,7 @@ def train(logger, args):
def evaluate(logger, args): def evaluate(logger, args):
"""evaluate a specific model using devset"""
logger.info('Load data_set and vocab...') logger.info('Load data_set and vocab...')
with open(os.path.join(args.vocab_dir, 'vocab.data'), 'rb') as fin: with open(os.path.join(args.vocab_dir, 'vocab.data'), 'rb') as fin:
vocab = pickle.load(fin) vocab = pickle.load(fin)
...@@ -527,8 +531,8 @@ def evaluate(logger, args): ...@@ -527,8 +531,8 @@ def evaluate(logger, args):
inference_program = main_program.clone(for_test=True) inference_program = main_program.clone(for_test=True)
eval_loss, bleu_rouge = validation( eval_loss, bleu_rouge = validation(
inference_program, avg_cost, s_probs, e_probs, match, inference_program, avg_cost, s_probs, e_probs, match, feed_order,
feed_order, place, dev_count, vocab, brc_data, logger, args) place, dev_count, vocab, brc_data, logger, args)
logger.info('Dev eval loss {}'.format(eval_loss)) logger.info('Dev eval loss {}'.format(eval_loss))
logger.info('Dev eval result: {}'.format(bleu_rouge)) logger.info('Dev eval result: {}'.format(bleu_rouge))
logger.info('Predicted answers are saved to {}'.format( logger.info('Predicted answers are saved to {}'.format(
...@@ -536,6 +540,7 @@ def evaluate(logger, args): ...@@ -536,6 +540,7 @@ def evaluate(logger, args):
def predict(logger, args): def predict(logger, args):
"""do inference on the test dataset """
logger.info('Load data_set and vocab...') logger.info('Load data_set and vocab...')
with open(os.path.join(args.vocab_dir, 'vocab.data'), 'rb') as fin: with open(os.path.join(args.vocab_dir, 'vocab.data'), 'rb') as fin:
vocab = pickle.load(fin) vocab = pickle.load(fin)
......
export CUDA_VISIBLE_DEVICES=0 #!/bin/bash
python run.py \ export CUDA_VISIBLE_DEVICES=1
--trainset 'data/preprocessed/trainset/search.train.json' \
'data/preprocessed/trainset/zhidao.train.json' \ paragraph_extraction ()
--devset 'data/preprocessed/devset/search.dev.json' \ {
'data/preprocessed/devset/zhidao.dev.json' \ SOURCE_DIR=$1
--testset 'data/preprocessed/testset/search.test.json' \ TARGET_DIR=$2
'data/preprocessed/testset/zhidao.test.json' \ echo "Start paragraph extraction, this may take a few hours"
--vocab_dir 'data/vocab' \ echo "Source dir: $SOURCE_DIR"
--use_gpu true \ echo "Target dir: $TARGET_DIR"
--save_dir ./models \ mkdir -p $TARGET_DIR/trainset
--pass_num 10 \ mkdir -p $TARGET_DIR/devset
--learning_rate 0.001 \ mkdir -p $TARGET_DIR/testset
--batch_size 32 \
--embed_size 300 \ echo "Processing trainset"
--hidden_size 150 \ cat $SOURCE_DIR/trainset/search.train.json | python paragraph_extraction.py train \
--max_p_num 5 \ > $TARGET_DIR/trainset/search.train.json
--max_p_len 500 \ cat $SOURCE_DIR/trainset/zhidao.train.json | python paragraph_extraction.py train \
--max_q_len 60 \ > $TARGET_DIR/trainset/zhidao.train.json
--max_a_len 200 \
--weight_decay 0.0001 \ echo "Processing devset"
--drop_rate 0.2 $@\ cat $SOURCE_DIR/devset/search.dev.json | python paragraph_extraction.py dev \
> $TARGET_DIR/devset/search.dev.json
cat $SOURCE_DIR/devset/zhidao.dev.json | python paragraph_extraction.py dev \
> $TARGET_DIR/devset/zhidao.dev.json
echo "Processing testset"
cat $SOURCE_DIR/testset/search.test.json | python paragraph_extraction.py test \
> $TARGET_DIR/testset/search.test.json
cat $SOURCE_DIR/testset/zhidao.test.json | python paragraph_extraction.py test \
> $TARGET_DIR/testset/zhidao.test.json
echo "Paragraph extraction done!"
}
PROCESS_NAME="$1"
case $PROCESS_NAME in
--para_extraction)
# Start paragraph extraction
if [ ! -d data/preprocessed ]; then
echo "Please download the preprocessed data first (See README - Preprocess)"
exit 1
fi
paragraph_extraction data/preprocessed data/extracted
;;
--prepare|--train|--evaluate|--predict)
# Start Paddle baseline
python run.py $@
;;
*)
echo $"Usage: $0 {--para_extraction|--prepare|--train|--evaluate|--predict}"
esac
...@@ -37,9 +37,10 @@ class Vocab(object): ...@@ -37,9 +37,10 @@ class Vocab(object):
self.pad_token = '<blank>' self.pad_token = '<blank>'
self.unk_token = '<unk>' self.unk_token = '<unk>'
self.split_token = '<splitter>'
self.initial_tokens = initial_tokens if initial_tokens is not None else [] self.initial_tokens = initial_tokens if initial_tokens is not None else []
self.initial_tokens.extend([self.pad_token, self.unk_token]) self.initial_tokens.extend([self.pad_token, self.unk_token, self.split_token])
for token in self.initial_tokens: for token in self.initial_tokens:
self.add(token) self.add(token)
...@@ -137,7 +138,7 @@ class Vocab(object): ...@@ -137,7 +138,7 @@ class Vocab(object):
""" """
self.embed_dim = embed_dim self.embed_dim = embed_dim
self.embeddings = np.random.rand(self.size(), embed_dim) self.embeddings = np.random.rand(self.size(), embed_dim)
for token in [self.pad_token, self.unk_token]: for token in [self.pad_token, self.unk_token, self.split_token]:
self.embeddings[self.get_id(token)] = np.zeros([self.embed_dim]) self.embeddings[self.get_id(token)] = np.zeros([self.embed_dim])
def load_pretrained_embeddings(self, embedding_path): def load_pretrained_embeddings(self, embedding_path):
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册