未验证 提交 73ee4d16 编写于 作者: G guru4elephant 提交者: GitHub

Merge pull request #1 from PaddlePaddle/develop

sync with paddlepaddle/models
[submodule "fluid/SimNet"]
path = fluid/SimNet
url = https://github.com/baidu/AnyQ.git
[submodule "fluid/LAC"]
path = fluid/LAC
url = https://github.com/baidu/lac
[submodule "fluid/Senta"]
path = fluid/Senta
url = https://github.com/baidu/Senta
......@@ -8,7 +8,7 @@ PaddlePaddle provides a rich set of computational units to enable users to adopt
- [fluid models](fluid): use PaddlePaddle's Fluid APIs. We especially recommend users to use Fluid models.
- [v2 models](v2): use PaddlePaddle's v2 APIs.
- [legacy models](legacy): use PaddlePaddle's v2 APIs.
## License
......
Subproject commit 66660503bb6e8f34adc4715ccf42cad77ed46ded
......@@ -49,14 +49,46 @@ Network,ICNet)进行语义分割,相比其他分割算法,ICNet兼顾了准
- `ICNet <https://github.com/PaddlePaddle/models/tree/develop/fluid/icnet>`__
图像生成
-----------
图像生成是指根据输入向量,生成目标图像。这里的输入向量可以是随机的噪声或用户指定的条件向量。具体的应用场景有:手写体生成、人脸合成、风格迁移、图像修复等。当前的图像生成任务主要是借助生成对抗网络(GAN)来实现。
生成对抗网络(GAN)由两种子网络组成:生成器和识别器。生成器的输入是随机噪声或条件向量,输出是目标图像。识别器是一个分类器,输入是一张图像,输出是该图像是否是真实的图像。在训练过程中,生成器和识别器通过不断的相互博弈提升自己的能力。
在图像生成任务中,我们介绍了如何使用DCGAN和ConditioanlGAN来进行手写数字的生成,另外还介绍了用于风格迁移的CycleGAN.
- `DCGAN & ConditionalGAN <https://github.com/PaddlePaddle/models/tree/develop/fluid/gan/c_gan>`__
- `CycleGAN <https://github.com/PaddlePaddle/models/tree/develop/fluid/gan/cycle_gan>`__
场景文字识别
------------
许多场景图像中包含着丰富的文本信息,对理解图像信息有着重要作用,能够极大地帮助人们认知和理解场景图像的内容。场景文字识别是在图像背景复杂、分辨率低下、字体多样、分布随意等情况下,将图像信息转化为文字序列的过程,可认为是一种特别的翻译过程:将图像输入翻译为自然语言输出。场景图像文字识别技术的发展也促进了一些新型应用的产生,如通过自动识别路牌中的文字帮助街景应用获取更加准确的地址信息等。
在场景文字识别任务中,我们介绍如何将基于CNN的图像特征提取和基于RNN的序列翻译技术结合,免除人工定义特征,避免字符分割,使用自动学习到的图像特征,完成端到端地无约束字符定位和识别。当前,介绍了CRNN-CTC模型,后续会引入基于注意力机制的序列到序列模型。
在场景文字识别任务中,我们介绍如何将基于CNN的图像特征提取和基于RNN的序列翻译技术结合,免除人工定义特征,避免字符分割,使用自动学习到的图像特征,完成字符识别。当前,介绍了CRNN-CTC模型和基于注意力机制的序列到序列模型。
- `CRNN-CTC模型 <https://github.com/PaddlePaddle/models/tree/develop/fluid/ocr_recognition>`__
- `Attention模型 <https://github.com/PaddlePaddle/models/tree/develop/fluid/ocr_recognition>`__
度量学习
-------
度量学习也称作距离度量学习、相似度学习,通过学习对象之间的距离,度量学习能够用于分析对象时间的关联、比较关系,在实际问题中应用较为广泛,可应用于辅助分类、聚类问题,也广泛用于图像检索、人脸识别等领域。以往,针对不同的任务,需要选择合适的特征并手动构建距离函数,而度量学习可根据不同的任务来自主学习出针对特定任务的度量距离函数。度量学习和深度学习的结合,在人脸识别/验证、行人再识别(human Re-ID)、图像检索等领域均取得较好的性能,在这个任务中我们主要介绍了基于Fluid的深度度量学习模型,包含了三元组、四元组等损失函数。
- `Metric Learning <https://github.com/PaddlePaddle/models/tree/develop/fluid/metric_learning>`__
视频分类
-------
视频分类是视频理解任务的基础,与图像分类不同的是,分类的对象不再是静止的图像,而是一个由多帧图像构成的、包含语音数据、包含运动信息等的视频对象,因此理解视频需要获得更多的上下文信息,不仅要理解每帧图像是什么、包含什么,还需要结合不同帧,知道上下文的关联信息。视频分类方法主要包含基于卷积神经网络、基于循环神经网络、或将这两者结合的方法。该任务中我们介绍基于Fluid的视频分类模型,目前包含Temporal Segment Network(TSN)模型,后续会持续增加更多模型。
- `TSN <https://github.com/PaddlePaddle/models/tree/develop/fluid/video_classification>`__
语音识别
--------
......@@ -124,6 +156,15 @@ DQN 及其变体,并测试了它们在 Atari 游戏中的表现。
- `Senta <https://github.com/baidu/Senta/blob/master/README.md>`__
语义匹配
--------
在自然语言处理很多场景中,需要度量两个文本在语义上的相似度,这类任务通常被称为语义匹配。例如在搜索中根据查询与候选文档的相似度对搜索结果进行排序,文本去重中文本与文本相似度的计算,自动问答中候选答案与问题的匹配等。
本例所开放的DAM (Deep Attention Matching Network)为百度自然语言处理部发表于ACL-2018的工作,用于检索式聊天机器人多轮对话中应答的选择。DAM受Transformer的启发,其网络结构完全基于注意力(attention)机制,利用栈式的self-attention结构分别学习不同粒度下应答和语境的语义表示,然后利用cross-attention获取应答与语境之间的相关性,在两个大规模多轮对话数据集上的表现均好于其它模型。
- `Deep Attention Matching Network <https://github.com/PaddlePaddle/models/tree/develop/fluid/deep_attention_matching_net>`__
AnyQ
----
......@@ -135,3 +176,12 @@ SimNet是百度自然语言处理部于2013年自主研发的语义匹配框架
- `SimNet in PaddlePaddle
Fluid <https://github.com/baidu/AnyQ/blob/master/tools/simnet/train/paddle/README.md>`__
机器阅读理解
----
机器阅读理解(MRC)是自然语言处理(NLP)中的核心任务之一,最终目标是让机器像人类一样阅读文本,提炼文本信息并回答相关问题。深度学习近年来在NLP中得到广泛使用,也使得机器阅读理解能力在近年有了大幅提高,但是目前研究的机器阅读理解都采用人工构造的数据集,以及回答一些相对简单的问题,和人类处理的数据还有明显差距,因此亟需大规模真实训练数据推动MRC的进一步发展。
百度阅读理解数据集是由百度自然语言处理部开源的一个真实世界数据集,所有的问题、原文都来源于实际数据(百度搜索引擎数据和百度知道问答社区),答案是由人类回答的。每个问题都对应多个答案,数据集包含200k问题、1000k原文和420k答案,是目前最大的中文MRC数据集。百度同时开源了对应的阅读理解模型,称为DuReader,采用当前通用的网络分层结构,通过双向attention机制捕捉问题和原文之间的交互关系,生成query-aware的原文表示,最终基于query-aware的原文表示通过point network预测答案范围。
- `DuReader in PaddlePaddle Fluid] <https://github.com/PaddlePaddle/models/blob/develop/fluid/machine_reading_comprehension/README.md>`__
......@@ -28,8 +28,11 @@ Fluid模型配置和参数文件的工具。
开放环境中的检测人脸,尤其是小的、模糊的和部分遮挡的人脸也是一个具有挑战的任务。我们也介绍了如何基于 [WIDER FACE](http://mmlab.ie.cuhk.edu.hk/projects/WIDERFace) 数据训练百度自研的人脸检测PyramidBox模型,该算法于2018年3月份在WIDER FACE的多项评测中均获得 [第一名](http://mmlab.ie.cuhk.edu.hk/projects/WIDERFace/WiderFace_Results.html)
Faster RCNN 是典型的两阶段目标检测器,相较于传统提取区域的方法,Faster RCNN中RPN网络通过共享卷积层参数大幅提高提取区域的效率,并提出高质量的候选区域。
- [Single Shot MultiBox Detector](https://github.com/PaddlePaddle/models/blob/develop/fluid/object_detection/README_cn.md)
- [Face Detector: PyramidBox](https://github.com/PaddlePaddle/models/tree/develop/fluid/face_detection/README_cn.md)
- [Faster RCNN](https://github.com/PaddlePaddle/models/tree/develop/fluid/faster_rcnn/README_cn.md)
图像语义分割
------------
......@@ -41,14 +44,45 @@ Network,ICNet)进行语义分割,相比其他分割算法,ICNet兼顾了准
- [ICNet](https://github.com/PaddlePaddle/models/tree/develop/fluid/icnet)
图像生成
-----------
图像生成是指根据输入向量,生成目标图像。这里的输入向量可以是随机的噪声或用户指定的条件向量。具体的应用场景有:手写体生成、人脸合成、风格迁移、图像修复等。当前的图像生成任务主要是借助生成对抗网络(GAN)来实现。
生成对抗网络(GAN)由两种子网络组成:生成器和识别器。生成器的输入是随机噪声或条件向量,输出是目标图像。识别器是一个分类器,输入是一张图像,输出是该图像是否是真实的图像。在训练过程中,生成器和识别器通过不断的相互博弈提升自己的能力。
在图像生成任务中,我们介绍了如何使用DCGAN和ConditioanlGAN来进行手写数字的生成,另外还介绍了用于风格迁移的CycleGAN.
- [DCGAN & ConditionalGAN](https://github.com/PaddlePaddle/models/tree/develop/fluid/gan/c_gan)
- [CycleGAN](https://github.com/PaddlePaddle/models/tree/develop/fluid/gan/cycle_gan)
场景文字识别
------------
许多场景图像中包含着丰富的文本信息,对理解图像信息有着重要作用,能够极大地帮助人们认知和理解场景图像的内容。场景文字识别是在图像背景复杂、分辨率低下、字体多样、分布随意等情况下,将图像信息转化为文字序列的过程,可认为是一种特别的翻译过程:将图像输入翻译为自然语言输出。场景图像文字识别技术的发展也促进了一些新型应用的产生,如通过自动识别路牌中的文字帮助街景应用获取更加准确的地址信息等。
在场景文字识别任务中,我们介绍如何将基于CNN的图像特征提取和基于RNN的序列翻译技术结合,免除人工定义特征,避免字符分割,使用自动学习到的图像特征,完成端到端地无约束字符定位和识别。当前,介绍了CRNN-CTC模型,后续会引入基于注意力机制的序列到序列模型。
在场景文字识别任务中,我们介绍如何将基于CNN的图像特征提取和基于RNN的序列翻译技术结合,免除人工定义特征,避免字符分割,使用自动学习到的图像特征,完成字符识别。当前,介绍了CRNN-CTC模型和基于注意力机制的序列到序列模型。
- [CRNN-CTC模型](https://github.com/PaddlePaddle/models/tree/develop/fluid/ocr_recognition)
- [Attention模型](https://github.com/PaddlePaddle/models/tree/develop/fluid/ocr_recognition)
度量学习
-------
度量学习也称作距离度量学习、相似度学习,通过学习对象之间的距离,度量学习能够用于分析对象时间的关联、比较关系,在实际问题中应用较为广泛,可应用于辅助分类、聚类问题,也广泛用于图像检索、人脸识别等领域。以往,针对不同的任务,需要选择合适的特征并手动构建距离函数,而度量学习可根据不同的任务来自主学习出针对特定任务的度量距离函数。度量学习和深度学习的结合,在人脸识别/验证、行人再识别(human Re-ID)、图像检索等领域均取得较好的性能,在这个任务中我们主要介绍了基于Fluid的深度度量学习模型,包含了三元组、四元组等损失函数。
- [Metric Learning](https://github.com/PaddlePaddle/models/tree/develop/fluid/metric_learning)
视频分类
-------
视频分类是视频理解任务的基础,与图像分类不同的是,分类的对象不再是静止的图像,而是一个由多帧图像构成的、包含语音数据、包含运动信息等的视频对象,因此理解视频需要获得更多的上下文信息,不仅要理解每帧图像是什么、包含什么,还需要结合不同帧,知道上下文的关联信息。视频分类方法主要包含基于卷积神经网络、基于循环神经网络、或将这两者结合的方法。该任务中我们介绍基于Fluid的视频分类模型,目前包含Temporal Segment Network(TSN)模型,后续会持续增加更多模型。
- [TSN](https://github.com/PaddlePaddle/models/tree/develop/fluid/video_classification)
- [CRNN-CTC模](https://github.com/PaddlePaddle/models/tree/develop/fluid/ocr_recognition)
语音识别
--------
......@@ -94,6 +128,15 @@ Machine Translation, NMT)等阶段。在 NMT 成熟后,机器翻译才真正
- [Senta](https://github.com/baidu/Senta/blob/master/README.md)
语义匹配
--------
在自然语言处理很多场景中,需要度量两个文本在语义上的相似度,这类任务通常被称为语义匹配。例如在搜索中根据查询与候选文档的相似度对搜索结果进行排序,文本去重中文本与文本相似度的计算,自动问答中候选答案与问题的匹配等。
本例所开放的DAM (Deep Attention Matching Network)为百度自然语言处理部发表于ACL-2018的工作,用于检索式聊天机器人多轮对话中应答的选择。DAM受Transformer的启发,其网络结构完全基于注意力(attention)机制,利用栈式的self-attention结构分别学习不同粒度下应答和语境的语义表示,然后利用cross-attention获取应答与语境之间的相关性,在两个大规模多轮对话数据集上的表现均好于其它模型。
- [Deep Attention Matching Network](https://github.com/PaddlePaddle/models/tree/develop/fluid/deep_attention_matching_net)
AnyQ
----
......@@ -102,3 +145,12 @@ AnyQ
SimNet是百度自然语言处理部于2013年自主研发的语义匹配框架,该框架在百度各产品上广泛应用,主要包括BOW、CNN、RNN、MM-DNN等核心网络结构形式,同时基于该框架也集成了学术界主流的语义匹配模型,如MatchPyramid、MV-LSTM、K-NRM等模型。使用SimNet构建出的模型可以便捷的加入AnyQ系统中,增强AnyQ系统的语义匹配能力。
- [SimNet in PaddlePaddle Fluid](https://github.com/baidu/AnyQ/blob/master/tools/simnet/train/paddle/README.md)
机器阅读理解
----------
机器阅读理解(MRC)是自然语言处理(NLP)中的核心任务之一,最终目标是让机器像人类一样阅读文本,提炼文本信息并回答相关问题。深度学习近年来在NLP中得到广泛使用,也使得机器阅读理解能力在近年有了大幅提高,但是目前研究的机器阅读理解都采用人工构造的数据集,以及回答一些相对简单的问题,和人类处理的数据还有明显差距,因此亟需大规模真实训练数据推动MRC的进一步发展。
百度阅读理解数据集是由百度自然语言处理部开源的一个真实世界数据集,所有的问题、原文都来源于实际数据(百度搜索引擎数据和百度知道问答社区),答案是由人类回答的。每个问题都对应多个答案,数据集包含200k问题、1000k原文和420k答案,是目前最大的中文MRC数据集。百度同时开源了对应的阅读理解模型,称为DuReader,采用当前通用的网络分层结构,通过双向attention机制捕捉问题和原文之间的交互关系,生成query-aware的原文表示,最终基于query-aware的原文表示通过point network预测答案范围。
- [DuReader in PaddlePaddle Fluid](https://github.com/PaddlePaddle/models/blob/develop/fluid/machine_reading_comprehension/README.md)
Subproject commit 870651e257750f2c237f0b0bc9a27e5d062d1909
Subproject commit 4dbe7f7b0e76c188eb7f448d104f0165f0a12229
import cPickle as pickle
import six
import numpy as np
import paddle.fluid as fluid
import utils.layers as layers
......@@ -29,7 +29,7 @@ class Net(object):
mask_cache = dict() if self.use_mask_cache else None
turns_data = []
for i in xrange(self._max_turn_num):
for i in six.moves.xrange(self._max_turn_num):
turn = fluid.layers.data(
name="turn_%d" % i,
shape=[self._max_turn_len, 1],
......@@ -37,7 +37,7 @@ class Net(object):
turns_data.append(turn)
turns_mask = []
for i in xrange(self._max_turn_num):
for i in six.moves.xrange(self._max_turn_num):
turn_mask = fluid.layers.data(
name="turn_mask_%d" % i,
shape=[self._max_turn_len, 1],
......@@ -64,7 +64,7 @@ class Net(object):
Hr = response_emb
Hr_stack = [Hr]
for index in range(self._stack_num):
for index in six.moves.xrange(self._stack_num):
Hr = layers.block(
name="response_self_stack" + str(index),
query=Hr,
......@@ -78,7 +78,7 @@ class Net(object):
# context part
sim_turns = []
for t in xrange(self._max_turn_num):
for t in six.moves.xrange(self._max_turn_num):
Hu = fluid.layers.embedding(
input=turns_data[t],
size=[self._vocab_size + 1, self._emb_size],
......@@ -88,7 +88,7 @@ class Net(object):
initializer=fluid.initializer.Normal(scale=0.1)))
Hu_stack = [Hu]
for index in range(self._stack_num):
for index in six.moves.xrange(self._stack_num):
# share parameters
Hu = layers.block(
name="turn_self_stack" + str(index),
......@@ -104,7 +104,7 @@ class Net(object):
# cross attention
r_a_t_stack = []
t_a_r_stack = []
for index in range(self._stack_num + 1):
for index in six.moves.xrange(self._stack_num + 1):
t_a_r = layers.block(
name="t_attend_r_" + str(index),
query=Hu_stack[index],
......@@ -134,7 +134,7 @@ class Net(object):
t_a_r = fluid.layers.stack(t_a_r_stack, axis=1)
r_a_t = fluid.layers.stack(r_a_t_stack, axis=1)
else:
for index in xrange(len(t_a_r_stack)):
for index in six.moves.xrange(len(t_a_r_stack)):
t_a_r_stack[index] = fluid.layers.unsqueeze(
input=t_a_r_stack[index], axes=[1])
r_a_t_stack[index] = fluid.layers.unsqueeze(
......@@ -151,7 +151,7 @@ class Net(object):
if self.use_stack_op:
sim = fluid.layers.stack(sim_turns, axis=2)
else:
for index in xrange(len(sim_turns)):
for index in six.moves.xrange(len(sim_turns)):
sim_turns[index] = fluid.layers.unsqueeze(
input=sim_turns[index], axes=[2])
# sim shape: [batch_size, 2*(stack_num+1), max_turn_num, max_turn_len, max_turn_len]
......
import os
import six
import numpy as np
import time
import argparse
......@@ -6,8 +7,12 @@ import multiprocessing
import paddle
import paddle.fluid as fluid
import utils.reader as reader
import cPickle as pickle
from utils.util import print_arguments
from utils.util import print_arguments, mkdir
try:
import cPickle as pickle #python 2
except ImportError as e:
import pickle #python 3
from model import Net
......@@ -107,7 +112,7 @@ def parse_args():
def test(args):
if not os.path.exists(args.save_path):
raise ValueError("Invalid save path %s" % args.save_path)
mkdir(args.save_path)
if not os.path.exists(args.model_path):
raise ValueError("Invalid model init path %s" % args.model_path)
# data data_config
......@@ -158,7 +163,11 @@ def test(args):
use_cuda=args.use_cuda, main_program=test_program)
print("start loading data ...")
train_data, val_data, test_data = pickle.load(open(args.data_path, 'rb'))
with open(args.data_path, 'rb') as f:
if six.PY2:
train_data, val_data, test_data = pickle.load(f)
else:
train_data, val_data, test_data = pickle.load(f, encoding="bytes")
print("finish loading data ...")
if args.ext_eval:
......@@ -178,9 +187,9 @@ def test(args):
score_path = os.path.join(args.save_path, 'score.txt')
score_file = open(score_path, 'w')
for it in xrange(test_batch_num // dev_count):
for it in six.moves.xrange(test_batch_num // dev_count):
feed_list = []
for dev in xrange(dev_count):
for dev in six.moves.xrange(dev_count):
index = it * dev_count + dev
feed_dict = reader.make_one_batch_input(test_batches, index)
feed_list.append(feed_dict)
......@@ -190,9 +199,9 @@ def test(args):
scores = np.array(predicts[0])
print("step = %d" % it)
for dev in xrange(dev_count):
for dev in six.moves.xrange(dev_count):
index = it * dev_count + dev
for i in xrange(args.batch_size):
for i in six.moves.xrange(args.batch_size):
score_file.write(
str(scores[args.batch_size * dev + i][0]) + '\t' + str(
test_batches["label"][index][i]) + '\n')
......
import os
import six
import numpy as np
import time
import argparse
......@@ -6,9 +7,13 @@ import multiprocessing
import paddle
import paddle.fluid as fluid
import utils.reader as reader
import cPickle as pickle
from utils.util import print_arguments
try:
import cPickle as pickle #python 2
except ImportError as e:
import pickle #python 3
from model import Net
......@@ -164,35 +169,45 @@ def train(args):
if args.word_emb_init is not None:
print("start loading word embedding init ...")
word_emb = np.array(pickle.load(open(args.word_emb_init, 'rb'))).astype(
if six.PY2:
word_emb = np.array(pickle.load(open(args.word_emb_init,
'rb'))).astype('float32')
else:
word_emb = np.array(
pickle.load(
open(args.word_emb_init, 'rb'), encoding="bytes")).astype(
'float32')
dam.set_word_embedding(word_emb, place)
print("finish init word embedding ...")
print("start loading data ...")
train_data, val_data, test_data = pickle.load(open(args.data_path, 'rb'))
with open(args.data_path, 'rb') as f:
if six.PY2:
train_data, val_data, test_data = pickle.load(f)
else:
train_data, val_data, test_data = pickle.load(f, encoding="bytes")
print("finish loading data ...")
val_batches = reader.build_batches(val_data, data_conf)
batch_num = len(train_data['y']) / args.batch_size
batch_num = len(train_data[six.b('y')]) // args.batch_size
val_batch_num = len(val_batches["response"])
print_step = max(1, batch_num / (dev_count * 100))
save_step = max(1, batch_num / (dev_count * 10))
print_step = max(1, batch_num // (dev_count * 100))
save_step = max(1, batch_num // (dev_count * 10))
print("begin model training ...")
print(time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time())))
step = 0
for epoch in xrange(args.num_scan_data):
for epoch in six.moves.xrange(args.num_scan_data):
shuffle_train = reader.unison_shuffle(train_data)
train_batches = reader.build_batches(shuffle_train, data_conf)
ave_cost = 0.0
for it in xrange(batch_num // dev_count):
for it in six.moves.xrange(batch_num // dev_count):
feed_list = []
for dev in xrange(dev_count):
for dev in six.moves.xrange(dev_count):
index = it * dev_count + dev
feed_dict = reader.make_one_batch_input(train_batches, index)
feed_list.append(feed_dict)
......@@ -215,9 +230,9 @@ def train(args):
score_path = os.path.join(args.save_path, 'score.' + str(step))
score_file = open(score_path, 'w')
for it in xrange(val_batch_num // dev_count):
for it in six.moves.xrange(val_batch_num // dev_count):
feed_list = []
for dev in xrange(dev_count):
for dev in six.moves.xrange(dev_count):
val_index = it * dev_count + dev
feed_dict = reader.make_one_batch_input(val_batches,
val_index)
......@@ -227,9 +242,9 @@ def train(args):
fetch_list=[logits.name])
scores = np.array(predicts[0])
for dev in xrange(dev_count):
for dev in six.moves.xrange(dev_count):
val_index = it * dev_count + dev
for i in xrange(args.batch_size):
for i in six.moves.xrange(args.batch_size):
score_file.write(
str(scores[args.batch_size * dev + i][0]) + '\t'
+ str(val_batches["label"][val_index][
......
import sys
import six
import numpy as np
from sklearn.metrics import average_precision_score
......@@ -7,7 +8,7 @@ def mean_average_precision(sort_data):
#to do
count_1 = 0
sum_precision = 0
for index in range(len(sort_data)):
for index in six.moves.xrange(len(sort_data)):
if sort_data[index][1] == 1:
count_1 += 1
sum_precision += 1.0 * count_1 / (index + 1)
......
import sys
import six
def get_p_at_n_in_m(data, n, m, ind):
......@@ -30,9 +31,9 @@ def evaluate(file_path):
p_at_2_in_10 = 0.0
p_at_5_in_10 = 0.0
length = len(data) / 10
length = len(data) // 10
for i in xrange(0, length):
for i in six.moves.xrange(0, length):
ind = i * 10
assert data[ind][1] == 1
......
import cPickle as pickle
import six
import numpy as np
try:
import cPickle as pickle #python 2
except ImportError as e:
import pickle #python 3
def unison_shuffle(data, seed=None):
if seed is not None:
np.random.seed(seed)
y = np.array(data['y'])
c = np.array(data['c'])
r = np.array(data['r'])
y = np.array(data[six.b('y')])
c = np.array(data[six.b('c')])
r = np.array(data[six.b('r')])
assert len(y) == len(c) == len(r)
p = np.random.permutation(len(y))
shuffle_data = {'y': y[p], 'c': c[p], 'r': r[p]}
shuffle_data = {six.b('y'): y[p], six.b('c'): c[p], six.b('r'): r[p]}
return shuffle_data
......@@ -65,9 +70,9 @@ def produce_one_sample(data,
max_turn_len=50
return y, nor_turns_nor_c, nor_r, turn_len, term_len, r_len
'''
c = data['c'][index]
r = data['r'][index][:]
y = data['y'][index]
c = data[six.b('c')][index]
r = data[six.b('r')][index][:]
y = data[six.b('y')][index]
turns = split_c(c, split_id)
#normalize turns_c length, nor_turns length is max_turn_num
......@@ -101,7 +106,7 @@ def build_one_batch(data,
_label = []
for i in range(conf['batch_size']):
for i in six.moves.xrange(conf['batch_size']):
index = batch_index * conf['batch_size'] + i
y, nor_turns_nor_c, nor_r, turn_len, term_len, r_len = produce_one_sample(
data, index, conf['_EOS_'], conf['max_turn_num'],
......@@ -145,8 +150,8 @@ def build_batches(data, conf, turn_cut_type='tail', term_cut_type='tail'):
_label_batches = []
batch_len = len(data['y']) / conf['batch_size']
for batch_index in range(batch_len):
batch_len = len(data[six.b('y')]) // conf['batch_size']
for batch_index in six.moves.range(batch_len):
_turns, _tt_turns_len, _every_turn_len, _response, _response_len, _label = build_one_batch(
data, batch_index, conf, turn_cut_type='tail', term_cut_type='tail')
......@@ -192,8 +197,10 @@ def make_one_batch_input(data_batches, index):
max_turn_num = turns.shape[1]
max_turn_len = turns.shape[2]
turns_list = [turns[:, i, :] for i in xrange(max_turn_num)]
every_turn_len_list = [every_turn_len[:, i] for i in xrange(max_turn_num)]
turns_list = [turns[:, i, :] for i in six.moves.xrange(max_turn_num)]
every_turn_len_list = [
every_turn_len[:, i] for i in six.moves.xrange(max_turn_num)
]
feed_dict = {}
for i, turn in enumerate(turns_list):
......@@ -204,7 +211,7 @@ def make_one_batch_input(data_batches, index):
for i, turn_len in enumerate(every_turn_len_list):
feed_dict["turn_mask_%d" % i] = np.ones(
(batch_size, max_turn_len, 1)).astype("float32")
for row in xrange(batch_size):
for row in six.moves.xrange(batch_size):
feed_dict["turn_mask_%d" % i][row, turn_len[row]:, 0] = 0
feed_dict["response"] = response
......@@ -212,7 +219,7 @@ def make_one_batch_input(data_batches, index):
feed_dict["response_mask"] = np.ones(
(batch_size, max_turn_len, 1)).astype("float32")
for row in xrange(batch_size):
for row in six.moves.xrange(batch_size):
feed_dict["response_mask"][row, response_len[row]:, 0] = 0
feed_dict["label"] = np.array([data_batches["label"][index]]).reshape(
......@@ -228,14 +235,14 @@ if __name__ == '__main__':
"max_turn_len": 50,
"_EOS_": 28270,
}
train, val, test = pickle.load(open('../data/ubuntu/data_small.pkl', 'rb'))
with open('../ubuntu/data/data_small.pkl', 'rb') as f:
if six.PY2:
train, val, test = pickle.load(f)
else:
train, val, test = pickle.load(f, encoding="bytes")
print('load data success')
train_batches = build_batches(train, conf)
val_batches = build_batches(val, conf)
test_batches = build_batches(test, conf)
print('build batches success')
pickle.dump([train_batches, val_batches, test_batches],
open('../data/ubuntu/data_small_xxx.pkl', 'wb'))
print('dump success')
import six
import os
def print_arguments(args):
print('----------- Configuration Arguments -----------')
for arg, value in sorted(vars(args).iteritems()):
for arg, value in sorted(six.iteritems(vars(args))):
print('%s: %s' % (arg, value))
print('------------------------------------------------')
def mkdir(path):
if not os.path.isdir(path):
mkdir(os.path.split(path)[0])
else:
return
os.mkdir(path)
def pos_encoding_init():
pass
......
deeplabv3plus_xception65_initialize.params
deeplabv3plus.params
deeplabv3plus.tar.gz
DeepLab运行本目录下的程序示例需要使用PaddlePaddle develop最新版本。如果您的PaddlePaddle安装版本低于此要求,请按照[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)中的说明更新PaddlePaddle安装版本。
DeepLab运行本目录下的程序示例需要使用PaddlePaddle Fluid v1.0.0版本或以上。如果您的PaddlePaddle安装版本低于此要求,请按照安装文档中的说明更新PaddlePaddle安装版本,如果使用GPU,该程序需要使用cuDNN v7版本。
## 代码结构
......@@ -41,10 +41,12 @@ data/cityscape/
如果需要从头开始训练模型,用户需要下载我们的初始化模型
```
wget http://paddlemodels.cdn.bcebos.com/deeplab/deeplabv3plus_xception65_initialize.tar.gz
tar -xf deeplabv3plus_xception65_initialize.tar.gz && rm deeplabv3plus_xception65_initialize.tar.gz
```
如果需要最终训练模型进行fine tune或者直接用于预测,请下载我们的最终模型
```
wget http://paddlemodels.cdn.bcebos.com/deeplab/deeplabv3plus.tar.gz
tar -xf deeplabv3plus.tar.gz && rm deeplabv3plus.tar.gz
```
......@@ -70,11 +72,11 @@ python train.py --help
```
python ./train.py \
--batch_size=8 \
--parallel=true
--parallel=true \
--train_crop_size=769 \
--total_step=90000 \
--init_weights_path=$INIT_WEIGHTS_PATH \
--save_weights_path=$SAVE_WEIGHTS_PATH \
--init_weights_path=deeplabv3plus_xception65_initialize.params \
--save_weights_path=output \
--dataset_path=$DATASET_PATH
```
......@@ -82,11 +84,10 @@ python ./train.py \
执行以下命令在`Cityscape`测试数据集上进行测试:
```
python ./eval.py \
--init_weights_path=$INIT_WEIGHTS_PATH \
--init_weights=deeplabv3plus.params \
--dataset_path=$DATASET_PATH
```
需要通过选项`--model_path`指定模型文件。
测试脚本的输出的评估指标为[mean IoU]()。
需要通过选项`--model_path`指定模型文件。测试脚本的输出的评估指标为mean IoU。
## 实验结果
......
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
os.environ['FLAGS_fraction_of_gpu_memory_to_use'] = '0.98'
......@@ -91,7 +94,7 @@ exe = fluid.Executor(place)
exe.run(sp)
if args.init_weights_path:
print "load from:", args.init_weights_path
print("load from:", args.init_weights_path)
load_model()
dataset = CityscapeDataset(args.dataset_path, 'val')
......@@ -118,7 +121,7 @@ for i, imgs, labels, names in batches:
mp = (wrong + right) != 0
miou2 = np.mean((right[mp] * 1.0 / (right[mp] + wrong[mp])))
if args.verbose:
print 'step: %s, mIoU: %s' % (i + 1, miou2)
print('step: %s, mIoU: %s' % (i + 1, miou2))
else:
print '\rstep: %s, mIoU: %s' % (i + 1, miou2),
print('\rstep: %s, mIoU: %s' % (i + 1, miou2))
sys.stdout.flush()
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import paddle
import paddle.fluid as fluid
......@@ -50,7 +53,7 @@ def append_op_result(result, name):
def conv(*args, **kargs):
kargs['param_attr'] = name_scope + 'weights'
if kargs.has_key('bias_attr') and kargs['bias_attr']:
if 'bias_attr' in kargs and kargs['bias_attr']:
kargs['bias_attr'] = name_scope + 'biases'
else:
kargs['bias_attr'] = False
......@@ -62,7 +65,7 @@ def group_norm(input, G, eps=1e-5, param_attr=None, bias_attr=None):
N, C, H, W = input.shape
if C % G != 0:
print "group can not divide channle:", C, G
print("group can not divide channle:", C, G)
for d in range(10):
for t in [d, -d]:
if G + t <= 0: continue
......@@ -70,7 +73,7 @@ def group_norm(input, G, eps=1e-5, param_attr=None, bias_attr=None):
G = G + t
break
if C % G == 0:
print "use group size:", G
print("use group size:", G)
break
assert C % G == 0
param_shape = (G, )
......@@ -139,7 +142,7 @@ def seq_conv(input, channel, stride, filter, dilation=1, act=None):
filter,
stride,
groups=input.shape[1],
padding=(filter / 2) * dilation,
padding=(filter // 2) * dilation,
dilation=dilation)
input = bn(input)
if act: input = act(input)
......
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import cv2
import numpy as np
import os
import six
default_config = {
"shuffle": True,
......@@ -30,7 +35,7 @@ def slice_with_pad(a, s, value=0):
pr = 0
pads.append([pl, pr])
slices.append([l, r])
slices = map(lambda x: slice(x[0], x[1], 1), slices)
slices = list(map(lambda x: slice(x[0], x[1], 1), slices))
a = a[slices]
a = np.pad(a, pad_width=pads, mode='constant', constant_values=value)
return a
......@@ -38,11 +43,17 @@ def slice_with_pad(a, s, value=0):
class CityscapeDataset:
def __init__(self, dataset_dir, subset='train', config=default_config):
label_dirname = os.path.join(dataset_dir, 'gtFine/' + subset)
if six.PY2:
import commands
label_dirname = dataset_dir + 'gtFine/' + subset
label_files = commands.getoutput(
"find %s -type f | grep labelTrainIds | sort" %
label_dirname).splitlines()
else:
import subprocess
label_files = subprocess.getstatusoutput(
"find %s -type f | grep labelTrainIds | sort" %
label_dirname)[-1].splitlines()
self.label_files = label_files
self.label_dirname = label_dirname
self.index = 0
......@@ -50,7 +61,7 @@ class CityscapeDataset:
self.dataset_dir = dataset_dir
self.config = config
self.reset()
print "total number", len(label_files)
print("total number", len(label_files))
def reset(self, shuffle=False):
self.index = 0
......@@ -66,13 +77,14 @@ class CityscapeDataset:
shape = self.config["crop_size"]
while True:
ln = self.label_files[self.index]
img_name = self.dataset_dir + 'leftImg8bit/' + self.subset + ln[len(
self.label_dirname):]
img_name = os.path.join(
self.dataset_dir,
'leftImg8bit/' + self.subset + ln[len(self.label_dirname):])
img_name = img_name.replace('gtFine_labelTrainIds', 'leftImg8bit')
label = cv2.imread(ln)
img = cv2.imread(img_name)
if img is None:
print "load img failed:", img_name
print("load img failed:", img_name)
self.next_img()
else:
break
......@@ -128,5 +140,7 @@ class CityscapeDataset:
from prefetch_generator import BackgroundGenerator
batches = BackgroundGenerator(batches, 100)
except:
print "You can install 'prefetch_generator' for acceleration of data reading."
print(
"You can install 'prefetch_generator' for acceleration of data reading."
)
return batches
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
os.environ['FLAGS_fraction_of_gpu_memory_to_use'] = '0.98'
......@@ -126,13 +129,12 @@ exe = fluid.Executor(place)
exe.run(sp)
if args.init_weights_path:
print "load from:", args.init_weights_path
print("load from:", args.init_weights_path)
load_model()
dataset = CityscapeDataset(args.dataset_path, 'train')
if args.parallel:
print "Using ParallelExecutor."
exe_p = fluid.ParallelExecutor(
use_cuda=True, loss_name=loss_mean.name, main_program=tp)
......@@ -149,9 +151,9 @@ for i, imgs, labels, names in batches:
'label': labels},
fetch_list=[pred, loss_mean])
if i % 100 == 0:
print "Model is saved to", args.save_weights_path
print("Model is saved to", args.save_weights_path)
save_model()
print "step %s, loss: %s" % (i, np.mean(retv[1]))
print("step %s, loss: %s" % (i, np.mean(retv[1])))
print "Training done. Model is saved to", args.save_weights_path
print("Training done. Model is saved to", args.save_weights_path)
save_model()
......@@ -10,3 +10,4 @@ output*
pred
eval_tools
box*
PyramidBox_WiderFace*
......@@ -9,6 +9,7 @@ import time
import numpy as np
import threading
import multiprocessing
import traceback
try:
import queue
except ImportError:
......@@ -71,6 +72,7 @@ class GeneratorEnqueuer(object):
try:
task()
except Exception:
traceback.print_exc()
self._stop_event.set()
break
else:
......@@ -78,6 +80,7 @@ class GeneratorEnqueuer(object):
try:
task()
except Exception:
traceback.print_exc()
self._stop_event.set()
break
......
......@@ -427,6 +427,7 @@ class PyramidBox(object):
overlap_threshold=0.35,
neg_overlap=0.35)
loss = fluid.layers.reduce_sum(loss)
loss.persistable = True
return loss
def train(self):
......
......@@ -250,6 +250,10 @@ def train_generator(settings, file_list, batch_size, shuffle=True):
ymin = float(temp_info_box[1])
w = float(temp_info_box[2])
h = float(temp_info_box[3])
# Filter out wrong labels
if w < 0 or h < 0:
continue
xmax = xmin + w
ymax = ymin + h
......@@ -294,7 +298,7 @@ def train(settings,
generator_output = enqueuer.queue.get()
break
else:
time.sleep(0.02)
time.sleep(0.01)
yield generator_output
generator_output = None
finally:
......
......@@ -167,7 +167,7 @@ def train(args, config, train_params, train_file_list):
shutil.rmtree(model_path)
print('save models to %s' % (model_path))
fluid.io.save_persistables(exe, model_path)
fluid.io.save_persistables(exe, model_path, main_program=program)
train_py_reader.start()
try:
......@@ -189,13 +189,13 @@ def train(args, config, train_params, train_file_list):
fetch_vars = [np.mean(np.array(v)) for v in fetch_vars]
if batch_id % 10 == 0:
if not args.use_pyramidbox:
print("Pass {0}, batch {1}, loss {2}, time {3}".format(
print("Pass {:d}, batch {:d}, loss {:.6f}, time {:.5f}".format(
pass_id, batch_id, fetch_vars[0],
start_time - prev_start_time))
else:
print("Pass {0}, batch {1}, face loss {2}, " \
"head loss {3}, " \
"time {4}".format(pass_id,
print("Pass {:d}, batch {:d}, face loss {:.6f}, " \
"head loss {:.6f}, " \
"time {:.5f}".format(pass_id,
batch_id, fetch_vars[0], fetch_vars[1],
start_time - prev_start_time))
if pass_id % 1 == 0 or pass_id == epoc_num - 1:
......
......@@ -82,9 +82,6 @@ def save_widerface_bboxes(image_path, bboxes_scores, output_dir):
image_name = image_path.split('/')[-1]
image_class = image_path.split('/')[-2]
image_name = image_name.encode('utf-8')
image_class = image_class.encode('utf-8')
odir = os.path.join(output_dir, image_class)
if not os.path.exists(odir):
os.makedirs(odir)
......
......@@ -43,7 +43,7 @@ After data preparation, one can start the training step by:
python train.py \
--max_size=1333 \
--scales=800 \
--scales=[800] \
--batch_size=8 \
--model_save_dir=output/
......@@ -57,6 +57,22 @@ After data preparation, one can start the training step by:
sh ./pretrained/download.sh
Set `pretrained_model` to load pre-trained model. In addition, this parameter is used to load trained model when finetuning as well.
Please make sure that pretrained_model is downloaded and loaded correctly, otherwise, the loss may be NAN during training.
**Install the [cocoapi](https://github.com/cocodataset/cocoapi):**
To train the model, [cocoapi](https://github.com/cocodataset/cocoapi) is needed. Install the cocoapi:
# COCOAPI=/path/to/clone/cocoapi
git clone https://github.com/cocodataset/cocoapi.git $COCOAPI
cd $COCOAPI/PythonAPI
# if cython is not installed
pip install Cython
# Install into global site-packages
make install
# Alternatively, if you do not have permissions or prefer
# not to install the COCO API into global site-packages
python2 setup.py install --user
**data reader introduction:**
......@@ -103,18 +119,7 @@ Finetuning is to finetune model weights in a specific task by loading pretrained
## Evaluation
Evaluation is to evaluate the performance of a trained model. This sample provides `eval_coco_map.py` which uses a COCO-specific mAP metric defined by [COCO committee](http://cocodataset.org/#detections-eval). To use `eval_coco_map.py` , [cocoapi](https://github.com/cocodataset/cocoapi) is needed. Install the cocoapi:
# COCOAPI=/path/to/clone/cocoapi
git clone https://github.com/cocodataset/cocoapi.git $COCOAPI
cd $COCOAPI/PythonAPI
# if cython is not installed
pip install Cython
# Install into global site-packages
make install
# Alternatively, if you do not have permissions or prefer
# not to install the COCO API into global site-packages
python2 setup.py install --user
Evaluation is to evaluate the performance of a trained model. This sample provides `eval_coco_map.py` which uses a COCO-specific mAP metric defined by [COCO committee](http://cocodataset.org/#detections-eval).
`eval_coco_map.py` is the main executor for evalution, one can start evalution step by:
......@@ -136,7 +141,7 @@ Faster RCNN mAP
| Detectron | 8 | 180000 | 0.315 |
| Fluid minibatch padding | 8 | 180000 | 0.314 |
| Fluid all padding | 8 | 180000 | 0.308 |
| Fluid no padding |6 | 240000 | 0.317 |
| Fluid no padding |8 | 180000 | 0.316 |
* Fluid all padding: Each image padding to 1333\*1333.
* Fluid minibatch padding: Images in one batch padding to the same size. This method is same as detectron.
......
......@@ -42,7 +42,7 @@ Faster RCNN 目标检测模型
python train.py \
--max_size=1333 \
--scales=800 \
--scales=[800] \
--batch_size=8 \
--model_save_dir=output/ \
--pretrained_model=${path_to_pretrain_model}
......@@ -57,6 +57,22 @@ Faster RCNN 目标检测模型
sh ./pretrained/download.sh
通过初始化`pretrained_model` 加载预训练模型。同时在参数微调时也采用该设置加载已训练模型。
请在训练前确认预训练模型下载与加载正确,否则训练过程中损失可能会出现NAN。
**安装[cocoapi](https://github.com/cocodataset/cocoapi):**
训练前需要首先下载[cocoapi](https://github.com/cocodataset/cocoapi)
# COCOAPI=/path/to/clone/cocoapi
git clone https://github.com/cocodataset/cocoapi.git $COCOAPI
cd $COCOAPI/PythonAPI
# if cython is not installed
pip install Cython
# Install into global site-packages
make install
# Alternatively, if you do not have permissions or prefer
# not to install the COCO API into global site-packages
python2 setup.py install --user
**数据读取器说明:** 数据读取器定义在reader.py中。所有图像将短边等比例缩放至`scales`,若长边大于`max_size`, 则再次将长边等比例缩放至`max_iter`。在训练阶段,对图像采用水平翻转。支持将同一个batch内的图像padding为相同尺寸。
......@@ -87,18 +103,7 @@ Faster RCNN 训练loss
## 模型评估
模型评估是指对训练完毕的模型评估各类性能指标。本示例采用[COCO官方评估](http://cocodataset.org/#detections-eval),使用前需要首先下载[cocoapi](https://github.com/cocodataset/cocoapi)
# COCOAPI=/path/to/clone/cocoapi
git clone https://github.com/cocodataset/cocoapi.git $COCOAPI
cd $COCOAPI/PythonAPI
# if cython is not installed
pip install Cython
# Install into global site-packages
make install
# Alternatively, if you do not have permissions or prefer
# not to install the COCO API into global site-packages
python2 setup.py install --user
模型评估是指对训练完毕的模型评估各类性能指标。本示例采用[COCO官方评估](http://cocodataset.org/#detections-eval)
`eval_coco_map.py`是评估模块的主要执行程序,调用示例如下:
......@@ -120,7 +125,7 @@ Faster RCNN mAP
| Detectron | 8 | 180000 | 0.315 |
| Fluid minibatch padding | 8 | 180000 | 0.314 |
| Fluid all padding | 8 | 180000 | 0.308 |
| Fluid no padding |6 | 240000 | 0.317 |
| Fluid no padding |8 | 180000 | 0.316 |
* Fluid all padding: 每张图像填充为1333\*1333大小。
* Fluid minibatch padding: 同一个batch内的图像填充为相同尺寸。该方法与detectron处理相同。
......
# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved.
#Licensed under the Apache License, Version 2.0 (the "License");
#you may not use this file except in compliance with the License.
#You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
#Unless required by applicable law or agreed to in writing, software
#distributed under the License is distributed on an "AS IS" BASIS,
#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#See the License for the specific language governing permissions and
#limitations under the License.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from edict import AttrDict
import six
import numpy as np
_C = AttrDict()
cfg = _C
#
# Training options
#
_C.TRAIN = AttrDict()
# scales an image's shortest side
_C.TRAIN.scales = [800]
# max size of longest side
_C.TRAIN.max_size = 1333
# images per GPU in minibatch
_C.TRAIN.im_per_batch = 1
# roi minibatch size per image
_C.TRAIN.batch_size_per_im = 512
# target fraction of foreground roi minibatch
_C.TRAIN.fg_fractrion = 0.25
# overlap threshold for a foreground roi
_C.TRAIN.fg_thresh = 0.5
# overlap threshold for a background roi
_C.TRAIN.bg_thresh_hi = 0.5
_C.TRAIN.bg_thresh_lo = 0.0
# If False, only resize image and not pad, image shape is different between
# GPUs in one mini-batch. If True, image shape is the same in one mini-batch.
_C.TRAIN.padding_minibatch = False
# Snapshot period
_C.TRAIN.snapshot_iter = 10000
# number of RPN proposals to keep before NMS
_C.TRAIN.rpn_pre_nms_top_n = 12000
# number of RPN proposals to keep after NMS
_C.TRAIN.rpn_post_nms_top_n = 2000
# NMS threshold used on RPN proposals
_C.TRAIN.rpn_nms_thresh = 0.7
# min size in RPN proposals
_C.TRAIN.rpn_min_size = 0.0
# eta for adaptive NMS in RPN
_C.TRAIN.rpn_eta = 1.0
# number of RPN examples per image
_C.TRAIN.rpn_batch_size_per_im = 256
# remove anchors out of the image
_C.TRAIN.rpn_straddle_thresh = 0.
# target fraction of foreground examples pre RPN minibatch
_C.TRAIN.rpn_fg_fraction = 0.5
# min overlap between anchor and gt box to be a positive examples
_C.TRAIN.rpn_positive_overlap = 0.7
# max overlap between anchor and gt box to be a negative examples
_C.TRAIN.rpn_negative_overlap = 0.3
# stopgrad at a specified stage
_C.TRAIN.freeze_at = 2
# min area of ground truth box
_C.TRAIN.gt_min_area = -1
#
# Inference options
#
_C.TEST = AttrDict()
# scales an image's shortest side
_C.TEST.scales = [800]
# max size of longest side
_C.TEST.max_size = 1333
# eta for adaptive NMS in RPN
_C.TEST.rpn_eta = 1.0
# min score threshold to infer
_C.TEST.score_thresh = 0.05
# overlap threshold used for NMS
_C.TEST.nms_thresh = 0.5
# number of RPN proposals to keep before NMS
_C.TEST.rpn_pre_nms_top_n = 6000
# number of RPN proposals to keep after NMS
_C.TEST.rpn_post_nms_top_n = 1000
# min size in RPN proposals
_C.TEST.rpn_min_size = 0.0
# max number of detections
_C.TEST.detectiions_per_im = 100
# NMS threshold used on RPN proposals
_C.TEST.rpn_nms_thresh = 0.7
#
# Model options
#
# weight for bbox regression targets
_C.bbox_reg_weights = [0.1, 0.1, 0.2, 0.2]
# RPN anchor sizes
_C.anchor_sizes = [32, 64, 128, 256, 512]
# RPN anchor ratio
_C.aspect_ratio = [0.5, 1, 2]
# variance of anchors
_C.variances = [1., 1., 1., 1.]
# stride of feature map
_C.rpn_stride = [16.0, 16.0]
# Use roi pool or roi align, 'RoIPool' or 'RoIAlign'
_C.roi_func = 'RoIAlign'
# sampling ratio for roi align
_C.sampling_ratio = 0
# pooled width and pooled height
_C.roi_resolution = 14
# spatial scale
_C.spatial_scale = 1. / 16.
#
# SOLVER options
#
# derived learning rate the to get the final learning rate.
_C.learning_rate = 0.01
# maximum number of iterations
_C.max_iter = 180000
# warm up to learning rate
_C.warm_up_iter = 500
_C.warm_up_factor = 1. / 3.
# lr steps_with_decay
_C.lr_steps = [120000, 160000]
_C.lr_gamma = 0.1
# L2 regularization hyperparameter
_C.weight_decay = 0.0001
# momentum with SGD
_C.momentum = 0.9
#
# ENV options
#
# support both CPU and GPU
_C.use_gpu = True
# Whether use parallel
_C.parallel = True
# Class number
_C.class_num = 81
# support pyreader
_C.use_pyreader = True
# pixel mean values
_C.pixel_means = [102.9801, 115.9465, 122.7717]
# clip box to prevent overflowing
_C.bbox_clip = np.log(1000. / 16.)
# dataset path
_C.train_file_list = 'annotations/instances_train2017.json'
_C.train_data_dir = 'train2017'
_C.val_file_list = 'annotations/instances_val2017.json'
_C.val_data_dir = 'val2017'
def merge_cfg_from_args(args, mode):
"""Merge config keys, values in args into the global config."""
if mode == 'train':
sub_d = _C.TRAIN
else:
sub_d = _C.TEST
for k, v in sorted(six.iteritems(vars(args))):
d = _C
try:
value = eval(v)
except:
value = v
if k in sub_d:
sub_d[k] = value
else:
d[k] = value
......@@ -27,21 +27,27 @@ from __future__ import unicode_literals
import cv2
import numpy as np
from config import cfg
def get_image_blob(roidb, settings):
def get_image_blob(roidb, mode):
"""Builds an input blob from the images in the roidb at the specified
scales.
"""
scale_ind = np.random.randint(0, high=len(settings.scales))
if mode == 'train':
scales = cfg.TRAIN.scales
scale_ind = np.random.randint(0, high=len(scales))
target_size = scales[scale_ind]
max_size = cfg.TRAIN.max_size
else:
target_size = cfg.TEST.scales[0]
max_size = cfg.TEST.max_size
im = cv2.imread(roidb['image'])
assert im is not None, \
'Failed to read image \'{}\''.format(roidb['image'])
if roidb['flipped']:
im = im[:, ::-1, :]
target_size = settings.scales[scale_ind]
im, im_scale = prep_im_for_blob(im, settings.mean_value, target_size,
settings.max_size)
im, im_scale = prep_im_for_blob(im, cfg.pixel_means, target_size, max_size)
return im, im_scale
......
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.
#
#Licensed under the Apache License, Version 2.0 (the "License");
#you may not use this file except in compliance with the License.
#You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
#Unless required by applicable law or agreed to in writing, software
#distributed under the License is distributed on an "AS IS" BASIS,
#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#See the License for the specific language governing permissions and
#limitations under the License.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
class AttrDict(dict):
def __init__(self, *args, **kwargs):
super(AttrDict, self).__init__(*args, **kwargs)
def __getattr__(self, name):
if name in self.__dict__:
return self.__dict__[name]
elif name in self:
return self[name]
else:
raise AttributeError(name)
def __setattr__(self, name, value):
if name in self.__dict__:
self.__dict__[name] = value
else:
self[name] = value
......@@ -29,18 +29,20 @@ import models.resnet as resnet
import json
from pycocotools.coco import COCO
from pycocotools.cocoeval import COCOeval, Params
from config import cfg
def eval(cfg):
def eval():
if '2014' in cfg.dataset:
test_list = 'annotations/instances_val2014.json'
elif '2017' in cfg.dataset:
test_list = 'annotations/instances_val2017.json'
image_shape = [3, cfg.max_size, cfg.max_size]
image_shape = [3, cfg.TEST.max_size, cfg.TEST.max_size]
class_nums = cfg.class_num
batch_size = cfg.batch_size
devices = os.getenv("CUDA_VISIBLE_DEVICES") or ""
devices_num = len(devices.split(","))
total_batch_size = devices_num * cfg.TRAIN.im_per_batch
cocoGt = COCO(os.path.join(cfg.data_dir, test_list))
numId_to_catId_map = {i + 1: v for i, v in enumerate(cocoGt.getCatIds())}
category_ids = cocoGt.getCatIds()
......@@ -51,7 +53,6 @@ def eval(cfg):
label_list[0] = ['background']
model = model_builder.FasterRCNN(
cfg=cfg,
add_conv_body_func=resnet.add_ResNet50_conv4_body,
add_roi_box_head_func=resnet.add_ResNet_roi_conv5_head,
use_pyreader=False,
......@@ -66,7 +67,7 @@ def eval(cfg):
return os.path.exists(os.path.join(cfg.pretrained_model, var.name))
fluid.io.load_vars(exe, cfg.pretrained_model, predicate=if_exist)
# yapf: enable
test_reader = reader.test(cfg, batch_size)
test_reader = reader.test(total_batch_size)
feeder = fluid.DataFeeder(place=place, feed_list=model.feeds())
dts_res = []
......@@ -80,11 +81,11 @@ def eval(cfg):
fetch_list=[v.name for v in fetch_list],
feed=feeder.feed(batch_data),
return_numpy=False)
new_lod, nmsed_out = get_nmsed_box(cfg, rpn_rois_v, confs_v, locs_v,
new_lod, nmsed_out = get_nmsed_box(rpn_rois_v, confs_v, locs_v,
class_nums, im_info,
numId_to_catId_map)
dts_res += get_dt_res(batch_size, new_lod, nmsed_out, batch_data)
dts_res += get_dt_res(total_batch_size, new_lod, nmsed_out, batch_data)
end = time.time()
print('batch id: {}, time: {}'.format(batch_id, end - start))
with open("detection_result.json", 'w') as outfile:
......@@ -100,6 +101,4 @@ def eval(cfg):
if __name__ == '__main__':
args = parse_args()
print_arguments(args)
data_args = reader.Settings(args)
eval(data_args)
eval()
......@@ -20,6 +20,7 @@ import box_utils
from PIL import Image
from PIL import ImageDraw
from PIL import ImageFont
from config import cfg
def box_decoder(target_box, prior_box, prior_box_var):
......@@ -31,10 +32,8 @@ def box_decoder(target_box, prior_box, prior_box_var):
prior_box_loc[:, 3] = (prior_box[:, 3] + prior_box[:, 1]) / 2
pred_bbox = np.zeros_like(target_box, dtype=np.float32)
for i in range(prior_box.shape[0]):
dw = np.minimum(prior_box_var[2] * target_box[i, 2::4],
np.log(1000. / 16.))
dh = np.minimum(prior_box_var[3] * target_box[i, 3::4],
np.log(1000. / 16.))
dw = np.minimum(prior_box_var[2] * target_box[i, 2::4], cfg.bbox_clip)
dh = np.minimum(prior_box_var[3] * target_box[i, 3::4], cfg.bbox_clip)
pred_bbox[i, 0::4] = prior_box_var[0] * target_box[
i, 0::4] * prior_box_loc[i, 0] + prior_box_loc[i, 2]
pred_bbox[i, 1::4] = prior_box_var[1] * target_box[
......@@ -67,11 +66,11 @@ def clip_tiled_boxes(boxes, im_shape):
return boxes
def get_nmsed_box(args, rpn_rois, confs, locs, class_nums, im_info,
def get_nmsed_box(rpn_rois, confs, locs, class_nums, im_info,
numId_to_catId_map):
lod = rpn_rois.lod()[0]
rpn_rois_v = np.array(rpn_rois)
variance_v = np.array([0.1, 0.1, 0.2, 0.2])
variance_v = np.array(cfg.bbox_reg_weights)
confs_v = np.array(confs)
locs_v = np.array(locs)
rois = box_decoder(locs_v, rpn_rois_v, variance_v)
......@@ -89,12 +88,12 @@ def get_nmsed_box(args, rpn_rois, confs, locs, class_nums, im_info,
cls_boxes = [[] for _ in range(class_nums)]
scores_n = confs_v[start:end, :]
for j in range(1, class_nums):
inds = np.where(scores_n[:, j] > args.score_threshold)[0]
inds = np.where(scores_n[:, j] > cfg.TEST.score_thresh)[0]
scores_j = scores_n[inds, j]
rois_j = rois_n[inds, j * 4:(j + 1) * 4]
dets_j = np.hstack((rois_j, scores_j[:, np.newaxis])).astype(
np.float32, copy=False)
keep = box_utils.nms(dets_j, args.nms_threshold)
keep = box_utils.nms(dets_j, cfg.TEST.nms_thresh)
nms_dets = dets_j[keep, :]
#add labels
cat_id = numId_to_catId_map[j]
......@@ -105,8 +104,8 @@ def get_nmsed_box(args, rpn_rois, confs, locs, class_nums, im_info,
# Limit to max_per_image detections **over all classes**
image_scores = np.hstack(
[cls_boxes[j][:, -2] for j in range(1, class_nums)])
if len(image_scores) > 100:
image_thresh = np.sort(image_scores)[-100]
if len(image_scores) > cfg.TEST.detectiions_per_im:
image_thresh = np.sort(image_scores)[-cfg.TEST.detectiions_per_im]
for j in range(1, class_nums):
keep = np.where(cls_boxes[j][:, -2] >= image_thresh)[0]
cls_boxes[j] = cls_boxes[j][keep, :]
......
fluid/faster_rcnn/image/mAP.jpg

80.0 KB | W: | H:

fluid/faster_rcnn/image/mAP.jpg

41.0 KB | W: | H:

fluid/faster_rcnn/image/mAP.jpg
fluid/faster_rcnn/image/mAP.jpg
fluid/faster_rcnn/image/mAP.jpg
fluid/faster_rcnn/image/mAP.jpg
  • 2-up
  • Swipe
  • Onion skin
import os
import time
import numpy as np
from eval_helper import get_nmsed_box
from eval_helper import get_dt_res
from eval_helper import draw_bounding_box_on_image
import paddle
import paddle.fluid as fluid
import reader
from utility import print_arguments, parse_args
import models.model_builder as model_builder
import models.resnet as resnet
import json
from pycocotools.coco import COCO
from pycocotools.cocoeval import COCOeval, Params
from config import cfg
def infer():
if '2014' in cfg.dataset:
test_list = 'annotations/instances_val2014.json'
elif '2017' in cfg.dataset:
test_list = 'annotations/instances_val2017.json'
cocoGt = COCO(os.path.join(cfg.data_dir, test_list))
numId_to_catId_map = {i + 1: v for i, v in enumerate(cocoGt.getCatIds())}
category_ids = cocoGt.getCatIds()
label_list = {
item['id']: item['name']
for item in cocoGt.loadCats(category_ids)
}
label_list[0] = ['background']
image_shape = [3, cfg.TEST.max_size, cfg.TEST.max_size]
class_nums = cfg.class_num
model = model_builder.FasterRCNN(
add_conv_body_func=resnet.add_ResNet50_conv4_body,
add_roi_box_head_func=resnet.add_ResNet_roi_conv5_head,
use_pyreader=False,
is_train=False)
model.build_model(image_shape)
rpn_rois, confs, locs = model.eval_out()
place = fluid.CUDAPlace(0) if cfg.use_gpu else fluid.CPUPlace()
exe = fluid.Executor(place)
# yapf: disable
if cfg.pretrained_model:
def if_exist(var):
return os.path.exists(os.path.join(cfg.pretrained_model, var.name))
fluid.io.load_vars(exe, cfg.pretrained_model, predicate=if_exist)
# yapf: enable
infer_reader = reader.infer()
feeder = fluid.DataFeeder(place=place, feed_list=model.feeds())
dts_res = []
fetch_list = [rpn_rois, confs, locs]
data = next(infer_reader())
im_info = [data[0][1]]
rpn_rois_v, confs_v, locs_v = exe.run(
fetch_list=[v.name for v in fetch_list],
feed=feeder.feed(data),
return_numpy=False)
new_lod, nmsed_out = get_nmsed_box(rpn_rois_v, confs_v, locs_v, class_nums,
im_info, numId_to_catId_map)
path = os.path.join(cfg.image_path, cfg.image_name)
draw_bounding_box_on_image(path, nmsed_out, cfg.draw_threshold, label_list)
if __name__ == '__main__':
args = parse_args()
print_arguments(args)
infer()
......@@ -17,11 +17,11 @@ from paddle.fluid.param_attr import ParamAttr
from paddle.fluid.initializer import Constant
from paddle.fluid.initializer import Normal
from paddle.fluid.regularizer import L2Decay
from config import cfg
class FasterRCNN(object):
def __init__(self,
cfg=None,
add_conv_body_func=None,
add_roi_box_head_func=None,
is_train=True,
......@@ -29,7 +29,6 @@ class FasterRCNN(object):
use_random=True):
self.add_conv_body_func = add_conv_body_func
self.add_roi_box_head_func = add_roi_box_head_func
self.cfg = cfg
self.is_train = is_train
self.use_pyreader = use_pyreader
self.use_random = use_random
......@@ -111,10 +110,10 @@ class FasterRCNN(object):
name="conv_rpn_b", learning_rate=2., regularizer=L2Decay(0.)))
self.anchor, self.var = fluid.layers.anchor_generator(
input=rpn_conv,
anchor_sizes=self.cfg.anchor_sizes,
aspect_ratios=self.cfg.aspect_ratios,
variance=self.cfg.variance,
stride=[16.0, 16.0])
anchor_sizes=cfg.anchor_sizes,
aspect_ratios=cfg.aspect_ratio,
variance=cfg.variances,
stride=cfg.rpn_stride)
num_anchor = self.anchor.shape[2]
# Proposal classification scores
self.rpn_cls_score = fluid.layers.conv2d(
......@@ -152,8 +151,12 @@ class FasterRCNN(object):
rpn_cls_score_prob = fluid.layers.sigmoid(
self.rpn_cls_score, name='rpn_cls_score_prob')
pre_nms_top_n = 12000 if self.is_train else 6000
post_nms_top_n = 2000 if self.is_train else 1000
param_obj = cfg.TRAIN if self.is_train else cfg.TEST
pre_nms_top_n = param_obj.rpn_pre_nms_top_n
post_nms_top_n = param_obj.rpn_post_nms_top_n
nms_thresh = param_obj.rpn_nms_thresh
min_size = param_obj.rpn_min_size
eta = param_obj.rpn_eta
rpn_rois, rpn_roi_probs = fluid.layers.generate_proposals(
scores=rpn_cls_score_prob,
bbox_deltas=self.rpn_bbox_pred,
......@@ -162,9 +165,9 @@ class FasterRCNN(object):
variances=self.var,
pre_nms_top_n=pre_nms_top_n,
post_nms_top_n=post_nms_top_n,
nms_thresh=0.7,
min_size=0.0,
eta=1.0)
nms_thresh=nms_thresh,
min_size=min_size,
eta=eta)
self.rpn_rois = rpn_rois
if self.is_train:
outs = fluid.layers.generate_proposal_labels(
......@@ -173,13 +176,13 @@ class FasterRCNN(object):
is_crowd=self.is_crowd,
gt_boxes=self.gt_box,
im_info=self.im_info,
batch_size_per_im=self.cfg.batch_size_per_im,
fg_fraction=0.25,
fg_thresh=0.5,
bg_thresh_hi=0.5,
bg_thresh_lo=0.0,
bbox_reg_weights=[0.1, 0.1, 0.2, 0.2],
class_nums=self.cfg.class_num,
batch_size_per_im=cfg.TRAIN.batch_size_per_im,
fg_fraction=cfg.TRAIN.fg_fractrion,
fg_thresh=cfg.TRAIN.fg_thresh,
bg_thresh_hi=cfg.TRAIN.bg_thresh_hi,
bg_thresh_lo=cfg.TRAIN.bg_thresh_lo,
bbox_reg_weights=cfg.bbox_reg_weights,
class_nums=cfg.class_num,
use_random=self.use_random)
self.rois = outs[0]
......@@ -193,15 +196,24 @@ class FasterRCNN(object):
pool_rois = self.rois
else:
pool_rois = self.rpn_rois
if cfg.roi_func == 'RoIPool':
pool = fluid.layers.roi_pool(
input=roi_input,
rois=pool_rois,
pooled_height=14,
pooled_width=14,
spatial_scale=0.0625)
pooled_height=cfg.roi_resolution,
pooled_width=cfg.roi_resolution,
spatial_scale=cfg.spatial_scale)
elif cfg.roi_func == 'RoIAlign':
pool = fluid.layers.roi_align(
input=roi_input,
rois=pool_rois,
pooled_height=cfg.roi_resolution,
pooled_width=cfg.roi_resolution,
spatial_scale=cfg.spatial_scale,
sampling_ratio=cfg.sampling_ratio)
rcnn_out = self.add_roi_box_head_func(pool)
self.cls_score = fluid.layers.fc(input=rcnn_out,
size=self.cfg.class_num,
size=cfg.class_num,
act=None,
name='cls_score',
param_attr=ParamAttr(
......@@ -213,7 +225,7 @@ class FasterRCNN(object):
learning_rate=2.,
regularizer=L2Decay(0.)))
self.bbox_pred = fluid.layers.fc(input=rcnn_out,
size=4 * self.cfg.class_num,
size=4 * cfg.class_num,
act=None,
name='bbox_pred',
param_attr=ParamAttr(
......@@ -257,7 +269,6 @@ class FasterRCNN(object):
x=rpn_cls_score_reshape, shape=(0, -1, 1))
rpn_bbox_pred_reshape = fluid.layers.reshape(
x=rpn_bbox_pred_reshape, shape=(0, -1, 4))
score_pred, loc_pred, score_tgt, loc_tgt = \
fluid.layers.rpn_target_assign(
bbox_pred=rpn_bbox_pred_reshape,
......@@ -267,11 +278,11 @@ class FasterRCNN(object):
gt_boxes=self.gt_box,
is_crowd=self.is_crowd,
im_info=self.im_info,
rpn_batch_size_per_im=256,
rpn_straddle_thresh=0.0,
rpn_fg_fraction=0.5,
rpn_positive_overlap=0.7,
rpn_negative_overlap=0.3,
rpn_batch_size_per_im=cfg.TRAIN.rpn_batch_size_per_im,
rpn_straddle_thresh=cfg.TRAIN.rpn_straddle_thresh,
rpn_fg_fraction=cfg.TRAIN.rpn_fg_fraction,
rpn_positive_overlap=cfg.TRAIN.rpn_positive_overlap,
rpn_negative_overlap=cfg.TRAIN.rpn_negative_overlap,
use_random=self.use_random)
score_tgt = fluid.layers.cast(x=score_tgt, dtype='float32')
rpn_cls_loss = fluid.layers.sigmoid_cross_entropy_with_logits(
......
......@@ -16,6 +16,7 @@ import paddle.fluid as fluid
from paddle.fluid.param_attr import ParamAttr
from paddle.fluid.initializer import Constant
from paddle.fluid.regularizer import L2Decay
from config import cfg
def conv_bn_layer(input,
......@@ -88,8 +89,7 @@ def conv_affine_layer(input,
default_initializer=Constant(0.))
bias.stop_gradient = True
elt_mul = fluid.layers.elementwise_mul(x=conv, y=scale, axis=1)
out = fluid.layers.elementwise_add(x=elt_mul, y=bias, axis=1)
out = fluid.layers.affine_channel(x=conv, scale=scale, bias=bias)
if act == 'relu':
out = fluid.layers.relu(x=out)
return out
......@@ -137,7 +137,7 @@ ResNet_cfg = {
}
def add_ResNet50_conv4_body(body_input, freeze_at=2):
def add_ResNet50_conv4_body(body_input):
stages, block_func = ResNet_cfg[50]
stages = stages[0:3]
conv1 = conv_affine_layer(
......@@ -149,13 +149,13 @@ def add_ResNet50_conv4_body(body_input, freeze_at=2):
pool_stride=2,
pool_padding=1)
res2 = layer_warp(block_func, pool1, 64, stages[0], 1, name="res2")
if freeze_at == 2:
if cfg.TRAIN.freeze_at == 2:
res2.stop_gradient = True
res3 = layer_warp(block_func, res2, 128, stages[1], 2, name="res3")
if freeze_at == 3:
if cfg.TRAIN.freeze_at == 3:
res3.stop_gradient = True
res4 = layer_warp(block_func, res3, 256, stages[2], 2, name="res4")
if freeze_at == 4:
if cfg.TRAIN.freeze_at == 4:
res4.stop_gradient = True
return res4
......
DIR="$( cd "$(dirname "$0")" ; pwd -P )"
DIR="$(dirname "$PWD -P")"
cd "$DIR"
# Download the data.
......
......@@ -26,19 +26,18 @@ import paddle.fluid.profiler as profiler
import models.model_builder as model_builder
import models.resnet as resnet
from learning_rate import exponential_with_warmup_decay
from config import cfg
def train(cfg):
batch_size = cfg.batch_size
def train():
learning_rate = cfg.learning_rate
image_shape = [3, cfg.max_size, cfg.max_size]
image_shape = [3, cfg.TRAIN.max_size, cfg.TRAIN.max_size]
num_iterations = cfg.max_iter
devices = os.getenv("CUDA_VISIBLE_DEVICES") or ""
devices_num = len(devices.split(","))
total_batch_size = devices_num * cfg.TRAIN.im_per_batch
model = model_builder.FasterRCNN(
cfg=cfg,
add_conv_body_func=resnet.add_ResNet50_conv4_body,
add_roi_box_head_func=resnet.add_ResNet_roi_conv5_head,
use_pyreader=cfg.use_pyreader,
......@@ -51,8 +50,10 @@ def train(cfg):
rpn_reg_loss.persistable = True
loss = loss_cls + loss_bbox + rpn_cls_loss + rpn_reg_loss
boundaries = [120000, 160000]
values = [learning_rate, learning_rate * 0.1, learning_rate * 0.01]
boundaries = cfg.lr_steps
gamma = cfg.lr_gamma
step_num = len(lr_steps)
values = [learning_rate * (gamma**i) for i in range(step_num + 1)]
optimizer = fluid.optimizer.Momentum(
learning_rate=exponential_with_warmup_decay(
......@@ -82,22 +83,16 @@ def train(cfg):
train_exe = fluid.ParallelExecutor(
use_cuda=bool(cfg.use_gpu), loss_name=loss.name)
assert cfg.batch_size % devices_num == 0, \
"batch_size = %d, devices_num = %d" %(cfg.batch_size, devices_num)
batch_size_per_dev = cfg.batch_size / devices_num
if cfg.use_pyreader:
train_reader = reader.train(
cfg,
batch_size=batch_size_per_dev,
total_batch_size=cfg.batch_size,
padding_total=cfg.padding_minibatch,
batch_size=cfg.TRAIN.im_per_batch,
total_batch_size=total_batch_size,
padding_total=cfg.TRAIN.padding_minibatch,
shuffle=False)
py_reader = model.py_reader
py_reader.decorate_paddle_reader(train_reader)
else:
train_reader = reader.train(
cfg, batch_size=cfg.batch_size, shuffle=False)
train_reader = reader.train(batch_size=total_batch_size, shuffle=False)
feeder = fluid.DataFeeder(place=place, feed_list=model.feeds())
fetch_list = [loss, loss_cls, loss_bbox, rpn_cls_loss, rpn_reg_loss]
......@@ -109,7 +104,7 @@ def train(cfg):
for batch_id in range(iterations):
start_time = time.time()
data = train_reader().next()
data = next(train_reader())
end_time = time.time()
reader_time.append(end_time - start_time)
start_time = time.time()
......@@ -163,8 +158,7 @@ def train(cfg):
run_func(2)
# profiling
start = time.time()
use_profile = False
if use_profile:
if cfg.use_profile:
with profiler.profiler('GPU', 'total', '/tmp/profile_file'):
reader_time, run_time, total_images = run_func(num_iterations)
else:
......@@ -181,6 +175,4 @@ def train(cfg):
if __name__ == '__main__':
args = parse_args()
print_arguments(args)
data_args = reader.Settings(args)
train(data_args)
train()
......@@ -26,58 +26,45 @@ from collections import deque
from roidbs import JsonDataset
import data_utils
from config import cfg
class Settings(object):
def __init__(self, args=None):
for arg, value in sorted(six.iteritems(vars(args))):
setattr(self, arg, value)
if 'coco2014' in args.dataset:
self.class_nums = 81
self.train_file_list = 'annotations/instances_train2014.json'
self.train_data_dir = 'train2014'
self.val_file_list = 'annotations/instances_val2014.json'
self.val_data_dir = 'val2014'
elif 'coco2017' in args.dataset:
self.class_nums = 81
self.train_file_list = 'annotations/instances_train2017.json'
self.train_data_dir = 'train2017'
self.val_file_list = 'annotations/instances_val2017.json'
self.val_data_dir = 'val2017'
else:
raise NotImplementedError('Dataset {} not supported'.format(
self.dataset))
self.mean_value = np.array(self.mean_value)[
np.newaxis, np.newaxis, :].astype('float32')
def coco(settings,
mode,
def coco(mode,
batch_size=None,
total_batch_size=None,
padding_total=False,
shuffle=False):
if 'coco2014' in cfg.dataset:
cfg.train_file_list = 'annotations/instances_train2014.json'
cfg.train_data_dir = 'train2014'
cfg.val_file_list = 'annotations/instances_val2014.json'
cfg.val_data_dir = 'val2014'
elif 'coco2017' in cfg.dataset:
cfg.train_file_list = 'annotations/instances_train2017.json'
cfg.train_data_dir = 'train2017'
cfg.val_file_list = 'annotations/instances_val2017.json'
cfg.val_data_dir = 'val2017'
else:
raise NotImplementedError('Dataset {} not supported'.format(
cfg.dataset))
cfg.mean_value = np.array(cfg.pixel_means)[np.newaxis,
np.newaxis, :].astype('float32')
total_batch_size = total_batch_size if total_batch_size else batch_size
if mode != 'infer':
assert total_batch_size % batch_size == 0
if mode == 'train':
settings.train_file_list = os.path.join(settings.data_dir,
settings.train_file_list)
settings.train_data_dir = os.path.join(settings.data_dir,
settings.train_data_dir)
cfg.train_file_list = os.path.join(cfg.data_dir, cfg.train_file_list)
cfg.train_data_dir = os.path.join(cfg.data_dir, cfg.train_data_dir)
elif mode == 'test' or mode == 'infer':
settings.val_file_list = os.path.join(settings.data_dir,
settings.val_file_list)
settings.val_data_dir = os.path.join(settings.data_dir,
settings.val_data_dir)
json_dataset = JsonDataset(settings, train=(mode == 'train'))
cfg.val_file_list = os.path.join(cfg.data_dir, cfg.val_file_list)
cfg.val_data_dir = os.path.join(cfg.data_dir, cfg.val_data_dir)
json_dataset = JsonDataset(train=(mode == 'train'))
roidbs = json_dataset.get_roidb()
print("{} on {} with {} roidbs".format(mode, settings.dataset, len(roidbs)))
print("{} on {} with {} roidbs".format(mode, cfg.dataset, len(roidbs)))
def roidb_reader(roidb, mode):
im, im_scales = data_utils.get_image_blob(roidb, settings)
im, im_scales = data_utils.get_image_blob(roidb, mode)
im_id = roidb['id']
im_height = np.round(roidb['height'] * im_scales)
im_width = np.round(roidb['width'] * im_scales)
......@@ -150,7 +137,7 @@ def coco(settings,
else:
for roidb in roidbs:
if settings.image_name not in roidb['image']:
if cfg.image_name not in roidb['image']:
continue
im, im_info, im_id = roidb_reader(roidb, mode)
batch_out = [(im, im_info, im_id)]
......@@ -159,23 +146,14 @@ def coco(settings,
return reader
def train(settings,
batch_size,
total_batch_size=None,
padding_total=False,
shuffle=True):
def train(batch_size, total_batch_size=None, padding_total=False, shuffle=True):
return coco(
settings,
'train',
batch_size,
total_batch_size,
padding_total,
shuffle=shuffle)
'train', batch_size, total_batch_size, padding_total, shuffle=shuffle)
def test(settings, batch_size, total_batch_size=None, padding_total=False):
return coco(settings, 'test', batch_size, total_batch_size, shuffle=False)
def test(batch_size, total_batch_size=None, padding_total=False):
return coco('test', batch_size, total_batch_size, shuffle=False)
def infer(settings):
return coco(settings, 'infer')
def infer():
return coco('infer')
......@@ -26,7 +26,6 @@ from __future__ import print_function
from __future__ import unicode_literals
import copy
import cPickle as pickle
import logging
import numpy as np
import os
......@@ -37,6 +36,7 @@ import matplotlib
matplotlib.use('Agg')
from pycocotools.coco import COCO
import box_utils
from config import cfg
logger = logging.getLogger(__name__)
......@@ -44,16 +44,16 @@ logger = logging.getLogger(__name__)
class JsonDataset(object):
"""A class representing a COCO json dataset."""
def __init__(self, args, train=False):
print('Creating: {}'.format(args.dataset))
self.name = args.dataset
def __init__(self, train=False):
print('Creating: {}'.format(cfg.dataset))
self.name = cfg.dataset
self.is_train = train
if self.is_train:
data_dir = args.train_data_dir
file_list = args.train_file_list
data_dir = cfg.train_data_dir
file_list = cfg.train_file_list
else:
data_dir = args.val_data_dir
file_list = args.val_file_list
data_dir = cfg.val_data_dir
file_list = cfg.val_file_list
self.image_directory = data_dir
self.COCO = COCO(file_list)
# Set up dataset classes
......@@ -91,7 +91,6 @@ class JsonDataset(object):
end_time = time.time()
print('_add_gt_annotations took {:.3f}s'.format(end_time -
start_time))
print('Appending horizontally-flipped training examples...')
self._extend_with_flipped_entries(roidb)
print('Loaded dataset: {:s}'.format(self.name))
......@@ -130,7 +129,7 @@ class JsonDataset(object):
width = entry['width']
height = entry['height']
for obj in objs:
if obj['area'] < -1: #cfg.TRAIN.GT_MIN_AREA:
if obj['area'] < cfg.TRAIN.gt_min_area:
continue
if 'ignore' in obj and obj['ignore'] == 1:
continue
......
......@@ -28,11 +28,12 @@ import reader
import models.model_builder as model_builder
import models.resnet as resnet
from learning_rate import exponential_with_warmup_decay
from config import cfg
def train(cfg):
def train():
learning_rate = cfg.learning_rate
image_shape = [3, cfg.max_size, cfg.max_size]
image_shape = [3, cfg.TRAIN.max_size, cfg.TRAIN.max_size]
if cfg.debug:
fluid.default_startup_program().random_seed = 1000
......@@ -43,9 +44,9 @@ def train(cfg):
devices = os.getenv("CUDA_VISIBLE_DEVICES") or ""
devices_num = len(devices.split(","))
total_batch_size = devices_num * cfg.TRAIN.im_per_batch
model = model_builder.FasterRCNN(
cfg=cfg,
add_conv_body_func=resnet.add_ResNet50_conv4_body,
add_roi_box_head_func=resnet.add_ResNet_roi_conv5_head,
use_pyreader=cfg.use_pyreader,
......@@ -58,18 +59,20 @@ def train(cfg):
rpn_reg_loss.persistable = True
loss = loss_cls + loss_bbox + rpn_cls_loss + rpn_reg_loss
boundaries = [120000, 160000]
values = [learning_rate, learning_rate * 0.1, learning_rate * 0.01]
boundaries = cfg.lr_steps
gamma = cfg.lr_gamma
step_num = len(lr_steps)
values = [learning_rate * (gamma**i) for i in range(step_num + 1)]
optimizer = fluid.optimizer.Momentum(
learning_rate=exponential_with_warmup_decay(
learning_rate=learning_rate,
boundaries=boundaries,
values=values,
warmup_iter=500,
warmup_factor=1.0 / 3.0),
regularization=fluid.regularizer.L2Decay(0.0001),
momentum=0.9)
warmup_iter=cfg.warm_up_iter,
warmup_factor=cfg.warm_up_factor),
regularization=fluid.regularizer.L2Decay(cfg.weight_decay),
momentum=cfg.momentum)
optimizer.minimize(loss)
fluid.memory_optimize(fluid.default_main_program())
......@@ -89,20 +92,16 @@ def train(cfg):
train_exe = fluid.ParallelExecutor(
use_cuda=bool(cfg.use_gpu), loss_name=loss.name)
assert cfg.batch_size % devices_num == 0
batch_size_per_dev = cfg.batch_size / devices_num
if cfg.use_pyreader:
train_reader = reader.train(
cfg,
batch_size=batch_size_per_dev,
total_batch_size=cfg.batch_size,
padding_total=cfg.padding_minibatch,
batch_size=cfg.TRAIN.im_per_batch,
total_batch_size=total_batch_size,
padding_total=cfg.TRAIN.padding_minibatch,
shuffle=True)
py_reader = model.py_reader
py_reader.decorate_paddle_reader(train_reader)
else:
train_reader = reader.train(
cfg, batch_size=cfg.batch_size, shuffle=True)
train_reader = reader.train(batch_size=total_batch_size, shuffle=True)
feeder = fluid.DataFeeder(place=place, feed_list=model.feeds())
def save_model(postfix):
......@@ -133,7 +132,7 @@ def train(cfg):
smoothed_loss.get_median_value(
), start_time - prev_start_time))
sys.stdout.flush()
if (iter_id + 1) % cfg.snapshot_stride == 0:
if (iter_id + 1) % cfg.TRAIN.snapshot_iter == 0:
save_model("model_iter{}".format(iter_id))
except fluid.core.EOFException:
py_reader.reset()
......@@ -159,7 +158,7 @@ def train(cfg):
iter_id, lr[0],
smoothed_loss.get_median_value(), start_time - prev_start_time))
sys.stdout.flush()
if (iter_id + 1) % cfg.snapshot_stride == 0:
if (iter_id + 1) % cfg.TRAIN.snapshot_iter == 0:
save_model("model_iter{}".format(iter_id))
if (iter_id + 1) == cfg.max_iter:
break
......@@ -175,6 +174,4 @@ def train(cfg):
if __name__ == '__main__':
args = parse_args()
print_arguments(args)
data_args = reader.Settings(args)
train(data_args)
train()
......@@ -18,7 +18,7 @@ Contains common utility functions.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import sys
import distutils.util
import numpy as np
import six
......@@ -26,6 +26,7 @@ from collections import deque
from paddle.fluid import core
import argparse
import functools
from config import *
def print_arguments(args):
......@@ -96,31 +97,33 @@ def parse_args():
add_arg('model_save_dir', str, 'output', "The path to save model.")
add_arg('pretrained_model', str, 'imagenet_resnet50_fusebn', "The init model path.")
add_arg('dataset', str, 'coco2017', "coco2014, coco2017.")
add_arg('data_dir', str, 'data/COCO17', "The data root path.")
add_arg('class_num', int, 81, "Class number.")
add_arg('data_dir', str, 'data/COCO17', "The data root path.")
add_arg('use_pyreader', bool, True, "Use pyreader.")
add_arg('use_profile', bool, False, "Whether use profiler.")
add_arg('padding_minibatch',bool, False,
"If False, only resize image and not pad, image shape is different between"
" GPUs in one mini-batch. If True, image shape is the same in one mini-batch.")
#SOLVER
add_arg('learning_rate', float, 0.01, "Learning rate.")
add_arg('max_iter', int, 180000, "Iter number.")
add_arg('log_window', int, 1, "Log smooth window, set 1 for debug, set 20 for train.")
add_arg('snapshot_stride', int, 10000, "save model every snapshot stride.")
add_arg('log_window', int, 20, "Log smooth window, set 1 for debug, set 20 for train.")
# FAST RCNN
# RPN
add_arg('anchor_sizes', int, [32,64,128,256,512], "The size of anchors.")
add_arg('aspect_ratios', float, [0.5,1.0,2.0], "The ratio of anchors.")
add_arg('variance', float, [1.,1.,1.,1.], "The variance of anchors.")
add_arg('rpn_stride', float, 16., "Stride of the feature map that RPN is attached.")
# FAST RCNN
add_arg('rpn_stride', float, [16.,16.], "Stride of the feature map that RPN is attached.")
add_arg('rpn_nms_thresh', float, 0.7, "NMS threshold used on RPN proposals")
# TRAIN TEST INFER
add_arg('batch_size', int, 1, "Minibatch size.")
add_arg('im_per_batch', int, 1, "Minibatch size.")
add_arg('max_size', int, 1333, "The resized image height.")
add_arg('scales', int, [800], "The resized image height.")
add_arg('batch_size_per_im',int, 512, "fast rcnn head batch size")
add_arg('mean_value', float, [102.9801, 115.9465, 122.7717], "pixel mean")
add_arg('nms_threshold', float, 0.5, "NMS threshold.")
add_arg('score_threshold', float, 0.05, "score threshold for NMS.")
add_arg('pixel_means', float, [102.9801, 115.9465, 122.7717], "pixel mean")
add_arg('nms_thresh', float, 0.5, "NMS threshold.")
add_arg('score_thresh', float, 0.05, "score threshold for NMS.")
add_arg('snapshot_stride', int, 10000, "save model every snapshot stride.")
add_arg('debug', bool, False, "Debug mode")
# SINGLE EVAL AND DRAW
add_arg('draw_threshold', float, 0.8, "Confidence threshold to draw bbox.")
......@@ -128,4 +131,9 @@ def parse_args():
add_arg('image_name', str, '', "The single image used to inference and visualize.")
# yapf: enable
args = parser.parse_args()
file_name = sys.argv[0]
if 'train' in file_name or 'profile' in file_name:
merge_cfg_from_args(args, 'train')
else:
merge_cfg_from_args(args, 'test')
return args
......@@ -163,8 +163,9 @@ def train(args):
total_images = np.concatenate([real_image, generated_images])
fig = plot(total_images)
msg = "Epoch ID={0}\n Batch ID={1}\n D-Loss={2}\n DG-Loss={3}\n gen={4}".format(
pass_id, batch_id, np.mean(d_loss_np), dg_loss_np,
check(generated_images))
pass_id, batch_id,
np.sum(d_loss_np),
np.sum(dg_loss_np), check(generated_images))
print(msg)
plt.title(msg)
plt.savefig(
......
......@@ -150,7 +150,8 @@ def train(args):
fig = plot(total_images)
msg = "Epoch ID={0} Batch ID={1} D-Loss={2} DG-Loss={3}\n gen={4}".format(
pass_id, batch_id,
np.mean(d_loss_np), dg_loss_np, check(generated_images))
np.sum(d_loss_np),
np.sum(dg_loss_np), check(generated_images))
print(msg)
plt.title(msg)
plt.savefig(
......
......@@ -101,7 +101,7 @@ def D_cond(image, y):
h2 = bn(fc(h1, dfc_dim), act='leaky_relu')
h2 = fluid.layers.concat([h2, y], 1)
h3 = fc(h2, 1)
h3 = fc(h2, 1, act='sigmoid')
return h3
......@@ -131,7 +131,7 @@ def D(x):
x = conv(x, df_dim, act='leaky_relu')
x = bn(conv(x, df_dim * 2), act='leaky_relu')
x = bn(fc(x, dfc_dim), act='leaky_relu')
x = fc(x, 1, act=None)
x = fc(x, 1, act='sigmoid')
return x
......
......@@ -17,6 +17,7 @@ from paddle.fluid.initializer import init_on_cpu
if 'ce_mode' in os.environ:
np.random.seed(10)
fluid.default_startup_program().random_seed = 90
parser = argparse.ArgumentParser(description=__doc__)
add_arg = functools.partial(add_arguments, argparser=parser)
......@@ -91,9 +92,6 @@ def train(args):
place = fluid.CUDAPlace(0)
exe = fluid.Executor(place)
if 'ce_mode' in os.environ:
fluid.default_startup_program().random_seed = 90
exe.run(fluid.default_startup_program())
if args.init_model is not None:
......@@ -126,8 +124,9 @@ def train(args):
sub124_loss += results[3]
# training log
if iter_id % LOG_PERIOD == 0:
print("Iter[%d]; train loss: %.3f; sub4_loss: %.3f; sub24_loss: %.3f; sub124_loss: %.3f" % (
iter_id, t_loss / LOG_PERIOD, sub4_loss / LOG_PERIOD,
print(
"Iter[%d]; train loss: %.3f; sub4_loss: %.3f; sub24_loss: %.3f; sub124_loss: %.3f"
% (iter_id, t_loss / LOG_PERIOD, sub4_loss / LOG_PERIOD,
sub24_loss / LOG_PERIOD, sub124_loss / LOG_PERIOD))
print("kpis train_cost %f" % (t_loss / LOG_PERIOD))
......
......@@ -14,7 +14,7 @@
## 安装
在当前目录下运行样例代码需要PadddlePaddle Fluid的v0.13.0或以上的版本。如果你的运行环境中的PaddlePaddle低于此版本,请根据[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)中的说明来更新PaddlePaddle。
在当前目录下运行样例代码需要PadddlePaddle Fluid的v0.13.0或以上的版本。如果你的运行环境中的PaddlePaddle低于此版本,请根据安装文档中的说明来更新PaddlePaddle。
## 数据准备
......
......@@ -19,7 +19,7 @@ test_acc_top1_kpi = AccKpi(
test_acc_top5_kpi = AccKpi(
'test_acc_top5', 0.02, 0, actived=True, desc='TOP5 ACC')
test_cost_kpi = CostKpi('test_cost', 0.02, 0, actived=True, desc='train cost')
train_speed_kpi = AccKpi(
train_speed_kpi = DurationKpi(
'train_speed',
0.05,
0,
......@@ -38,7 +38,7 @@ test_acc_top5_card4_kpi = AccKpi(
'test_acc_top5_card4', 0.02, 0, actived=True, desc='TOP5 ACC')
test_cost_card4_kpi = CostKpi(
'test_cost_card4', 0.02, 0, actived=True, desc='train cost')
train_speed_card4_kpi = AccKpi(
train_speed_card4_kpi = DurationKpi(
'train_speed_card4',
0.05,
0,
......
......@@ -19,7 +19,7 @@ This tool is used to convert a Caffe model to a Fluid model
- Download one from github directly
```
cd proto/ && wget https://github.com/ethereon/caffe-tensorflow/blob/master/kaffe/caffe/caffepb.py
cd proto/ && wget https://raw.githubusercontent.com/ethereon/caffe-tensorflow/master/kaffe/caffe/caffepb.py
```
2. Convert the Caffe model to Fluid model
......
......@@ -21,6 +21,11 @@ def parse_args():
action='store_true',
help='If set, run \
the task with continuous evaluation logs.')
parser.add_argument(
'--num_devices',
type=int,
default=1,
help='Number of GPU devices')
args = parser.parse_args()
return args
......@@ -165,13 +170,13 @@ def train(train_reader,
print("finish training")
def get_cards(enable_ce):
if enable_ce:
def get_cards(args):
if args.enable_ce:
cards = os.environ.get('CUDA_VISIBLE_DEVICES')
num = len(cards.split(","))
return num
else:
return fluid.core.get_cuda_device_count()
return args.num_devices
def train_net():
......@@ -179,7 +184,7 @@ def train_net():
batch_size = 20
args = parse_args()
vocab, train_reader, test_reader = utils.prepare_data(
batch_size=batch_size * get_cards(args.enable_ce), buffer_size=1000, \
batch_size=batch_size * get_cards(args), buffer_size=1000, \
word_freq_threshold=0, enable_ce = args.enable_ce)
train(
train_reader=train_reader,
......
# Abstract
Dureader is an end-to-end neural network model for machine reading comprehension style question answering, which aims to answer questions from given passages. We first match the question and passages with a bidireactional attention flow network to obtrain the question-aware passages represenation. Then we employ a pointer network to locate the positions of answers from passages. Our experimental evalutions show that DuReader model achieves the state-of-the-art results in DuReader Dadaset.
# Dataset
DuReader Dataset is a new large-scale real-world and human sourced MRC dataset in Chinese. DuReader focuses on real-world open-domain question answering. The advantages of DuReader over existing datasets are concluded as follows:
- Real question
- Real article
- Real answer
- Real application scenario
- Rich annotation
# Network
DuReader model is inspired by 3 classic reading comprehension models([BiDAF](https://arxiv.org/abs/1611.01603), [Match-LSTM](https://arxiv.org/abs/1608.07905), [R-NET](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/05/r-net.pdf)).
DuReader model is a hierarchical multi-stage process and consists of five layers
- **Word Embedding Layer** maps each word to a vector using a pre-trained word embedding model.
- **Encoding Layer** extracts context infomation for each position in question and passages with a bi-directional LSTM network.
- **Attention Flow Layer** couples the query and context vectors and produces a set of query-aware feature vectors for each word in the context. Please refer to [BiDAF](https://arxiv.org/abs/1611.01603) for more details.
- **Fusion Layer** employs a layer of bi-directional LSTM to capture the interaction among context words independent of the query.
- **Decode Layer** employs an answer point network with attention pooling of the quesiton to locate the positions of answers from passages. Please refer to [Match-LSTM](https://arxiv.org/abs/1608.07905) and [R-NET](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/05/r-net.pdf) for more details.
## How to Run
### Download the Dataset
To Download DuReader dataset:
```
cd data && bash download.sh
```
For more details about DuReader dataset please refer to [DuReader Dataset Homepage](https://ai.baidu.com//broad/subordinate?dataset=dureader).
### Download Thirdparty Dependencies
We use Bleu and Rouge as evaluation metrics, the calculation of these metrics relies on the scoring scripts under [coco-caption](https://github.com/tylin/coco-caption), to download them, run:
```
cd utils && bash download_thirdparty.sh
```
### Environment Requirements
For now we've only tested on PaddlePaddle v1.0, to install PaddlePaddle and for more details about PaddlePaddle, see [PaddlePaddle Homepage](http://paddlepaddle.org).
### Preparation
Before training the model, we have to make sure that the data is ready. For preparation, we will check the data files, make directories and extract a vocabulary for later use. You can run the following command to do this with a specified task name:
```
sh run.sh --prepare
```
You can specify the files for train/dev/test by setting the `trainset`/`devset`/`testset`.
### Training
To train the model and you can also set the hyper-parameters such as the learning rate by using `--learning_rate NUM`. For example, to train the model for 10 passes, you can run:
```
sh run.sh --train --pass_num 10
```
The training process includes an evaluation on the dev set after each training epoch. By default, the model with the least Bleu-4 score on the dev set will be saved.
### Evaluation
To conduct a single evaluation on the dev set with the the model already trained, you can run the following command:
```
sh run.sh --evaluate --load_dir models/1
```
### Prediction
You can also predict answers for the samples in some files using the following command:
```
sh run.sh --predict --load_dir models/1 --testset ../data/preprocessed/testset/search.dev.json
```
By default, the results are saved at `../data/results/` folder. You can change this by specifying `--result_dir DIR_PATH`.
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import argparse
import distutils.util
def parse_args():
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
'--prepare',
action='store_true',
help='create the directories, prepare the vocabulary and embeddings')
parser.add_argument('--train', action='store_true', help='train the model')
parser.add_argument(
'--evaluate', action='store_true', help='evaluate the model on dev set')
parser.add_argument(
'--predict',
action='store_true',
help='predict the answers for test set with trained model')
parser.add_argument(
"--embed_size",
type=int,
default=300,
help="The dimension of embedding table. (default: %(default)d)")
parser.add_argument(
"--hidden_size",
type=int,
default=300,
help="The size of rnn hidden unit. (default: %(default)d)")
parser.add_argument(
"--batch_size",
type=int,
default=32,
help="The sequence number of a mini-batch data. (default: %(default)d)")
parser.add_argument(
"--pass_num",
type=int,
default=5,
help="The pass number to train. (default: %(default)d)")
parser.add_argument(
"--learning_rate",
type=float,
default=0.001,
help="Learning rate used to train the model. (default: %(default)f)")
parser.add_argument(
"--weight_decay",
type=float,
default=0.0001,
help="Weight decay. (default: %(default)f)")
parser.add_argument(
"--use_gpu",
type=distutils.util.strtobool,
default=True,
help="Whether to use gpu. (default: %(default)d)")
parser.add_argument(
"--save_dir",
type=str,
default="model",
help="Specify the path to save trained models.")
parser.add_argument(
"--load_dir",
type=str,
default="",
help="Specify the path to load trained models.")
parser.add_argument(
"--save_interval",
type=int,
default=1,
help="Save the trained model every n passes."
"(default: %(default)d)")
parser.add_argument(
"--log_interval",
type=int,
default=50,
help="log the train loss every n batches."
"(default: %(default)d)")
parser.add_argument(
"--dev_interval",
type=int,
default=1000,
help="cal dev loss every n batches."
"(default: %(default)d)")
parser.add_argument('--optim', default='adam', help='optimizer type')
parser.add_argument('--trainset', nargs='+', help='train dataset')
parser.add_argument('--devset', nargs='+', help='dev dataset')
parser.add_argument('--testset', nargs='+', help='test dataset')
parser.add_argument('--vocab_dir', help='dict')
parser.add_argument('--max_p_num', type=int, default=5)
parser.add_argument('--max_a_len', type=int, default=200)
parser.add_argument('--max_p_len', type=int, default=500)
parser.add_argument('--max_q_len', type=int, default=9)
parser.add_argument('--doc_num', type=int, default=5)
parser.add_argument('--para_print', action='store_true')
parser.add_argument('--drop_rate', type=float, default=0.0)
parser.add_argument('--random_seed', type=int, default=123)
parser.add_argument(
'--log_path',
help='path of the log file. If not set, logs are printed to console')
parser.add_argument(
'--result_dir',
default='../data/results/',
help='the dir to output the results')
parser.add_argument(
'--result_name',
default='test_result',
help='the file name of the results')
args = parser.parse_args()
return args
#!/bin/bash
# ==============================================================================
# Copyright 2017 Baidu.com, Inc. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
if [[ -d preprocessed ]] && [[ -d raw ]]; then
echo "data exist"
exit 0
else
wget -c --no-check-certificate http://dureader.gz.bcebos.com/dureader_preprocessed.zip
fi
if md5sum --status -c md5sum.txt; then
unzip dureader_preprocessed.zip
else
echo "download data error!" >> /dev/stderr
exit 1
fi
7a4c28026f7dc94e8135d17203c63664 dureader_preprocessed.zip
# -*- coding:utf8 -*-
# ==============================================================================
# Copyright 2017 Baidu.com, Inc. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""
This module implements data process strategies.
"""
import os
import json
import logging
import numpy as np
from collections import Counter
class BRCDataset(object):
"""
This module implements the APIs for loading and using baidu reading comprehension dataset
"""
def __init__(self,
max_p_num,
max_p_len,
max_q_len,
train_files=[],
dev_files=[],
test_files=[]):
self.logger = logging.getLogger("brc")
self.max_p_num = max_p_num
self.max_p_len = max_p_len
self.max_q_len = max_q_len
self.train_set, self.dev_set, self.test_set = [], [], []
if train_files:
for train_file in train_files:
self.train_set += self._load_dataset(train_file, train=True)
self.logger.info('Train set size: {} questions.'.format(
len(self.train_set)))
if dev_files:
for dev_file in dev_files:
self.dev_set += self._load_dataset(dev_file)
self.logger.info('Dev set size: {} questions.'.format(
len(self.dev_set)))
if test_files:
for test_file in test_files:
self.test_set += self._load_dataset(test_file)
self.logger.info('Test set size: {} questions.'.format(
len(self.test_set)))
def _load_dataset(self, data_path, train=False):
"""
Loads the dataset
Args:
data_path: the data file to load
"""
with open(data_path) as fin:
data_set = []
for lidx, line in enumerate(fin):
sample = json.loads(line.strip())
if train:
if len(sample['answer_spans']) == 0:
continue
if sample['answer_spans'][0][1] >= self.max_p_len:
continue
if 'answer_docs' in sample:
sample['answer_passages'] = sample['answer_docs']
sample['question_tokens'] = sample['segmented_question']
sample['passages'] = []
for d_idx, doc in enumerate(sample['documents']):
if train:
most_related_para = doc['most_related_para']
sample['passages'].append({
'passage_tokens':
doc['segmented_paragraphs'][most_related_para],
'is_selected': doc['is_selected']
})
else:
para_infos = []
for para_tokens in doc['segmented_paragraphs']:
question_tokens = sample['segmented_question']
common_with_question = Counter(
para_tokens) & Counter(question_tokens)
correct_preds = sum(common_with_question.values())
if correct_preds == 0:
recall_wrt_question = 0
else:
recall_wrt_question = float(
correct_preds) / len(question_tokens)
para_infos.append((para_tokens, recall_wrt_question,
len(para_tokens)))
para_infos.sort(key=lambda x: (-x[1], x[2]))
fake_passage_tokens = []
for para_info in para_infos[:1]:
fake_passage_tokens += para_info[0]
sample['passages'].append({
'passage_tokens': fake_passage_tokens
})
data_set.append(sample)
return data_set
def _one_mini_batch(self, data, indices, pad_id):
"""
Get one mini batch
Args:
data: all data
indices: the indices of the samples to be selected
pad_id:
Returns:
one batch of data
"""
batch_data = {
'raw_data': [data[i] for i in indices],
'question_token_ids': [],
'question_length': [],
'passage_token_ids': [],
'passage_length': [],
'start_id': [],
'end_id': []
}
max_passage_num = max(
[len(sample['passages']) for sample in batch_data['raw_data']])
#max_passage_num = min(self.max_p_num, max_passage_num)
max_passage_num = self.max_p_num
for sidx, sample in enumerate(batch_data['raw_data']):
for pidx in range(max_passage_num):
if pidx < len(sample['passages']):
batch_data['question_token_ids'].append(sample[
'question_token_ids'])
batch_data['question_length'].append(
len(sample['question_token_ids']))
passage_token_ids = sample['passages'][pidx][
'passage_token_ids']
batch_data['passage_token_ids'].append(passage_token_ids)
batch_data['passage_length'].append(
min(len(passage_token_ids), self.max_p_len))
else:
batch_data['question_token_ids'].append([])
batch_data['question_length'].append(0)
batch_data['passage_token_ids'].append([])
batch_data['passage_length'].append(0)
batch_data, padded_p_len, padded_q_len = self._dynamic_padding(
batch_data, pad_id)
for sample in batch_data['raw_data']:
if 'answer_passages' in sample and len(sample['answer_passages']):
gold_passage_offset = padded_p_len * sample['answer_passages'][
0]
batch_data['start_id'].append(gold_passage_offset + sample[
'answer_spans'][0][0])
batch_data['end_id'].append(gold_passage_offset + sample[
'answer_spans'][0][1])
else:
# fake span for some samples, only valid for testing
batch_data['start_id'].append(0)
batch_data['end_id'].append(0)
return batch_data
def _dynamic_padding(self, batch_data, pad_id):
"""
Dynamically pads the batch_data with pad_id
"""
pad_p_len = min(self.max_p_len, max(batch_data['passage_length']))
pad_q_len = min(self.max_q_len, max(batch_data['question_length']))
batch_data['passage_token_ids'] = [
(ids + [pad_id] * (pad_p_len - len(ids)))[:pad_p_len]
for ids in batch_data['passage_token_ids']
]
batch_data['question_token_ids'] = [
(ids + [pad_id] * (pad_q_len - len(ids)))[:pad_q_len]
for ids in batch_data['question_token_ids']
]
return batch_data, pad_p_len, pad_q_len
def word_iter(self, set_name=None):
"""
Iterates over all the words in the dataset
Args:
set_name: if it is set, then the specific set will be used
Returns:
a generator
"""
if set_name is None:
data_set = self.train_set + self.dev_set + self.test_set
elif set_name == 'train':
data_set = self.train_set
elif set_name == 'dev':
data_set = self.dev_set
elif set_name == 'test':
data_set = self.test_set
else:
raise NotImplementedError('No data set named as {}'.format(
set_name))
if data_set is not None:
for sample in data_set:
for token in sample['question_tokens']:
yield token
for passage in sample['passages']:
for token in passage['passage_tokens']:
yield token
def convert_to_ids(self, vocab):
"""
Convert the question and passage in the original dataset to ids
Args:
vocab: the vocabulary on this dataset
"""
for data_set in [self.train_set, self.dev_set, self.test_set]:
if data_set is None:
continue
for sample in data_set:
sample['question_token_ids'] = vocab.convert_to_ids(sample[
'question_tokens'])
for passage in sample['passages']:
passage['passage_token_ids'] = vocab.convert_to_ids(passage[
'passage_tokens'])
def gen_mini_batches(self, set_name, batch_size, pad_id, shuffle=True):
"""
Generate data batches for a specific dataset (train/dev/test)
Args:
set_name: train/dev/test to indicate the set
batch_size: number of samples in one batch
pad_id: pad id
shuffle: if set to be true, the data is shuffled.
Returns:
a generator for all batches
"""
if set_name == 'train':
data = self.train_set
elif set_name == 'dev':
data = self.dev_set
elif set_name == 'test':
data = self.test_set
else:
raise NotImplementedError('No data set named as {}'.format(
set_name))
data_size = len(data)
indices = np.arange(data_size)
if shuffle:
np.random.shuffle(indices)
for batch_start in np.arange(0, data_size, batch_size):
batch_indices = indices[batch_start:batch_start + batch_size]
yield self._one_mini_batch(data, batch_indices, pad_id)
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import paddle.fluid.layers as layers
import paddle.fluid as fluid
import numpy as np
def dropout(input, args):
if args.drop_rate:
return layers.dropout(
input,
dropout_prob=args.drop_rate,
seed=args.random_seed,
is_test=False)
else:
return input
def bi_lstm_encoder(input_seq, gate_size, para_name, args):
# A bi-directional lstm encoder implementation.
# Linear transformation part for input gate, output gate, forget gate
# and cell activation vectors need be done outside of dynamic_lstm.
# So the output size is 4 times of gate_size.
input_forward_proj = layers.fc(
input=input_seq,
param_attr=fluid.ParamAttr(name=para_name + '_fw_gate_w'),
size=gate_size * 4,
act=None,
bias_attr=False)
input_reversed_proj = layers.fc(
input=input_seq,
param_attr=fluid.ParamAttr(name=para_name + '_bw_gate_w'),
size=gate_size * 4,
act=None,
bias_attr=False)
forward, _ = layers.dynamic_lstm(
input=input_forward_proj,
size=gate_size * 4,
use_peepholes=False,
param_attr=fluid.ParamAttr(name=para_name + '_fw_lstm_w'),
bias_attr=fluid.ParamAttr(name=para_name + '_fw_lstm_b'))
reversed, _ = layers.dynamic_lstm(
input=input_reversed_proj,
param_attr=fluid.ParamAttr(name=para_name + '_bw_lstm_w'),
bias_attr=fluid.ParamAttr(name=para_name + '_bw_lstm_b'),
size=gate_size * 4,
is_reverse=True,
use_peepholes=False)
encoder_out = layers.concat(input=[forward, reversed], axis=1)
return encoder_out
def encoder(input_name, para_name, shape, hidden_size, args):
input_ids = layers.data(
name=input_name, shape=[1], dtype='int64', lod_level=1)
input_embedding = layers.embedding(
input=input_ids,
size=shape,
dtype='float32',
is_sparse=True,
param_attr=fluid.ParamAttr(name='embedding_para'))
encoder_out = bi_lstm_encoder(
input_seq=input_embedding,
gate_size=hidden_size,
para_name=para_name,
args=args)
return dropout(encoder_out, args)
def attn_flow(q_enc, p_enc, p_ids_name, args):
tag = p_ids_name + "::"
drnn = layers.DynamicRNN()
with drnn.block():
h_cur = drnn.step_input(p_enc)
u_all = drnn.static_input(q_enc)
h_expd = layers.sequence_expand(x=h_cur, y=u_all)
s_t_mul = layers.elementwise_mul(x=u_all, y=h_expd, axis=0)
s_t_sum = layers.reduce_sum(input=s_t_mul, dim=1, keep_dim=True)
s_t_re = layers.reshape(s_t_sum, shape=[-1, 0])
s_t = layers.sequence_softmax(input=s_t_re)
u_expr = layers.elementwise_mul(x=u_all, y=s_t, axis=0)
u_expr = layers.sequence_pool(input=u_expr, pool_type='sum')
b_t = layers.sequence_pool(input=s_t_sum, pool_type='max')
drnn.output(u_expr, b_t)
U_expr, b = drnn()
b_norm = layers.sequence_softmax(input=b)
h_expr = layers.elementwise_mul(x=p_enc, y=b_norm, axis=0)
h_expr = layers.sequence_pool(input=h_expr, pool_type='sum')
H_expr = layers.sequence_expand(x=h_expr, y=p_enc)
H_expr = layers.lod_reset(x=H_expr, y=p_enc)
h_u = layers.elementwise_mul(x=p_enc, y=U_expr, axis=0)
h_h = layers.elementwise_mul(x=p_enc, y=H_expr, axis=0)
g = layers.concat(input=[p_enc, U_expr, h_u, h_h], axis=1)
return dropout(g, args)
def lstm_step(x_t, hidden_t_prev, cell_t_prev, size, para_name, args):
def linear(inputs, para_name, args):
return layers.fc(input=inputs,
size=size,
param_attr=fluid.ParamAttr(name=para_name + '_w'),
bias_attr=fluid.ParamAttr(name=para_name + '_b'))
input_cat = layers.concat([hidden_t_prev, x_t], axis=1)
forget_gate = layers.sigmoid(x=linear(input_cat, para_name + '_lstm_f',
args))
input_gate = layers.sigmoid(x=linear(input_cat, para_name + '_lstm_i',
args))
output_gate = layers.sigmoid(x=linear(input_cat, para_name + '_lstm_o',
args))
cell_tilde = layers.tanh(x=linear(input_cat, para_name + '_lstm_c', args))
cell_t = layers.sums(input=[
layers.elementwise_mul(
x=forget_gate, y=cell_t_prev), layers.elementwise_mul(
x=input_gate, y=cell_tilde)
])
hidden_t = layers.elementwise_mul(x=output_gate, y=layers.tanh(x=cell_t))
return hidden_t, cell_t
#point network
def point_network_decoder(p_vec, q_vec, hidden_size, args):
tag = 'pn_decoder:'
init_random = fluid.initializer.Normal(loc=0.0, scale=1.0)
random_attn = layers.create_parameter(
shape=[1, hidden_size],
dtype='float32',
default_initializer=init_random)
random_attn = layers.fc(
input=random_attn,
size=hidden_size,
act=None,
param_attr=fluid.ParamAttr(name=tag + 'random_attn_fc_w'),
bias_attr=fluid.ParamAttr(name=tag + 'random_attn_fc_b'))
random_attn = layers.reshape(random_attn, shape=[-1])
U = layers.fc(input=q_vec,
param_attr=fluid.ParamAttr(name=tag + 'q_vec_fc_w'),
bias_attr=False,
size=hidden_size,
act=None) + random_attn
U = layers.tanh(U)
logits = layers.fc(input=U,
param_attr=fluid.ParamAttr(name=tag + 'logits_fc_w'),
bias_attr=fluid.ParamAttr(name=tag + 'logits_fc_b'),
size=1,
act=None)
scores = layers.sequence_softmax(input=logits)
pooled_vec = layers.elementwise_mul(x=q_vec, y=scores, axis=0)
pooled_vec = layers.sequence_pool(input=pooled_vec, pool_type='sum')
init_state = layers.fc(
input=pooled_vec,
param_attr=fluid.ParamAttr(name=tag + 'init_state_fc_w'),
bias_attr=fluid.ParamAttr(name=tag + 'init_state_fc_b'),
size=hidden_size,
act=None)
def custom_dynamic_rnn(p_vec, init_state, hidden_size, para_name, args):
tag = para_name + "custom_dynamic_rnn:"
def static_rnn(step,
p_vec=p_vec,
init_state=None,
para_name='',
args=args):
tag = para_name + "static_rnn:"
ctx = layers.fc(
input=p_vec,
param_attr=fluid.ParamAttr(name=tag + 'context_fc_w'),
bias_attr=fluid.ParamAttr(name=tag + 'context_fc_b'),
size=hidden_size,
act=None)
beta = []
c_prev = init_state
m_prev = init_state
for i in range(step):
m_prev0 = layers.fc(
input=m_prev,
size=hidden_size,
act=None,
param_attr=fluid.ParamAttr(name=tag + 'm_prev0_fc_w'),
bias_attr=fluid.ParamAttr(name=tag + 'm_prev0_fc_b'))
m_prev1 = layers.sequence_expand(x=m_prev0, y=ctx)
Fk = ctx + m_prev1
Fk = layers.tanh(Fk)
logits = layers.fc(
input=Fk,
size=1,
act=None,
param_attr=fluid.ParamAttr(name=tag + 'logits_fc_w'),
bias_attr=fluid.ParamAttr(name=tag + 'logits_fc_b'))
scores = layers.sequence_softmax(input=logits)
attn_ctx = layers.elementwise_mul(x=p_vec, y=scores, axis=0)
attn_ctx = layers.sequence_pool(input=attn_ctx, pool_type='sum')
hidden_t, cell_t = lstm_step(
attn_ctx,
hidden_t_prev=m_prev,
cell_t_prev=c_prev,
size=hidden_size,
para_name=tag,
args=args)
m_prev = hidden_t
c_prev = cell_t
beta.append(scores)
return beta
return static_rnn(
2, p_vec=p_vec, init_state=init_state, para_name=para_name)
fw_outputs = custom_dynamic_rnn(p_vec, init_state, hidden_size, tag + "fw:",
args)
bw_outputs = custom_dynamic_rnn(p_vec, init_state, hidden_size, tag + "bw:",
args)
start_prob = layers.elementwise_add(
x=fw_outputs[0], y=bw_outputs[1], axis=0) / 2
end_prob = layers.elementwise_add(
x=fw_outputs[1], y=bw_outputs[0], axis=0) / 2
return start_prob, end_prob
def fusion(g, args):
m = bi_lstm_encoder(
input_seq=g, gate_size=args.hidden_size, para_name='fusion', args=args)
return dropout(m, args)
def rc_model(hidden_size, vocab, args):
emb_shape = [vocab.size(), vocab.embed_dim]
# stage 1:encode
p_ids_names = []
q_ids_names = []
ms = []
gs = []
qs = []
for i in range(args.doc_num):
p_ids_name = "pids_%d" % i
p_ids_names.append(p_ids_name)
p_enc_i = encoder(p_ids_name, 'p_enc', emb_shape, hidden_size, args)
q_ids_name = "qids_%d" % i
q_ids_names.append(q_ids_name)
q_enc_i = encoder(q_ids_name, 'q_enc', emb_shape, hidden_size, args)
# stage 2:match
g_i = attn_flow(q_enc_i, p_enc_i, p_ids_name, args)
# stage 3:fusion
m_i = fusion(g_i, args)
ms.append(m_i)
gs.append(g_i)
qs.append(q_enc_i)
m = layers.sequence_concat(input=ms)
g = layers.sequence_concat(input=gs)
q_vec = layers.sequence_concat(input=qs)
# stage 4:decode
start_probs, end_probs = point_network_decoder(
p_vec=m, q_vec=q_vec, hidden_size=hidden_size, args=args)
start_labels = layers.data(
name="start_lables", shape=[1], dtype='float32', lod_level=1)
end_labels = layers.data(
name="end_lables", shape=[1], dtype='float32', lod_level=1)
cost0 = layers.sequence_pool(
layers.cross_entropy(
input=start_probs, label=start_labels, soft_label=True),
'sum')
cost1 = layers.sequence_pool(
layers.cross_entropy(
input=end_probs, label=end_labels, soft_label=True),
'sum')
cost0 = layers.mean(cost0)
cost1 = layers.mean(cost1)
cost = cost0 + cost1
cost.persistable = True
feeding_list = q_ids_names + ["start_lables", "end_lables"] + p_ids_names
return cost, start_probs, end_probs, feeding_list
此差异已折叠。
export CUDA_VISIBLE_DEVICES=1
python run.py \
--trainset 'data/preprocessed/trainset/search.train.json' \
'data/preprocessed/trainset/zhidao.train.json' \
--devset 'data/preprocessed/devset/search.dev.json' \
'data/preprocessed/devset/zhidao.dev.json' \
--testset 'data/preprocessed/testset/search.test.json' \
'data/preprocessed/testset/zhidao.test.json' \
--vocab_dir 'data/vocab' \
--use_gpu true \
--save_dir ./models \
--pass_num 10 \
--learning_rate 0.001 \
--batch_size 8 \
--embed_size 300 \
--hidden_size 150 \
--max_p_num 5 \
--max_p_len 500 \
--max_q_len 60 \
--max_a_len 200 \
--drop_rate 0.2 $@\
# coding:utf8
# ==============================================================================
# Copyright 2017 Baidu.com, Inc. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""
This package implements some utility functions shared by PaddlePaddle
and Tensorflow model implementations.
Authors: liuyuan(liuyuan04@baidu.com)
Date: 2017/10/06 18:23:06
"""
from .dureader_eval import compute_bleu_rouge
from .dureader_eval import normalize
from .preprocess import find_fake_answer
from .preprocess import find_best_question_match
__all__ = [
'compute_bleu_rouge',
'normalize',
'find_fake_answer',
'find_best_question_match',
]
#!/bin/bash
# ==============================================================================
# Copyright 2017 Baidu.com, Inc. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
# We use Bleu and Rouge as evaluation metrics, the calculation of these metrics
# relies on the scoring scripts under "https://github.com/tylin/coco-caption"
bleu_base_url='https://raw.githubusercontent.com/tylin/coco-caption/master/pycocoevalcap/bleu'
bleu_files=("LICENSE" "__init__.py" "bleu.py" "bleu_scorer.py")
rouge_base_url="https://raw.githubusercontent.com/tylin/coco-caption/master/pycocoevalcap/rouge"
rouge_files=("__init__.py" "rouge.py")
download() {
local metric=$1; shift;
local base_url=$1; shift;
local fnames=($@);
mkdir -p ${metric}
for fname in ${fnames[@]};
do
printf "downloading: %s\n" ${base_url}/${fname}
wget --no-check-certificate ${base_url}/${fname} -O ${metric}/${fname}
done
}
# prepare rouge
download "rouge_metric" ${rouge_base_url} ${rouge_files[@]}
# prepare bleu
download "bleu_metric" ${bleu_base_url} ${bleu_files[@]}
# convert python 2.x source code to python 3.x
2to3 -w "../utils/bleu_metric/bleu_scorer.py"
2to3 -w "../utils/bleu_metric/bleu.py"
# -*- coding:utf8 -*-
# ==============================================================================
# Copyright 2017 Baidu.com, Inc. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""
This module computes evaluation metrics for DuReader dataset.
"""
import argparse
import json
import sys
import zipfile
from collections import Counter
from .bleu_metric.bleu import Bleu
from .rouge_metric.rouge import Rouge
EMPTY = ''
YESNO_LABELS = set(['Yes', 'No', 'Depends'])
def normalize(s):
"""
Normalize strings to space joined chars.
Args:
s: a list of strings.
Returns:
A list of normalized strings.
"""
if not s:
return s
normalized = []
for ss in s:
tokens = [c for c in list(ss) if len(c.strip()) != 0]
normalized.append(' '.join(tokens))
return normalized
def data_check(obj, task):
"""
Check data.
Raises:
Raises AssertionError when data is not legal.
"""
assert 'question_id' in obj, "Missing 'question_id' field."
assert 'question_type' in obj, \
"Missing 'question_type' field. question_id: {}".format(obj['question_type'])
assert 'yesno_answers' in obj, \
"Missing 'yesno_answers' field. question_id: {}".format(obj['question_id'])
assert isinstance(obj['yesno_answers'], list), \
r"""'yesno_answers' field must be a list, if the 'question_type' is not
'YES_NO', then this field should be an empty list.
question_id: {}""".format(obj['question_id'])
assert 'entity_answers' in obj, \
"Missing 'entity_answers' field. question_id: {}".format(obj['question_id'])
assert isinstance(obj['entity_answers'], list) \
and len(obj['entity_answers']) > 0, \
r"""'entity_answers' field must be a list, and has at least one element,
which can be a empty list. question_id: {}""".format(obj['question_id'])
def read_file(file_name, task, is_ref=False):
"""
Read predict answers or reference answers from file.
Args:
file_name: the name of the file containing predict result or reference
result.
Returns:
A dictionary mapping question_id to the result information. The result
information itself is also a dictionary with has four keys:
- question_type: type of the query.
- yesno_answers: A list of yesno answers corresponding to 'answers'.
- answers: A list of predicted answers.
- entity_answers: A list, each element is also a list containing the entities
tagged out from the corresponding answer string.
"""
def _open(file_name, mode, zip_obj=None):
if zip_obj is not None:
return zip_obj.open(file_name, mode)
return open(file_name, mode)
results = {}
keys = ['answers', 'yesno_answers', 'entity_answers', 'question_type']
if is_ref:
keys += ['source']
zf = zipfile.ZipFile(file_name, 'r') if file_name.endswith('.zip') else None
file_list = [file_name] if zf is None else zf.namelist()
for fn in file_list:
for line in _open(fn, 'r', zip_obj=zf):
try:
obj = json.loads(line.strip())
except ValueError:
raise ValueError("Every line of data should be legal json")
data_check(obj, task)
qid = obj['question_id']
assert qid not in results, "Duplicate question_id: {}".format(qid)
results[qid] = {}
for k in keys:
results[qid][k] = obj[k]
return results
def compute_bleu_rouge(pred_dict, ref_dict, bleu_order=4):
"""
Compute bleu and rouge scores.
"""
assert set(pred_dict.keys()) == set(ref_dict.keys()), \
"missing keys: {}".format(set(ref_dict.keys()) - set(pred_dict.keys()))
scores = {}
bleu_scores, _ = Bleu(bleu_order).compute_score(ref_dict, pred_dict)
for i, bleu_score in enumerate(bleu_scores):
scores['Bleu-%d' % (i + 1)] = bleu_score
rouge_score, _ = Rouge().compute_score(ref_dict, pred_dict)
scores['Rouge-L'] = rouge_score
return scores
def local_prf(pred_list, ref_list):
"""
Compute local precision recall and f1-score,
given only one prediction list and one reference list
"""
common = Counter(pred_list) & Counter(ref_list)
num_same = sum(common.values())
if num_same == 0:
return 0, 0, 0
p = 1.0 * num_same / len(pred_list)
r = 1.0 * num_same / len(ref_list)
f1 = (2 * p * r) / (p + r)
return p, r, f1
def compute_prf(pred_dict, ref_dict):
"""
Compute precision recall and f1-score.
"""
pred_question_ids = set(pred_dict.keys())
ref_question_ids = set(ref_dict.keys())
correct_preds, total_correct, total_preds = 0, 0, 0
for question_id in ref_question_ids:
pred_entity_list = pred_dict.get(question_id, [[]])
assert len(pred_entity_list) == 1, \
'the number of entity list for question_id {} is not 1.'.format(question_id)
pred_entity_list = pred_entity_list[0]
all_ref_entity_lists = ref_dict[question_id]
best_local_f1 = 0
best_ref_entity_list = None
for ref_entity_list in all_ref_entity_lists:
local_f1 = local_prf(pred_entity_list, ref_entity_list)[2]
if local_f1 > best_local_f1:
best_ref_entity_list = ref_entity_list
best_local_f1 = local_f1
if best_ref_entity_list is None:
if len(all_ref_entity_lists) > 0:
best_ref_entity_list = sorted(
all_ref_entity_lists, key=lambda x: len(x))[0]
else:
best_ref_entity_list = []
gold_entities = set(best_ref_entity_list)
pred_entities = set(pred_entity_list)
correct_preds += len(gold_entities & pred_entities)
total_preds += len(pred_entities)
total_correct += len(gold_entities)
p = float(correct_preds) / total_preds if correct_preds > 0 else 0
r = float(correct_preds) / total_correct if correct_preds > 0 else 0
f1 = 2 * p * r / (p + r) if correct_preds > 0 else 0
return {'Precision': p, 'Recall': r, 'F1': f1}
def prepare_prf(pred_dict, ref_dict):
"""
Prepares data for calculation of prf scores.
"""
preds = {k: v['entity_answers'] for k, v in pred_dict.items()}
refs = {k: v['entity_answers'] for k, v in ref_dict.items()}
return preds, refs
def filter_dict(result_dict, key_tag):
"""
Filter a subset of the result_dict, where keys ends with 'key_tag'.
"""
filtered = {}
for k, v in result_dict.items():
if k.endswith(key_tag):
filtered[k] = v
return filtered
def get_metrics(pred_result, ref_result, task, source):
"""
Computes metrics.
"""
metrics = {}
ref_result_filtered = {}
pred_result_filtered = {}
if source == 'both':
ref_result_filtered = ref_result
pred_result_filtered = pred_result
else:
for question_id, info in ref_result.items():
if info['source'] == source:
ref_result_filtered[question_id] = info
if question_id in pred_result:
pred_result_filtered[question_id] = pred_result[question_id]
if task == 'main' or task == 'all' \
or task == 'description':
pred_dict, ref_dict = prepare_bleu(pred_result_filtered,
ref_result_filtered, task)
metrics = compute_bleu_rouge(pred_dict, ref_dict)
elif task == 'yesno':
pred_dict, ref_dict = prepare_bleu(pred_result_filtered,
ref_result_filtered, task)
keys = ['Yes', 'No', 'Depends']
preds = [filter_dict(pred_dict, k) for k in keys]
refs = [filter_dict(ref_dict, k) for k in keys]
metrics = compute_bleu_rouge(pred_dict, ref_dict)
for k, pred, ref in zip(keys, preds, refs):
m = compute_bleu_rouge(pred, ref)
k_metric = [(k + '|' + key, v) for key, v in m.items()]
metrics.update(k_metric)
elif task == 'entity':
pred_dict, ref_dict = prepare_prf(pred_result_filtered,
ref_result_filtered)
pred_dict_bleu, ref_dict_bleu = prepare_bleu(pred_result_filtered,
ref_result_filtered, task)
metrics = compute_prf(pred_dict, ref_dict)
metrics.update(compute_bleu_rouge(pred_dict_bleu, ref_dict_bleu))
else:
raise ValueError("Illegal task name: {}".format(task))
return metrics
def prepare_bleu(pred_result, ref_result, task):
"""
Prepares data for calculation of bleu and rouge scores.
"""
pred_list, ref_list = [], []
qids = ref_result.keys()
for qid in qids:
if task == 'main':
pred, ref = get_main_result(qid, pred_result, ref_result)
elif task == 'yesno':
pred, ref = get_yesno_result(qid, pred_result, ref_result)
elif task == 'all':
pred, ref = get_all_result(qid, pred_result, ref_result)
elif task == 'entity':
pred, ref = get_entity_result(qid, pred_result, ref_result)
elif task == 'description':
pred, ref = get_desc_result(qid, pred_result, ref_result)
else:
raise ValueError("Illegal task name: {}".format(task))
if pred and ref:
pred_list += pred
ref_list += ref
pred_dict = dict(pred_list)
ref_dict = dict(ref_list)
for qid, ans in ref_dict.items():
ref_dict[qid] = normalize(ref_dict[qid])
pred_dict[qid] = normalize(pred_dict.get(qid, [EMPTY]))
if not ans or ans == [EMPTY]:
del ref_dict[qid]
del pred_dict[qid]
for k, v in pred_dict.items():
assert len(v) == 1, \
"There should be only one predict answer. question_id: {}".format(k)
return pred_dict, ref_dict
def get_main_result(qid, pred_result, ref_result):
"""
Prepare answers for task 'main'.
Args:
qid: question_id.
pred_result: A dict include all question_id's result information read
from args.pred_file.
ref_result: A dict incluce all question_id's result information read
from args.ref_file.
Returns:
Two lists, the first one contains predict result, the second
one contains reference result of the same question_id. Each list has
elements of tuple (question_id, answers), 'answers' is a list of strings.
"""
ref_ans = ref_result[qid]['answers']
if not ref_ans:
ref_ans = [EMPTY]
pred_ans = pred_result.get(qid, {}).get('answers', [])[:1]
if not pred_ans:
pred_ans = [EMPTY]
return [(qid, pred_ans)], [(qid, ref_ans)]
def get_entity_result(qid, pred_result, ref_result):
"""
Prepare answers for task 'entity'.
Args:
qid: question_id.
pred_result: A dict include all question_id's result information read
from args.pred_file.
ref_result: A dict incluce all question_id's result information read
from args.ref_file.
Returns:
Two lists, the first one contains predict result, the second
one contains reference result of the same question_id. Each list has
elements of tuple (question_id, answers), 'answers' is a list of strings.
"""
if ref_result[qid]['question_type'] != 'ENTITY':
return None, None
return get_main_result(qid, pred_result, ref_result)
def get_desc_result(qid, pred_result, ref_result):
"""
Prepare answers for task 'description'.
Args:
qid: question_id.
pred_result: A dict include all question_id's result information read
from args.pred_file.
ref_result: A dict incluce all question_id's result information read
from args.ref_file.
Returns:
Two lists, the first one contains predict result, the second
one contains reference result of the same question_id. Each list has
elements of tuple (question_id, answers), 'answers' is a list of strings.
"""
if ref_result[qid]['question_type'] != 'DESCRIPTION':
return None, None
return get_main_result(qid, pred_result, ref_result)
def get_yesno_result(qid, pred_result, ref_result):
"""
Prepare answers for task 'yesno'.
Args:
qid: question_id.
pred_result: A dict include all question_id's result information read
from args.pred_file.
ref_result: A dict incluce all question_id's result information read
from args.ref_file.
Returns:
Two lists, the first one contains predict result, the second
one contains reference result of the same question_id. Each list has
elements of tuple (question_id, answers), 'answers' is a list of strings.
"""
def _uniq(li, is_ref):
uniq_li = []
left = []
keys = set()
for k, v in li:
if k not in keys:
uniq_li.append((k, v))
keys.add(k)
else:
left.append((k, v))
if is_ref:
dict_li = dict(uniq_li)
for k, v in left:
dict_li[k] += v
uniq_li = [(k, v) for k, v in dict_li.items()]
return uniq_li
def _expand_result(uniq_li):
expanded = uniq_li[:]
keys = set([x[0] for x in uniq_li])
for k in YESNO_LABELS - keys:
expanded.append((k, [EMPTY]))
return expanded
def _get_yesno_ans(qid, result_dict, is_ref=False):
if qid not in result_dict:
return [(str(qid) + '_' + k, v) for k, v in _expand_result([])]
yesno_answers = result_dict[qid]['yesno_answers']
answers = result_dict[qid]['answers']
lbl_ans = _uniq([(k, [v]) for k, v in zip(yesno_answers, answers)],
is_ref)
ret = [(str(qid) + '_' + k, v) for k, v in _expand_result(lbl_ans)]
return ret
if ref_result[qid]['question_type'] != 'YES_NO':
return None, None
ref_ans = _get_yesno_ans(qid, ref_result, is_ref=True)
pred_ans = _get_yesno_ans(qid, pred_result)
return pred_ans, ref_ans
def get_all_result(qid, pred_result, ref_result):
"""
Prepare answers for task 'all'.
Args:
qid: question_id.
pred_result: A dict include all question_id's result information read
from args.pred_file.
ref_result: A dict incluce all question_id's result information read
from args.ref_file.
Returns:
Two lists, the first one contains predict result, the second
one contains reference result of the same question_id. Each list has
elements of tuple (question_id, answers), 'answers' is a list of strings.
"""
if ref_result[qid]['question_type'] == 'YES_NO':
return get_yesno_result(qid, pred_result, ref_result)
return get_main_result(qid, pred_result, ref_result)
def format_metrics(metrics, task, err_msg):
"""
Format metrics. 'err' field returns any error occured during evaluation.
Args:
metrics: A dict object contains metrics for different tasks.
task: Task name.
err_msg: Exception raised during evaluation.
Returns:
Formatted result.
"""
result = {}
sources = ["both", "search", "zhidao"]
if err_msg is not None:
return {'errorMsg': str(err_msg), 'errorCode': 1, 'data': []}
data = []
if task != 'all' and task != 'main':
sources = ["both"]
if task == 'entity':
metric_names = ["Bleu-4", "Rouge-L"]
metric_names_prf = ["F1", "Precision", "Recall"]
for name in metric_names + metric_names_prf:
for src in sources:
obj = {
"name": name,
"value": round(metrics[src].get(name, 0) * 100, 2),
"type": src,
}
data.append(obj)
elif task == 'yesno':
metric_names = ["Bleu-4", "Rouge-L"]
details = ["Yes", "No", "Depends"]
src = sources[0]
for name in metric_names:
obj = {
"name": name,
"value": round(metrics[src].get(name, 0) * 100, 2),
"type": 'All',
}
data.append(obj)
for d in details:
obj = {
"name": name,
"value": \
round(metrics[src].get(d + '|' + name, 0) * 100, 2),
"type": d,
}
data.append(obj)
else:
metric_names = ["Bleu-4", "Rouge-L"]
for name in metric_names:
for src in sources:
obj = {
"name": name,
"value": \
round(metrics[src].get(name, 0) * 100, 2),
"type": src,
}
data.append(obj)
result["data"] = data
result["errorCode"] = 0
result["errorMsg"] = "success"
return result
def main(args):
"""
Do evaluation.
"""
err = None
metrics = {}
try:
pred_result = read_file(args.pred_file, args.task)
ref_result = read_file(args.ref_file, args.task, is_ref=True)
sources = ['both', 'search', 'zhidao']
if args.task not in set(['main', 'all']):
sources = sources[:1]
for source in sources:
metrics[source] = get_metrics(pred_result, ref_result, args.task,
source)
except ValueError as ve:
err = ve
except AssertionError as ae:
err = ae
print(json.dumps(
format_metrics(metrics, args.task, err), ensure_ascii=False).encode(
'utf8'))
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('pred_file', help='predict file')
parser.add_argument('ref_file', help='reference file')
parser.add_argument(
'task', help='task name: Main|Yes_No|All|Entity|Description')
args = parser.parse_args()
args.task = args.task.lower().replace('_', '')
main(args)
# -*- coding:utf8 -*-
# ==============================================================================
# Copyright 2017 Baidu.com, Inc. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""
Utility function to generate vocabulary file.
"""
import argparse
import sys
import json
from itertools import chain
def get_vocab(files, vocab_file):
"""
Builds vocabulary file from field 'segmented_paragraphs'
and 'segmented_question'.
Args:
files: A list of file names.
vocab_file: The file that stores the vocabulary.
"""
vocab = {}
for f in files:
with open(f, 'r') as fin:
for line in fin:
obj = json.loads(line.strip())
paras = [
chain(*d['segmented_paragraphs']) for d in obj['documents']
]
doc_tokens = chain(*paras)
question_tokens = obj['segmented_question']
for t in list(doc_tokens) + question_tokens:
vocab[t] = vocab.get(t, 0) + 1
# output
sorted_vocab = sorted(
[(v, c) for v, c in vocab.items()], key=lambda x: x[1], reverse=True)
with open(vocab_file, 'w') as outf:
for w, c in sorted_vocab:
print >> outf, '{}\t{}'.format(w.encode('utf8'), c)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument(
'--files',
nargs='+',
required=True,
help='file list to count vocab from.')
parser.add_argument(
'--vocab', required=True, help='file to store counted vocab.')
args = parser.parse_args()
get_vocab(args.files, args.vocab)
#coding=utf8
import os, sys, json
import nltk
def _nltk_tokenize(sequence):
tokens = nltk.word_tokenize(sequence)
cur_char_offset = 0
token_offsets = []
token_words = []
for token in tokens:
cur_char_offset = sequence.find(token, cur_char_offset)
token_offsets.append(
[cur_char_offset, cur_char_offset + len(token) - 1])
token_words.append(token)
return token_offsets, token_words
def segment(input_js):
_, input_js['segmented_question'] = _nltk_tokenize(input_js['question'])
for doc_id, doc in enumerate(input_js['documents']):
doc['segmented_title'] = []
doc['segmented_paragraphs'] = []
for para_id, para in enumerate(doc['paragraphs']):
_, seg_para = _nltk_tokenize(para)
doc['segmented_paragraphs'].append(seg_para)
if 'answers' in input_js:
input_js['segmented_answers'] = []
for answer_id, answer in enumerate(input_js['answers']):
_, seg_answer = _nltk_tokenize(answer)
input_js['segmented_answers'].append(seg_answer)
if __name__ == '__main__':
if len(sys.argv) != 2:
print('Usage: tokenize_data.py <input_path>')
exit()
nltk.download('punkt')
for line in open(sys.argv[1]):
dureader_js = json.loads(line.strip())
segment(dureader_js)
print(json.dumps(dureader_js))
#coding=utf8
import sys
import json
import pandas as pd
def trans(input_js):
output_js = {}
output_js['question'] = input_js['query']
output_js['question_type'] = input_js['query_type']
output_js['question_id'] = input_js['query_id']
output_js['fact_or_opinion'] = ""
output_js['documents'] = []
for para_id, para in enumerate(input_js['passages']):
doc = {}
doc['title'] = ""
if 'is_selected' in para:
doc['is_selected'] = True if para['is_selected'] != 0 else False
doc['paragraphs'] = [para['passage_text']]
output_js['documents'].append(doc)
if 'answers' in input_js:
output_js['answers'] = input_js['answers']
return output_js
if __name__ == '__main__':
if len(sys.argv) != 2:
print('Usage: marcov1_to_dureader.py <input_path>')
exit()
df = pd.read_json(sys.argv[1])
for row in df.iterrows():
marco_js = json.loads(row[1].to_json())
dureader_js = trans(marco_js)
print(json.dumps(dureader_js))
import sys
import json
import pandas as pd
if __name__ == '__main__':
if len(sys.argv) != 3:
print('Usage: tojson.py <input_path> <output_path>')
exit()
infile = sys.argv[1]
outfile = sys.argv[2]
df = pd.read_json(infile)
with open(outfile, 'w') as f:
for row in df.iterrows():
f.write(row[1].to_json() + '\n')
###############################################################################
# ==============================================================================
# Copyright 2017 Baidu.com, Inc. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""
This module finds the most related paragraph of each document according to recall.
"""
import sys
if sys.version[0] == '2':
reload(sys)
sys.setdefaultencoding("utf-8")
import json
from collections import Counter
def precision_recall_f1(prediction, ground_truth):
"""
This function calculates and returns the precision, recall and f1-score
Args:
prediction: prediction string or list to be matched
ground_truth: golden string or list reference
Returns:
floats of (p, r, f1)
Raises:
None
"""
if not isinstance(prediction, list):
prediction_tokens = prediction.split()
else:
prediction_tokens = prediction
if not isinstance(ground_truth, list):
ground_truth_tokens = ground_truth.split()
else:
ground_truth_tokens = ground_truth
common = Counter(prediction_tokens) & Counter(ground_truth_tokens)
num_same = sum(common.values())
if num_same == 0:
return 0, 0, 0
p = 1.0 * num_same / len(prediction_tokens)
r = 1.0 * num_same / len(ground_truth_tokens)
f1 = (2 * p * r) / (p + r)
return p, r, f1
def recall(prediction, ground_truth):
"""
This function calculates and returns the recall
Args:
prediction: prediction string or list to be matched
ground_truth: golden string or list reference
Returns:
floats of recall
Raises:
None
"""
return precision_recall_f1(prediction, ground_truth)[1]
def f1_score(prediction, ground_truth):
"""
This function calculates and returns the f1-score
Args:
prediction: prediction string or list to be matched
ground_truth: golden string or list reference
Returns:
floats of f1
Raises:
None
"""
return precision_recall_f1(prediction, ground_truth)[2]
def metric_max_over_ground_truths(metric_fn, prediction, ground_truths):
"""
This function calculates and returns the precision, recall and f1-score
Args:
metric_fn: metric function pointer which calculates scores according to corresponding logic.
prediction: prediction string or list to be matched
ground_truth: golden string or list reference
Returns:
floats of (p, r, f1)
Raises:
None
"""
scores_for_ground_truths = []
for ground_truth in ground_truths:
score = metric_fn(prediction, ground_truth)
scores_for_ground_truths.append(score)
return max(scores_for_ground_truths)
def find_best_question_match(doc, question, with_score=False):
"""
For each docment, find the paragraph that matches best to the question.
Args:
doc: The document object.
question: The question tokens.
with_score: If True then the match score will be returned,
otherwise False.
Returns:
The index of the best match paragraph, if with_score=False,
otherwise returns a tuple of the index of the best match paragraph
and the match score of that paragraph.
"""
most_related_para = -1
max_related_score = 0
most_related_para_len = 0
for p_idx, para_tokens in enumerate(doc['segmented_paragraphs']):
if len(question) > 0:
related_score = metric_max_over_ground_truths(recall, para_tokens,
question)
else:
related_score = 0
if related_score > max_related_score \
or (related_score == max_related_score \
and len(para_tokens) < most_related_para_len):
most_related_para = p_idx
max_related_score = related_score
most_related_para_len = len(para_tokens)
if most_related_para == -1:
most_related_para = 0
if with_score:
return most_related_para, max_related_score
return most_related_para
def find_fake_answer(sample):
"""
For each document, finds the most related paragraph based on recall,
then finds a span that maximize the f1_score compared with the gold answers
and uses this span as a fake answer span
Args:
sample: a sample in the dataset
Returns:
None
Raises:
None
"""
for doc in sample['documents']:
most_related_para = -1
most_related_para_len = 999999
max_related_score = 0
for p_idx, para_tokens in enumerate(doc['segmented_paragraphs']):
if len(sample['segmented_answers']) > 0:
related_score = metric_max_over_ground_truths(
recall, para_tokens, sample['segmented_answers'])
else:
continue
if related_score > max_related_score \
or (related_score == max_related_score
and len(para_tokens) < most_related_para_len):
most_related_para = p_idx
most_related_para_len = len(para_tokens)
max_related_score = related_score
doc['most_related_para'] = most_related_para
sample['answer_docs'] = []
sample['answer_spans'] = []
sample['fake_answers'] = []
sample['match_scores'] = []
best_match_score = 0
best_match_d_idx, best_match_span = -1, [-1, -1]
best_fake_answer = None
answer_tokens = set()
for segmented_answer in sample['segmented_answers']:
answer_tokens = answer_tokens | set(
[token for token in segmented_answer])
for d_idx, doc in enumerate(sample['documents']):
if not doc['is_selected']:
continue
if doc['most_related_para'] == -1:
doc['most_related_para'] = 0
most_related_para_tokens = doc['segmented_paragraphs'][doc[
'most_related_para']][:1000]
for start_tidx in range(len(most_related_para_tokens)):
if most_related_para_tokens[start_tidx] not in answer_tokens:
continue
for end_tidx in range(
len(most_related_para_tokens) - 1, start_tidx - 1, -1):
span_tokens = most_related_para_tokens[start_tidx:end_tidx + 1]
if len(sample['segmented_answers']) > 0:
match_score = metric_max_over_ground_truths(
f1_score, span_tokens, sample['segmented_answers'])
else:
match_score = 0
if match_score == 0:
break
if match_score > best_match_score:
best_match_d_idx = d_idx
best_match_span = [start_tidx, end_tidx]
best_match_score = match_score
best_fake_answer = ''.join(span_tokens)
if best_match_score > 0:
sample['answer_docs'].append(best_match_d_idx)
sample['answer_spans'].append(best_match_span)
sample['fake_answers'].append(best_fake_answer)
sample['match_scores'].append(best_match_score)
if __name__ == '__main__':
for line in sys.stdin:
sample = json.loads(line)
find_fake_answer(sample)
print(json.dumps(sample, encoding='utf8', ensure_ascii=False))
#!/bin/bash
input_file=$1
output_file=$2
# convert the data from MARCO V2 (json) format to MARCO V1 (jsonl) format.
# the script was forked from MARCO repo.
# the format of MARCO V1 is much more easier to explore.
python3 marcov2_to_v1_tojsonl.py $input_file $input_file.marcov1
# convert the data from MARCO V1 format to DuReader format.
python3 marcov1_to_dureader.py $input_file.marcov1 >$input_file.dureader_raw
# tokenize the data.
python3 marco_tokenize_data.py $input_file.dureader_raw >$input_file.segmented
# find fake answers (indicating the start and end positions of answers in the document) for train and dev sets.
# note that this should not be applied for test set, since there is no ground truth in test set.
python preprocess.py $input_file.segmented >$output_file
# remove the temporal data files.
rm -rf $input_file.dureader_raw $input_file.segmented
# -*- coding:utf8 -*-
# ==============================================================================
# Copyright 2017 Baidu.com, Inc. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""
This module implements the Vocab class for converting string to id and back
"""
import numpy as np
class Vocab(object):
"""
Implements a vocabulary to store the tokens in the data, with their corresponding embeddings.
"""
def __init__(self, filename=None, initial_tokens=None, lower=False):
self.id2token = {}
self.token2id = {}
self.token_cnt = {}
self.lower = lower
self.embed_dim = None
self.embeddings = None
self.pad_token = '<blank>'
self.unk_token = '<unk>'
self.initial_tokens = initial_tokens if initial_tokens is not None else []
self.initial_tokens.extend([self.pad_token, self.unk_token])
for token in self.initial_tokens:
self.add(token)
if filename is not None:
self.load_from_file(filename)
def size(self):
"""
get the size of vocabulary
Returns:
an integer indicating the size
"""
return len(self.id2token)
def load_from_file(self, file_path):
"""
loads the vocab from file_path
Args:
file_path: a file with a word in each line
"""
for line in open(file_path, 'r'):
token = line.rstrip('\n')
self.add(token)
def get_id(self, token):
"""
gets the id of a token, returns the id of unk token if token is not in vocab
Args:
key: a string indicating the word
Returns:
an integer
"""
token = token.lower() if self.lower else token
try:
return self.token2id[token]
except KeyError:
return self.token2id[self.unk_token]
def get_token(self, idx):
"""
gets the token corresponding to idx, returns unk token if idx is not in vocab
Args:
idx: an integer
returns:
a token string
"""
try:
return self.id2token[idx]
except KeyError:
return self.unk_token
def add(self, token, cnt=1):
"""
adds the token to vocab
Args:
token: a string
cnt: a num indicating the count of the token to add, default is 1
"""
token = token.lower() if self.lower else token
if token in self.token2id:
idx = self.token2id[token]
else:
idx = len(self.id2token)
self.id2token[idx] = token
self.token2id[token] = idx
if cnt > 0:
if token in self.token_cnt:
self.token_cnt[token] += cnt
else:
self.token_cnt[token] = cnt
return idx
def filter_tokens_by_cnt(self, min_cnt):
"""
filter the tokens in vocab by their count
Args:
min_cnt: tokens with frequency less than min_cnt is filtered
"""
filtered_tokens = [
token for token in self.token2id if self.token_cnt[token] >= min_cnt
]
# rebuild the token x id map
self.token2id = {}
self.id2token = {}
for token in self.initial_tokens:
self.add(token, cnt=0)
for token in filtered_tokens:
self.add(token, cnt=0)
def randomly_init_embeddings(self, embed_dim):
"""
randomly initializes the embeddings for each token
Args:
embed_dim: the size of the embedding for each token
"""
self.embed_dim = embed_dim
self.embeddings = np.random.rand(self.size(), embed_dim)
for token in [self.pad_token, self.unk_token]:
self.embeddings[self.get_id(token)] = np.zeros([self.embed_dim])
def load_pretrained_embeddings(self, embedding_path):
"""
loads the pretrained embeddings from embedding_path,
tokens not in pretrained embeddings will be filtered
Args:
embedding_path: the path of the pretrained embedding file
"""
trained_embeddings = {}
with open(embedding_path, 'r') as fin:
for line in fin:
contents = line.strip().split()
token = contents[0].decode('utf8')
if token not in self.token2id:
continue
trained_embeddings[token] = list(map(float, contents[1:]))
if self.embed_dim is None:
self.embed_dim = len(contents) - 1
filtered_tokens = trained_embeddings.keys()
# rebuild the token x id map
self.token2id = {}
self.id2token = {}
for token in self.initial_tokens:
self.add(token, cnt=0)
for token in filtered_tokens:
self.add(token, cnt=0)
# load embeddings
self.embeddings = np.zeros([self.size(), self.embed_dim])
for token in self.token2id.keys():
if token in trained_embeddings:
self.embeddings[self.get_id(token)] = trained_embeddings[token]
def convert_to_ids(self, tokens):
"""
Convert a list of tokens to ids, use unk_token if the token is not in vocab.
Args:
tokens: a list of token
Returns:
a list of ids
"""
vec = [self.get_id(label) for label in tokens]
return vec
def recover_from_ids(self, ids, stop_id=None):
"""
Convert a list of ids to tokens, stop converting if the stop_id is encountered
Args:
ids: a list of ids to convert
stop_id: the stop id, default is None
Returns:
a list of tokens
"""
tokens = []
for i in ids:
tokens += [self.get_token(i)]
if stop_id is not None and i == stop_id:
break
return tokens
import os
import math
import random
import cPickle
import functools
import numpy as np
import paddle
......@@ -44,9 +43,9 @@ for i, item in enumerate(test_list):
test_data[label] = []
test_data[label].append(path)
print "train_data size:", len(train_data)
print "test_data size:", len(test_data)
print "test_data image number:", len(test_image_list)
print("train_data size:", len(train_data))
print("test_data size:", len(test_data))
print("test_data image number:", len(test_image_list))
random.shuffle(test_image_list)
......@@ -213,11 +212,11 @@ def eml_iterator(data,
color_jitter=False,
rotate=False):
def reader():
labs = data.keys()
labs = list(data.keys())
lab_num = len(labs)
ind = range(0, lab_num)
ind = list(range(0, lab_num))
assert batch_size % samples_each_class == 0, "batch_size % samples_each_class != 0"
num_class = batch_size/samples_each_class
num_class = batch_size // samples_each_class
for i in range(iter_size):
random.shuffle(ind)
for n in range(num_class):
......@@ -244,9 +243,9 @@ def quadruplet_iterator(data,
color_jitter=False,
rotate=False):
def reader():
labs = data.keys()
labs = list(data.keys())
lab_num = len(labs)
ind = range(0, lab_num)
ind = list(range(0, lab_num))
for i in range(iter_size):
random.shuffle(ind)
ind_sample = ind[:class_num]
......@@ -254,7 +253,7 @@ def quadruplet_iterator(data,
for ind_i in ind_sample:
lab = labs[ind_i]
data_list = data[lab]
data_ind = range(0, len(data_list))
data_ind = list(range(0, len(data_list)))
random.shuffle(data_ind)
anchor_ind = data_ind[:samples_each_class]
......@@ -276,15 +275,15 @@ def triplet_iterator(data,
color_jitter=False,
rotate=False):
def reader():
labs = data.keys()
labs = list(data.keys())
lab_num = len(labs)
ind = range(0, lab_num)
ind = list(range(0, lab_num))
for i in range(iter_size):
random.shuffle(ind)
ind_pos, ind_neg = ind[:2]
lab_pos = labs[ind_pos]
pos_data_list = data[lab_pos]
data_ind = range(0, len(pos_data_list))
data_ind = list(range(0, len(pos_data_list)))
random.shuffle(data_ind)
anchor_ind, pos_ind = data_ind[:2]
......@@ -345,7 +344,7 @@ def quadruplet_train(class_num, samples_each_class):
def triplet_train(batch_size):
assert(batch_size % 3 == 0)
return triplet_iterator(train_data, 'train', batch_size, iter_size = batch_size/3 * 100, \
return triplet_iterator(train_data, 'train', batch_size, iter_size = batch_size//3 * 100, \
shuffle=True, color_jitter=False, rotate=False)
def test():
......
import datareader as reader
import math
import numpy as np
import paddle.fluid as fluid
from metrics import calculate_order_dist_matrix
from metrics import get_gpu_num
from . import datareader as reader
from .metrics import calculate_order_dist_matrix
from .metrics import get_gpu_num
class emlloss():
def __init__(self, train_batch_size = 40, samples_each_class=2):
......@@ -11,9 +11,9 @@ class emlloss():
self.samples_each_class = samples_each_class
self.train_batch_size = train_batch_size
assert(train_batch_size % num_gpus == 0)
self.cal_loss_batch_size = train_batch_size / num_gpus
self.cal_loss_batch_size = train_batch_size // num_gpus
assert(self.cal_loss_batch_size % samples_each_class == 0)
class_num = train_batch_size / samples_each_class
class_num = train_batch_size // samples_each_class
self.train_reader = reader.eml_train(train_batch_size, samples_each_class)
self.test_reader = reader.test()
......
......@@ -20,12 +20,14 @@ def recall_topk(fea, lab, k = 1):
import subprocess
import os
def get_gpu_num():
visibledevice = os.getenv('CUDA_VISIBLE_DEVICES')
if visibledevice:
devicenum = len(visibledevice.split(','))
else:
devicenum = subprocess.check_output(['nvidia-smi', '-L']).count('\n')
devicenum = subprocess.check_output(
[str.encode('nvidia-smi'), str.encode('-L')]).decode('utf-8').count('\n')
return devicenum
import paddle as paddle
......
import numpy as np
import datareader as reader
import paddle.fluid as fluid
from metrics import calculate_order_dist_matrix
from metrics import get_gpu_num
from . import datareader as reader
from .metrics import calculate_order_dist_matrix
from .metrics import get_gpu_num
class quadrupletloss():
def __init__(self,
......@@ -14,9 +14,9 @@ class quadrupletloss():
self.samples_each_class = samples_each_class
self.train_batch_size = train_batch_size
assert(train_batch_size % num_gpus == 0)
self.cal_loss_batch_size = train_batch_size / num_gpus
self.cal_loss_batch_size = train_batch_size // num_gpus
assert(self.cal_loss_batch_size % samples_each_class == 0)
class_num = train_batch_size / samples_each_class
class_num = train_batch_size // samples_each_class
self.train_reader = reader.quadruplet_train(class_num, samples_each_class)
self.test_reader = reader.test()
......
import datareader as reader
from . import datareader as reader
import paddle.fluid as fluid
class tripletloss():
......
......@@ -75,7 +75,7 @@ class ResNet():
num_filters=num_filters,
filter_size=filter_size,
stride=stride,
padding=(filter_size - 1) / 2,
padding=(filter_size - 1) // 2,
groups=groups,
act=None,
bias_attr=False)
......
......@@ -127,7 +127,7 @@ class SE_ResNeXt():
num_filters=num_filters,
filter_size=filter_size,
stride=stride,
padding=(filter_size - 1) / 2,
padding=(filter_size - 1) // 2,
groups=groups,
act=None,
bias_attr=False)
......
......@@ -93,7 +93,7 @@ def train(args):
elif loss_name == "emlloss":
metricloss = emlloss(
train_batch_size = args.train_batch_size,
samples_each_class=2
samples_each_class = args.samples_each_class
)
cost_metric = metricloss.loss(out[0])
avg_cost_metric = fluid.layers.mean(x=cost_metric)
......
......@@ -17,6 +17,7 @@ from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import distutils.util
import six
import numpy as np
from paddle.fluid import core
......@@ -37,7 +38,7 @@ def print_arguments(args):
:type args: argparse.Namespace
"""
print("----------- Configuration Arguments -----------")
for arg, value in sorted(vars(args).iteritems()):
for arg, value in sorted(six.iteritems(vars(args))):
print("%s: %s" % (arg, value))
print("------------------------------------------------")
......
此差异已折叠。
......@@ -7,9 +7,9 @@ from kpi import CostKpi, DurationKpi, AccKpi
#### NOTE kpi.py should shared in models in some way!!!!
train_cost_kpi = CostKpi('train_cost', 0.02, 0, actived=True)
test_cost_kpi = CostKpi('test_cost', 0.005, 0, actived=True)
train_duration_kpi = DurationKpi('train_duration', 0.06, 0, actived=True)
train_cost_kpi = CostKpi('train_cost', 0.02, 0, actived=False)
test_cost_kpi = CostKpi('test_cost', 0.005, 0, actived=False)
train_duration_kpi = DurationKpi('train_duration', 0.06, 0, actived=False)
tracking_kpis = [
train_cost_kpi,
......
......@@ -52,7 +52,8 @@ def parse_args():
"--pass_num",
type=int,
default=5,
help="The pass number to train. (default: %(default)d)")
help="The pass number to train. In inference mode, load the saved model"
" at the end of given pass.(default: %(default)d)")
parser.add_argument(
"--learning_rate",
type=float,
......@@ -66,17 +67,17 @@ def parse_args():
"--beam_size",
type=int,
default=3,
help="The width for beam searching. (default: %(default)d)")
help="The width for beam search. (default: %(default)d)")
parser.add_argument(
"--use_gpu",
type=distutils.util.strtobool,
default=True,
help="Whether to use gpu. (default: %(default)d)")
help="Whether to use gpu or not. (default: %(default)d)")
parser.add_argument(
"--max_length",
type=int,
default=50,
help="The maximum length of sequence when doing generation. "
help="The maximum sequence length for translation result."
"(default: %(default)d)")
parser.add_argument(
"--save_dir",
......
......@@ -122,9 +122,8 @@ def seq_to_seq_net(embedding_dim, encoder_size, decoder_size, source_dict_dim,
decoder_state_expand = fluid.layers.sequence_expand(
x=decoder_state_proj, y=encoder_proj)
# concated lod should inherit from encoder_proj
concated = fluid.layers.concat(
input=[encoder_proj, decoder_state_expand], axis=1)
attention_weights = fluid.layers.fc(input=concated,
mixed_state = encoder_proj + decoder_state_expand
attention_weights = fluid.layers.fc(input=mixed_state,
size=1,
bias_attr=False)
attention_weights = fluid.layers.sequence_softmax(
......
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册