提交 4a34273e 编写于 作者: R root

Merge branch 'develop' of https://github.com/PaddlePaddle/models into dev_sc

......@@ -13,6 +13,57 @@ PaddlePaddle 提供了丰富的计算单元,使得用户可以采用模块化
- [fluid模型](fluid): 使用 PaddlePaddle Fluid版本的 APIs,我们特别推荐您使用Fluid模型。
## PaddleCV
模型|简介|模型优势|参考论文
--|:--:|:--:|:--:
[AlexNet](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/image_classification/models)|图像分类经典模型|首次在CNN中成功的应用了ReLU、Dropout和LRN,并使用GPU进行运算加速|[ImageNet Classification with Deep Convolutional Neural Networks](https://www.researchgate.net/publication/267960550_ImageNet_Classification_with_Deep_Convolutional_Neural_Networks)
[VGG](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/image_classification/models)|图像分类经典模型|在AlexNet的基础上使用3*3小卷积核,增加网络深度,具有很好的泛化能力|[Very Deep ConvNets for Large-Scale Inage Recognition](https://arxiv.org/pdf/1409.1556.pdf)
[GoogleNet](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/image_classification/models)|图像分类经典模型|在不增加计算负载的前提下增加了网络的深度和宽度,性能更加优越|[Going deeper with convolutions](https://ieeexplore.ieee.org/document/7298594)
[ResNet](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/image_classification/models)|残差网络|引入了新的残差结构,解决了随着网络加深,准确率下降的问题|[Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385)
[Inception-v4](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/image_classification/models)|图像分类经典模型|更加deeper和wider的inception结构|[Inception-ResNet and the Impact of Residual Connections on Learning](http://arxiv.org/abs/1602.07261)
[MobileNet](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/image_classification/models)|轻量级网络模型|为移动和嵌入式设备提出的高效模型|[MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861)
[DPN](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/image_classification/models)|图像分类模型|结合了DenseNet和ResNeXt的网络结构,对图像分类效果有所提升|[Dual Path Networks](https://arxiv.org/abs/1707.01629)
[SE-ResNeXt](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/image_classification/models)|图像分类模型|ResNeXt中加入了SE block,提高了模型准确率|[Squeeze-and-excitation networks](https://arxiv.org/abs/1709.01507)
[SSD](https://github.com/PaddlePaddle/models/blob/develop/fluid/PaddleCV/object_detection/README_cn.md)|单阶段目标检测器|在不同尺度的特征图上检测对应尺度的目标,可以方便地插入到任何一种标准卷积网络中|[SSD: Single Shot MultiBox Detector](https://arxiv.org/abs/1512.02325)
[Face Detector: PyramidBox](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/face_detection/README_cn.md)|基于SSD的单阶段人脸检测器|利用上下文信息解决困难人脸的检测问题,网络表达能力高,鲁棒性强|[PyramidBox: A Context-assisted Single Shot Face Detector](https://arxiv.org/pdf/1803.07737.pdf)
[Faster RCNN](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/faster_rcnn/README_cn.md)|典型的两阶段目标检测器|创造性地采用卷积网络自行产生建议框,并且和目标检测网络共享卷积网络,建议框数目减少,质量提高|[Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks](https://arxiv.org/abs/1506.01497)
[ICNet](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/icnet)|图像实时语义分割模型|即考虑了速度,也考虑了准确性,在高分辨率图像的准确性和低复杂度网络的效率之间获得平衡|[ICNet for Real-Time Semantic Segmentation on High-Resolution Images](https://arxiv.org/abs/1704.08545)
[DCGAN](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/gan/c_gan)|图像生成模型|深度卷积生成对抗网络,将GAN和卷积网络结合起来,以解决GAN训练不稳定的问题|[Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks](https://arxiv.org/pdf/1511.06434.pdf)
[ConditionalGAN](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/gan/c_gan)|图像生成模型|条件生成对抗网络,一种带条件约束的GAN,使用额外信息对模型增加条件,可以指导数据生成过程|[Conditional Generative Adversarial Nets](https://arxiv.org/abs/1411.1784)
[CycleGAN](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/gan/cycle_gan)|图片转化模型|自动将某一类图片转换成另外一类图片,可用于风格迁移|[Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks](https://arxiv.org/abs/1703.10593)
[CRNN-CTC模型](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/ocr_recognition)|场景文字识别模型|使用CTC model识别图片中单行英文字符|[Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks](https://www.researchgate.net/publication/221346365_Connectionist_temporal_classification_Labelling_unsegmented_sequence_data_with_recurrent_neural_'networks)
[Attention模型](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/ocr_recognition)|场景文字识别模型|使用attention 识别图片中单行英文字符|[Recurrent Models of Visual Attention](https://arxiv.org/abs/1406.6247)
[Metric Learning](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/metric_learning)|度量学习模型|能够用于分析对象时间的关联、比较关系,可应用于辅助分类、聚类问题,也广泛用于图像检索、人脸识别等领域|-
[TSN](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/video_classification)|视频分类模型|基于长范围时间结构建模,结合了稀疏时间采样策略和视频级监督来保证使用整段视频时学习得有效和高效|[Temporal Segment Networks: Towards Good Practices for Deep Action Recognition](https://arxiv.org/abs/1608.00859)
[caffe2fluid](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/caffe2fluid)|将Caffe模型转换为Paddle Fluid配置和模型文件工具|-|-
## PaddleNLP
模型|简介|模型优势|参考论文
--|:--:|:--:|:--:
[Transformer](https://github.com/PaddlePaddle/models/blob/develop/fluid/PaddleNLP/neural_machine_translation/transformer/README_cn.md)|机器翻译模型|基于self-attention,计算复杂度小,并行度高,容易学习长程依赖,翻译效果更好|[Attention Is All You Need](https://arxiv.org/abs/1706.03762)
[LAC](https://github.com/baidu/lac/blob/master/README.md)|联合的词法分析模型|能够整体性地完成中文分词、词性标注、专名识别任务|[Chinese Lexical Analysis with Deep Bi-GRU-CRF Network](https://arxiv.org/abs/1807.01882)
[Senta](https://github.com/baidu/Senta/blob/master/README.md)|情感倾向分析模型集|百度AI开放平台中情感倾向分析模型|-
[DAM](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleNLP/deep_attention_matching_net)|语义匹配模型|百度自然语言处理部发表于ACL-2018的工作,用于检索式聊天机器人多轮对话中应答的选择|[Multi-Turn Response Selection for Chatbots with Deep Attention Matching Network](http://aclweb.org/anthology/P18-1103)
[SimNet](https://github.com/baidu/AnyQ/blob/master/tools/simnet/train/paddle/README.md)|语义匹配框架|使用SimNet构建出的模型可以便捷的加入AnyQ系统中,增强AnyQ系统的语义匹配能力|-
[DuReader](https://github.com/PaddlePaddle/models/blob/develop/fluid/PaddleNLP/machine_reading_comprehension/README.md)|阅读理解模型|百度MRC数据集上的机器阅读理解模型|-
[Bi-GRU-CRF](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleNLP/sequence_tagging_for_ner/README.md)|命名实体识别|结合了CRF和双向GRU的命名实体识别模型|-
## PaddleRec
模型|简介|模型优势|参考论文
--|:--:|:--:|:--:
[TagSpace](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleRec/tagspace)|文本及标签的embedding表示学习模型|应用于工业级的标签推荐,具体应用场景有feed新闻标签推荐等|[#TagSpace: Semantic embeddings from hashtags](https://www.bibsonomy.org/bibtex/0ed4314916f8e7c90d066db45c293462)
[GRU4Rec](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleRec/gru4rec)|个性化推荐模型|首次将RNN(GRU)运用于session-based推荐,相比传统的KNN和矩阵分解,效果有明显的提升|[Session-based Recommendations with Recurrent Neural Networks](https://arxiv.org/abs/1511.06939)
[SSR](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleRec/ssr)|序列语义检索推荐模型|使用参考论文中的思想,使用多种时间粒度进行用户行为预测|[Multi-Rate Deep Learning for Temporal Recommendation](https://dl.acm.org/citation.cfm?id=2914726)
[DeepCTR](https://github.com/PaddlePaddle/models/blob/develop/fluid/PaddleRec/ctr/README.cn.md)|点击率预估模型|只实现了DeepFM论文中介绍的模型的DNN部分,DeepFM会在其他例子中给出|[DeepFM: A Factorization-Machine based Neural Network for CTR Prediction](https://arxiv.org/abs/1703.04247)
[Multiview-Simnet](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleRec/multiview_simnet)|个性化推荐模型|基于多元视图,将用户和项目的多个功能视图合并为一个统一模型|[A Multi-View Deep Learning Approach for Cross Domain User Modeling in Recommendation Systems](http://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/frp1159-songA.pdf)
## Other Models
模型|简介|模型优势|参考论文
--|:--:|:--:|:--:
[DeepASR](https://github.com/PaddlePaddle/models/blob/develop/fluid/DeepASR/README_cn.md)|语音识别系统|利用Fluid框架完成语音识别中声学模型的配置和训练,并集成 Kaldi 的解码器|-
[DQN](https://github.com/PaddlePaddle/models/blob/develop/fluid/DeepQNetwork/README_cn.md)|深度Q网络|value based强化学习算法,第一个成功地将深度学习和强化学习结合起来的模型|[Human-level control through deep reinforcement learning](https://www.nature.com/articles/nature14236)
[DoubleDQN](https://github.com/PaddlePaddle/models/blob/develop/fluid/DeepQNetwork/README_cn.md)|DQN的变体|将Double Q的想法应用在DQN上,解决过优化问题|[Font Size: Deep Reinforcement Learning with Double Q-Learning](https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/viewPaper/12389)
[DuelingDQN](https://github.com/PaddlePaddle/models/blob/develop/fluid/DeepQNetwork/README_cn.md)|DQN的变体|改进了DQN模型,提高了模型的性能|[Dueling Network Architectures for Deep Reinforcement Learning](http://proceedings.mlr.press/v48/wangf16.html)
## License
This tutorial is contributed by [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) and licensed under the [Apache-2.0 license](LICENSE).
......
......@@ -21,6 +21,7 @@ import math
import numpy as np
import paddle
import paddle.fluid as fluid
from paddle.fluid.contrib.trainer import *
from paddle.fluid.layers.learning_rate_scheduler import _decay_step_counter
import reader
......@@ -104,7 +105,7 @@ class Model(object):
accs = []
def event_handler(event):
if isinstance(event, fluid.EndStepEvent):
if isinstance(event, EndStepEvent):
costs.append(event.metrics[0])
accs.append(event.metrics[1])
if event.step % 20 == 0:
......@@ -113,7 +114,7 @@ class Model(object):
del costs[:]
del accs[:]
if isinstance(event, fluid.EndEpochEvent):
if isinstance(event, EndEpochEvent):
if event.epoch % 3 == 0 or event.epoch == FLAGS.num_epochs - 1:
avg_cost, accuracy = trainer.test(
reader=test_reader, feed_order=['pixel', 'label'])
......@@ -126,7 +127,7 @@ class Model(object):
event_handler.best_acc = 0.0
place = fluid.CUDAPlace(0)
trainer = fluid.Trainer(
trainer = Trainer(
train_func=self.train_network,
optimizer_func=self.optimizer_program,
place=place)
......
......@@ -440,7 +440,8 @@ class Network(object):
if need_transpose:
order = range(dims)
order.remove(axis).append(axis)
order.remove(axis)
order.append(axis)
input = fluid.layers.transpose(
input,
perm=order,
......@@ -525,11 +526,21 @@ class Network(object):
scale_shape = input.shape[axis:axis + num_axes]
param_attr = fluid.ParamAttr(name=prefix + 'scale')
scale_param = fluid.layers.create_parameter(
shape=scale_shape, dtype=input.dtype, name=name, attr=param_attr)
shape=scale_shape,
dtype=input.dtype,
name=name,
attr=param_attr,
is_bias=True,
default_initializer=fluid.initializer.Constant(value=1.0))
offset_attr = fluid.ParamAttr(name=prefix + 'offset')
offset_param = fluid.layers.create_parameter(
shape=scale_shape, dtype=input.dtype, name=name, attr=offset_attr)
shape=scale_shape,
dtype=input.dtype,
name=name,
attr=offset_attr,
is_bias=True,
default_initializer=fluid.initializer.Constant(value=0.0))
output = fluid.layers.elementwise_mul(
input,
......
"""
This code is based on https://github.com/fchollet/keras/blob/master/keras/utils/data_utils.py
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import time
import numpy as np
import threading
import multiprocessing
import traceback
try:
import queue
except ImportError:
import Queue as queue
class GeneratorEnqueuer(object):
"""
Builds a queue out of a data generator.
Args:
generator: a generator function which endlessly yields data
use_multiprocessing (bool): use multiprocessing if True,
otherwise use threading.
wait_time (float): time to sleep in-between calls to `put()`.
random_seed (int): Initial seed for workers,
will be incremented by one for each workers.
"""
def __init__(self,
generator,
use_multiprocessing=False,
wait_time=0.05,
random_seed=None):
self.wait_time = wait_time
self._generator = generator
self._use_multiprocessing = use_multiprocessing
self._threads = []
self._stop_event = None
self.queue = None
self._manager = None
self.seed = random_seed
def start(self, workers=1, max_queue_size=10):
"""
Start worker threads which add data from the generator into the queue.
Args:
workers (int): number of worker threads
max_queue_size (int): queue size
(when full, threads could block on `put()`)
"""
def data_generator_task():
"""
Data generator task.
"""
def task():
if (self.queue is not None and
self.queue.qsize() < max_queue_size):
generator_output = next(self._generator)
self.queue.put((generator_output))
else:
time.sleep(self.wait_time)
if not self._use_multiprocessing:
while not self._stop_event.is_set():
with self.genlock:
try:
task()
except Exception:
traceback.print_exc()
self._stop_event.set()
break
else:
while not self._stop_event.is_set():
try:
task()
except Exception:
traceback.print_exc()
self._stop_event.set()
break
try:
if self._use_multiprocessing:
self._manager = multiprocessing.Manager()
self.queue = self._manager.Queue(maxsize=max_queue_size)
self._stop_event = multiprocessing.Event()
else:
self.genlock = threading.Lock()
self.queue = queue.Queue()
self._stop_event = threading.Event()
for _ in range(workers):
if self._use_multiprocessing:
# Reset random seed else all children processes
# share the same seed
np.random.seed(self.seed)
thread = multiprocessing.Process(target=data_generator_task)
thread.daemon = True
if self.seed is not None:
self.seed += 1
else:
thread = threading.Thread(target=data_generator_task)
self._threads.append(thread)
thread.start()
except:
self.stop()
raise
def is_running(self):
"""
Returns:
bool: Whether the worker theads are running.
"""
return self._stop_event is not None and not self._stop_event.is_set()
def stop(self, timeout=None):
"""
Stops running threads and wait for them to exit, if necessary.
Should be called by the same thread which called `start()`.
Args:
timeout(int|None): maximum time to wait on `thread.join()`.
"""
if self.is_running():
self._stop_event.set()
for thread in self._threads:
if self._use_multiprocessing:
if thread.is_alive():
thread.terminate()
else:
thread.join(timeout)
if self._manager:
self._manager.shutdown()
self._threads = []
self._stop_event = None
self.queue = None
def get(self):
"""
Creates a generator to extract data from the queue.
Skip the data if it is `None`.
# Yields
tuple of data in the queue.
"""
while self.is_running():
if not self.queue.empty():
inputs = self.queue.get()
if inputs is not None:
yield inputs
else:
time.sleep(self.wait_time)
......@@ -16,8 +16,6 @@ from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import image_util
from paddle.utils.image_util import *
from PIL import Image
from PIL import ImageDraw
import numpy as np
......@@ -28,7 +26,10 @@ import copy
import random
import cv2
import six
from data_util import GeneratorEnqueuer
import math
from itertools import islice
import paddle
import image_util
class Settings(object):
......@@ -199,7 +200,7 @@ def load_file_list(input_txt):
else:
file_dict[num_class].append(line_txt)
return file_dict
return list(file_dict.values())
def expand_bboxes(bboxes,
......@@ -227,13 +228,12 @@ def expand_bboxes(bboxes,
def train_generator(settings, file_list, batch_size, shuffle=True):
file_dict = load_file_list(file_list)
while True:
def reader():
if shuffle:
np.random.shuffle(file_dict)
np.random.shuffle(file_list)
batch_out = []
for index_image in file_dict.keys():
image_name = file_dict[index_image][0]
for item in file_list:
image_name = item[0]
image_path = os.path.join(settings.data_dir, image_name)
im = Image.open(image_path)
if im.mode == 'L':
......@@ -242,10 +242,10 @@ def train_generator(settings, file_list, batch_size, shuffle=True):
# layout: label | xmin | ymin | xmax | ymax
bbox_labels = []
for index_box in range(len(file_dict[index_image])):
for index_box in range(len(item)):
if index_box >= 2:
bbox_sample = []
temp_info_box = file_dict[index_image][index_box].split(' ')
temp_info_box = item[index_box].split(' ')
xmin = float(temp_info_box[0])
ymin = float(temp_info_box[1])
w = float(temp_info_box[2])
......@@ -277,43 +277,25 @@ def train_generator(settings, file_list, batch_size, shuffle=True):
yield batch_out
batch_out = []
return reader
def train(settings,
file_list,
batch_size,
shuffle=True,
use_multiprocessing=True,
num_workers=8,
max_queue=24):
def reader():
try:
enqueuer = GeneratorEnqueuer(
train_generator(settings, file_list, batch_size, shuffle),
use_multiprocessing=use_multiprocessing)
enqueuer.start(max_queue_size=max_queue, workers=num_workers)
generator_output = None
while True:
while enqueuer.is_running():
if not enqueuer.queue.empty():
generator_output = enqueuer.queue.get()
break
else:
time.sleep(0.01)
yield generator_output
generator_output = None
finally:
if enqueuer is not None:
enqueuer.stop()
return reader
def train(settings, file_list, batch_size, shuffle=True, num_workers=8):
file_lists = load_file_list(file_list)
n = int(math.ceil(len(file_lists) // num_workers))
split_lists = [file_lists[i:i + n] for i in range(0, len(file_lists), n)]
readers = []
for iterm in split_lists:
readers.append(train_generator(settings, iterm, batch_size, shuffle))
return paddle.reader.multiprocess_reader(readers, False)
def test(settings, file_list):
file_dict = load_file_list(file_list)
file_lists = load_file_list(file_list)
def reader():
for index_image in file_dict.keys():
image_name = file_dict[index_image][0]
for image in file_lists:
image_name = image[0]
image_path = os.path.join(settings.data_dir, image_name)
im = Image.open(image_path)
if im.mode == 'L':
......
......@@ -163,9 +163,7 @@ def train(args, config, train_params, train_file_list):
train_file_list,
batch_size_per_device,
shuffle = is_shuffle,
use_multiprocessing=True,
num_workers = num_workers,
max_queue=24)
num_workers = num_workers)
train_py_reader.decorate_paddle_reader(train_reader)
if args.parallel:
......@@ -182,61 +180,59 @@ def train(args, config, train_params, train_file_list):
print('save models to %s' % (model_path))
fluid.io.save_persistables(exe, model_path, main_program=program)
train_py_reader.start()
try:
total_time = 0.0
epoch_idx = 0
face_loss = 0
head_loss = 0
for pass_id in range(start_epoc, epoc_num):
epoch_idx += 1
start_time = time.time()
prev_start_time = start_time
end_time = 0
batch_id = 0
for batch_id in range(iters_per_epoc):
total_time = 0.0
epoch_idx = 0
face_loss = 0
head_loss = 0
for pass_id in range(start_epoc, epoc_num):
epoch_idx += 1
start_time = time.time()
prev_start_time = start_time
end_time = 0
batch_id = 0
train_py_reader.start()
while True:
try:
prev_start_time = start_time
start_time = time.time()
if args.parallel:
fetch_vars = train_exe.run(fetch_list=
[v.name for v in fetches])
else:
fetch_vars = exe.run(train_prog,
fetch_list=fetches)
fetch_vars = exe.run(train_prog, fetch_list=fetches)
end_time = time.time()
fetch_vars = [np.mean(np.array(v)) for v in fetch_vars]
face_loss = fetch_vars[0]
head_loss = fetch_vars[1]
if batch_id % 10 == 0:
if not args.use_pyramidbox:
print("Pass {:d}, batch {:d}, loss {:.6f}, time {:.5f}".format(
pass_id, batch_id, fetch_vars[0],
pass_id, batch_id, face_loss,
start_time - prev_start_time))
else:
print("Pass {:d}, batch {:d}, face loss {:.6f}, " \
"head loss {:.6f}, " \
"time {:.5f}".format(pass_id,
batch_id, fetch_vars[0], fetch_vars[1],
batch_id, face_loss, head_loss,
start_time - prev_start_time))
face_loss = fetch_vars[0]
head_loss = fetch_vars[1]
epoch_end_time = time.time()
total_time += epoch_end_time - start_time
if pass_id % 1 == 0 or pass_id == epoc_num - 1:
save_model(str(pass_id), train_prog)
# only for ce
if args.enable_ce:
gpu_num = get_cards(args)
print("kpis\teach_pass_duration_card%s\t%s" %
(gpu_num, total_time / epoch_idx))
print("kpis\ttrain_face_loss_card%s\t%s" %
(gpu_num, face_loss))
print("kpis\ttrain_head_loss_card%s\t%s" %
(gpu_num, head_loss))
except fluid.core.EOFException:
train_py_reader.reset()
except StopIteration:
train_py_reader.reset()
train_py_reader.reset()
batch_id += 1
except (fluid.core.EOFException, StopIteration):
train_py_reader.reset()
break
epoch_end_time = time.time()
total_time += epoch_end_time - start_time
save_model(str(pass_id), train_prog)
# only for ce
if args.enable_ce:
gpu_num = get_cards(args)
print("kpis\teach_pass_duration_card%s\t%s" %
(gpu_num, total_time / epoch_idx))
print("kpis\ttrain_face_loss_card%s\t%s" %
(gpu_num, face_loss))
print("kpis\ttrain_head_loss_card%s\t%s" %
(gpu_num, head_loss))
def get_cards(args):
......
......@@ -38,18 +38,6 @@ Train the model on [MS-COCO dataset](http://cocodataset.org/#download), download
## Training
After data preparation, one can start the training step by:
python train.py \
--model_save_dir=output/ \
--pretrained_model=${path_to_pretrain_model}
--data_dir=${path_to_data}
- Set ```export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7``` to specifiy 8 GPU to train.
- For more help on arguments:
python train.py --help
**download the pre-trained model:** This sample provides Resnet-50 pre-trained model which is converted from Caffe. The model fuses the parameters in batch normalization layer. One can download pre-trained model as:
sh ./pretrained/download.sh
......@@ -72,6 +60,18 @@ To train the model, [cocoapi](https://github.com/cocodataset/cocoapi) is needed.
# not to install the COCO API into global site-packages
python2 setup.py install --user
After data preparation, one can start the training step by:
python train.py \
--model_save_dir=output/ \
--pretrained_model=${path_to_pretrain_model}
--data_dir=${path_to_data}
- Set ```export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7``` to specifiy 8 GPU to train.
- For more help on arguments:
python train.py --help
**data reader introduction:**
* Data reader is defined in `reader.py`.
......
......@@ -37,18 +37,6 @@ Faster RCNN 目标检测模型
## 模型训练
数据准备完毕后,可以通过如下的方式启动训练:
python train.py \
--model_save_dir=output/ \
--pretrained_model=${path_to_pretrain_model}
--data_dir=${path_to_data}
- 通过设置export CUDA\_VISIBLE\_DEVICES=0,1,2,3,4,5,6,7指定8卡GPU训练。
- 可选参数见:
python train.py --help
**下载预训练模型:** 本示例提供Resnet-50预训练模型,该模性转换自Caffe,并对批标准化层(Batch Normalization Layer)进行参数融合。采用如下命令下载预训练模型:
sh ./pretrained/download.sh
......@@ -71,6 +59,18 @@ Faster RCNN 目标检测模型
# not to install the COCO API into global site-packages
python2 setup.py install --user
数据准备完毕后,可以通过如下的方式启动训练:
python train.py \
--model_save_dir=output/ \
--pretrained_model=${path_to_pretrain_model}
--data_dir=${path_to_data}
- 通过设置export CUDA\_VISIBLE\_DEVICES=0,1,2,3,4,5,6,7指定8卡GPU训练。
- 可选参数见:
python train.py --help
**数据读取器说明:** 数据读取器定义在reader.py中。所有图像将短边等比例缩放至`scales`,若长边大于`max_size`, 则再次将长边等比例缩放至`max_size`。在训练阶段,对图像采用水平翻转。支持将同一个batch内的图像padding为相同尺寸。
**模型设置:**
......
......@@ -28,6 +28,7 @@ from __future__ import unicode_literals
import cv2
import numpy as np
from config import cfg
import os
def get_image_blob(roidb, mode):
......@@ -43,8 +44,11 @@ def get_image_blob(roidb, mode):
target_size = cfg.TEST.scales[0]
max_size = cfg.TEST.max_size
im = cv2.imread(roidb['image'])
assert im is not None, \
'Failed to read image \'{}\''.format(roidb['image'])
try:
assert im is not None
except AssertionError as e:
print('Failed to read image \'{}\''.format(roidb['image']))
os._exit(0)
if roidb['flipped']:
im = im[:, ::-1, :]
im, im_scale = prep_im_for_blob(im, cfg.pixel_means, target_size, max_size)
......
......@@ -3,7 +3,7 @@
# This file is only used for continuous evaluation.
export FLAGS_cudnn_deterministic=True
export ce_mode=1
(CUDA_VISIBLE_DEVICES=6 python c_gan.py --batch_size=121 --epoch=1 --run_ce=True --use_gpu=True & \
CUDA_VISIBLE_DEVICES=7 python dc_gan.py --batch_size=121 --epoch=1 --run_ce=True --use_gpu=True) | python _ce.py
(CUDA_VISIBLE_DEVICES=2 python c_gan.py --batch_size=121 --epoch=1 --run_ce=True --use_gpu=True & \
CUDA_VISIBLE_DEVICES=3 python dc_gan.py --batch_size=121 --epoch=1 --run_ce=True --use_gpu=True) | python _ce.py
......@@ -7,13 +7,15 @@ large-scaled distributed training with two distributed mode: parameter server mo
Before getting started, please make sure you have go throught the imagenet [Data Preparation](../README.md#data-preparation).
1. The entrypoint file is `dist_train.py`, some important flags are as follows:
1. The entrypoint file is `dist_train.py`, the commandline arguments are almost the same as the original `train.py`, with the following arguments specific to distributed training.
- `model`, the model to run with, default is the fine tune model `DistResnet`.
- `batch_size`, the batch_size per device.
- `update_method`, specify the update method, can choose from local, pserver or nccl2.
- `device`, use CPU or GPU device.
- `gpus`, the GPU device count that the process used.
- `multi_batch_repeat`, set this greater than 1 to merge batches before pushing gradients to pservers.
- `start_test_pass`, when to start running tests.
- `num_threads`, how many threads will be used for ParallelExecutor.
- `split_var`, in pserver mode, whether to split one parameter to several pservers, default True.
- `async_mode`, do async training, defalt False.
- `reduce_strategy`, choose from "reduce", "allreduce".
you can check out more details of the flags by `python dist_train.py --help`.
......@@ -21,66 +23,27 @@ Before getting started, please make sure you have go throught the imagenet [Data
We use the environment variable to distinguish the different training role of a distributed training job.
- `PADDLE_TRAINING_ROLE`, the current training role, should be in [PSERVER, TRAINER].
- `PADDLE_TRAINERS`, the trainer count of a job.
- `PADDLE_CURRENT_IP`, the current instance IP.
- `PADDLE_PSERVER_IPS`, the parameter server IP list, separated by "," only be used with update_method is pserver.
- `PADDLE_TRAINER_ID`, the unique trainer ID of a job, the ranging is [0, PADDLE_TRAINERS).
- `PADDLE_PSERVER_PORT`, the port of the parameter pserver listened on.
- `PADDLE_TRAINER_IPS`, the trainer IP list, separated by ",", only be used with upadte_method is nccl2.
### Parameter Server Mode
In this example, we launched 4 parameter server instances and 4 trainer instances in the cluster:
1. launch parameter server process
``` bash
PADDLE_TRAINING_ROLE=PSERVER \
PADDLE_TRAINERS=4 \
PADDLE_PSERVER_IPS=192.168.0.100,192.168.0.101,192.168.0.102,192.168.0.103 \
PADDLE_CURRENT_IP=192.168.0.100 \
PADDLE_PSERVER_PORT=7164 \
python dist_train.py \
--model=DistResnet \
--batch_size=32 \
--update_method=pserver \
--device=CPU \
--data_dir=../data/ILSVRC2012
```
1. launch trainer process
``` bash
PADDLE_TRAINING_ROLE=TRAINER \
PADDLE_TRAINERS=4 \
PADDLE_PSERVER_IPS=192.168.0.100,192.168.0.101,192.168.0.102,192.168.0.103 \
PADDLE_TRAINER_ID=0 \
PADDLE_PSERVER_PORT=7164 \
python dist_train.py \
--model=DistResnet \
--batch_size=32 \
--update_method=pserver \
--device=GPU \
--data_dir=../data/ILSVRC2012
```
### NCCL2 Collective Mode
1. launch trainer process
``` bash
PADDLE_TRAINING_ROLE=TRAINER \
PADDLE_TRAINERS=4 \
PADDLE_TRAINER_IPS=192.168.0.100,192.168.0.101,192.168.0.102,192.168.0.103 \
PADDLE_TRAINER_ID=0 \
python dist_train.py \
--model=DistResnet \
--batch_size=32 \
--update_method=nccl2 \
--device=GPU \
--data_dir=../data/ILSVRC2012
```
- General envs:
- `PADDLE_TRAINER_ID`, the unique trainer ID of a job, the ranging is [0, PADDLE_TRAINERS).
- `PADDLE_TRAINERS_NUM`, the trainer count of a distributed job.
- `PADDLE_CURRENT_ENDPOINT`, current process endpoint.
- Pserver mode:
- `PADDLE_TRAINING_ROLE`, the current training role, should be in [PSERVER, TRAINER].
- `PADDLE_PSERVER_ENDPOINTS`, the parameter server endpoint list, separated by ",".
- NCCL2 mode:
- `PADDLE_TRAINER_ENDPOINTS`, endpoint list for each worker, separated by ",".
### Try Out Different Distributed Training Modes
You can test if distributed training works on a single node before deploying to the "real" cluster.
***NOTE: for best performance, we recommend using multi-process mode, see No.3. And together with fp16.***
1. simply run `python dist_train.py` to start local training with default configuratioins.
2. for pserver mode, run `bash run_ps_mode.sh` to start 2 pservers and 2 trainers, these 2 trainers
will use GPU 0 and 1 to simulate 2 workers.
3. for nccl2 mode, run `bash run_nccl2_mode.sh` to start 2 workers.
4. for local/distributed multi-process mode, run `run_mp_mode.sh` (this test use 4 GPUs).
### Visualize the Training Process
......@@ -88,16 +51,10 @@ It's easy to draw the learning curve accroding to the training logs, for example
the logs of ResNet50 is as follows:
``` text
Pass 0, batch 0, loss 7.0336914, accucacys: [0.0, 0.00390625]
Pass 0, batch 1, loss 7.094781, accucacys: [0.0, 0.0]
Pass 0, batch 2, loss 7.007068, accucacys: [0.0, 0.0078125]
Pass 0, batch 3, loss 7.1056547, accucacys: [0.00390625, 0.00390625]
Pass 0, batch 4, loss 7.133543, accucacys: [0.0, 0.0078125]
Pass 0, batch 5, loss 7.3055463, accucacys: [0.0078125, 0.01171875]
Pass 0, batch 6, loss 7.341838, accucacys: [0.0078125, 0.01171875]
Pass 0, batch 7, loss 7.290557, accucacys: [0.0, 0.0]
Pass 0, batch 8, loss 7.264951, accucacys: [0.0, 0.00390625]
Pass 0, batch 9, loss 7.43522, accucacys: [0.00390625, 0.00390625]
Pass 0, batch 30, loss 7.569439, acc1: 0.0125, acc5: 0.0125, avg batch time 0.1720
Pass 0, batch 60, loss 7.027379, acc1: 0.0, acc5: 0.0, avg batch time 0.1551
Pass 0, batch 90, loss 6.819984, acc1: 0.0, acc5: 0.0125, avg batch time 0.1492
Pass 0, batch 120, loss 6.9076853, acc1: 0.0, acc5: 0.0125, avg batch time 0.1464
```
The below figure shows top 1 train accuracy for local training with 8 GPUs and distributed training
......
import paddle.fluid as fluid
def copyback_repeat_bn_params(main_prog):
repeat_vars = set()
for op in main_prog.global_block().ops:
if op.type == "batch_norm":
repeat_vars.add(op.input("Mean")[0])
repeat_vars.add(op.input("Variance")[0])
for vname in repeat_vars:
real_var = fluid.global_scope().find_var("%s.repeat.0" % vname).get_tensor()
orig_var = fluid.global_scope().find_var(vname).get_tensor()
orig_var.set(np.array(real_var), fluid.CUDAPlace(0)) # test on GPU0
def append_bn_repeat_init_op(main_prog, startup_prog, num_repeats):
repeat_vars = set()
for op in main_prog.global_block().ops:
if op.type == "batch_norm":
repeat_vars.add(op.input("Mean")[0])
repeat_vars.add(op.input("Variance")[0])
for i in range(num_repeats):
for op in startup_prog.global_block().ops:
if op.type == "fill_constant":
for oname in op.output_arg_names:
if oname in repeat_vars:
var = startup_prog.global_block().var(oname)
repeat_var_name = "%s.repeat.%d" % (oname, i)
repeat_var = startup_prog.global_block().create_var(
name=repeat_var_name,
type=var.type,
dtype=var.dtype,
shape=var.shape,
persistable=var.persistable
)
main_prog.global_block()._clone_variable(repeat_var)
startup_prog.global_block().append_op(
type="fill_constant",
inputs={},
outputs={"Out": repeat_var},
attrs=op.all_attrs()
)
import os
import paddle.fluid as fluid
def nccl2_prepare(args, startup_prog):
config = fluid.DistributeTranspilerConfig()
config.mode = "nccl2"
t = fluid.DistributeTranspiler(config=config)
envs = args.dist_env
t.transpile(envs["trainer_id"],
trainers=','.join(envs["trainer_endpoints"]),
current_endpoint=envs["current_endpoint"],
startup_program=startup_prog)
def pserver_prepare(args, train_prog, startup_prog):
config = fluid.DistributeTranspilerConfig()
config.slice_var_up = args.split_var
t = fluid.DistributeTranspiler(config=config)
envs = args.dist_env
training_role = envs["training_role"]
t.transpile(
envs["trainer_id"],
program=train_prog,
pservers=envs["pserver_endpoints"],
trainers=envs["num_trainers"],
sync_mode=not args.async_mode,
startup_program=startup_prog)
if training_role == "PSERVER":
pserver_program = t.get_pserver_program(envs["current_endpoint"])
pserver_startup_program = t.get_startup_program(
envs["current_endpoint"], pserver_program, startup_program=startup_prog)
return pserver_program, pserver_startup_program
elif training_role == "TRAINER":
train_program = t.get_trainer_program()
return train_program, startup_prog
else:
raise ValueError(
'PADDLE_TRAINING_ROLE environment variable must be either TRAINER or PSERVER'
)
import os
def dist_env():
"""
Return a dict of all variable that distributed training may use.
NOTE: you may rewrite this function to suit your cluster environments.
"""
trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
num_trainers = 1
training_role = os.getenv("PADDLE_TRAINING_ROLE", "TRAINER")
assert(training_role == "PSERVER" or training_role == "TRAINER")
# - PADDLE_TRAINER_ENDPOINTS means nccl2 mode.
# - PADDLE_PSERVER_ENDPOINTS means pserver mode.
# - PADDLE_CURRENT_ENDPOINT means current process endpoint.
trainer_endpoints = os.getenv("PADDLE_TRAINER_ENDPOINTS")
pserver_endpoints = os.getenv("PADDLE_PSERVER_ENDPOINTS")
current_endpoint = os.getenv("PADDLE_CURRENT_ENDPOINT")
if trainer_endpoints:
trainer_endpoints = trainer_endpoints.split(",")
num_trainers = len(trainer_endpoints)
elif pserver_endpoints:
num_trainers = int(os.getenv("PADDLE_TRAINERS_NUM"))
return {
"trainer_id": trainer_id,
"num_trainers": num_trainers,
"current_endpoint": current_endpoint,
"training_role": training_role,
"pserver_endpoints": pserver_endpoints,
"trainer_endpoints": trainer_endpoints
}
#!/bin/bash
# Test using 4 GPUs
export CUDA_VISIBLE_DEVICES="0,1,2,3"
export MODEL="DistResNet"
export PADDLE_TRAINER_ENDPOINTS="127.0.0.1:7160,127.0.0.1:7161,127.0.0.1:7162,127.0.0.1:7163"
# PADDLE_TRAINERS_NUM is used only for reader when nccl2 mode
export PADDLE_TRAINERS_NUM="4"
mkdir -p logs
for i in {0..3}
do
PADDLE_TRAINING_ROLE="TRAINER" \
PADDLE_CURRENT_ENDPOINT="127.0.0.1:716${i}" \
PADDLE_TRAINER_ID="${i}" \
FLAGS_selected_gpus="${i}" \
python dist_train.py --model $MODEL --update_method nccl2 --batch_size 32 --fp16 1 --scale_loss 8 &> logs/tr$i.log &
done
#!/bin/bash
export MODEL="DistResNet"
export PADDLE_TRAINER_ENDPOINTS="127.0.0.1:7160,127.0.0.1:7161"
# PADDLE_TRAINERS_NUM is used only for reader when nccl2 mode
export PADDLE_TRAINERS_NUM="2"
mkdir -p logs
# NOTE: set NCCL_P2P_DISABLE so that can run nccl2 distribute train on one node.
PADDLE_TRAINING_ROLE="TRAINER" \
PADDLE_CURRENT_ENDPOINT="127.0.0.1:7160" \
PADDLE_TRAINER_ID="0" \
CUDA_VISIBLE_DEVICES="0" \
NCCL_P2P_DISABLE="1" \
python dist_train.py --model $MODEL --update_method nccl2 --batch_size 32 &> logs/tr0.log &
PADDLE_TRAINING_ROLE="TRAINER" \
PADDLE_CURRENT_ENDPOINT="127.0.0.1:7161" \
PADDLE_TRAINER_ID="1" \
CUDA_VISIBLE_DEVICES="1" \
NCCL_P2P_DISABLE="1" \
python dist_train.py --model $MODEL --update_method nccl2 --batch_size 32 &> logs/tr1.log &
#!/bin/bash
export MODEL="DistResNet"
export PADDLE_PSERVER_ENDPOINTS="127.0.0.1:7160,127.0.0.1:7161"
export PADDLE_TRAINERS_NUM="2"
mkdir -p logs
PADDLE_TRAINING_ROLE="PSERVER" \
PADDLE_CURRENT_ENDPOINT="127.0.0.1:7160" \
python dist_train.py --model $MODEL --update_method pserver --batch_size 32 &> logs/ps0.log &
PADDLE_TRAINING_ROLE="PSERVER" \
PADDLE_CURRENT_ENDPOINT="127.0.0.1:7161" \
python dist_train.py --model $MODEL --update_method pserver --batch_size 32 &> logs/ps1.log &
PADDLE_TRAINING_ROLE="TRAINER" \
PADDLE_CURRENT_ENDPOINT="127.0.0.1:7160" \
PADDLE_TRAINER_ID="0" \
CUDA_VISIBLE_DEVICES="0" \
python dist_train.py --model $MODEL --update_method pserver --batch_size 32 &> logs/tr0.log &
PADDLE_TRAINING_ROLE="TRAINER" \
PADDLE_CURRENT_ENDPOINT="127.0.0.1:7161" \
PADDLE_TRAINER_ID="1" \
CUDA_VISIBLE_DEVICES="1" \
python dist_train.py --model $MODEL --update_method pserver --batch_size 32 &> logs/tr1.log &
......@@ -14,8 +14,9 @@ train_parameters = {
"learning_strategy": {
"name": "piecewise_decay",
"batch_size": 256,
"epochs": [30, 60, 90],
"steps": [0.1, 0.01, 0.001, 0.0001]
"epochs": [30, 60, 80],
"steps": [0.1, 0.01, 0.001, 0.0001],
"warmup_passes": 5
}
}
......@@ -118,3 +119,4 @@ class DistResNet():
short = self.shortcut(input, num_filters * 4, stride)
return fluid.layers.elementwise_add(x=short, y=conv2, act='relu')
......@@ -130,16 +130,19 @@ def _reader_creator(file_list,
shuffle=False,
color_jitter=False,
rotate=False,
data_dir=DATA_DIR):
data_dir=DATA_DIR,
pass_id_as_seed=0):
def reader():
with open(file_list) as flist:
full_lines = [line.strip() for line in flist]
if shuffle:
if pass_id_as_seed:
np.random.seed(pass_id_as_seed)
np.random.shuffle(full_lines)
if mode == 'train' and os.getenv('PADDLE_TRAINING_ROLE'):
# distributed mode if the env var `PADDLE_TRAINING_ROLE` exits
trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
trainer_count = int(os.getenv("PADDLE_TRAINERS", "1"))
trainer_count = int(os.getenv("PADDLE_TRAINERS_NUM", "1"))
per_node_lines = len(full_lines) // trainer_count
lines = full_lines[trainer_id * per_node_lines:(trainer_id + 1)
* per_node_lines]
......@@ -166,7 +169,7 @@ def _reader_creator(file_list,
return paddle.reader.xmap_readers(mapper, reader, THREAD, BUF_SIZE)
def train(data_dir=DATA_DIR):
def train(data_dir=DATA_DIR, pass_id_as_seed=0):
file_list = os.path.join(data_dir, 'train_list.txt')
return _reader_creator(
file_list,
......@@ -174,7 +177,8 @@ def train(data_dir=DATA_DIR):
shuffle=True,
color_jitter=False,
rotate=False,
data_dir=data_dir)
data_dir=data_dir,
pass_id_as_seed=pass_id_as_seed)
def val(data_dir=DATA_DIR):
......
......@@ -21,9 +21,7 @@ SSD is readily pluggable into a wide variant standard convolutional network, suc
### Data Preparation
You can use [PASCAL VOC dataset](http://host.robots.ox.ac.uk/pascal/VOC/) or [MS-COCO dataset](http://cocodataset.org/#download).
If you want to train a model on PASCAL VOC dataset, please download dataset at first, skip this step if you already have one.
Please download [PASCAL VOC dataset](http://host.robots.ox.ac.uk/pascal/VOC/) at first, skip this step if you already have one.
```bash
cd data/pascalvoc
......@@ -32,30 +30,18 @@ cd data/pascalvoc
The command `download.sh` also will create training and testing file lists.
If you want to train a model on MS-COCO dataset, please download dataset at first, skip this step if you already have one.
```
cd data/coco
./download.sh
```
### Train
#### Download the Pre-trained Model.
We provide two pre-trained models. The one is MobileNet-v1 SSD trained on COCO dataset, but removed the convolutional predictors for COCO dataset. This model can be used to initialize the models when training other datasets, like PASCAL VOC. The other pre-trained model is MobileNet-v1 trained on ImageNet 2012 dataset but removed the last weights and bias in the Fully-Connected layer.
Declaration: the MobileNet-v1 SSD model is converted by [TensorFlow model](https://github.com/tensorflow/models/blob/f87a58cd96d45de73c9a8330a06b2ab56749a7fa/research/object_detection/g3doc/detection_model_zoo.md). The MobileNet-v1 model is converted from [Caffe](https://github.com/shicai/MobileNet-Caffe).
We will release the pre-trained models by ourself in the upcoming soon.
We provide two pre-trained models. The one is MobileNet-v1 SSD trained on COCO dataset, but removed the convolutional predictors for COCO dataset. This model can be used to initialize the models when training other datasets, like PASCAL VOC. The other pre-trained model is MobileNet-v1 trained on ImageNet 2012 dataset but removed the last weights and bias in the Fully-Connected layer. Download MobileNet-v1 SSD:
- Download MobileNet-v1 SSD:
```bash
./pretrained/download_coco.sh
```
- Download MobileNet-v1:
```bash
./pretrained/download_imagenet.sh
```
Declaration: the MobileNet-v1 SSD model is converted by [TensorFlow model](https://github.com/tensorflow/models/blob/f87a58cd96d45de73c9a8330a06b2ab56749a7fa/research/object_detection/g3doc/detection_model_zoo.md).
#### Train on PASCAL VOC
......@@ -64,7 +50,6 @@ We will release the pre-trained models by ourself in the upcoming soon.
python -u train.py --batch_size=64 --dataset='pascalvoc' --pretrained_model='pretrained/ssd_mobilenet_v1_coco/'
```
- Set ```export CUDA_VISIBLE_DEVICES=0,1``` to specifiy the number of GPU you want to use.
- Set ```--dataset='coco2014'``` or ```--dataset='coco2017'``` to train model on MS COCO dataset.
- For more help on arguments:
```bash
......@@ -88,19 +73,6 @@ You can evaluate your trained model in different metrics like 11point, integral
python eval.py --dataset='pascalvoc' --model_dir='train_pascal_model/best_model' --data_dir='data/pascalvoc' --test_list='test.txt' --ap_version='11point' --nms_threshold=0.45
```
You can set ```--dataset``` to ```coco2014``` or ```coco2017``` to evaluate COCO dataset. Moreover, we provide `eval_coco_map.py` which uses a COCO-specific mAP metric defined by [COCO committee](http://cocodataset.org/#detections-eval). To use this eval_coco_map.py, [cocoapi](https://github.com/cocodataset/cocoapi) is needed.
Install the cocoapi:
```
# COCOAPI=/path/to/clone/cocoapi
git clone https://github.com/cocodataset/cocoapi.git $COCOAPI
cd $COCOAPI/PythonAPI
# Install into global site-packages
make install
# Alternatively, if you do not have permissions or prefer
# not to install the COCO API into global site-packages
python2 setup.py install --user
```
### Infer and Visualize
`infer.py` is the main caller of the inferring module. Examples of usage are shown below.
```bash
......
......@@ -21,9 +21,8 @@ SSD 可以方便地插入到任何一种标准卷积网络中,比如 VGG、Res
### 数据准备
你可以使用 [PASCAL VOC 数据集](http://host.robots.ox.ac.uk/pascal/VOC/) 或者 [MS-COCO 数据集](http://cocodataset.org/#download)
如果你想在 PASCAL VOC 数据集上进行训练,请先使用下面的命令下载数据集。
请先使用下面的命令下载 [PASCAL VOC 数据集](http://host.robots.ox.ac.uk/pascal/VOC/)
```bash
cd data/pascalvoc
......@@ -32,29 +31,19 @@ cd data/pascalvoc
`download.sh` 命令会自动创建训练和测试用的列表文件。
如果你想在 MS-COCO 数据集上进行训练,请先使用下面的命令下载数据集。
```
cd data/coco
./download.sh
```
### 模型训练
#### 下载预训练模型
我们提供了两个预训练模型。第一个模型是在 COCO 数据集上预训练的 MobileNet-v1 SSD,我们将它的预测头移除了以便在 COCO 以外的数据集上进行训练。第二个模型是在 ImageNet 2012 数据集上预训练的 MobileNet-v1,我们也将最后的全连接层移除以便进行目标检测训练。
声明:MobileNet-v1 SSD 模型转换自[TensorFlow model](https://github.com/tensorflow/models/blob/f87a58cd96d45de73c9a8330a06b2ab56749a7fa/research/object_detection/g3doc/detection_model_zoo.md)。MobileNet-v1 模型转换自[Caffe](https://github.com/shicai/MobileNet-Caffe)。我们不久也会发布我们自己预训练的模型。
我们提供了两个预训练模型。第一个模型是在 COCO 数据集上预训练的 MobileNet-v1 SSD,我们将它的预测头移除了以便在 COCO 以外的数据集上进行训练。第二个模型是在 ImageNet 2012 数据集上预训练的 MobileNet-v1,我们也将最后的全连接层移除以便进行目标检测训练。下载 MobileNet-v1 SSD:
- 下载 MobileNet-v1 SSD:
```bash
./pretrained/download_coco.sh
```
- 下载 MobileNet-v1:
```bash
./pretrained/download_imagenet.sh
```
声明:MobileNet-v1 SSD 模型转换自[TensorFlow model](https://github.com/tensorflow/models/blob/f87a58cd96d45de73c9a8330a06b2ab56749a7fa/research/object_detection/g3doc/detection_model_zoo.md)。MobileNet-v1 模型转换自[Caffe](https://github.com/shicai/MobileNet-Caffe)
#### 训练
......@@ -63,7 +52,6 @@ cd data/coco
python -u train.py --batch_size=64 --dataset='pascalvoc' --pretrained_model='pretrained/ssd_mobilenet_v1_coco/'
```
- 可以通过设置 ```export CUDA_VISIBLE_DEVICES=0,1``` 指定想要使用的GPU数量。
- 可以通过设置 ```--dataset='coco2014'``````--dataset='coco2017'``` 指定训练 MS-COCO数据集。
- 更多的可选参数见:
```bash
......@@ -80,25 +68,13 @@ cd data/coco
### 模型评估
你可以使用11point、integral等指标在PASCAL VOC 和 COCO 数据集上评估训练好的模型。不失一般性,我们采用相应数据集的测试列表作为样例代码的默认列表,你也可以通过设置```--test_list```来指定自己的测试样本列表。
你可以使用11point、integral等指标在PASCAL VOC 数据集上评估训练好的模型。不失一般性,我们采用相应数据集的测试列表作为样例代码的默认列表,你也可以通过设置```--test_list```来指定自己的测试样本列表。
`eval.py`是评估模块的主要执行程序,调用示例如下:
```bash
python eval.py --dataset='pascalvoc' --model_dir='train_pascal_model/best_model' --data_dir='data/pascalvoc' --test_list='test.txt' --ap_version='11point' --nms_threshold=0.45
```
你可以设置```--dataset``````coco2014``````coco2017```来评估 COCO 数据集。我们也提供了`eval_coco_map.py`以进行[COCO官方评估](http://cocodataset.org/#detections-eval)。若要使用 eval_coco_map.py, 需要首先下载[cocoapi](https://github.com/cocodataset/cocoapi)
```
# COCOAPI=/path/to/clone/cocoapi
git clone https://github.com/cocodataset/cocoapi.git $COCOAPI
cd $COCOAPI/PythonAPI
# Install into global site-packages
make install
# Alternatively, if you do not have permissions or prefer
# not to install the COCO API into global site-packages
python2 setup.py install --user
```
### 模型预测以及可视化
`infer.py`是预测及可视化模块的主要执行程序,调用示例如下:
......
"""
This code is based on https://github.com/fchollet/keras/blob/master/keras/utils/data_utils.py
"""
import time
import numpy as np
import threading
import multiprocessing
try:
import queue
except ImportError:
import Queue as queue
class GeneratorEnqueuer(object):
"""
Builds a queue out of a data generator.
Args:
generator: a generator function which endlessly yields data
use_multiprocessing (bool): use multiprocessing if True,
otherwise use threading.
wait_time (float): time to sleep in-between calls to `put()`.
random_seed (int): Initial seed for workers,
will be incremented by one for each workers.
"""
def __init__(self,
generator,
use_multiprocessing=False,
wait_time=0.05,
random_seed=None):
self.wait_time = wait_time
self._generator = generator
self._use_multiprocessing = use_multiprocessing
self._threads = []
self._stop_event = None
self.queue = None
self._manager = None
self.seed = random_seed
def start(self, workers=1, max_queue_size=10):
"""
Start worker threads which add data from the generator into the queue.
Args:
workers (int): number of worker threads
max_queue_size (int): queue size
(when full, threads could block on `put()`)
"""
def data_generator_task():
"""
Data generator task.
"""
def task():
if (self.queue is not None and
self.queue.qsize() < max_queue_size):
generator_output = next(self._generator)
self.queue.put((generator_output))
else:
time.sleep(self.wait_time)
if not self._use_multiprocessing:
while not self._stop_event.is_set():
with self.genlock:
try:
task()
except Exception:
traceback.print_exc()
self._stop_event.set()
break
else:
while not self._stop_event.is_set():
try:
task()
except Exception:
traceback.print_exc()
self._stop_event.set()
break
try:
if self._use_multiprocessing:
self._manager = multiprocessing.Manager()
self.queue = self._manager.Queue(maxsize=max_queue_size)
self._stop_event = multiprocessing.Event()
else:
self.genlock = threading.Lock()
self.queue = queue.Queue()
self._stop_event = threading.Event()
for _ in range(workers):
if self._use_multiprocessing:
# Reset random seed else all children processes
# share the same seed
np.random.seed(self.seed)
thread = multiprocessing.Process(target=data_generator_task)
thread.daemon = True
if self.seed is not None:
self.seed += 1
else:
thread = threading.Thread(target=data_generator_task)
self._threads.append(thread)
thread.start()
except:
self.stop()
raise
def is_running(self):
"""
Returns:
bool: Whether the worker theads are running.
"""
return self._stop_event is not None and not self._stop_event.is_set()
def stop(self, timeout=None):
"""
Stops running threads and wait for them to exit, if necessary.
Should be called by the same thread which called `start()`.
Args:
timeout(int|None): maximum time to wait on `thread.join()`.
"""
if self.is_running():
self._stop_event.set()
for thread in self._threads:
if self._use_multiprocessing:
if thread.is_alive():
thread.terminate()
else:
thread.join(timeout)
if self._manager:
self._manager.shutdown()
self._threads = []
self._stop_event = None
self.queue = None
def get(self):
"""
Creates a generator to extract data from the queue.
Skip the data if it is `None`.
# Yields
tuple of data in the queue.
"""
while self.is_running():
if not self.queue.empty():
inputs = self.queue.get()
if inputs is not None:
yield inputs
else:
time.sleep(self.wait_time)
......@@ -52,7 +52,7 @@ def build_program(main_prog, startup_prog, args, data_args):
nmsed_out = fluid.layers.detection_output(
locs, confs, box, box_var, nms_threshold=args.nms_threshold)
with fluid.program_guard(main_prog):
map = fluid.evaluator.DetectionMAP(
map = fluid.metrics.DetectionMAP(
nmsed_out,
gt_label,
gt_box,
......
......@@ -47,7 +47,7 @@ def eval(args, data_args, test_list, batch_size, model_dir=None):
gt_iscrowd = fluid.layers.data(
name='gt_iscrowd', shape=[1], dtype='int32', lod_level=1)
gt_image_info = fluid.layers.data(
name='gt_image_id', shape=[3], dtype='int32', lod_level=1)
name='gt_image_id', shape=[3], dtype='int32')
locs, confs, box, box_var = mobile_net(num_classes, image, image_shape)
nmsed_out = fluid.layers.detection_output(
......@@ -57,14 +57,14 @@ def eval(args, data_args, test_list, batch_size, model_dir=None):
place = fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace()
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
# yapf: disable
if model_dir:
def if_exist(var):
return os.path.exists(os.path.join(model_dir, var.name))
fluid.io.load_vars(exe, model_dir, predicate=if_exist)
# yapf: enable
test_reader = paddle.batch(
reader.test(data_args, test_list), batch_size=batch_size)
test_reader = reader.test(data_args, test_list, batch_size)
feeder = fluid.DataFeeder(
place=place,
feed_list=[image, gt_box, gt_label, gt_iscrowd, gt_image_info])
......@@ -146,8 +146,7 @@ if __name__ == '__main__':
mean_value=[args.mean_value_B, args.mean_value_G, args.mean_value_R],
apply_distort=False,
apply_expand=False,
ap_version=args.ap_version,
toy=0)
ap_version=args.ap_version)
eval(
args,
data_args=data_args,
......
......@@ -12,17 +12,17 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import image_util
from paddle.utils.image_util import *
from PIL import Image
from PIL import ImageDraw
import numpy as np
import xml.etree.ElementTree
import os
import time
import copy
import six
from data_util import GeneratorEnqueuer
import math
import numpy as np
from PIL import Image
from PIL import ImageDraw
import image_util
import paddle
class Settings(object):
......@@ -162,26 +162,14 @@ def preprocess(img, bbox_labels, mode, settings):
return img, sampled_labels
def coco(settings, file_list, mode, batch_size, shuffle):
# cocoapi
def coco(settings, coco_api, file_list, mode, batch_size, shuffle, data_dir):
from pycocotools.coco import COCO
from pycocotools.cocoeval import COCOeval
coco = COCO(file_list)
image_ids = coco.getImgIds()
images = coco.loadImgs(image_ids)
print("{} on {} with {} images".format(mode, settings.dataset, len(images)))
def reader():
if mode == 'train' and shuffle:
np.random.shuffle(images)
np.random.shuffle(file_list)
batch_out = []
if '2014' in file_list:
sub_dir = "train2014" if model == "train" else "val2014"
elif '2017' in file_list:
sub_dir = "train2017" if mode == "train" else "val2017"
data_dir = os.path.join(settings.data_dir, sub_dir)
for image in images:
for image in file_list:
image_name = image['file_name']
image_path = os.path.join(data_dir, image_name)
if not os.path.exists(image_path):
......@@ -195,8 +183,8 @@ def coco(settings, file_list, mode, batch_size, shuffle):
# layout: category_id | xmin | ymin | xmax | ymax | iscrowd
bbox_labels = []
annIds = coco.getAnnIds(imgIds=image['id'])
anns = coco.loadAnns(annIds)
annIds = coco_api.getAnnIds(imgIds=image['id'])
anns = coco_api.loadAnns(annIds)
for ann in anns:
bbox_sample = []
# start from 1, leave 0 to background
......@@ -236,16 +224,12 @@ def coco(settings, file_list, mode, batch_size, shuffle):
def pascalvoc(settings, file_list, mode, batch_size, shuffle):
flist = open(file_list)
images = [line.strip() for line in flist]
print("{} on {} with {} images".format(mode, settings.dataset, len(images)))
def reader():
if mode == 'train' and shuffle:
np.random.shuffle(images)
np.random.shuffle(file_list)
batch_out = []
cnt = 0
for image in images:
for image in file_list:
image_path, label_path = image.split()
image_path = os.path.join(settings.data_dir, image_path)
label_path = os.path.join(settings.data_dir, label_path)
......@@ -299,52 +283,55 @@ def train(settings,
file_list,
batch_size,
shuffle=True,
use_multiprocessing=True,
num_workers=8,
max_queue=24,
enable_ce=False):
file_list = os.path.join(settings.data_dir, file_list)
file_path = os.path.join(settings.data_dir, file_list)
readers = []
if 'coco' in settings.dataset:
generator = coco(settings, file_list, "train", batch_size, shuffle)
else:
generator = pascalvoc(settings, file_list, "train", batch_size, shuffle)
# cocoapi
from pycocotools.coco import COCO
coco_api = COCO(file_path)
image_ids = coco_api.getImgIds()
images = coco_api.loadImgs(image_ids)
n = int(math.ceil(len(images) // num_workers))
image_lists = [images[i:i + n] for i in range(0, len(images), n)]
def infinite_reader():
while True:
for data in generator():
yield data
def reader():
try:
enqueuer = GeneratorEnqueuer(
infinite_reader(), use_multiprocessing=use_multiprocessing)
enqueuer.start(max_queue_size=max_queue, workers=num_workers)
generator_output = None
while True:
while enqueuer.is_running():
if not enqueuer.queue.empty():
generator_output = enqueuer.queue.get()
break
else:
time.sleep(0.02)
yield generator_output
generator_output = None
finally:
if enqueuer is not None:
enqueuer.stop()
if enable_ce:
return infinite_reader
if '2014' in file_list:
sub_dir = "train2014"
elif '2017' in file_list:
sub_dir = "train2017"
data_dir = os.path.join(settings.data_dir, sub_dir)
for l in image_lists:
readers.append(
coco(settings, coco_api, l, 'train', batch_size, shuffle,
data_dir))
else:
return reader
images = [line.strip() for line in open(file_path)]
n = int(math.ceil(len(images) // num_workers))
image_lists = [images[i:i + n] for i in range(0, len(images), n)]
for l in image_lists:
readers.append(pascalvoc(settings, l, 'train', batch_size, shuffle))
return paddle.reader.multiprocess_reader(readers, False)
def test(settings, file_list, batch_size):
file_list = os.path.join(settings.data_dir, file_list)
if 'coco' in settings.dataset:
return coco(settings, file_list, 'test', batch_size, False)
from pycocotools.coco import COCO
coco_api = COCO(file_list)
image_ids = coco_api.getImgIds()
images = coco_api.loadImgs(image_ids)
if '2014' in file_list:
sub_dir = "val2014"
elif '2017' in file_list:
sub_dir = "val2017"
data_dir = os.path.join(settings.data_dir, sub_dir)
return coco(settings, coco_api, images, 'test', batch_size, False,
data_dir)
else:
return pascalvoc(settings, file_list, 'test', batch_size, False)
image_list = [line.strip() for line in open(file_list)]
return pascalvoc(settings, image_list, 'test', batch_size, False)
def infer(settings, image_path):
......
......@@ -105,7 +105,7 @@ def build_program(main_prog, startup_prog, train_params, is_train):
with fluid.unique_name.guard("inference"):
nmsed_out = fluid.layers.detection_output(
locs, confs, box, box_var, nms_threshold=0.45)
map_eval = fluid.evaluator.DetectionMAP(
map_eval = fluid.metrics.DetectionMAP(
nmsed_out,
gt_label,
gt_box,
......@@ -156,6 +156,7 @@ def train(args,
startup_prog.random_seed = 111
train_prog.random_seed = 111
test_prog.random_seed = 111
num_workers = 1
train_py_reader, loss = build_program(
main_prog=train_prog,
......@@ -186,9 +187,7 @@ def train(args,
train_file_list,
batch_size_per_device,
shuffle=is_shuffle,
use_multiprocessing=True,
num_workers=num_workers,
max_queue=24,
enable_ce=enable_ce)
test_reader = reader.test(data_args, val_file_list, batch_size)
train_py_reader.decorate_paddle_reader(train_reader)
......@@ -205,7 +204,7 @@ def train(args,
def test(epoc_id, best_map):
_, accum_map = map_eval.get_map_var()
map_eval.reset(exe)
every_epoc_map=[]
every_epoc_map=[] # for CE
test_py_reader.start()
try:
batch_id = 0
......@@ -218,22 +217,23 @@ def train(args,
except fluid.core.EOFException:
test_py_reader.reset()
mean_map = np.mean(every_epoc_map)
print("Epoc {0}, test map {1}".format(epoc_id, test_map))
print("Epoc {0}, test map {1}".format(epoc_id, test_map[0]))
if test_map[0] > best_map:
best_map = test_map[0]
save_model('best_model', test_prog)
return best_map, mean_map
train_py_reader.start()
total_time = 0.0
try:
for epoc_id in range(epoc_num):
epoch_idx = epoc_id + 1
start_time = time.time()
prev_start_time = start_time
every_epoc_loss = []
for batch_id in range(iters_per_epoc):
for epoc_id in range(epoc_num):
epoch_idx = epoc_id + 1
start_time = time.time()
prev_start_time = start_time
every_epoc_loss = []
batch_id = 0
train_py_reader.start()
while True:
try:
prev_start_time = start_time
start_time = time.time()
if parallel:
......@@ -242,34 +242,35 @@ def train(args,
loss_v, = exe.run(train_prog, fetch_list=[loss])
loss_v = np.mean(np.array(loss_v))
every_epoc_loss.append(loss_v)
if batch_id % 20 == 0:
if batch_id % 10 == 0:
print("Epoc {:d}, batch {:d}, loss {:.6f}, time {:.5f}".format(
epoc_id, batch_id, loss_v, start_time - prev_start_time))
end_time = time.time()
total_time += end_time - start_time
best_map, mean_map = test(epoc_id, best_map)
print("Best test map {0}".format(best_map))
if epoc_id % 10 == 0 or epoc_id == epoc_num - 1:
save_model(str(epoc_id), train_prog)
if enable_ce and epoc_id == epoc_num - 1:
train_avg_loss = np.mean(every_epoc_loss)
if devices_num == 1:
print("kpis train_cost %s" % train_avg_loss)
print("kpis test_acc %s" % mean_map)
print("kpis train_speed %s" % (total_time / epoch_idx))
else:
print("kpis train_cost_card%s %s" %
(devices_num, train_avg_loss))
print("kpis test_acc_card%s %s" %
(devices_num, mean_map))
print("kpis train_speed_card%s %f" %
(devices_num, total_time / epoch_idx))
except (fluid.core.EOFException, StopIteration):
train_reader().close()
train_py_reader.reset()
batch_id += 1
except (fluid.core.EOFException, StopIteration):
train_reader().close()
train_py_reader.reset()
break
end_time = time.time()
total_time += end_time - start_time
best_map, mean_map = test(epoc_id, best_map)
print("Best test map {0}".format(best_map))
if epoc_id % 10 == 0 or epoc_id == epoc_num - 1:
save_model(str(epoc_id), train_prog)
if enable_ce:
train_avg_loss = np.mean(every_epoc_loss)
if devices_num == 1:
print("kpis train_cost %s" % train_avg_loss)
print("kpis test_acc %s" % mean_map)
print("kpis train_speed %s" % (total_time / epoch_idx))
else:
print("kpis train_cost_card%s %s" %
(devices_num, train_avg_loss))
print("kpis test_acc_card%s %s" %
(devices_num, mean_map))
print("kpis train_speed_card%s %f" %
(devices_num, total_time / epoch_idx))
if __name__ == '__main__':
......
......@@ -15,7 +15,14 @@
在data目录下,有两个文件夹,train_files中保存的是训练数据,test_files中保存的是测试数据,作为示例,在目录下我们各放置了两个文件,实际训练时,根据自己的实际需要将数据放置在对应目录,并根据数据格式,修改reader.py中的数据读取函数。
## 训练
修改 [train.py](./train.py)`main` 函数,指定数据路径,运行`python train.py`开始训练。
通过运行
```
python train.py --help
```
来获取命令行参数的帮助,设置正确的数据路径等参数后,运行`train.py`开始训练。
训练记录形如
```txt
......@@ -31,7 +38,7 @@ pass_id:2, time_cost:0.740842103958s
```
## 预测
修改 [infer.py](./infer.py)`infer` 函数,指定:需要测试的模型的路径、测试数据、预测标记文件的路径,运行`python infer.py`开始预测。
类似于训练过程,预测时指定需要测试模型的路径、测试数据、预测标记文件的路径,运行`infer.py`开始预测。
预测结果如下
```txt
......
......@@ -12,7 +12,7 @@ import reader
def parse_args():
parser = argparse.ArgumentParser("Run inference.")
parser = argparse.ArgumentParser("Run training.")
parser.add_argument(
'--batch_size',
type=int,
......
......@@ -408,10 +408,19 @@ def test_context(exe, train_exe, dev_count):
test_data = prepare_data_generator(
args, is_test=True, count=dev_count, pyreader=pyreader)
exe.run(startup_prog)
exe.run(startup_prog) # to init pyreader for testing
if TrainTaskConfig.ckpt_path:
fluid.io.load_persistables(
exe, TrainTaskConfig.ckpt_path, main_program=test_prog)
exec_strategy = fluid.ExecutionStrategy()
exec_strategy.use_experimental_executor = True
build_strategy = fluid.BuildStrategy()
test_exe = fluid.ParallelExecutor(
use_cuda=TrainTaskConfig.use_gpu,
main_program=test_prog,
build_strategy=build_strategy,
exec_strategy=exec_strategy,
share_vars_from=train_exe)
def test(exe=test_exe, pyreader=pyreader):
......@@ -457,7 +466,11 @@ def train_loop(exe,
nccl2_trainer_id=0):
# Initialize the parameters.
if TrainTaskConfig.ckpt_path:
fluid.io.load_persistables(exe, TrainTaskConfig.ckpt_path)
exe.run(startup_prog) # to init pyreader for training
logging.info("load checkpoint from {}".format(
TrainTaskConfig.ckpt_path))
fluid.io.load_persistables(
exe, TrainTaskConfig.ckpt_path, main_program=train_prog)
else:
logging.info("init fluid.framework.default_startup_program")
exe.run(startup_prog)
......@@ -741,6 +754,7 @@ if __name__ == "__main__":
LOG_FORMAT = "[%(asctime)s %(levelname)s %(filename)s:%(lineno)d] %(message)s"
logging.basicConfig(
stream=sys.stdout, level=logging.DEBUG, format=LOG_FORMAT)
logging.getLogger().setLevel(logging.INFO)
args = parse_args()
train(args)
......@@ -30,7 +30,9 @@ def test(exe, chunk_evaluator, inference_program, test_data, test_fetch_list,
num_infer = np.array(rets[0])
num_label = np.array(rets[1])
num_correct = np.array(rets[2])
chunk_evaluator.update(num_infer[0], num_label[0], num_correct[0])
chunk_evaluator.update(num_infer[0].astype('int64'),
num_label[0].astype('int64'),
num_correct[0].astype('int64'))
return chunk_evaluator.eval()
......@@ -65,11 +67,11 @@ def main(train_data_file,
input=feature_out, param_attr=fluid.ParamAttr(name='crfw'))
(precision, recall, f1_score, num_infer_chunks, num_label_chunks,
num_correct_chunks) = fluid.layers.chunk_eval(
input=crf_decode,
label=target,
chunk_scheme="IOB",
num_chunk_types=int(math.ceil((label_dict_len - 1) / 2.0)))
num_correct_chunks) = fluid.layers.chunk_eval(
input=crf_decode,
label=target,
chunk_scheme="IOB",
num_chunk_types=int(math.ceil((label_dict_len - 1) / 2.0)))
chunk_evaluator = fluid.metrics.ChunkEvaluator()
inference_program = fluid.default_main_program().clone(for_test=True)
......
# 文本分类
以下是本例的简要目录结构及说明:
```text
.
|-- README.md # README
|-- data_generator # IMDB数据集生成工具
| |-- IMDB.py # 在data_generator.py基础上扩展IMDB数据集处理逻辑
| |-- build_raw_data.py # IMDB数据预处理,其产出被splitfile.py读取。格式:word word ... | label
| |-- data_generator.py # 与AsyncExecutor配套的数据生成工具框架
| `-- splitfile.py # 将build_raw_data.py生成的文件切分,其产出被IMDB.py读取
|-- data_generator.sh # IMDB数据集生成工具入口
|-- data_reader.py # 预测脚本使用的数据读取工具
|-- infer.py # 预测脚本
`-- train.py # 训练脚本
```
## 简介
本目录包含用fluid.AsyncExecutor训练文本分类任务的脚本。网络模型定义沿用自父目录nets.py
## 训练
1. 运行命令 `sh data_generator.sh`,下载IMDB数据集,并转化成适合AsyncExecutor读取的训练数据
2. 运行命令 `python train.py bow` 开始训练模型。
```python
python train.py bow # bow指定网络结构,可替换成cnn, lstm, gru
```
3. (可选)想自定义网络结构,需在[nets.py](../nets.py)中自行添加,并设置[train.py](./train.py)中的相应参数。
```python
def train(train_reader, # 训练数据
word_dict, # 数据字典
network, # 模型配置
use_cuda, # 是否用GPU
parallel, # 是否并行
save_dirname, # 保存模型路径
lr=0.2, # 学习率大小
batch_size=128, # 每个batch的样本数
pass_num=30): # 训练的轮数
```
## 训练结果示例
```text
pass_id: 0 pass_time_cost 4.723438
pass_id: 1 pass_time_cost 3.867186
pass_id: 2 pass_time_cost 4.490111
pass_id: 3 pass_time_cost 4.573296
pass_id: 4 pass_time_cost 4.180547
pass_id: 5 pass_time_cost 4.214476
pass_id: 6 pass_time_cost 4.520387
pass_id: 7 pass_time_cost 4.149485
pass_id: 8 pass_time_cost 3.821354
pass_id: 9 pass_time_cost 5.136178
pass_id: 10 pass_time_cost 4.137318
pass_id: 11 pass_time_cost 3.943429
pass_id: 12 pass_time_cost 3.766478
pass_id: 13 pass_time_cost 4.235983
pass_id: 14 pass_time_cost 4.796462
pass_id: 15 pass_time_cost 4.668116
pass_id: 16 pass_time_cost 4.373798
pass_id: 17 pass_time_cost 4.298131
pass_id: 18 pass_time_cost 4.260021
pass_id: 19 pass_time_cost 4.244411
pass_id: 20 pass_time_cost 3.705138
pass_id: 21 pass_time_cost 3.728070
pass_id: 22 pass_time_cost 3.817919
pass_id: 23 pass_time_cost 4.698598
pass_id: 24 pass_time_cost 4.859262
pass_id: 25 pass_time_cost 5.725732
pass_id: 26 pass_time_cost 5.102599
pass_id: 27 pass_time_cost 3.876582
pass_id: 28 pass_time_cost 4.762538
pass_id: 29 pass_time_cost 3.797759
```
与fluid.Executor不同,AsyncExecutor在每个pass结束不会将accuracy打印出来。为了观察训练过程,可以将fluid.AsyncExecutor.run()方法的Debug参数设为True,这样每个pass结束会把参数指定的fetch variable打印出来:
```
async_executor.run(
main_program,
dataset,
filelist,
thread_num,
[acc],
debug=True)
```
## 预测
1. 运行命令 `python infer.py bow_model`, 开始预测。
```python
python infer.py bow_model # bow_model指定需要导入的模型
```
## 预测结果示例
```text
model_path: bow_model/epoch0.model, avg_acc: 0.882600
model_path: bow_model/epoch1.model, avg_acc: 0.887920
model_path: bow_model/epoch2.model, avg_acc: 0.886920
model_path: bow_model/epoch3.model, avg_acc: 0.884720
model_path: bow_model/epoch4.model, avg_acc: 0.879760
model_path: bow_model/epoch5.model, avg_acc: 0.876920
model_path: bow_model/epoch6.model, avg_acc: 0.874160
model_path: bow_model/epoch7.model, avg_acc: 0.872000
model_path: bow_model/epoch8.model, avg_acc: 0.870360
model_path: bow_model/epoch9.model, avg_acc: 0.868480
model_path: bow_model/epoch10.model, avg_acc: 0.867240
model_path: bow_model/epoch11.model, avg_acc: 0.866200
model_path: bow_model/epoch12.model, avg_acc: 0.865560
model_path: bow_model/epoch13.model, avg_acc: 0.865160
model_path: bow_model/epoch14.model, avg_acc: 0.864480
model_path: bow_model/epoch15.model, avg_acc: 0.864240
model_path: bow_model/epoch16.model, avg_acc: 0.863800
model_path: bow_model/epoch17.model, avg_acc: 0.863520
model_path: bow_model/epoch18.model, avg_acc: 0.862760
model_path: bow_model/epoch19.model, avg_acc: 0.862680
model_path: bow_model/epoch20.model, avg_acc: 0.862240
model_path: bow_model/epoch21.model, avg_acc: 0.862280
model_path: bow_model/epoch22.model, avg_acc: 0.862080
model_path: bow_model/epoch23.model, avg_acc: 0.861560
model_path: bow_model/epoch24.model, avg_acc: 0.861280
model_path: bow_model/epoch25.model, avg_acc: 0.861160
model_path: bow_model/epoch26.model, avg_acc: 0.861080
model_path: bow_model/epoch27.model, avg_acc: 0.860920
model_path: bow_model/epoch28.model, avg_acc: 0.860800
model_path: bow_model/epoch29.model, avg_acc: 0.860760
```
注:过拟合导致acc持续下降,请忽略
#!/usr/bin/env bash
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
pushd .
cd ./data_generator
# wget "http://ai.stanford.edu/%7Eamaas/data/sentiment/aclImdb_v1.tar.gz"
if [ ! -f aclImdb_v1.tar.gz ]; then
wget "http://10.64.74.104:8080/paddle/dataset/imdb/aclImdb_v1.tar.gz"
fi
tar zxvf aclImdb_v1.tar.gz
mkdir train_data
python build_raw_data.py train | python splitfile.py 12 train_data
mkdir test_data
python build_raw_data.py test | python splitfile.py 12 test_data
/opt/python27/bin/python IMDB.py train_data
/opt/python27/bin/python IMDB.py test_data
mv ./output_dataset/train_data ../
mv ./output_dataset/test_data ../
cp aclImdb/imdb.vocab ../
rm -rf ./output_dataset
rm -rf train_data
rm -rf test_data
rm -rf aclImdb
popd
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import re
import os, sys
sys.path.append(os.path.abspath(os.path.join('..')))
from data_generator import MultiSlotDataGenerator
class IMDbDataGenerator(MultiSlotDataGenerator):
def load_resource(self, dictfile):
self._vocab = {}
wid = 0
with open(dictfile) as f:
for line in f:
self._vocab[line.strip()] = wid
wid += 1
self._unk_id = len(self._vocab)
self._pattern = re.compile(r'(;|,|\.|\?|!|\s|\(|\))')
def process(self, line):
send = '|'.join(line.split('|')[:-1]).lower().replace("<br />",
" ").strip()
label = [int(line.split('|')[-1])]
words = [x for x in self._pattern.split(send) if x and x != " "]
feas = [
self._vocab[x] if x in self._vocab else self._unk_id for x in words
]
return ("words", feas), ("label", label)
imdb = IMDbDataGenerator()
imdb.load_resource("aclImdb/imdb.vocab")
# data from files
file_names = os.listdir(sys.argv[1])
filelist = []
for i in range(0, len(file_names)):
filelist.append(os.path.join(sys.argv[1], file_names[i]))
line_limit = 2500
process_num = 24
imdb.run_from_files(
filelist=filelist,
line_limit=line_limit,
process_num=process_num,
output_dir=('output_dataset/%s' % (sys.argv[1])))
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Split file into parts
"""
import sys
import os
block = int(sys.argv[1])
datadir = sys.argv[2]
file_list = []
for i in range(block):
file_list.append(open(datadir + "/part-" + str(i), "w"))
id_ = 0
for line in sys.stdin:
file_list[id_ % block].write(line)
id_ += 1
for f in file_list:
f.close()
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import sys
import os
import paddle
def parse_fields(fields):
words_width = int(fields[0])
words = fields[1:1 + words_width]
label = fields[-1]
return words, label
def imdb_data_feed_reader(data_dir, batch_size, buf_size):
"""
Data feed reader for IMDB dataset.
This data set has been converted from original format to a format suitable
for AsyncExecutor
See data.proto for data format
"""
def reader():
for file in os.listdir(data_dir):
if file.endswith('.proto'):
continue
with open(os.path.join(data_dir, file), 'r') as f:
for line in f:
fields = line.split(' ')
words, label = parse_fields(fields)
yield words, label
test_reader = paddle.batch(
paddle.reader.shuffle(
reader, buf_size=buf_size), batch_size=batch_size)
return test_reader
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import sys
import time
import unittest
import contextlib
import numpy as np
import paddle
import paddle.fluid as fluid
import data_reader
def infer(test_reader, use_cuda, model_path=None):
"""
inference function
"""
if model_path is None:
print(str(model_path) + " cannot be found")
return
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
exe = fluid.Executor(place)
inference_scope = fluid.core.Scope()
with fluid.scope_guard(inference_scope):
[inference_program, feed_target_names,
fetch_targets] = fluid.io.load_inference_model(model_path, exe)
total_acc = 0.0
total_count = 0
for data in test_reader():
acc = exe.run(inference_program,
feed=utils.data2tensor(data, place),
fetch_list=fetch_targets,
return_numpy=True)
total_acc += acc[0] * len(data)
total_count += len(data)
avg_acc = total_acc / total_count
print("model_path: %s, avg_acc: %f" % (model_path, avg_acc))
if __name__ == "__main__":
if __package__ is None:
from os import sys, path
sys.path.append(
os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
import utils
batch_size = 128
model_path = sys.argv[1]
test_data_dirname = 'test_data'
if len(sys.argv) == 3:
test_data_dirname = sys.argv[2]
test_reader = data_reader.imdb_data_feed_reader(
'test_data', batch_size, buf_size=500000)
models = os.listdir(model_path)
for i in range(0, len(models)):
epoch_path = "epoch" + str(i) + ".model"
epoch_path = os.path.join(model_path, epoch_path)
infer(test_reader, use_cuda=False, model_path=epoch_path)
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import sys
import time
import multiprocessing
import paddle
import paddle.fluid as fluid
def train(network, dict_dim, lr, save_dirname, training_data_dirname, pass_num,
thread_num, batch_size):
file_names = os.listdir(training_data_dirname)
filelist = []
for i in range(0, len(file_names)):
if file_names[i] == 'data_feed.proto':
continue
filelist.append(os.path.join(training_data_dirname, file_names[i]))
dataset = fluid.DataFeedDesc(
os.path.join(training_data_dirname, 'data_feed.proto'))
dataset.set_batch_size(
batch_size) # datafeed should be assigned a batch size
dataset.set_use_slots(['words', 'label'])
data = fluid.layers.data(
name="words", shape=[1], dtype="int64", lod_level=1)
label = fluid.layers.data(name="label", shape=[1], dtype="int64")
avg_cost, acc, prediction = network(data, label, dict_dim)
optimizer = fluid.optimizer.Adagrad(learning_rate=lr)
opt_ops, weight_and_grad = optimizer.minimize(avg_cost)
startup_program = fluid.default_startup_program()
main_program = fluid.default_main_program()
place = fluid.CPUPlace()
executor = fluid.Executor(place)
executor.run(startup_program)
async_executor = fluid.AsyncExecutor(place)
for i in range(pass_num):
pass_start = time.time()
async_executor.run(main_program,
dataset,
filelist,
thread_num, [acc],
debug=False)
print('pass_id: %u pass_time_cost %f' % (i, time.time() - pass_start))
fluid.io.save_inference_model('%s/epoch%d.model' % (save_dirname, i),
[data.name, label.name], [acc], executor)
if __name__ == "__main__":
if __package__ is None:
from os import sys, path
sys.path.append(
os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
from nets import bow_net, cnn_net, lstm_net, gru_net
from utils import load_vocab
batch_size = 4
lr = 0.002
pass_num = 30
save_dirname = ""
thread_num = multiprocessing.cpu_count()
if sys.argv[1] == "bow":
network = bow_net
batch_size = 128
save_dirname = "bow_model"
elif sys.argv[1] == "cnn":
network = cnn_net
lr = 0.01
save_dirname = "cnn_model"
elif sys.argv[1] == "lstm":
network = lstm_net
lr = 0.05
save_dirname = "lstm_model"
elif sys.argv[1] == "gru":
network = gru_net
batch_size = 128
lr = 0.05
save_dirname = "gru_model"
training_data_dirname = 'train_data/'
if len(sys.argv) == 3:
training_data_dirname = sys.argv[2]
if len(sys.argv) == 4:
if thread_num >= int(sys.argv[3]):
thread_num = int(sys.argv[3])
vocab = load_vocab('imdb.vocab')
dict_dim = len(vocab)
train(network, dict_dim, lr, save_dirname, training_data_dirname, pass_num,
thread_num, batch_size)
......@@ -5,8 +5,10 @@
```text
.
├── README.md # 文档
├── train.py # 训练脚本
├── infer.py # 预测脚本
├── train.py # 训练脚本 全词表 cross-entropy
├── train_sample_neg.py # 训练脚本 sample负例 包含bpr loss 和cross-entropy
├── infer.py # 预测脚本 全词表
├── infer_sample_neg.py # 预测脚本 sample负例
├── net.py # 网络结构
├── text2paddle.py # 文本数据转paddle数据
├── cluster_train.py # 多机训练
......@@ -30,6 +32,17 @@ GRU4REC模型的介绍可以参阅论文[Session-based Recommendations with Recu
session-based推荐应用场景非常广泛,比如用户的商品浏览、新闻点击、地点签到等序列数据。
支持三种形式的损失函数, 分别是全词表的cross-entropy, 负采样的Bayesian Pairwise Ranking和负采样的Cross-entropy.
我们基本复现了论文效果,recall@20的效果分别为
全词表 cross entropy : 0.67
负采样 bpr : 0.606
负采样 cross entropy : 0.605
运行样例程序可跳过'RSC15 数据下载及预处理'部分
## RSC15 数据下载及预处理
......@@ -108,25 +121,42 @@ python text2paddle.py raw_train_data/ raw_test_data/ train_data test_data vocab.
```
## 训练
'--use_cuda 1' 表示使用gpu, 缺省表示使用cpu '--parallel 1' 表示使用多卡,缺省表示使用单卡
具体的参数配置可运行
具体的参数配置可运行
```
python train.py -h
```
全词表cross entropy 训练代码
gpu 单机单卡训练
``` bash
CUDA_VISIBLE_DEVICES=0 python train.py --train_dir train_data --use_cuda 1 --batch_size 50 --model_dir model_output
```
GPU 环境
运行命令开始训练模型。
cpu 单机训练
``` bash
python train.py --train_dir train_data --use_cuda 0 --batch_size 50 --model_dir model_output
```
CUDA_VISIBLE_DEVICES=0 python train.py --train_dir train_data/ --use_cuda 1
gpu 单机多卡训练
``` bash
CUDA_VISIBLE_DEVICES=0,1 python train.py --train_dir train_data --use_cuda 1 --parallel 1 --batch_size 50 --model_dir model_output --num_devices 2
```
CPU 环境
运行命令开始训练模型。
cpu 单机多卡训练
``` bash
CPU_NUM=10 python train.py --train_dir train_data --use_cuda 0 --parallel 1 --batch_size 50 --model_dir model_output --num_devices 10
```
python train.py --train_dir train_data/
负采样 bayesian pairwise ranking loss(bpr loss) 训练
```
CUDA_VISIBLE_DEVICES=0 python train_sample_neg.py --loss bpr --use_cuda 1
```
请注意CPU环境下运行单机多卡任务(--parallel 1)时,batch_size应大于cpu核数。
负采样 cross entropy 训练
```
CUDA_VISIBLE_DEVICES=0 python train_sample_neg.py --loss ce --use_cuda 1
```
## 自定义网络结构
......
import argparse
import sys
import time
import math
import unittest
import contextlib
import numpy as np
import six
import paddle.fluid as fluid
import paddle
import net
import utils
def parse_args():
parser = argparse.ArgumentParser("gru4rec benchmark.")
parser.add_argument(
'--test_dir', type=str, default='test_data', help='test file address')
parser.add_argument(
'--start_index', type=int, default='1', help='start index')
parser.add_argument(
'--last_index', type=int, default='3', help='last index')
parser.add_argument(
'--model_dir', type=str, default='model_bpr_recall20', help='model dir')
parser.add_argument(
'--use_cuda', type=int, default='0', help='whether use cuda')
parser.add_argument(
'--batch_size', type=int, default='5', help='batch_size')
parser.add_argument(
'--hid_size', type=int, default='100', help='batch_size')
parser.add_argument(
'--vocab_path', type=str, default='vocab.txt', help='vocab file')
args = parser.parse_args()
return args
def infer(args, vocab_size, test_reader, use_cuda):
""" inference function """
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
exe = fluid.Executor(place)
hid_size = args.hid_size
batch_size = args.batch_size
with fluid.scope_guard(fluid.core.Scope()):
main_program = fluid.Program()
with fluid.program_guard(main_program):
acc = net.infer_network(vocab_size, batch_size, hid_size)
for epoch in range(start_index, last_index + 1):
copy_program = main_program.clone()
model_path = model_dir + "/epoch_" + str(epoch)
fluid.io.load_params(
executor=exe, dirname=model_path, main_program=copy_program)
accum_num_recall = 0.0
accum_num_sum = 0.0
t0 = time.time()
step_id = 0
for data in test_reader():
step_id += 1
label_data = [dat[1] for dat in data]
ls, lp = utils.to_lodtensor_bpr_test(data, vocab_size,
place)
para = exe.run(
copy_program,
feed={
"src": ls,
"all_label":
np.arange(vocab_size).reshape(vocab_size, 1),
"pos_label": lp
},
fetch_list=[acc.name],
return_numpy=False)
acc_ = np.array(para[0])[0]
data_length = len(
np.concatenate(
label_data, axis=0).astype("int64"))
accum_num_sum += (data_length)
accum_num_recall += (data_length * acc_)
if step_id % 1 == 0:
print("step:%d " % (step_id),
accum_num_recall / accum_num_sum)
t1 = time.time()
print("model:%s recall@20:%.4f time_cost(s):%.2f" %
(model_path, accum_num_recall / accum_num_sum, t1 - t0))
if __name__ == "__main__":
args = parse_args()
start_index = args.start_index
last_index = args.last_index
test_dir = args.test_dir
model_dir = args.model_dir
batch_size = args.batch_size
vocab_path = args.vocab_path
use_cuda = True if args.use_cuda else False
print("start index: ", start_index, " last_index:", last_index)
vocab_size, test_reader = utils.prepare_data(
test_dir,
vocab_path,
batch_size=batch_size,
buffer_size=1000,
word_freq_threshold=0,
is_train=False)
infer(args, vocab_size, test_reader=test_reader, use_cuda=use_cuda)
import paddle.fluid as fluid
def network(vocab_size,
hid_size=100,
init_low_bound=-0.04,
init_high_bound=0.04):
def all_vocab_network(vocab_size,
hid_size=100,
init_low_bound=-0.04,
init_high_bound=0.04):
""" network definition """
emb_lr_x = 10.0
gru_lr_x = 1.0
......@@ -43,8 +44,173 @@ def network(vocab_size,
initializer=fluid.initializer.Uniform(
low=init_low_bound, high=init_high_bound),
learning_rate=fc_lr_x))
cost = fluid.layers.cross_entropy(input=fc, label=dst_wordseq)
acc = fluid.layers.accuracy(input=fc, label=dst_wordseq, k=20)
avg_cost = fluid.layers.mean(x=cost)
return src_wordseq, dst_wordseq, avg_cost, acc
def train_bpr_network(vocab_size, neg_size, hid_size, drop_out=0.2):
""" network definition """
emb_lr_x = 1.0
gru_lr_x = 1.0
fc_lr_x = 1.0
# Input data
src = fluid.layers.data(name="src", shape=[1], dtype="int64", lod_level=1)
pos_label = fluid.layers.data(
name="pos_label", shape=[1], dtype="int64", lod_level=1)
label = fluid.layers.data(
name="label", shape=[neg_size + 1], dtype="int64", lod_level=1)
emb_src = fluid.layers.embedding(
input=src,
size=[vocab_size, hid_size],
param_attr=fluid.ParamAttr(
name="emb",
initializer=fluid.initializer.XavierInitializer(),
learning_rate=emb_lr_x))
emb_src_drop = fluid.layers.dropout(emb_src, dropout_prob=drop_out)
fc0 = fluid.layers.fc(input=emb_src_drop,
size=hid_size * 3,
param_attr=fluid.ParamAttr(
name="gru_fc",
initializer=fluid.initializer.XavierInitializer(),
learning_rate=gru_lr_x),
bias_attr=False)
gru_h0 = fluid.layers.dynamic_gru(
input=fc0,
size=hid_size,
param_attr=fluid.ParamAttr(
name="dy_gru.param",
initializer=fluid.initializer.XavierInitializer(),
learning_rate=gru_lr_x),
bias_attr="dy_gru.bias")
gru_h0_drop = fluid.layers.dropout(gru_h0, dropout_prob=drop_out)
label_re = fluid.layers.sequence_reshape(input=label, new_dim=1)
emb_label = fluid.layers.embedding(
input=label_re,
size=[vocab_size, hid_size],
param_attr=fluid.ParamAttr(
name="emb",
initializer=fluid.initializer.XavierInitializer(),
learning_rate=emb_lr_x))
emb_label_drop = fluid.layers.dropout(emb_label, dropout_prob=drop_out)
gru_exp = fluid.layers.expand(
x=gru_h0_drop, expand_times=[1, (neg_size + 1)])
gru = fluid.layers.sequence_reshape(input=gru_exp, new_dim=hid_size)
ele_mul = fluid.layers.elementwise_mul(emb_label_drop, gru)
red_sum = fluid.layers.reduce_sum(input=ele_mul, dim=1, keep_dim=True)
pre = fluid.layers.sequence_reshape(input=red_sum, new_dim=(neg_size + 1))
cost = fluid.layers.bpr_loss(input=pre, label=pos_label)
cost_sum = fluid.layers.reduce_sum(input=cost)
return src, pos_label, label, cost_sum
def train_cross_entropy_network(vocab_size, neg_size, hid_size, drop_out=0.2):
""" network definition """
emb_lr_x = 1.0
gru_lr_x = 1.0
fc_lr_x = 1.0
# Input data
src = fluid.layers.data(name="src", shape=[1], dtype="int64", lod_level=1)
pos_label = fluid.layers.data(
name="pos_label", shape=[1], dtype="int64", lod_level=1)
label = fluid.layers.data(
name="label", shape=[neg_size + 1], dtype="int64", lod_level=1)
emb_src = fluid.layers.embedding(
input=src,
size=[vocab_size, hid_size],
param_attr=fluid.ParamAttr(
name="emb",
initializer=fluid.initializer.XavierInitializer(),
learning_rate=emb_lr_x))
emb_src_drop = fluid.layers.dropout(emb_src, dropout_prob=drop_out)
fc0 = fluid.layers.fc(input=emb_src_drop,
size=hid_size * 3,
param_attr=fluid.ParamAttr(
name="gru_fc",
initializer=fluid.initializer.XavierInitializer(),
learning_rate=gru_lr_x),
bias_attr=False)
gru_h0 = fluid.layers.dynamic_gru(
input=fc0,
size=hid_size,
param_attr=fluid.ParamAttr(
name="dy_gru.param",
initializer=fluid.initializer.XavierInitializer(),
learning_rate=gru_lr_x),
bias_attr="dy_gru.bias")
gru_h0_drop = fluid.layers.dropout(gru_h0, dropout_prob=drop_out)
label_re = fluid.layers.sequence_reshape(input=label, new_dim=1)
emb_label = fluid.layers.embedding(
input=label_re,
size=[vocab_size, hid_size],
param_attr=fluid.ParamAttr(
name="emb",
initializer=fluid.initializer.XavierInitializer(),
learning_rate=emb_lr_x))
emb_label_drop = fluid.layers.dropout(emb_label, dropout_prob=drop_out)
gru_exp = fluid.layers.expand(
x=gru_h0_drop, expand_times=[1, (neg_size + 1)])
gru = fluid.layers.sequence_reshape(input=gru_exp, new_dim=hid_size)
ele_mul = fluid.layers.elementwise_mul(emb_label_drop, gru)
red_sum = fluid.layers.reduce_sum(input=ele_mul, dim=1, keep_dim=True)
pre = fluid.layers.sequence_reshape(input=red_sum, new_dim=(neg_size + 1))
cost = fluid.layers.cross_entropy(input=pre, label=pos_label)
cost_sum = fluid.layers.reduce_sum(input=cost)
return src, pos_label, label, cost_sum
def infer_network(vocab_size, batch_size, hid_size, dropout=0.2):
src = fluid.layers.data(name="src", shape=[1], dtype="int64", lod_level=1)
emb_src = fluid.layers.embedding(
input=src, size=[vocab_size, hid_size], param_attr="emb")
emb_src_drop = fluid.layers.dropout(
emb_src, dropout_prob=dropout, is_test=True)
fc0 = fluid.layers.fc(input=emb_src_drop,
size=hid_size * 3,
param_attr="gru_fc",
bias_attr=False)
gru_h0 = fluid.layers.dynamic_gru(
input=fc0,
size=hid_size,
param_attr="dy_gru.param",
bias_attr="dy_gru.bias")
gru_h0_drop = fluid.layers.dropout(
gru_h0, dropout_prob=dropout, is_test=True)
all_label = fluid.layers.data(
name="all_label",
shape=[vocab_size, 1],
dtype="int64",
append_batch_size=False)
emb_all_label = fluid.layers.embedding(
input=all_label, size=[vocab_size, hid_size], param_attr="emb")
emb_all_label_drop = fluid.layers.dropout(
emb_all_label, dropout_prob=dropout, is_test=True)
all_pre = fluid.layers.matmul(
gru_h0_drop, emb_all_label_drop, transpose_y=True)
pos_label = fluid.layers.data(
name="pos_label", shape=[1], dtype="int64", lod_level=1)
acc = fluid.layers.accuracy(input=all_pre, label=pos_label, k=20)
return acc
......@@ -17,13 +17,13 @@ SEED = 102
def parse_args():
parser = argparse.ArgumentParser("gru4rec benchmark.")
parser.add_argument(
'--train_dir', type=str, default='train_data', help='train file address')
'--train_dir', type=str, default='train_data', help='train file')
parser.add_argument(
'--vocab_path', type=str, default='vocab.txt', help='vocab file address')
'--vocab_path', type=str, default='vocab.txt', help='vocab file')
parser.add_argument(
'--is_local', type=int, default=1, help='whether local')
'--is_local', type=int, default=1, help='whether is local')
parser.add_argument(
'--hid_size', type=int, default=100, help='hid size')
'--hid_size', type=int, default=100, help='hidden-dim size')
parser.add_argument(
'--model_dir', type=str, default='model_recall20', help='model dir')
parser.add_argument(
......@@ -31,7 +31,7 @@ def parse_args():
parser.add_argument(
'--print_batch', type=int, default=10, help='num of print batch')
parser.add_argument(
'--pass_num', type=int, default=10, help='num of epoch')
'--pass_num', type=int, default=10, help='number of epoch')
parser.add_argument(
'--use_cuda', type=int, default=0, help='whether use gpu')
parser.add_argument(
......@@ -43,9 +43,11 @@ def parse_args():
args = parser.parse_args()
return args
def get_cards(args):
return args.num_devices
def train():
""" do training """
args = parse_args()
......@@ -61,23 +63,23 @@ def train():
buffer_size=1000, word_freq_threshold=0, is_train=True)
# Train program
src_wordseq, dst_wordseq, avg_cost, acc = net.network(vocab_size=vocab_size, hid_size=hid_size)
src_wordseq, dst_wordseq, avg_cost, acc = net.all_vocab_network(
vocab_size=vocab_size, hid_size=hid_size)
# Optimization to minimize lost
sgd_optimizer = fluid.optimizer.Adagrad(learning_rate=args.base_lr)
sgd_optimizer.minimize(avg_cost)
# Initialize executor
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
if parallel:
train_exe = fluid.ParallelExecutor(
use_cuda=use_cuda,
loss_name=avg_cost.name)
use_cuda=use_cuda, loss_name=avg_cost.name)
else:
train_exe = exe
pass_num = args.pass_num
model_dir = args.model_dir
fetch_list = [avg_cost.name]
......@@ -96,10 +98,11 @@ def train():
place)
lod_dst_wordseq = utils.to_lodtensor([dat[1] for dat in data],
place)
ret_avg_cost = train_exe.run(feed={
"src_wordseq": lod_src_wordseq,
"dst_wordseq": lod_dst_wordseq},
fetch_list=fetch_list)
ret_avg_cost = train_exe.run(feed={
"src_wordseq": lod_src_wordseq,
"dst_wordseq": lod_dst_wordseq
},
fetch_list=fetch_list)
avg_ppl = np.exp(ret_avg_cost[0])
newest_ppl = np.mean(avg_ppl)
if i % args.print_batch == 0:
......@@ -114,7 +117,6 @@ def train():
fetch_vars = [avg_cost, acc]
fluid.io.save_inference_model(save_dir, feed_var_names, fetch_vars, exe)
print("model saved in %s" % save_dir)
#exe.close()
print("finish training")
......
import os
import sys
import time
import six
import numpy as np
import math
import argparse
import paddle.fluid as fluid
import paddle
import time
import utils
import net
SEED = 102
def parse_args():
parser = argparse.ArgumentParser("gru4rec benchmark.")
parser.add_argument(
'--train_dir', type=str, default='train_data', help='train file')
parser.add_argument(
'--vocab_path', type=str, default='vocab.txt', help='vocab file')
parser.add_argument(
'--is_local', type=int, default=1, help='whether is local')
parser.add_argument(
'--hid_size', type=int, default=100, help='hidden-dim size')
parser.add_argument(
'--neg_size', type=int, default=10, help='neg item size')
parser.add_argument(
'--loss', type=str, default="bpr", help='loss: bpr/cross_entropy')
parser.add_argument(
'--model_dir', type=str, default='model_bpr_recall20', help='model dir')
parser.add_argument(
'--batch_size', type=int, default=5, help='num of batch size')
parser.add_argument(
'--print_batch', type=int, default=10, help='num of print batch')
parser.add_argument(
'--pass_num', type=int, default=10, help='number of epoch')
parser.add_argument(
'--use_cuda', type=int, default=0, help='whether use gpu')
parser.add_argument(
'--parallel', type=int, default=0, help='whether parallel')
parser.add_argument(
'--base_lr', type=float, default=0.01, help='learning rate')
parser.add_argument(
'--num_devices', type=int, default=1, help='Number of GPU devices')
args = parser.parse_args()
return args
def get_cards(args):
return args.num_devices
def train():
""" do training """
args = parse_args()
hid_size = args.hid_size
train_dir = args.train_dir
vocab_path = args.vocab_path
use_cuda = True if args.use_cuda else False
parallel = True if args.parallel else False
print("use_cuda:", use_cuda, "parallel:", parallel)
batch_size = args.batch_size
vocab_size, train_reader = utils.prepare_data(
train_dir, vocab_path, batch_size=batch_size * get_cards(args),\
buffer_size=1000, word_freq_threshold=0, is_train=True)
# Train program
if args.loss == 'bpr':
src, pos_label, label, avg_cost = net.train_bpr_network(
neg_size=args.neg_size, vocab_size=vocab_size, hid_size=hid_size)
else:
src, pos_label, label, avg_cost = net.train_cross_entropy_network(
neg_size=args.neg_size, vocab_size=vocab_size, hid_size=hid_size)
# Optimization to minimize lost
sgd_optimizer = fluid.optimizer.Adagrad(learning_rate=args.base_lr)
sgd_optimizer.minimize(avg_cost)
# Initialize executor
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
if parallel:
train_exe = fluid.ParallelExecutor(
use_cuda=use_cuda, loss_name=avg_cost.name)
else:
train_exe = exe
pass_num = args.pass_num
model_dir = args.model_dir
fetch_list = [avg_cost.name]
total_time = 0.0
for pass_idx in six.moves.xrange(pass_num):
epoch_idx = pass_idx + 1
print("epoch_%d start" % epoch_idx)
t0 = time.time()
i = 0
newest_ppl = 0
for data in train_reader():
i += 1
ls, lp, ll = utils.to_lodtensor_bpr(data, args.neg_size, vocab_size,
place)
ret_avg_cost = train_exe.run(
feed={"src": ls,
"label": ll,
"pos_label": lp},
fetch_list=fetch_list)
avg_ppl = np.exp(ret_avg_cost[0])
newest_ppl = np.mean(avg_ppl)
if i % args.print_batch == 0:
print("step:%d ppl:%.3f" % (i, newest_ppl))
t1 = time.time()
total_time += t1 - t0
print("epoch:%d num_steps:%d time_cost(s):%f" %
(epoch_idx, i, total_time / epoch_idx))
save_dir = "%s/epoch_%d" % (model_dir, epoch_idx)
fluid.io.save_params(executor=exe, dirname=save_dir)
print("model saved in %s" % save_dir)
print("finish training")
if __name__ == "__main__":
train()
......@@ -7,6 +7,7 @@ import paddle.fluid as fluid
import paddle
import os
def to_lodtensor(data, place):
""" convert to LODtensor """
seq_lens = [len(seq) for seq in data]
......@@ -22,11 +23,74 @@ def to_lodtensor(data, place):
res.set_lod([lod])
return res
def to_lodtensor_bpr(raw_data, neg_size, vocab_size, place):
""" convert to LODtensor """
data = [dat[0] for dat in raw_data]
seq_lens = [len(seq) for seq in data]
cur_len = 0
lod = [cur_len]
for l in seq_lens:
cur_len += l
lod.append(cur_len)
flattened_data = np.concatenate(data, axis=0).astype("int64")
flattened_data = flattened_data.reshape([len(flattened_data), 1])
res = fluid.LoDTensor()
res.set(flattened_data, place)
res.set_lod([lod])
data = [dat[1] for dat in raw_data]
pos_data = np.concatenate(data, axis=0).astype("int64")
length = np.size(pos_data)
neg_data = np.tile(pos_data, neg_size)
np.random.shuffle(neg_data)
for ii in range(length * neg_size):
if neg_data[ii] == pos_data[ii / neg_size]:
neg_data[ii] = pos_data[length - 1 - ii / neg_size]
label_data = np.column_stack(
(pos_data.reshape(length, 1), neg_data.reshape(length, neg_size)))
res_label = fluid.LoDTensor()
res_label.set(label_data, place)
res_label.set_lod([lod])
res_pos = fluid.LoDTensor()
res_pos.set(np.zeros([len(flattened_data), 1]).astype("int64"), place)
res_pos.set_lod([lod])
return res, res_pos, res_label
def to_lodtensor_bpr_test(raw_data, vocab_size, place):
""" convert to LODtensor """
data = [dat[0] for dat in raw_data]
seq_lens = [len(seq) for seq in data]
cur_len = 0
lod = [cur_len]
for l in seq_lens:
cur_len += l
lod.append(cur_len)
flattened_data = np.concatenate(data, axis=0).astype("int64")
flattened_data = flattened_data.reshape([len(flattened_data), 1])
res = fluid.LoDTensor()
res.set(flattened_data, place)
res.set_lod([lod])
data = [dat[1] for dat in raw_data]
flattened_data = np.concatenate(data, axis=0).astype("int64")
flattened_data = flattened_data.reshape([len(flattened_data), 1])
res_pos = fluid.LoDTensor()
res_pos.set(flattened_data, place)
res_pos.set_lod([lod])
return res, res_pos
def get_vocab_size(vocab_path):
with open(vocab_path, "r") as rf:
line = rf.readline()
return int(line.strip())
def prepare_data(file_dir,
vocab_path,
batch_size,
......@@ -45,12 +109,10 @@ def prepare_data(file_dir,
batch_size,
batch_size * 20)
else:
reader = sort_batch(
vocab_size = get_vocab_size(vocab_path)
reader = paddle.batch(
test(
file_dir, buffer_size, data_type=DataType.SEQ),
batch_size,
batch_size * 20)
vocab_size = 0
file_dir, buffer_size, data_type=DataType.SEQ), batch_size)
return vocab_size, reader
......@@ -103,6 +165,7 @@ def sort_batch(reader, batch_size, sort_group_size, drop_last=False):
class DataType(object):
SEQ = 2
def reader_creator(file_dir, n, data_type):
def reader():
files = os.listdir(file_dir)
......@@ -118,10 +181,13 @@ def reader_creator(file_dir, n, data_type):
yield src_seq, trg_seq
else:
assert False, 'error data type'
return reader
def train(train_dir, n, data_type=DataType.SEQ):
return reader_creator(train_dir, n, data_type)
def test(test_dir, n, data_type=DataType.SEQ):
return reader_creator(test_dir, n, data_type)
......@@ -3,31 +3,47 @@
## Introduction
In news recommendation scenarios, different from traditional systems that recommend entertainment items such as movies or music, there are several new problems to solve.
- Very sparse user profile features exist that a user may login a news recommendation app anonymously and a user is likely to read a fresh news item.
- News are generated or disappeared very fast compare with movies or musics. Usually, there will be thousands of news generated in a news recommendation app. The Consumption of news is also fast since users care about newly happened things.
- News are generated or disappeared very fast compare with movies or musics. Usually, there will be thousands of news generated in a news recommendation app. The Consumption of news is also fast since users care about newly happened things.
- User interests may change frequently in the news recommendation setting. The content of news will affect users' reading behaviors a lot even the category of the news does not belong to users' long-term interest. In news recommendation, reading behaviors are determined by both short-term interest and long-term interest of users.
[GRU4Rec](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleRec/gru4rec) models a user's short-term and long-term interest by applying a gated-recurrent-unit on the user's reading history. The generalization ability of recurrent neural network captures users' similarity of reading sequences that alleviates the user profile sparsity problem. However, the paper of GRU4Rec operates on close domain of items that the model predicts which item a user will be interested in through classification method. In news recommendation, news items are dynamic through time that GRU4Rec model can not predict items that do not exist in training dataset.
Sequence Semantic Retrieval(SSR) Model shares the similar idea with Multi-Rate Deep Learning for Temporal Recommendation, SIGIR 2016. Sequence Semantic Retrieval Model has two components, one is the matching model part, the other one is the retrieval part.
- The idea of SSR is to model a user's personalized interest of an item through matching model structure, and the representation of a news item can be computed online even the news item does not exist in training dataset.
- The idea of SSR is to model a user's personalized interest of an item through matching model structure, and the representation of a news item can be computed online even the news item does not exist in training dataset.
- With the representation of news items, we are able to build an vector indexing service online for news prediction and this is the retrieval part of SSR.
## Dataset
Dataset preprocessing follows the method of [GRU4Rec Project](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleRec/gru4rec). Note that you should reuse scripts from GRU4Rec project for data preprocessing.
## Training
Before training, you should set PYTHONPATH environment
The command line options for training can be listed by `python train.py -h`
gpu 单机单卡训练
``` bash
CUDA_VISIBLE_DEVICES=0 python train.py --train_dir train_data --use_cuda 1 --batch_size 50 --model_dir model_output
```
export PYTHONPATH=./models/fluid:$PYTHONPATH
cpu 单机训练
``` bash
python train.py --train_dir train_data --use_cuda 0 --batch_size 50 --model_dir model_output
```
The command line options for training can be listed by `python train.py -h`
gpu 单机多卡训练
``` bash
python train.py --train_file rsc15_train_tr_paddle.txt
CUDA_VISIBLE_DEVICES=0,1 python train.py --train_dir train_data --use_cuda 1 --parallel 1 --batch_size 50 --model_dir model_output --num_devices 2
```
## Build Index
TBA
cpu 单机多卡训练
``` bash
CPU_NUM=10 python train.py --train_dir train_data --use_cuda 0 --parallel 1 --batch_size 50 --model_dir model_output --num_devices 10
```
多机训练 参考fluid/PaddleRec/gru4rec下的配置
## Retrieval
TBA
## Inference
gpu 预测
``` bash
CUDA_VISIBLE_DEVICES=0 python infer.py --test_dir test_data --use_cuda 1 --batch_size 50 --model_dir model_output
```
import sys
import argparse
import time
import math
import unittest
import contextlib
import numpy as np
import six
import paddle.fluid as fluid
import paddle
import utils
import nets as net
def parse_args():
parser = argparse.ArgumentParser("ssr benchmark.")
parser.add_argument(
'--test_dir', type=str, default='test_data', help='test file address')
parser.add_argument(
'--vocab_path', type=str, default='vocab.txt', help='vocab path')
parser.add_argument(
'--start_index', type=int, default='1', help='start index')
parser.add_argument(
'--last_index', type=int, default='10', help='end index')
parser.add_argument(
'--model_dir', type=str, default='model_output', help='model dir')
parser.add_argument(
'--use_cuda', type=int, default='0', help='whether use cuda')
parser.add_argument(
'--batch_size', type=int, default='50', help='batch_size')
parser.add_argument(
'--hid_size', type=int, default='128', help='hidden size')
parser.add_argument(
'--emb_size', type=int, default='128', help='embedding size')
args = parser.parse_args()
return args
def model(vocab_size, emb_size, hidden_size):
user_data = fluid.layers.data(
name="user", shape=[1], dtype="int64", lod_level=1)
all_item_data = fluid.layers.data(
name="all_item", shape=[vocab_size, 1], dtype="int64")
user_emb = fluid.layers.embedding(
input=user_data, size=[vocab_size, emb_size], param_attr="emb.item")
all_item_emb = fluid.layers.embedding(
input=all_item_data, size=[vocab_size, emb_size], param_attr="emb.item")
all_item_emb_re = fluid.layers.reshape(x=all_item_emb, shape=[-1, emb_size])
user_encoder = net.GrnnEncoder(hidden_size=hidden_size)
user_enc = user_encoder.forward(user_emb)
user_hid = fluid.layers.fc(input=user_enc,
size=hidden_size,
param_attr='user.w',
bias_attr="user.b")
user_exp = fluid.layers.expand(x=user_hid, expand_times=[1, vocab_size])
user_re = fluid.layers.reshape(x=user_exp, shape=[-1, hidden_size])
all_item_hid = fluid.layers.fc(input=all_item_emb_re,
size=hidden_size,
param_attr='item.w',
bias_attr="item.b")
cos_item = fluid.layers.cos_sim(X=all_item_hid, Y=user_re)
all_pre_ = fluid.layers.reshape(x=cos_item, shape=[-1, vocab_size])
pos_label = fluid.layers.data(name="pos_label", shape=[1], dtype="int64")
acc = fluid.layers.accuracy(input=all_pre_, label=pos_label, k=20)
return acc
def infer(args, vocab_size, test_reader):
""" inference function """
place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
exe = fluid.Executor(place)
emb_size = args.emb_size
hid_size = args.hid_size
batch_size = args.batch_size
model_path = args.model_dir
with fluid.scope_guard(fluid.core.Scope()):
main_program = fluid.Program()
start_up_program = fluid.Program()
with fluid.program_guard(main_program, start_up_program):
acc = model(vocab_size, emb_size, hid_size)
for epoch in xrange(start_index, last_index + 1):
copy_program = main_program.clone()
model_path = model_dir + "/epoch_" + str(epoch)
fluid.io.load_params(
executor=exe, dirname=model_path, main_program=copy_program)
accum_num_recall = 0.0
accum_num_sum = 0.0
t0 = time.time()
step_id = 0
for data in test_reader():
step_id += 1
user_data, pos_label = utils.infer_data(data, place)
all_item_numpy = np.tile(
np.arange(vocab_size), len(pos_label)).reshape(
len(pos_label), vocab_size, 1)
para = exe.run(copy_program,
feed={
"user": user_data,
"all_item": all_item_numpy,
"pos_label": pos_label
},
fetch_list=[acc.name],
return_numpy=False)
acc_ = para[0]._get_float_element(0)
data_length = len(
np.concatenate(
pos_label, axis=0).astype("int64"))
accum_num_sum += (data_length)
accum_num_recall += (data_length * acc_)
if step_id % 1 == 0:
print("step:%d " % (step_id),
accum_num_recall / accum_num_sum)
t1 = time.time()
print("model:%s recall@20:%.3f time_cost(s):%.2f" %
(model_path, accum_num_recall / accum_num_sum, t1 - t0))
if __name__ == "__main__":
args = parse_args()
start_index = args.start_index
last_index = args.last_index
test_dir = args.test_dir
model_dir = args.model_dir
batch_size = args.batch_size
vocab_path = args.vocab_path
use_cuda = True if args.use_cuda else False
print("start index: ", start_index, " last_index:", last_index)
test_reader, vocab_size = utils.construct_test_data(
test_dir, vocab_path, batch_size=args.batch_size)
infer(args, vocab_size, test_reader=test_reader)
......@@ -17,35 +17,60 @@ import paddle.fluid.layers.nn as nn
import paddle.fluid.layers.tensor as tensor
import paddle.fluid.layers.control_flow as cf
import paddle.fluid.layers.io as io
from PaddleRec.multiview_simnet.nets import BowEncoder
from PaddleRec.multiview_simnet.nets import GrnnEncoder
class BowEncoder(object):
""" bow-encoder """
def __init__(self):
self.param_name = ""
def forward(self, emb):
return nn.sequence_pool(input=emb, pool_type='sum')
class GrnnEncoder(object):
""" grnn-encoder """
def __init__(self, param_name="grnn", hidden_size=128):
self.param_name = param_name
self.hidden_size = hidden_size
def forward(self, emb):
fc0 = nn.fc(input=emb,
size=self.hidden_size * 3,
param_attr=self.param_name + "_fc.w",
bias_attr=False)
gru_h = nn.dynamic_gru(
input=fc0,
size=self.hidden_size,
is_reverse=False,
param_attr=self.param_name + ".param",
bias_attr=self.param_name + ".bias")
return nn.sequence_pool(input=gru_h, pool_type='max')
class PairwiseHingeLoss(object):
def __init__(self, margin=0.8):
self.margin = margin
def forward(self, pos, neg):
loss_part1 = nn.elementwise_sub(
tensor.fill_constant_batch_size_like(
input=pos,
shape=[-1, 1],
value=self.margin,
dtype='float32'),
input=pos, shape=[-1, 1], value=self.margin, dtype='float32'),
pos)
loss_part2 = nn.elementwise_add(loss_part1, neg)
loss_part3 = nn.elementwise_max(
tensor.fill_constant_batch_size_like(
input=loss_part2,
shape=[-1, 1],
value=0.0,
dtype='float32'),
input=loss_part2, shape=[-1, 1], value=0.0, dtype='float32'),
loss_part2)
return loss_part3
class SequenceSemanticRetrieval(object):
""" sequence semantic retrieval model """
def __init__(self, embedding_size, embedding_dim, hidden_size):
self.embedding_size = embedding_size
self.embedding_dim = embedding_dim
......@@ -54,48 +79,44 @@ class SequenceSemanticRetrieval(object):
self.user_encoder = GrnnEncoder(hidden_size=hidden_size)
self.item_encoder = BowEncoder()
self.pairwise_hinge_loss = PairwiseHingeLoss()
def get_correct(self, x, y):
less = tensor.cast(cf.less_than(x, y), dtype='float32')
correct = nn.reduce_sum(less)
return correct
def train(self):
user_data = io.data(
name="user", shape=[1], dtype="int64", lod_level=1
)
user_data = io.data(name="user", shape=[1], dtype="int64", lod_level=1)
pos_item_data = io.data(
name="p_item", shape=[1], dtype="int64", lod_level=1
)
name="p_item", shape=[1], dtype="int64", lod_level=1)
neg_item_data = io.data(
name="n_item", shape=[1], dtype="int64", lod_level=1
)
name="n_item", shape=[1], dtype="int64", lod_level=1)
user_emb = nn.embedding(
input=user_data, size=self.emb_shape, param_attr="emb.item"
)
input=user_data, size=self.emb_shape, param_attr="emb.item")
pos_item_emb = nn.embedding(
input=pos_item_data, size=self.emb_shape, param_attr="emb.item"
)
input=pos_item_data, size=self.emb_shape, param_attr="emb.item")
neg_item_emb = nn.embedding(
input=neg_item_data, size=self.emb_shape, param_attr="emb.item"
)
input=neg_item_data, size=self.emb_shape, param_attr="emb.item")
user_enc = self.user_encoder.forward(user_emb)
pos_item_enc = self.item_encoder.forward(pos_item_emb)
neg_item_enc = self.item_encoder.forward(neg_item_emb)
user_hid = nn.fc(
input=user_enc, size=self.hidden_size, param_attr='user.w', bias_attr="user.b"
)
pos_item_hid = nn.fc(
input=pos_item_enc, size=self.hidden_size, param_attr='item.w', bias_attr="item.b"
)
neg_item_hid = nn.fc(
input=neg_item_enc, size=self.hidden_size, param_attr='item.w', bias_attr="item.b"
)
user_hid = nn.fc(input=user_enc,
size=self.hidden_size,
param_attr='user.w',
bias_attr="user.b")
pos_item_hid = nn.fc(input=pos_item_enc,
size=self.hidden_size,
param_attr='item.w',
bias_attr="item.b")
neg_item_hid = nn.fc(input=neg_item_enc,
size=self.hidden_size,
param_attr='item.w',
bias_attr="item.b")
cos_pos = nn.cos_sim(user_hid, pos_item_hid)
cos_neg = nn.cos_sim(user_hid, neg_item_hid)
hinge_loss = self.pairwise_hinge_loss.forward(cos_pos, cos_neg)
avg_cost = nn.mean(hinge_loss)
correct = self.get_correct(cos_neg, cos_pos)
return [user_data, pos_item_data, neg_item_data], \
pos_item_hid, neg_item_hid, avg_cost, correct
return [user_data, pos_item_data,
neg_item_data], cos_pos, avg_cost, correct
......@@ -14,19 +14,22 @@
import random
class Dataset:
def __init__(self):
pass
class Vocab:
def __init__(self):
pass
class YoochooseVocab(Vocab):
def __init__(self):
self.vocab = {}
self.word_array = []
def load(self, filelist):
idx = 0
for f in filelist:
......@@ -47,21 +50,16 @@ class YoochooseVocab(Vocab):
def _get_word_array(self):
return self.word_array
class YoochooseDataset(Dataset):
def __init__(self, y_vocab):
self.vocab_size = len(y_vocab.get_vocab())
self.word_array = y_vocab._get_word_array()
self.vocab = y_vocab.get_vocab()
def __init__(self, vocab_size):
self.vocab_size = vocab_size
def sample_neg(self):
return random.randint(0, self.vocab_size - 1)
def sample_neg_from_seq(self, seq):
return seq[random.randint(0, len(seq) - 1)]
# TODO(guru4elephant): wait memory, should be improved
def sample_from_word_freq(self):
return self.word_array[random.randint(0, len(self.word_array) - 1)]
def _reader_creator(self, filelist, is_train):
def reader():
......@@ -72,23 +70,20 @@ class YoochooseDataset(Dataset):
ids = line.strip().split()
if len(ids) <= 1:
continue
conv_ids = [self.vocab[i] if i in self.vocab else 0 for i in ids]
# random select an index as boundary
# make ids before boundary as sequence
# make id next to boundary right as target
boundary = random.randint(1, len(ids) - 1)
conv_ids = [i for i in ids]
boundary = len(ids) - 1
src = conv_ids[:boundary]
pos_tgt = [conv_ids[boundary]]
if is_train:
neg_tgt = [self.sample_from_word_freq()]
neg_tgt = [self.sample_neg()]
yield [src, pos_tgt, neg_tgt]
else:
yield [src, pos_tgt]
return reader
def train(self, file_list):
return self._reader_creator(file_list, True)
def test(self, file_list):
return self._reader_creator(file_list, False)
0 16
475 473 155
491 21
96 185 96
29 14 13
5 481 11 21 470
70 5 70 11
167 42 167 217
72 15 73 161 172
82 82
97 297 97
193 182 186 183 184 177 214
152 152
163 298 7
39 73 71
490 23 23 496 488 74 23 74 486 23 23 74
17 17
170 170 483 444 443 234
25 472
5 5 11 70 69
149 149 455
356 68 477 468 17 479 66
159 172 6 71 6 6 158 13 494 169
155 44 438 144 500
156 9 9
146 146
173 10 10 461
7 6 6
269 48 268
50 100
323 174 18
69 69 22 98
38 171
22 29 489 10
0 0
11 5
29 13 14 232 231 451 289 452 229
260 11 156
166 160 166 39
223 134 134 420
66 401 68 132 17 84 287 5
39 304
65 84 132
400 211
145 144
16 28 254 48 50 100 42 154 262 133 17
0 0
28 28
11 476 464
61 61 86 86
38 38
463 478
437 265
22 39 485 171 98
434 51 344
16 16
67 67 67 448
22 12 161
15 377 147 147 374
119 317 0
38 484
403 499
432 442
28 0 16 50 465 42
163 487 7 162
99 99 325 423 83 83
154 133
5 37 492 235 160 279
10 10 457 493 10 460
441 4 4 4 4 4 4 4
153 153
159 164 164
328 37
65 65 404 347 431 459
80 80 44 44
61 446
162 495 7 453
157 21 204 68 37 66 469 145
37 151 230 206 240 205 264 87 409 87 288 270 280 329 157 296 454 474
430 445 433
449 14
9 9 9 9
440 238 226
148 148
266 267 181
48 498
263 255 256
458 158 7
72 168 12 165 71 73 173 49
0 0
7 7 6
14 29 13 6 15 14 15 13
480 439 21
450 21 151
12 12 49 14 13 165 12 169 72 15 15
91 91
22 12 49 168
497 101 30 411 30 482 30 53 30 101 176 415 53 447
462 150 150
471 456 131 435 131 467 436 412 227 218 190 466 429 213 326
......@@ -13,87 +13,108 @@
# limitations under the License.
import os
import sys
import time
import argparse
import logging
import paddle.fluid as fluid
import paddle
import reader as reader
import utils
import numpy as np
from nets import SequenceSemanticRetrieval
logging.basicConfig(format="%(asctime)s - %(levelname)s - %(message)s")
logger = logging.getLogger("fluid")
logger.setLevel(logging.INFO)
def parse_args():
parser = argparse.ArgumentParser("sequence semantic retrieval")
parser.add_argument("--train_file", type=str, help="Training file")
parser.add_argument("--valid_file", type=str, help="Validation file")
parser.add_argument(
"--epochs", type=int, default=10, help="Number of epochs for training")
"--train_dir", type=str, default='train_data', help="Training file")
parser.add_argument(
"--base_lr", type=float, default=0.01, help="learning rate")
parser.add_argument(
'--vocab_path', type=str, default='vocab.txt', help='vocab file')
parser.add_argument(
"--epochs", type=int, default=10, help="Number of epochs")
parser.add_argument(
'--parallel', type=int, default=0, help='whether parallel')
parser.add_argument(
'--use_cuda', type=int, default=0, help='whether use gpu')
parser.add_argument(
'--print_batch', type=int, default=10, help='num of print batch')
parser.add_argument(
"--model_output_dir",
type=str,
default='model_output',
help="Model output folder")
'--model_dir', type=str, default='model_output', help='model dir')
parser.add_argument(
"--sequence_encode_dim",
type=int,
default=128,
help="Dimension of sequence encoder output")
"--hidden_size", type=int, default=128, help="hidden size")
parser.add_argument(
"--matching_dim",
type=int,
default=128,
help="Dimension of hidden layer")
"--batch_size", type=int, default=50, help="number of batch")
parser.add_argument(
"--batch_size", type=int, default=128, help="Batch size for training")
"--embedding_dim", type=int, default=128, help="embedding dim")
parser.add_argument(
"--embedding_dim",
type=int,
default=128,
help="Default Dimension of Embedding")
'--num_devices', type=int, default=1, help='Number of GPU devices')
return parser.parse_args()
def start_train(args):
y_vocab = reader.YoochooseVocab()
y_vocab.load([args.train_file])
logger.info("Load yoochoose vocabulary size: {}".format(len(y_vocab.get_vocab())))
y_data = reader.YoochooseDataset(y_vocab)
train_reader = paddle.batch(
paddle.reader.shuffle(
y_data.train([args.train_file]), buf_size=args.batch_size * 100),
batch_size=args.batch_size)
place = fluid.CPUPlace()
ssr = SequenceSemanticRetrieval(
len(y_vocab.get_vocab()), args.embedding_dim, args.matching_dim
)
input_data, user_rep, item_rep, avg_cost, acc = ssr.train()
optimizer = fluid.optimizer.Adam(learning_rate=1e-4)
def get_cards(args):
return args.num_devices
def train(args):
use_cuda = True if args.use_cuda else False
parallel = True if args.parallel else False
print("use_cuda:", use_cuda, "parallel:", parallel)
train_reader, vocab_size = utils.construct_train_data(
args.train_dir, args.vocab_path, args.batch_size * get_cards(args))
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
ssr = SequenceSemanticRetrieval(vocab_size, args.embedding_dim,
args.hidden_size)
# Train program
train_input_data, cos_pos, avg_cost, acc = ssr.train()
# Optimization to minimize lost
optimizer = fluid.optimizer.Adagrad(learning_rate=args.base_lr)
optimizer.minimize(avg_cost)
startup_program = fluid.default_startup_program()
loop_program = fluid.default_main_program()
data_list = [var.name for var in input_data]
data_list = [var.name for var in train_input_data]
feeder = fluid.DataFeeder(feed_list=data_list, place=place)
exe = fluid.Executor(place)
exe.run(startup_program)
exe.run(fluid.default_startup_program())
if parallel:
train_exe = fluid.ParallelExecutor(
use_cuda=use_cuda, loss_name=avg_cost.name)
else:
train_exe = exe
total_time = 0.0
for pass_id in range(args.epochs):
epoch_idx = pass_id + 1
print("epoch_%d start" % epoch_idx)
t0 = time.time()
i = 0
for batch_id, data in enumerate(train_reader()):
loss_val, correct_val = exe.run(loop_program,
feed=feeder.feed(data),
fetch_list=[avg_cost, acc])
logger.info("Train --> pass: {} batch_id: {} avg_cost: {}, acc: {}".
format(pass_id, batch_id, loss_val,
float(correct_val) / args.batch_size))
fluid.io.save_inference_model(args.model_output_dir,
[var.name for val in input_data],
[user_rep, item_rep, avg_cost, acc], exe)
i += 1
loss_val, correct_val = train_exe.run(
feed=feeder.feed(data), fetch_list=[avg_cost.name, acc.name])
if i % args.print_batch == 0:
logger.info(
"Train --> pass: {} batch_id: {} avg_cost: {}, acc: {}".
format(pass_id, batch_id,
np.mean(loss_val),
float(np.mean(correct_val)) / args.batch_size))
t1 = time.time()
total_time += t1 - t0
print("epoch:%d num_steps:%d time_cost(s):%f" %
(epoch_idx, i, total_time / epoch_idx))
save_dir = "%s/epoch_%d" % (args.model_dir, epoch_idx)
fluid.io.save_params(executor=exe, dirname=save_dir)
print("model saved in %s" % save_dir)
def main():
args = parse_args()
start_train(args)
train(args)
if __name__ == "__main__":
main()
197 196 198 236
93 93 384 362 363 43
336 364 407
421 322
314 388
128 58
138 138
46 46 46
34 34 57 57 57 342 228 321 346 357 59 376
110 110
135 94 135
27 250 27
129 118
18 18 18
81 81 89 89
27 27
20 20 20 20 20 212
33 33 33 33
62 62 62 63 63 55 248 124 381 428 383 382 43 43 261 63
90 90 78 78
399 397 202 141 104 104 245 192 191 271
239 332 283 88
187 313
136 136 324
41 41
352 128
413 414
410 45 45 45 1 1 1 1 1 1 1 1 31 31 31 31
92 334 92
95 285
215 249
390 41
116 116
300 252
2 2 2 2 2
8 8 8 8 8 8
53 241 259
118 129 126 94 137 208 216 299
209 368 139 418 419
311 180
303 302 203 284
369 32 32 32 32 337
207 47 47 47
106 107
143 143
179 178
109 109
405 79 79 371 246
251 417 427
333 88 387 358 123 348 394 360 36 365
3 3 3 3 3
189 188
398 425
107 406
281 201 141
2 2 2
359 54
395 385 293
60 60 60 121 121 233 58 58
24 199 175 24 24 24 351 386 106
115 294
122 122 127 127
35 35
282 393
277 140 140 343 225 123 36 36 36 221 114 114 59 59 117 117 247 367 219 258 222 301 375 350 353 111 111
275 272 273 274 331 330 305 108 76 76 108
26 26 26 408 26
290 18 210 291
372 139 424 113
341 340 335
120 370
224 200
426 416
137 319
402 55
54 54
327 119
125 125
391 396 354 355 389
142 142
295 320
113 366
253 85 85
56 56 310 309 308 307 278 25 25 19 19 3 312 19 19 19 3 25
220 338
34 130
130 120 380 315
339 422
379 378
95 56 392 115
55 124
126 34
349 373 361
195 194
75 75
64 64 64
35 35
40 40 40 242 77 244 77 243
257 316
103 306 102 51 52 103 105 52 52 292 318 112 286 345 237 276 112 51 102 105
import numpy as np
import reader as reader
import os
import logging
import paddle.fluid as fluid
import paddle
def get_vocab_size(vocab_path):
with open(vocab_path, "r") as rf:
line = rf.readline()
return int(line.strip())
def construct_train_data(file_dir, vocab_path, batch_size):
vocab_size = get_vocab_size(vocab_path)
files = [file_dir + '/' + f for f in os.listdir(file_dir)]
y_data = reader.YoochooseDataset(vocab_size)
train_reader = paddle.batch(
paddle.reader.shuffle(
y_data.train(files), buf_size=batch_size * 100),
batch_size=batch_size)
return train_reader, vocab_size
def construct_test_data(file_dir, vocab_path, batch_size):
vocab_size = get_vocab_size(vocab_path)
files = [file_dir + '/' + f for f in os.listdir(file_dir)]
y_data = reader.YoochooseDataset(vocab_size)
test_reader = paddle.batch(y_data.test(files), batch_size=batch_size)
return test_reader, vocab_size
def infer_data(raw_data, place):
data = [dat[0] for dat in raw_data]
seq_lens = [len(seq) for seq in data]
cur_len = 0
lod = [cur_len]
for l in seq_lens:
cur_len += l
lod.append(cur_len)
flattened_data = np.concatenate(data, axis=0).astype("int64")
flattened_data = flattened_data.reshape([len(flattened_data), 1])
res = fluid.LoDTensor()
res.set(flattened_data, place)
res.set_lod([lod])
p_label = [dat[1] for dat in raw_data]
pos_label = np.array(p_label).astype("int64").reshape(len(p_label), 1)
return res, pos_label
......@@ -10,12 +10,16 @@ import paddle.fluid as fluid
import paddle
import utils
def parse_args():
parser = argparse.ArgumentParser("gru4rec benchmark.")
parser = argparse.ArgumentParser("tagspace benchmark.")
parser.add_argument(
'--test_dir', type=str, default='test_data', help='test file address')
parser.add_argument(
'--vocab_tag_path', type=str, default='vocab_tag.txt', help='vocab path')
'--vocab_tag_path',
type=str,
default='vocab_tag.txt',
help='vocab path')
parser.add_argument(
'--start_index', type=int, default='1', help='start index')
parser.add_argument(
......@@ -29,6 +33,7 @@ def parse_args():
args = parser.parse_args()
return args
def infer(test_reader, vocab_tag, use_cuda, model_path):
""" inference function """
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
......@@ -39,7 +44,7 @@ def infer(test_reader, vocab_tag, use_cuda, model_path):
model_path, exe)
t0 = time.time()
step_id = 0
true_num = 0
true_num = 0
all_num = 0
size = vocab_tag
value = []
......@@ -48,13 +53,11 @@ def infer(test_reader, vocab_tag, use_cuda, model_path):
lod_text_seq = utils.to_lodtensor([dat[0] for dat in data], place)
lod_tag = utils.to_lodtensor([dat[1] for dat in data], place)
lod_pos_tag = utils.to_lodtensor([dat[2] for dat in data], place)
para = exe.run(
infer_program,
feed={
"text": lod_text_seq,
"pos_tag": lod_tag},
fetch_list=fetch_vars,
return_numpy=False)
para = exe.run(infer_program,
feed={"text": lod_text_seq,
"pos_tag": lod_tag},
fetch_list=fetch_vars,
return_numpy=False)
value.append(para[0]._get_float_element(0))
if step_id % size == 0 and step_id > 1:
all_num += 1
......@@ -66,6 +69,7 @@ def infer(test_reader, vocab_tag, use_cuda, model_path):
print(step_id, 1.0 * true_num / all_num)
t1 = time.time()
if __name__ == "__main__":
args = parse_args()
start_index = args.start_index
......@@ -75,11 +79,20 @@ if __name__ == "__main__":
batch_size = args.batch_size
vocab_tag_path = args.vocab_tag_path
use_cuda = True if args.use_cuda else False
print("start index: ", start_index, " last_index:" ,last_index)
print("start index: ", start_index, " last_index:", last_index)
vocab_text, vocab_tag, test_reader = utils.prepare_data(
test_dir, "", vocab_tag_path, batch_size=1,
neg_size=0, buffer_size=1000, is_train=False)
test_dir,
"",
vocab_tag_path,
batch_size=1,
neg_size=0,
buffer_size=1000,
is_train=False)
for epoch in range(start_index, last_index + 1):
epoch_path = model_dir + "/epoch_" + str(epoch)
infer(test_reader=test_reader, vocab_tag=vocab_tag, use_cuda=False, model_path=epoch_path)
infer(
test_reader=test_reader,
vocab_tag=vocab_tag,
use_cuda=False,
model_path=epoch_path)
此差异已折叠。
......@@ -25,7 +25,14 @@ cd data && ./download.sh && cd ..
```bash
python preprocess.py --data_path ./data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled --dict_path data/1-billion_dict
```
如果您想使用我们支持的第三方词汇表,请将--other_dict_path设置为您存放将使用的词汇表的目录,并设置--with_other_dict使用它
如果您想使用自定义的词典形如:
```bash
<UNK>
a
b
c
```
请将--other_dict_path设置为您存放将使用的词典的目录,并设置--with_other_dict使用它
## 训练
训练的命令行选项可以通过`python train.py -h`列出。
......@@ -40,6 +47,14 @@ python train.py \
--with_hs --with_nce --is_local \
2>&1 | tee train.log
```
如果您想使用自定义的词典形如:
```bash
<UNK>
a
b
c
```
请将--other_dict_path设置为您存放将使用的词典的目录,并设置--with_other_dict使用它
### 分布式训练
......
......@@ -29,9 +29,16 @@ This model implement a skip-gram model of word2vector.
Preprocess the training data to generate a word dict.
```bash
python preprocess.py --data_path ./data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled --is_local --dict_path data/1-billion_dict
python preprocess.py --data_path ./data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled --dict_path data/1-billion_dict
```
if you would like to use our supported third party vocab, please set --other_dict_path as the directory of where you
if you would like to use your own vocab follow the format below:
```bash
<UNK>
a
b
c
```
Then, please set --other_dict_path as the directory of where you
save the vocab you will use and set --with_other_dict flag on to using it.
## Train
......@@ -47,7 +54,8 @@ python train.py \
--with_hs --with_nce --is_local \
2>&1 | tee train.log
```
if you would like to use our supported third party vocab, please set --other_dict_path as the directory of where you
save the vocab you will use and set --with_other_dict flag on to using it.
### Distributed Train
Run a 2 pserver 2 trainer distribute training on a single machine.
......
......@@ -105,7 +105,7 @@ class Word2VecReader(object):
return set(targets)
def train(self, with_hs):
def train(self, with_hs, with_other_dict):
def _reader():
for file in self.filelist:
with io.open(
......@@ -116,7 +116,11 @@ class Word2VecReader(object):
count = 1
for line in f:
if self.trainer_id == count % self.trainer_num:
line = preprocess.strip_lines(line, self.word_count)
if with_other_dict:
line = preprocess.strip_lines(line,
self.word_count)
else:
line = preprocess.text_strip(line)
word_ids = [
self.word_to_id_[word] for word in line.split()
if word in self.word_to_id_
......@@ -140,7 +144,11 @@ class Word2VecReader(object):
count = 1
for line in f:
if self.trainer_id == count % self.trainer_num:
line = preprocess.strip_lines(line, self.word_count)
if with_other_dict:
line = preprocess.strip_lines(line,
self.word_count)
else:
line = preprocess.text_strip(line)
word_ids = [
self.word_to_id_[word] for word in line.split()
if word in self.word_to_id_
......
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册