提交 04d3498a 编写于 作者: G guosheng

Merge branch 'develop' of https://github.com/PaddlePaddle/models into add-inceptionresnetv2

#!/usr/bin/env bash
set -e
readonly VERSION="3.9"
readonly VERSION="3.8"
version=$(clang-format -version)
......
......@@ -16,11 +16,12 @@ addons:
- python
- python-pip
- python2.7-dev
- clang-format-3.8
ssh_known_hosts: 52.76.173.135
before_install:
- sudo pip install -U virtualenv pre-commit pip
- docker pull paddlepaddle/paddle:latest
- if [[ "$JOB" == "PRE_COMMIT" ]]; then sudo ln -s /usr/bin/clang-format-3.8 /usr/bin/clang-format; fi
- sudo pip install -U virtualenv pre-commit pip
- docker pull paddlepaddle/paddle:latest
script:
- exit_code=0
......
......@@ -31,7 +31,7 @@ PaddlePaddle提供了丰富的运算单元,帮助大家以模块化的方式
在点击率预估任务中,我们首先给出谷歌提出的 Wide & Deep 模型。这一模型融合了适用于学习抽象特征的DNN和适用于大规模稀疏特征的逻谛斯克回归两者的优点,可以作为一种相对成熟的模型框架使用,在工业界也有一定的应用。同时,我们提供基于因子分解机的深度神经网络模型,该模型融合了因子分解机和深度神经网络,分别建模输入属性之间的低阶交互和高阶交互。
- 3.1 [Wide & deep 点击率预估模型](https://github.com/PaddlePaddle/models/tree/develop/ctr)
- 3.1 [Wide & deep 点击率预估模型](https://github.com/PaddlePaddle/models/tree/develop/ctr/README.cn.md)
- 3.2 [基于深度因子分解机的点击率预估模型](https://github.com/PaddlePaddle/models/tree/develop/deep_fm)
## 4. 文本分类
......@@ -57,7 +57,7 @@ PaddlePaddle提供了丰富的运算单元,帮助大家以模块化的方式
在结构化语义模型任务中,我们演示如何建模两个字符串之间的语义相似度。模型支持DNN(全连接前馈网络)、CNN(卷积网络)、RNN(递归神经网络)等不同的网络结构,以及分类、回归、排序等不同损失函数。本例采用最简单的文本数据作为输入,通过替换自己的训练和预测数据,便可以在真实场景中使用。
- 6.1 [深度结构化语义模型](https://github.com/PaddlePaddle/models/tree/develop/dssm)
- 6.1 [深度结构化语义模型](https://github.com/PaddlePaddle/models/tree/develop/dssm/README.cn.md)
## 7. 命名实体识别
......@@ -73,7 +73,7 @@ PaddlePaddle提供了丰富的运算单元,帮助大家以模块化的方式
在序列到序列学习任务中,我们首先以机器翻译任务为例,提供了多种改进模型供大家学习和使用。包括:不带注意力机制的序列到序列映射模型,这一模型是所有序列到序列学习模型的基础;使用Scheduled Sampling改善RNN模型在生成任务中的错误累积问题;带外部记忆机制的神经机器翻译,通过增强神经网络的记忆能力,来完成复杂的序列到序列学习任务。除机器翻译任务之外,我们也提供了一个基于深层LSTM网络生成古诗词,实现同语言生成的模型。
- 8.1 [无注意力机制的神经机器翻译](https://github.com/PaddlePaddle/models/tree/develop/nmt_without_attention)
- 8.1 [无注意力机制的神经机器翻译](https://github.com/PaddlePaddle/models/tree/develop/nmt_without_attention/README.cn.md)
- 8.2 [使用Scheduled Sampling改善翻译质量](https://github.com/PaddlePaddle/models/tree/develop/scheduled_sampling)
- 8.3 [带外部记忆机制的神经机器翻译](https://github.com/PaddlePaddle/models/tree/develop/mt_with_external_memory)
- 8.4 [生成古诗词](https://github.com/PaddlePaddle/models/tree/develop/generate_chinese_poetry)
......@@ -111,7 +111,7 @@ PaddlePaddle提供了丰富的运算单元,帮助大家以模块化的方式
在目标检测任务中,我们介绍利用SSD方法完成目标检测。SSD全称:Single Shot MultiBox Detector,是目标检测领域较新且效果较好的检测算法之一,具有检测速度快且检测精度高的特点。
- 12.1 [Single Shot MultiBox Detector](https://github.com/PaddlePaddle/models/tree/develop/ssd)
- 12.1 [Single Shot MultiBox Detector](https://github.com/PaddlePaddle/models/tree/develop/ssd/README.cn.md)
## 13. 场景文字识别
......@@ -127,7 +127,6 @@ PaddlePaddle提供了丰富的运算单元,帮助大家以模块化的方式
在语音识别任务中,我们提供了基于 DeepSpeech2 模型的完整流水线,包括:特征提取、数据增强、模型训练、语言模型、解码模块等,并提供一个训练好的模型和体验实例,大家能够使用自己的声音来体验语音识别的乐趣。
- 14.1 [语音识别: DeepSpeech2](https://github.com/PaddlePaddle/models/tree/develop/deep_speech_2)
14.1 [语音识别: DeepSpeech2](https://github.com/PaddlePaddle/DeepSpeech)
本教程由[PaddlePaddle](https://github.com/PaddlePaddle/Paddle)创作,采用[Apache-2.0](LICENSE) 许可协议进行许可。
......@@ -11,7 +11,6 @@ import reader
class BeamSearch(object):
"""
Generate sequence by beam search
NOTE: this class only implements generating one sentence at a time.
"""
def __init__(self,
......
# 点击率预估
以下是本例目录包含的文件以及对应说明:
```
├── README.md # 本教程markdown 文档
├── dataset.md # 数据集处理教程
├── images # 本教程图片目录
│   ├── lr_vs_dnn.jpg
│   └── wide_deep.png
├── infer.py # 预测脚本
├── network_conf.py # 模型网络配置
├── reader.py # data reader
├── train.py # 训练脚本
└── utils.py # helper functions
└── avazu_data_processer.py # 示例数据预处理脚本
```
## 背景介绍
CTR(Click-Through Rate,点击率预估)\[[1](https://en.wikipedia.org/wiki/Click-through_rate)\]
是对用户点击一个特定链接的概率做出预测,是广告投放过程中的一个重要环节。精准的点击率预估对在线广告系统收益最大化具有重要意义。
当有多个广告位时,CTR 预估一般会作为排序的基准,比如在搜索引擎的广告系统里,当用户输入一个带商业价值的搜索词(query)时,系统大体上会执行下列步骤来展示广告:
1. 获取与用户搜索词相关的广告集合
2. 业务规则和相关性过滤
3. 根据拍卖机制和 CTR 排序
4. 展出广告
可以看到,CTR 在最终排序中起到了很重要的作用。
### 发展阶段
在业内,CTR 模型经历了如下的发展阶段:
- Logistic Regression(LR) / GBDT + 特征工程
- LR + DNN 特征
- DNN + 特征工程
在发展早期时 LR 一统天下,但最近 DNN 模型由于其强大的学习能力和逐渐成熟的性能优化,
逐渐地接过 CTR 预估任务的大旗。
### LR vs DNN
下图展示了 LR 和一个 \(3x2\) 的 DNN 模型的结构:
<p align="center">
<img src="images/lr_vs_dnn.jpg" width="620" hspace='10'/> <br/>
Figure 1. LR 和 DNN 模型结构对比
</p>
LR 的蓝色箭头部分可以直接类比到 DNN 中对应的结构,可以看到 LR 和 DNN 有一些共通之处(比如权重累加),
但前者的模型复杂度在相同输入维度下比后者可能低很多(从某方面讲,模型越复杂,越有潜力学习到更复杂的信息);
如果 LR 要达到匹敌 DNN 的学习能力,必须增加输入的维度,也就是增加特征的数量,
这也就是为何 LR 和大规模的特征工程必须绑定在一起的原因。
LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力,包括内存和计算量等方面,工业界都有非常成熟的优化方法;
而 DNN 模型具有自己学习新特征的能力,一定程度上能够提升特征使用的效率,
这使得 DNN 模型在同样规模特征的情况下,更有可能达到更好的学习效果。
本文后面的章节会演示如何使用 PaddlePaddle 编写一个结合两者优点的模型。
## 数据和任务抽象
我们可以将 `click` 作为学习目标,任务可以有以下几种方案:
1. 直接学习 click,0,1 作二元分类
2. Learning to rank, 具体用 pairwise rank(标签 1>0)或者 listwise rank
3. 统计每个广告的点击率,将同一个 query 下的广告两两组合,点击率高的>点击率低的,做 rank 或者分类
我们直接使用第一种方法做分类任务。
我们使用 Kaggle 上 `Click-through rate prediction` 任务的数据集\[[2](https://www.kaggle.com/c/avazu-ctr-prediction/data)\] 来演示本例中的模型。
具体的特征处理方法参看 [data process](./dataset.md)
本教程中演示模型的输入格式如下:
```
# <dnn input ids> \t <lr input sparse values> \t click
1 23 190 \t 230:0.12 3421:0.9 23451:0.12 \t 0
23 231 \t 1230:0.12 13421:0.9 \t 1
```
详细的格式描述如下:
- `dnn input ids` 采用 one-hot 表示,只需要填写值为1的ID(注意这里不是变长输入)
- `lr input sparse values` 使用了 `ID:VALUE` 的表示,值部分最好规约到值域 `[-1, 1]`
此外,模型训练时需要传入一个文件描述 dnn 和 lr两个子模型的输入维度,文件的格式如下:
```
dnn_input_dim: <int>
lr_input_dim: <int>
```
其中, `<int>` 表示一个整型数值。
本目录下的 `avazu_data_processor.py` 可以对下载的演示数据集\[[2](#参考文档)\] 进行处理,具体使用方法参考如下说明:
```
usage: avazu_data_processer.py [-h] --data_path DATA_PATH --output_dir
OUTPUT_DIR
[--num_lines_to_detect NUM_LINES_TO_DETECT]
[--test_set_size TEST_SET_SIZE]
[--train_size TRAIN_SIZE]
PaddlePaddle CTR example
optional arguments:
-h, --help show this help message and exit
--data_path DATA_PATH
path of the Avazu dataset
--output_dir OUTPUT_DIR
directory to output
--num_lines_to_detect NUM_LINES_TO_DETECT
number of records to detect dataset's meta info
--test_set_size TEST_SET_SIZE
size of the validation dataset(default: 10000)
--train_size TRAIN_SIZE
size of the trainset (default: 100000)
```
- `data_path` 是待处理的数据路径
- `output_dir` 生成数据的输出路径
- `num_lines_to_detect` 预先扫描数据生成ID的个数,这里是扫描的文件行数
- `test_set_size` 生成测试集的行数
- `train_size` 生成训练姐的行数
## Wide & Deep Learning Model
谷歌在 16 年提出了 Wide & Deep Learning 的模型框架,用于融合适合学习抽象特征的 DNN 和 适用于大规模稀疏特征的 LR 两种模型的优点。
### 模型简介
Wide & Deep Learning Model\[[3](#参考文献)\] 可以作为一种相对成熟的模型框架使用,
在 CTR 预估的任务中工业界也有一定的应用,因此本文将演示使用此模型来完成 CTR 预估的任务。
模型结构如下:
<p align="center">
<img src="images/wide_deep.png" width="820" hspace='10'/> <br/>
Figure 2. Wide & Deep Model
</p>
模型上边的 Wide 部分,可以容纳大规模系数特征,并且对一些特定的信息(比如 ID)有一定的记忆能力;
而模型下边的 Deep 部分,能够学习特征间的隐含关系,在相同数量的特征下有更好的学习和推导能力。
### 编写模型输入
模型只接受 3 个输入,分别是
- `dnn_input` ,也就是 Deep 部分的输入
- `lr_input` ,也就是 Wide 部分的输入
- `click` , 点击与否,作为二分类模型学习的标签
```python
dnn_merged_input = layer.data(
name='dnn_input',
type=paddle.data_type.sparse_binary_vector(data_meta_info['dnn_input']))
lr_merged_input = layer.data(
name='lr_input',
type=paddle.data_type.sparse_binary_vector(data_meta_info['lr_input']))
click = paddle.layer.data(name='click', type=dtype.dense_vector(1))
```
### 编写 Wide 部分
Wide 部分直接使用了 LR 模型,但激活函数改成了 `RELU` 来加速
```python
def build_lr_submodel():
fc = layer.fc(
input=lr_merged_input, size=1, name='lr', act=paddle.activation.Relu())
return fc
```
### 编写 Deep 部分
Deep 部分使用了标准的多层前向传导的 DNN 模型
```python
def build_dnn_submodel(dnn_layer_dims):
dnn_embedding = layer.fc(input=dnn_merged_input, size=dnn_layer_dims[0])
_input_layer = dnn_embedding
for i, dim in enumerate(dnn_layer_dims[1:]):
fc = layer.fc(
input=_input_layer,
size=dim,
act=paddle.activation.Relu(),
name='dnn-fc-%d' % i)
_input_layer = fc
return _input_layer
```
### 两者融合
两个 submodel 的最上层输出加权求和得到整个模型的输出,输出部分使用 `sigmoid` 作为激活函数,得到区间 (0,1) 的预测值,
来逼近训练数据中二元类别的分布,并最终作为 CTR 预估的值使用。
```python
# conbine DNN and LR submodels
def combine_submodels(dnn, lr):
merge_layer = layer.concat(input=[dnn, lr])
fc = layer.fc(
input=merge_layer,
size=1,
name='output',
# use sigmoid function to approximate ctr, wihch is a float value between 0 and 1.
act=paddle.activation.Sigmoid())
return fc
```
### 训练任务的定义
```python
dnn = build_dnn_submodel(dnn_layer_dims)
lr = build_lr_submodel()
output = combine_submodels(dnn, lr)
# ==============================================================================
# cost and train period
# ==============================================================================
classification_cost = paddle.layer.multi_binary_label_cross_entropy_cost(
input=output, label=click)
paddle.init(use_gpu=False, trainer_count=11)
params = paddle.parameters.create(classification_cost)
optimizer = paddle.optimizer.Momentum(momentum=0)
trainer = paddle.trainer.SGD(
cost=classification_cost, parameters=params, update_equation=optimizer)
dataset = AvazuDataset(train_data_path, n_records_as_test=test_set_size)
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0:
logging.warning("Pass %d, Samples %d, Cost %f" % (
event.pass_id, event.batch_id * batch_size, event.cost))
if event.batch_id % 1000 == 0:
result = trainer.test(
reader=paddle.batch(dataset.test, batch_size=1000),
feeding=field_index)
logging.warning("Test %d-%d, Cost %f" % (event.pass_id, event.batch_id,
result.cost))
trainer.train(
reader=paddle.batch(
paddle.reader.shuffle(dataset.train, buf_size=500),
batch_size=batch_size),
feeding=field_index,
event_handler=event_handler,
num_passes=100)
```
## 运行训练和测试
训练模型需要如下步骤:
1. 准备训练数据
1.[Kaggle CTR](https://www.kaggle.com/c/avazu-ctr-prediction/data) 下载 train.gz
2. 解压 train.gz 得到 train.txt
3. `mkdir -p output; python avazu_data_processer.py --data_path train.txt --output_dir output --num_lines_to_detect 1000 --test_set_size 100` 生成演示数据
2. 执行 `python train.py --train_data_path ./output/train.txt --test_data_path ./output/test.txt --data_meta_file ./output/data.meta.txt --model_type=0` 开始训练
上面第2个步骤可以为 `train.py` 填充命令行参数来定制模型的训练过程,具体的命令行参数及用法如下
```
usage: train.py [-h] --train_data_path TRAIN_DATA_PATH
[--test_data_path TEST_DATA_PATH] [--batch_size BATCH_SIZE]
[--num_passes NUM_PASSES]
[--model_output_prefix MODEL_OUTPUT_PREFIX] --data_meta_file
DATA_META_FILE --model_type MODEL_TYPE
PaddlePaddle CTR example
optional arguments:
-h, --help show this help message and exit
--train_data_path TRAIN_DATA_PATH
path of training dataset
--test_data_path TEST_DATA_PATH
path of testing dataset
--batch_size BATCH_SIZE
size of mini-batch (default:10000)
--num_passes NUM_PASSES
number of passes to train
--model_output_prefix MODEL_OUTPUT_PREFIX
prefix of path for model to store (default:
./ctr_models)
--data_meta_file DATA_META_FILE
path of data meta info file
--model_type MODEL_TYPE
model type, classification: 0, regression 1 (default
classification)
```
- `train_data_path` : 训练集的路径
- `test_data_path` : 测试集的路径
- `num_passes`: 模型训练多少轮
- `data_meta_file`: 参考[数据和任务抽象](### 数据和任务抽象)的描述。
- `model_type`: 模型分类或回归
## 用训好的模型做预测
训好的模型可以用来预测新的数据, 预测数据的格式为
```
# <dnn input ids> \t <lr input sparse values>
1 23 190 \t 230:0.12 3421:0.9 23451:0.12
23 231 \t 1230:0.12 13421:0.9
```
这里与训练数据的格式唯一不同的地方,就是没有标签,也就是训练数据中第3列 `click` 对应的数值。
`infer.py` 的使用方法如下
```
usage: infer.py [-h] --model_gz_path MODEL_GZ_PATH --data_path DATA_PATH
--prediction_output_path PREDICTION_OUTPUT_PATH
[--data_meta_path DATA_META_PATH] --model_type MODEL_TYPE
PaddlePaddle CTR example
optional arguments:
-h, --help show this help message and exit
--model_gz_path MODEL_GZ_PATH
path of model parameters gz file
--data_path DATA_PATH
path of the dataset to infer
--prediction_output_path PREDICTION_OUTPUT_PATH
path to output the prediction
--data_meta_path DATA_META_PATH
path of trainset's meta info, default is ./data.meta
--model_type MODEL_TYPE
model type, classification: 0, regression 1 (default
classification)
```
- `model_gz_path_model`:用 `gz` 压缩过的模型路径
- `data_path` : 需要预测的数据路径
- `prediction_output_paht`:预测输出的路径
- `data_meta_file` :参考[数据和任务抽象](### 数据和任务抽象)的描述。
- `model_type` :分类或回归
示例数据可以用如下命令预测
```
python infer.py --model_gz_path <model_path> --data_path output/infer.txt --prediction_output_path predictions.txt --data_meta_path data.meta.txt
```
最终的预测结果位于 `predictions.txt`
## 参考文献
1. <https://en.wikipedia.org/wiki/Click-through_rate>
2. <https://www.kaggle.com/c/avazu-ctr-prediction/data>
3. Cheng H T, Koc L, Harmsen J, et al. [Wide & deep learning for recommender systems](https://arxiv.org/pdf/1606.07792.pdf)[C]//Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. ACM, 2016: 7-10.
......@@ -120,7 +120,7 @@ The model structure is as follows:
Figure 2. Wide & Deep Model
</p>
The wide part of the left side of the model can accommodate large-scale coefficient features and has some memory for some specific information (such as ID); and the Deep part of the right side of the model can learn the implicit relationship between features.
The wide part of the top side of the model can accommodate large-scale coefficient features and has some memory for some specific information (such as ID); and the Deep part of the bottom side of the model can learn the implicit relationship between features.
### Model Input
......
ctr/images/wide_deep.png

139.6 KB | W: | H:

ctr/images/wide_deep.png

150.6 KB | W: | H:

ctr/images/wide_deep.png
ctr/images/wide_deep.png
ctr/images/wide_deep.png
ctr/images/wide_deep.png
  • 2-up
  • Swipe
  • Onion skin
......@@ -131,7 +131,7 @@ def create_cnn(self, emb, prefix=''):
conv_3 = create_conv(3, self.dnn_dims[1], "cnn")
conv_4 = create_conv(4, self.dnn_dims[1], "cnn")
return conv_3, conv_4
return paddle.layer.concat(input=[conv_3, conv_4])
```
CNN 接受词向量序列,通过卷积和池化操作捕捉到原始句子的关键信息,最终输出一个语义向量(可以认为是句子向量)。
......@@ -216,54 +216,54 @@ Pairwise Rank复用上面的DNN结构,同一个source对两个target求相似
### 回归的数据格式
```
# 3 fields each line:
# - source's word ids
# - target's word ids
# - source word list
# - target word list
# - target
<ids> \t <ids> \t <float>
<word list> \t <word list> \t <float>
```
比如:
```
3 6 10 \t 6 8 33 \t 0.7
6 0 \t 6 9 330 \t 0.03
苹果 六 袋 苹果 6s 0.1
新手 汽车 驾驶 驾校 培训 0.9
```
### 分类的数据格式
```
# 3 fields each line:
# - source's word ids
# - target's word ids
# - source word list
# - target word list
# - target
<ids> \t <ids> \t <label>
<word list> \t <word list> \t <label>
```
比如:
```
3 6 10 \t 6 8 33 \t 0
6 10 \t 8 3 1 \t 1
苹果 六 袋 苹果 6s 0
新手 汽车 驾驶 驾校 培训 1
```
### 排序的数据格式
```
# 4 fields each line:
# - source's word ids
# - target1's word ids
# - target2's word ids
# - source word list
# - target1 word list
# - target2 word list
# - label
<ids> \t <ids> \t <ids> \t <label>
<word list> \t <word list> \t <word list> \t <label>
```
比如:
```
7 2 4 \t 2 10 12 \t 9 2 7 10 23 \t 0
7 2 4 \t 10 12 \t 9 2 21 23 \t 1
苹果 六 袋 苹果 6s 新手 汽车 驾驶 1
新手 汽车 驾驶 驾校 培训 苹果 6s 1
```
## 执行训练
可以直接执行 `python train.py -y 0 --model_arch 0` 使用 `./data/classification` 目录里的实例数据来测试能否直接运行训练分类FC模型。
可以直接执行 `python train.py -y 0 --model_arch 0 --class_num 2` 使用 `./data/classification` 目录里的实例数据来测试能否直接运行训练分类FC模型。
其他模型结构也可以通过命令行实现定制,详细命令行参数请执行 `python train.py --help`进行查阅。
......
......@@ -107,7 +107,7 @@ def create_cnn(self, emb, prefix=''):
conv_3 = create_conv(3, self.dnn_dims[1], "cnn")
conv_4 = create_conv(4, self.dnn_dims[1], "cnn")
return conv_3, conv_4
return paddle.layer.concat(input=[conv_3, conv_4])
```
CNN accepts the word sequence of the embedding table, then process the data by convolution and pooling, and finally outputs a semantic vector.
......@@ -190,62 +190,62 @@ Below is a simple example for the data in `./data`
### Regression data format
```
# 3 fields each line:
# - source's word ids
# - target's word ids
# - source word list
# - target word list
# - target
<ids> \t <ids> \t <float>
<word list> \t <word list> \t <float>
```
The example of this format is as follows.
```
3 6 10 \t 6 8 33 \t 0.7
6 0 \t 6 9 330 \t 0.03
Six bags of apples Apple 6s 0.1
The new driver The driving school 0.9
```
### Classification data format
```
# 3 fields each line:
# - source's word ids
# - target's word ids
# - source word list
# - target word list
# - target
<ids> \t <ids> \t <label>
<word list> \t <word list> \t <label>
```
The example of this format is as follows.
```
3 6 10 \t 6 8 33 \t 0
6 10 \t 8 3 1 \t 1
Six bags of apples Apple 6s 0
The new driver The driving school 1
```
### Ranking data format
```
# 4 fields each line:
# - source's word ids
# - target1's word ids
# - target2's word ids
# - source word list
# - target1 word list
# - target2 word list
# - label
<ids> \t <ids> \t <ids> \t <label>
<word list> \t <word list> \t <word list> \t <label>
```
The example of this format is as follows.
```
7 2 4 \t 2 10 12 \t 9 2 7 10 23 \t 0
7 2 4 \t 10 12 \t 9 2 21 23 \t 1
Six bags of apples Apple 6s The new driver 1
The new driver The driving school Apple 6s 1
```
## Training
We use `python train.py -y 0 --model_arch 0` with the data in `./data/classification` to train a DSSM model for classification. The paremeters to execute the script `train.py` can be found by execution `python infer.py --help`. Some important parameters are:
We use `python train.py -y 0 --model_arch 0 --class_num 2` with the data in `./data/classification` to train a DSSM model for classification. The paremeters to execute the script `train.py` can be found by execution `python infer.py --help`. Some important parameters are:
- `train_data_path` Training data path
- `test_data_path` Test data path, optional
- `source_dic_path` Source dictionary path
- `target_dic_path` Target dictionary path
- `target_dic_path` Target dictionary path
- `model_type` The type of loss function of the model: classification 0, sort 1, regression 2
- `model_arch` Model structure: FC 0,CNN 1, RNN 2
- `dnn_dims` The dimension of each layer of the model is set, the default is `256,128,64,32`,with 4 layers.
......
......@@ -146,12 +146,12 @@ class DSSM(object):
pool_bias_attr=ParamAttr(name=key + "_pool.b"))
return conv
logger.info("create a sequence_conv_pool which context width is 3")
logger.info("create a sequence_conv_pool whose context width is 3.")
conv_3 = create_conv(3, self.dnn_dims[1], "cnn")
logger.info("create a sequence_conv_pool which context width is 4")
logger.info("create a sequence_conv_pool whose context width is 4.")
conv_4 = create_conv(4, self.dnn_dims[1], "cnn")
return conv_3, conv_4
return paddle.layer.concat(input=[conv_3, conv_4])
def create_dnn(self, sent_vec, prefix):
# if more than three layers, than a fc layer will be added.
......
......@@ -61,7 +61,7 @@ RNN是一个序列模型,基本思路是:在时刻$t$,将前一时刻$t-1$
运行本例的方法如下:
* 1,运行`python train.py`命令,开始train模型(默认使用RNN),待训练结束。
* 1,运行`python train.py`命令,开始train模型(默认使用LSTM),待训练结束。
* 2,运行`python generate.py`运行文本生成。(输入的文本默认为`data/train_data_examples.txt`,生成的文本默认保存到`data/gen_result.txt`中。)
......
......@@ -85,9 +85,9 @@ def resnet_cifar10(input, class_dim, depth=32):
nStages = {16, 64, 128}
conv1 = conv_bn_layer(
input, ch_in=3, ch_out=16, filter_size=3, stride=1, padding=1)
res1 = layer_warp(basicblock, conv1, 16, 16, n, 1)
res2 = layer_warp(basicblock, res1, 16, 32, n, 2)
res3 = layer_warp(basicblock, res2, 32, 64, n, 2)
res1 = layer_warp(basicblock, conv1, 16, n, 1)
res2 = layer_warp(basicblock, res1, 32, n, 2)
res3 = layer_warp(basicblock, res2, 64, n, 2)
pool = paddle.layer.img_pool(
input=res3, pool_size=8, stride=1, pool_type=paddle.pooling.Avg())
out = paddle.layer.fc(
......
## 使用说明
`tf2paddle.py`脚本中的工具类`TFModelConverter`实现了将TensorFlow训练好的模型文件转换为PaddlePaddle可加载的模型文件。目前能够支持图像领域常用的:卷积(`Convolution`)层、`Batch Normalization`层和全连接(`Full Connection`)层。图像领域常用的 `ResNet` `VGG` 网络都以这些层此为基础,使用TensorFlow训练的`ResNet``VGG`模型能够被转换为PaddlePaddle可加载的模型,进一步用于预训练或是预测服务的开发等。
模型转换的基本流程是:
1. 将TensorFlow模型等价地使用PaddlePaddle Python API接口进行改写。
1. 在TensorFlow中可学习参数用 `Variable` 表示,基于TensorFlow的Python API获取网络中的 Variable。
1. 确定TensorFlow模型中`Variable`与PaddlePaddle中`paddle.layer`的可学习参数的对应关系。
1. 对TensorFlow中的`Variable`进行一定的适配(详见下文),转化为PaddlePaddle中的参数存储格式并进行序列化保存。
### 需要遵守的约定
为使TensorFlow模型中的`Variable`能够正确对应到`paddle.layer`中的可学习参数,目前版本在使用时有如下约束需要遵守:
1. 目前仅支持将TensorFlow中 `conv2d``batchnorm``fc`这三种带有可学习`Variable`的Operator训练出的参数向PaddlePaddle模型参数转换。
1. TensorFlow网络配置中同一Operator内的`Variable`属于相同的scope,以此为依据将`Variable`划分到不同的`paddle.layer`
1. `conv2d``batchnorm``fc`的scope需分别包含`conv``bn``fc`,以此获取对应`paddle.layer`的类型。也可以通过为`TFModelConverter`传入`layer_type_map``dict`,将scope映射到对应的`paddle.layer`的type来规避此项约束。
1. `conv2d``fc``Variable`的顺序为:先可学习`Weight``Bias``batchnorm``Variable`的顺序为:`scale``shift``mean``var`,请注意参数存储的顺序将`Variable`对应到`paddle.layer.batch_norm`相应位置的参数。
1. TensorFlow网络拓扑顺序需和PaddlePaddle网络拓扑顺序一致,尤其注意网络包含分支结构时分支定义的先后顺序,如ResNet的bottleneck模块中两分支定义的先后顺序。这是针对模型转换和PaddlePaddle网络配置均使用PaddlePaddle默认参数命名的情况,此时将根据拓扑顺序进行参数命名。
1. 若PaddlePaddle网络配置中需要通过调用`param_attr=paddle.attr.Param(name="XX"))`显示地设置可学习参数名字,这时可通过为`TFModelConverter`传入`layer_name_map``param_name_map`字典(类型为Python `dict`),在模型转换时将`Variable`的名字映射为所对应的`paddle.layer.XX`中可学习参数的名字。
1. 要求提供`build_model`接口以从此构建TensorFlow网络,加载模型并返回session。可参照如下示例进行编写:
```python
def build_model():
build_graph()
sess = tf.Session(config=tf.ConfigProto(allow_soft_placement=True))
sess.run(tf.tables_initializer())
saver = tf.train.Saver()
saver.restore(sess, 'model/model.ckpt')
return sess
```
### 使用说明
按照以上原则操作后,`tf2paddle.py` 脚本的`main`函数提供了一个调用示例,将TensorFlow训练的`ResNet50`模型转换为PaddlePaddle可加载模型。若要对其它各种自定义的模型进行转换,只需修改相关变量的值,在终端执行`python tf2paddle.py`即可。
下面是一个简单的调用示例:
```python
# 定义相关变量
tf_net = "TF_ResNet50" # 提供build_model的module名
paddle_tar_name = "Paddle_ResNet50.tar.gz" # 输出的Paddle模型的文件名
# 初始化并加载模型
converter = TFModelConverter(tf_net=tf_net,
paddle_tar_name=paddle_tar_name)
# 进行模型转换
converter.convert()
```
### 注意事项
1. 由于TensorFlow中的padding机制较为特殊,在编写PaddlePaddle网络配置时,对`paddle.layer.conv`这种需要padding的层可能需要推算size后在`paddle.layer.conv`外使用`paddle.layer.pad`进行padding。
1. 与TensorFlow图像输入多使用NHWC的数据组织格式有所不同,PaddlePaddle按照NCHW的格式组织图像输入数据。
import os
import re
import collections
import struct
import gzip
import tarfile
import cStringIO
import numpy as np
import tensorflow as tf
from paddle.proto.ParameterConfig_pb2 import ParameterConfig
from paddle.trainer_config_helpers.default_decorators import wrap_name_default
class ModelConverter(object):
def __init__(self,
paddle_tar_name,
param_name_map=None,
layer_name_map=None,
layer_type_map=None):
self.tar_name = paddle_tar_name
self.param_name_map = param_name_map
self.layer_name_map = layer_name_map
self.layer_type_map = layer_type_map
self.params = dict()
def convert(self):
layers_params = self.arrange_layer_params()
for layer_name in layers_params.keys():
layer_params, layer_params_names, layer_type = layers_params[
layer_name]
if len(layer_params) > 0:
if not layer_type:
assert layer_type_map and (
layer_type_map.get(layer_name) in ["conv", "bn", "fc"])
layer_type = layer_type_map[layer_name]
self.pre_layer_name = getattr(
self, "convert_" + layer_type + "_layer")(
layer_params,
params_names=[
self.param_name_map.get(name)
if self.param_name_map else None
for name in layer_params_names
],
name=None if self.layer_name_map == None else
self.layer_name_map.get(layer_name))
with gzip.open(self.tar_name, 'w') as f:
self.to_tar(f)
return
def to_tar(self, f):
tar = tarfile.TarFile(fileobj=f, mode='w')
for param_name in self.params.keys():
param_conf, param_data = self.params[param_name]
confStr = param_conf.SerializeToString()
tarinfo = tarfile.TarInfo(name="%s.protobuf" % param_name)
tarinfo.size = len(confStr)
buf = cStringIO.StringIO(confStr)
buf.seek(0)
tar.addfile(tarinfo, fileobj=buf)
buf = cStringIO.StringIO()
self.serialize(param_data, buf)
tarinfo = tarfile.TarInfo(name=param_name)
buf.seek(0)
tarinfo.size = len(buf.getvalue())
tar.addfile(tarinfo, buf)
@staticmethod
def serialize(data, f):
f.write(struct.pack("IIQ", 0, 4, data.size))
f.write(data.tobytes())
class TFModelConverter(ModelConverter):
def __init__(self,
tf_net,
paddle_tar_name,
param_name_map=None,
layer_name_map=None,
layer_type_map=None):
super(TFModelConverter, self).__init__(paddle_tar_name, param_name_map,
layer_name_map, layer_type_map)
self.sess = __import__(tf_net).build_model()
def arrange_layer_params(self):
all_vars = tf.global_variables()
layers_params = collections.OrderedDict()
for var in all_vars:
var_name = var.name
scope_pos = var_name.rfind('/')
if scope_pos != -1:
layer_scope = var_name[:scope_pos]
if layers_params.has_key(layer_scope):
layer_params, layer_params_names, layer_type = layers_params[
layer_scope]
layer_params.append(var.eval(self.sess))
layer_params_names.append(var_name)
else:
layer_type = re.search('conv|bn|fc', layer_scope)
layers_params[layer_scope] = ([var.eval(self.sess)],
[var_name], layer_type.group()
if layer_type else None)
return layers_params
@wrap_name_default("conv")
def convert_conv_layer(self, params, params_names=None, name=None):
for i in range(len(params)):
data = np.transpose(params[i], (
3, 2, 0, 1)) if len(params[i].shape) == 4 else params[i]
if len(params) == 2:
suffix = "0" if i == 0 else "bias"
file_name = "_%s.w%s" % (name, suffix) if not (
params_names and params_names[i]) else params_names[i]
else:
file_name = "_%s.w%s" % (name, str(i)) if not (
params_names and params_names[i]) else params_names[i]
param_conf = ParameterConfig()
param_conf.name = file_name
dims = list(data.shape)
if len(dims) == 1:
dims.insert(1, 1)
param_conf.dims.extend(dims)
param_conf.size = reduce(lambda a, b: a * b, data.shape)
self.params[file_name] = (param_conf, data.flatten())
@wrap_name_default("fc_layer")
def convert_fc_layer(self, params, params_names=None, name=None):
for i in range(len(params)):
data = params[i]
if len(params) == 2:
suffix = "0" if i == 0 else "bias"
file_name = "_%s.w%s" % (name, suffix) if not (
params_names and params_names[i]) else params_names[i]
else:
file_name = "_%s.w%s" % (name, str(i)) if not (
params_names and params_names[i]) else params_names[i]
param_conf = ParameterConfig()
param_conf.name = file_name
dims = list(data.shape)
if len(dims) < 2:
dims.insert(0, 1)
param_conf.size = reduce(lambda a, b: a * b, dims)
param_conf.dims.extend(dims)
self.params[file_name] = (param_conf, data.flatten())
return name
@wrap_name_default("batch_norm")
def convert_bn_layer(self, params, params_names=None, name=None):
params = [params[i] for i in (0, 2, 3, 1)]
params_names = [params_names[i]
for i in (0, 2, 3, 1)] if params_names else params_names
for i in range(len(params)):
data = params[i]
file_name = "_%s.w%s" % (name, str(i)) if i < 3 else "_%s.w%s" % (
name, "bias")
file_name = file_name if not (params_names and
params_names[i]) else params_names[i]
param_conf = ParameterConfig()
param_conf.name = file_name
dims = list(data.shape)
assert len(dims) == 1
dims.insert(0, 1)
param_conf.size = reduce(lambda a, b: a * b, dims)
param_conf.dims.extend(dims)
self.params[file_name] = (param_conf, data.flatten())
return name
if __name__ == "__main__":
tf_net = "TF_ResNet"
paddle_tar_name = "Paddle_ResNet50.tar.gz"
converter = TFModelConverter(tf_net=tf_net, paddle_tar_name=paddle_tar_name)
converter.convert()
......@@ -96,7 +96,7 @@ $$\lambda _{i,j}=\frac{\partial C}{\partial s_{i}} = \frac{1}{2}(1-S_{i,j})-\fra
训练`RankNet`模型在命令行执行:
```bash
python ranknet.py
python train.py --model_type ranknet
```
初次执行会自动下载数据,训练RankNet模型,并将每个轮次的模型参数存储下来。
......@@ -104,9 +104,7 @@ python ranknet.py
使用训练好的`RankNet`模型继续进行预测,在命令行执行:
```bash
python ranknet.py \
--run_type infer \
--test_model_path models/ranknet_params_0.tar.gz
python infer.py --model_type ranknet --test_model_path models/ranknet_params_0.tar.gz
```
本例提供了rankNet模型的训练和预测两个部分。完成训练后的模型分为拓扑结构(需要注意`rank_cost`不是模型拓扑结构的一部分)和模型参数文件两部分。在本例子中复用了`ranknet`训练时的模型拓扑结构`half_ranknet`,模型参数从外存中加载。模型预测的输入为单个文档的特征向量,模型会给出相关性得分。将预测得分排序即可得到最终的文档相关性排序结果。
......@@ -193,7 +191,7 @@ $$\lambda _{i,j}=\frac{\partial C}{\partial s_{i}}=-\frac{\sigma }{1+e^{\sigma (
训练`LambdaRank`模型在命令行执行:
```bash
python lambda_rank.py
python train.py --model_type lambdarank
```
初次运行脚本会自动下载数据训练LambdaRank模型,并将每个轮次的模型存储下来。
......@@ -203,9 +201,7 @@ LambdaRank模型预测过程和RankNet相同。预测时的模型拓扑结构复
使用训练好的`LambdaRank`模型继续进行预测,在命令行执行:
```bash
python lambda_rank.py \
--run_type infer \
--test_model_path models/lambda_rank_params_0.tar.gz
python infer.py --model_type lambdarank --test_model_path models/lambda_rank_params_0.tar.gz
```
## 自定义 LambdaRank数据
......
import os
import gzip
import functools
import argparse
import paddle.v2 as paddle
from ranknet import half_ranknet
from lambda_rank import lambda_rank
def ranknet_infer(input_dim, model_path):
"""
RankNet model inference interface.
"""
# we just need half_ranknet to predict a rank score,
# which can be used in sort documents
output = half_ranknet("right", input_dim)
parameters = paddle.parameters.Parameters.from_tar(gzip.open(model_path))
# load data of same query and relevance documents,
# need ranknet to rank these candidates
infer_query_id = []
infer_data = []
infer_doc_index = []
# convert to mq2007 built-in data format
# <query_id> <relevance_score> <feature_vector>
plain_txt_test = functools.partial(
paddle.dataset.mq2007.test, format="plain_txt")
for query_id, relevance_score, feature_vector in plain_txt_test():
infer_query_id.append(query_id)
infer_data.append([feature_vector])
# predict score of infer_data document.
# Re-sort the document base on predict score
# in descending order. then we build the ranking documents
scores = paddle.infer(
output_layer=output, parameters=parameters, input=infer_data)
for query_id, score in zip(infer_query_id, scores):
print "query_id : ", query_id, " score : ", score
def lambda_rank_infer(input_dim, model_path):
"""
LambdaRank model inference interface.
"""
output = lambda_rank(input_dim, is_infer=True)
parameters = paddle.parameters.Parameters.from_tar(gzip.open(model_path))
infer_query_id = None
infer_data = []
infer_data_num = 1
fill_default_test = functools.partial(
paddle.dataset.mq2007.test, format="listwise")
for label, querylist in fill_default_test():
infer_data.append([querylist])
if len(infer_data) == infer_data_num:
break
# Predict score of infer_data document.
# Re-sort the document base on predict score.
# In descending order. then we build the ranking documents.
predicitons = paddle.infer(
output_layer=output, parameters=parameters, input=infer_data)
for i, score in enumerate(predicitons):
print i, score
def parse_args():
parser = argparse.ArgumentParser(
description="PaddlePaddle learning to rank example.")
parser.add_argument(
"--model_type",
type=str,
help=("A flag indicating to run the RankNet or the LambdaRank model. "
"Available options are: ranknet or lambdarank."),
default="ranknet")
parser.add_argument(
"--use_gpu",
type=bool,
help="A flag indicating whether to use the GPU device in training.",
default=False)
parser.add_argument(
"--trainer_count",
type=int,
help="The thread number used in training.",
default=1)
parser.add_argument(
"--test_model_path",
type=str,
required=True,
help=("The path of a trained model."))
return parser.parse_args()
if __name__ == "__main__":
args = parse_args()
assert os.path.exists(args.test_model_path), (
"The trained model does not exit. Please set a correct path.")
paddle.init(use_gpu=args.use_gpu, trainer_count=args.trainer_count)
# Training dataset: mq2007, input_dim = 46, dense format.
input_dim = 46
if args.model_type == "ranknet":
ranknet_infer(input_dim, args.test_model_path)
elif args.model_type == "lambdarank":
lambda_rank_infer(input_dim, args.test_model_path)
else:
logger.fatal(("A wrong value for parameter model type. "
"Available options are: ranknet or lambdarank."))
import os
import sys
import gzip
import functools
import argparse
import logging
import numpy as np
"""
LambdaRank is a listwise rank model.
https://papers.nips.cc/paper/2971-learning-to-rank-with-nonsmooth-cost-functions.pdf
"""
import paddle.v2 as paddle
logger = logging.getLogger("paddle")
logger.setLevel(logging.INFO)
def lambda_rank(input_dim, is_infer):
def lambda_rank(input_dim, is_infer=False):
"""
LambdaRank is a listwise rank model, the input data and label
must be sequences.
The input data and label for LambdaRank must be sequences.
https://papers.nips.cc/paper/2971-learning-to-rank-with-nonsmooth-cost-functions.pdf
parameters :
input_dim, one document's dense feature vector dimension
The format of the dense_vector_sequence is as follows:
[[f, ...], [f, ...], ...], f is a float or an int number
"""
if not is_infer:
label = paddle.layer.data("label",
paddle.data_type.dense_vector_sequence(1))
data = paddle.layer.data("data",
paddle.data_type.dense_vector_sequence(input_dim))
......@@ -49,134 +37,11 @@ def lambda_rank(input_dim, is_infer):
param_attr=paddle.attr.Param(initial_std=0.01))
if not is_infer:
# Define the cost layer.
label = paddle.layer.data("label",
paddle.data_type.dense_vector_sequence(1))
cost = paddle.layer.lambda_cost(
input=output, score=label, NDCG_num=6, max_sort_size=-1)
return cost, output
return output
def lambda_rank_train(num_passes, model_save_dir):
# The input for LambdaRank must be a sequence.
fill_default_train = functools.partial(
paddle.dataset.mq2007.train, format="listwise")
fill_default_test = functools.partial(
paddle.dataset.mq2007.test, format="listwise")
train_reader = paddle.batch(
paddle.reader.shuffle(fill_default_train, buf_size=100), batch_size=32)
test_reader = paddle.batch(fill_default_test, batch_size=32)
# Training dataset: mq2007, input_dim = 46, dense format.
input_dim = 46
cost, output = lambda_rank(input_dim, is_infer=False)
parameters = paddle.parameters.create(cost)
trainer = paddle.trainer.SGD(
cost=cost,
parameters=parameters,
update_equation=paddle.optimizer.Adam(learning_rate=1e-4))
# Define end batch and end pass event handler.
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
logger.info("Pass %d Batch %d Cost %.9f" %
(event.pass_id, event.batch_id, event.cost))
if isinstance(event, paddle.event.EndPass):
result = trainer.test(reader=test_reader, feeding=feeding)
logger.info("\nTest with Pass %d, %s" %
(event.pass_id, result.metrics))
with gzip.open(
os.path.join(model_save_dir, "lambda_rank_params_%d.tar.gz"
% (event.pass_id)), "w") as f:
trainer.save_parameter_to_tar(f)
feeding = {"label": 0, "data": 1}
trainer.train(
reader=train_reader,
event_handler=event_handler,
feeding=feeding,
num_passes=num_passes)
def lambda_rank_infer(test_model_path):
"""LambdaRank model inference interface.
Parameters:
test_model_path : The path of the trained model.
"""
logger.info("Begin to Infer...")
input_dim = 46
output = lambda_rank(input_dim, is_infer=True)
parameters = paddle.parameters.Parameters.from_tar(
gzip.open(test_model_path))
infer_query_id = None
infer_data = []
infer_data_num = 1
fill_default_test = functools.partial(
paddle.dataset.mq2007.test, format="listwise")
for label, querylist in fill_default_test():
infer_data.append([querylist])
if len(infer_data) == infer_data_num:
break
# Predict score of infer_data document.
# Re-sort the document base on predict score.
# In descending order. then we build the ranking documents.
predicitons = paddle.infer(
output_layer=output, parameters=parameters, input=infer_data)
for i, score in enumerate(predicitons):
print i, score
if __name__ == '__main__':
parser = argparse.ArgumentParser(
description="PaddlePaddle LambdaRank example.")
parser.add_argument(
"--run_type",
type=str,
help=("A flag indicating to run the training or the inferring task. "
"Available options are: train or infer."),
default="train")
parser.add_argument(
"--num_passes",
type=int,
help="The number of passes to train the model.",
default=10)
parser.add_argument(
"--use_gpu",
type=bool,
help="A flag indicating whether to use the GPU device in training.",
default=False)
parser.add_argument(
"--trainer_count",
type=int,
help="The thread number used in training.",
default=1)
parser.add_argument(
"--model_save_dir",
type=str,
required=False,
help=("The path to save the trained models."),
default="models")
parser.add_argument(
"--test_model_path",
type=str,
required=False,
help=("This parameter works only in inferring task to "
"specify path of a trained model."),
default="")
args = parser.parse_args()
paddle.init(use_gpu=args.use_gpu, trainer_count=args.trainer_count)
if args.run_type == "train":
lambda_rank_train(args.num_passes, args.model_save_dir)
elif args.run_type == "infer":
assert os.path.exists(args.test_model_path), (
"The trained model does not exit. Please set a correct path.")
lambda_rank_infer(args.test_model_path)
return cost
else:
logger.fatal(("A wrong value for parameter run type. "
"Available options are: train or infer."))
return output
import numpy as np
import unittest
def ndcg(score_list):
"""
measure the ndcg score of order list
https://en.wikipedia.org/wiki/Discounted_cumulative_gain
parameter:
score_list: np.array, shape=(sample_num,1)
e.g. predict rank score list :
>>> scores = [3, 2, 3, 0, 1, 2]
>>> ndcg_score = ndcg(scores)
"""
def dcg(score_list):
n = len(score_list)
cost = .0
for i in range(n):
cost += float(np.power(2, score_list[i])) / np.log((i + 1) + 1)
return cost
dcg_cost = dcg(score_list)
score_ranking = sorted(score_list, reverse=True)
ideal_cost = dcg(score_ranking)
return dcg_cost / ideal_cost
class TestNDCG(unittest.TestCase):
def test_array(self):
a = [3, 2, 3, 0, 1, 2]
value = ndcg(a)
self.assertAlmostEqual(0.9583, value, places=3)
if __name__ == '__main__':
unittest.main()
import os
import sys
import gzip
import functools
import argparse
import logging
import numpy as np
"""
ranknet is the classic pairwise learning to rank algorithm
http://icml.cc/2015/wp-content/uploads/2015/06/icml_ranking.pdf
"""
import paddle.v2 as paddle
logger = logging.getLogger("paddle")
logger.setLevel(logging.INFO)
# ranknet is the classic pairwise learning to rank algorithm
# http://icml.cc/2015/wp-content/uploads/2015/06/icml_ranking.pdf
def score_diff(right_score, left_score):
return np.average(np.abs(right_score - left_score))
def half_ranknet(name_prefix, input_dim):
"""
......@@ -60,142 +46,3 @@ def ranknet(input_dim):
cost = paddle.layer.rank_cost(
name="cost", left=output_left, right=output_right, label=label)
return cost
def ranknet_train(num_passes, model_save_dir):
train_reader = paddle.batch(
paddle.reader.shuffle(paddle.dataset.mq2007.train, buf_size=100),
batch_size=100)
test_reader = paddle.batch(paddle.dataset.mq2007.test, batch_size=100)
# mq2007 feature_dim = 46, dense format
# fc hidden_dim = 128
feature_dim = 46
cost = ranknet(feature_dim)
parameters = paddle.parameters.create(cost)
trainer = paddle.trainer.SGD(
cost=cost,
parameters=parameters,
update_equation=paddle.optimizer.Adam(learning_rate=2e-4))
# Define the input data order
feeding = {"label": 0, "left_data": 1, "right_data": 2}
# Define end batch and end pass event handler
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 25 == 0:
diff = score_diff(
event.gm.getLayerOutputs("left_score")["left_score"][
"value"],
event.gm.getLayerOutputs("right_score")["right_score"][
"value"])
logger.info(("Pass %d Batch %d : Cost %.6f, "
"average absolute diff scores: %.6f") %
(event.pass_id, event.batch_id, event.cost, diff))
if isinstance(event, paddle.event.EndPass):
result = trainer.test(reader=test_reader, feeding=feeding)
logger.info("\nTest with Pass %d, %s" %
(event.pass_id, result.metrics))
with gzip.open(
os.path.join(model_save_dir, "ranknet_params_%d.tar.gz" %
(event.pass_id)), "w") as f:
trainer.save_parameter_to_tar(f)
trainer.train(
reader=train_reader,
event_handler=event_handler,
feeding=feeding,
num_passes=num_passes)
def ranknet_infer(model_path):
"""
load the trained model. And predict with plain txt input
"""
logger.info("Begin to Infer...")
feature_dim = 46
# we just need half_ranknet to predict a rank score,
# which can be used in sort documents
output = half_ranknet("right", feature_dim)
parameters = paddle.parameters.Parameters.from_tar(gzip.open(model_path))
# load data of same query and relevance documents,
# need ranknet to rank these candidates
infer_query_id = []
infer_data = []
infer_doc_index = []
# convert to mq2007 built-in data format
# <query_id> <relevance_score> <feature_vector>
plain_txt_test = functools.partial(
paddle.dataset.mq2007.test, format="plain_txt")
for query_id, relevance_score, feature_vector in plain_txt_test():
infer_query_id.append(query_id)
infer_data.append([feature_vector])
# predict score of infer_data document.
# Re-sort the document base on predict score
# in descending order. then we build the ranking documents
scores = paddle.infer(
output_layer=output, parameters=parameters, input=infer_data)
for query_id, score in zip(infer_query_id, scores):
print "query_id : ", query_id, " score : ", score
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="PaddlePaddle RankNet example.")
parser.add_argument(
"--run_type",
type=str,
help=("A flag indicating to run the training or the inferring task. "
"Available options are: train or infer."),
default="train")
parser.add_argument(
"--num_passes",
type=int,
help="The number of passes to train the model.",
default=10)
parser.add_argument(
"--use_gpu",
type=bool,
help="A flag indicating whether to use the GPU device in training.",
default=False)
parser.add_argument(
"--trainer_count",
type=int,
help="The thread number used in training.",
default=1)
parser.add_argument(
"--model_save_dir",
type=str,
required=False,
help=("The path to save the trained models."),
default="models")
parser.add_argument(
"--test_model_path",
type=str,
required=False,
help=("This parameter works only in inferring task to "
"specify path of a trained model."),
default="")
args = parser.parse_args()
if not os.path.exists(args.model_save_dir): os.mkdir(args.model_save_dir)
paddle.init(use_gpu=args.use_gpu, trainer_count=args.trainer_count)
if args.run_type == "train":
ranknet_train(args.num_passes, args.model_save_dir)
elif args.run_type == "infer":
assert os.path.exists(
args.test_model_path), "The trained model does not exit."
ranknet_infer(args.test_model_path)
else:
logger.fatal(("A wrong value for parameter run type. "
"Available options are: train or infer."))
import os
import gzip
import functools
import argparse
import logging
import numpy as np
import paddle.v2 as paddle
from ranknet import ranknet
from lambda_rank import lambda_rank
logger = logging.getLogger("paddle")
logger.setLevel(logging.INFO)
def ranknet_train(input_dim, num_passes, model_save_dir):
train_reader = paddle.batch(
paddle.reader.shuffle(paddle.dataset.mq2007.train, buf_size=100),
batch_size=100)
test_reader = paddle.batch(paddle.dataset.mq2007.test, batch_size=100)
cost = ranknet(input_dim)
parameters = paddle.parameters.create(cost)
trainer = paddle.trainer.SGD(
cost=cost,
parameters=parameters,
update_equation=paddle.optimizer.Adam(learning_rate=2e-4))
feeding = {"label": 0, "left_data": 1, "right_data": 2}
def score_diff(right_score, left_score):
return np.average(np.abs(right_score - left_score))
# Define end batch and end pass event handler
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 25 == 0:
diff = score_diff(
event.gm.getLayerOutputs("left_score")["left_score"][
"value"],
event.gm.getLayerOutputs("right_score")["right_score"][
"value"])
logger.info(("Pass %d Batch %d : Cost %.6f, "
"average absolute diff scores: %.6f") %
(event.pass_id, event.batch_id, event.cost, diff))
if isinstance(event, paddle.event.EndPass):
result = trainer.test(reader=test_reader, feeding=feeding)
logger.info("\nTest with Pass %d, %s" %
(event.pass_id, result.metrics))
with gzip.open(
os.path.join(model_save_dir, "ranknet_params_%d.tar.gz" %
(event.pass_id)), "w") as f:
trainer.save_parameter_to_tar(f)
trainer.train(
reader=train_reader,
event_handler=event_handler,
feeding=feeding,
num_passes=num_passes)
def lambda_rank_train(input_dim, num_passes, model_save_dir):
# The input for LambdaRank must be a sequence.
fill_default_train = functools.partial(
paddle.dataset.mq2007.train, format="listwise")
fill_default_test = functools.partial(
paddle.dataset.mq2007.test, format="listwise")
train_reader = paddle.batch(
paddle.reader.shuffle(fill_default_train, buf_size=100), batch_size=32)
test_reader = paddle.batch(fill_default_test, batch_size=32)
cost = lambda_rank(input_dim)
parameters = paddle.parameters.create(cost)
trainer = paddle.trainer.SGD(
cost=cost,
parameters=parameters,
update_equation=paddle.optimizer.Adam(learning_rate=1e-4))
feeding = {"label": 0, "data": 1}
# Define end batch and end pass event handler.
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
logger.info("Pass %d Batch %d Cost %.9f" %
(event.pass_id, event.batch_id, event.cost))
if isinstance(event, paddle.event.EndPass):
result = trainer.test(reader=test_reader, feeding=feeding)
logger.info("\nTest with Pass %d, %s" %
(event.pass_id, result.metrics))
with gzip.open(
os.path.join(model_save_dir, "lambda_rank_params_%d.tar.gz"
% (event.pass_id)), "w") as f:
trainer.save_parameter_to_tar(f)
trainer.train(
reader=train_reader,
event_handler=event_handler,
feeding=feeding,
num_passes=num_passes)
def parse_args():
parser = argparse.ArgumentParser(
description="PaddlePaddle learning to rank example.")
parser.add_argument(
"--model_type",
type=str,
help=("A flag indicating to run the RankNet or the LambdaRank model. "
"Available options are: ranknet or lambdarank."),
default="ranknet")
parser.add_argument(
"--num_passes",
type=int,
help="The number of passes to train the model.",
default=10)
parser.add_argument(
"--use_gpu",
type=bool,
help="A flag indicating whether to use the GPU device in training.",
default=False)
parser.add_argument(
"--trainer_count",
type=int,
help="The thread number used in training.",
default=1)
parser.add_argument(
"--model_save_dir",
type=str,
required=False,
help=("The path to save the trained models."),
default="models")
return parser.parse_args()
if __name__ == "__main__":
args = parse_args()
if not os.path.exists(args.model_save_dir): os.mkdir(args.model_save_dir)
paddle.init(use_gpu=args.use_gpu, trainer_count=args.trainer_count)
# Training dataset: mq2007, input_dim = 46, dense format.
input_dim = 46
if args.model_type == "ranknet":
ranknet_train(input_dim, args.num_passes, args.model_save_dir)
elif args.model_type == "lambdarank":
lambda_rank_train(input_dim, args.num_passes, args.model_save_dir)
else:
logger.fatal(("A wrong value for parameter model type. "
"Available options are: ranknet or lambdarank."))
......@@ -129,7 +129,7 @@ NCE 层的一些重要参数解释如下:
size=dict_size,
input=paddle.layer.trans_full_matrix_projection(
hidden_layer, param_attr=paddle.attr.Param(name="nce_w")),
act=paddle.activation.Sigmoid(),
act=paddle.activation.Softmax(),
bias_attr=paddle.attr.Param(name="nce_b"))
```
上述代码片段中的 `paddle.layer.mixed` 必须以 PaddlePaddle 中 `paddle.layer.×_projection` 为输入。`paddle.layer.mixed` 将多个 `projection` (输入可以是多个)计算结果求和作为输出。`paddle.layer.trans_full_matrix_projection` 在计算矩阵乘法时会对参数$W$进行转置。
......
......@@ -15,11 +15,13 @@ def cnn_cov_group(group_input, hidden_size):
conv4 = paddle.networks.sequence_conv_pool(
input=group_input, context_len=4, hidden_size=hidden_size)
fc_param_attr = paddle.attr.ParamAttr(name='_cov_value_weight')
fc_bias_attr = paddle.attr.ParamAttr(name='_cov_value_bias')
linear_proj = paddle.layer.fc(
input=[conv3, conv4],
size=hidden_size,
param_attr=paddle.attr.ParamAttr(name='_cov_value_weight'),
bias_attr=paddle.attr.ParamAttr(name='_cov_value_bias'),
param_attr=[fc_param_attr, fc_param_attr],
bias_attr=fc_bias_attr,
act=paddle.activation.Linear())
return linear_proj
......
......@@ -84,7 +84,7 @@ After the training is completed, the model will input and decode the correspondi
**4**. iterate through it until you get $ k $ complete sentences as candidates for translation results.
For more information on beam search, refer to the [beam search] chapter in PaddleBook [machine translation] (https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/README.cn.md) (https://github.com/PaddlePaddle/book/blob/develop.org.machine_translation/README.cn.md# beam search algorithm) section.
For more information on beam search, refer to the [beam search](https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/README.md#beam-search-algorithm) section in PaddleBook [machine translation](https://github.com/PaddlePaddle/book/tree/develop/08.machine_translation) chapter.
### Decoder without Attention mechanism
......@@ -343,6 +343,6 @@ So far, we have implemented a basic machine translation model using PaddlePaddle
## References
[1] Sutskever I, Vinyals O, Le Q V. [Sequence to Sequence Learning with Neural Networks] (https://arxiv.org/abs/1409.3215) [J]. 2014, 4: 3104-3112.
[2] Cho K, Van Merriënboer B, Gulcehre C, et al. [Learning phrase representations using RNN encoder-decoder for statistical machine translation (http://www.aclweb.org/anthology/D/D14/D14-1179 .pdf) [C]. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014: 1724-1734.
[2] Cho K, Van Merriënboer B, Gulcehre C, et al. [Learning phrase representations using RNN encoder-decoder for statistical machine translation](http://www.aclweb.org/anthology/D/D14/D14-1179.pdf) [C]. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014: 1724-1734.
[3] Bahdanau D, Cho K, Bengio Y. [Neural machine translation by exclusive learning to align and translate] (https://arxiv.org/abs/1409.0473) [C]. Proceedings of ICLR 2015, 2015
......@@ -109,11 +109,12 @@ Baghdad NNP I-NP I-LOC
```python
main(
train_data_file='data/train',
test_data_file='data/test',
vocab_file='data/vocab.txt',
target_file='data/target.txt',
emb_file='data/wordVectors.txt')
train_data_file="data/train",
test_data_file="data/test",
vocab_file="data/vocab.txt",
target_file="data/target.txt",
emb_file="data/wordVectors.txt",
model_save_dir="models/")
```
3. 运行命令 `python train.py`**需要注意:直接运行使用的是示例数据,请替换真实的标记数据。**
......
......@@ -108,4 +108,4 @@ if __name__ == "__main__":
vocab_file="data/vocab.txt",
target_file="data/target.txt",
emb_file="data/wordVectors.txt",
model_save_dir="model/")
model_save_dir="models/")
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册