提交 250c3949 编写于 作者: W wangmeng28

Merge remote-tracking branch 'upstream/develop' into deep_fm

...@@ -2,20 +2,6 @@ ...@@ -2,20 +2,6 @@
排序学习技术\[[1](#参考文献1)\]是构建排序模型的机器学习方法,在信息检索、自然语言处理,数据挖掘等机器学场景中具有重要作用。排序学习的主要目的是对给定一组文档,对任意查询请求给出反映相关性的文档排序。在本例子中,利用标注过的语料库训练两种经典排序模型RankNet[[4](#参考文献4)\]和LamdaRank[[6](#参考文献6)\],分别可以生成对应的排序模型,能够对任意查询请求,给出相关性文档排序。 排序学习技术\[[1](#参考文献1)\]是构建排序模型的机器学习方法,在信息检索、自然语言处理,数据挖掘等机器学场景中具有重要作用。排序学习的主要目的是对给定一组文档,对任意查询请求给出反映相关性的文档排序。在本例子中,利用标注过的语料库训练两种经典排序模型RankNet[[4](#参考文献4)\]和LamdaRank[[6](#参考文献6)\],分别可以生成对应的排序模型,能够对任意查询请求,给出相关性文档排序。
RankNet模型在命令行输入:
```bash
bash ./run_ranknet.sh
```
LambdaRank模型在命令行输入:
```bash
bash ./run_lambdarank.sh
```
用户只需要使用以上命令就完成排序模型的训练和预测,程序会自动下载内置数据集,无需手动下载。
## 背景介绍 ## 背景介绍
排序学习技术随着互联网的快速增长而受到越来越多关注,是机器学习中的常见任务之一。一方面人工排序规则不能处理海量规模的候选数据,另一方面无法为不同渠道的候选数据给于合适的权重,因此排序学习在日常生活中应用非常广泛。排序学习起源于信息检索领域,目前仍然是许多信息检索场景中的核心模块,例如搜索引擎搜索结果排序,推荐系统候选集排序,在线广告排序等等。本例以文档检索任务阐述排序学习模型。 排序学习技术随着互联网的快速增长而受到越来越多关注,是机器学习中的常见任务之一。一方面人工排序规则不能处理海量规模的候选数据,另一方面无法为不同渠道的候选数据给于合适的权重,因此排序学习在日常生活中应用非常广泛。排序学习起源于信息检索领域,目前仍然是许多信息检索场景中的核心模块,例如搜索引擎搜索结果排序,推荐系统候选集排序,在线广告排序等等。本例以文档检索任务阐述排序学习模型。
...@@ -102,60 +88,27 @@ $$\lambda _{i,j}=\frac{\partial C}{\partial s_{i}} = \frac{1}{2}(1-S_{i,j})-\fra ...@@ -102,60 +88,27 @@ $$\lambda _{i,j}=\frac{\partial C}{\partial s_{i}} = \frac{1}{2}(1-S_{i,j})-\fra
- 全连接层(fully connected layer) : 指上一层中的每个节点都连接到下层网络。本例子中同样使用`paddle.layer.fc`实现,注意输入到RankCost层的全连接层维度为1。 - 全连接层(fully connected layer) : 指上一层中的每个节点都连接到下层网络。本例子中同样使用`paddle.layer.fc`实现,注意输入到RankCost层的全连接层维度为1。
- RankCost层: RankCost层是排序网络RankNet的核心,度量docA相关性是否比docB好,给出预测值并和label比较。使用了交叉熵(cross enctropy)作为度量损失函数,使用梯度下降方法进行优化。细节可见[RankNet](http://icml.cc/2015/wp-content/uploads/2015/06/icml_ranking.pdf)[4] - RankCost层: RankCost层是排序网络RankNet的核心,度量docA相关性是否比docB好,给出预测值并和label比较。使用了交叉熵(cross enctropy)作为度量损失函数,使用梯度下降方法进行优化。细节可见[RankNet](http://icml.cc/2015/wp-content/uploads/2015/06/icml_ranking.pdf)[4]
由于Pairwise中的网络结构是左右对称,可定义一半网络结构,另一半共享网络参数。在PaddlePaddle中允许网络结构中共享连接,具有相同名字的参数将会共享参数。使用PaddlePaddle实现RankNet排序模型,定义网络结构的示例代码如下: 由于Pairwise中的网络结构是左右对称,可定义一半网络结构,另一半共享网络参数。在PaddlePaddle中允许网络结构中共享连接,具有相同名字的参数将会共享参数。使用PaddlePaddle实现RankNet排序模型,定义网络结构的示例代码[ranknet.py](ranknet.py) 中的 `half_ranknet` 函数。
```python `half_ranknet` 函数中定义的结构使用了和图3相同的模型结构:两层隐藏层,分别是`hidden_size=10`的全连接层和`hidden_size=1`的全连接层。本例中的`input_dim`指输入**单个文档**的特征的维度,label取值为1,0。每条输入样本为`<label>,<docA, docB>`的结构,以`docA`为例,输入`input_dim`的文档特征,依次变换成10维,1维特征,最终输入到RankCost层中,比较docA和docB在RankCost输出得到预测值。
import paddle.v2 as paddle
def half_ranknet(name_prefix, input_dim):
"""
parameter with a same name will be shared in PaddlePaddle framework,
these parameters in ranknet can be used in shared state, e.g. left network and right network in detail
https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/api.md
"""
# data layer
data = paddle.layer.data(name_prefix+"/data", paddle.data_type.dense_vector(input_dim))
# fully connect layer
hd1 = paddle.layer.fc(
input=data,
size=10,
act=paddle.activation.Tanh(),
param_attr=paddle.attr.Param(initial_std=0.01, name="hidden_w1"))
# fully connected layer/ output layer
output = paddle.layer.fc(
input=hd1,
size=1,
act=paddle.activation.Linear(),
param_attr=paddle.attr.Param(initial_std=0.01, name="output"))
return output
def ranknet(input_dim):
# label layer
label = paddle.layer.data("label", paddle.data_type.integer_value(1))
# reuse the parameter in half_ranknet
output_left = half_ranknet("left", input_dim)
output_right = half_ranknet("right", input_dim)
# rankcost layer
cost = paddle.layer.rank_cost(name="cost", left=output_left, right=output_right, label=label)
return cost
```
上述结构中使用了和图3相同的模型结构:两层隐藏层,分别是`hidden_size=10`的全连接层和`hidden_size=1`的全连接层。本例中的input_dim指输入**单个文档**的特征的维度,label取值为1,0。每条输入样本为`<label>,<docA, docB>`的结构,以docA为例,输入`input_dim`的文档特征,依次变换成10维,1维特征,最终输入到RankCost层中,比较docA和docB在RankCost输出得到预测值。
### RankNet模型训练 ### RankNet模型训练
RankNet的训练只需要运行命令: 训练`RankNet`模型在命令行执行:
```bash ```bash
bash ./run_ranknet.sh python ranknet.py
``` ```
会自动下载数据,训练RankNet模型,并将每个轮次的模型参数存储下来。 初次执行会自动下载数据,训练RankNet模型,并将每个轮次的模型参数存储下来。
### RankNet模型预测 ### RankNet模型预测
使用训练好的`RankNet`模型继续进行预测,在命令行执行:
```bash
python ranknet.py \
--run_type infer \
--test_model_path models/ranknet_params_0.tar.gz
```
本例提供了rankNet模型的训练和预测两个部分。完成训练后的模型分为拓扑结构(需要注意`rank_cost`不是模型拓扑结构的一部分)和模型参数文件两部分。在本例子中复用了`ranknet`训练时的模型拓扑结构`half_ranknet`,模型参数从外存中加载。模型预测的输入为单个文档的特征向量,模型会给出相关性得分。将预测得分排序即可得到最终的文档相关性排序结果。 本例提供了rankNet模型的训练和预测两个部分。完成训练后的模型分为拓扑结构(需要注意`rank_cost`不是模型拓扑结构的一部分)和模型参数文件两部分。在本例子中复用了`ranknet`训练时的模型拓扑结构`half_ranknet`,模型参数从外存中加载。模型预测的输入为单个文档的特征向量,模型会给出相关性得分。将预测得分排序即可得到最终的文档相关性排序结果。
## 用户自定义RankNet数据 ## 用户自定义RankNet数据
...@@ -209,8 +162,6 @@ feeding = { "label":0, ...@@ -209,8 +162,6 @@ feeding = { "label":0,
"right/data":2} "right/data":2}
``` ```
## LambdaRank排序模型 ## LambdaRank排序模型
[LambdaRank](https://papers.nips.cc/paper/2971-learning-to-rank-with-nonsmooth-cost-functions.pdf)\[[6](#参考文献))\]是Listwise的排序方法,是Bugers[6]等人从RankNet发展而来,使用构造lambda函数(LambdaRank名字的由来)的方法优化度量标准NDCG(Normalized Discounted Cumulative Gain),每个查询后得到的结果文档列表都单独作为一个训练样本。NDCG是信息论中很衡量文档列表排序质量的标准之一,前$K$个文档的NDCG得分记做 [LambdaRank](https://papers.nips.cc/paper/2971-learning-to-rank-with-nonsmooth-cost-functions.pdf)\[[6](#参考文献))\]是Listwise的排序方法,是Bugers[6]等人从RankNet发展而来,使用构造lambda函数(LambdaRank名字的由来)的方法优化度量标准NDCG(Normalized Discounted Cumulative Gain),每个查询后得到的结果文档列表都单独作为一个训练样本。NDCG是信息论中很衡量文档列表排序质量的标准之一,前$K$个文档的NDCG得分记做
...@@ -234,58 +185,29 @@ $$\lambda _{i,j}=\frac{\partial C}{\partial s_{i}}=-\frac{\sigma }{1+e^{\sigma ( ...@@ -234,58 +185,29 @@ $$\lambda _{i,j}=\frac{\partial C}{\partial s_{i}}=-\frac{\sigma }{1+e^{\sigma (
- LambdaCost层 : LambdaCost层使用NDCG差值作为Lambda函数,score是一个一维的序列,对于单调训练样本全连接层输出的是1x1的序列,二者的序列长度都等于该条查询得到的文档数量。Lambda函数的构造详细见[LambdaRank](https://papers.nips.cc/paper/2971-learning-to-rank-with-nonsmooth-cost-functions.pdf) - LambdaCost层 : LambdaCost层使用NDCG差值作为Lambda函数,score是一个一维的序列,对于单调训练样本全连接层输出的是1x1的序列,二者的序列长度都等于该条查询得到的文档数量。Lambda函数的构造详细见[LambdaRank](https://papers.nips.cc/paper/2971-learning-to-rank-with-nonsmooth-cost-functions.pdf)
使用PaddlePaddle定义LambdaRank网络结构的示例代码如下: 使用PaddlePaddle定义LambdaRank网络结构的示例代见 [lambda_rank.py](lambda_rank.py) 中的`lambda_rank`函数。
```python
import paddle.v2 as paddle
def lambda_rank(input_dim):
"""
lambda_rank is a ListWise Rank Model, input data and label must be sequence
https://papers.nips.cc/paper/2971-learning-to-rank-with-nonsmooth-cost-functions.pdf
parameters :
input_dim, one document's dense feature vector dimension
dense_vector_sequence format
[[f, ...], [f, ...], ...], f is represent for an float or int number
"""
label = paddle.layer.data("label",
paddle.data_type.dense_vector_sequence(1))
data = paddle.layer.data("data",
paddle.data_type.dense_vector_sequence(input_dim))
# hidden layer
hd1 = paddle.layer.fc(
input=data,
size=10,
act=paddle.activation.Tanh(),
param_attr=paddle.attr.Param(initial_std=0.01))
output = paddle.layer.fc(
input=hd1,
size=1,
act=paddle.activation.Linear(),
param_attr=paddle.attr.Param(initial_std=0.01))
# cost layer
cost = paddle.layer.lambda_cost(
input=output, score=label, NDCG_num=6, max_sort_size=-1)
return cost, output
```
上述结构中使用了和图3相同的模型结构。和RankNet相似,分别使用了`hidden_size=10``hidden_size=1`的两个全连接层。本例中的input_dim指输入**单个文档**的特征的维度。每条输入样本为label,\<docA, docB\>的结构,以docA为例,输入input_dim的文档特征,依次变换成10维,1维特征,最终输入到LambdaCost层中。需要注意这里的label和data格式为**dense_vector_sequence**,表示一列文档得分或者文档特征组成的**序列** 上述结构中使用了和图3相同的模型结构。和RankNet相似,分别使用了`hidden_size=10``hidden_size=1`的两个全连接层。本例中的input_dim指输入**单个文档**的特征的维度。每条输入样本为label,\<docA, docB\>的结构,以docA为例,输入input_dim的文档特征,依次变换成10维,1维特征,最终输入到LambdaCost层中。需要注意这里的label和data格式为**dense_vector_sequence**,表示一列文档得分或者文档特征组成的**序列**
### LambdaRank模型训练 ### LambdaRank模型训练
训练LambdaRank模型只需要运行命令: 训练`LambdaRank`模型在命令行执行:
```bash ```bash
bash ./run_lambdarank.sh python lambda_rank.py
``` ```
初次运行脚本会自动下载数据训练LambdaRank模型,并将每个轮次的模型存储下来。
脚本会自动下载数据,训练LambdaRank模型,并将每个轮次的模型存储下来。
### LambdaRank模型预测 ### LambdaRank模型预测
LambdaRank模型预测过程和RankNet相同。预测时的模型拓扑结构复用代码中的模型定义,从外存加载对应的参数文件。预测时的输入是文档列表,输出是该文档列表的各个文档相关性打分,根据打分对文档进行重新排序,即可得到最终的文档排序结果。 LambdaRank模型预测过程和RankNet相同。预测时的模型拓扑结构复用代码中的模型定义,从外存加载对应的参数文件。预测时的输入是文档列表,输出是该文档列表的各个文档相关性打分,根据打分对文档进行重新排序,即可得到最终的文档排序结果。
使用训练好的`LambdaRank`模型继续进行预测,在命令行执行:
```bash
python lambda_rank.py \
--run_type infer \
--test_model_path models/lambda_rank_params_0.tar.gz
```
## 自定义 LambdaRank数据 ## 自定义 LambdaRank数据
上面的代码使用了PaddlePaddle内置的mq2007数据,如果希望使用自定义格式数据,可以参考PaddlePaddle内置的`mq2007`数据集,编写一个生成器函数。例如输入数据为如下格式,只包含doc0-doc2三个文档。 上面的代码使用了PaddlePaddle内置的mq2007数据,如果希望使用自定义格式数据,可以参考PaddlePaddle内置的`mq2007`数据集,编写一个生成器函数。例如输入数据为如下格式,只包含doc0-doc2三个文档。
...@@ -340,13 +262,42 @@ feeding = {"label":0, ...@@ -340,13 +262,42 @@ feeding = {"label":0,
"data" : 1} "data" : 1}
``` ```
## 训练过程中输出自定义评估指标
这里,我们以 `RankNet` 为例,介绍如何在训练过程中输出自定义评估指标。这个方法同样可以用来在训练过程中获取网络某一层输出矩阵的值。
`RankNet`网络学习一个打分函数对左右两个输入进行打分,左右两个输入的分值差异越大,打分函数对正负例的区分能力越强,模型的泛化能力越好。假设我们希望输出:训练过程中模型对左右输入打分之差绝对值的平均值这样一个指标。为了计算这个自定义的指标,需要获取每个`mini-batch`之后分值层(对应着`ranknet`中的`name``left_score``right_score`的层)的输出矩阵。可以通过下面两步来实现这一功能:
1.`event_handler`中处理`PaddlePaddle`预定义的`paddle.event.EndIteration`或是`paddle.event.EndPass`事件。
2. 调用`event.gm.getLayerOutputs`,传入网络中指定层的名字,便可获取该层在一个`mini-batch`前向计算结束后的值。
下面是代码示例:
```python
def score_diff(right_score, left_score):
return np.average(np.abs(right_score - left_score))
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 25 == 0:
diff = score_diff(
event.gm.getLayerOutputs("right_score")["right_score"][
"value"],
event.gm.getLayerOutputs("left_score")["left_score"][
"value"])
logger.info(("Pass %d Batch %d : Cost %.6f, "
"average absolute diff scores: %.6f") %
(event.pass_id, event.batch_id, event.cost, diff))
```
## 总结 ## 总结
LTR在实际生活中有着广泛的应用。排序模型构造方法一般可划分为PointWise方法,Pairwise方法,Listwise方法,本例以LETOR的mq2007数据为例子,阐述了Pairwise的经典方法RankNet和Listwise方法中的LambdaRank,展示如何使用PaddlePaddle框架构造对应的排序模型结构,并提供了自定义数据类型样例。PaddlePaddle提供了灵活的编程接口,并可以使用一套代码运行在单机单GPU和多机分布式多GPU下实现LTR类型任务。 LTR在实际生活中有着广泛的应用。排序模型构造方法一般可划分为PointWise方法,Pairwise方法,Listwise方法,本例以LETOR的mq2007数据为例子,阐述了Pairwise的经典方法RankNet和Listwise方法中的LambdaRank,展示如何使用PaddlePaddle框架构造对应的排序模型结构,并提供了自定义数据类型样例。PaddlePaddle提供了灵活的编程接口,并可以使用一套代码运行在单机单GPU和多机分布式多GPU下实现LTR类型任务。
## 注意事项 ## 注意事项
本例作为LTR的演示示例,所采用的网络规模较小,在应用中须结合实际情况进行设置。本例实验数据中的特征向量为**查询-文档对**的联合特征,当使用查询和文档的独立特征时,可参考[DSSM](https://github.com/PaddlePaddle/models/tree/develop/dssm) 1. 本例作为LTR的演示示例,**所采用的网络规模较小**,在应用中须结合实际情况调整网络复杂度,对网络规模重新进行设置。
2. 本例实验数据中的特征向量为**查询-文档对**的联合特征,当使用查询和文档的独立特征时,可参考[DSSM](https://github.com/PaddlePaddle/models/tree/develop/dssm)构建网络。
## 参考文献 ## 参考文献
......
...@@ -3,10 +3,14 @@ import sys ...@@ -3,10 +3,14 @@ import sys
import gzip import gzip
import functools import functools
import argparse import argparse
import logging
import numpy as np import numpy as np
import paddle.v2 as paddle import paddle.v2 as paddle
logger = logging.getLogger("paddle")
logger.setLevel(logging.INFO)
def lambda_rank(input_dim, is_infer): def lambda_rank(input_dim, is_infer):
""" """
...@@ -26,7 +30,7 @@ def lambda_rank(input_dim, is_infer): ...@@ -26,7 +30,7 @@ def lambda_rank(input_dim, is_infer):
data = paddle.layer.data("data", data = paddle.layer.data("data",
paddle.data_type.dense_vector_sequence(input_dim)) paddle.data_type.dense_vector_sequence(input_dim))
# Define hidden layer. # Define the hidden layer.
hd1 = paddle.layer.fc( hd1 = paddle.layer.fc(
input=data, input=data,
size=128, size=128,
...@@ -45,17 +49,15 @@ def lambda_rank(input_dim, is_infer): ...@@ -45,17 +49,15 @@ def lambda_rank(input_dim, is_infer):
param_attr=paddle.attr.Param(initial_std=0.01)) param_attr=paddle.attr.Param(initial_std=0.01))
if not is_infer: if not is_infer:
# Define evaluator. # Define the cost layer.
evaluator = paddle.evaluator.auc(input=output, label=label)
# Define cost layer.
cost = paddle.layer.lambda_cost( cost = paddle.layer.lambda_cost(
input=output, score=label, NDCG_num=6, max_sort_size=-1) input=output, score=label, NDCG_num=6, max_sort_size=-1)
return cost, output return cost, output
return output return output
def train_lambda_rank(num_passes): def lambda_rank_train(num_passes, model_save_dir):
# The input for LambdaRank is a sequence. # The input for LambdaRank must be a sequence.
fill_default_train = functools.partial( fill_default_train = functools.partial(
paddle.dataset.mq2007.train, format="listwise") paddle.dataset.mq2007.train, format="listwise")
fill_default_test = functools.partial( fill_default_test = functools.partial(
...@@ -78,13 +80,15 @@ def train_lambda_rank(num_passes): ...@@ -78,13 +80,15 @@ def train_lambda_rank(num_passes):
# Define end batch and end pass event handler. # Define end batch and end pass event handler.
def event_handler(event): def event_handler(event):
if isinstance(event, paddle.event.EndIteration): if isinstance(event, paddle.event.EndIteration):
print "Pass %d Batch %d Cost %.9f" % (event.pass_id, event.batch_id, logger.info("Pass %d Batch %d Cost %.9f" %
event.cost) (event.pass_id, event.batch_id, event.cost))
if isinstance(event, paddle.event.EndPass): if isinstance(event, paddle.event.EndPass):
result = trainer.test(reader=test_reader, feeding=feeding) result = trainer.test(reader=test_reader, feeding=feeding)
print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics) logger.info("\nTest with Pass %d, %s" %
with gzip.open("lambda_rank_params_%d.tar.gz" % (event.pass_id), (event.pass_id, result.metrics))
"w") as f: with gzip.open(
os.path.join(model_save_dir, "lambda_rank_params_%d.tar.gz"
% (event.pass_id)), "w") as f:
trainer.save_parameter_to_tar(f) trainer.save_parameter_to_tar(f)
feeding = {"label": 0, "data": 1} feeding = {"label": 0, "data": 1}
...@@ -95,17 +99,17 @@ def train_lambda_rank(num_passes): ...@@ -95,17 +99,17 @@ def train_lambda_rank(num_passes):
num_passes=num_passes) num_passes=num_passes)
def lambda_rank_infer(pass_id): def lambda_rank_infer(test_model_path):
"""LambdaRank model inference interface. """LambdaRank model inference interface.
Parameters: Parameters:
pass_id : inference model in pass_id test_model_path : The path of the trained model.
""" """
print "Begin to Infer..." logger.info("Begin to Infer...")
input_dim = 46 input_dim = 46
output = lambda_rank(input_dim, is_infer=True) output = lambda_rank(input_dim, is_infer=True)
parameters = paddle.parameters.Parameters.from_tar( parameters = paddle.parameters.Parameters.from_tar(
gzip.open("lambda_rank_params_%d.tar.gz" % (pass_id - 1))) gzip.open(test_model_path))
infer_query_id = None infer_query_id = None
infer_data = [] infer_data = []
...@@ -128,15 +132,51 @@ def lambda_rank_infer(pass_id): ...@@ -128,15 +132,51 @@ def lambda_rank_infer(pass_id):
if __name__ == '__main__': if __name__ == '__main__':
parser = argparse.ArgumentParser(description='LambdaRank demo') parser = argparse.ArgumentParser(
parser.add_argument("--run_type", type=str, help="run type is train|infer") description="PaddlePaddle LambdaRank example.")
parser.add_argument(
"--run_type",
type=str,
help=("A flag indicating to run the training or the inferring task. "
"Available options are: train or infer."),
default="train")
parser.add_argument( parser.add_argument(
"--num_passes", "--num_passes",
type=int, type=int,
help="The Num of passes in train| infer pass number of model.") help="The number of passes to train the model.",
default=10)
parser.add_argument(
"--use_gpu",
type=bool,
help="A flag indicating whether to use the GPU device in training.",
default=False)
parser.add_argument(
"--trainer_count",
type=int,
help="The thread number used in training.",
default=1)
parser.add_argument(
"--model_save_dir",
type=str,
required=False,
help=("The path to save the trained models."),
default="models")
parser.add_argument(
"--test_model_path",
type=str,
required=False,
help=("This parameter works only in inferring task to "
"specify path of a trained model."),
default="")
args = parser.parse_args() args = parser.parse_args()
paddle.init(use_gpu=False, trainer_count=1) paddle.init(use_gpu=args.use_gpu, trainer_count=args.trainer_count)
if args.run_type == "train": if args.run_type == "train":
train_lambda_rank(args.num_passes) lambda_rank_train(args.num_passes, args.model_save_dir)
elif args.run_type == "infer": elif args.run_type == "infer":
lambda_rank_infer(pass_id=args.num_passes - 1) assert os.path.exists(args.test_model_path), (
"The trained model does not exit. Please set a correct path.")
lambda_rank_infer(args.test_model_path)
else:
logger.fatal(("A wrong value for parameter run type. "
"Available options are: train or infer."))
...@@ -10,7 +10,7 @@ def ndcg(score_list): ...@@ -10,7 +10,7 @@ def ndcg(score_list):
score_list: np.array, shape=(sample_num,1) score_list: np.array, shape=(sample_num,1)
e.g. predict rank score list : e.g. predict rank score list :
>>> scores = [3, 2, 3, 0, 1, 2] >>> scores = [3, 2, 3, 0, 1, 2]
>>> ndcg_score = ndcg(scores) >>> ndcg_score = ndcg(scores)
""" """
......
...@@ -2,15 +2,23 @@ import os ...@@ -2,15 +2,23 @@ import os
import sys import sys
import gzip import gzip
import functools import functools
import paddle.v2 as paddle
import numpy as np
from metrics import ndcg
import argparse import argparse
import logging
import numpy as np
import paddle.v2 as paddle
logger = logging.getLogger("paddle")
logger.setLevel(logging.INFO)
# ranknet is the classic pairwise learning to rank algorithm # ranknet is the classic pairwise learning to rank algorithm
# http://icml.cc/2015/wp-content/uploads/2015/06/icml_ranking.pdf # http://icml.cc/2015/wp-content/uploads/2015/06/icml_ranking.pdf
def score_diff(right_score, left_score):
return np.average(np.abs(right_score - left_score))
def half_ranknet(name_prefix, input_dim): def half_ranknet(name_prefix, input_dim):
""" """
parameter in same name will be shared in paddle framework, parameter in same name will be shared in paddle framework,
...@@ -19,18 +27,21 @@ def half_ranknet(name_prefix, input_dim): ...@@ -19,18 +27,21 @@ def half_ranknet(name_prefix, input_dim):
https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/api.md https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/api.md
""" """
# data layer # data layer
data = paddle.layer.data(name_prefix + "/data", data = paddle.layer.data(name_prefix + "_data",
paddle.data_type.dense_vector(input_dim)) paddle.data_type.dense_vector(input_dim))
# hidden layer # hidden layer
hd1 = paddle.layer.fc( hd1 = paddle.layer.fc(
input=data, input=data,
name=name_prefix + "_hidden",
size=10, size=10,
act=paddle.activation.Tanh(), act=paddle.activation.Tanh(),
param_attr=paddle.attr.Param(initial_std=0.01, name="hidden_w1")) param_attr=paddle.attr.Param(initial_std=0.01, name="hidden_w1"))
# fully connect layer/ output layer
# fully connected layer and output layer
output = paddle.layer.fc( output = paddle.layer.fc(
input=hd1, input=hd1,
name=name_prefix + "_score",
size=1, size=1,
act=paddle.activation.Linear(), act=paddle.activation.Linear(),
param_attr=paddle.attr.Param(initial_std=0.01, name="output")) param_attr=paddle.attr.Param(initial_std=0.01, name="output"))
...@@ -45,14 +56,13 @@ def ranknet(input_dim): ...@@ -45,14 +56,13 @@ def ranknet(input_dim):
output_left = half_ranknet("left", input_dim) output_left = half_ranknet("left", input_dim)
output_right = half_ranknet("right", input_dim) output_right = half_ranknet("right", input_dim)
evaluator = paddle.evaluator.auc(input=output_left, label=label)
# rankcost layer # rankcost layer
cost = paddle.layer.rank_cost( cost = paddle.layer.rank_cost(
name="cost", left=output_left, right=output_right, label=label) name="cost", left=output_left, right=output_right, label=label)
return cost return cost
def train_ranknet(num_passes): def ranknet_train(num_passes, model_save_dir):
train_reader = paddle.batch( train_reader = paddle.batch(
paddle.reader.shuffle(paddle.dataset.mq2007.train, buf_size=100), paddle.reader.shuffle(paddle.dataset.mq2007.train, buf_size=100),
batch_size=100) batch_size=100)
...@@ -70,22 +80,28 @@ def train_ranknet(num_passes): ...@@ -70,22 +80,28 @@ def train_ranknet(num_passes):
update_equation=paddle.optimizer.Adam(learning_rate=2e-4)) update_equation=paddle.optimizer.Adam(learning_rate=2e-4))
# Define the input data order # Define the input data order
feeding = {"label": 0, "left/data": 1, "right/data": 2} feeding = {"label": 0, "left_data": 1, "right_data": 2}
# Define end batch and end pass event handler # Define end batch and end pass event handler
def event_handler(event): def event_handler(event):
if isinstance(event, paddle.event.EndIteration): if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0: if event.batch_id % 25 == 0:
print "Pass %d Batch %d Cost %.9f" % ( diff = score_diff(
event.pass_id, event.batch_id, event.cost) event.gm.getLayerOutputs("left_score")["left_score"][
else: "value"],
sys.stdout.write(".") event.gm.getLayerOutputs("right_score")["right_score"][
sys.stdout.flush() "value"])
logger.info(("Pass %d Batch %d : Cost %.6f, "
"average absolute diff scores: %.6f") %
(event.pass_id, event.batch_id, event.cost, diff))
if isinstance(event, paddle.event.EndPass): if isinstance(event, paddle.event.EndPass):
result = trainer.test(reader=test_reader, feeding=feeding) result = trainer.test(reader=test_reader, feeding=feeding)
print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics) logger.info("\nTest with Pass %d, %s" %
with gzip.open("ranknet_params_%d.tar.gz" % (event.pass_id), (event.pass_id, result.metrics))
"w") as f: with gzip.open(
os.path.join(model_save_dir, "ranknet_params_%d.tar.gz" %
(event.pass_id)), "w") as f:
trainer.save_parameter_to_tar(f) trainer.save_parameter_to_tar(f)
trainer.train( trainer.train(
...@@ -95,18 +111,17 @@ def train_ranknet(num_passes): ...@@ -95,18 +111,17 @@ def train_ranknet(num_passes):
num_passes=num_passes) num_passes=num_passes)
def ranknet_infer(pass_id): def ranknet_infer(model_path):
""" """
load the trained model. And predict with plain txt input load the trained model. And predict with plain txt input
""" """
print "Begin to Infer..." logger.info("Begin to Infer...")
feature_dim = 46 feature_dim = 46
# we just need half_ranknet to predict a rank score, # we just need half_ranknet to predict a rank score,
# which can be used in sort documents # which can be used in sort documents
output = half_ranknet("infer", feature_dim) output = half_ranknet("right", feature_dim)
parameters = paddle.parameters.Parameters.from_tar( parameters = paddle.parameters.Parameters.from_tar(gzip.open(model_path))
gzip.open("ranknet_params_%d.tar.gz" % (pass_id)))
# load data of same query and relevance documents, # load data of same query and relevance documents,
# need ranknet to rank these candidates # need ranknet to rank these candidates
...@@ -128,21 +143,59 @@ def ranknet_infer(pass_id): ...@@ -128,21 +143,59 @@ def ranknet_infer(pass_id):
# in descending order. then we build the ranking documents # in descending order. then we build the ranking documents
scores = paddle.infer( scores = paddle.infer(
output_layer=output, parameters=parameters, input=infer_data) output_layer=output, parameters=parameters, input=infer_data)
print scores
for query_id, score in zip(infer_query_id, scores): for query_id, score in zip(infer_query_id, scores):
print "query_id : ", query_id, " ranknet rank document order : ", score print "query_id : ", query_id, " score : ", score
if __name__ == '__main__': if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Ranknet demo') parser = argparse.ArgumentParser(
parser.add_argument("--run_type", type=str, help="run type is train|infer") description="PaddlePaddle RankNet example.")
parser.add_argument(
"--run_type",
type=str,
help=("A flag indicating to run the training or the inferring task. "
"Available options are: train or infer."),
default="train")
parser.add_argument( parser.add_argument(
"--num_passes", "--num_passes",
type=int, type=int,
help="num of passes in train| infer pass number of model") help="The number of passes to train the model.",
default=10)
parser.add_argument(
"--use_gpu",
type=bool,
help="A flag indicating whether to use the GPU device in training.",
default=False)
parser.add_argument(
"--trainer_count",
type=int,
help="The thread number used in training.",
default=1)
parser.add_argument(
"--model_save_dir",
type=str,
required=False,
help=("The path to save the trained models."),
default="models")
parser.add_argument(
"--test_model_path",
type=str,
required=False,
help=("This parameter works only in inferring task to "
"specify path of a trained model."),
default="")
args = parser.parse_args() args = parser.parse_args()
paddle.init(use_gpu=False, trainer_count=4) if not os.path.exists(args.model_save_dir): os.mkdir(args.model_save_dir)
paddle.init(use_gpu=args.use_gpu, trainer_count=args.trainer_count)
if args.run_type == "train": if args.run_type == "train":
train_ranknet(args.num_passes) ranknet_train(args.num_passes, args.model_save_dir)
elif args.run_type == "infer": elif args.run_type == "infer":
ranknet_infer(pass_id=args.pass_num - 1) assert os.path.exists(
args.test_model_path), "The trained model does not exit."
ranknet_infer(args.test_model_path)
else:
logger.fatal(("A wrong value for parameter run type. "
"Available options are: train or infer."))
#!/bin/sh
python lambda_rank.py \
--run_type="train" \
--num_passes=10 \
2>&1 | tee lambdarank_train.log
python lambda_rank.py \
--run_type="infer" \
--num_passes=10 \
2>&1 | tee lambdarank_infer.log
#!/bin/sh
python ranknet.py \
--run_type="train" \
--num_passes=10 \
2>&1 | tee ranknet_train.log
python ranknet.py \
--run_type="infer" \
--num_passes=10 \
2>&1 | tee ranknet_infer.log
...@@ -47,6 +47,7 @@ __C.NET.DETOUT.KEEP_TOP_K = 200 ...@@ -47,6 +47,7 @@ __C.NET.DETOUT.KEEP_TOP_K = 200
__C.NET.CONV4 = edict() __C.NET.CONV4 = edict()
__C.NET.CONV4.PB = edict() __C.NET.CONV4.PB = edict()
__C.NET.CONV4.PB.MIN_SIZE = [30] __C.NET.CONV4.PB.MIN_SIZE = [30]
__C.NET.CONV4.PB.MAX_SIZE = []
__C.NET.CONV4.PB.ASPECT_RATIO = [2.] __C.NET.CONV4.PB.ASPECT_RATIO = [2.]
__C.NET.CONV4.PB.VARIANCE = [0.1, 0.1, 0.2, 0.2] __C.NET.CONV4.PB.VARIANCE = [0.1, 0.1, 0.2, 0.2]
......
...@@ -19,6 +19,13 @@ def net_conf(mode): ...@@ -19,6 +19,13 @@ def net_conf(mode):
return paddle.attr.ParamAttr( return paddle.attr.ParamAttr(
learning_rate=local_lr, l2_rate=regularization, is_static=is_static) learning_rate=local_lr, l2_rate=regularization, is_static=is_static)
def get_loc_conf_filter_size(aspect_ratio_num, min_size_num, max_size_num):
loc_filter_size = (
aspect_ratio_num * 2 + min_size_num + max_size_num) * 4
conf_filter_size = (
aspect_ratio_num * 2 + min_size_num + max_size_num) * cfg.CLASS_NUM
return loc_filter_size, conf_filter_size
def conv_group(stack_num, name_list, input, filter_size_list, num_channels, def conv_group(stack_num, name_list, input, filter_size_list, num_channels,
num_filters_list, stride_list, padding_list, num_filters_list, stride_list, padding_list,
common_bias_attr, common_param_attr, common_act): common_bias_attr, common_param_attr, common_act):
...@@ -102,9 +109,8 @@ def net_conf(mode): ...@@ -102,9 +109,8 @@ def net_conf(mode):
get_param_attr(1, default_l2regularization), get_param_attr(1, default_l2regularization),
paddle.activation.Relu()) paddle.activation.Relu())
loc_filters = (len(aspect_ratio) * 2 + 1 + len(max_size)) * 4 loc_filters, conf_filters = get_loc_conf_filter_size(
conf_filters = ( len(aspect_ratio), len(min_size), len(max_size))
len(aspect_ratio) * 2 + 1 + len(max_size)) * cfg.CLASS_NUM
mbox_loc, mbox_conf = mbox_block(conv2_name, conv2, num_filters2, 3, mbox_loc, mbox_conf = mbox_block(conv2_name, conv2, num_filters2, 3,
loc_filters, conf_filters) loc_filters, conf_filters)
mbox_priorbox = paddle.layer.priorbox( mbox_priorbox = paddle.layer.priorbox(
...@@ -166,8 +172,13 @@ def net_conf(mode): ...@@ -166,8 +172,13 @@ def net_conf(mode):
input=conv4_3, input=conv4_3,
param_attr=paddle.attr.ParamAttr( param_attr=paddle.attr.ParamAttr(
initial_mean=20, initial_std=0, is_static=False, learning_rate=1)) initial_mean=20, initial_std=0, is_static=False, learning_rate=1))
CONV4_PB = cfg.NET.CONV4.PB
loc_filter_size, conf_filter_size = get_loc_conf_filter_size(
len(CONV4_PB.ASPECT_RATIO),
len(CONV4_PB.MIN_SIZE), len(CONV4_PB.MAX_SIZE))
conv4_3_norm_mbox_loc, conv4_3_norm_mbox_conf = \ conv4_3_norm_mbox_loc, conv4_3_norm_mbox_conf = \
mbox_block("conv4_3_norm", conv4_3_norm, 512, 3, 12, 63) mbox_block("conv4_3_norm", conv4_3_norm, 512, 3,
loc_filter_size, conf_filter_size)
conv5_3, pool5 = vgg_block("5", pool4, 512, 512, 3, 1, 1) conv5_3, pool5 = vgg_block("5", pool4, 512, 512, 3, 1, 1)
...@@ -177,7 +188,11 @@ def net_conf(mode): ...@@ -177,7 +188,11 @@ def net_conf(mode):
get_param_attr(1, default_l2regularization), get_param_attr(1, default_l2regularization),
paddle.activation.Relu()) paddle.activation.Relu())
fc7_mbox_loc, fc7_mbox_conf = mbox_block("fc7", fc7, 1024, 3, 24, 126) FC7_PB = cfg.NET.FC7.PB
loc_filter_size, conf_filter_size = get_loc_conf_filter_size(
len(FC7_PB.ASPECT_RATIO), len(FC7_PB.MIN_SIZE), len(FC7_PB.MAX_SIZE))
fc7_mbox_loc, fc7_mbox_conf = mbox_block("fc7", fc7, 1024, 3,
loc_filter_size, conf_filter_size)
fc7_mbox_priorbox = paddle.layer.priorbox( fc7_mbox_priorbox = paddle.layer.priorbox(
input=fc7, input=fc7,
image=img, image=img,
...@@ -212,8 +227,12 @@ def net_conf(mode): ...@@ -212,8 +227,12 @@ def net_conf(mode):
num_channels=256, num_channels=256,
stride=1, stride=1,
pool_type=paddle.pooling.Avg()) pool_type=paddle.pooling.Avg())
pool6_mbox_loc, pool6_mbox_conf = mbox_block("pool6", pool6, 256, 3, 24, POOL6_PB = cfg.NET.POOL6.PB
126) loc_filter_size, conf_filter_size = get_loc_conf_filter_size(
len(POOL6_PB.ASPECT_RATIO),
len(POOL6_PB.MIN_SIZE), len(POOL6_PB.MAX_SIZE))
pool6_mbox_loc, pool6_mbox_conf = mbox_block(
"pool6", pool6, 256, 3, loc_filter_size, conf_filter_size)
pool6_mbox_priorbox = paddle.layer.priorbox( pool6_mbox_priorbox = paddle.layer.priorbox(
input=pool6, input=pool6,
image=img, image=img,
......
...@@ -9,60 +9,60 @@ logger.setLevel(logging.INFO) ...@@ -9,60 +9,60 @@ logger.setLevel(logging.INFO)
def parse_train_cmd(): def parse_train_cmd():
parser = argparse.ArgumentParser( parser = argparse.ArgumentParser(
description="PaddlePaddle text classification demo") description="PaddlePaddle text classification example.")
parser.add_argument( parser.add_argument(
"--nn_type", "--nn_type",
type=str, type=str,
help="define which type of network to use, available: [dnn, cnn]", help=("A flag that defines which type of network to use, "
"available: [dnn, cnn]."),
default="dnn") default="dnn")
parser.add_argument( parser.add_argument(
"--train_data_dir", "--train_data_dir",
type=str, type=str,
required=False, required=False,
help=("path of training dataset (default: None). " help=("The path of training dataset (default: None). If this parameter "
"if this parameter is not set, " "is not set, paddle.dataset.imdb will be used."),
"paddle.dataset.imdb will be used."),
default=None) default=None)
parser.add_argument( parser.add_argument(
"--test_data_dir", "--test_data_dir",
type=str, type=str,
required=False, required=False,
help=("path of testing dataset (default: None). " help=("The path of testing dataset (default: None). If this parameter "
"if this parameter is not set, " "is not set, paddle.dataset.imdb will be used."),
"paddle.dataset.imdb will be used."),
default=None) default=None)
parser.add_argument( parser.add_argument(
"--word_dict", "--word_dict",
type=str, type=str,
required=False, required=False,
help=("path of word dictionary (default: None)." help=("The path of word dictionary (default: None). If this parameter "
"if this parameter is not set, paddle.dataset.imdb will be used." "is not set, paddle.dataset.imdb will be used. If this parameter "
"if this parameter is set, but the file does not exist, " "is set, but the file does not exist, word dictionay "
"word dictionay will be built from " "will be built from the training data automatically."),
"the training data automatically."),
default=None) default=None)
parser.add_argument( parser.add_argument(
"--label_dict", "--label_dict",
type=str, type=str,
required=False, required=False,
help=("path of label dictionay (default: None)." help=("The path of label dictionay (default: None).If this parameter "
"if this parameter is not set, paddle.dataset.imdb will be used." "is not set, paddle.dataset.imdb will be used. If this parameter "
"if this parameter is set, but the file does not exist, " "is set, but the file does not exist, word dictionay "
"word dictionay will be built from " "will be built from the training data automatically."),
"the training data automatically."),
default=None) default=None)
parser.add_argument( parser.add_argument(
"--batch_size", "--batch_size",
type=int, type=int,
default=32, default=32,
help="the number of training examples in one forward/backward pass") help="The number of training examples in one forward/backward pass.")
parser.add_argument( parser.add_argument(
"--num_passes", type=int, default=10, help="number of passes to train") "--num_passes",
type=int,
default=10,
help="The number of passes to train the model.")
parser.add_argument( parser.add_argument(
"--model_save_dir", "--model_save_dir",
type=str, type=str,
required=False, required=False,
help=("path to save the trained models."), help=("The path to save the trained models."),
default="models") default="models")
return parser.parse_args() return parser.parse_args()
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册