diff --git a/ltr/README.md b/ltr/README.md index a0990367ef8b03c70c29d285e22ef85907e1d0b7..8d78c220a1b863534272006d821e74f28df2803b 100644 --- a/ltr/README.md +++ b/ltr/README.md @@ -1 +1,346 @@ -TBD +## 排序学习(LearningToRank) + +排序学习技术[1] 是构建排序模型的机器学习方法,在信息检索,自然语言处理,数据挖掘等机器学场景中具有重要作用。排序学习的主要目的是对给定一组文档,对任意查询请求给出反映相关性的文档排序。在本例子中,利用标注过的语料库训练两种经典排序模型RankNet[4]和LamdaRank[6],分别可以生成对应的排序模型,能够对任意查询请求,给出相关性文档排序。 + +用户可以使用 `python ranknet.py` 或者 `python lambdaRank.py`命令就完成排序模型的训练和预测,程序会自动下载内置数据集,无需手动下载。 + + + +## 背景介绍 + +排序学习技术随着互联网的快速增长而受到越来越多关注,是机器学习中的常见任务。一方面人工排序规则不能处理海量规模的候选数据,另一方面无法为不同渠道的候选数据给于合适的权重,因此排序学习在日常生活中应用非常广泛。排序学习起源于信息检索领域,目前仍然是许多信息检索场景中的核心模块,例如搜索引擎搜索结果排序,推荐系统候选集排序,在线广告排序等等。在本例子中,采用文档检索阐述排序学习模型。 + +

+ + + +
图1 排序模型在文档检索的典型应用搜索引擎中的作用

+ +假定有一组文档S,文档检索任务是依据和请求的相关性,给出文档排列顺序。查询引擎根据查询请求,排序模型会给每个文档打出分数,依据打分情况倒序排列文档,得到查询结果。在训练模型时,给定一条查询,并给出对应的文档最佳排序和得分。在预测时候,给出查询请求,排序模型生成文档排序。传统的排序学习方法划分为以下三类: + +- Pointwise 方法 + + Pointwise方法是通过近似为回归问题解决排序问题,输入的单条样本为得分-文档,将每个查询-文档对的相关性得分作为实数分数或者序数分数,使得单个查询-文档对作为样本点(Pointwise的由来),训练排序模型。预测时候对于指定输入,给出查询-文档对的相关性得分 + +- Pairwise方法 + + Pairwise方法是通过近似为分类问题解决排序问题,输入的单条样本为标签-文档对。对于一次查询的多个结果文档,组合任意两个文档形成文档对作为输入样本。即学习一个二分类器,对输入的一对文档对AB(Pairwise的由来),根据A相关性是否比B好,二分类器给出分类标签+1或-1。对所有文档对进行分类,就可以得到一组偏序关系,从而构造文档全集的排序关系。该类方法的原理是对给定的文档全集S,降低排序中的逆序文档对的个数来降低排序错误,从而达到优化排序结果的目的。 + +- Listwise方法 + + Listwise方法是直接优化排序列表,输入为单条样本为一个文档排列。通过构造合适的度量函数衡量当前文档排序和最优排序差值,优化度量函数得到排序模型。由于度量函数很多具有非连续性,优化困难。 + +

+ + + +
图2 排序模型三类构造方法

+ + ​ + +## 实验数据 + +本例子中的实验数据采用了排序学习中的基准数据[LETOR]([http://research.microsoft.com/en-us/um/beijing/projects/letor/LETOR4.0/Data/MQ2007.rar](http://research.microsoft.com/en-us/um/beijing/projects/letor/LETOR4.0/Data/MQ2007.rar))中语料库,部分来自于Gov2网站的查询请求结果,包含了约1700条查询请求结果文档列表,并对文档相关性做出了人工标注。其中,一条查询含有唯一的查询id,对应于多个具有相关性的文档,构成了一次查询请求结果文档列表。每个文档由一个一维数组的特征向量表示,并对应一个人工标注与查询的相关性分数。 + +[文档与查询相关性分数] :[查询id] : [文档的特征向量] + +score : query id : feature1, feature2, …., featureN. + +本样例在第一次运行的时候会自动下载LETOR MQ2007数据集并缓存,用户无需手动下载。 + +`mq2007`数据集分别提供了三种类型排序模型的生成格式,需要指定生成格式`format` + +例如 + +```python +pairwise_train_dataset = functools.partial(paddle.dataset.mq2007.train, format="pairwise") +for label, left_doc, right_doc in pairwise_train_dataset(): + ... +``` + + + + + +## 模型概览 + +对于排序模型,本样例中提供了PairWise方法的模型RankNet和ListWise方法的模型LambdaRank,分别代表了两类学习方法。PointWise方法的排序模型退化为回归问题,不予赘述。 + + + +## RankNet排序模型 + +[RankNet](http://icml.cc/2015/wp-content/uploads/2015/06/icml_ranking.pdf)是一种经典的Pairwise的排序学习方法,是典型的前向神经网络排序模型。在文档集合S中的第i个文档记做`Ui`,它的文档特征向量记做`xi`,对于给定的一个文档对``,RankNet将输入的单个文档特征向量x映射到`f(x)`,得到`si=f(xi), sj=f(xj)`。将`Ui`相关性比Uj好的概率记做Pij,则 + +$$P_{i,j}=P(U_{i}>U_{j})=\frac{1}{1+e^{-\sigma (s_{i}-s_{j}))}}$$ + +由于排序度量函数大多数非连续,非光滑,因此RankNet需要一个可以优化的度量函数C。首先使用交叉熵作为度量函数衡量预测代价,将损失函数C记做 + +$$C_{i,j}=-\bar{P_{i,j}}logP_{i,j}-(1-\bar{P_{i,j}})log(1-P_{i,j})$$ + +其中代表真实概率的 + +$$\bar{P_{i,j}}=\frac{1}{2}(1+S_{i,j})$$ + +而Sij = {+1,-1},表示Ui和Uj组成的Pair的标签,即Ui相关性是否好于Uj。 + +最终得到了可求导的度量损失函数 + +$$C=\frac{1}{2}(1-S_{i,j})\sigma (s_{i}-s{j})+log(1+e^{-\sigma (s_{i}-s_{j})})$$ + +可以使用常规的梯度下降方法进行优化。细节见[RankNet](http://icml.cc/2015/wp-content/uploads/2015/06/icml_ranking.pdf) + +同时,得到文档Ui在排序优化过程的梯度信息为 + +\lambda _{i,j}=\frac{\partial C}{\partial s_{i}} = \frac{1}{2}(1-S_{i,j})-\frac{1}{1+e^{\sigma (s_{i}-s_{j})}} + +表示的含义是本轮排序优化过程中上升或者下降量。 + + + +根据以上推论构造RankNet网络结构,由若干层隐藏层和全连接层构成,如图所示,将文档特征使用隐藏层,全连接层逐层变换,完成了底层特征空间到高层特征空间的变换。其中docA和docB结构对称,分别输入到最终的RankCost层中。 + +

+ + + +
图3 RankNet网络结构示意图

+ + + +- 全连接层(fully connected layer) : 指上一层中的每个节点都连接到下层网络。本例子中同样使用paddle.layer.fc实现,注意输入到RankCost层的全连接层输出为1x1的层结构 +- RankCost层: RankCost层是排序网络RankNet的核心,度量docA相关性是否比docB好,给出预测值并和label比较。使用了交叉熵(cross enctropy)作为度量损失函数,使用梯度下降方法进行优化。细节可见[RankNet](http://icml.cc/2015/wp-content/uploads/2015/06/icml_ranking.pdf)[4] + +由于Pairwise中的网络结构是左右对称,可定义一半网络结构,另一半共享网络参数。使用PaddlePaddle实现RankNet排序模型,定义网络结构的示例代码如下: + +```python +import paddle.v2 as paddle + +def half_ranknet(name_prefix, input_dim): + """ + parameter in same name will be shared in paddle framework, + these parameters in ranknet can be used in shared state, e.g. left network and right network + shared parameters in detail + https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/api.md + """ + # data layer + data = paddle.layer.data(name_prefix+"/data", paddle.data_type.dense_vector(input_dim)) + + # fully connect layer + hd1 = paddle.layer.fc( + input=data, + size=10, + act=paddle.activation.Tanh(), + param_attr=paddle.attr.Param(initial_std=0.01, name="hidden_w1")) + # fully connect layer/ output layer + output = paddle.layer.fc( + input=hd1, + size=1, + act=paddle.activation.Linear(), + param_attr=paddle.attr.Param(initial_std=0.01, name="output")) + return output + +def ranknet(input_dim): + # label layer + label = paddle.layer.data("label", paddle.data_type.integer_value(1)) + + # reuse the parameter in half_ranknet + output_left = half_ranknet("left", input_dim) + output_right = half_ranknet("right", input_dim) + + # rankcost layer + cost = paddle.layer.rank_cost(name="cost", left=output_left, right=output_right, label=label) + return cost +``` + +上述结构中使用了和前述图表相同的模型结构,使用了两层隐藏层,分别使用了`hidden_size=10`的全连接层和`hidden_size=1`的全连接层。本例子中的input_dim指输入**单个文档**的特征dense_vector的维度,label取值为1,-1。每条输入样本为label,\的结构,以docA为例,输入input_dim的文档特征,依次变换成10维,1维特征,最终输入到RankCost层中,比较docA和docB在RankCost输出得到预测值。 + +用户运行`python ranknet.py`将会将每个轮次的模型存下来,并在测试数据上测试效果。 + + + +## 用户自定义RankNet数据 + +上面的代码使用了paddle内置的排序数据,如果希望使用自定义格式数据。可以参考Paddle内置的`mq2007`数据集,编写一个生成器函数。例如输入数据为如下格式,只包含doc0-doc2三个文档。 + +\ \ \的格式 + +``` +query_id : 1, relevance_score:1, feature_vector 0:0.1, 1:0.2, 2:0.4 #doc0 +query_id : 1, relevance_score:2, feature_vector 0:0.3, 1:0.1, 2:0.4 #doc1 +query_id : 1, relevance_score:0, feature_vector 0:0.2, 1:0.4, 2:0.1 #doc2 +query_id : 2, relevance_score:0, feature_vector 0:0.1, 1:0.4, 2:0.1 #doc0 +..... +``` + +需要将输入样本转换为PairWise的输入格式,例如组合生成格式与mq2007 PairWise格式相同的结构 + +\ \\ + +``` +1 doc1 doc0 +1 doc1 doc2 +1 doc0 doc2 +.... +``` + +注意,一般在PairWise格式的数据中,label=1表示docA和查询的相关性好于docB,事实上label信息隐含在docA和docB组合pair中。如果存在`-1 docA docB`,交换顺序构造`1 docB docA`即可。 + +另外组合所有的pair会有训练数据冗余,因为可以从部分偏序关系恢复文档集上的全序关系。相关研究见[PairWise approach](http://www.machinelearning.org/proceedings/icml2007/papers/139.pdf)[5],本例子不予赘述。 + +```python +# self define data generator +def gen_pairwise_data(text_line_of_data): + """ + return : + ------ + label : np.array, shape=(1) + docA_feature_vector : np.array, shape=(1, feature_dimension) + docA_feature_vector : np.array, shape=(1, feature_dimension) + """ + yield label, docA_feature_vector, docB_feature_vector +``` + +对应于paddle的输入中,`integer_value`为单个整数,`dense_vector`为实数一维向量,与生成器对应,需要在训练模型之前指明输入数据对应关系。 + +```python +# Define the input data order +feeding = {"label":0, + "left/data" :1, + "right/data":2} +``` + + + +## LambdaRank排序模型 + +[LambdaRank](https://papers.nips.cc/paper/2971-learning-to-rank-with-nonsmooth-cost-functions.pdf)[6]是ListWise的排序方法,是Bugers[6]等人从RankNet发展而来,使用构造lambda函数(LambdaRank名字的由来)的方法优化度量标准NDCG(Normalized Discounted Cumulative Gain),每个查询后得到的结果文档列表都单独作为一个训练样本。NDCG是信息论中很衡量文档列表排序质量的标准之一,前K个文档的NDCG得分记做 + +$$NDCG@K=Z_{k}\sum (2^{rel_{i}})1/log(k+1)$$ + +前文中RankNet中推导出,文档排序需要的是排序错误的梯度信息。NDCG度量函数是非光滑,非连续的,不能直接求得梯度信息,因此将|delta(NDCG)|=|NDCG(new) - NDCG(old)|引入,构造lambda函数为 + +$$\lambda _{i,j}=\frac{\partial C}{\partial s_{i}}=-\frac{\sigma }{1+e^{\sigma (s_{i}-s{j})}}|\Delta NDCG|$$ + +替换RankNet中的梯度表示,得到的排序模型称为[LambdaRank](https://papers.nips.cc/paper/2971-learning-to-rank-with-nonsmooth-cost-functions.pdf) + +由以上推导可知,LambdaRank网络结构和RankNet结构非常相似。如图所示 + +

+ + + +
图4. LambdaRank的网络结构示意图

+ +一个查询得到的结果文档列表作为一条样本输入到网络中,替换RankCost为LambdaCost层,其他结构与RankNet相同。 + +- LambdaCost层 : LambdaCost层使用NDCG差值作为Lambda函数,score是一个一维的序列,对于单调训练样本全连接层输出的是1x1的序列,二者的序列长度都等于该条查询得到的文档数量。Lambda函数的构造详细见[LambdaRank](https://papers.nips.cc/paper/2971-learning-to-rank-with-nonsmooth-cost-functions.pdf) + +使用Paddle定义LambdaRank网络结构的示例代码如下: + +```python +import paddle.v2 as paddle +def lambdaRank(input_dim): + label = paddle.layer.data("label", paddle.data_type.dense_vector_sequence(1)) + data = paddle.layer.data("data", paddle.data_type.dense_vector_sequence(input_dim)) + + # hidden layer + hd1 = paddle.layer.fc( + input=data, + size=10, + act=paddle.activation.Tanh(), + param_attr=paddle.attr.Param(initial_std=0.01)) + output = paddle.layer.fc( input=hd1, + size=1, + act=paddle.activation.Linear(), + param_attr=paddle.attr.Param(initial_std=0.01)) + cost = paddle.layer.lambda_cost(input=output, + score=label, + NDCG_num=6, + max_sort_size = -1) + return cost, output +``` + +上述结构中使用了和前述图表相同的模型结构,和RankNet相似,分别使用了`hidden_size=10`的全连接层和`hidden_size=1`的全连接层。本例子中的input_dim指输入**单个文档**的特征dense_vector的维度,label取值为1,-1。每条输入样本为label,\的结构,以docA为例,输入input_dim的文档特征,依次变换成10维,1维特征,最终输入到LambdaCost层中。需要注意这里的label和data格式为**dense_vector_sequence**,表示一列文档得分或者文档特征组成的**序列**。 + +用户运行`python lambdaRank.py`将会把每个轮次的模型存下来,并在测试数据上测试效果。 + + + +## 自定义 LambdaRank数据 + +上面的代码使用了paddle内置的mq2007数据,如果希望使用自定义格式数据。可以参考Paddle内置的`mq2007`数据集,编写一个生成器函数。例如输入数据为如下格式,只包含doc0-doc2三个文档。 + +\ \ \的格式 + +``` +query_id : 1, relevance_score:1, feature_vector 0:0.1, 1:0.2, 2:0.4 #doc0 +query_id : 1, relevance_score:2, feature_vector 0:0.3, 1:0.1, 2:0.4 #doc1 +query_id : 1, relevance_score:0, feature_vector 0:0.2, 1:0.4, 2:0.1 #doc2 +query_id : 2, relevance_score:0, feature_vector 0:0.1, 1:0.4, 2:0.1 #doc0 +query_id : 2, relevance_score:2, feature_vector 0:0.1, 1:0.4, 2:0.1 #doc1 +..... +``` + +需要转换为ListWise格式,例如 + + + +```tex +1 1 0.1,0.2,0.4 +1 2 0.3,0.1,0.4 +1 0 0.2,0.4,0.1 + +2 0 0.1,0.4,0.1 +2 2 0.1,0.4,0.1 +...... +``` + +**数据格式注意** + +- 数据中每条样本对应的文档数量都必须大于Lambda_cost层的NDCG_num +- 单条样本对应的文档都为0,NDCG将会计算无效。文档相关性都为0,那么可以判定该query无效,可以过滤掉。 + + + +```python +# self define data generator +def gen_listwise_data(text_all_lines_of_data): + """ + return : + ------ + label : np.array, shape=(samples_num, ) + querylist : np.array, shape=(samples_num, feature_dimension) + """ + yield label_list, query_docs_feature_vector_matrix +``` + +对应于paddle的输入中,label的`dense_vector_sequence`为得分序列,data的`dense_vector_sequence`为特征向量的序列输入,input_dim为单个文档的一维特征向量维度,与生成器对应,需要在训练模型之前指明输入数据对应关系。 + +```python +# Define the input data order +feeding = {"label":0, + "data" : 1} +``` + + + +## 总结 + +LearningToRank是和业务场景结合非常紧密的常用机器学习方法,排序模型构造方法一般可划分为PointWise方法,PairWise方法,ListWise方法,本例子中以LETOR的mq2007数据为例,阐述了PairWise的经典方法RankNet和ListWise方法中的LambdaRank,展示如何使用Paddle框架构造对应的排序模型结构,并提供了自定义数据类型样例。Paddle提供了灵活的编程接口,并能在单机单GPU和多机分布式多GPU无缝运行,可以实现LearningToRank类型任务。 + + + +## 参考文献 + +[1] https://en.wikipedia.org/wiki/Learning_to_rank + +[2] T.Y. Liu, “Learning to rank for information retrieval,” Foundations and Trends in Information Retrieval, vol.3, no.3, pp.225–331, 2009. + +[3] H. Li, “Learning to rank for information retrieval and natural language processing,” Synthesis Lectures on Human Language Technologies, 2011, Morgan & Claypool Publishers. + +[4] Burges C, Shaked T, Renshaw E, et al. Learning to rank using gradient descent[C]// International Conference. DBLP, 2005:89-96. + +[5]Cao Z, Qin T, Liu T Y, et al. Learning to rank: from pairwise approach to listwise approach[C]// International Conference on Machine Learning. ACM, 2007:129-136. + +[6]Schölkopf B, Platt J, Hofmann T. Learning to Rank with Nonsmooth Cost Functions[C]// Conference on Advances in Neural Information Processing Systems. MIT Press, 2006:193-200. diff --git a/ltr/lambdaRank.py b/ltr/lambdaRank.py index f2301bf31d8586cde753f7436dfb9f821cd7d802..6d28be6249ea91dbad7940356a75511ece779bfb 100644 --- a/ltr/lambdaRank.py +++ b/ltr/lambdaRank.py @@ -1,99 +1,104 @@ import os, sys import gzip +import sqlite3 import paddle.v2 as paddle import numpy as np import functools -#lambdaRank is listwise learning to rank algorithm +#lambdaRank is listwise learning to rank model -def lambdaRank(feature_dim): - label = paddle.layer.data("label", paddle.data_type.integer_value_sequence(1)) - data = paddle.layer.data("data", paddle.data_type.dense_vector(feature_dim)) - # two hidden layers - hd1 = paddle.layer.fc( - name="/hidden_1", - input=data, - size=256, - act=paddle.activation.Tanh(), - param_attr=paddle.attr.Param(initial_std=0.01, name="hidden_w1")) - hd2 = paddle.layer.fc( - name="/hidden_2", - input=hd1, - size=256, - act=paddle.activation.Tanh(), - param_attr=paddle.attr.Param(initial_std=0.01, name="hidden_w2")) - output = paddle.layer.fc( - name="/output", - input=hd2, - size=1, - act=paddle.activation.Linear(), - param_attr=paddle.attr.Param(initial_std=0.01, name="output")) - cost = paddle.layer.lambda_cost(input=output, - score=label, - NDCG_num=10) - return cost, output - +def lambdaRank(input_dim): + label = paddle.layer.data("label", + paddle.data_type.dense_vector_sequence(1)) + data = paddle.layer.data("data", + paddle.data_type.dense_vector_sequence(input_dim)) + + # hidden layer + hd1 = paddle.layer.fc( + input=data, + size=10, + act=paddle.activation.Tanh(), + param_attr=paddle.attr.Param(initial_std=0.01)) + output = paddle.layer.fc( + input=hd1, + size=1, + act=paddle.activation.Linear(), + param_attr=paddle.attr.Param(initial_std=0.01)) + cost = paddle.layer.lambda_cost( + input=output, score=label, NDCG_num=6, max_sort_size=-1) + return cost, output + def train_lambdaRank(num_passes): - fill_default_train = functools.partial(paddle.dataset.mq2007.train, format="listwise") - fill_default_test = functools.partial(paddle.dataset.mq2007.test, format="listwise") - train_reader = paddle.batch( - paddle.reader.shuffle(fill_default_train, buf_size=1000), batch_size=1000) - test_reader = paddle.batch( - paddle.reader.shuffle(fill_default_test, buf_size=1000), batch_size=1000) + # listwise input sequence + fill_default_train = functools.partial( + paddle.dataset.mq2007.train, format="listwise") + fill_default_test = functools.partial( + paddle.dataset.mq2007.test, format="listwise") + train_reader = paddle.batch( + paddle.reader.shuffle(fill_default_train, buf_size=100), batch_size=32) + test_reader = paddle.batch( + paddle.reader.shuffle(fill_default_test, buf_size=100), batch_size=32) + + # mq2007 input_dim = 46, dense format + input_dim = 46 + cost, output = lambdaRank(input_dim) + parameters = paddle.parameters.create(cost) + + trainer = paddle.trainer.SGD( + cost=cost, + parameters=parameters, + update_equation=paddle.optimizer.Adam(learning_rate=1e-4)) - # mq2007 feature_dim = 46, dense format - # fc hidden_dim = 128 - feature_dim = 46 - cost, output = lambdaRank(feature_dim) - parameters = paddle.parameters.create(cost) - - trainer = paddle.trainer.SGD( - cost=cost, - parameters=parameters, - update_equation=paddle.optimizer.Adam(learning_rate=1e-4) - ) + # Define end batch and end pass event handler + def event_handler(event): + if isinstance(event, paddle.event.EndIteration): + print "Pass %d Batch %d Cost %.9f" % (event.pass_id, event.batch_id, + event.cost) + if isinstance(event, paddle.event.EndPass): + result = trainer.test(reader=test_reader, feeding=feeding) + print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics) + with gzip.open("lambdaRank_params_%d.tar.gz" % (event.pass_id), + "w") as f: + parameters.to_tar(f) + + feeding = {"label": 0, "data": 1} + trainer.train( + reader=train_reader, + event_handler=event_handler, + feeding=feeding, + num_passes=num_passes) - # Define end batch and end pass event handler - def event_handler(event): - if isinstance(event, paddle.event.EndIteration): - if event.batch_id % 100 == 0: - print "Pass %d Batch %d Cost %.9f" % ( - event.pass_id, event.batch_id, event.cost) - else: - sys.stdout.write(".") - sys.stdout.flush() - if isinstance(event, paddle.event.EndPass): - result = trainer.test(reader=test_reader, feeding=feeding) - print "\nTest with Pass %d, %s" %(event.pass_id, result.metrics) - with gzip.open("lambdaRank_params_%d.tar.gz" %(event.pass_id), "w") as f: - parameters.to_tar(f) - feeding = {"label":0, - "data": 1} - trainer.train(reader=train_reader, - event_handler=event_handler, - feeding=feeding, - num_passes=num_passes) def lambdaRank_infer(pass_id): - print "Begin to Infer..." - feature_dim = 46 - output = lambdaRnak(feature_dim) - parameters = paddle.parameters.Parameters.from_tar(gzip.open("lambdaRank_params_%d.tar.gz" %(pass_id-1))) - infer_data = [] - infer_data_num = 1000 - for label, left, right in paddle.dataset.mq2007.test(): - infer_data.append(left) - if len(infer_data) == infer_data_num: - break - predicitons = paddle.infer(output_layer=output, - parameters=parameters, - input=infer_data) - for i, score in enumerate(predicitons): - print score + """ + lambdaRank model inference interface + parameters: + pass_id : inference model in pass_id + """ + print "Begin to Infer..." + input_dim = 46 + output = lambdaRank(input_dim) + parameters = paddle.parameters.Parameters.from_tar( + gzip.open("lambdaRank_params_%d.tar.gz" % (pass_id - 1))) + + infer_query_id = None + infer_data = [] + infer_data_num = 1000 + fill_default_test = functools.partial( + paddle.dataset.mq2007.test, format="listwise") + for label, querylist in fill_default_test(): + infer_data.append(querylist) + if len(infer_data) == infer_data_num: + break + predicitons = paddle.infer( + output_layer=output, parameters=parameters, input=infer_data) + for i, score in enumerate(predicitons): + print score + if __name__ == '__main__': - paddle.init(use_gpu=False, trainer_count=4) - train_lambdaRank(2) - lambdaRank_infer(pass_id=2) + paddle.init(use_gpu=False, trainer_count=4) + train_lambdaRank(100) + lambdaRank_infer(pass_id=2) diff --git a/ltr/lambdarank.jpg b/ltr/lambdarank.jpg new file mode 100644 index 0000000000000000000000000000000000000000..5a0b0da90eb077d63608c004ff346e7a49a467db Binary files /dev/null and b/ltr/lambdarank.jpg differ diff --git a/ltr/learningToRank.jpg b/ltr/learningToRank.jpg new file mode 100644 index 0000000000000000000000000000000000000000..54083d1143dff90a3c8ff06a77d3ea354fa621ef Binary files /dev/null and b/ltr/learningToRank.jpg differ diff --git a/ltr/metrics.py b/ltr/metrics.py new file mode 100644 index 0000000000000000000000000000000000000000..bcf6e939db4c5886856083b741be43b961f88944 --- /dev/null +++ b/ltr/metrics.py @@ -0,0 +1,30 @@ +import numpy as np +import unittest + + +def ndcg(score_list): + def dcg(score_list): + n = len(score_list) + cost = .0 + for i in range(n): + cost += float(score_list[i]) / np.log((i + 1) + 1) + return cost + + dcg_cost = dcg(score_list) + score_ranking = sorted(score_list, reverse=True) + ideal_cost = dcg(score_ranking) + return dcg_cost / ideal_cost + + +class NdcgTest(unittest.TestCase): + def __init__(self): + pass + + def runcase(self): + a = [3, 2, 3, 0, 1, 2] + value = ndcg(a) + self.assertAlmostEqual(0.961, value, places=3) + + +if __name__ == '__main__': + unittest.main() diff --git a/ltr/ranknet.jpg b/ltr/ranknet.jpg new file mode 100644 index 0000000000000000000000000000000000000000..9767cec96cd2386218df261cf4d78e288fecbb3d Binary files /dev/null and b/ltr/ranknet.jpg differ diff --git a/ltr/ranknet.py b/ltr/ranknet.py index cf7593095ec8f6ebdb490c3e9771d87f9ffd6a66..e849d1281a72b0ecb4f782dcab6358f7ec39ece3 100644 --- a/ltr/ranknet.py +++ b/ltr/ranknet.py @@ -1,104 +1,135 @@ import os, sys import gzip +import functools import paddle.v2 as paddle import numpy as np +from metrics import ndcg # ranknet is the classic pairwise learning to rank algorithm # http://icml.cc/2015/wp-content/uploads/2015/06/icml_ranking.pdf + def half_ranknet(name_prefix, input_dim): + """ + parameter in same name will be shared in paddle framework, + these parameters in ranknet can be used in shared state, e.g. left network and right network + shared parameters in detail + https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/api.md + """ + # data layer + data = paddle.layer.data(name_prefix + "/data", + paddle.data_type.dense_vector(input_dim)) - data = paddle.layer.data(name_prefix+"/data", paddle.data_type.dense_vector(input_dim)) - - # two hidden layers - hd1 = paddle.layer.fc( - name=name_prefix+"/hidden_1", - input=data, - size=32, - act=paddle.activation.Tanh(), - param_attr=paddle.attr.Param(initial_std=0.01, name="hidden_w1")) - hd2 = paddle.layer.fc( - name=name_prefix+"/hidden_2", - input=hd1, - size=16, - act=paddle.activation.Tanh(), - param_attr=paddle.attr.Param(initial_std=0.01, name="hidden_w2")) - output = paddle.layer.fc( - name=name_prefix+"/output", - input=hd2, - size=1, - act=paddle.activation.Linear(), - param_attr=paddle.attr.Param(initial_std=0.01, name="output")) - return output + # hidden layer + hd1 = paddle.layer.fc( + input=data, + size=10, + act=paddle.activation.Tanh(), + param_attr=paddle.attr.Param(initial_std=0.01, name="hidden_w1")) + # fully connect layer/ output layer + output = paddle.layer.fc( + input=hd1, + size=1, + act=paddle.activation.Linear(), + param_attr=paddle.attr.Param(initial_std=0.01, name="output")) + return output def ranknet(input_dim): - label = paddle.layer.data("label", paddle.data_type.integer_value(1)) - output_left = half_ranknet("left", input_dim) - output_right = half_ranknet("right", input_dim) - cost = paddle.layer.rank_cost(name="cost", left=output_left, right=output_right, label=label) - return cost + # label layer + label = paddle.layer.data("label", paddle.data_type.integer_value(1)) + + # reuse the parameter in half_ranknet + output_left = half_ranknet("left", input_dim) + output_right = half_ranknet("right", input_dim) + + evaluator = paddle.evaluator.auc(input=output_left, label=label) + # rankcost layer + cost = paddle.layer.rank_cost( + name="cost", left=output_left, right=output_right, label=label) + return cost def train_ranknet(num_passes): - train_reader = paddle.batch( - paddle.reader.shuffle(paddle.dataset.mq2007.train, buf_size=1000), batch_size=1000) - test_reader = paddle.batch( - paddle.reader.shuffle(paddle.dataset.mq2007.test, buf_size=1000), batch_size=1000) - - # mq2007 feature_dim = 46, dense format - # fc hidden_dim = 128 - feature_dim = 46 - cost = ranknet(feature_dim) - parameters = paddle.parameters.create(cost) - - trainer = paddle.trainer.SGD( - cost=cost, - parameters=parameters, - update_equation=paddle.optimizer.Adam(learning_rate=2e-4) - ) - - # Define end batch and end pass event handler - def event_handler(event): - if isinstance(event, paddle.event.EndIteration): - if event.batch_id % 100 == 0: - print "Pass %d Batch %d Cost %.9f" % ( - event.pass_id, event.batch_id, event.cost) - else: - sys.stdout.write(".") - sys.stdout.flush() - if isinstance(event, paddle.event.EndPass): - result = trainer.test(reader=test_reader, feeding=feeding) - print "\nTest with Pass %d, %s" %(event.pass_id, result.metrics) - with gzip.open("ranknet_params_%d.tar.gz" %(event.pass_id), "w") as f: - parameters.to_tar(f) - feeding = {"label":0, - "left/data" :1, - "right/data":2} - trainer.train(reader=train_reader, - event_handler=event_handler, - feeding=feeding, - num_passes=num_passes) + train_reader = paddle.batch( + paddle.reader.shuffle(paddle.dataset.mq2007.train, buf_size=100), + batch_size=100) + test_reader = paddle.batch( + paddle.reader.buffered(paddle.dataset.mq2007.test, size=100), + batch_size=100) + + # mq2007 feature_dim = 46, dense format + # fc hidden_dim = 128 + feature_dim = 46 + cost = ranknet(feature_dim) + parameters = paddle.parameters.create(cost) + + trainer = paddle.trainer.SGD( + cost=cost, + parameters=parameters, + update_equation=paddle.optimizer.Adam(learning_rate=2e-4)) + + # Define the input data order + feeding = {"label": 0, "left/data": 1, "right/data": 2} + + # Define end batch and end pass event handler + def event_handler(event): + if isinstance(event, paddle.event.EndIteration): + if event.batch_id % 100 == 0: + print "Pass %d Batch %d Cost %.9f" % ( + event.pass_id, event.batch_id, event.cost) + else: + sys.stdout.write(".") + sys.stdout.flush() + if isinstance(event, paddle.event.EndPass): + result = trainer.test(reader=test_reader, feeding=feeding) + print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics) + with gzip.open("ranknet_params_%d.tar.gz" % (event.pass_id), + "w") as f: + parameters.to_tar(f) + + trainer.train( + reader=train_reader, + event_handler=event_handler, + feeding=feeding, + num_passes=num_passes) + def ranknet_infer(pass_id): - print "Begin to Infer..." - feature_dim = 46 - output = half_ranknet("left", feature_dim) - parameters = paddle.parameters.Parameters.from_tar(gzip.open("ranknet_params_%d.tar.gz" %(pass_id-1))) - infer_data = [] - infer_data_num = 1000 - for label, left, right in paddle.dataset.mq2007.test(): - infer_data.append(left) - if len(infer_data) == infer_data_num: - break - predicitons = paddle.infer(output_layer=output, - parameters=parameters, - input=infer_data) - for i, score in enumerate(predicitons): - print score - + """ + load the trained model. And predict with plain txt input + """ + print "Begin to Infer..." + feature_dim = 46 + + # we just need half_ranknet to predict a rank score, which can be used in sort documents + output = half_ranknet("left", feature_dim) + parameters = paddle.parameters.Parameters.from_tar( + gzip.open("ranknet_params_%d.tar.gz" % (pass_id - 1))) + + # load data of same query and relevance documents, need ranknet to rank these candidates + infer_query_id = None + infer_data = [] + infer_score_list = [] + infer_data_num = 1000 + + # convert to mq2007 built-in data format + # + plain_txt_test = functools.partial( + paddle.dataset.mq2007.test, format="plain_txt") + + for query_id, relevance_score, feature_vector in plain_txt_test(): + if infer_query_id == None: + infer_query_id = query_id + elif infer_query_id != query_id: + break + infer_data.append(feature_vector) + predicitons = paddle.infer( + output_layer=output, parameters=parameters, input=infer_data) + if __name__ == '__main__': - paddle.init(use_gpu=False, trainer_count=4) - train_ranknet(2) - ranknet_infer(pass_id=2) + paddle.init(use_gpu=False, trainer_count=4) + pass_num = 10 + train_ranknet(pass_num) + ranknet_infer(pass_id=pass_num - 1) diff --git a/ltr/search-engine-example.png b/ltr/search-engine-example.png new file mode 100644 index 0000000000000000000000000000000000000000..36386085f45794f51d5cae79b3f9dd6ba004ded8 Binary files /dev/null and b/ltr/search-engine-example.png differ