add cross feature into model input

c126c520 · Superjom · b0e7d38f · c126c520 · c126c520
隐藏空白更改
内联并排

Showing with 56 addition and 37 deletion

ctr/README.org ctr/README.org +27 -26

ctr/data_provider.py ctr/data_provider.py +29 -11

未找到文件。
--- a/ctr/README.org
+++ b/ctr/README.org
@@ -4,18 +4,18 @@ CTR(Click-through rate) 是用来表示用户点击一个特定链接的概率
 通常被用来衡量一个在线广告系统的有效性。
 当有多个广告位时，CTR 预估一般会作为排序的基准。
-比如在搜索引擎的广告系统里，当用户输入一个带商业价值的搜索词（query）时，系统大体上会执行下列步骤：
+比如在搜索引擎的广告系统里，当用户输入一个带商业价值的搜索词（query）时，系统大体上会执行下列步骤来展示广告：
 1. 召回满足 query 的广告集合
 2. 业务规则和相关性过滤
 3. 根据拍卖机制和 CTR 排序
-4. 展出
+4. 展出广告
 可以看到，CTR 在最终排序中起到了很重要的作用。
 在业内，CTR 模型经历了如下的发展阶段：
- Logistic Regression(LR) + 特征工程
+- Logistic Regression(LR) / GBDT + 特征工程
 - LR + DNN 特征
 - DNN + 特征工程
@@ -25,42 +25,44 @@ CTR(Click-through rate) 是用来表示用户点击一个特定链接的概率
 ** LR vs DNN
 下图展示了 LR 和一个 \(3x2\) 的 NN 模型的结构：
-[[./img/lr-vs-dnn.jpg]]
+[[./images/lr-vs-dnn.jpg]]
 LR 部分和蓝色箭头部分可以直接类比到 NN 中的结构，可以看到 LR 和 NN 有一些共通之处（比如权重累加），
-但前者的模型复杂度在相同输入维度下比后者可能第很多（从某方面讲，模型越复杂，越有潜力学习到更复杂的信息）。
+但前者的模型复杂度在相同输入维度下比后者可能低很多（从某方面讲，模型越复杂，越有潜力学习到更复杂的信息）。
-如果 LR 要达到匹敌 NN 的学习能力，必须增加输入的维度，也就是增加特征的数量（作为输入），
+如果 LR 要达到匹敌 NN 的学习能力，必须增加输入的维度，也就是增加特征的数量，
 这也就是为何 LR 和大规模的特征工程必须绑定在一起的原因。
+LR 对于 NN 模型的优势是对大规模稀疏特征的容纳能力，包括内存和计算量等方面，工业界都有非常成熟的优化方法。
 而 NN 模型具有自己学习新特征的能力，一定程度上能够提升特征使用的效率，
 这使得 NN 模型在同样规模特征的情况下，更有可能达到更好的学习效果。
-LR 对于 NN 模型的优势是对大规模稀疏特征的容纳能力，包括内存和计算量等，工业界都有非常成熟的优化方法。
 本文后面的章节会演示如何使用 PaddlePaddle 编写一个结合两者优点的模型。
 * 数据和任务抽象
-我们可以将 `click` 作为学习目标，具体任务可以有以下几种方案：
+我们可以将 ~click~ 作为学习目标，具体任务可以有以下几种方案：
-1. 直接学习 click，0,1 作二元分类，或 pairwise rank（标签 1>0）
+1. 直接学习 click，0,1 作二元分类
-2. 统计每个广告的点击率，将同一个 query 下的广告两两组合，点击率高的>点击率低的
+2. Learning to rank, 具体用 pairwise rank（标签 1>0）或者 list rank
+2. 统计每个广告的点击率，将同一个 query 下的广告两两组合，点击率高的>点击率低的，做 rank 或者分类
-这里，我们直接使用第一种方法做分类任务。
+我们直接使用第一种方法做分类任务。
-我们使用 Kaggle 上 `Click-through rate prediction` 任务的数据集[1] 来演示模型。
+我们使用 Kaggle 上 ~Click-through rate prediction~ 任务的数据集[3] 来演示模型。
 具体的特征处理方法参看 [[./dataset.md][data process]]
 * Wide & Deep Learning Model
-谷歌在 16 年提出了 Wide & Deep Learning 的模型框架，用于融合 适合学习抽象特征的 DNN 和 适用于大规模系数特征的 LR 两种模型的优点。
+谷歌在 16 年提出了 Wide & Deep Learning 的模型框架，用于融合适合学习抽象特征的 DNN 和 适用于大规模稀疏特征的 LR 两种模型的优点。
 ** 模型简介
 Wide & Deep Learning Model 可以作为一种相对成熟的模型框架使用，
 在 CTR 预估的任务中工业界也有一定的应用，因此本文将演示使用此模型来完成 CTR 预估的任务。
 模型结构如下：
-[[./img/wide-deep.png]]
+[[./images/wide-deep.png]]
 模型左边的 Wide 部分，可以容纳大规模系数特征，并且对一些特定的信息（比如 ID）有一定的记忆能力；
 而模型右边的 Deep 部分，能够学习特征间的隐含关系，在相同数量的特征下有更好的学习和推导能力。
@@ -68,9 +70,9 @@ LR 对于 NN 模型的优势是对大规模稀疏特征的容纳能力，包括
 模型只接受 3 个输入，分别是
- `dnn_input` ，也就是 Deep 部分的输入
+- ~dnn_input~ ，也就是 Deep 部分的输入
- `lr_input` ，也就是 Wide 部分的输入
+- ~lr_input~ ，也就是 Wide 部分的输入
- `click` ， 点击与否，作为二分类模型学习的标签
+- ~click~ ， 点击与否，作为二分类模型学习的标签
 #+BEGIN_SRC python
  dnn_merged_input = layer.data(
@@ -86,6 +88,7 @@ LR 对于 NN 模型的优势是对大规模稀疏特征的容纳能力，包括
 ** 编写 Wide 部分
+Wide 部分直接使用了 LR 模型，但激活函数改成了 ~RELU~ 来加速
   #+BEGIN_SRC python
     def build_lr_submodel():
         fc = layer.fc(
@@ -94,7 +97,7 @@ LR 对于 NN 模型的优势是对大规模稀疏特征的容纳能力，包括
   #+END_SRC
 ** 编写 Deep 部分
+Deep 部分使用了标准的多层前向传导的 NN 模型
   #+BEGIN_SRC python
     def build_dnn_submodel(dnn_layer_dims):
         dnn_embedding = layer.fc(input=dnn_merged_input, size=dnn_layer_dims[0])
@@ -109,7 +112,8 @@ LR 对于 NN 模型的优势是对大规模稀疏特征的容纳能力，包括
         return _input_layer
   #+END_SRC
 ** 两者融合
+两个 submodel 的最上层输出加权求和得到整个模型的输出，输出部分使用 ~sigmoid~ 作为激活函数，得到区间\((0,1)\) 的预测值，
+来逼近训练数据中二元类别的分布，最终作为 CTR 预估的值使用。
   #+BEGIN_SRC python
     # conbine DNN and LR submodels
     def combine_submodels(dnn, lr):
@@ -165,13 +169,10 @@ LR 对于 NN 模型的优势是对大规模稀疏特征的容纳能力，包括
         feeding=field_index,
         event_handler=event_handler,
         num_passes=100)
   #+END_SRC
-* 写在最后
+* 引用
 - [1] https://en.wikipedia.org/wiki/Click-through_rate
- [2] Strategies for Training Large Scale Neural Network Language Models
+- [2] Mikolov, Tomáš, et al. "Strategies for training large scale neural network language models." Automatic Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on. IEEE, 2011.
- https://www.kaggle.com/c/avazu-ctr-prediction/data
+- [3] https://www.kaggle.com/c/avazu-ctr-prediction/data
+- [4] Cheng, Heng-Tze, et al. "Wide & deep learning for recommender systems." Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. ACM, 2016.
-[1] https://www.kaggle.com/c/avazu-ctr-prediction/data
--- a/ctr/data_provider.py
+++ b/ctr/data_provider.py
@@ -47,7 +47,7 @@ feature_dims = {}
 categorial_features = ('C1 banner_pos site_category app_category ' +
                       'device_type device_conn_type').split()
-id_features = 'id site_id app_id device_id'.split()
+id_features = 'id site_id app_id device_id _device_id_cross_site_id'.split()
 def get_all_field_names(mode=0):
@@ -98,12 +98,14 @@ class CategoryFeatureGenerator(object):
 class IDfeatureGenerator(object):
-    def __init__(self, max_dim):
+    def __init__(self, max_dim, cross_fea0=None, cross_fea1=None):
        '''
        @max_dim: int
            Size of the id elements' space
        '''
        self.max_dim = max_dim
+        self.cross_fea0 = cross_fea0
+        self.cross_fea1 = cross_fea1
    def gen(self, key):
        '''
@@ -134,19 +136,27 @@ class ContinuousFeatureGenerator(object):
        return (val - self.min) / self.len_part
+# init all feature generators
 fields = {}
 for key in categorial_features:
    fields[key] = CategoryFeatureGenerator()
 for key in id_features:
-    fields[key] = IDfeatureGenerator(10000)
+    # for cross features
+    if 'cross' in key:
+        feas = key[1:].split('_cross_')
+        fields[key] = IDfeatureGenerator(10000000, *feas)
+    # for normal ID features
+    else:
+        fields[key] = IDfeatureGenerator(10000)
+# used as feed_dict in PaddlePaddle
 field_index = dict(
    (key, id) for id, key in enumerate(['dnn_input', 'lr_input', 'click']))
 def detect_dataset(path, topn, id_fea_space=10000):
    '''
-    Parse the first `topn` records to collect information of this dataset.
+    Parse the first `topn` records to collect meta information of this dataset.
    NOTE the records should be randomly shuffled first.
    '''
@@ -164,23 +174,23 @@ def detect_dataset(path, topn, id_fea_space=10000):
    for key, item in fields.items():
        feature_dims[key] = item.size()
-    for key in id_features:
+    #for key in id_features:
-        feature_dims[key] = id_fea_space
+        #feature_dims[key] = id_fea_space
    feature_dims['hour'] = 24
    feature_dims['click'] = 1
    feature_dims['dnn_input'] = np.sum(
-        feature_dims[key] for key in categorial_features + ['hour']) + 10
+        feature_dims[key] for key in categorial_features + ['hour']) + 1
    feature_dims['lr_input'] = np.sum(feature_dims[key]
-                                      for key in id_features) + 10
+                                      for key in id_features) + 1
    return feature_dims
 def concat_sparse_vectors(inputs, dims):
    '''
-    concaterate sparse vectors into one
+    Concaterate more than one sparse vectors into one.
    @inputs: list
        list of sparse vector
@@ -198,6 +208,9 @@ def concat_sparse_vectors(inputs, dims):
 class AvazuDataset(object):
+    '''
+    Load AVAZU dataset as train set.
+    '''
    TRAIN_MODE = 0
    TEST_MODE = 1
@@ -239,7 +252,12 @@ class AvazuDataset(object):
                record = []
                for key in id_features:
-                    record.append(fields[key].gen(row[key]))
+                    if 'cross' not in key:
+                        record.append(fields[key].gen(row[key]))
+                    else:
+                        fea0 = fields[key].cross_fea0
+                        fea1 = fields[key].cross_fea1
+                        record.append(fields[key].gen_cross_fea(row[fea0], row[fea1]))
                sparse_input = concat_sparse_vectors(record, id_dims)