diff --git a/ctr/README.md b/ctr/README.md new file mode 100644 index 0000000000000000000000000000000000000000..1cd8c369857b096bb88f463201e4226d6dc16dbd --- /dev/null +++ b/ctr/README.md @@ -0,0 +1,236 @@ +# CTR预估 + +## 背景介绍 + +CTR(Click-Through Rate)\[[1](https://en.wikipedia.org/wiki/Click-through_rate)\] 是用来表示用户点击一个特定链接的概率, +通常被用来衡量一个在线广告系统的有效性。 + +当有多个广告位时,CTR 预估一般会作为排序的基准。 +比如在搜索引擎的广告系统里,当用户输入一个带商业价值的搜索词(query)时,系统大体上会执行下列步骤来展示广告: + +1. 召回满足 query 的广告集合 +2. 业务规则和相关性过滤 +3. 根据拍卖机制和 CTR 排序 +4. 展出广告 + +可以看到,CTR 在最终排序中起到了很重要的作用。 + +### 发展阶段 +在业内,CTR 模型经历了如下的发展阶段: + +- Logistic Regression(LR) / GBDT + 特征工程 +- LR + DNN 特征 +- DNN + 特征工程 + +在发展早期时 LR 一统天下,但最近 DNN 模型由于其强大的学习能力和逐渐成熟的性能优化, +逐渐地接过 CTR 预估任务的大旗。 + + +### LR vs DNN + +下图展示了 LR 和一个 \(3x2\) 的 DNN 模型的结构: + +

+
+Figure 1. LR 和 DNN 模型结构对比 +

+ +LR 的蓝色箭头部分可以直接类比到 DNN 中对应的结构,可以看到 LR 和 DNN 有一些共通之处(比如权重累加), +但前者的模型复杂度在相同输入维度下比后者可能低很多(从某方面讲,模型越复杂,越有潜力学习到更复杂的信息)。 + +如果 LR 要达到匹敌 DNN 的学习能力,必须增加输入的维度,也就是增加特征的数量, +这也就是为何 LR 和大规模的特征工程必须绑定在一起的原因。 + +LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力,包括内存和计算量等方面,工业界都有非常成熟的优化方法。 + +而 DNN 模型具有自己学习新特征的能力,一定程度上能够提升特征使用的效率, +这使得 DNN 模型在同样规模特征的情况下,更有可能达到更好的学习效果。 + +本文后面的章节会演示如何使用 PaddlePaddle 编写一个结合两者优点的模型。 + + +## 数据和任务抽象 + +我们可以将 `click` 作为学习目标,任务可以有以下几种方案: + +1. 直接学习 click,0,1 作二元分类 +2. Learning to rank, 具体用 pairwise rank(标签 1>0)或者 listwise rank +3. 统计每个广告的点击率,将同一个 query 下的广告两两组合,点击率高的>点击率低的,做 rank 或者分类 + +我们直接使用第一种方法做分类任务。 + +我们使用 Kaggle 上 `Click-through rate prediction` 任务的数据集\[[2](https://www.kaggle.com/c/avazu-ctr-prediction/data)\] 来演示模型。 + +具体的特征处理方法参看 [data process](./dataset.md) + + +## Wide & Deep Learning Model + +谷歌在 16 年提出了 Wide & Deep Learning 的模型框架,用于融合适合学习抽象特征的 DNN 和 适用于大规模稀疏特征的 LR 两种模型的优点。 + + +### 模型简介 + +Wide & Deep Learning Model\[[3](#参考文献)\] 可以作为一种相对成熟的模型框架使用, +在 CTR 预估的任务中工业界也有一定的应用,因此本文将演示使用此模型来完成 CTR 预估的任务。 + +模型结构如下: + +

+
+Figure 2. Wide & Deep Model +

+ +模型左边的 Wide 部分,可以容纳大规模系数特征,并且对一些特定的信息(比如 ID)有一定的记忆能力; +而模型右边的 Deep 部分,能够学习特征间的隐含关系,在相同数量的特征下有更好的学习和推导能力。 + + +### 编写模型输入 + +模型只接受 3 个输入,分别是 + +- `dnn_input` ,也就是 Deep 部分的输入 +- `lr_input` ,也就是 Wide 部分的输入 +- `click` , 点击与否,作为二分类模型学习的标签 + +```python +dnn_merged_input = layer.data( + name='dnn_input', + type=paddle.data_type.sparse_binary_vector(data_meta_info['dnn_input'])) + +lr_merged_input = layer.data( + name='lr_input', + type=paddle.data_type.sparse_binary_vector(data_meta_info['lr_input'])) + +click = paddle.layer.data(name='click', type=dtype.dense_vector(1)) +``` + +### 编写 Wide 部分 + +Wide 部分直接使用了 LR 模型,但激活函数改成了 `RELU` 来加速 + +```python +def build_lr_submodel(): + fc = layer.fc( + input=lr_merged_input, size=1, name='lr', act=paddle.activation.Relu()) + return fc +``` + +### 编写 Deep 部分 + +Deep 部分使用了标准的多层前向传导的 DNN 模型 + +```python +def build_dnn_submodel(dnn_layer_dims): + dnn_embedding = layer.fc(input=dnn_merged_input, size=dnn_layer_dims[0]) + _input_layer = dnn_embedding + for i, dim in enumerate(dnn_layer_dims[1:]): + fc = layer.fc( + input=_input_layer, + size=dim, + act=paddle.activation.Relu(), + name='dnn-fc-%d' % i) + _input_layer = fc + return _input_layer +``` + +### 两者融合 + +两个 submodel 的最上层输出加权求和得到整个模型的输出,输出部分使用 `sigmoid` 作为激活函数,得到区间 (0,1) 的预测值, +来逼近训练数据中二元类别的分布,并最终作为 CTR 预估的值使用。 + +```python +# conbine DNN and LR submodels +def combine_submodels(dnn, lr): + merge_layer = layer.concat(input=[dnn, lr]) + fc = layer.fc( + input=merge_layer, + size=1, + name='output', + # use sigmoid function to approximate ctr, wihch is a float value between 0 and 1. + act=paddle.activation.Sigmoid()) + return fc +``` + +### 训练任务的定义 +```python +dnn = build_dnn_submodel(dnn_layer_dims) +lr = build_lr_submodel() +output = combine_submodels(dnn, lr) + +# ============================================================================== +# cost and train period +# ============================================================================== +classification_cost = paddle.layer.multi_binary_label_cross_entropy_cost( + input=output, label=click) + + +paddle.init(use_gpu=False, trainer_count=11) + +params = paddle.parameters.create(classification_cost) + +optimizer = paddle.optimizer.Momentum(momentum=0) + +trainer = paddle.trainer.SGD( + cost=classification_cost, parameters=params, update_equation=optimizer) + +dataset = AvazuDataset(train_data_path, n_records_as_test=test_set_size) + +def event_handler(event): + if isinstance(event, paddle.event.EndIteration): + if event.batch_id % 100 == 0: + logging.warning("Pass %d, Samples %d, Cost %f" % ( + event.pass_id, event.batch_id * batch_size, event.cost)) + + if event.batch_id % 1000 == 0: + result = trainer.test( + reader=paddle.batch(dataset.test, batch_size=1000), + feeding=field_index) + logging.warning("Test %d-%d, Cost %f" % (event.pass_id, event.batch_id, + result.cost)) + + +trainer.train( + reader=paddle.batch( + paddle.reader.shuffle(dataset.train, buf_size=500), + batch_size=batch_size), + feeding=field_index, + event_handler=event_handler, + num_passes=100) +``` +## 运行训练和测试 +训练模型需要如下步骤: + +1. 下载训练数据,可以使用 Kaggle 上 CTR 比赛的数据\[[2](#参考文献)\] + 1. 从 [Kaggle CTR](https://www.kaggle.com/c/avazu-ctr-prediction/data) 下载 train.gz + 2. 解压 train.gz 得到 train.txt +2. 执行 `python train.py --train_data_path train.txt` ,开始训练 + +上面第2个步骤可以为 `train.py` 填充命令行参数来定制模型的训练过程,具体的命令行参数及用法如下 + +``` +usage: train.py [-h] --train_data_path TRAIN_DATA_PATH + [--batch_size BATCH_SIZE] [--test_set_size TEST_SET_SIZE] + [--num_passes NUM_PASSES] + [--num_lines_to_detact NUM_LINES_TO_DETACT] + +PaddlePaddle CTR example + +optional arguments: + -h, --help show this help message and exit + --train_data_path TRAIN_DATA_PATH + path of training dataset + --batch_size BATCH_SIZE + size of mini-batch (default:10000) + --test_set_size TEST_SET_SIZE + size of the validation dataset(default: 10000) + --num_passes NUM_PASSES + number of passes to train + --num_lines_to_detact NUM_LINES_TO_DETACT + number of records to detect dataset's meta info +``` + +## 参考文献 +1. +2. +3. Cheng H T, Koc L, Harmsen J, et al. [Wide & deep learning for recommender systems](https://arxiv.org/pdf/1606.07792.pdf)[C]//Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. ACM, 2016: 7-10. diff --git a/ctr/data_provider.py b/ctr/data_provider.py new file mode 100644 index 0000000000000000000000000000000000000000..f02d3d33e75163cf772921ef54729a3fc8da022b --- /dev/null +++ b/ctr/data_provider.py @@ -0,0 +1,277 @@ +import sys +import csv +import numpy as np +''' +The fields of the dataset are: + + 0. id: ad identifier + 1. click: 0/1 for non-click/click + 2. hour: format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC. + 3. C1 -- anonymized categorical variable + 4. banner_pos + 5. site_id + 6. site_domain + 7. site_category + 8. app_id + 9. app_domain + 10. app_category + 11. device_id + 12. device_ip + 13. device_model + 14. device_type + 15. device_conn_type + 16. C14-C21 -- anonymized categorical variables + +We will treat following fields as categorical features: + + - C1 + - banner_pos + - site_category + - app_category + - device_type + - device_conn_type + +and some other features as id features: + + - id + - site_id + - app_id + - device_id + +The `hour` field will be treated as a continuous feature and will be transformed +to one-hot representation which has 24 bits. +''' + +feature_dims = {} + +categorial_features = ('C1 banner_pos site_category app_category ' + + 'device_type device_conn_type').split() + +id_features = 'id site_id app_id device_id _device_id_cross_site_id'.split() + + +def get_all_field_names(mode=0): + ''' + @mode: int + 0 for train, 1 for test + @return: list of str + ''' + return categorial_features + ['hour'] + id_features + ['click'] \ + if mode == 0 else [] + + +class CategoryFeatureGenerator(object): + ''' + Generator category features. + + Register all records by calling `register` first, then call `gen` to generate + one-hot representation for a record. + ''' + + def __init__(self): + self.dic = {'unk': 0} + self.counter = 1 + + def register(self, key): + ''' + Register record. + ''' + if key not in self.dic: + self.dic[key] = self.counter + self.counter += 1 + + def size(self): + return len(self.dic) + + def gen(self, key): + ''' + Generate one-hot representation for a record. + ''' + if key not in self.dic: + res = self.dic['unk'] + else: + res = self.dic[key] + return [res] + + def __repr__(self): + return '' % len(self.dic) + + +class IDfeatureGenerator(object): + def __init__(self, max_dim, cross_fea0=None, cross_fea1=None): + ''' + @max_dim: int + Size of the id elements' space + ''' + self.max_dim = max_dim + self.cross_fea0 = cross_fea0 + self.cross_fea1 = cross_fea1 + + def gen(self, key): + ''' + Generate one-hot representation for records + ''' + return [hash(key) % self.max_dim] + + def gen_cross_fea(self, fea1, fea2): + key = str(fea1) + str(fea2) + return self.gen(key) + + def size(self): + return self.max_dim + + +class ContinuousFeatureGenerator(object): + def __init__(self, n_intervals): + self.min = sys.maxint + self.max = sys.minint + self.n_intervals = n_intervals + + def register(self, val): + self.min = min(self.minint, val) + self.max = max(self.maxint, val) + + def gen(self, val): + self.len_part = (self.max - self.min) / self.n_intervals + return (val - self.min) / self.len_part + + +# init all feature generators +fields = {} +for key in categorial_features: + fields[key] = CategoryFeatureGenerator() +for key in id_features: + # for cross features + if 'cross' in key: + feas = key[1:].split('_cross_') + fields[key] = IDfeatureGenerator(10000000, *feas) + # for normal ID features + else: + fields[key] = IDfeatureGenerator(10000) + +# used as feed_dict in PaddlePaddle +field_index = dict((key, id) + for id, key in enumerate(['dnn_input', 'lr_input', 'click'])) + + +def detect_dataset(path, topn, id_fea_space=10000): + ''' + Parse the first `topn` records to collect meta information of this dataset. + + NOTE the records should be randomly shuffled first. + ''' + # create categorical statis objects. + + with open(path, 'rb') as csvfile: + reader = csv.DictReader(csvfile) + for row_id, row in enumerate(reader): + if row_id > topn: + break + + for key in categorial_features: + fields[key].register(row[key]) + + for key, item in fields.items(): + feature_dims[key] = item.size() + + #for key in id_features: + #feature_dims[key] = id_fea_space + + feature_dims['hour'] = 24 + feature_dims['click'] = 1 + + feature_dims['dnn_input'] = np.sum( + feature_dims[key] for key in categorial_features + ['hour']) + 1 + feature_dims['lr_input'] = np.sum(feature_dims[key] + for key in id_features) + 1 + + return feature_dims + + +def concat_sparse_vectors(inputs, dims): + ''' + Concaterate more than one sparse vectors into one. + + @inputs: list + list of sparse vector + @dims: list of int + dimention of each sparse vector + ''' + res = [] + assert len(inputs) == len(dims) + start = 0 + for no, vec in enumerate(inputs): + for v in vec: + res.append(v + start) + start += dims[no] + return res + + +class AvazuDataset(object): + ''' + Load AVAZU dataset as train set. + ''' + TRAIN_MODE = 0 + TEST_MODE = 1 + + def __init__(self, train_path, n_records_as_test=-1): + self.train_path = train_path + self.n_records_as_test = n_records_as_test + # task model: 0 train, 1 test + self.mode = 0 + + def train(self): + self.mode = self.TRAIN_MODE + return self._parse(self.train_path, skip_n_lines=self.n_records_as_test) + + def test(self): + self.mode = self.TEST_MODE + return self._parse(self.train_path, top_n_lines=self.n_records_as_test) + + def _parse(self, path, skip_n_lines=-1, top_n_lines=-1): + with open(path, 'rb') as csvfile: + reader = csv.DictReader(csvfile) + + categorial_dims = [ + feature_dims[key] for key in categorial_features + ['hour'] + ] + id_dims = [feature_dims[key] for key in id_features] + + for row_id, row in enumerate(reader): + if skip_n_lines > 0 and row_id < skip_n_lines: + continue + if top_n_lines > 0 and row_id > top_n_lines: + break + + record = [] + for key in categorial_features: + record.append(fields[key].gen(row[key])) + record.append([int(row['hour'][-2:])]) + dense_input = concat_sparse_vectors(record, categorial_dims) + + record = [] + for key in id_features: + if 'cross' not in key: + record.append(fields[key].gen(row[key])) + else: + fea0 = fields[key].cross_fea0 + fea1 = fields[key].cross_fea1 + record.append( + fields[key].gen_cross_fea(row[fea0], row[fea1])) + + sparse_input = concat_sparse_vectors(record, id_dims) + + record = [dense_input, sparse_input] + + record.append(list((int(row['click']), ))) + yield record + + +if __name__ == '__main__': + path = 'train.txt' + print detect_dataset(path, 400000) + + filereader = AvazuDataset(path) + for no, rcd in enumerate(filereader.train()): + print no, rcd + if no > 1000: break diff --git a/ctr/dataset.md b/ctr/dataset.md new file mode 100644 index 0000000000000000000000000000000000000000..dd6443d56adaf548d6c39458900c711c7f274def --- /dev/null +++ b/ctr/dataset.md @@ -0,0 +1,289 @@ +# 数据及处理 +## 数据集介绍 + +数据集使用 `csv` 格式存储,其中各个字段内容如下: + +- `id` : ad identifier +- `click` : 0/1 for non-click/click +- `hour` : format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC. +- `C1` : anonymized categorical variable +- `banner_pos` +- `site_id` +- `site_domain` +- `site_category` +- `app_id` +- `app_domain` +- `app_category` +- `device_id` +- `device_ip` +- `device_model` +- `device_type` +- `device_conn_type` +- `C14-C21` : anonymized categorical variables + + +## 特征提取 + +下面我们会简单演示几种特征的提取方式。 + +原始数据中的特征可以分为以下几类: + +1. ID 类特征(稀疏,数量多) +- `id` +- `site_id` +- `app_id` +- `device_id` + +2. 类别类特征(稀疏,但数量有限) + +- `C1` +- `site_category` +- `device_type` +- `C14-C21` + +3. 数值型特征转化为类别型特征 + +- hour (可以转化成数值,也可以按小时为单位转化为类别) + +### 类别类特征 + +类别类特征的提取方法有以下两种: + +1. One-hot 表示作为特征 +2. 类似词向量,用一个 Embedding 将每个类别映射到对应的向量 + + +### ID 类特征 + +ID 类特征的特点是稀疏数据,但量比较大,直接使用 One-hot 表示时维度过大。 + +一般会作如下处理: + +1. 确定表示的最大维度 N +2. newid = id % N +3. 用 newid 作为类别类特征使用 + +上面的方法尽管存在一定的碰撞概率,但能够处理任意数量的 ID 特征,并保留一定的效果\[[2](#参考文献)\]。 + +### 数值型特征 + +一般会做如下处理: + +- 归一化,直接作为特征输入模型 +- 用区间分割处理成类别类特征,稀疏化表示,模糊细微上的差别 + +## 特征处理 + + +### 类别型特征 + +类别型特征有有限多种值,在模型中,我们一般使用 Embedding将每种值映射为连续值的向量。 + +这种特征在输入到模型时,一般使用 One-hot 表示,相关处理方法如下: + +```python +class CategoryFeatureGenerator(object): + ''' + Generator category features. + + Register all records by calling ~register~ first, then call ~gen~ to generate + one-hot representation for a record. + ''' + + def __init__(self): + self.dic = {'unk': 0} + self.counter = 1 + + def register(self, key): + ''' + Register record. + ''' + if key not in self.dic: + self.dic[key] = self.counter + self.counter += 1 + + def size(self): + return len(self.dic) + + def gen(self, key): + ''' + Generate one-hot representation for a record. + ''' + if key not in self.dic: + res = self.dic['unk'] + else: + res = self.dic[key] + return [res] + + def __repr__(self): + return '' % len(self.dic) +``` + +`CategoryFeatureGenerator` 需要先扫描数据集,得到该类别对应的项集合,之后才能开始生成特征。 + +我们的实验数据集\[[3](https://www.kaggle.com/c/avazu-ctr-prediction/data)\]已经经过shuffle,可以扫描前面一定数目的记录来近似总的类别项集合(等价于随机抽样), +对于没有抽样上的低频类别项,可以用一个 UNK 的特殊值表示。 + +```python +fields = {} +for key in categorial_features: + fields[key] = CategoryFeatureGenerator() + +def detect_dataset(path, topn, id_fea_space=10000): + ''' + Parse the first `topn` records to collect meta information of this dataset. + + NOTE the records should be randomly shuffled first. + ''' + # create categorical statis objects. + + with open(path, 'rb') as csvfile: + reader = csv.DictReader(csvfile) + for row_id, row in enumerate(reader): + if row_id > topn: + break + + for key in categorial_features: + fields[key].register(row[key]) +``` + +`CategoryFeatureGenerator` 在注册得到数据集中对应类别信息后,可以对相应记录生成对应的特征表示: + +```python +record = [] +for key in categorial_features: + record.append(fields[key].gen(row[key])) +``` + +本任务中,类别类特征会输入到 DNN 中使用。 + +### ID 类特征 + +ID 类特征代稀疏值,且值的空间很大的情况,一般用模操作规约到一个有限空间, +之后可以当成类别类特征使用,这里我们会将 ID 类特征输入到 LR 模型中使用。 + +```python +class IDfeatureGenerator(object): + def __init__(self, max_dim): + ''' + @max_dim: int + Size of the id elements' space + ''' + self.max_dim = max_dim + + def gen(self, key): + ''' + Generate one-hot representation for records + ''' + return [hash(key) % self.max_dim] + + def size(self): + return self.max_dim +``` + +`IDfeatureGenerator` 不需要预先初始化,可以直接生成特征,比如 + +```python +record = [] +for key in id_features: + if 'cross' not in key: + record.append(fields[key].gen(row[key])) +``` + +### 交叉类特征 + +LR 模型作为 Wide & Deep model 的 `wide` 部分,可以输入很 wide 的数据(特征空间的维度很大), +为了充分利用这个优势,我们将演示交叉组合特征构建成更大维度特征的情况,之后塞入到模型中训练。 + +这里我们依旧使用模操作来约束最终组合出的特征空间的大小,具体实现是直接在 `IDfeatureGenerator` 中添加一个 `gen_cross_feature` 的方法: + +```python +def gen_cross_fea(self, fea1, fea2): + key = str(fea1) + str(fea2) + return self.gen(key) +``` + +比如,我们觉得原始数据中, `device_id` 和 `site_id` 有一些关联(比如某个 device 倾向于浏览特定 site), +我们通过组合出两者组合来捕捉这类信息。 + +```python +fea0 = fields[key].cross_fea0 +fea1 = fields[key].cross_fea1 +record.append( + fields[key].gen_cross_fea(row[fea0], row[fea1])) +``` + +### 特征维度 +#### Deep submodel(DNN)特征 +| feature | dimention | +|------------------|-----------| +| app_category | 21 | +| site_category | 22 | +| device_conn_type | 5 | +| hour | 24 | +| banner_pos | 7 | +| **Total** | 79 | + +#### Wide submodel(LR)特征 +| Feature | Dimention | +|---------------------|-----------| +| id | 10000 | +| site_id | 10000 | +| app_id | 10000 | +| device_id | 10000 | +| device_id X site_id | 1000000 | +| **Total** | 1,040,000 | + +## 输入到 PaddlePaddle 中 + +Deep 和 Wide 两部分均以 `sparse_binary_vector` 的格式 \[[1](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/api/v1/data_provider/pydataprovider2_en.rst)\] 输入,输入前需要将相关特征拼合,模型最终只接受 3 个 input, +分别是 + +1. `dnn input` ,DNN 的输入 +2. `lr input` , LR 的输入 +3. `click` , 标签 + +拼合特征的方法: + +```python +def concat_sparse_vectors(inputs, dims): + ''' + concaterate sparse vectors into one + + @inputs: list + list of sparse vector + @dims: list of int + dimention of each sparse vector + ''' + res = [] + assert len(inputs) == len(dims) + start = 0 + for no, vec in enumerate(inputs): + for v in vec: + res.append(v + start) + start += dims[no] + return res +``` + +生成最终特征的代码如下: + +```python +# dimentions of the features +categorial_dims = [ + feature_dims[key] for key in categorial_features + ['hour'] +] +id_dims = [feature_dims[key] for key in id_features] + +dense_input = concat_sparse_vectors(record, categorial_dims) +sparse_input = concat_sparse_vectors(record, id_dims) + +record = [dense_input, sparse_input] +record.append(list((int(row['click']), ))) +yield record +``` + +## 参考文献 + +1. +2. Mikolov T, Deoras A, Povey D, et al. [Strategies for training large scale neural network language models](https://www.researchgate.net/profile/Lukas_Burget/publication/241637478_Strategies_for_training_large_scale_neural_network_language_models/links/542c14960cf27e39fa922ed3.pdf)[C]//Automatic Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on. IEEE, 2011: 196-201. +3. diff --git a/ctr/images/lr_vs_dnn.jpg b/ctr/images/lr_vs_dnn.jpg new file mode 100644 index 0000000000000000000000000000000000000000..50a0db583cd9b6e1a5bc0f83a28ab6e22d649931 Binary files /dev/null and b/ctr/images/lr_vs_dnn.jpg differ diff --git a/ctr/images/wide_deep.png b/ctr/images/wide_deep.png new file mode 100644 index 0000000000000000000000000000000000000000..03c4afcfc6cea0b5abf4c4554ecf9810843e75e2 Binary files /dev/null and b/ctr/images/wide_deep.png differ diff --git a/ctr/train.py b/ctr/train.py new file mode 100644 index 0000000000000000000000000000000000000000..da6dc9dd6d9e386a87693b5a5bc0cbf95da0b069 --- /dev/null +++ b/ctr/train.py @@ -0,0 +1,138 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- + +import argparse +import logging +import paddle.v2 as paddle +from paddle.v2 import layer +from paddle.v2 import data_type as dtype +from data_provider import field_index, detect_dataset, AvazuDataset + +parser = argparse.ArgumentParser(description="PaddlePaddle CTR example") +parser.add_argument( + '--train_data_path', + type=str, + required=True, + help="path of training dataset") +parser.add_argument( + '--batch_size', + type=int, + default=10000, + help="size of mini-batch (default:10000)") +parser.add_argument( + '--test_set_size', + type=int, + default=10000, + help="size of the validation dataset(default: 10000)") +parser.add_argument( + '--num_passes', type=int, default=10, help="number of passes to train") +parser.add_argument( + '--num_lines_to_detact', + type=int, + default=500000, + help="number of records to detect dataset's meta info") + +args = parser.parse_args() + +dnn_layer_dims = [128, 64, 32, 1] +data_meta_info = detect_dataset(args.train_data_path, args.num_lines_to_detact) + +logging.warning('detect categorical fields in dataset %s' % + args.train_data_path) +for key, item in data_meta_info.items(): + logging.warning(' - {}\t{}'.format(key, item)) + +paddle.init(use_gpu=False, trainer_count=1) + +# ============================================================================== +# input layers +# ============================================================================== +dnn_merged_input = layer.data( + name='dnn_input', + type=paddle.data_type.sparse_binary_vector(data_meta_info['dnn_input'])) + +lr_merged_input = layer.data( + name='lr_input', + type=paddle.data_type.sparse_binary_vector(data_meta_info['lr_input'])) + +click = paddle.layer.data(name='click', type=dtype.dense_vector(1)) + + +# ============================================================================== +# network structure +# ============================================================================== +def build_dnn_submodel(dnn_layer_dims): + dnn_embedding = layer.fc(input=dnn_merged_input, size=dnn_layer_dims[0]) + _input_layer = dnn_embedding + for i, dim in enumerate(dnn_layer_dims[1:]): + fc = layer.fc( + input=_input_layer, + size=dim, + act=paddle.activation.Relu(), + name='dnn-fc-%d' % i) + _input_layer = fc + return _input_layer + + +# config LR submodel +def build_lr_submodel(): + fc = layer.fc( + input=lr_merged_input, size=1, name='lr', act=paddle.activation.Relu()) + return fc + + +# conbine DNN and LR submodels +def combine_submodels(dnn, lr): + merge_layer = layer.concat(input=[dnn, lr]) + fc = layer.fc( + input=merge_layer, + size=1, + name='output', + # use sigmoid function to approximate ctr rate, a float value between 0 and 1. + act=paddle.activation.Sigmoid()) + return fc + + +dnn = build_dnn_submodel(dnn_layer_dims) +lr = build_lr_submodel() +output = combine_submodels(dnn, lr) + +# ============================================================================== +# cost and train period +# ============================================================================== +classification_cost = paddle.layer.multi_binary_label_cross_entropy_cost( + input=output, label=click) + +params = paddle.parameters.create(classification_cost) + +optimizer = paddle.optimizer.Momentum(momentum=0.01) + +trainer = paddle.trainer.SGD( + cost=classification_cost, parameters=params, update_equation=optimizer) + +dataset = AvazuDataset( + args.train_data_path, n_records_as_test=args.test_set_size) + + +def event_handler(event): + if isinstance(event, paddle.event.EndIteration): + num_samples = event.batch_id * args.batch_size + if event.batch_id % 100 == 0: + logging.warning("Pass %d, Samples %d, Cost %f" % + (event.pass_id, num_samples, event.cost)) + + if event.batch_id % 1000 == 0: + result = trainer.test( + reader=paddle.batch(dataset.test, batch_size=args.batch_size), + feeding=field_index) + logging.warning("Test %d-%d, Cost %f" % + (event.pass_id, event.batch_id, result.cost)) + + +trainer.train( + reader=paddle.batch( + paddle.reader.shuffle(dataset.train, buf_size=500), + batch_size=args.batch_size), + feeding=field_index, + event_handler=event_handler, + num_passes=args.num_passes)