diff --git a/ctr/README.md b/ctr/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..1cd8c369857b096bb88f463201e4226d6dc16dbd
--- /dev/null
+++ b/ctr/README.md
@@ -0,0 +1,236 @@
+# CTR预估
+
+## 背景介绍
+
+CTR(Click-Through Rate)\[[1](https://en.wikipedia.org/wiki/Click-through_rate)\] 是用来表示用户点击一个特定链接的概率,
+通常被用来衡量一个在线广告系统的有效性。
+
+当有多个广告位时,CTR 预估一般会作为排序的基准。
+比如在搜索引擎的广告系统里,当用户输入一个带商业价值的搜索词(query)时,系统大体上会执行下列步骤来展示广告:
+
+1. 召回满足 query 的广告集合
+2. 业务规则和相关性过滤
+3. 根据拍卖机制和 CTR 排序
+4. 展出广告
+
+可以看到,CTR 在最终排序中起到了很重要的作用。
+
+### 发展阶段
+在业内,CTR 模型经历了如下的发展阶段:
+
+- Logistic Regression(LR) / GBDT + 特征工程
+- LR + DNN 特征
+- DNN + 特征工程
+
+在发展早期时 LR 一统天下,但最近 DNN 模型由于其强大的学习能力和逐渐成熟的性能优化,
+逐渐地接过 CTR 预估任务的大旗。
+
+
+### LR vs DNN
+
+下图展示了 LR 和一个 \(3x2\) 的 DNN 模型的结构:
+
+
+
+Figure 1. LR 和 DNN 模型结构对比
+
+
+LR 的蓝色箭头部分可以直接类比到 DNN 中对应的结构,可以看到 LR 和 DNN 有一些共通之处(比如权重累加),
+但前者的模型复杂度在相同输入维度下比后者可能低很多(从某方面讲,模型越复杂,越有潜力学习到更复杂的信息)。
+
+如果 LR 要达到匹敌 DNN 的学习能力,必须增加输入的维度,也就是增加特征的数量,
+这也就是为何 LR 和大规模的特征工程必须绑定在一起的原因。
+
+LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力,包括内存和计算量等方面,工业界都有非常成熟的优化方法。
+
+而 DNN 模型具有自己学习新特征的能力,一定程度上能够提升特征使用的效率,
+这使得 DNN 模型在同样规模特征的情况下,更有可能达到更好的学习效果。
+
+本文后面的章节会演示如何使用 PaddlePaddle 编写一个结合两者优点的模型。
+
+
+## 数据和任务抽象
+
+我们可以将 `click` 作为学习目标,任务可以有以下几种方案:
+
+1. 直接学习 click,0,1 作二元分类
+2. Learning to rank, 具体用 pairwise rank(标签 1>0)或者 listwise rank
+3. 统计每个广告的点击率,将同一个 query 下的广告两两组合,点击率高的>点击率低的,做 rank 或者分类
+
+我们直接使用第一种方法做分类任务。
+
+我们使用 Kaggle 上 `Click-through rate prediction` 任务的数据集\[[2](https://www.kaggle.com/c/avazu-ctr-prediction/data)\] 来演示模型。
+
+具体的特征处理方法参看 [data process](./dataset.md)
+
+
+## Wide & Deep Learning Model
+
+谷歌在 16 年提出了 Wide & Deep Learning 的模型框架,用于融合适合学习抽象特征的 DNN 和 适用于大规模稀疏特征的 LR 两种模型的优点。
+
+
+### 模型简介
+
+Wide & Deep Learning Model\[[3](#参考文献)\] 可以作为一种相对成熟的模型框架使用,
+在 CTR 预估的任务中工业界也有一定的应用,因此本文将演示使用此模型来完成 CTR 预估的任务。
+
+模型结构如下:
+
+
+
+Figure 2. Wide & Deep Model
+
+
+模型左边的 Wide 部分,可以容纳大规模系数特征,并且对一些特定的信息(比如 ID)有一定的记忆能力;
+而模型右边的 Deep 部分,能够学习特征间的隐含关系,在相同数量的特征下有更好的学习和推导能力。
+
+
+### 编写模型输入
+
+模型只接受 3 个输入,分别是
+
+- `dnn_input` ,也就是 Deep 部分的输入
+- `lr_input` ,也就是 Wide 部分的输入
+- `click` , 点击与否,作为二分类模型学习的标签
+
+```python
+dnn_merged_input = layer.data(
+ name='dnn_input',
+ type=paddle.data_type.sparse_binary_vector(data_meta_info['dnn_input']))
+
+lr_merged_input = layer.data(
+ name='lr_input',
+ type=paddle.data_type.sparse_binary_vector(data_meta_info['lr_input']))
+
+click = paddle.layer.data(name='click', type=dtype.dense_vector(1))
+```
+
+### 编写 Wide 部分
+
+Wide 部分直接使用了 LR 模型,但激活函数改成了 `RELU` 来加速
+
+```python
+def build_lr_submodel():
+ fc = layer.fc(
+ input=lr_merged_input, size=1, name='lr', act=paddle.activation.Relu())
+ return fc
+```
+
+### 编写 Deep 部分
+
+Deep 部分使用了标准的多层前向传导的 DNN 模型
+
+```python
+def build_dnn_submodel(dnn_layer_dims):
+ dnn_embedding = layer.fc(input=dnn_merged_input, size=dnn_layer_dims[0])
+ _input_layer = dnn_embedding
+ for i, dim in enumerate(dnn_layer_dims[1:]):
+ fc = layer.fc(
+ input=_input_layer,
+ size=dim,
+ act=paddle.activation.Relu(),
+ name='dnn-fc-%d' % i)
+ _input_layer = fc
+ return _input_layer
+```
+
+### 两者融合
+
+两个 submodel 的最上层输出加权求和得到整个模型的输出,输出部分使用 `sigmoid` 作为激活函数,得到区间 (0,1) 的预测值,
+来逼近训练数据中二元类别的分布,并最终作为 CTR 预估的值使用。
+
+```python
+# conbine DNN and LR submodels
+def combine_submodels(dnn, lr):
+ merge_layer = layer.concat(input=[dnn, lr])
+ fc = layer.fc(
+ input=merge_layer,
+ size=1,
+ name='output',
+ # use sigmoid function to approximate ctr, wihch is a float value between 0 and 1.
+ act=paddle.activation.Sigmoid())
+ return fc
+```
+
+### 训练任务的定义
+```python
+dnn = build_dnn_submodel(dnn_layer_dims)
+lr = build_lr_submodel()
+output = combine_submodels(dnn, lr)
+
+# ==============================================================================
+# cost and train period
+# ==============================================================================
+classification_cost = paddle.layer.multi_binary_label_cross_entropy_cost(
+ input=output, label=click)
+
+
+paddle.init(use_gpu=False, trainer_count=11)
+
+params = paddle.parameters.create(classification_cost)
+
+optimizer = paddle.optimizer.Momentum(momentum=0)
+
+trainer = paddle.trainer.SGD(
+ cost=classification_cost, parameters=params, update_equation=optimizer)
+
+dataset = AvazuDataset(train_data_path, n_records_as_test=test_set_size)
+
+def event_handler(event):
+ if isinstance(event, paddle.event.EndIteration):
+ if event.batch_id % 100 == 0:
+ logging.warning("Pass %d, Samples %d, Cost %f" % (
+ event.pass_id, event.batch_id * batch_size, event.cost))
+
+ if event.batch_id % 1000 == 0:
+ result = trainer.test(
+ reader=paddle.batch(dataset.test, batch_size=1000),
+ feeding=field_index)
+ logging.warning("Test %d-%d, Cost %f" % (event.pass_id, event.batch_id,
+ result.cost))
+
+
+trainer.train(
+ reader=paddle.batch(
+ paddle.reader.shuffle(dataset.train, buf_size=500),
+ batch_size=batch_size),
+ feeding=field_index,
+ event_handler=event_handler,
+ num_passes=100)
+```
+## 运行训练和测试
+训练模型需要如下步骤:
+
+1. 下载训练数据,可以使用 Kaggle 上 CTR 比赛的数据\[[2](#参考文献)\]
+ 1. 从 [Kaggle CTR](https://www.kaggle.com/c/avazu-ctr-prediction/data) 下载 train.gz
+ 2. 解压 train.gz 得到 train.txt
+2. 执行 `python train.py --train_data_path train.txt` ,开始训练
+
+上面第2个步骤可以为 `train.py` 填充命令行参数来定制模型的训练过程,具体的命令行参数及用法如下
+
+```
+usage: train.py [-h] --train_data_path TRAIN_DATA_PATH
+ [--batch_size BATCH_SIZE] [--test_set_size TEST_SET_SIZE]
+ [--num_passes NUM_PASSES]
+ [--num_lines_to_detact NUM_LINES_TO_DETACT]
+
+PaddlePaddle CTR example
+
+optional arguments:
+ -h, --help show this help message and exit
+ --train_data_path TRAIN_DATA_PATH
+ path of training dataset
+ --batch_size BATCH_SIZE
+ size of mini-batch (default:10000)
+ --test_set_size TEST_SET_SIZE
+ size of the validation dataset(default: 10000)
+ --num_passes NUM_PASSES
+ number of passes to train
+ --num_lines_to_detact NUM_LINES_TO_DETACT
+ number of records to detect dataset's meta info
+```
+
+## 参考文献
+1.
+2.
+3. Cheng H T, Koc L, Harmsen J, et al. [Wide & deep learning for recommender systems](https://arxiv.org/pdf/1606.07792.pdf)[C]//Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. ACM, 2016: 7-10.
diff --git a/ctr/data_provider.py b/ctr/data_provider.py
new file mode 100644
index 0000000000000000000000000000000000000000..f02d3d33e75163cf772921ef54729a3fc8da022b
--- /dev/null
+++ b/ctr/data_provider.py
@@ -0,0 +1,277 @@
+import sys
+import csv
+import numpy as np
+'''
+The fields of the dataset are:
+
+ 0. id: ad identifier
+ 1. click: 0/1 for non-click/click
+ 2. hour: format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC.
+ 3. C1 -- anonymized categorical variable
+ 4. banner_pos
+ 5. site_id
+ 6. site_domain
+ 7. site_category
+ 8. app_id
+ 9. app_domain
+ 10. app_category
+ 11. device_id
+ 12. device_ip
+ 13. device_model
+ 14. device_type
+ 15. device_conn_type
+ 16. C14-C21 -- anonymized categorical variables
+
+We will treat following fields as categorical features:
+
+ - C1
+ - banner_pos
+ - site_category
+ - app_category
+ - device_type
+ - device_conn_type
+
+and some other features as id features:
+
+ - id
+ - site_id
+ - app_id
+ - device_id
+
+The `hour` field will be treated as a continuous feature and will be transformed
+to one-hot representation which has 24 bits.
+'''
+
+feature_dims = {}
+
+categorial_features = ('C1 banner_pos site_category app_category ' +
+ 'device_type device_conn_type').split()
+
+id_features = 'id site_id app_id device_id _device_id_cross_site_id'.split()
+
+
+def get_all_field_names(mode=0):
+ '''
+ @mode: int
+ 0 for train, 1 for test
+ @return: list of str
+ '''
+ return categorial_features + ['hour'] + id_features + ['click'] \
+ if mode == 0 else []
+
+
+class CategoryFeatureGenerator(object):
+ '''
+ Generator category features.
+
+ Register all records by calling `register` first, then call `gen` to generate
+ one-hot representation for a record.
+ '''
+
+ def __init__(self):
+ self.dic = {'unk': 0}
+ self.counter = 1
+
+ def register(self, key):
+ '''
+ Register record.
+ '''
+ if key not in self.dic:
+ self.dic[key] = self.counter
+ self.counter += 1
+
+ def size(self):
+ return len(self.dic)
+
+ def gen(self, key):
+ '''
+ Generate one-hot representation for a record.
+ '''
+ if key not in self.dic:
+ res = self.dic['unk']
+ else:
+ res = self.dic[key]
+ return [res]
+
+ def __repr__(self):
+ return '' % len(self.dic)
+
+
+class IDfeatureGenerator(object):
+ def __init__(self, max_dim, cross_fea0=None, cross_fea1=None):
+ '''
+ @max_dim: int
+ Size of the id elements' space
+ '''
+ self.max_dim = max_dim
+ self.cross_fea0 = cross_fea0
+ self.cross_fea1 = cross_fea1
+
+ def gen(self, key):
+ '''
+ Generate one-hot representation for records
+ '''
+ return [hash(key) % self.max_dim]
+
+ def gen_cross_fea(self, fea1, fea2):
+ key = str(fea1) + str(fea2)
+ return self.gen(key)
+
+ def size(self):
+ return self.max_dim
+
+
+class ContinuousFeatureGenerator(object):
+ def __init__(self, n_intervals):
+ self.min = sys.maxint
+ self.max = sys.minint
+ self.n_intervals = n_intervals
+
+ def register(self, val):
+ self.min = min(self.minint, val)
+ self.max = max(self.maxint, val)
+
+ def gen(self, val):
+ self.len_part = (self.max - self.min) / self.n_intervals
+ return (val - self.min) / self.len_part
+
+
+# init all feature generators
+fields = {}
+for key in categorial_features:
+ fields[key] = CategoryFeatureGenerator()
+for key in id_features:
+ # for cross features
+ if 'cross' in key:
+ feas = key[1:].split('_cross_')
+ fields[key] = IDfeatureGenerator(10000000, *feas)
+ # for normal ID features
+ else:
+ fields[key] = IDfeatureGenerator(10000)
+
+# used as feed_dict in PaddlePaddle
+field_index = dict((key, id)
+ for id, key in enumerate(['dnn_input', 'lr_input', 'click']))
+
+
+def detect_dataset(path, topn, id_fea_space=10000):
+ '''
+ Parse the first `topn` records to collect meta information of this dataset.
+
+ NOTE the records should be randomly shuffled first.
+ '''
+ # create categorical statis objects.
+
+ with open(path, 'rb') as csvfile:
+ reader = csv.DictReader(csvfile)
+ for row_id, row in enumerate(reader):
+ if row_id > topn:
+ break
+
+ for key in categorial_features:
+ fields[key].register(row[key])
+
+ for key, item in fields.items():
+ feature_dims[key] = item.size()
+
+ #for key in id_features:
+ #feature_dims[key] = id_fea_space
+
+ feature_dims['hour'] = 24
+ feature_dims['click'] = 1
+
+ feature_dims['dnn_input'] = np.sum(
+ feature_dims[key] for key in categorial_features + ['hour']) + 1
+ feature_dims['lr_input'] = np.sum(feature_dims[key]
+ for key in id_features) + 1
+
+ return feature_dims
+
+
+def concat_sparse_vectors(inputs, dims):
+ '''
+ Concaterate more than one sparse vectors into one.
+
+ @inputs: list
+ list of sparse vector
+ @dims: list of int
+ dimention of each sparse vector
+ '''
+ res = []
+ assert len(inputs) == len(dims)
+ start = 0
+ for no, vec in enumerate(inputs):
+ for v in vec:
+ res.append(v + start)
+ start += dims[no]
+ return res
+
+
+class AvazuDataset(object):
+ '''
+ Load AVAZU dataset as train set.
+ '''
+ TRAIN_MODE = 0
+ TEST_MODE = 1
+
+ def __init__(self, train_path, n_records_as_test=-1):
+ self.train_path = train_path
+ self.n_records_as_test = n_records_as_test
+ # task model: 0 train, 1 test
+ self.mode = 0
+
+ def train(self):
+ self.mode = self.TRAIN_MODE
+ return self._parse(self.train_path, skip_n_lines=self.n_records_as_test)
+
+ def test(self):
+ self.mode = self.TEST_MODE
+ return self._parse(self.train_path, top_n_lines=self.n_records_as_test)
+
+ def _parse(self, path, skip_n_lines=-1, top_n_lines=-1):
+ with open(path, 'rb') as csvfile:
+ reader = csv.DictReader(csvfile)
+
+ categorial_dims = [
+ feature_dims[key] for key in categorial_features + ['hour']
+ ]
+ id_dims = [feature_dims[key] for key in id_features]
+
+ for row_id, row in enumerate(reader):
+ if skip_n_lines > 0 and row_id < skip_n_lines:
+ continue
+ if top_n_lines > 0 and row_id > top_n_lines:
+ break
+
+ record = []
+ for key in categorial_features:
+ record.append(fields[key].gen(row[key]))
+ record.append([int(row['hour'][-2:])])
+ dense_input = concat_sparse_vectors(record, categorial_dims)
+
+ record = []
+ for key in id_features:
+ if 'cross' not in key:
+ record.append(fields[key].gen(row[key]))
+ else:
+ fea0 = fields[key].cross_fea0
+ fea1 = fields[key].cross_fea1
+ record.append(
+ fields[key].gen_cross_fea(row[fea0], row[fea1]))
+
+ sparse_input = concat_sparse_vectors(record, id_dims)
+
+ record = [dense_input, sparse_input]
+
+ record.append(list((int(row['click']), )))
+ yield record
+
+
+if __name__ == '__main__':
+ path = 'train.txt'
+ print detect_dataset(path, 400000)
+
+ filereader = AvazuDataset(path)
+ for no, rcd in enumerate(filereader.train()):
+ print no, rcd
+ if no > 1000: break
diff --git a/ctr/dataset.md b/ctr/dataset.md
new file mode 100644
index 0000000000000000000000000000000000000000..dd6443d56adaf548d6c39458900c711c7f274def
--- /dev/null
+++ b/ctr/dataset.md
@@ -0,0 +1,289 @@
+# 数据及处理
+## 数据集介绍
+
+数据集使用 `csv` 格式存储,其中各个字段内容如下:
+
+- `id` : ad identifier
+- `click` : 0/1 for non-click/click
+- `hour` : format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC.
+- `C1` : anonymized categorical variable
+- `banner_pos`
+- `site_id`
+- `site_domain`
+- `site_category`
+- `app_id`
+- `app_domain`
+- `app_category`
+- `device_id`
+- `device_ip`
+- `device_model`
+- `device_type`
+- `device_conn_type`
+- `C14-C21` : anonymized categorical variables
+
+
+## 特征提取
+
+下面我们会简单演示几种特征的提取方式。
+
+原始数据中的特征可以分为以下几类:
+
+1. ID 类特征(稀疏,数量多)
+- `id`
+- `site_id`
+- `app_id`
+- `device_id`
+
+2. 类别类特征(稀疏,但数量有限)
+
+- `C1`
+- `site_category`
+- `device_type`
+- `C14-C21`
+
+3. 数值型特征转化为类别型特征
+
+- hour (可以转化成数值,也可以按小时为单位转化为类别)
+
+### 类别类特征
+
+类别类特征的提取方法有以下两种:
+
+1. One-hot 表示作为特征
+2. 类似词向量,用一个 Embedding 将每个类别映射到对应的向量
+
+
+### ID 类特征
+
+ID 类特征的特点是稀疏数据,但量比较大,直接使用 One-hot 表示时维度过大。
+
+一般会作如下处理:
+
+1. 确定表示的最大维度 N
+2. newid = id % N
+3. 用 newid 作为类别类特征使用
+
+上面的方法尽管存在一定的碰撞概率,但能够处理任意数量的 ID 特征,并保留一定的效果\[[2](#参考文献)\]。
+
+### 数值型特征
+
+一般会做如下处理:
+
+- 归一化,直接作为特征输入模型
+- 用区间分割处理成类别类特征,稀疏化表示,模糊细微上的差别
+
+## 特征处理
+
+
+### 类别型特征
+
+类别型特征有有限多种值,在模型中,我们一般使用 Embedding将每种值映射为连续值的向量。
+
+这种特征在输入到模型时,一般使用 One-hot 表示,相关处理方法如下:
+
+```python
+class CategoryFeatureGenerator(object):
+ '''
+ Generator category features.
+
+ Register all records by calling ~register~ first, then call ~gen~ to generate
+ one-hot representation for a record.
+ '''
+
+ def __init__(self):
+ self.dic = {'unk': 0}
+ self.counter = 1
+
+ def register(self, key):
+ '''
+ Register record.
+ '''
+ if key not in self.dic:
+ self.dic[key] = self.counter
+ self.counter += 1
+
+ def size(self):
+ return len(self.dic)
+
+ def gen(self, key):
+ '''
+ Generate one-hot representation for a record.
+ '''
+ if key not in self.dic:
+ res = self.dic['unk']
+ else:
+ res = self.dic[key]
+ return [res]
+
+ def __repr__(self):
+ return '' % len(self.dic)
+```
+
+`CategoryFeatureGenerator` 需要先扫描数据集,得到该类别对应的项集合,之后才能开始生成特征。
+
+我们的实验数据集\[[3](https://www.kaggle.com/c/avazu-ctr-prediction/data)\]已经经过shuffle,可以扫描前面一定数目的记录来近似总的类别项集合(等价于随机抽样),
+对于没有抽样上的低频类别项,可以用一个 UNK 的特殊值表示。
+
+```python
+fields = {}
+for key in categorial_features:
+ fields[key] = CategoryFeatureGenerator()
+
+def detect_dataset(path, topn, id_fea_space=10000):
+ '''
+ Parse the first `topn` records to collect meta information of this dataset.
+
+ NOTE the records should be randomly shuffled first.
+ '''
+ # create categorical statis objects.
+
+ with open(path, 'rb') as csvfile:
+ reader = csv.DictReader(csvfile)
+ for row_id, row in enumerate(reader):
+ if row_id > topn:
+ break
+
+ for key in categorial_features:
+ fields[key].register(row[key])
+```
+
+`CategoryFeatureGenerator` 在注册得到数据集中对应类别信息后,可以对相应记录生成对应的特征表示:
+
+```python
+record = []
+for key in categorial_features:
+ record.append(fields[key].gen(row[key]))
+```
+
+本任务中,类别类特征会输入到 DNN 中使用。
+
+### ID 类特征
+
+ID 类特征代稀疏值,且值的空间很大的情况,一般用模操作规约到一个有限空间,
+之后可以当成类别类特征使用,这里我们会将 ID 类特征输入到 LR 模型中使用。
+
+```python
+class IDfeatureGenerator(object):
+ def __init__(self, max_dim):
+ '''
+ @max_dim: int
+ Size of the id elements' space
+ '''
+ self.max_dim = max_dim
+
+ def gen(self, key):
+ '''
+ Generate one-hot representation for records
+ '''
+ return [hash(key) % self.max_dim]
+
+ def size(self):
+ return self.max_dim
+```
+
+`IDfeatureGenerator` 不需要预先初始化,可以直接生成特征,比如
+
+```python
+record = []
+for key in id_features:
+ if 'cross' not in key:
+ record.append(fields[key].gen(row[key]))
+```
+
+### 交叉类特征
+
+LR 模型作为 Wide & Deep model 的 `wide` 部分,可以输入很 wide 的数据(特征空间的维度很大),
+为了充分利用这个优势,我们将演示交叉组合特征构建成更大维度特征的情况,之后塞入到模型中训练。
+
+这里我们依旧使用模操作来约束最终组合出的特征空间的大小,具体实现是直接在 `IDfeatureGenerator` 中添加一个 `gen_cross_feature` 的方法:
+
+```python
+def gen_cross_fea(self, fea1, fea2):
+ key = str(fea1) + str(fea2)
+ return self.gen(key)
+```
+
+比如,我们觉得原始数据中, `device_id` 和 `site_id` 有一些关联(比如某个 device 倾向于浏览特定 site),
+我们通过组合出两者组合来捕捉这类信息。
+
+```python
+fea0 = fields[key].cross_fea0
+fea1 = fields[key].cross_fea1
+record.append(
+ fields[key].gen_cross_fea(row[fea0], row[fea1]))
+```
+
+### 特征维度
+#### Deep submodel(DNN)特征
+| feature | dimention |
+|------------------|-----------|
+| app_category | 21 |
+| site_category | 22 |
+| device_conn_type | 5 |
+| hour | 24 |
+| banner_pos | 7 |
+| **Total** | 79 |
+
+#### Wide submodel(LR)特征
+| Feature | Dimention |
+|---------------------|-----------|
+| id | 10000 |
+| site_id | 10000 |
+| app_id | 10000 |
+| device_id | 10000 |
+| device_id X site_id | 1000000 |
+| **Total** | 1,040,000 |
+
+## 输入到 PaddlePaddle 中
+
+Deep 和 Wide 两部分均以 `sparse_binary_vector` 的格式 \[[1](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/api/v1/data_provider/pydataprovider2_en.rst)\] 输入,输入前需要将相关特征拼合,模型最终只接受 3 个 input,
+分别是
+
+1. `dnn input` ,DNN 的输入
+2. `lr input` , LR 的输入
+3. `click` , 标签
+
+拼合特征的方法:
+
+```python
+def concat_sparse_vectors(inputs, dims):
+ '''
+ concaterate sparse vectors into one
+
+ @inputs: list
+ list of sparse vector
+ @dims: list of int
+ dimention of each sparse vector
+ '''
+ res = []
+ assert len(inputs) == len(dims)
+ start = 0
+ for no, vec in enumerate(inputs):
+ for v in vec:
+ res.append(v + start)
+ start += dims[no]
+ return res
+```
+
+生成最终特征的代码如下:
+
+```python
+# dimentions of the features
+categorial_dims = [
+ feature_dims[key] for key in categorial_features + ['hour']
+]
+id_dims = [feature_dims[key] for key in id_features]
+
+dense_input = concat_sparse_vectors(record, categorial_dims)
+sparse_input = concat_sparse_vectors(record, id_dims)
+
+record = [dense_input, sparse_input]
+record.append(list((int(row['click']), )))
+yield record
+```
+
+## 参考文献
+
+1.
+2. Mikolov T, Deoras A, Povey D, et al. [Strategies for training large scale neural network language models](https://www.researchgate.net/profile/Lukas_Burget/publication/241637478_Strategies_for_training_large_scale_neural_network_language_models/links/542c14960cf27e39fa922ed3.pdf)[C]//Automatic Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on. IEEE, 2011: 196-201.
+3.
diff --git a/ctr/images/lr_vs_dnn.jpg b/ctr/images/lr_vs_dnn.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..50a0db583cd9b6e1a5bc0f83a28ab6e22d649931
Binary files /dev/null and b/ctr/images/lr_vs_dnn.jpg differ
diff --git a/ctr/images/wide_deep.png b/ctr/images/wide_deep.png
new file mode 100644
index 0000000000000000000000000000000000000000..03c4afcfc6cea0b5abf4c4554ecf9810843e75e2
Binary files /dev/null and b/ctr/images/wide_deep.png differ
diff --git a/ctr/train.py b/ctr/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..da6dc9dd6d9e386a87693b5a5bc0cbf95da0b069
--- /dev/null
+++ b/ctr/train.py
@@ -0,0 +1,138 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+import argparse
+import logging
+import paddle.v2 as paddle
+from paddle.v2 import layer
+from paddle.v2 import data_type as dtype
+from data_provider import field_index, detect_dataset, AvazuDataset
+
+parser = argparse.ArgumentParser(description="PaddlePaddle CTR example")
+parser.add_argument(
+ '--train_data_path',
+ type=str,
+ required=True,
+ help="path of training dataset")
+parser.add_argument(
+ '--batch_size',
+ type=int,
+ default=10000,
+ help="size of mini-batch (default:10000)")
+parser.add_argument(
+ '--test_set_size',
+ type=int,
+ default=10000,
+ help="size of the validation dataset(default: 10000)")
+parser.add_argument(
+ '--num_passes', type=int, default=10, help="number of passes to train")
+parser.add_argument(
+ '--num_lines_to_detact',
+ type=int,
+ default=500000,
+ help="number of records to detect dataset's meta info")
+
+args = parser.parse_args()
+
+dnn_layer_dims = [128, 64, 32, 1]
+data_meta_info = detect_dataset(args.train_data_path, args.num_lines_to_detact)
+
+logging.warning('detect categorical fields in dataset %s' %
+ args.train_data_path)
+for key, item in data_meta_info.items():
+ logging.warning(' - {}\t{}'.format(key, item))
+
+paddle.init(use_gpu=False, trainer_count=1)
+
+# ==============================================================================
+# input layers
+# ==============================================================================
+dnn_merged_input = layer.data(
+ name='dnn_input',
+ type=paddle.data_type.sparse_binary_vector(data_meta_info['dnn_input']))
+
+lr_merged_input = layer.data(
+ name='lr_input',
+ type=paddle.data_type.sparse_binary_vector(data_meta_info['lr_input']))
+
+click = paddle.layer.data(name='click', type=dtype.dense_vector(1))
+
+
+# ==============================================================================
+# network structure
+# ==============================================================================
+def build_dnn_submodel(dnn_layer_dims):
+ dnn_embedding = layer.fc(input=dnn_merged_input, size=dnn_layer_dims[0])
+ _input_layer = dnn_embedding
+ for i, dim in enumerate(dnn_layer_dims[1:]):
+ fc = layer.fc(
+ input=_input_layer,
+ size=dim,
+ act=paddle.activation.Relu(),
+ name='dnn-fc-%d' % i)
+ _input_layer = fc
+ return _input_layer
+
+
+# config LR submodel
+def build_lr_submodel():
+ fc = layer.fc(
+ input=lr_merged_input, size=1, name='lr', act=paddle.activation.Relu())
+ return fc
+
+
+# conbine DNN and LR submodels
+def combine_submodels(dnn, lr):
+ merge_layer = layer.concat(input=[dnn, lr])
+ fc = layer.fc(
+ input=merge_layer,
+ size=1,
+ name='output',
+ # use sigmoid function to approximate ctr rate, a float value between 0 and 1.
+ act=paddle.activation.Sigmoid())
+ return fc
+
+
+dnn = build_dnn_submodel(dnn_layer_dims)
+lr = build_lr_submodel()
+output = combine_submodels(dnn, lr)
+
+# ==============================================================================
+# cost and train period
+# ==============================================================================
+classification_cost = paddle.layer.multi_binary_label_cross_entropy_cost(
+ input=output, label=click)
+
+params = paddle.parameters.create(classification_cost)
+
+optimizer = paddle.optimizer.Momentum(momentum=0.01)
+
+trainer = paddle.trainer.SGD(
+ cost=classification_cost, parameters=params, update_equation=optimizer)
+
+dataset = AvazuDataset(
+ args.train_data_path, n_records_as_test=args.test_set_size)
+
+
+def event_handler(event):
+ if isinstance(event, paddle.event.EndIteration):
+ num_samples = event.batch_id * args.batch_size
+ if event.batch_id % 100 == 0:
+ logging.warning("Pass %d, Samples %d, Cost %f" %
+ (event.pass_id, num_samples, event.cost))
+
+ if event.batch_id % 1000 == 0:
+ result = trainer.test(
+ reader=paddle.batch(dataset.test, batch_size=args.batch_size),
+ feeding=field_index)
+ logging.warning("Test %d-%d, Cost %f" %
+ (event.pass_id, event.batch_id, result.cost))
+
+
+trainer.train(
+ reader=paddle.batch(
+ paddle.reader.shuffle(dataset.train, buf_size=500),
+ batch_size=args.batch_size),
+ feeding=field_index,
+ event_handler=event_handler,
+ num_passes=args.num_passes)