README.md 12.7 KB
Newer Older
J
julie 已提交
1
# Click-Through Rate Prediction
S
Superjom 已提交
2

S
Superjom 已提交
3 4 5
以下是本例目录包含的文件以及对应说明:

```
S
fix PR  
Superjom 已提交
6 7 8
├── README.md               # 本教程markdown 文档
├── dataset.md              # 数据集处理教程
├── images                  # 本教程图片目录
S
Superjom 已提交
9 10
│   ├── lr_vs_dnn.jpg
│   └── wide_deep.png
S
fix PR  
Superjom 已提交
11 12 13 14 15 16
├── infer.py                # 预测脚本
├── network_conf.py         # 模型网络配置
├── reader.py               # data reader
├── train.py                # 训练脚本
└── utils.py                # helper functions
└── avazu_data_processer.py # 示例数据预处理脚本
S
Superjom 已提交
17 18
```

J
julie 已提交
19
## Introduction
S
Superjom 已提交
20

S
fix PR  
Superjom 已提交
21
CTR(Click-Through Rate,点击率预估)\[[1](https://en.wikipedia.org/wiki/Click-through_rate)\]
J
julie 已提交
22
is a prediction of the probability that a user clicks on an advertisement. This model is widely used in the advertisement industry. Accurate click rate estimates are important for maximizing online advertising revenue.
S
Superjom 已提交
23

J
julie 已提交
24
When there are multiple ad slots, CTR estimates are generally used as a baseline for ranking. For example, in a search engine's ad system, when the user enters a query, the system typically performs the following steps to show relevant ads.
S
Superjom 已提交
25

J
julie 已提交
26 27 28 29
1.  Get the ad collection associated with the user's search term.
2.  Business rules and relevance filtering.
3.  Rank by auction mechanism and CTR.
4.  Show ads.
S
Superjom 已提交
30

J
julie 已提交
31
Here,CTR plays a crucial role.
S
Superjom 已提交
32

J
julie 已提交
33 34
### Brief history
Historically, the CTR prediction model has been evolving as follows.
S
Superjom 已提交
35

J
julie 已提交
36 37 38
-   Logistic Regression(LR) / Gradient Boosting Decision Trees (GBDT) + feature engineering
-   LR + Deep Neural Network (DNN)
-   DNN + feature engineering
S
Superjom 已提交
39

J
julie 已提交
40
In the early stages of development LR dominated, but the recent years DNN based models are mainly used.
S
Superjom 已提交
41 42


S
Superjom 已提交
43
### LR vs DNN
S
Superjom 已提交
44

J
julie 已提交
45
The following figure shows the structure of LR and DNN model:
S
Superjom 已提交
46

S
Superjom 已提交
47 48
<p align="center">
<img src="images/lr_vs_dnn.jpg" width="620" hspace='10'/> <br/>
J
julie 已提交
49
Figure 1.  LR and DNN model structure comparison
S
Superjom 已提交
50
</p>
S
Superjom 已提交
51

J
julie 已提交
52
We can see, LR and CNN have some common structures. However, DNN can have non-linear relation between input and output values by adding activation unit and further layers. This enables DNN to achieve better learning results in CTR estimates.
S
Superjom 已提交
53

J
julie 已提交
54
In the following, we demonstrate how to use PaddlePaddle to learn to predict CTR.
S
Superjom 已提交
55

J
julie 已提交
56
## Data and Model formation
S
Superjom 已提交
57

J
julie 已提交
58
Here `click` is the learning objective. There are several ways to learn the objectives.
S
Superjom 已提交
59

J
julie 已提交
60 61 62
1.  Direct learning click, 0,1 for binary classification
2.  Learning to rank, pairwise rank or listwise rank
3.  Measure the ad click rate of each ad, then rank by the click rate.
S
Superjom 已提交
63

J
julie 已提交
64
In this example, we use the first method.
S
Superjom 已提交
65

J
julie 已提交
66
We use the Kaggle `Click-through rate prediction` task \[[2](https://www.kaggle.com/c/avazu-ctr-prediction/data)\].
S
Superjom 已提交
67

J
julie 已提交
68
Please see the [data process](./dataset.md) for pre-processing data.
S
Superjom 已提交
69

J
julie 已提交
70
The input data format for the demo model in this tutorial is as follows:
S
Superjom 已提交
71 72 73 74 75 76 77

```
# <dnn input ids> \t <lr input sparse values> \t click
1 23 190 \t 230:0.12 3421:0.9 23451:0.12 \t 0
23 231 \t 1230:0.12 13421:0.9 \t 1
```

J
julie 已提交
78
Description:
S
fix PR  
Superjom 已提交
79

J
julie 已提交
80 81
- `dnn input ids` one-hot coding.
- `lr input sparse values` Use `ID:VALUE` , values are preferaly scaled to the range `[-1, 1]`
S
fix PR  
Superjom 已提交
82 83 84 85 86 87 88 89

此外,模型训练时需要传入一个文件描述 dnn 和 lr两个子模型的输入维度,文件的格式如下:

```
dnn_input_dim: <int>
lr_input_dim: <int>
```

J
julie 已提交
90
<int> represents an integer value.
S
fix PR  
Superjom 已提交
91

J
julie 已提交
92
`avazu_data_processor.py` can be used to download the data set \[[2](#参考文档)\]and pre-process the data.
S
Superjom 已提交
93 94 95 96 97 98 99 100 101

```
usage: avazu_data_processer.py [-h] --data_path DATA_PATH --output_dir
                               OUTPUT_DIR
                               [--num_lines_to_detect NUM_LINES_TO_DETECT]
                               [--test_set_size TEST_SET_SIZE]
                               [--train_size TRAIN_SIZE]

PaddlePaddle CTR example
S
Superjom 已提交
102

S
Superjom 已提交
103 104 105 106 107 108 109 110 111 112 113 114 115
optional arguments:
  -h, --help            show this help message and exit
  --data_path DATA_PATH
                        path of the Avazu dataset
  --output_dir OUTPUT_DIR
                        directory to output
  --num_lines_to_detect NUM_LINES_TO_DETECT
                        number of records to detect dataset's meta info
  --test_set_size TEST_SET_SIZE
                        size of the validation dataset(default: 10000)
  --train_size TRAIN_SIZE
                        size of the trainset (default: 100000)
```
S
Superjom 已提交
116

J
julie 已提交
117 118 119 120 121
- `data_path` The data path to be processed
- `output_dir` The output path of the data
- `num_lines_to_detect` The number of generated IDs
- `test_set_size` The number of rows for the test set
- `train_size` The number of rows of training set
S
fix PR  
Superjom 已提交
122

S
Superjom 已提交
123
## Wide & Deep Learning Model
S
Superjom 已提交
124

J
julie 已提交
125
Google proposed a model framework for Wide & Deep Learning to integrate the advantages of both DNNs suitable for learning abstract features and LR models for large sparse features.
S
Superjom 已提交
126 127


J
julie 已提交
128
### Introduction to the model
S
Superjom 已提交
129

J
julie 已提交
130
Wide & Deep Learning Model\[[3](#References)\] is a relatively mature model, but this model is still being used in the CTR predicting task. Here we demonstrate the use of this model to complete the CTR predicting task.
S
Superjom 已提交
131

J
julie 已提交
132
The model structure is as follows:
S
Superjom 已提交
133

S
Superjom 已提交
134 135 136 137
<p align="center">
<img src="images/wide_deep.png" width="820" hspace='10'/> <br/>
Figure 2. Wide & Deep Model
</p>
S
Superjom 已提交
138

J
julie 已提交
139
The wide part of the left side of the model can accommodate large-scale coefficient features and has some memory for some specific information (such as ID); and the Deep part of the right side of the model can learn the implicit relationship between features.
S
Superjom 已提交
140 141


J
julie 已提交
142
### Model Input
S
Superjom 已提交
143

J
julie 已提交
144
The model has three inputs as follows.
S
Superjom 已提交
145

J
julie 已提交
146 147 148
-   `dnn_input` ,the Deep part of the input
-   `lr_input` ,the wide part of the input
-   `click` , click on or not
S
Superjom 已提交
149

S
Superjom 已提交
150
```python
S
Superjom 已提交
151 152 153
dnn_merged_input = layer.data(
    name='dnn_input',
    type=paddle.data_type.sparse_binary_vector(data_meta_info['dnn_input']))
S
Superjom 已提交
154

S
Superjom 已提交
155 156 157
lr_merged_input = layer.data(
    name='lr_input',
    type=paddle.data_type.sparse_binary_vector(data_meta_info['lr_input']))
S
Superjom 已提交
158

S
Superjom 已提交
159 160 161
click = paddle.layer.data(name='click', type=dtype.dense_vector(1))
```

J
julie 已提交
162
### Wide part
S
Superjom 已提交
163

J
julie 已提交
164
Wide part uses of the LR model, but the activation function changed to `RELU` for speed.
S
Superjom 已提交
165 166

```python
S
Superjom 已提交
167 168 169 170 171 172
def build_lr_submodel():
    fc = layer.fc(
        input=lr_merged_input, size=1, name='lr', act=paddle.activation.Relu())
    return fc
```

J
julie 已提交
173
### Deep part
S
Superjom 已提交
174

J
julie 已提交
175
The Deep part uses a standard multi-layer DNN.
S
Superjom 已提交
176 177

```python
S
Superjom 已提交
178 179 180
def build_dnn_submodel(dnn_layer_dims):
    dnn_embedding = layer.fc(input=dnn_merged_input, size=dnn_layer_dims[0])
    _input_layer = dnn_embedding
S
Superjom 已提交
181
    for i, dim in enumerate(dnn_layer_dims[1:]):
S
Superjom 已提交
182 183 184 185
        fc = layer.fc(
            input=_input_layer,
            size=dim,
            act=paddle.activation.Relu(),
S
Superjom 已提交
186
            name='dnn-fc-%d' % i)
S
Superjom 已提交
187 188 189 190
        _input_layer = fc
    return _input_layer
```

J
julie 已提交
191
### Combine
S
Superjom 已提交
192

J
julie 已提交
193
The output section uses `sigmoid` function to output (0,1) as the prediction value.
S
Superjom 已提交
194 195

```python
S
Superjom 已提交
196 197 198 199 200 201 202
# conbine DNN and LR submodels
def combine_submodels(dnn, lr):
    merge_layer = layer.concat(input=[dnn, lr])
    fc = layer.fc(
        input=merge_layer,
        size=1,
        name='output',
S
Superjom 已提交
203
        # use sigmoid function to approximate ctr, wihch is a float value between 0 and 1.
S
Superjom 已提交
204 205 206 207
        act=paddle.activation.Sigmoid())
    return fc
```

J
julie 已提交
208
### Training
S
Superjom 已提交
209
```python
S
Superjom 已提交
210 211 212 213 214 215 216 217 218
dnn = build_dnn_submodel(dnn_layer_dims)
lr = build_lr_submodel()
output = combine_submodels(dnn, lr)

# ==============================================================================
#                   cost and train period
# ==============================================================================
classification_cost = paddle.layer.multi_binary_label_cross_entropy_cost(
    input=output, label=click)
S
Superjom 已提交
219

S
Superjom 已提交
220 221 222

paddle.init(use_gpu=False, trainer_count=11)

S
Superjom 已提交
223
params = paddle.parameters.create(classification_cost)
S
Superjom 已提交
224

S
Superjom 已提交
225
optimizer = paddle.optimizer.Momentum(momentum=0)
S
Superjom 已提交
226

S
Superjom 已提交
227 228
trainer = paddle.trainer.SGD(
    cost=classification_cost, parameters=params, update_equation=optimizer)
S
Superjom 已提交
229

S
Superjom 已提交
230
dataset = AvazuDataset(train_data_path, n_records_as_test=test_set_size)
S
Superjom 已提交
231

S
Superjom 已提交
232 233 234 235 236
def event_handler(event):
    if isinstance(event, paddle.event.EndIteration):
        if event.batch_id % 100 == 0:
            logging.warning("Pass %d, Samples %d, Cost %f" % (
                event.pass_id, event.batch_id * batch_size, event.cost))
S
Superjom 已提交
237

S
Superjom 已提交
238 239 240 241 242 243
        if event.batch_id % 1000 == 0:
            result = trainer.test(
                reader=paddle.batch(dataset.test, batch_size=1000),
                feeding=field_index)
            logging.warning("Test %d-%d, Cost %f" % (event.pass_id, event.batch_id,
                                           result.cost))
S
Superjom 已提交
244 245


S
Superjom 已提交
246 247 248 249 250 251 252 253
trainer.train(
    reader=paddle.batch(
        paddle.reader.shuffle(dataset.train, buf_size=500),
        batch_size=batch_size),
    feeding=field_index,
    event_handler=event_handler,
    num_passes=100)
```
S
Superjom 已提交
254

J
julie 已提交
255 256 257 258 259 260
## Run training and testing
The model go through the following steps:

1. Prepare training data
    1. Download train.gz from [Kaggle CTR](https://www.kaggle.com/c/avazu-ctr-prediction/data) .
    2. Unzip train.gz to get train.txt
S
Superjom 已提交
261
    3. `mkdir -p output; python avazu_data_processer.py --data_path train.txt --output_dir output --num_lines_to_detect 1000 --test_set_size 100` 生成演示数据
J
julie 已提交
262
2. Execute `python train.py --train_data_path ./output/train.txt --test_data_path ./output/test.txt --data_meta_file ./output/data.meta.txt --model_type=0`. Start training.
S
Superjom 已提交
263

J
julie 已提交
264
The argument options for `train.py` are as follows.
S
Superjom 已提交
265 266 267

```
usage: train.py [-h] --train_data_path TRAIN_DATA_PATH
S
Superjom 已提交
268
                [--test_data_path TEST_DATA_PATH] [--batch_size BATCH_SIZE]
S
Superjom 已提交
269
                [--num_passes NUM_PASSES]
S
Superjom 已提交
270 271
                [--model_output_prefix MODEL_OUTPUT_PREFIX] --data_meta_file
                DATA_META_FILE --model_type MODEL_TYPE
S
Superjom 已提交
272 273 274 275 276 277 278

PaddlePaddle CTR example

optional arguments:
  -h, --help            show this help message and exit
  --train_data_path TRAIN_DATA_PATH
                        path of training dataset
S
Superjom 已提交
279 280
  --test_data_path TEST_DATA_PATH
                        path of testing dataset
S
Superjom 已提交
281 282 283 284
  --batch_size BATCH_SIZE
                        size of mini-batch (default:10000)
  --num_passes NUM_PASSES
                        number of passes to train
S
Superjom 已提交
285 286 287 288 289 290 291 292 293 294
  --model_output_prefix MODEL_OUTPUT_PREFIX
                        prefix of path for model to store (default:
                        ./ctr_models)
  --data_meta_file DATA_META_FILE
                        path of data meta info file
  --model_type MODEL_TYPE
                        model type, classification: 0, regression 1 (default
                        classification)
```

J
julie 已提交
295 296 297 298 299 300
- `train_data_path` :  The path of the training set
- `test_data_path` : The path of the testing set
- `num_passes`: number of rounds of model training
- `data_meta_file`: Please refer to [数据和任务抽象](### 数据和任务抽象)的描述。
- `model_type`:  Model classification or regressio

S
Superjom 已提交
301

J
julie 已提交
302 303
## Use the training model for prediction
The training model can be used to predict new data, and the format of the forecast data is as follows.
S
Superjom 已提交
304

S
Superjom 已提交
305 306 307 308 309 310 311

```
# <dnn input ids> \t <lr input sparse values>
1 23 190 \t 230:0.12 3421:0.9 23451:0.12
23 231 \t 1230:0.12 13421:0.9
```

J
julie 已提交
312
Here the only difference to the training data is that there is no label (i.e. `click` values).
S
fix PR  
Superjom 已提交
313

J
julie 已提交
314
We now can use `infer.py` to perform inference.
S
Superjom 已提交
315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335

```
usage: infer.py [-h] --model_gz_path MODEL_GZ_PATH --data_path DATA_PATH
                --prediction_output_path PREDICTION_OUTPUT_PATH
                [--data_meta_path DATA_META_PATH] --model_type MODEL_TYPE

PaddlePaddle CTR example

optional arguments:
  -h, --help            show this help message and exit
  --model_gz_path MODEL_GZ_PATH
                        path of model parameters gz file
  --data_path DATA_PATH
                        path of the dataset to infer
  --prediction_output_path PREDICTION_OUTPUT_PATH
                        path to output the prediction
  --data_meta_path DATA_META_PATH
                        path of trainset's meta info, default is ./data.meta
  --model_type MODEL_TYPE
                        model type, classification: 0, regression 1 (default
                        classification)
S
Superjom 已提交
336
```
S
Superjom 已提交
337

J
julie 已提交
338 339 340 341 342
- `model_gz_path_model`:path for `gz` compressed data.
- `data_path`
- `prediction_output_patj`:path for the predicted values s
- `data_meta_file` :Please refer to [数据和任务抽象](### 数据和任务抽象)。
- `model_type` :Classification or regression
S
Superjom 已提交
343

J
julie 已提交
344
The sample data can be predicted with the following command
S
Superjom 已提交
345 346 347 348 349

```
python infer.py --model_gz_path <model_path> --data_path output/infer.txt --prediction_output_path predictions.txt --data_meta_path data.meta.txt
```

J
julie 已提交
350
The final prediction is written in  `predictions.txt`
S
Superjom 已提交
351

J
julie 已提交
352
## References
S
Superjom 已提交
353
1. <https://en.wikipedia.org/wiki/Click-through_rate>
S
Superjom 已提交
354 355
2. <https://www.kaggle.com/c/avazu-ctr-prediction/data>
3. Cheng H T, Koc L, Harmsen J, et al. [Wide & deep learning for recommender systems](https://arxiv.org/pdf/1606.07792.pdf)[C]//Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. ACM, 2016: 7-10.