提交 3065a876 编写于 作者: S Superjom

refactor ctr model

上级 0a27ca9d
# 点击率预估 # 点击率预估
以下是本例目录包含的文件以及对应说明:
```
├── README.md # 本教程markdown 文档
├── dataset.md # 数据集处理教程
├── images # 本教程图片目录
│   ├── lr_vs_dnn.jpg
│   └── wide_deep.png
├── infer.py # 预测脚本
├── network_conf.py # 模型网络配置
├── reader.py # data provider
├── train.py # 训练脚本
└── utils.py # helper functions
```
## 背景介绍 ## 背景介绍
CTR(Click-Through Rate,点击率预估)\[[1](https://en.wikipedia.org/wiki/Click-through_rate)\] 是用来表示用户点击一个特定链接的概率, CTR(Click-Through Rate,点击率预估)\[[1](https://en.wikipedia.org/wiki/Click-through_rate)\] 是用来表示用户点击一个特定链接的概率,
...@@ -61,8 +76,40 @@ LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力,包括 ...@@ -61,8 +76,40 @@ LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力,包括
我们使用 Kaggle 上 `Click-through rate prediction` 任务的数据集\[[2](https://www.kaggle.com/c/avazu-ctr-prediction/data)\] 来演示模型。 我们使用 Kaggle 上 `Click-through rate prediction` 任务的数据集\[[2](https://www.kaggle.com/c/avazu-ctr-prediction/data)\] 来演示模型。
具体的特征处理方法参看 [data process](./dataset.md) 具体的特征处理方法参看 [data process](./dataset.md)
本教程中演示模型的输入格式如下:
```
# <dnn input ids> \t <lr input sparse values> \t click
1 23 190 \t 230:0.12 3421:0.9 23451:0.12 \t 0
23 231 \t 1230:0.12 13421:0.9 \t 1
```
演示数据集\[[2](#参考文档)\] 可以使用 `avazu_data_processor.py` 脚本处理,具体使用方法参考如下说明:
```
usage: avazu_data_processer.py [-h] --data_path DATA_PATH --output_dir
OUTPUT_DIR
[--num_lines_to_detect NUM_LINES_TO_DETECT]
[--test_set_size TEST_SET_SIZE]
[--train_size TRAIN_SIZE]
PaddlePaddle CTR example
optional arguments:
-h, --help show this help message and exit
--data_path DATA_PATH
path of the Avazu dataset
--output_dir OUTPUT_DIR
directory to output
--num_lines_to_detect NUM_LINES_TO_DETECT
number of records to detect dataset's meta info
--test_set_size TEST_SET_SIZE
size of the validation dataset(default: 10000)
--train_size TRAIN_SIZE
size of the trainset (default: 100000)
```
## Wide & Deep Learning Model ## Wide & Deep Learning Model
...@@ -204,15 +251,17 @@ trainer.train( ...@@ -204,15 +251,17 @@ trainer.train(
1. 下载训练数据,可以使用 Kaggle 上 CTR 比赛的数据\[[2](#参考文献)\] 1. 下载训练数据,可以使用 Kaggle 上 CTR 比赛的数据\[[2](#参考文献)\]
1.[Kaggle CTR](https://www.kaggle.com/c/avazu-ctr-prediction/data) 下载 train.gz 1.[Kaggle CTR](https://www.kaggle.com/c/avazu-ctr-prediction/data) 下载 train.gz
2. 解压 train.gz 得到 train.txt 2. 解压 train.gz 得到 train.txt
2. 执行 `python train.py --train_data_path train.txt` ,开始训练 3. `mkdir -p output; python avazu_data_processer.py --data_path train.txt --output_dir output --num_lines_to_detect 1000 --test_set_size 100` 生成演示数据
2. 执行 `python train.py --train_data_path ./output/train.txt --test_data_path ./output/test.txt --data_meta_file ./output/data.meta.txt --model_type=0` 开始训练
上面第2个步骤可以为 `train.py` 填充命令行参数来定制模型的训练过程,具体的命令行参数及用法如下 上面第2个步骤可以为 `train.py` 填充命令行参数来定制模型的训练过程,具体的命令行参数及用法如下
``` ```
usage: train.py [-h] --train_data_path TRAIN_DATA_PATH usage: train.py [-h] --train_data_path TRAIN_DATA_PATH
[--batch_size BATCH_SIZE] [--test_set_size TEST_SET_SIZE] [--test_data_path TEST_DATA_PATH] [--batch_size BATCH_SIZE]
[--num_passes NUM_PASSES] [--num_passes NUM_PASSES]
[--num_lines_to_detact NUM_LINES_TO_DETACT] [--model_output_prefix MODEL_OUTPUT_PREFIX] --data_meta_file
DATA_META_FILE --model_type MODEL_TYPE
PaddlePaddle CTR example PaddlePaddle CTR example
...@@ -220,16 +269,63 @@ optional arguments: ...@@ -220,16 +269,63 @@ optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--train_data_path TRAIN_DATA_PATH --train_data_path TRAIN_DATA_PATH
path of training dataset path of training dataset
--test_data_path TEST_DATA_PATH
path of testing dataset
--batch_size BATCH_SIZE --batch_size BATCH_SIZE
size of mini-batch (default:10000) size of mini-batch (default:10000)
--test_set_size TEST_SET_SIZE
size of the validation dataset(default: 10000)
--num_passes NUM_PASSES --num_passes NUM_PASSES
number of passes to train number of passes to train
--num_lines_to_detact NUM_LINES_TO_DETACT --model_output_prefix MODEL_OUTPUT_PREFIX
number of records to detect dataset's meta info prefix of path for model to store (default:
./ctr_models)
--data_meta_file DATA_META_FILE
path of data meta info file
--model_type MODEL_TYPE
model type, classification: 0, regression 1 (default
classification)
```
## 用训好的模型做预测
训好的模型可以用来预测新的数据, 预测数据的格式为
```
# <dnn input ids> \t <lr input sparse values>
1 23 190 \t 230:0.12 3421:0.9 23451:0.12
23 231 \t 1230:0.12 13421:0.9
```
`infer.py` 的使用方法如下
```
usage: infer.py [-h] --model_gz_path MODEL_GZ_PATH --data_path DATA_PATH
--prediction_output_path PREDICTION_OUTPUT_PATH
[--data_meta_path DATA_META_PATH] --model_type MODEL_TYPE
PaddlePaddle CTR example
optional arguments:
-h, --help show this help message and exit
--model_gz_path MODEL_GZ_PATH
path of model parameters gz file
--data_path DATA_PATH
path of the dataset to infer
--prediction_output_path PREDICTION_OUTPUT_PATH
path to output the prediction
--data_meta_path DATA_META_PATH
path of trainset's meta info, default is ./data.meta
--model_type MODEL_TYPE
model type, classification: 0, regression 1 (default
classification)
``` ```
示例数据可以用如下命令预测
```
python infer.py --model_gz_path <model_path> --data_path output/infer.txt --prediction_output_path predictions.txt --data_meta_path data.meta.txt
```
最终的预测结果位于 `predictions.txt`
## 参考文献 ## 参考文献
1. <https://en.wikipedia.org/wiki/Click-through_rate> 1. <https://en.wikipedia.org/wiki/Click-through_rate>
2. <https://www.kaggle.com/c/avazu-ctr-prediction/data> 2. <https://www.kaggle.com/c/avazu-ctr-prediction/data>
......
import os
import sys import sys
import csv import csv
import cPickle
import argparse
import numpy as np import numpy as np
from utils import logger, TaskMode
parser = argparse.ArgumentParser(description="PaddlePaddle CTR example")
parser.add_argument(
'--data_path', type=str, required=True, help="path of the Avazu dataset")
parser.add_argument(
'--output_dir', type=str, required=True, help="directory to output")
parser.add_argument(
'--num_lines_to_detect',
type=int,
default=500000,
help="number of records to detect dataset's meta info")
parser.add_argument(
'--test_set_size',
type=int,
default=10000,
help="size of the validation dataset(default: 10000)")
parser.add_argument(
'--train_size',
type=int,
default=100000,
help="size of the trainset (default: 100000)")
args = parser.parse_args()
''' '''
The fields of the dataset are: The fields of the dataset are:
...@@ -40,6 +67,14 @@ and some other features as id features: ...@@ -40,6 +67,14 @@ and some other features as id features:
The `hour` field will be treated as a continuous feature and will be transformed The `hour` field will be treated as a continuous feature and will be transformed
to one-hot representation which has 24 bits. to one-hot representation which has 24 bits.
This script will output 3 files:
1. train.txt
2. test.txt
3. infer.txt
all the files are for demo.
''' '''
feature_dims = {} feature_dims = {}
...@@ -161,6 +196,7 @@ def detect_dataset(path, topn, id_fea_space=10000): ...@@ -161,6 +196,7 @@ def detect_dataset(path, topn, id_fea_space=10000):
NOTE the records should be randomly shuffled first. NOTE the records should be randomly shuffled first.
''' '''
# create categorical statis objects. # create categorical statis objects.
logger.warning('detecting dataset')
with open(path, 'rb') as csvfile: with open(path, 'rb') as csvfile:
reader = csv.DictReader(csvfile) reader = csv.DictReader(csvfile)
...@@ -174,9 +210,6 @@ def detect_dataset(path, topn, id_fea_space=10000): ...@@ -174,9 +210,6 @@ def detect_dataset(path, topn, id_fea_space=10000):
for key, item in fields.items(): for key, item in fields.items():
feature_dims[key] = item.size() feature_dims[key] = item.size()
#for key in id_features:
#feature_dims[key] = id_fea_space
feature_dims['hour'] = 24 feature_dims['hour'] = 24
feature_dims['click'] = 1 feature_dims['click'] = 1
...@@ -184,10 +217,20 @@ def detect_dataset(path, topn, id_fea_space=10000): ...@@ -184,10 +217,20 @@ def detect_dataset(path, topn, id_fea_space=10000):
feature_dims[key] for key in categorial_features + ['hour']) + 1 feature_dims[key] for key in categorial_features + ['hour']) + 1
feature_dims['lr_input'] = np.sum(feature_dims[key] feature_dims['lr_input'] = np.sum(feature_dims[key]
for key in id_features) + 1 for key in id_features) + 1
# logger.warning("dump dataset's meta info to %s" % meta_out_path)
# cPickle.dump([feature_dims, fields], open(meta_out_path, 'wb'))
return feature_dims return feature_dims
def load_data_meta(meta_path):
'''
Load dataset's meta infomation.
'''
feature_dims, fields = cPickle.load(open(meta_path, 'rb'))
return feature_dims, fields
def concat_sparse_vectors(inputs, dims): def concat_sparse_vectors(inputs, dims):
''' '''
Concaterate more than one sparse vectors into one. Concaterate more than one sparse vectors into one.
...@@ -211,67 +254,162 @@ class AvazuDataset(object): ...@@ -211,67 +254,162 @@ class AvazuDataset(object):
''' '''
Load AVAZU dataset as train set. Load AVAZU dataset as train set.
''' '''
TRAIN_MODE = 0
TEST_MODE = 1
def __init__(self, train_path, n_records_as_test=-1): def __init__(self,
train_path,
n_records_as_test=-1,
fields=None,
feature_dims=None):
self.train_path = train_path self.train_path = train_path
self.n_records_as_test = n_records_as_test self.n_records_as_test = n_records_as_test
# task model: 0 train, 1 test self.fields = fields
self.mode = 0 # default is train mode.
self.mode = TaskMode.create_train()
self.categorial_dims = [
feature_dims[key] for key in categorial_features + ['hour']
]
self.id_dims = [feature_dims[key] for key in id_features]
def train(self): def train(self):
self.mode = self.TRAIN_MODE '''
return self._parse(self.train_path, skip_n_lines=self.n_records_as_test) Load trainset.
'''
logger.info("load trainset from %s" % self.train_path)
self.mode = TaskMode.create_train()
with open(self.train_path) as f:
reader = csv.DictReader(f)
def test(self): for row_id, row in enumerate(reader):
self.mode = self.TEST_MODE # skip top n lines
return self._parse(self.train_path, top_n_lines=self.n_records_as_test) if self.n_records_as_test > 0 and row_id < self.n_records_as_test:
continue
def _parse(self, path, skip_n_lines=-1, top_n_lines=-1): rcd = self._parse_record(row)
with open(path, 'rb') as csvfile: if rcd:
reader = csv.DictReader(csvfile) yield rcd
categorial_dims = [ def test(self):
feature_dims[key] for key in categorial_features + ['hour'] '''
] Load testset.
id_dims = [feature_dims[key] for key in id_features] '''
logger.info("load testset from %s" % self.train_path)
self.mode = TaskMode.create_test()
with open(self.train_path) as f:
reader = csv.DictReader(f)
for row_id, row in enumerate(reader): for row_id, row in enumerate(reader):
if skip_n_lines > 0 and row_id < skip_n_lines: # skip top n lines
continue if self.n_records_as_test > 0 and row_id > self.n_records_as_test:
if top_n_lines > 0 and row_id > top_n_lines:
break break
rcd = self._parse_record(row)
if rcd:
yield rcd
def infer(self):
'''
Load inferset.
'''
logger.info("load inferset from %s" % self.train_path)
self.mode = TaskMode.create_infer()
with open(self.train_path) as f:
reader = csv.DictReader(f)
for row_id, row in enumerate(reader):
rcd = self._parse_record(row)
if rcd:
yield rcd
def _parse_record(self, row):
'''
Parse a CSV row and get a record.
'''
record = [] record = []
for key in categorial_features: for key in categorial_features:
record.append(fields[key].gen(row[key])) record.append(self.fields[key].gen(row[key]))
record.append([int(row['hour'][-2:])]) record.append([int(row['hour'][-2:])])
dense_input = concat_sparse_vectors(record, categorial_dims) dense_input = concat_sparse_vectors(record, self.categorial_dims)
record = [] record = []
for key in id_features: for key in id_features:
if 'cross' not in key: if 'cross' not in key:
record.append(fields[key].gen(row[key])) record.append(self.fields[key].gen(row[key]))
else: else:
fea0 = fields[key].cross_fea0 fea0 = self.fields[key].cross_fea0
fea1 = fields[key].cross_fea1 fea1 = self.fields[key].cross_fea1
record.append( record.append(
fields[key].gen_cross_fea(row[fea0], row[fea1])) self.fields[key].gen_cross_fea(row[fea0], row[fea1]))
sparse_input = concat_sparse_vectors(record, id_dims) sparse_input = concat_sparse_vectors(record, self.id_dims)
record = [dense_input, sparse_input] record = [dense_input, sparse_input]
if not self.mode.is_infer():
record.append(list((int(row['click']), ))) record.append(list((int(row['click']), )))
yield record return record
def ids2dense(vec, dim):
return vec
def ids2sparse(vec):
return ["%d:1" % x for x in vec]
if __name__ == '__main__': detect_dataset(args.data_path, args.num_lines_to_detect)
path = 'train.txt' dataset = AvazuDataset(
print detect_dataset(path, 400000) args.data_path,
args.test_set_size,
fields=fields,
feature_dims=feature_dims)
filereader = AvazuDataset(path) output_trainset_path = os.path.join(args.output_dir, 'train.txt')
for no, rcd in enumerate(filereader.train()): output_testset_path = os.path.join(args.output_dir, 'test.txt')
print no, rcd output_infer_path = os.path.join(args.output_dir, 'infer.txt')
if no > 1000: break output_meta_path = os.path.join(args.output_dir, 'data.meta.txt')
with open(output_trainset_path, 'w') as f:
for id, record in enumerate(dataset.train()):
if id and id % 10000 == 0:
logger.info("load %d records" % id)
if id > args.train_size:
break
dnn_input, lr_input, click = record
dnn_input = ids2dense(dnn_input, feature_dims['dnn_input'])
lr_input = ids2sparse(lr_input)
line = "%s\t%s\t%d\n" % (' '.join(map(str, dnn_input)),
' '.join(map(str, lr_input)), click[0])
f.write(line)
logger.info('write to %s' % output_trainset_path)
with open(output_testset_path, 'w') as f:
for id, record in enumerate(dataset.test()):
dnn_input, lr_input, click = record
dnn_input = ids2dense(dnn_input, feature_dims['dnn_input'])
lr_input = ids2sparse(lr_input)
line = "%s\t%s\t%d\n" % (' '.join(map(str, dnn_input)),
' '.join(map(str, lr_input)), click[0])
f.write(line)
logger.info('write to %s' % output_testset_path)
with open(output_infer_path, 'w') as f:
for id, record in enumerate(dataset.infer()):
dnn_input, lr_input = record
dnn_input = ids2dense(dnn_input, feature_dims['dnn_input'])
lr_input = ids2sparse(lr_input)
line = "%s\t%s\n" % (' '.join(map(str, dnn_input)),
' '.join(map(str, lr_input)), )
f.write(line)
if id > args.test_set_size:
break
logger.info('write to %s' % output_infer_path)
with open(output_meta_path, 'w') as f:
lines = [
"dnn_input_dim: %d" % feature_dims['dnn_input'],
"lr_input_dim: %d" % feature_dims['lr_input']
]
f.write('\n'.join(lines))
logger.info('write data meta into %s' % output_meta_path)
# 数据及处理 # 数据及处理
## 数据集介绍 ## 数据集介绍
本教程演示使用Kaggle上CTR任务的数据集\[[3](#参考文献)\]的预处理方法,最终产生本模型需要的格式,详细的数据格式参考[README.md](./README.md)
Wide && Deep Model\[[2](#参考文献)\]的优势是融合稠密特征和大规模稀疏特征,
因此特征处理方面也针对稠密和稀疏两种特征作处理,
其中Deep部分的稠密值全部转化为ID类特征,
通过embedding 来转化为稠密的向量输入;Wide部分主要通过ID的叉乘提升维度。
数据集使用 `csv` 格式存储,其中各个字段内容如下: 数据集使用 `csv` 格式存储,其中各个字段内容如下:
- `id` : ad identifier - `id` : ad identifier
......
...@@ -42,6 +42,21 @@ ...@@ -42,6 +42,21 @@
<div id="markdown" style='display:none'> <div id="markdown" style='display:none'>
# 点击率预估 # 点击率预估
以下是本例目录包含的文件以及对应说明:
```
├── README.md # 本教程markdown 文档
├── dataset.md # 数据集处理教程
├── images # 本教程图片目录
│   ├── lr_vs_dnn.jpg
│   └── wide_deep.png
├── infer.py # 预测脚本
├── network_conf.py # 模型网络配置
├── reader.py # data provider
├── train.py # 训练脚本
└── utils.py # helper functions
```
## 背景介绍 ## 背景介绍
CTR(Click-Through Rate,点击率预估)\[[1](https://en.wikipedia.org/wiki/Click-through_rate)\] 是用来表示用户点击一个特定链接的概率, CTR(Click-Through Rate,点击率预估)\[[1](https://en.wikipedia.org/wiki/Click-through_rate)\] 是用来表示用户点击一个特定链接的概率,
...@@ -103,8 +118,40 @@ LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力,包括 ...@@ -103,8 +118,40 @@ LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力,包括
我们使用 Kaggle 上 `Click-through rate prediction` 任务的数据集\[[2](https://www.kaggle.com/c/avazu-ctr-prediction/data)\] 来演示模型。 我们使用 Kaggle 上 `Click-through rate prediction` 任务的数据集\[[2](https://www.kaggle.com/c/avazu-ctr-prediction/data)\] 来演示模型。
具体的特征处理方法参看 [data process](./dataset.md) 具体的特征处理方法参看 [data process](./dataset.md)。
本教程中演示模型的输入格式如下:
```
# <dnn input ids> \t <lr input sparse values> \t click
1 23 190 \t 230:0.12 3421:0.9 23451:0.12 \t 0
23 231 \t 1230:0.12 13421:0.9 \t 1
```
演示数据集\[[2](#参考文档)\] 可以使用 `avazu_data_processor.py` 脚本处理,具体使用方法参考如下说明:
```
usage: avazu_data_processer.py [-h] --data_path DATA_PATH --output_dir
OUTPUT_DIR
[--num_lines_to_detect NUM_LINES_TO_DETECT]
[--test_set_size TEST_SET_SIZE]
[--train_size TRAIN_SIZE]
PaddlePaddle CTR example
optional arguments:
-h, --help show this help message and exit
--data_path DATA_PATH
path of the Avazu dataset
--output_dir OUTPUT_DIR
directory to output
--num_lines_to_detect NUM_LINES_TO_DETECT
number of records to detect dataset's meta info
--test_set_size TEST_SET_SIZE
size of the validation dataset(default: 10000)
--train_size TRAIN_SIZE
size of the trainset (default: 100000)
```
## Wide & Deep Learning Model ## Wide & Deep Learning Model
...@@ -246,15 +293,17 @@ trainer.train( ...@@ -246,15 +293,17 @@ trainer.train(
1. 下载训练数据,可以使用 Kaggle 上 CTR 比赛的数据\[[2](#参考文献)\] 1. 下载训练数据,可以使用 Kaggle 上 CTR 比赛的数据\[[2](#参考文献)\]
1. 从 [Kaggle CTR](https://www.kaggle.com/c/avazu-ctr-prediction/data) 下载 train.gz 1. 从 [Kaggle CTR](https://www.kaggle.com/c/avazu-ctr-prediction/data) 下载 train.gz
2. 解压 train.gz 得到 train.txt 2. 解压 train.gz 得到 train.txt
2. 执行 `python train.py --train_data_path train.txt` ,开始训练 3. `mkdir -p output; python avazu_data_processer.py --data_path train.txt --output_dir output --num_lines_to_detect 1000 --test_set_size 100` 生成演示数据
2. 执行 `python train.py --train_data_path ./output/train.txt --test_data_path ./output/test.txt --data_meta_file ./output/data.meta.txt --model_type=0` 开始训练
上面第2个步骤可以为 `train.py` 填充命令行参数来定制模型的训练过程,具体的命令行参数及用法如下 上面第2个步骤可以为 `train.py` 填充命令行参数来定制模型的训练过程,具体的命令行参数及用法如下
``` ```
usage: train.py [-h] --train_data_path TRAIN_DATA_PATH usage: train.py [-h] --train_data_path TRAIN_DATA_PATH
[--batch_size BATCH_SIZE] [--test_set_size TEST_SET_SIZE] [--test_data_path TEST_DATA_PATH] [--batch_size BATCH_SIZE]
[--num_passes NUM_PASSES] [--num_passes NUM_PASSES]
[--num_lines_to_detact NUM_LINES_TO_DETACT] [--model_output_prefix MODEL_OUTPUT_PREFIX] --data_meta_file
DATA_META_FILE --model_type MODEL_TYPE
PaddlePaddle CTR example PaddlePaddle CTR example
...@@ -262,16 +311,63 @@ optional arguments: ...@@ -262,16 +311,63 @@ optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--train_data_path TRAIN_DATA_PATH --train_data_path TRAIN_DATA_PATH
path of training dataset path of training dataset
--test_data_path TEST_DATA_PATH
path of testing dataset
--batch_size BATCH_SIZE --batch_size BATCH_SIZE
size of mini-batch (default:10000) size of mini-batch (default:10000)
--test_set_size TEST_SET_SIZE
size of the validation dataset(default: 10000)
--num_passes NUM_PASSES --num_passes NUM_PASSES
number of passes to train number of passes to train
--num_lines_to_detact NUM_LINES_TO_DETACT --model_output_prefix MODEL_OUTPUT_PREFIX
number of records to detect dataset's meta info prefix of path for model to store (default:
./ctr_models)
--data_meta_file DATA_META_FILE
path of data meta info file
--model_type MODEL_TYPE
model type, classification: 0, regression 1 (default
classification)
```
## 用训好的模型做预测
训好的模型可以用来预测新的数据, 预测数据的格式为
```
# <dnn input ids> \t <lr input sparse values>
1 23 190 \t 230:0.12 3421:0.9 23451:0.12
23 231 \t 1230:0.12 13421:0.9
```
`infer.py` 的使用方法如下
```
usage: infer.py [-h] --model_gz_path MODEL_GZ_PATH --data_path DATA_PATH
--prediction_output_path PREDICTION_OUTPUT_PATH
[--data_meta_path DATA_META_PATH] --model_type MODEL_TYPE
PaddlePaddle CTR example
optional arguments:
-h, --help show this help message and exit
--model_gz_path MODEL_GZ_PATH
path of model parameters gz file
--data_path DATA_PATH
path of the dataset to infer
--prediction_output_path PREDICTION_OUTPUT_PATH
path to output the prediction
--data_meta_path DATA_META_PATH
path of trainset's meta info, default is ./data.meta
--model_type MODEL_TYPE
model type, classification: 0, regression 1 (default
classification)
``` ```
示例数据可以用如下命令预测
```
python infer.py --model_gz_path <model_path> --data_path output/infer.txt --prediction_output_path predictions.txt --data_meta_path data.meta.txt
```
最终的预测结果位于 `predictions.txt`。
## 参考文献 ## 参考文献
1. <https://en.wikipedia.org/wiki/Click-through_rate> 1. <https://en.wikipedia.org/wiki/Click-through_rate>
2. <https://www.kaggle.com/c/avazu-ctr-prediction/data> 2. <https://www.kaggle.com/c/avazu-ctr-prediction/data>
......
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import gzip
import argparse
import itertools
import paddle.v2 as paddle
import network_conf
from train import dnn_layer_dims
import reader
from utils import logger, ModelType
parser = argparse.ArgumentParser(description="PaddlePaddle CTR example")
parser.add_argument(
'--model_gz_path',
type=str,
required=True,
help="path of model parameters gz file")
parser.add_argument(
'--data_path', type=str, required=True, help="path of the dataset to infer")
parser.add_argument(
'--prediction_output_path',
type=str,
required=True,
help="path to output the prediction")
parser.add_argument(
'--data_meta_path',
type=str,
default="./data.meta",
help="path of trainset's meta info, default is ./data.meta")
parser.add_argument(
'--model_type',
type=int,
required=True,
default=ModelType.CLASSIFICATION,
help='model type, classification: %d, regression %d (default classification)'
% (ModelType.CLASSIFICATION, ModelType.REGRESSION))
args = parser.parse_args()
paddle.init(use_gpu=False, trainer_count=1)
class CTRInferer(object):
def __init__(self, param_path):
logger.info("create CTR model")
dnn_input_dim, lr_input_dim = reader.load_data_meta(args.data_meta_path)
# create the mdoel
self.ctr_model = network_conf.CTRmodel(
dnn_layer_dims,
dnn_input_dim,
lr_input_dim,
model_type=args.model_type,
is_infer=True)
# load parameter
logger.info("load model parameters from %s" % param_path)
self.parameters = paddle.parameters.Parameters.from_tar(
gzip.open(param_path, 'r'))
self.inferer = paddle.inference.Inference(
output_layer=self.ctr_model.model,
parameters=self.parameters, )
def infer(self, data_path):
logger.info("infer data...")
dataset = reader.Dataset()
infer_reader = paddle.batch(
dataset.infer(args.data_path), batch_size=1000)
logger.warning('write predictions to %s' % args.prediction_output_path)
output_f = open(args.prediction_output_path, 'w')
for id, batch in enumerate(infer_reader()):
res = self.inferer.infer(input=batch)
predictions = [x for x in itertools.chain.from_iterable(res)]
assert len(batch) == len(
predictions), "predict error, %d inputs, but %d predictions" % (
len(batch), len(predictions))
output_f.write('\n'.join(map(str, predictions)) + '\n')
if __name__ == '__main__':
ctr_inferer = CTRInferer(args.model_gz_path)
ctr_inferer.infer(args.data_path)
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import paddle.v2 as paddle
from paddle.v2 import layer
from paddle.v2 import data_type as dtype
from utils import logger, ModelType
class CTRmodel(object):
'''
A CTR model which implements wide && deep learning model.
'''
def __init__(self,
dnn_layer_dims,
dnn_input_dim,
lr_input_dim,
model_type=ModelType.create_classification(),
is_infer=False):
'''
@dnn_layer_dims: list of integer
dims of each layer in dnn
@dnn_input_dim: int
size of dnn's input layer
@lr_input_dim: int
size of lr's input layer
@is_infer: bool
whether to build a infer model
'''
self.dnn_layer_dims = dnn_layer_dims
self.dnn_input_dim = dnn_input_dim
self.lr_input_dim = lr_input_dim
self.model_type = model_type
self.is_infer = is_infer
self._declare_input_layers()
self.dnn = self._build_dnn_submodel_(self.dnn_layer_dims)
self.lr = self._build_lr_submodel_()
# model's prediction
# TODO(superjom) rename it to prediction
if self.model_type.is_classification():
self.model = self._build_classification_model(self.dnn, self.lr)
if self.model_type.is_regression():
self.model = self._build_regression_model(self.dnn, self.lr)
def _declare_input_layers(self):
self.dnn_merged_input = layer.data(
name='dnn_input',
type=paddle.data_type.sparse_binary_vector(self.dnn_input_dim))
self.lr_merged_input = layer.data(
name='lr_input',
type=paddle.data_type.sparse_vector(self.lr_input_dim))
if not self.is_infer:
self.click = paddle.layer.data(
name='click', type=dtype.dense_vector(1))
def _build_dnn_submodel_(self, dnn_layer_dims):
'''
build DNN submodel.
'''
dnn_embedding = layer.fc(
input=self.dnn_merged_input, size=dnn_layer_dims[0])
_input_layer = dnn_embedding
for i, dim in enumerate(dnn_layer_dims[1:]):
fc = layer.fc(
input=_input_layer,
size=dim,
act=paddle.activation.Relu(),
name='dnn-fc-%d' % i)
_input_layer = fc
return _input_layer
def _build_lr_submodel_(self):
'''
config LR submodel
'''
fc = layer.fc(
input=self.lr_merged_input,
size=1,
name='lr',
act=paddle.activation.Relu())
return fc
def _build_classification_model(self, dnn, lr):
merge_layer = layer.concat(input=[dnn, lr])
self.output = layer.fc(
input=merge_layer,
size=1,
name='output',
# use sigmoid function to approximate ctr rate, a float value between 0 and 1.
act=paddle.activation.Sigmoid())
self.train_cost = paddle.layer.multi_binary_label_cross_entropy_cost(
input=self.output, label=self.click)
return self.output
def _build_regression_model(self, dnn, lr):
merge_layer = layer.concat(input=[dnn, lr])
self.output = layer.fc(
input=merge_layer,
size=1,
name='output',
act=paddle.activation.Sigmoid())
if not self.is_infer:
self.train_cost = paddle.layer.mse_cost(
input=self.output, label=self.click)
return self.output
from utils import logger, TaskMode, load_dnn_input_record, load_lr_input_record
feeding_index = {'dnn_input': 0, 'lr_input': 1, 'click': 2}
class Dataset(object):
def __init__(self):
self.mode = TaskMode.create_train()
def train(self, path):
'''
Load trainset.
'''
logger.info("load trainset from %s" % path)
self.mode = TaskMode.create_train()
self.path = path
return self._parse
def test(self, path):
'''
Load testset.
'''
logger.info("load testset from %s" % path)
self.path = path
self.mode = TaskMode.create_test()
return self._parse
def infer(self, path):
'''
Load infer set.
'''
logger.info("load inferset from %s" % path)
self.path = path
self.mode = TaskMode.create_infer()
return self._parse
def _parse(self):
'''
Parse dataset.
'''
with open(self.path) as f:
for line_id, line in enumerate(f):
fs = line.strip().split('\t')
dnn_input = load_dnn_input_record(fs[0])
lr_input = load_lr_input_record(fs[1])
if not self.mode.is_infer():
click = [int(fs[2])]
yield dnn_input, lr_input, click
else:
yield dnn_input, lr_input
def load_data_meta(path):
'''
load data meta info from path, return (dnn_input_dim, lr_input_dim)
'''
with open(path) as f:
lines = f.read().split('\n')
err_info = "wrong meta format"
assert len(lines) == 2, err_info
assert 'dnn_input_dim:' in lines[0] and 'lr_input_dim:' in lines[
1], err_info
res = map(int, [_.split(':')[1] for _ in lines])
logger.info('dnn input dim: %d' % res[0])
logger.info('lr input dim: %d' % res[1])
return res
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import argparse import argparse
import logging import gzip
import reader
import paddle.v2 as paddle import paddle.v2 as paddle
from paddle.v2 import layer from utils import logger, ModelType
from paddle.v2 import data_type as dtype from network_conf import CTRmodel
from data_provider import field_index, detect_dataset, AvazuDataset
parser = argparse.ArgumentParser(description="PaddlePaddle CTR example")
parser.add_argument( def parse_args():
parser = argparse.ArgumentParser(description="PaddlePaddle CTR example")
parser.add_argument(
'--train_data_path', '--train_data_path',
type=str, type=str,
required=True, required=True,
help="path of training dataset") help="path of training dataset")
parser.add_argument( parser.add_argument(
'--test_data_path', type=str, help='path of testing dataset')
parser.add_argument(
'--batch_size', '--batch_size',
type=int, type=int,
default=10000, default=10000,
help="size of mini-batch (default:10000)") help="size of mini-batch (default:10000)")
parser.add_argument( parser.add_argument(
'--test_set_size',
type=int,
default=10000,
help="size of the validation dataset(default: 10000)")
parser.add_argument(
'--num_passes', type=int, default=10, help="number of passes to train") '--num_passes', type=int, default=10, help="number of passes to train")
parser.add_argument( parser.add_argument(
'--num_lines_to_detact', '--model_output_prefix',
type=str,
default='./ctr_models',
help='prefix of path for model to store (default: ./ctr_models)')
parser.add_argument(
'--data_meta_file',
type=str,
required=True,
help='path of data meta info file', )
parser.add_argument(
'--model_type',
type=int, type=int,
default=500000, required=True,
help="number of records to detect dataset's meta info") default=ModelType.CLASSIFICATION,
help='model type, classification: %d, regression %d (default classification)'
args = parser.parse_args() % (ModelType.CLASSIFICATION, ModelType.REGRESSION))
dnn_layer_dims = [128, 64, 32, 1] return parser.parse_args()
data_meta_info = detect_dataset(args.train_data_path, args.num_lines_to_detact)
logging.warning('detect categorical fields in dataset %s' %
args.train_data_path)
for key, item in data_meta_info.items():
logging.warning(' - {}\t{}'.format(key, item))
paddle.init(use_gpu=False, trainer_count=1) dnn_layer_dims = [128, 64, 32, 1]
# ============================================================================== # ==============================================================================
# input layers # cost and train period
# ============================================================================== # ==============================================================================
dnn_merged_input = layer.data(
name='dnn_input',
type=paddle.data_type.sparse_binary_vector(data_meta_info['dnn_input']))
lr_merged_input = layer.data(
name='lr_input',
type=paddle.data_type.sparse_binary_vector(data_meta_info['lr_input']))
click = paddle.layer.data(name='click', type=dtype.dense_vector(1)) def train():
args = parse_args()
args.model_type = ModelType(args.model_type)
paddle.init(use_gpu=False, trainer_count=1)
dnn_input_dim, lr_input_dim = reader.load_data_meta(args.data_meta_file)
# create ctr model.
model = CTRmodel(
dnn_layer_dims,
dnn_input_dim,
lr_input_dim,
model_type=args.model_type,
is_infer=False)
# ============================================================================== params = paddle.parameters.create(model.train_cost)
# network structure optimizer = paddle.optimizer.AdaGrad()
# ==============================================================================
def build_dnn_submodel(dnn_layer_dims):
dnn_embedding = layer.fc(input=dnn_merged_input, size=dnn_layer_dims[0])
_input_layer = dnn_embedding
for i, dim in enumerate(dnn_layer_dims[1:]):
fc = layer.fc(
input=_input_layer,
size=dim,
act=paddle.activation.Relu(),
name='dnn-fc-%d' % i)
_input_layer = fc
return _input_layer
# config LR submodel
def build_lr_submodel():
fc = layer.fc(
input=lr_merged_input, size=1, name='lr', act=paddle.activation.Relu())
return fc
# conbine DNN and LR submodels
def combine_submodels(dnn, lr):
merge_layer = layer.concat(input=[dnn, lr])
fc = layer.fc(
input=merge_layer,
size=1,
name='output',
# use sigmoid function to approximate ctr rate, a float value between 0 and 1.
act=paddle.activation.Sigmoid())
return fc
dnn = build_dnn_submodel(dnn_layer_dims)
lr = build_lr_submodel()
output = combine_submodels(dnn, lr)
# ==============================================================================
# cost and train period
# ==============================================================================
classification_cost = paddle.layer.multi_binary_label_cross_entropy_cost(
input=output, label=click)
params = paddle.parameters.create(classification_cost)
optimizer = paddle.optimizer.Momentum(momentum=0.01)
trainer = paddle.trainer.SGD( trainer = paddle.trainer.SGD(
cost=classification_cost, parameters=params, update_equation=optimizer) cost=model.train_cost, parameters=params, update_equation=optimizer)
dataset = AvazuDataset( # dataset = reader.AvazuDataset(
args.train_data_path, n_records_as_test=args.test_set_size) # args.train_data_path,
# n_records_as_test=args.test_set_size,
# fields=reader.fields,
# feature_dims=reader.feature_dims)
dataset = reader.Dataset()
def event_handler(event): def __event_handler__(event):
if isinstance(event, paddle.event.EndIteration): if isinstance(event, paddle.event.EndIteration):
num_samples = event.batch_id * args.batch_size num_samples = event.batch_id * args.batch_size
if event.batch_id % 100 == 0: if event.batch_id % 100 == 0:
logging.warning("Pass %d, Samples %d, Cost %f" % logger.warning("Pass %d, Samples %d, Cost %f, %s" % (
(event.pass_id, num_samples, event.cost)) event.pass_id, num_samples, event.cost, event.metrics))
if event.batch_id % 1000 == 0: if event.batch_id % 1000 == 0:
if args.test_data_path:
result = trainer.test( result = trainer.test(
reader=paddle.batch(dataset.test, batch_size=args.batch_size),
feeding=field_index)
logging.warning("Test %d-%d, Cost %f" %
(event.pass_id, event.batch_id, result.cost))
trainer.train(
reader=paddle.batch( reader=paddle.batch(
paddle.reader.shuffle(dataset.train, buf_size=500), dataset.test(args.test_data_path),
batch_size=args.batch_size),
feeding=reader.feeding_index)
logger.warning("Test %d-%d, Cost %f, %s" %
(event.pass_id, event.batch_id, result.cost,
result.metrics))
path = "{}-pass-{}-batch-{}-test-{}.tar.gz".format(
args.model_output_prefix, event.pass_id, event.batch_id,
result.cost)
with gzip.open(path, 'w') as f:
params.to_tar(f)
trainer.train(
reader=paddle.batch(
paddle.reader.shuffle(
dataset.train(args.train_data_path), buf_size=500),
batch_size=args.batch_size), batch_size=args.batch_size),
feeding=field_index, feeding=reader.feeding_index,
event_handler=event_handler, event_handler=__event_handler__,
num_passes=args.num_passes) num_passes=args.num_passes)
if __name__ == '__main__':
train()
import logging
logging.basicConfig()
logger = logging.getLogger("logger")
logger.setLevel(logging.INFO)
class TaskMode:
TRAIN_MODE = 0
TEST_MODE = 1
INFER_MODE = 2
def __init__(self, mode):
self.mode = mode
def is_train(self):
return self.mode == self.TRAIN_MODE
def is_test(self):
return self.mode == self.TEST_MODE
def is_infer(self):
return self.mode == self.INFER_MODE
@staticmethod
def create_train():
return TaskMode(TaskMode.TRAIN_MODE)
@staticmethod
def create_test():
return TaskMode(TaskMode.TEST_MODE)
@staticmethod
def create_infer():
return TaskMode(TaskMode.INFER_MODE)
class ModelType:
CLASSIFICATION = 0
REGRESSION = 1
def __init__(self, mode):
self.mode = mode
def is_classification(self):
return self.mode == self.CLASSIFICATION
def is_regression(self):
return self.mode == self.REGRESSION
@staticmethod
def create_classification():
return ModelType(ModelType.CLASSIFICATION)
@staticmethod
def create_regression():
return ModelType(ModelType.REGRESSION)
def load_dnn_input_record(sent):
return map(int, sent.split())
def load_lr_input_record(sent):
res = []
for _ in [x.split(':') for x in sent.split()]:
res.append((int(_[0]), float(_[1]), ))
return res
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册