Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
PaddlePaddle
models
提交
7a2d0cae
M
models
项目概览
PaddlePaddle
/
models
1 年多 前同步成功
通知
222
Star
6828
Fork
2962
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
602
列表
看板
标记
里程碑
合并请求
255
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
M
models
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
602
Issue
602
列表
看板
标记
里程碑
合并请求
255
合并请求
255
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
提交
7a2d0cae
编写于
7月 18, 2017
作者:
Y
Yibing Liu
浏览文件
操作
浏览文件
下载
差异文件
Merge branch 'develop' into mfcc_feat_dev
上级
f1911b12
63a80a32
变更
10
隐藏空白更改
内联
并排
Showing
10 changed file
with
1142 addition
and
166 deletion
+1142
-166
ctr/README.md
ctr/README.md
+148
-19
ctr/avazu_data_processer.py
ctr/avazu_data_processer.py
+413
-0
ctr/dataset.md
ctr/dataset.md
+7
-0
ctr/index.html
ctr/index.html
+148
-19
ctr/infer.py
ctr/infer.py
+81
-0
ctr/network_conf.py
ctr/network_conf.py
+105
-0
ctr/reader.py
ctr/reader.py
+66
-0
ctr/train.py
ctr/train.py
+102
-127
ctr/utils.py
ctr/utils.py
+68
-0
nce_cost/train.py
nce_cost/train.py
+4
-1
未找到文件。
ctr/README.md
浏览文件 @
7a2d0cae
# 点击率预估
# 点击率预估
以下是本例目录包含的文件以及对应说明:
```
├── README.md # 本教程markdown 文档
├── dataset.md # 数据集处理教程
├── images # 本教程图片目录
│ ├── lr_vs_dnn.jpg
│ └── wide_deep.png
├── infer.py # 预测脚本
├── network_conf.py # 模型网络配置
├── reader.py # data reader
├── train.py # 训练脚本
└── utils.py # helper functions
└── avazu_data_processer.py # 示例数据预处理脚本
```
## 背景介绍
## 背景介绍
CTR(Click-Through Rate,点击率预估)
\[
[
1
](
https://en.wikipedia.org/wiki/Click-through_rate
)
\]
是用来表示用户点击一个特定链接的概率,
CTR(Click-Through Rate,点击率预估)
\[
[
1
](
https://en.wikipedia.org/wiki/Click-through_rate
)
\]
通常被用来衡量一个在线广告系统的有效性
。
是对用户点击一个特定链接的概率做出预测,是广告投放过程中的一个重要环节。精准的点击率预估对在线广告系统收益最大化具有重要意义
。
当有多个广告位时,CTR 预估一般会作为排序的基准。
当有多个广告位时,CTR 预估一般会作为排序的基准,比如在搜索引擎的广告系统里,当用户输入一个带商业价值的搜索词(query)时,系统大体上会执行下列步骤来展示广告:
比如在搜索引擎的广告系统里,当用户输入一个带商业价值的搜索词(query)时,系统大体上会执行下列步骤来展示广告:
1.
召回满足 query
的广告集合
1.
获取与用户搜索词相关
的广告集合
2.
业务规则和相关性过滤
2.
业务规则和相关性过滤
3.
根据拍卖机制和 CTR 排序
3.
根据拍卖机制和 CTR 排序
4.
展出广告
4.
展出广告
...
@@ -36,13 +51,11 @@ Figure 1. LR 和 DNN 模型结构对比
...
@@ -36,13 +51,11 @@ Figure 1. LR 和 DNN 模型结构对比
</p>
</p>
LR 的蓝色箭头部分可以直接类比到 DNN 中对应的结构,可以看到 LR 和 DNN 有一些共通之处(比如权重累加),
LR 的蓝色箭头部分可以直接类比到 DNN 中对应的结构,可以看到 LR 和 DNN 有一些共通之处(比如权重累加),
但前者的模型复杂度在相同输入维度下比后者可能低很多(从某方面讲,模型越复杂,越有潜力学习到更复杂的信息)。
但前者的模型复杂度在相同输入维度下比后者可能低很多(从某方面讲,模型越复杂,越有潜力学习到更复杂的信息);
如果 LR 要达到匹敌 DNN 的学习能力,必须增加输入的维度,也就是增加特征的数量,
如果 LR 要达到匹敌 DNN 的学习能力,必须增加输入的维度,也就是增加特征的数量,
这也就是为何 LR 和大规模的特征工程必须绑定在一起的原因。
这也就是为何 LR 和大规模的特征工程必须绑定在一起的原因。
LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力,包括内存和计算量等方面,工业界都有非常成熟的优化方法。
LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力,包括内存和计算量等方面,工业界都有非常成熟的优化方法;
而 DNN 模型具有自己学习新特征的能力,一定程度上能够提升特征使用的效率,
而 DNN 模型具有自己学习新特征的能力,一定程度上能够提升特征使用的效率,
这使得 DNN 模型在同样规模特征的情况下,更有可能达到更好的学习效果。
这使得 DNN 模型在同样规模特征的情况下,更有可能达到更好的学习效果。
...
@@ -59,10 +72,62 @@ LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力,包括
...
@@ -59,10 +72,62 @@ LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力,包括
我们直接使用第一种方法做分类任务。
我们直接使用第一种方法做分类任务。
我们使用 Kaggle 上
`Click-through rate prediction`
任务的数据集
\[
[
2
](
https://www.kaggle.com/c/avazu-ctr-prediction/data
)
\]
来演示模型。
我们使用 Kaggle 上
`Click-through rate prediction`
任务的数据集
\[
[
2
](
https://www.kaggle.com/c/avazu-ctr-prediction/data
)
\]
来演示本例中的模型。
具体的特征处理方法参看
[
data process
](
./dataset.md
)
。
本教程中演示模型的输入格式如下:
```
# <dnn input ids> \t <lr input sparse values> \t click
1 23 190 \t 230:0.12 3421:0.9 23451:0.12 \t 0
23 231 \t 1230:0.12 13421:0.9 \t 1
```
详细的格式描述如下:
-
`dnn input ids`
采用 one-hot 表示,只需要填写值为1的ID(注意这里不是变长输入)
-
`lr input sparse values`
使用了
`ID:VALUE`
的表示,值部分最好规约到值域
`[-1, 1]`
。
此外,模型训练时需要传入一个文件描述 dnn 和 lr两个子模型的输入维度,文件的格式如下:
具体的特征处理方法参看
[
data process
](
./dataset.md
)
```
dnn_input_dim: <int>
lr_input_dim: <int>
```
其中,
`<int>`
表示一个整型数值。
本目录下的
`avazu_data_processor.py`
可以对下载的演示数据集
\[
[
2
](
#参考文档
)
\]
进行处理,具体使用方法参考如下说明:
```
usage: avazu_data_processer.py [-h] --data_path DATA_PATH --output_dir
OUTPUT_DIR
[--num_lines_to_detect NUM_LINES_TO_DETECT]
[--test_set_size TEST_SET_SIZE]
[--train_size TRAIN_SIZE]
PaddlePaddle CTR example
optional arguments:
-h, --help show this help message and exit
--data_path DATA_PATH
path of the Avazu dataset
--output_dir OUTPUT_DIR
directory to output
--num_lines_to_detect NUM_LINES_TO_DETECT
number of records to detect dataset's meta info
--test_set_size TEST_SET_SIZE
size of the validation dataset(default: 10000)
--train_size TRAIN_SIZE
size of the trainset (default: 100000)
```
-
`data_path`
是待处理的数据路径
-
`output_dir`
生成数据的输出路径
-
`num_lines_to_detect`
预先扫描数据生成ID的个数,这里是扫描的文件行数
-
`test_set_size`
生成测试集的行数
-
`train_size`
生成训练姐的行数
## Wide & Deep Learning Model
## Wide & Deep Learning Model
...
@@ -201,18 +266,20 @@ trainer.train(
...
@@ -201,18 +266,20 @@ trainer.train(
## 运行训练和测试
## 运行训练和测试
训练模型需要如下步骤:
训练模型需要如下步骤:
1.
下载训练数据,可以使用 Kaggle 上 CTR 比赛的数据
\[
[
2
](
#参考文献
)
\]
1.
准备训练数据
1.
从
[
Kaggle CTR
](
https://www.kaggle.com/c/avazu-ctr-prediction/data
)
下载 train.gz
1.
从
[
Kaggle CTR
](
https://www.kaggle.com/c/avazu-ctr-prediction/data
)
下载 train.gz
2.
解压 train.gz 得到 train.txt
2.
解压 train.gz 得到 train.txt
2.
执行
`python train.py --train_data_path train.txt`
,开始训练
3.
`mkdir -p output; python avazu_data_processer.py --data_path train.txt --output_dir output --num_lines_to_detect 1000 --test_set_size 100`
生成演示数据
2.
执行
`python train.py --train_data_path ./output/train.txt --test_data_path ./output/test.txt --data_meta_file ./output/data.meta.txt --model_type=0`
开始训练
上面第2个步骤可以为
`train.py`
填充命令行参数来定制模型的训练过程,具体的命令行参数及用法如下
上面第2个步骤可以为
`train.py`
填充命令行参数来定制模型的训练过程,具体的命令行参数及用法如下
```
```
usage
:
train
.
py
[-
h
]
--
train_data_path
TRAIN_DATA_PATH
usage
:
train
.
py
[-
h
]
--
train_data_path
TRAIN_DATA_PATH
[--
batch_size BATCH_SIZE] [--test_set_size TEST_SET
_SIZE]
[--
test_data_path
TEST_DATA_PATH
]
[--
batch_size
BATCH
_SIZE
]
[--
num_passes
NUM_PASSES
]
[--
num_passes
NUM_PASSES
]
[--num_lines_to_detact NUM_LINES_TO_DETACT]
[--
model_output_prefix
MODEL_OUTPUT_PREFIX
]
--
data_meta_file
DATA_META_FILE
--
model_type
MODEL_TYPE
PaddlePaddle
CTR
example
PaddlePaddle
CTR
example
...
@@ -220,16 +287,78 @@ optional arguments:
...
@@ -220,16 +287,78 @@ optional arguments:
-
h
,
--
help
show
this
help
message
and
exit
-
h
,
--
help
show
this
help
message
and
exit
--
train_data_path
TRAIN_DATA_PATH
--
train_data_path
TRAIN_DATA_PATH
path
of
training
dataset
path
of
training
dataset
--
test_data_path
TEST_DATA_PATH
path
of
testing
dataset
--
batch_size
BATCH_SIZE
--
batch_size
BATCH_SIZE
size
of
mini
-
batch
(
default
:
10000
)
size
of
mini
-
batch
(
default
:
10000
)
--test_set_size TEST_SET_SIZE
size of the validation dataset(default: 10000)
--
num_passes
NUM_PASSES
--
num_passes
NUM_PASSES
number
of
passes
to
train
number
of
passes
to
train
--num_lines_to_detact NUM_LINES_TO_DETACT
--
model_output_prefix
MODEL_OUTPUT_PREFIX
number of records to detect dataset's meta info
prefix
of
path
for
model
to
store
(
default
:
./
ctr_models
)
--
data_meta_file
DATA_META_FILE
path
of
data
meta
info
file
--
model_type
MODEL_TYPE
model
type
,
classification
:
0
,
regression
1
(
default
classification
)
```
-
`train_data_path`
: 训练集的路径
-
`test_data_path`
: 测试集的路径
-
`num_passes`
: 模型训练多少轮
-
`data_meta_file`
: 参考
[
数据和任务抽象
](
###
数据和任务抽象)的描述。
-
`model_type`
: 模型分类或回归
## 用训好的模型做预测
训好的模型可以用来预测新的数据, 预测数据的格式为
```
# <dnn input ids> \t <lr input sparse values>
1 23 190 \t 230:0.12 3421:0.9 23451:0.12
23 231 \t 1230:0.12 13421:0.9
```
```
这里与训练数据的格式唯一不同的地方,就是没有标签,也就是训练数据中第3列
`click`
对应的数值。
`infer.py`
的使用方法如下
```
usage
:
infer
.
py
[-
h
]
--
model_gz_path
MODEL_GZ_PATH
--
data_path
DATA_PATH
--
prediction_output_path
PREDICTION_OUTPUT_PATH
[--
data_meta_path
DATA_META_PATH
]
--
model_type
MODEL_TYPE
PaddlePaddle
CTR
example
optional
arguments
:
-
h
,
--
help
show
this
help
message
and
exit
--
model_gz_path
MODEL_GZ_PATH
path
of
model
parameters
gz
file
--
data_path
DATA_PATH
path
of
the
dataset
to
infer
--
prediction_output_path
PREDICTION_OUTPUT_PATH
path
to
output
the
prediction
--
data_meta_path
DATA_META_PATH
path
of
trainset
's meta info, default is ./data.meta
--model_type MODEL_TYPE
model type, classification: 0, regression 1 (default
classification)
```
-
`model_gz_path_model`
:用
`gz`
压缩过的模型路径
-
`data_path`
: 需要预测的数据路径
-
`prediction_output_paht`
:预测输出的路径
-
`data_meta_file`
:参考
[
数据和任务抽象
](
###
数据和任务抽象)的描述。
-
`model_type`
:分类或回归
示例数据可以用如下命令预测
```
python infer.py --model_gz_path <model_path> --data_path output/infer.txt --prediction_output_path predictions.txt --data_meta_path data.meta.txt
```
最终的预测结果位于
`predictions.txt`
。
## 参考文献
## 参考文献
1.
<https://en.wikipedia.org/wiki/Click-through_rate>
1.
<https://en.wikipedia.org/wiki/Click-through_rate>
2.
<https://www.kaggle.com/c/avazu-ctr-prediction/data>
2.
<https://www.kaggle.com/c/avazu-ctr-prediction/data>
...
...
ctr/
data_provid
er.py
→
ctr/
avazu_data_process
er.py
浏览文件 @
7a2d0cae
#!/usr/bin/env python
# -*- coding: utf-8 -*-import os
import
sys
import
sys
import
csv
import
csv
import
cPickle
import
argparse
import
numpy
as
np
import
numpy
as
np
from
utils
import
logger
,
TaskMode
parser
=
argparse
.
ArgumentParser
(
description
=
"PaddlePaddle CTR example"
)
parser
.
add_argument
(
'--data_path'
,
type
=
str
,
required
=
True
,
help
=
"path of the Avazu dataset"
)
parser
.
add_argument
(
'--output_dir'
,
type
=
str
,
required
=
True
,
help
=
"directory to output"
)
parser
.
add_argument
(
'--num_lines_to_detect'
,
type
=
int
,
default
=
500000
,
help
=
"number of records to detect dataset's meta info"
)
parser
.
add_argument
(
'--test_set_size'
,
type
=
int
,
default
=
10000
,
help
=
"size of the validation dataset(default: 10000)"
)
parser
.
add_argument
(
'--train_size'
,
type
=
int
,
default
=
100000
,
help
=
"size of the trainset (default: 100000)"
)
args
=
parser
.
parse_args
()
'''
'''
The fields of the dataset are:
The fields of the dataset are:
...
@@ -22,7 +50,7 @@ The fields of the dataset are:
...
@@ -22,7 +50,7 @@ The fields of the dataset are:
15. device_conn_type
15. device_conn_type
16. C14-C21 -- anonymized categorical variables
16. C14-C21 -- anonymized categorical variables
We will treat following fields as categorical features:
We will treat
the
following fields as categorical features:
- C1
- C1
- banner_pos
- banner_pos
...
@@ -40,6 +68,14 @@ and some other features as id features:
...
@@ -40,6 +68,14 @@ and some other features as id features:
The `hour` field will be treated as a continuous feature and will be transformed
The `hour` field will be treated as a continuous feature and will be transformed
to one-hot representation which has 24 bits.
to one-hot representation which has 24 bits.
This script will output 3 files:
1. train.txt
2. test.txt
3. infer.txt
all the files are for demo.
'''
'''
feature_dims
=
{}
feature_dims
=
{}
...
@@ -161,6 +197,7 @@ def detect_dataset(path, topn, id_fea_space=10000):
...
@@ -161,6 +197,7 @@ def detect_dataset(path, topn, id_fea_space=10000):
NOTE the records should be randomly shuffled first.
NOTE the records should be randomly shuffled first.
'''
'''
# create categorical statis objects.
# create categorical statis objects.
logger
.
warning
(
'detecting dataset'
)
with
open
(
path
,
'rb'
)
as
csvfile
:
with
open
(
path
,
'rb'
)
as
csvfile
:
reader
=
csv
.
DictReader
(
csvfile
)
reader
=
csv
.
DictReader
(
csvfile
)
...
@@ -174,9 +211,6 @@ def detect_dataset(path, topn, id_fea_space=10000):
...
@@ -174,9 +211,6 @@ def detect_dataset(path, topn, id_fea_space=10000):
for
key
,
item
in
fields
.
items
():
for
key
,
item
in
fields
.
items
():
feature_dims
[
key
]
=
item
.
size
()
feature_dims
[
key
]
=
item
.
size
()
#for key in id_features:
#feature_dims[key] = id_fea_space
feature_dims
[
'hour'
]
=
24
feature_dims
[
'hour'
]
=
24
feature_dims
[
'click'
]
=
1
feature_dims
[
'click'
]
=
1
...
@@ -184,10 +218,17 @@ def detect_dataset(path, topn, id_fea_space=10000):
...
@@ -184,10 +218,17 @@ def detect_dataset(path, topn, id_fea_space=10000):
feature_dims
[
key
]
for
key
in
categorial_features
+
[
'hour'
])
+
1
feature_dims
[
key
]
for
key
in
categorial_features
+
[
'hour'
])
+
1
feature_dims
[
'lr_input'
]
=
np
.
sum
(
feature_dims
[
key
]
feature_dims
[
'lr_input'
]
=
np
.
sum
(
feature_dims
[
key
]
for
key
in
id_features
)
+
1
for
key
in
id_features
)
+
1
return
feature_dims
return
feature_dims
def
load_data_meta
(
meta_path
):
'''
Load dataset's meta infomation.
'''
feature_dims
,
fields
=
cPickle
.
load
(
open
(
meta_path
,
'rb'
))
return
feature_dims
,
fields
def
concat_sparse_vectors
(
inputs
,
dims
):
def
concat_sparse_vectors
(
inputs
,
dims
):
'''
'''
Concaterate more than one sparse vectors into one.
Concaterate more than one sparse vectors into one.
...
@@ -211,67 +252,162 @@ class AvazuDataset(object):
...
@@ -211,67 +252,162 @@ class AvazuDataset(object):
'''
'''
Load AVAZU dataset as train set.
Load AVAZU dataset as train set.
'''
'''
TRAIN_MODE
=
0
TEST_MODE
=
1
def
__init__
(
self
,
train_path
,
n_records_as_test
=-
1
):
def
__init__
(
self
,
train_path
,
n_records_as_test
=-
1
,
fields
=
None
,
feature_dims
=
None
):
self
.
train_path
=
train_path
self
.
train_path
=
train_path
self
.
n_records_as_test
=
n_records_as_test
self
.
n_records_as_test
=
n_records_as_test
# task model: 0 train, 1 test
self
.
fields
=
fields
self
.
mode
=
0
# default is train mode.
self
.
mode
=
TaskMode
.
create_train
()
def
train
(
self
):
self
.
categorial_dims
=
[
self
.
mode
=
self
.
TRAIN_MODE
feature_dims
[
key
]
for
key
in
categorial_features
+
[
'hour'
]
return
self
.
_parse
(
self
.
train_path
,
skip_n_lines
=
self
.
n_records_as_test
)
]
self
.
id_dims
=
[
feature_dims
[
key
]
for
key
in
id_features
]
def
test
(
self
):
def
train
(
self
):
self
.
mode
=
self
.
TEST_MODE
'''
return
self
.
_parse
(
self
.
train_path
,
top_n_lines
=
self
.
n_records_as_test
)
Load trainset.
'''
def
_parse
(
self
,
path
,
skip_n_lines
=-
1
,
top_n_lines
=-
1
):
logger
.
info
(
"load trainset from %s"
%
self
.
train_path
)
with
open
(
path
,
'rb'
)
as
csvfile
:
self
.
mode
=
TaskMode
.
create_train
()
reader
=
csv
.
DictReader
(
csvfile
)
with
open
(
self
.
train_path
)
as
f
:
reader
=
csv
.
DictReader
(
f
)
categorial_dims
=
[
feature_dims
[
key
]
for
key
in
categorial_features
+
[
'hour'
]
]
id_dims
=
[
feature_dims
[
key
]
for
key
in
id_features
]
for
row_id
,
row
in
enumerate
(
reader
):
for
row_id
,
row
in
enumerate
(
reader
):
if
skip_n_lines
>
0
and
row_id
<
skip_n_lines
:
# skip top n lines
if
self
.
n_records_as_test
>
0
and
row_id
<
self
.
n_records_as_test
:
continue
continue
if
top_n_lines
>
0
and
row_id
>
top_n_lines
:
break
record
=
[]
for
key
in
categorial_features
:
record
.
append
(
fields
[
key
].
gen
(
row
[
key
]))
record
.
append
([
int
(
row
[
'hour'
][
-
2
:])])
dense_input
=
concat_sparse_vectors
(
record
,
categorial_dims
)
record
=
[]
rcd
=
self
.
_parse_record
(
row
)
for
key
in
id_features
:
if
rcd
:
if
'cross'
not
in
key
:
yield
rcd
record
.
append
(
fields
[
key
].
gen
(
row
[
key
]))
else
:
fea0
=
fields
[
key
].
cross_fea0
fea1
=
fields
[
key
].
cross_fea1
record
.
append
(
fields
[
key
].
gen_cross_fea
(
row
[
fea0
],
row
[
fea1
]))
sparse_input
=
concat_sparse_vectors
(
record
,
id_dims
)
def
test
(
self
):
'''
Load testset.
'''
logger
.
info
(
"load testset from %s"
%
self
.
train_path
)
self
.
mode
=
TaskMode
.
create_test
()
with
open
(
self
.
train_path
)
as
f
:
reader
=
csv
.
DictReader
(
f
)
record
=
[
dense_input
,
sparse_input
]
for
row_id
,
row
in
enumerate
(
reader
):
# skip top n lines
if
self
.
n_records_as_test
>
0
and
row_id
>
self
.
n_records_as_test
:
break
record
.
append
(
list
((
int
(
row
[
'click'
]),
)))
rcd
=
self
.
_parse_record
(
row
)
yield
record
if
rcd
:
yield
rcd
def
infer
(
self
):
'''
Load inferset.
'''
logger
.
info
(
"load inferset from %s"
%
self
.
train_path
)
self
.
mode
=
TaskMode
.
create_infer
()
with
open
(
self
.
train_path
)
as
f
:
reader
=
csv
.
DictReader
(
f
)
if
__name__
==
'__main__'
:
for
row_id
,
row
in
enumerate
(
reader
):
path
=
'train.txt'
rcd
=
self
.
_parse_record
(
row
)
print
detect_dataset
(
path
,
400000
)
if
rcd
:
yield
rcd
filereader
=
AvazuDataset
(
path
)
def
_parse_record
(
self
,
row
):
for
no
,
rcd
in
enumerate
(
filereader
.
train
()):
'''
print
no
,
rcd
Parse a CSV row and get a record.
if
no
>
1000
:
break
'''
record
=
[]
for
key
in
categorial_features
:
record
.
append
(
self
.
fields
[
key
].
gen
(
row
[
key
]))
record
.
append
([
int
(
row
[
'hour'
][
-
2
:])])
dense_input
=
concat_sparse_vectors
(
record
,
self
.
categorial_dims
)
record
=
[]
for
key
in
id_features
:
if
'cross'
not
in
key
:
record
.
append
(
self
.
fields
[
key
].
gen
(
row
[
key
]))
else
:
fea0
=
self
.
fields
[
key
].
cross_fea0
fea1
=
self
.
fields
[
key
].
cross_fea1
record
.
append
(
self
.
fields
[
key
].
gen_cross_fea
(
row
[
fea0
],
row
[
fea1
]))
sparse_input
=
concat_sparse_vectors
(
record
,
self
.
id_dims
)
record
=
[
dense_input
,
sparse_input
]
if
not
self
.
mode
.
is_infer
():
record
.
append
(
list
((
int
(
row
[
'click'
]),
)))
return
record
def
ids2dense
(
vec
,
dim
):
return
vec
def
ids2sparse
(
vec
):
return
[
"%d:1"
%
x
for
x
in
vec
]
detect_dataset
(
args
.
data_path
,
args
.
num_lines_to_detect
)
dataset
=
AvazuDataset
(
args
.
data_path
,
args
.
test_set_size
,
fields
=
fields
,
feature_dims
=
feature_dims
)
output_trainset_path
=
os
.
path
.
join
(
args
.
output_dir
,
'train.txt'
)
output_testset_path
=
os
.
path
.
join
(
args
.
output_dir
,
'test.txt'
)
output_infer_path
=
os
.
path
.
join
(
args
.
output_dir
,
'infer.txt'
)
output_meta_path
=
os
.
path
.
join
(
args
.
output_dir
,
'data.meta.txt'
)
with
open
(
output_trainset_path
,
'w'
)
as
f
:
for
id
,
record
in
enumerate
(
dataset
.
train
()):
if
id
and
id
%
10000
==
0
:
logger
.
info
(
"load %d records"
%
id
)
if
id
>
args
.
train_size
:
break
dnn_input
,
lr_input
,
click
=
record
dnn_input
=
ids2dense
(
dnn_input
,
feature_dims
[
'dnn_input'
])
lr_input
=
ids2sparse
(
lr_input
)
line
=
"%s
\t
%s
\t
%d
\n
"
%
(
' '
.
join
(
map
(
str
,
dnn_input
)),
' '
.
join
(
map
(
str
,
lr_input
)),
click
[
0
])
f
.
write
(
line
)
logger
.
info
(
'write to %s'
%
output_trainset_path
)
with
open
(
output_testset_path
,
'w'
)
as
f
:
for
id
,
record
in
enumerate
(
dataset
.
test
()):
dnn_input
,
lr_input
,
click
=
record
dnn_input
=
ids2dense
(
dnn_input
,
feature_dims
[
'dnn_input'
])
lr_input
=
ids2sparse
(
lr_input
)
line
=
"%s
\t
%s
\t
%d
\n
"
%
(
' '
.
join
(
map
(
str
,
dnn_input
)),
' '
.
join
(
map
(
str
,
lr_input
)),
click
[
0
])
f
.
write
(
line
)
logger
.
info
(
'write to %s'
%
output_testset_path
)
with
open
(
output_infer_path
,
'w'
)
as
f
:
for
id
,
record
in
enumerate
(
dataset
.
infer
()):
dnn_input
,
lr_input
=
record
dnn_input
=
ids2dense
(
dnn_input
,
feature_dims
[
'dnn_input'
])
lr_input
=
ids2sparse
(
lr_input
)
line
=
"%s
\t
%s
\n
"
%
(
' '
.
join
(
map
(
str
,
dnn_input
)),
' '
.
join
(
map
(
str
,
lr_input
)),
)
f
.
write
(
line
)
if
id
>
args
.
test_set_size
:
break
logger
.
info
(
'write to %s'
%
output_infer_path
)
with
open
(
output_meta_path
,
'w'
)
as
f
:
lines
=
[
"dnn_input_dim: %d"
%
feature_dims
[
'dnn_input'
],
"lr_input_dim: %d"
%
feature_dims
[
'lr_input'
]
]
f
.
write
(
'
\n
'
.
join
(
lines
))
logger
.
info
(
'write data meta into %s'
%
output_meta_path
)
ctr/dataset.md
浏览文件 @
7a2d0cae
# 数据及处理
# 数据及处理
## 数据集介绍
## 数据集介绍
本教程演示使用Kaggle上CTR任务的数据集
\[
[
3
](
#参考文献
)
\]
的预处理方法,最终产生本模型需要的格式,详细的数据格式参考
[
README.md
](
./README.md
)
。
Wide && Deep Model
\[
[
2
](
#参考文献
)
\]
的优势是融合稠密特征和大规模稀疏特征,
因此特征处理方面也针对稠密和稀疏两种特征作处理,
其中Deep部分的稠密值全部转化为ID类特征,
通过embedding 来转化为稠密的向量输入;Wide部分主要通过ID的叉乘提升维度。
数据集使用
`csv`
格式存储,其中各个字段内容如下:
数据集使用
`csv`
格式存储,其中各个字段内容如下:
-
`id`
: ad identifier
-
`id`
: ad identifier
...
...
ctr/index.html
浏览文件 @
7a2d0cae
...
@@ -42,15 +42,30 @@
...
@@ -42,15 +42,30 @@
<div
id=
"markdown"
style=
'display:none'
>
<div
id=
"markdown"
style=
'display:none'
>
# 点击率预估
# 点击率预估
以下是本例目录包含的文件以及对应说明:
```
├── README.md # 本教程markdown 文档
├── dataset.md # 数据集处理教程
├── images # 本教程图片目录
│ ├── lr_vs_dnn.jpg
│ └── wide_deep.png
├── infer.py # 预测脚本
├── network_conf.py # 模型网络配置
├── reader.py # data reader
├── train.py # 训练脚本
└── utils.py # helper functions
└── avazu_data_processer.py # 示例数据预处理脚本
```
## 背景介绍
## 背景介绍
CTR(Click-Through Rate,点击率预估)\[[1](https://en.wikipedia.org/wiki/Click-through_rate)\]
是用来表示用户点击一个特定链接的概率,
CTR(Click-Through Rate,点击率预估)\[[1](https://en.wikipedia.org/wiki/Click-through_rate)\]
通常被用来衡量一个在线广告系统的有效性
。
是对用户点击一个特定链接的概率做出预测,是广告投放过程中的一个重要环节。精准的点击率预估对在线广告系统收益最大化具有重要意义
。
当有多个广告位时,CTR 预估一般会作为排序的基准。
当有多个广告位时,CTR 预估一般会作为排序的基准,比如在搜索引擎的广告系统里,当用户输入一个带商业价值的搜索词(query)时,系统大体上会执行下列步骤来展示广告:
比如在搜索引擎的广告系统里,当用户输入一个带商业价值的搜索词(query)时,系统大体上会执行下列步骤来展示广告:
1.
召回满足 query
的广告集合
1.
获取与用户搜索词相关
的广告集合
2. 业务规则和相关性过滤
2. 业务规则和相关性过滤
3. 根据拍卖机制和 CTR 排序
3. 根据拍卖机制和 CTR 排序
4. 展出广告
4. 展出广告
...
@@ -78,13 +93,11 @@ Figure 1. LR 和 DNN 模型结构对比
...
@@ -78,13 +93,11 @@ Figure 1. LR 和 DNN 模型结构对比
</p>
</p>
LR 的蓝色箭头部分可以直接类比到 DNN 中对应的结构,可以看到 LR 和 DNN 有一些共通之处(比如权重累加),
LR 的蓝色箭头部分可以直接类比到 DNN 中对应的结构,可以看到 LR 和 DNN 有一些共通之处(比如权重累加),
但前者的模型复杂度在相同输入维度下比后者可能低很多(从某方面讲,模型越复杂,越有潜力学习到更复杂的信息)。
但前者的模型复杂度在相同输入维度下比后者可能低很多(从某方面讲,模型越复杂,越有潜力学习到更复杂的信息);
如果 LR 要达到匹敌 DNN 的学习能力,必须增加输入的维度,也就是增加特征的数量,
如果 LR 要达到匹敌 DNN 的学习能力,必须增加输入的维度,也就是增加特征的数量,
这也就是为何 LR 和大规模的特征工程必须绑定在一起的原因。
这也就是为何 LR 和大规模的特征工程必须绑定在一起的原因。
LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力,包括内存和计算量等方面,工业界都有非常成熟的优化方法。
LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力,包括内存和计算量等方面,工业界都有非常成熟的优化方法;
而 DNN 模型具有自己学习新特征的能力,一定程度上能够提升特征使用的效率,
而 DNN 模型具有自己学习新特征的能力,一定程度上能够提升特征使用的效率,
这使得 DNN 模型在同样规模特征的情况下,更有可能达到更好的学习效果。
这使得 DNN 模型在同样规模特征的情况下,更有可能达到更好的学习效果。
...
@@ -101,10 +114,62 @@ LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力,包括
...
@@ -101,10 +114,62 @@ LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力,包括
我们直接使用第一种方法做分类任务。
我们直接使用第一种方法做分类任务。
我们使用 Kaggle 上 `Click-through rate prediction` 任务的数据集\[[2](https://www.kaggle.com/c/avazu-ctr-prediction/data)\] 来演示模型。
我们使用 Kaggle 上 `Click-through rate prediction` 任务的数据集\[[2](https://www.kaggle.com/c/avazu-ctr-prediction/data)\] 来演示本例中的模型。
具体的特征处理方法参看 [data process](./dataset.md)。
本教程中演示模型的输入格式如下:
```
#
<dnn
input
ids
>
\t
<lr
input
sparse
values
>
\t click
1 23 190 \t 230:0.12 3421:0.9 23451:0.12 \t 0
23 231 \t 1230:0.12 13421:0.9 \t 1
```
详细的格式描述如下:
- `dnn input ids` 采用 one-hot 表示,只需要填写值为1的ID(注意这里不是变长输入)
- `lr input sparse values` 使用了 `ID:VALUE` 的表示,值部分最好规约到值域 `[-1, 1]`。
此外,模型训练时需要传入一个文件描述 dnn 和 lr两个子模型的输入维度,文件的格式如下:
具体的特征处理方法参看 [data process](./dataset.md)
```
dnn_input_dim:
<int>
lr_input_dim:
<int>
```
其中, `
<int>
` 表示一个整型数值。
本目录下的 `avazu_data_processor.py` 可以对下载的演示数据集\[[2](#参考文档)\] 进行处理,具体使用方法参考如下说明:
```
usage: avazu_data_processer.py [-h] --data_path DATA_PATH --output_dir
OUTPUT_DIR
[--num_lines_to_detect NUM_LINES_TO_DETECT]
[--test_set_size TEST_SET_SIZE]
[--train_size TRAIN_SIZE]
PaddlePaddle CTR example
optional arguments:
-h, --help show this help message and exit
--data_path DATA_PATH
path of the Avazu dataset
--output_dir OUTPUT_DIR
directory to output
--num_lines_to_detect NUM_LINES_TO_DETECT
number of records to detect dataset's meta info
--test_set_size TEST_SET_SIZE
size of the validation dataset(default: 10000)
--train_size TRAIN_SIZE
size of the trainset (default: 100000)
```
- `data_path` 是待处理的数据路径
- `output_dir` 生成数据的输出路径
- `num_lines_to_detect` 预先扫描数据生成ID的个数,这里是扫描的文件行数
- `test_set_size` 生成测试集的行数
- `train_size` 生成训练姐的行数
## Wide
&
Deep Learning Model
## Wide
&
Deep Learning Model
...
@@ -243,18 +308,20 @@ trainer.train(
...
@@ -243,18 +308,20 @@ trainer.train(
## 运行训练和测试
## 运行训练和测试
训练模型需要如下步骤:
训练模型需要如下步骤:
1.
下载训练数据,可以使用 Kaggle 上 CTR 比赛的数据\[[2](#参考文献)\]
1.
准备训练数据
1. 从 [Kaggle CTR](https://www.kaggle.com/c/avazu-ctr-prediction/data) 下载 train.gz
1. 从 [Kaggle CTR](https://www.kaggle.com/c/avazu-ctr-prediction/data) 下载 train.gz
2. 解压 train.gz 得到 train.txt
2. 解压 train.gz 得到 train.txt
2. 执行 `python train.py --train_data_path train.txt` ,开始训练
3. `mkdir -p output; python avazu_data_processer.py --data_path train.txt --output_dir output --num_lines_to_detect 1000 --test_set_size 100` 生成演示数据
2. 执行 `python train.py --train_data_path ./output/train.txt --test_data_path ./output/test.txt --data_meta_file ./output/data.meta.txt --model_type=0` 开始训练
上面第2个步骤可以为 `train.py` 填充命令行参数来定制模型的训练过程,具体的命令行参数及用法如下
上面第2个步骤可以为 `train.py` 填充命令行参数来定制模型的训练过程,具体的命令行参数及用法如下
```
```
usage: train.py [-h] --train_data_path TRAIN_DATA_PATH
usage: train.py [-h] --train_data_path TRAIN_DATA_PATH
[--
batch_size BATCH_SIZE] [--test_set_size TEST_SET
_SIZE]
[--
test_data_path TEST_DATA_PATH] [--batch_size BATCH
_SIZE]
[--num_passes NUM_PASSES]
[--num_passes NUM_PASSES]
[--num_lines_to_detact NUM_LINES_TO_DETACT]
[--model_output_prefix MODEL_OUTPUT_PREFIX] --data_meta_file
DATA_META_FILE --model_type MODEL_TYPE
PaddlePaddle CTR example
PaddlePaddle CTR example
...
@@ -262,16 +329,78 @@ optional arguments:
...
@@ -262,16 +329,78 @@ optional arguments:
-h, --help show this help message and exit
-h, --help show this help message and exit
--train_data_path TRAIN_DATA_PATH
--train_data_path TRAIN_DATA_PATH
path of training dataset
path of training dataset
--test_data_path TEST_DATA_PATH
path of testing dataset
--batch_size BATCH_SIZE
--batch_size BATCH_SIZE
size of mini-batch (default:10000)
size of mini-batch (default:10000)
--test_set_size TEST_SET_SIZE
size of the validation dataset(default: 10000)
--num_passes NUM_PASSES
--num_passes NUM_PASSES
number of passes to train
number of passes to train
--num_lines_to_detact NUM_LINES_TO_DETACT
--model_output_prefix MODEL_OUTPUT_PREFIX
number of records to detect dataset's meta info
prefix of path for model to store (default:
./ctr_models)
--data_meta_file DATA_META_FILE
path of data meta info file
--model_type MODEL_TYPE
model type, classification: 0, regression 1 (default
classification)
```
- `train_data_path` : 训练集的路径
- `test_data_path` : 测试集的路径
- `num_passes`: 模型训练多少轮
- `data_meta_file`: 参考[数据和任务抽象](### 数据和任务抽象)的描述。
- `model_type`: 模型分类或回归
## 用训好的模型做预测
训好的模型可以用来预测新的数据, 预测数据的格式为
```
#
<dnn
input
ids
>
\t
<lr
input
sparse
values
>
1 23 190 \t 230:0.12 3421:0.9 23451:0.12
23 231 \t 1230:0.12 13421:0.9
```
```
这里与训练数据的格式唯一不同的地方,就是没有标签,也就是训练数据中第3列 `click` 对应的数值。
`infer.py` 的使用方法如下
```
usage: infer.py [-h] --model_gz_path MODEL_GZ_PATH --data_path DATA_PATH
--prediction_output_path PREDICTION_OUTPUT_PATH
[--data_meta_path DATA_META_PATH] --model_type MODEL_TYPE
PaddlePaddle CTR example
optional arguments:
-h, --help show this help message and exit
--model_gz_path MODEL_GZ_PATH
path of model parameters gz file
--data_path DATA_PATH
path of the dataset to infer
--prediction_output_path PREDICTION_OUTPUT_PATH
path to output the prediction
--data_meta_path DATA_META_PATH
path of trainset's meta info, default is ./data.meta
--model_type MODEL_TYPE
model type, classification: 0, regression 1 (default
classification)
```
- `model_gz_path_model`:用 `gz` 压缩过的模型路径
- `data_path` : 需要预测的数据路径
- `prediction_output_paht`:预测输出的路径
- `data_meta_file` :参考[数据和任务抽象](### 数据和任务抽象)的描述。
- `model_type` :分类或回归
示例数据可以用如下命令预测
```
python infer.py --model_gz_path
<model_path>
--data_path output/infer.txt --prediction_output_path predictions.txt --data_meta_path data.meta.txt
```
最终的预测结果位于 `predictions.txt`。
## 参考文献
## 参考文献
1.
<https:
//
en.wikipedia.org
/
wiki
/
Click-through_rate
>
1.
<https:
//
en.wikipedia.org
/
wiki
/
Click-through_rate
>
2.
<https:
//
www.kaggle.com
/
c
/
avazu-ctr-prediction
/
data
>
2.
<https:
//
www.kaggle.com
/
c
/
avazu-ctr-prediction
/
data
>
...
...
ctr/infer.py
0 → 100644
浏览文件 @
7a2d0cae
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import
gzip
import
argparse
import
itertools
import
paddle.v2
as
paddle
import
network_conf
from
train
import
dnn_layer_dims
import
reader
from
utils
import
logger
,
ModelType
parser
=
argparse
.
ArgumentParser
(
description
=
"PaddlePaddle CTR example"
)
parser
.
add_argument
(
'--model_gz_path'
,
type
=
str
,
required
=
True
,
help
=
"path of model parameters gz file"
)
parser
.
add_argument
(
'--data_path'
,
type
=
str
,
required
=
True
,
help
=
"path of the dataset to infer"
)
parser
.
add_argument
(
'--prediction_output_path'
,
type
=
str
,
required
=
True
,
help
=
"path to output the prediction"
)
parser
.
add_argument
(
'--data_meta_path'
,
type
=
str
,
default
=
"./data.meta"
,
help
=
"path of trainset's meta info, default is ./data.meta"
)
parser
.
add_argument
(
'--model_type'
,
type
=
int
,
required
=
True
,
default
=
ModelType
.
CLASSIFICATION
,
help
=
'model type, classification: %d, regression %d (default classification)'
%
(
ModelType
.
CLASSIFICATION
,
ModelType
.
REGRESSION
))
args
=
parser
.
parse_args
()
paddle
.
init
(
use_gpu
=
False
,
trainer_count
=
1
)
class
CTRInferer
(
object
):
def
__init__
(
self
,
param_path
):
logger
.
info
(
"create CTR model"
)
dnn_input_dim
,
lr_input_dim
=
reader
.
load_data_meta
(
args
.
data_meta_path
)
# create the mdoel
self
.
ctr_model
=
network_conf
.
CTRmodel
(
dnn_layer_dims
,
dnn_input_dim
,
lr_input_dim
,
model_type
=
ModelType
(
args
.
model_type
),
is_infer
=
True
)
# load parameter
logger
.
info
(
"load model parameters from %s"
%
param_path
)
self
.
parameters
=
paddle
.
parameters
.
Parameters
.
from_tar
(
gzip
.
open
(
param_path
,
'r'
))
self
.
inferer
=
paddle
.
inference
.
Inference
(
output_layer
=
self
.
ctr_model
.
model
,
parameters
=
self
.
parameters
,
)
def
infer
(
self
,
data_path
):
logger
.
info
(
"infer data..."
)
dataset
=
reader
.
Dataset
()
infer_reader
=
paddle
.
batch
(
dataset
.
infer
(
args
.
data_path
),
batch_size
=
1000
)
logger
.
warning
(
'write predictions to %s'
%
args
.
prediction_output_path
)
output_f
=
open
(
args
.
prediction_output_path
,
'w'
)
for
id
,
batch
in
enumerate
(
infer_reader
()):
res
=
self
.
inferer
.
infer
(
input
=
batch
)
predictions
=
[
x
for
x
in
itertools
.
chain
.
from_iterable
(
res
)]
assert
len
(
batch
)
==
len
(
predictions
),
"predict error, %d inputs, but %d predictions"
%
(
len
(
batch
),
len
(
predictions
))
output_f
.
write
(
'
\n
'
.
join
(
map
(
str
,
predictions
))
+
'
\n
'
)
if
__name__
==
'__main__'
:
ctr_inferer
=
CTRInferer
(
args
.
model_gz_path
)
ctr_inferer
.
infer
(
args
.
data_path
)
ctr/network_conf.py
0 → 100644
浏览文件 @
7a2d0cae
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import
paddle.v2
as
paddle
from
paddle.v2
import
layer
from
paddle.v2
import
data_type
as
dtype
from
utils
import
logger
,
ModelType
class
CTRmodel
(
object
):
'''
A CTR model which implements wide && deep learning model.
'''
def
__init__
(
self
,
dnn_layer_dims
,
dnn_input_dim
,
lr_input_dim
,
model_type
=
ModelType
.
create_classification
(),
is_infer
=
False
):
'''
@dnn_layer_dims: list of integer
dims of each layer in dnn
@dnn_input_dim: int
size of dnn's input layer
@lr_input_dim: int
size of lr's input layer
@is_infer: bool
whether to build a infer model
'''
self
.
dnn_layer_dims
=
dnn_layer_dims
self
.
dnn_input_dim
=
dnn_input_dim
self
.
lr_input_dim
=
lr_input_dim
self
.
model_type
=
model_type
self
.
is_infer
=
is_infer
self
.
_declare_input_layers
()
self
.
dnn
=
self
.
_build_dnn_submodel_
(
self
.
dnn_layer_dims
)
self
.
lr
=
self
.
_build_lr_submodel_
()
# model's prediction
# TODO(superjom) rename it to prediction
if
self
.
model_type
.
is_classification
():
self
.
model
=
self
.
_build_classification_model
(
self
.
dnn
,
self
.
lr
)
if
self
.
model_type
.
is_regression
():
self
.
model
=
self
.
_build_regression_model
(
self
.
dnn
,
self
.
lr
)
def
_declare_input_layers
(
self
):
self
.
dnn_merged_input
=
layer
.
data
(
name
=
'dnn_input'
,
type
=
paddle
.
data_type
.
sparse_binary_vector
(
self
.
dnn_input_dim
))
self
.
lr_merged_input
=
layer
.
data
(
name
=
'lr_input'
,
type
=
paddle
.
data_type
.
sparse_vector
(
self
.
lr_input_dim
))
if
not
self
.
is_infer
:
self
.
click
=
paddle
.
layer
.
data
(
name
=
'click'
,
type
=
dtype
.
dense_vector
(
1
))
def
_build_dnn_submodel_
(
self
,
dnn_layer_dims
):
'''
build DNN submodel.
'''
dnn_embedding
=
layer
.
fc
(
input
=
self
.
dnn_merged_input
,
size
=
dnn_layer_dims
[
0
])
_input_layer
=
dnn_embedding
for
i
,
dim
in
enumerate
(
dnn_layer_dims
[
1
:]):
fc
=
layer
.
fc
(
input
=
_input_layer
,
size
=
dim
,
act
=
paddle
.
activation
.
Relu
(),
name
=
'dnn-fc-%d'
%
i
)
_input_layer
=
fc
return
_input_layer
def
_build_lr_submodel_
(
self
):
'''
config LR submodel
'''
fc
=
layer
.
fc
(
input
=
self
.
lr_merged_input
,
size
=
1
,
act
=
paddle
.
activation
.
Relu
())
return
fc
def
_build_classification_model
(
self
,
dnn
,
lr
):
merge_layer
=
layer
.
concat
(
input
=
[
dnn
,
lr
])
self
.
output
=
layer
.
fc
(
input
=
merge_layer
,
size
=
1
,
# use sigmoid function to approximate ctr rate, a float value between 0 and 1.
act
=
paddle
.
activation
.
Sigmoid
())
if
not
self
.
is_infer
:
self
.
train_cost
=
paddle
.
layer
.
multi_binary_label_cross_entropy_cost
(
input
=
self
.
output
,
label
=
self
.
click
)
return
self
.
output
def
_build_regression_model
(
self
,
dnn
,
lr
):
merge_layer
=
layer
.
concat
(
input
=
[
dnn
,
lr
])
self
.
output
=
layer
.
fc
(
input
=
merge_layer
,
size
=
1
,
act
=
paddle
.
activation
.
Sigmoid
())
if
not
self
.
is_infer
:
self
.
train_cost
=
paddle
.
layer
.
mse_cost
(
input
=
self
.
output
,
label
=
self
.
click
)
return
self
.
output
ctr/reader.py
0 → 100644
浏览文件 @
7a2d0cae
from
utils
import
logger
,
TaskMode
,
load_dnn_input_record
,
load_lr_input_record
feeding_index
=
{
'dnn_input'
:
0
,
'lr_input'
:
1
,
'click'
:
2
}
class
Dataset
(
object
):
def
__init__
(
self
):
self
.
mode
=
TaskMode
.
create_train
()
def
train
(
self
,
path
):
'''
Load trainset.
'''
logger
.
info
(
"load trainset from %s"
%
path
)
self
.
mode
=
TaskMode
.
create_train
()
self
.
path
=
path
return
self
.
_parse
def
test
(
self
,
path
):
'''
Load testset.
'''
logger
.
info
(
"load testset from %s"
%
path
)
self
.
path
=
path
self
.
mode
=
TaskMode
.
create_test
()
return
self
.
_parse
def
infer
(
self
,
path
):
'''
Load infer set.
'''
logger
.
info
(
"load inferset from %s"
%
path
)
self
.
path
=
path
self
.
mode
=
TaskMode
.
create_infer
()
return
self
.
_parse
def
_parse
(
self
):
'''
Parse dataset.
'''
with
open
(
self
.
path
)
as
f
:
for
line_id
,
line
in
enumerate
(
f
):
fs
=
line
.
strip
().
split
(
'
\t
'
)
dnn_input
=
load_dnn_input_record
(
fs
[
0
])
lr_input
=
load_lr_input_record
(
fs
[
1
])
if
not
self
.
mode
.
is_infer
():
click
=
[
int
(
fs
[
2
])]
yield
dnn_input
,
lr_input
,
click
else
:
yield
dnn_input
,
lr_input
def
load_data_meta
(
path
):
'''
load data meta info from path, return (dnn_input_dim, lr_input_dim)
'''
with
open
(
path
)
as
f
:
lines
=
f
.
read
().
split
(
'
\n
'
)
err_info
=
"wrong meta format"
assert
len
(
lines
)
==
2
,
err_info
assert
'dnn_input_dim:'
in
lines
[
0
]
and
'lr_input_dim:'
in
lines
[
1
],
err_info
res
=
map
(
int
,
[
_
.
split
(
':'
)[
1
]
for
_
in
lines
])
logger
.
info
(
'dnn input dim: %d'
%
res
[
0
])
logger
.
info
(
'lr input dim: %d'
%
res
[
1
])
return
res
ctr/train.py
浏览文件 @
7a2d0cae
#!/usr/bin/env python
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# -*- coding: utf-8 -*-import os
import
argparse
import
argparse
import
logging
import
gzip
import
paddle.v2
as
paddle
from
paddle.v2
import
layer
from
paddle.v2
import
data_type
as
dtype
from
data_provider
import
field_index
,
detect_dataset
,
AvazuDataset
parser
=
argparse
.
ArgumentParser
(
description
=
"PaddlePaddle CTR example"
)
parser
.
add_argument
(
'--train_data_path'
,
type
=
str
,
required
=
True
,
help
=
"path of training dataset"
)
parser
.
add_argument
(
'--batch_size'
,
type
=
int
,
default
=
10000
,
help
=
"size of mini-batch (default:10000)"
)
parser
.
add_argument
(
'--test_set_size'
,
type
=
int
,
default
=
10000
,
help
=
"size of the validation dataset(default: 10000)"
)
parser
.
add_argument
(
'--num_passes'
,
type
=
int
,
default
=
10
,
help
=
"number of passes to train"
)
parser
.
add_argument
(
'--num_lines_to_detact'
,
type
=
int
,
default
=
500000
,
help
=
"number of records to detect dataset's meta info"
)
args
=
parser
.
parse_args
()
dnn_layer_dims
=
[
128
,
64
,
32
,
1
]
data_meta_info
=
detect_dataset
(
args
.
train_data_path
,
args
.
num_lines_to_detact
)
logging
.
warning
(
'detect categorical fields in dataset %s'
%
args
.
train_data_path
)
for
key
,
item
in
data_meta_info
.
items
():
logging
.
warning
(
' - {}
\t
{}'
.
format
(
key
,
item
))
paddle
.
init
(
use_gpu
=
False
,
trainer_count
=
1
)
# ==============================================================================
import
reader
# input layers
import
paddle.v2
as
paddle
# ==============================================================================
from
utils
import
logger
,
ModelType
dnn_merged_input
=
layer
.
data
(
from
network_conf
import
CTRmodel
name
=
'dnn_input'
,
type
=
paddle
.
data_type
.
sparse_binary_vector
(
data_meta_info
[
'dnn_input'
]))
def
parse_args
():
lr_merged_input
=
layer
.
data
(
parser
=
argparse
.
ArgumentParser
(
description
=
"PaddlePaddle CTR example"
)
name
=
'lr_input'
,
parser
.
add_argument
(
type
=
paddle
.
data_type
.
sparse_binary_vector
(
data_meta_info
[
'lr_input'
]))
'--train_data_path'
,
type
=
str
,
click
=
paddle
.
layer
.
data
(
name
=
'click'
,
type
=
dtype
.
dense_vector
(
1
))
required
=
True
,
help
=
"path of training dataset"
)
parser
.
add_argument
(
'--test_data_path'
,
type
=
str
,
help
=
'path of testing dataset'
)
parser
.
add_argument
(
'--batch_size'
,
type
=
int
,
default
=
10000
,
help
=
"size of mini-batch (default:10000)"
)
parser
.
add_argument
(
'--num_passes'
,
type
=
int
,
default
=
10
,
help
=
"number of passes to train"
)
parser
.
add_argument
(
'--model_output_prefix'
,
type
=
str
,
default
=
'./ctr_models'
,
help
=
'prefix of path for model to store (default: ./ctr_models)'
)
parser
.
add_argument
(
'--data_meta_file'
,
type
=
str
,
required
=
True
,
help
=
'path of data meta info file'
,
)
parser
.
add_argument
(
'--model_type'
,
type
=
int
,
required
=
True
,
default
=
ModelType
.
CLASSIFICATION
,
help
=
'model type, classification: %d, regression %d (default classification)'
%
(
ModelType
.
CLASSIFICATION
,
ModelType
.
REGRESSION
))
return
parser
.
parse_args
()
# ==============================================================================
dnn_layer_dims
=
[
128
,
64
,
32
,
1
]
# network structure
# ==============================================================================
def
build_dnn_submodel
(
dnn_layer_dims
):
dnn_embedding
=
layer
.
fc
(
input
=
dnn_merged_input
,
size
=
dnn_layer_dims
[
0
])
_input_layer
=
dnn_embedding
for
i
,
dim
in
enumerate
(
dnn_layer_dims
[
1
:]):
fc
=
layer
.
fc
(
input
=
_input_layer
,
size
=
dim
,
act
=
paddle
.
activation
.
Relu
(),
name
=
'dnn-fc-%d'
%
i
)
_input_layer
=
fc
return
_input_layer
# config LR submodel
def
build_lr_submodel
():
fc
=
layer
.
fc
(
input
=
lr_merged_input
,
size
=
1
,
name
=
'lr'
,
act
=
paddle
.
activation
.
Relu
())
return
fc
# conbine DNN and LR submodels
def
combine_submodels
(
dnn
,
lr
):
merge_layer
=
layer
.
concat
(
input
=
[
dnn
,
lr
])
fc
=
layer
.
fc
(
input
=
merge_layer
,
size
=
1
,
name
=
'output'
,
# use sigmoid function to approximate ctr rate, a float value between 0 and 1.
act
=
paddle
.
activation
.
Sigmoid
())
return
fc
dnn
=
build_dnn_submodel
(
dnn_layer_dims
)
lr
=
build_lr_submodel
()
output
=
combine_submodels
(
dnn
,
lr
)
# ==============================================================================
# ==============================================================================
# cost and train period
# cost and train period
# ==============================================================================
# ==============================================================================
classification_cost
=
paddle
.
layer
.
multi_binary_label_cross_entropy_cost
(
input
=
output
,
label
=
click
)
params
=
paddle
.
parameters
.
create
(
classification_cost
)
optimizer
=
paddle
.
optimizer
.
Momentum
(
momentum
=
0.01
)
trainer
=
paddle
.
trainer
.
SGD
(
cost
=
classification_cost
,
parameters
=
params
,
update_equation
=
optimizer
)
dataset
=
AvazuDataset
(
args
.
train_data_path
,
n_records_as_test
=
args
.
test_set_size
)
def
event_handler
(
event
):
if
isinstance
(
event
,
paddle
.
event
.
EndIteration
):
num_samples
=
event
.
batch_id
*
args
.
batch_size
if
event
.
batch_id
%
100
==
0
:
logging
.
warning
(
"Pass %d, Samples %d, Cost %f"
%
(
event
.
pass_id
,
num_samples
,
event
.
cost
))
if
event
.
batch_id
%
1000
==
0
:
result
=
trainer
.
test
(
reader
=
paddle
.
batch
(
dataset
.
test
,
batch_size
=
args
.
batch_size
),
feeding
=
field_index
)
logging
.
warning
(
"Test %d-%d, Cost %f"
%
(
event
.
pass_id
,
event
.
batch_id
,
result
.
cost
))
trainer
.
train
(
def
train
():
reader
=
paddle
.
batch
(
args
=
parse_args
()
paddle
.
reader
.
shuffle
(
dataset
.
train
,
buf_size
=
500
),
args
.
model_type
=
ModelType
(
args
.
model_type
)
batch_size
=
args
.
batch_size
),
paddle
.
init
(
use_gpu
=
False
,
trainer_count
=
1
)
feeding
=
field_index
,
dnn_input_dim
,
lr_input_dim
=
reader
.
load_data_meta
(
args
.
data_meta_file
)
event_handler
=
event_handler
,
num_passes
=
args
.
num_passes
)
# create ctr model.
model
=
CTRmodel
(
dnn_layer_dims
,
dnn_input_dim
,
lr_input_dim
,
model_type
=
args
.
model_type
,
is_infer
=
False
)
params
=
paddle
.
parameters
.
create
(
model
.
train_cost
)
optimizer
=
paddle
.
optimizer
.
AdaGrad
()
trainer
=
paddle
.
trainer
.
SGD
(
cost
=
model
.
train_cost
,
parameters
=
params
,
update_equation
=
optimizer
)
dataset
=
reader
.
Dataset
()
def
__event_handler__
(
event
):
if
isinstance
(
event
,
paddle
.
event
.
EndIteration
):
num_samples
=
event
.
batch_id
*
args
.
batch_size
if
event
.
batch_id
%
100
==
0
:
logger
.
warning
(
"Pass %d, Samples %d, Cost %f, %s"
%
(
event
.
pass_id
,
num_samples
,
event
.
cost
,
event
.
metrics
))
if
event
.
batch_id
%
1000
==
0
:
if
args
.
test_data_path
:
result
=
trainer
.
test
(
reader
=
paddle
.
batch
(
dataset
.
test
(
args
.
test_data_path
),
batch_size
=
args
.
batch_size
),
feeding
=
reader
.
feeding_index
)
logger
.
warning
(
"Test %d-%d, Cost %f, %s"
%
(
event
.
pass_id
,
event
.
batch_id
,
result
.
cost
,
result
.
metrics
))
path
=
"{}-pass-{}-batch-{}-test-{}.tar.gz"
.
format
(
args
.
model_output_prefix
,
event
.
pass_id
,
event
.
batch_id
,
result
.
cost
)
with
gzip
.
open
(
path
,
'w'
)
as
f
:
params
.
to_tar
(
f
)
trainer
.
train
(
reader
=
paddle
.
batch
(
paddle
.
reader
.
shuffle
(
dataset
.
train
(
args
.
train_data_path
),
buf_size
=
500
),
batch_size
=
args
.
batch_size
),
feeding
=
reader
.
feeding_index
,
event_handler
=
__event_handler__
,
num_passes
=
args
.
num_passes
)
if
__name__
==
'__main__'
:
train
()
ctr/utils.py
0 → 100644
浏览文件 @
7a2d0cae
import
logging
logging
.
basicConfig
()
logger
=
logging
.
getLogger
(
"paddle"
)
logger
.
setLevel
(
logging
.
INFO
)
class
TaskMode
:
TRAIN_MODE
=
0
TEST_MODE
=
1
INFER_MODE
=
2
def
__init__
(
self
,
mode
):
self
.
mode
=
mode
def
is_train
(
self
):
return
self
.
mode
==
self
.
TRAIN_MODE
def
is_test
(
self
):
return
self
.
mode
==
self
.
TEST_MODE
def
is_infer
(
self
):
return
self
.
mode
==
self
.
INFER_MODE
@
staticmethod
def
create_train
():
return
TaskMode
(
TaskMode
.
TRAIN_MODE
)
@
staticmethod
def
create_test
():
return
TaskMode
(
TaskMode
.
TEST_MODE
)
@
staticmethod
def
create_infer
():
return
TaskMode
(
TaskMode
.
INFER_MODE
)
class
ModelType
:
CLASSIFICATION
=
0
REGRESSION
=
1
def
__init__
(
self
,
mode
):
self
.
mode
=
mode
def
is_classification
(
self
):
return
self
.
mode
==
self
.
CLASSIFICATION
def
is_regression
(
self
):
return
self
.
mode
==
self
.
REGRESSION
@
staticmethod
def
create_classification
():
return
ModelType
(
ModelType
.
CLASSIFICATION
)
@
staticmethod
def
create_regression
():
return
ModelType
(
ModelType
.
REGRESSION
)
def
load_dnn_input_record
(
sent
):
return
map
(
int
,
sent
.
split
())
def
load_lr_input_record
(
sent
):
res
=
[]
for
_
in
[
x
.
split
(
':'
)
for
x
in
sent
.
split
()]:
res
.
append
((
int
(
_
[
0
]),
float
(
_
[
1
]),
))
return
res
nce_cost/train.py
浏览文件 @
7a2d0cae
...
@@ -43,7 +43,10 @@ def train(model_save_dir):
...
@@ -43,7 +43,10 @@ def train(model_save_dir):
parameters
.
to_tar
(
f
)
parameters
.
to_tar
(
f
)
trainer
.
train
(
trainer
.
train
(
paddle
.
batch
(
paddle
.
dataset
.
imikolov
.
train
(
word_dict
,
5
),
64
),
paddle
.
batch
(
paddle
.
reader
.
shuffle
(
lambda
:
paddle
.
dataset
.
imikolov
.
train
(
word_dict
,
5
)(),
buf_size
=
1000
),
64
),
num_passes
=
1000
,
num_passes
=
1000
,
event_handler
=
event_handler
)
event_handler
=
event_handler
)
...
...
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录