Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
PaddlePaddle
models
提交
3065a876
M
models
项目概览
PaddlePaddle
/
models
1 年多 前同步成功
通知
222
Star
6828
Fork
2962
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
602
列表
看板
标记
里程碑
合并请求
255
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
M
models
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
602
Issue
602
列表
看板
标记
里程碑
合并请求
255
合并请求
255
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
提交
3065a876
编写于
6月 29, 2017
作者:
S
Superjom
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
refactor ctr model
上级
0a27ca9d
变更
9
隐藏空白更改
内联
并排
Showing
9 changed file
with
1064 addition
and
144 deletion
+1064
-144
ctr/README.md
ctr/README.md
+104
-8
ctr/avazu_data_processer.py
ctr/avazu_data_processer.py
+415
-0
ctr/dataset.md
ctr/dataset.md
+7
-0
ctr/index.html
ctr/index.html
+104
-8
ctr/infer.py
ctr/infer.py
+81
-0
ctr/network_conf.py
ctr/network_conf.py
+112
-0
ctr/reader.py
ctr/reader.py
+66
-0
ctr/train.py
ctr/train.py
+107
-128
ctr/utils.py
ctr/utils.py
+68
-0
未找到文件。
ctr/README.md
浏览文件 @
3065a876
# 点击率预估
以下是本例目录包含的文件以及对应说明:
```
├── README.md # 本教程markdown 文档
├── dataset.md # 数据集处理教程
├── images # 本教程图片目录
│ ├── lr_vs_dnn.jpg
│ └── wide_deep.png
├── infer.py # 预测脚本
├── network_conf.py # 模型网络配置
├── reader.py # data provider
├── train.py # 训练脚本
└── utils.py # helper functions
```
## 背景介绍
CTR(Click-Through Rate,点击率预估)
\[
[
1
](
https://en.wikipedia.org/wiki/Click-through_rate
)
\]
是用来表示用户点击一个特定链接的概率,
...
...
@@ -61,8 +76,40 @@ LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力,包括
我们使用 Kaggle 上
`Click-through rate prediction`
任务的数据集
\[
[
2
](
https://www.kaggle.com/c/avazu-ctr-prediction/data
)
\]
来演示模型。
具体的特征处理方法参看
[
data process
](
./dataset.md
)
具体的特征处理方法参看
[
data process
](
./dataset.md
)
。
本教程中演示模型的输入格式如下:
```
# <dnn input ids> \t <lr input sparse values> \t click
1 23 190 \t 230:0.12 3421:0.9 23451:0.12 \t 0
23 231 \t 1230:0.12 13421:0.9 \t 1
```
演示数据集
\[
[
2
](
#参考文档
)
\]
可以使用
`avazu_data_processor.py`
脚本处理,具体使用方法参考如下说明:
```
usage: avazu_data_processer.py [-h] --data_path DATA_PATH --output_dir
OUTPUT_DIR
[--num_lines_to_detect NUM_LINES_TO_DETECT]
[--test_set_size TEST_SET_SIZE]
[--train_size TRAIN_SIZE]
PaddlePaddle CTR example
optional arguments:
-h, --help show this help message and exit
--data_path DATA_PATH
path of the Avazu dataset
--output_dir OUTPUT_DIR
directory to output
--num_lines_to_detect NUM_LINES_TO_DETECT
number of records to detect dataset's meta info
--test_set_size TEST_SET_SIZE
size of the validation dataset(default: 10000)
--train_size TRAIN_SIZE
size of the trainset (default: 100000)
```
## Wide & Deep Learning Model
...
...
@@ -204,15 +251,17 @@ trainer.train(
1.
下载训练数据,可以使用 Kaggle 上 CTR 比赛的数据
\[
[
2
](
#参考文献
)
\]
1.
从
[
Kaggle CTR
](
https://www.kaggle.com/c/avazu-ctr-prediction/data
)
下载 train.gz
2.
解压 train.gz 得到 train.txt
2.
执行
`python train.py --train_data_path train.txt`
,开始训练
3.
`mkdir -p output; python avazu_data_processer.py --data_path train.txt --output_dir output --num_lines_to_detect 1000 --test_set_size 100`
生成演示数据
2.
执行
`python train.py --train_data_path ./output/train.txt --test_data_path ./output/test.txt --data_meta_file ./output/data.meta.txt --model_type=0`
开始训练
上面第2个步骤可以为
`train.py`
填充命令行参数来定制模型的训练过程,具体的命令行参数及用法如下
```
usage
:
train
.
py
[-
h
]
--
train_data_path
TRAIN_DATA_PATH
[--
batch_size BATCH_SIZE] [--test_set_size TEST_SET
_SIZE]
[--
test_data_path
TEST_DATA_PATH
]
[--
batch_size
BATCH
_SIZE
]
[--
num_passes
NUM_PASSES
]
[--num_lines_to_detact NUM_LINES_TO_DETACT]
[--
model_output_prefix
MODEL_OUTPUT_PREFIX
]
--
data_meta_file
DATA_META_FILE
--
model_type
MODEL_TYPE
PaddlePaddle
CTR
example
...
...
@@ -220,16 +269,63 @@ optional arguments:
-
h
,
--
help
show
this
help
message
and
exit
--
train_data_path
TRAIN_DATA_PATH
path
of
training
dataset
--
test_data_path
TEST_DATA_PATH
path
of
testing
dataset
--
batch_size
BATCH_SIZE
size
of
mini
-
batch
(
default
:
10000
)
--test_set_size TEST_SET_SIZE
size of the validation dataset(default: 10000)
--
num_passes
NUM_PASSES
number
of
passes
to
train
--num_lines_to_detact NUM_LINES_TO_DETACT
number of records to detect dataset's meta info
--
model_output_prefix
MODEL_OUTPUT_PREFIX
prefix
of
path
for
model
to
store
(
default
:
./
ctr_models
)
--
data_meta_file
DATA_META_FILE
path
of
data
meta
info
file
--
model_type
MODEL_TYPE
model
type
,
classification
:
0
,
regression
1
(
default
classification
)
```
## 用训好的模型做预测
训好的模型可以用来预测新的数据, 预测数据的格式为
```
# <dnn input ids> \t <lr input sparse values>
1 23 190 \t 230:0.12 3421:0.9 23451:0.12
23 231 \t 1230:0.12 13421:0.9
```
`infer.py`
的使用方法如下
```
usage
:
infer
.
py
[-
h
]
--
model_gz_path
MODEL_GZ_PATH
--
data_path
DATA_PATH
--
prediction_output_path
PREDICTION_OUTPUT_PATH
[--
data_meta_path
DATA_META_PATH
]
--
model_type
MODEL_TYPE
PaddlePaddle
CTR
example
optional
arguments
:
-
h
,
--
help
show
this
help
message
and
exit
--
model_gz_path
MODEL_GZ_PATH
path
of
model
parameters
gz
file
--
data_path
DATA_PATH
path
of
the
dataset
to
infer
--
prediction_output_path
PREDICTION_OUTPUT_PATH
path
to
output
the
prediction
--
data_meta_path
DATA_META_PATH
path
of
trainset
's meta info, default is ./data.meta
--model_type MODEL_TYPE
model type, classification: 0, regression 1 (default
classification)
```
示例数据可以用如下命令预测
```
python infer.py --model_gz_path <model_path> --data_path output/infer.txt --prediction_output_path predictions.txt --data_meta_path data.meta.txt
```
最终的预测结果位于
`predictions.txt`
。
## 参考文献
1.
<https://en.wikipedia.org/wiki/Click-through_rate>
2.
<https://www.kaggle.com/c/avazu-ctr-prediction/data>
...
...
ctr/
data_provid
er.py
→
ctr/
avazu_data_process
er.py
浏览文件 @
3065a876
import
os
import
sys
import
csv
import
cPickle
import
argparse
import
numpy
as
np
from
utils
import
logger
,
TaskMode
parser
=
argparse
.
ArgumentParser
(
description
=
"PaddlePaddle CTR example"
)
parser
.
add_argument
(
'--data_path'
,
type
=
str
,
required
=
True
,
help
=
"path of the Avazu dataset"
)
parser
.
add_argument
(
'--output_dir'
,
type
=
str
,
required
=
True
,
help
=
"directory to output"
)
parser
.
add_argument
(
'--num_lines_to_detect'
,
type
=
int
,
default
=
500000
,
help
=
"number of records to detect dataset's meta info"
)
parser
.
add_argument
(
'--test_set_size'
,
type
=
int
,
default
=
10000
,
help
=
"size of the validation dataset(default: 10000)"
)
parser
.
add_argument
(
'--train_size'
,
type
=
int
,
default
=
100000
,
help
=
"size of the trainset (default: 100000)"
)
args
=
parser
.
parse_args
()
'''
The fields of the dataset are:
...
...
@@ -40,6 +67,14 @@ and some other features as id features:
The `hour` field will be treated as a continuous feature and will be transformed
to one-hot representation which has 24 bits.
This script will output 3 files:
1. train.txt
2. test.txt
3. infer.txt
all the files are for demo.
'''
feature_dims
=
{}
...
...
@@ -161,6 +196,7 @@ def detect_dataset(path, topn, id_fea_space=10000):
NOTE the records should be randomly shuffled first.
'''
# create categorical statis objects.
logger
.
warning
(
'detecting dataset'
)
with
open
(
path
,
'rb'
)
as
csvfile
:
reader
=
csv
.
DictReader
(
csvfile
)
...
...
@@ -174,9 +210,6 @@ def detect_dataset(path, topn, id_fea_space=10000):
for
key
,
item
in
fields
.
items
():
feature_dims
[
key
]
=
item
.
size
()
#for key in id_features:
#feature_dims[key] = id_fea_space
feature_dims
[
'hour'
]
=
24
feature_dims
[
'click'
]
=
1
...
...
@@ -184,10 +217,20 @@ def detect_dataset(path, topn, id_fea_space=10000):
feature_dims
[
key
]
for
key
in
categorial_features
+
[
'hour'
])
+
1
feature_dims
[
'lr_input'
]
=
np
.
sum
(
feature_dims
[
key
]
for
key
in
id_features
)
+
1
# logger.warning("dump dataset's meta info to %s" % meta_out_path)
# cPickle.dump([feature_dims, fields], open(meta_out_path, 'wb'))
return
feature_dims
def
load_data_meta
(
meta_path
):
'''
Load dataset's meta infomation.
'''
feature_dims
,
fields
=
cPickle
.
load
(
open
(
meta_path
,
'rb'
))
return
feature_dims
,
fields
def
concat_sparse_vectors
(
inputs
,
dims
):
'''
Concaterate more than one sparse vectors into one.
...
...
@@ -211,67 +254,162 @@ class AvazuDataset(object):
'''
Load AVAZU dataset as train set.
'''
TRAIN_MODE
=
0
TEST_MODE
=
1
def
__init__
(
self
,
train_path
,
n_records_as_test
=-
1
):
def
__init__
(
self
,
train_path
,
n_records_as_test
=-
1
,
fields
=
None
,
feature_dims
=
None
):
self
.
train_path
=
train_path
self
.
n_records_as_test
=
n_records_as_test
# task model: 0 train, 1 test
self
.
mode
=
0
self
.
fields
=
fields
# default is train mode.
self
.
mode
=
TaskMode
.
create_train
()
def
train
(
self
):
self
.
mode
=
self
.
TRAIN_MODE
return
self
.
_parse
(
self
.
train_path
,
skip_n_lines
=
self
.
n_records_as_test
)
self
.
categorial_dims
=
[
feature_dims
[
key
]
for
key
in
categorial_features
+
[
'hour'
]
]
self
.
id_dims
=
[
feature_dims
[
key
]
for
key
in
id_features
]
def
test
(
self
):
self
.
mode
=
self
.
TEST_MODE
return
self
.
_parse
(
self
.
train_path
,
top_n_lines
=
self
.
n_records_as_test
)
def
_parse
(
self
,
path
,
skip_n_lines
=-
1
,
top_n_lines
=-
1
):
with
open
(
path
,
'rb'
)
as
csvfile
:
reader
=
csv
.
DictReader
(
csvfile
)
categorial_dims
=
[
feature_dims
[
key
]
for
key
in
categorial_features
+
[
'hour'
]
]
id_dims
=
[
feature_dims
[
key
]
for
key
in
id_features
]
def
train
(
self
):
'''
Load trainset.
'''
logger
.
info
(
"load trainset from %s"
%
self
.
train_path
)
self
.
mode
=
TaskMode
.
create_train
()
with
open
(
self
.
train_path
)
as
f
:
reader
=
csv
.
DictReader
(
f
)
for
row_id
,
row
in
enumerate
(
reader
):
if
skip_n_lines
>
0
and
row_id
<
skip_n_lines
:
# skip top n lines
if
self
.
n_records_as_test
>
0
and
row_id
<
self
.
n_records_as_test
:
continue
if
top_n_lines
>
0
and
row_id
>
top_n_lines
:
break
record
=
[]
for
key
in
categorial_features
:
record
.
append
(
fields
[
key
].
gen
(
row
[
key
]))
record
.
append
([
int
(
row
[
'hour'
][
-
2
:])])
dense_input
=
concat_sparse_vectors
(
record
,
categorial_dims
)
record
=
[]
for
key
in
id_features
:
if
'cross'
not
in
key
:
record
.
append
(
fields
[
key
].
gen
(
row
[
key
]))
else
:
fea0
=
fields
[
key
].
cross_fea0
fea1
=
fields
[
key
].
cross_fea1
record
.
append
(
fields
[
key
].
gen_cross_fea
(
row
[
fea0
],
row
[
fea1
]))
rcd
=
self
.
_parse_record
(
row
)
if
rcd
:
yield
rcd
sparse_input
=
concat_sparse_vectors
(
record
,
id_dims
)
def
test
(
self
):
'''
Load testset.
'''
logger
.
info
(
"load testset from %s"
%
self
.
train_path
)
self
.
mode
=
TaskMode
.
create_test
()
with
open
(
self
.
train_path
)
as
f
:
reader
=
csv
.
DictReader
(
f
)
record
=
[
dense_input
,
sparse_input
]
for
row_id
,
row
in
enumerate
(
reader
):
# skip top n lines
if
self
.
n_records_as_test
>
0
and
row_id
>
self
.
n_records_as_test
:
break
record
.
append
(
list
((
int
(
row
[
'click'
]),
)))
yield
record
rcd
=
self
.
_parse_record
(
row
)
if
rcd
:
yield
rcd
def
infer
(
self
):
'''
Load inferset.
'''
logger
.
info
(
"load inferset from %s"
%
self
.
train_path
)
self
.
mode
=
TaskMode
.
create_infer
()
with
open
(
self
.
train_path
)
as
f
:
reader
=
csv
.
DictReader
(
f
)
if
__name__
==
'__main__'
:
path
=
'train.txt'
print
detect_dataset
(
path
,
400000
)
for
row_id
,
row
in
enumerate
(
reader
):
rcd
=
self
.
_parse_record
(
row
)
if
rcd
:
yield
rcd
filereader
=
AvazuDataset
(
path
)
for
no
,
rcd
in
enumerate
(
filereader
.
train
()):
print
no
,
rcd
if
no
>
1000
:
break
def
_parse_record
(
self
,
row
):
'''
Parse a CSV row and get a record.
'''
record
=
[]
for
key
in
categorial_features
:
record
.
append
(
self
.
fields
[
key
].
gen
(
row
[
key
]))
record
.
append
([
int
(
row
[
'hour'
][
-
2
:])])
dense_input
=
concat_sparse_vectors
(
record
,
self
.
categorial_dims
)
record
=
[]
for
key
in
id_features
:
if
'cross'
not
in
key
:
record
.
append
(
self
.
fields
[
key
].
gen
(
row
[
key
]))
else
:
fea0
=
self
.
fields
[
key
].
cross_fea0
fea1
=
self
.
fields
[
key
].
cross_fea1
record
.
append
(
self
.
fields
[
key
].
gen_cross_fea
(
row
[
fea0
],
row
[
fea1
]))
sparse_input
=
concat_sparse_vectors
(
record
,
self
.
id_dims
)
record
=
[
dense_input
,
sparse_input
]
if
not
self
.
mode
.
is_infer
():
record
.
append
(
list
((
int
(
row
[
'click'
]),
)))
return
record
def
ids2dense
(
vec
,
dim
):
return
vec
def
ids2sparse
(
vec
):
return
[
"%d:1"
%
x
for
x
in
vec
]
detect_dataset
(
args
.
data_path
,
args
.
num_lines_to_detect
)
dataset
=
AvazuDataset
(
args
.
data_path
,
args
.
test_set_size
,
fields
=
fields
,
feature_dims
=
feature_dims
)
output_trainset_path
=
os
.
path
.
join
(
args
.
output_dir
,
'train.txt'
)
output_testset_path
=
os
.
path
.
join
(
args
.
output_dir
,
'test.txt'
)
output_infer_path
=
os
.
path
.
join
(
args
.
output_dir
,
'infer.txt'
)
output_meta_path
=
os
.
path
.
join
(
args
.
output_dir
,
'data.meta.txt'
)
with
open
(
output_trainset_path
,
'w'
)
as
f
:
for
id
,
record
in
enumerate
(
dataset
.
train
()):
if
id
and
id
%
10000
==
0
:
logger
.
info
(
"load %d records"
%
id
)
if
id
>
args
.
train_size
:
break
dnn_input
,
lr_input
,
click
=
record
dnn_input
=
ids2dense
(
dnn_input
,
feature_dims
[
'dnn_input'
])
lr_input
=
ids2sparse
(
lr_input
)
line
=
"%s
\t
%s
\t
%d
\n
"
%
(
' '
.
join
(
map
(
str
,
dnn_input
)),
' '
.
join
(
map
(
str
,
lr_input
)),
click
[
0
])
f
.
write
(
line
)
logger
.
info
(
'write to %s'
%
output_trainset_path
)
with
open
(
output_testset_path
,
'w'
)
as
f
:
for
id
,
record
in
enumerate
(
dataset
.
test
()):
dnn_input
,
lr_input
,
click
=
record
dnn_input
=
ids2dense
(
dnn_input
,
feature_dims
[
'dnn_input'
])
lr_input
=
ids2sparse
(
lr_input
)
line
=
"%s
\t
%s
\t
%d
\n
"
%
(
' '
.
join
(
map
(
str
,
dnn_input
)),
' '
.
join
(
map
(
str
,
lr_input
)),
click
[
0
])
f
.
write
(
line
)
logger
.
info
(
'write to %s'
%
output_testset_path
)
with
open
(
output_infer_path
,
'w'
)
as
f
:
for
id
,
record
in
enumerate
(
dataset
.
infer
()):
dnn_input
,
lr_input
=
record
dnn_input
=
ids2dense
(
dnn_input
,
feature_dims
[
'dnn_input'
])
lr_input
=
ids2sparse
(
lr_input
)
line
=
"%s
\t
%s
\n
"
%
(
' '
.
join
(
map
(
str
,
dnn_input
)),
' '
.
join
(
map
(
str
,
lr_input
)),
)
f
.
write
(
line
)
if
id
>
args
.
test_set_size
:
break
logger
.
info
(
'write to %s'
%
output_infer_path
)
with
open
(
output_meta_path
,
'w'
)
as
f
:
lines
=
[
"dnn_input_dim: %d"
%
feature_dims
[
'dnn_input'
],
"lr_input_dim: %d"
%
feature_dims
[
'lr_input'
]
]
f
.
write
(
'
\n
'
.
join
(
lines
))
logger
.
info
(
'write data meta into %s'
%
output_meta_path
)
ctr/dataset.md
浏览文件 @
3065a876
# 数据及处理
## 数据集介绍
本教程演示使用Kaggle上CTR任务的数据集
\[
[
3
](
#参考文献
)
\]
的预处理方法,最终产生本模型需要的格式,详细的数据格式参考
[
README.md
](
./README.md
)
。
Wide && Deep Model
\[
[
2
](
#参考文献
)
\]
的优势是融合稠密特征和大规模稀疏特征,
因此特征处理方面也针对稠密和稀疏两种特征作处理,
其中Deep部分的稠密值全部转化为ID类特征,
通过embedding 来转化为稠密的向量输入;Wide部分主要通过ID的叉乘提升维度。
数据集使用
`csv`
格式存储,其中各个字段内容如下:
-
`id`
: ad identifier
...
...
ctr/index.html
浏览文件 @
3065a876
...
...
@@ -42,6 +42,21 @@
<div
id=
"markdown"
style=
'display:none'
>
# 点击率预估
以下是本例目录包含的文件以及对应说明:
```
├── README.md # 本教程markdown 文档
├── dataset.md # 数据集处理教程
├── images # 本教程图片目录
│ ├── lr_vs_dnn.jpg
│ └── wide_deep.png
├── infer.py # 预测脚本
├── network_conf.py # 模型网络配置
├── reader.py # data provider
├── train.py # 训练脚本
└── utils.py # helper functions
```
## 背景介绍
CTR(Click-Through Rate,点击率预估)\[[1](https://en.wikipedia.org/wiki/Click-through_rate)\] 是用来表示用户点击一个特定链接的概率,
...
...
@@ -103,8 +118,40 @@ LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力,包括
我们使用 Kaggle 上 `Click-through rate prediction` 任务的数据集\[[2](https://www.kaggle.com/c/avazu-ctr-prediction/data)\] 来演示模型。
具体的特征处理方法参看 [data process](./dataset.md)
具体的特征处理方法参看 [data process](./dataset.md)。
本教程中演示模型的输入格式如下:
```
#
<dnn
input
ids
>
\t
<lr
input
sparse
values
>
\t click
1 23 190 \t 230:0.12 3421:0.9 23451:0.12 \t 0
23 231 \t 1230:0.12 13421:0.9 \t 1
```
演示数据集\[[2](#参考文档)\] 可以使用 `avazu_data_processor.py` 脚本处理,具体使用方法参考如下说明:
```
usage: avazu_data_processer.py [-h] --data_path DATA_PATH --output_dir
OUTPUT_DIR
[--num_lines_to_detect NUM_LINES_TO_DETECT]
[--test_set_size TEST_SET_SIZE]
[--train_size TRAIN_SIZE]
PaddlePaddle CTR example
optional arguments:
-h, --help show this help message and exit
--data_path DATA_PATH
path of the Avazu dataset
--output_dir OUTPUT_DIR
directory to output
--num_lines_to_detect NUM_LINES_TO_DETECT
number of records to detect dataset's meta info
--test_set_size TEST_SET_SIZE
size of the validation dataset(default: 10000)
--train_size TRAIN_SIZE
size of the trainset (default: 100000)
```
## Wide
&
Deep Learning Model
...
...
@@ -246,15 +293,17 @@ trainer.train(
1. 下载训练数据,可以使用 Kaggle 上 CTR 比赛的数据\[[2](#参考文献)\]
1. 从 [Kaggle CTR](https://www.kaggle.com/c/avazu-ctr-prediction/data) 下载 train.gz
2. 解压 train.gz 得到 train.txt
2. 执行 `python train.py --train_data_path train.txt` ,开始训练
3. `mkdir -p output; python avazu_data_processer.py --data_path train.txt --output_dir output --num_lines_to_detect 1000 --test_set_size 100` 生成演示数据
2. 执行 `python train.py --train_data_path ./output/train.txt --test_data_path ./output/test.txt --data_meta_file ./output/data.meta.txt --model_type=0` 开始训练
上面第2个步骤可以为 `train.py` 填充命令行参数来定制模型的训练过程,具体的命令行参数及用法如下
```
usage: train.py [-h] --train_data_path TRAIN_DATA_PATH
[--
batch_size BATCH_SIZE] [--test_set_size TEST_SET
_SIZE]
[--
test_data_path TEST_DATA_PATH] [--batch_size BATCH
_SIZE]
[--num_passes NUM_PASSES]
[--num_lines_to_detact NUM_LINES_TO_DETACT]
[--model_output_prefix MODEL_OUTPUT_PREFIX] --data_meta_file
DATA_META_FILE --model_type MODEL_TYPE
PaddlePaddle CTR example
...
...
@@ -262,16 +311,63 @@ optional arguments:
-h, --help show this help message and exit
--train_data_path TRAIN_DATA_PATH
path of training dataset
--test_data_path TEST_DATA_PATH
path of testing dataset
--batch_size BATCH_SIZE
size of mini-batch (default:10000)
--test_set_size TEST_SET_SIZE
size of the validation dataset(default: 10000)
--num_passes NUM_PASSES
number of passes to train
--num_lines_to_detact NUM_LINES_TO_DETACT
number of records to detect dataset's meta info
--model_output_prefix MODEL_OUTPUT_PREFIX
prefix of path for model to store (default:
./ctr_models)
--data_meta_file DATA_META_FILE
path of data meta info file
--model_type MODEL_TYPE
model type, classification: 0, regression 1 (default
classification)
```
## 用训好的模型做预测
训好的模型可以用来预测新的数据, 预测数据的格式为
```
#
<dnn
input
ids
>
\t
<lr
input
sparse
values
>
1 23 190 \t 230:0.12 3421:0.9 23451:0.12
23 231 \t 1230:0.12 13421:0.9
```
`infer.py` 的使用方法如下
```
usage: infer.py [-h] --model_gz_path MODEL_GZ_PATH --data_path DATA_PATH
--prediction_output_path PREDICTION_OUTPUT_PATH
[--data_meta_path DATA_META_PATH] --model_type MODEL_TYPE
PaddlePaddle CTR example
optional arguments:
-h, --help show this help message and exit
--model_gz_path MODEL_GZ_PATH
path of model parameters gz file
--data_path DATA_PATH
path of the dataset to infer
--prediction_output_path PREDICTION_OUTPUT_PATH
path to output the prediction
--data_meta_path DATA_META_PATH
path of trainset's meta info, default is ./data.meta
--model_type MODEL_TYPE
model type, classification: 0, regression 1 (default
classification)
```
示例数据可以用如下命令预测
```
python infer.py --model_gz_path
<model_path>
--data_path output/infer.txt --prediction_output_path predictions.txt --data_meta_path data.meta.txt
```
最终的预测结果位于 `predictions.txt`。
## 参考文献
1.
<https:
//
en.wikipedia.org
/
wiki
/
Click-through_rate
>
2.
<https:
//
www.kaggle.com
/
c
/
avazu-ctr-prediction
/
data
>
...
...
ctr/infer.py
0 → 100644
浏览文件 @
3065a876
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import
gzip
import
argparse
import
itertools
import
paddle.v2
as
paddle
import
network_conf
from
train
import
dnn_layer_dims
import
reader
from
utils
import
logger
,
ModelType
parser
=
argparse
.
ArgumentParser
(
description
=
"PaddlePaddle CTR example"
)
parser
.
add_argument
(
'--model_gz_path'
,
type
=
str
,
required
=
True
,
help
=
"path of model parameters gz file"
)
parser
.
add_argument
(
'--data_path'
,
type
=
str
,
required
=
True
,
help
=
"path of the dataset to infer"
)
parser
.
add_argument
(
'--prediction_output_path'
,
type
=
str
,
required
=
True
,
help
=
"path to output the prediction"
)
parser
.
add_argument
(
'--data_meta_path'
,
type
=
str
,
default
=
"./data.meta"
,
help
=
"path of trainset's meta info, default is ./data.meta"
)
parser
.
add_argument
(
'--model_type'
,
type
=
int
,
required
=
True
,
default
=
ModelType
.
CLASSIFICATION
,
help
=
'model type, classification: %d, regression %d (default classification)'
%
(
ModelType
.
CLASSIFICATION
,
ModelType
.
REGRESSION
))
args
=
parser
.
parse_args
()
paddle
.
init
(
use_gpu
=
False
,
trainer_count
=
1
)
class
CTRInferer
(
object
):
def
__init__
(
self
,
param_path
):
logger
.
info
(
"create CTR model"
)
dnn_input_dim
,
lr_input_dim
=
reader
.
load_data_meta
(
args
.
data_meta_path
)
# create the mdoel
self
.
ctr_model
=
network_conf
.
CTRmodel
(
dnn_layer_dims
,
dnn_input_dim
,
lr_input_dim
,
model_type
=
args
.
model_type
,
is_infer
=
True
)
# load parameter
logger
.
info
(
"load model parameters from %s"
%
param_path
)
self
.
parameters
=
paddle
.
parameters
.
Parameters
.
from_tar
(
gzip
.
open
(
param_path
,
'r'
))
self
.
inferer
=
paddle
.
inference
.
Inference
(
output_layer
=
self
.
ctr_model
.
model
,
parameters
=
self
.
parameters
,
)
def
infer
(
self
,
data_path
):
logger
.
info
(
"infer data..."
)
dataset
=
reader
.
Dataset
()
infer_reader
=
paddle
.
batch
(
dataset
.
infer
(
args
.
data_path
),
batch_size
=
1000
)
logger
.
warning
(
'write predictions to %s'
%
args
.
prediction_output_path
)
output_f
=
open
(
args
.
prediction_output_path
,
'w'
)
for
id
,
batch
in
enumerate
(
infer_reader
()):
res
=
self
.
inferer
.
infer
(
input
=
batch
)
predictions
=
[
x
for
x
in
itertools
.
chain
.
from_iterable
(
res
)]
assert
len
(
batch
)
==
len
(
predictions
),
"predict error, %d inputs, but %d predictions"
%
(
len
(
batch
),
len
(
predictions
))
output_f
.
write
(
'
\n
'
.
join
(
map
(
str
,
predictions
))
+
'
\n
'
)
if
__name__
==
'__main__'
:
ctr_inferer
=
CTRInferer
(
args
.
model_gz_path
)
ctr_inferer
.
infer
(
args
.
data_path
)
ctr/network_conf.py
0 → 100644
浏览文件 @
3065a876
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import
paddle.v2
as
paddle
from
paddle.v2
import
layer
from
paddle.v2
import
data_type
as
dtype
from
utils
import
logger
,
ModelType
class
CTRmodel
(
object
):
'''
A CTR model which implements wide && deep learning model.
'''
def
__init__
(
self
,
dnn_layer_dims
,
dnn_input_dim
,
lr_input_dim
,
model_type
=
ModelType
.
create_classification
(),
is_infer
=
False
):
'''
@dnn_layer_dims: list of integer
dims of each layer in dnn
@dnn_input_dim: int
size of dnn's input layer
@lr_input_dim: int
size of lr's input layer
@is_infer: bool
whether to build a infer model
'''
self
.
dnn_layer_dims
=
dnn_layer_dims
self
.
dnn_input_dim
=
dnn_input_dim
self
.
lr_input_dim
=
lr_input_dim
self
.
model_type
=
model_type
self
.
is_infer
=
is_infer
self
.
_declare_input_layers
()
self
.
dnn
=
self
.
_build_dnn_submodel_
(
self
.
dnn_layer_dims
)
self
.
lr
=
self
.
_build_lr_submodel_
()
# model's prediction
# TODO(superjom) rename it to prediction
if
self
.
model_type
.
is_classification
():
self
.
model
=
self
.
_build_classification_model
(
self
.
dnn
,
self
.
lr
)
if
self
.
model_type
.
is_regression
():
self
.
model
=
self
.
_build_regression_model
(
self
.
dnn
,
self
.
lr
)
def
_declare_input_layers
(
self
):
self
.
dnn_merged_input
=
layer
.
data
(
name
=
'dnn_input'
,
type
=
paddle
.
data_type
.
sparse_binary_vector
(
self
.
dnn_input_dim
))
self
.
lr_merged_input
=
layer
.
data
(
name
=
'lr_input'
,
type
=
paddle
.
data_type
.
sparse_vector
(
self
.
lr_input_dim
))
if
not
self
.
is_infer
:
self
.
click
=
paddle
.
layer
.
data
(
name
=
'click'
,
type
=
dtype
.
dense_vector
(
1
))
def
_build_dnn_submodel_
(
self
,
dnn_layer_dims
):
'''
build DNN submodel.
'''
dnn_embedding
=
layer
.
fc
(
input
=
self
.
dnn_merged_input
,
size
=
dnn_layer_dims
[
0
])
_input_layer
=
dnn_embedding
for
i
,
dim
in
enumerate
(
dnn_layer_dims
[
1
:]):
fc
=
layer
.
fc
(
input
=
_input_layer
,
size
=
dim
,
act
=
paddle
.
activation
.
Relu
(),
name
=
'dnn-fc-%d'
%
i
)
_input_layer
=
fc
return
_input_layer
def
_build_lr_submodel_
(
self
):
'''
config LR submodel
'''
fc
=
layer
.
fc
(
input
=
self
.
lr_merged_input
,
size
=
1
,
name
=
'lr'
,
act
=
paddle
.
activation
.
Relu
())
return
fc
def
_build_classification_model
(
self
,
dnn
,
lr
):
merge_layer
=
layer
.
concat
(
input
=
[
dnn
,
lr
])
self
.
output
=
layer
.
fc
(
input
=
merge_layer
,
size
=
1
,
name
=
'output'
,
# use sigmoid function to approximate ctr rate, a float value between 0 and 1.
act
=
paddle
.
activation
.
Sigmoid
())
self
.
train_cost
=
paddle
.
layer
.
multi_binary_label_cross_entropy_cost
(
input
=
self
.
output
,
label
=
self
.
click
)
return
self
.
output
def
_build_regression_model
(
self
,
dnn
,
lr
):
merge_layer
=
layer
.
concat
(
input
=
[
dnn
,
lr
])
self
.
output
=
layer
.
fc
(
input
=
merge_layer
,
size
=
1
,
name
=
'output'
,
act
=
paddle
.
activation
.
Sigmoid
())
if
not
self
.
is_infer
:
self
.
train_cost
=
paddle
.
layer
.
mse_cost
(
input
=
self
.
output
,
label
=
self
.
click
)
return
self
.
output
ctr/reader.py
0 → 100644
浏览文件 @
3065a876
from
utils
import
logger
,
TaskMode
,
load_dnn_input_record
,
load_lr_input_record
feeding_index
=
{
'dnn_input'
:
0
,
'lr_input'
:
1
,
'click'
:
2
}
class
Dataset
(
object
):
def
__init__
(
self
):
self
.
mode
=
TaskMode
.
create_train
()
def
train
(
self
,
path
):
'''
Load trainset.
'''
logger
.
info
(
"load trainset from %s"
%
path
)
self
.
mode
=
TaskMode
.
create_train
()
self
.
path
=
path
return
self
.
_parse
def
test
(
self
,
path
):
'''
Load testset.
'''
logger
.
info
(
"load testset from %s"
%
path
)
self
.
path
=
path
self
.
mode
=
TaskMode
.
create_test
()
return
self
.
_parse
def
infer
(
self
,
path
):
'''
Load infer set.
'''
logger
.
info
(
"load inferset from %s"
%
path
)
self
.
path
=
path
self
.
mode
=
TaskMode
.
create_infer
()
return
self
.
_parse
def
_parse
(
self
):
'''
Parse dataset.
'''
with
open
(
self
.
path
)
as
f
:
for
line_id
,
line
in
enumerate
(
f
):
fs
=
line
.
strip
().
split
(
'
\t
'
)
dnn_input
=
load_dnn_input_record
(
fs
[
0
])
lr_input
=
load_lr_input_record
(
fs
[
1
])
if
not
self
.
mode
.
is_infer
():
click
=
[
int
(
fs
[
2
])]
yield
dnn_input
,
lr_input
,
click
else
:
yield
dnn_input
,
lr_input
def
load_data_meta
(
path
):
'''
load data meta info from path, return (dnn_input_dim, lr_input_dim)
'''
with
open
(
path
)
as
f
:
lines
=
f
.
read
().
split
(
'
\n
'
)
err_info
=
"wrong meta format"
assert
len
(
lines
)
==
2
,
err_info
assert
'dnn_input_dim:'
in
lines
[
0
]
and
'lr_input_dim:'
in
lines
[
1
],
err_info
res
=
map
(
int
,
[
_
.
split
(
':'
)[
1
]
for
_
in
lines
])
logger
.
info
(
'dnn input dim: %d'
%
res
[
0
])
logger
.
info
(
'lr input dim: %d'
%
res
[
1
])
return
res
ctr/train.py
浏览文件 @
3065a876
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import
argparse
import
logging
import
paddle.v2
as
paddle
from
paddle.v2
import
layer
from
paddle.v2
import
data_type
as
dtype
from
data_provider
import
field_index
,
detect_dataset
,
AvazuDataset
parser
=
argparse
.
ArgumentParser
(
description
=
"PaddlePaddle CTR example"
)
parser
.
add_argument
(
'--train_data_path'
,
type
=
str
,
required
=
True
,
help
=
"path of training dataset"
)
parser
.
add_argument
(
'--batch_size'
,
type
=
int
,
default
=
10000
,
help
=
"size of mini-batch (default:10000)"
)
parser
.
add_argument
(
'--test_set_size'
,
type
=
int
,
default
=
10000
,
help
=
"size of the validation dataset(default: 10000)"
)
parser
.
add_argument
(
'--num_passes'
,
type
=
int
,
default
=
10
,
help
=
"number of passes to train"
)
parser
.
add_argument
(
'--num_lines_to_detact'
,
type
=
int
,
default
=
500000
,
help
=
"number of records to detect dataset's meta info"
)
args
=
parser
.
parse_args
()
dnn_layer_dims
=
[
128
,
64
,
32
,
1
]
data_meta_info
=
detect_dataset
(
args
.
train_data_path
,
args
.
num_lines_to_detact
)
logging
.
warning
(
'detect categorical fields in dataset %s'
%
args
.
train_data_path
)
for
key
,
item
in
data_meta_info
.
items
():
logging
.
warning
(
' - {}
\t
{}'
.
format
(
key
,
item
))
paddle
.
init
(
use_gpu
=
False
,
trainer_count
=
1
)
import
gzip
# ==============================================================================
# input layers
# ==============================================================================
dnn_merged_input
=
layer
.
data
(
name
=
'dnn_input'
,
type
=
paddle
.
data_type
.
sparse_binary_vector
(
data_meta_info
[
'dnn_input'
]))
lr_merged_input
=
layer
.
data
(
name
=
'lr_input'
,
type
=
paddle
.
data_type
.
sparse_binary_vector
(
data_meta_info
[
'lr_input'
]))
click
=
paddle
.
layer
.
data
(
name
=
'click'
,
type
=
dtype
.
dense_vector
(
1
))
import
reader
import
paddle.v2
as
paddle
from
utils
import
logger
,
ModelType
from
network_conf
import
CTRmodel
def
parse_args
():
parser
=
argparse
.
ArgumentParser
(
description
=
"PaddlePaddle CTR example"
)
parser
.
add_argument
(
'--train_data_path'
,
type
=
str
,
required
=
True
,
help
=
"path of training dataset"
)
parser
.
add_argument
(
'--test_data_path'
,
type
=
str
,
help
=
'path of testing dataset'
)
parser
.
add_argument
(
'--batch_size'
,
type
=
int
,
default
=
10000
,
help
=
"size of mini-batch (default:10000)"
)
parser
.
add_argument
(
'--num_passes'
,
type
=
int
,
default
=
10
,
help
=
"number of passes to train"
)
parser
.
add_argument
(
'--model_output_prefix'
,
type
=
str
,
default
=
'./ctr_models'
,
help
=
'prefix of path for model to store (default: ./ctr_models)'
)
parser
.
add_argument
(
'--data_meta_file'
,
type
=
str
,
required
=
True
,
help
=
'path of data meta info file'
,
)
parser
.
add_argument
(
'--model_type'
,
type
=
int
,
required
=
True
,
default
=
ModelType
.
CLASSIFICATION
,
help
=
'model type, classification: %d, regression %d (default classification)'
%
(
ModelType
.
CLASSIFICATION
,
ModelType
.
REGRESSION
))
return
parser
.
parse_args
()
# ==============================================================================
# network structure
# ==============================================================================
def
build_dnn_submodel
(
dnn_layer_dims
):
dnn_embedding
=
layer
.
fc
(
input
=
dnn_merged_input
,
size
=
dnn_layer_dims
[
0
])
_input_layer
=
dnn_embedding
for
i
,
dim
in
enumerate
(
dnn_layer_dims
[
1
:]):
fc
=
layer
.
fc
(
input
=
_input_layer
,
size
=
dim
,
act
=
paddle
.
activation
.
Relu
(),
name
=
'dnn-fc-%d'
%
i
)
_input_layer
=
fc
return
_input_layer
# config LR submodel
def
build_lr_submodel
():
fc
=
layer
.
fc
(
input
=
lr_merged_input
,
size
=
1
,
name
=
'lr'
,
act
=
paddle
.
activation
.
Relu
())
return
fc
# conbine DNN and LR submodels
def
combine_submodels
(
dnn
,
lr
):
merge_layer
=
layer
.
concat
(
input
=
[
dnn
,
lr
])
fc
=
layer
.
fc
(
input
=
merge_layer
,
size
=
1
,
name
=
'output'
,
# use sigmoid function to approximate ctr rate, a float value between 0 and 1.
act
=
paddle
.
activation
.
Sigmoid
())
return
fc
dnn
=
build_dnn_submodel
(
dnn_layer_dims
)
lr
=
build_lr_submodel
()
output
=
combine_submodels
(
dnn
,
lr
)
dnn_layer_dims
=
[
128
,
64
,
32
,
1
]
# ==============================================================================
# cost and train period
# ==============================================================================
classification_cost
=
paddle
.
layer
.
multi_binary_label_cross_entropy_cost
(
input
=
output
,
label
=
click
)
params
=
paddle
.
parameters
.
create
(
classification_cost
)
optimizer
=
paddle
.
optimizer
.
Momentum
(
momentum
=
0.01
)
trainer
=
paddle
.
trainer
.
SGD
(
cost
=
classification_cost
,
parameters
=
params
,
update_equation
=
optimizer
)
dataset
=
AvazuDataset
(
args
.
train_data_path
,
n_records_as_test
=
args
.
test_set_size
)
def
event_handler
(
event
):
if
isinstance
(
event
,
paddle
.
event
.
EndIteration
):
num_samples
=
event
.
batch_id
*
args
.
batch_size
if
event
.
batch_id
%
100
==
0
:
logging
.
warning
(
"Pass %d, Samples %d, Cost %f"
%
(
event
.
pass_id
,
num_samples
,
event
.
cost
))
if
event
.
batch_id
%
1000
==
0
:
result
=
trainer
.
test
(
reader
=
paddle
.
batch
(
dataset
.
test
,
batch_size
=
args
.
batch_size
),
feeding
=
field_index
)
logging
.
warning
(
"Test %d-%d, Cost %f"
%
(
event
.
pass_id
,
event
.
batch_id
,
result
.
cost
))
trainer
.
train
(
reader
=
paddle
.
batch
(
paddle
.
reader
.
shuffle
(
dataset
.
train
,
buf_size
=
500
),
batch_size
=
args
.
batch_size
),
feeding
=
field_index
,
event_handler
=
event_handler
,
num_passes
=
args
.
num_passes
)
def
train
():
args
=
parse_args
()
args
.
model_type
=
ModelType
(
args
.
model_type
)
paddle
.
init
(
use_gpu
=
False
,
trainer_count
=
1
)
dnn_input_dim
,
lr_input_dim
=
reader
.
load_data_meta
(
args
.
data_meta_file
)
# create ctr model.
model
=
CTRmodel
(
dnn_layer_dims
,
dnn_input_dim
,
lr_input_dim
,
model_type
=
args
.
model_type
,
is_infer
=
False
)
params
=
paddle
.
parameters
.
create
(
model
.
train_cost
)
optimizer
=
paddle
.
optimizer
.
AdaGrad
()
trainer
=
paddle
.
trainer
.
SGD
(
cost
=
model
.
train_cost
,
parameters
=
params
,
update_equation
=
optimizer
)
# dataset = reader.AvazuDataset(
# args.train_data_path,
# n_records_as_test=args.test_set_size,
# fields=reader.fields,
# feature_dims=reader.feature_dims)
dataset
=
reader
.
Dataset
()
def
__event_handler__
(
event
):
if
isinstance
(
event
,
paddle
.
event
.
EndIteration
):
num_samples
=
event
.
batch_id
*
args
.
batch_size
if
event
.
batch_id
%
100
==
0
:
logger
.
warning
(
"Pass %d, Samples %d, Cost %f, %s"
%
(
event
.
pass_id
,
num_samples
,
event
.
cost
,
event
.
metrics
))
if
event
.
batch_id
%
1000
==
0
:
if
args
.
test_data_path
:
result
=
trainer
.
test
(
reader
=
paddle
.
batch
(
dataset
.
test
(
args
.
test_data_path
),
batch_size
=
args
.
batch_size
),
feeding
=
reader
.
feeding_index
)
logger
.
warning
(
"Test %d-%d, Cost %f, %s"
%
(
event
.
pass_id
,
event
.
batch_id
,
result
.
cost
,
result
.
metrics
))
path
=
"{}-pass-{}-batch-{}-test-{}.tar.gz"
.
format
(
args
.
model_output_prefix
,
event
.
pass_id
,
event
.
batch_id
,
result
.
cost
)
with
gzip
.
open
(
path
,
'w'
)
as
f
:
params
.
to_tar
(
f
)
trainer
.
train
(
reader
=
paddle
.
batch
(
paddle
.
reader
.
shuffle
(
dataset
.
train
(
args
.
train_data_path
),
buf_size
=
500
),
batch_size
=
args
.
batch_size
),
feeding
=
reader
.
feeding_index
,
event_handler
=
__event_handler__
,
num_passes
=
args
.
num_passes
)
if
__name__
==
'__main__'
:
train
()
ctr/utils.py
0 → 100644
浏览文件 @
3065a876
import
logging
logging
.
basicConfig
()
logger
=
logging
.
getLogger
(
"logger"
)
logger
.
setLevel
(
logging
.
INFO
)
class
TaskMode
:
TRAIN_MODE
=
0
TEST_MODE
=
1
INFER_MODE
=
2
def
__init__
(
self
,
mode
):
self
.
mode
=
mode
def
is_train
(
self
):
return
self
.
mode
==
self
.
TRAIN_MODE
def
is_test
(
self
):
return
self
.
mode
==
self
.
TEST_MODE
def
is_infer
(
self
):
return
self
.
mode
==
self
.
INFER_MODE
@
staticmethod
def
create_train
():
return
TaskMode
(
TaskMode
.
TRAIN_MODE
)
@
staticmethod
def
create_test
():
return
TaskMode
(
TaskMode
.
TEST_MODE
)
@
staticmethod
def
create_infer
():
return
TaskMode
(
TaskMode
.
INFER_MODE
)
class
ModelType
:
CLASSIFICATION
=
0
REGRESSION
=
1
def
__init__
(
self
,
mode
):
self
.
mode
=
mode
def
is_classification
(
self
):
return
self
.
mode
==
self
.
CLASSIFICATION
def
is_regression
(
self
):
return
self
.
mode
==
self
.
REGRESSION
@
staticmethod
def
create_classification
():
return
ModelType
(
ModelType
.
CLASSIFICATION
)
@
staticmethod
def
create_regression
():
return
ModelType
(
ModelType
.
REGRESSION
)
def
load_dnn_input_record
(
sent
):
return
map
(
int
,
sent
.
split
())
def
load_lr_input_record
(
sent
):
res
=
[]
for
_
in
[
x
.
split
(
':'
)
for
x
in
sent
.
split
()]:
res
.
append
((
int
(
_
[
0
]),
float
(
_
[
1
]),
))
return
res
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录