提交 4c03cf0f 编写于 作者: Q Qiao Longfei

update doc and model

上级 0f3eeda1
运行本目录下的程序示例需要使用PaddlePaddle v0.10.0 版本。如果您的PaddlePaddle安装版本低于此要求,请按照[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/en/build_and_install/pip_install_en.html)中的说明更新PaddlePaddle安装版本。
---
# 基于深度因子分解机的点击率预估模型
# 基于DNN模型的点击率预估模型
## 介绍
本模型实现了下述论文中提出的DeepFM模型:
本模型实现了下述论文中提出的DNN模型:
```text
@inproceedings{guo2017deepfm,
......@@ -17,8 +14,6 @@
}
```
DeepFM模型把因子分解机和深度神经网络的低阶和高阶特征的相互作用结合起来,有关因子分解机的详细信息,请参考论文[因子分解机](https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf)
## 数据集
本文使用的是Kaggle公司举办的[展示广告竞赛](https://www.kaggle.com/c/criteo-display-ad-challenge/)中所使用的Criteo数据集。
......@@ -30,18 +25,8 @@ cd data && ./download.sh && cd ..
```
## 模型
DeepFM模型是由因子分解机(FM)和深度神经网络(DNN)组成的。所有的输入特征都会同时输入FM和DNN,最后把FM和DNN的输出结合在一起形成最终的输出。DNN中稀疏特征生成的嵌入层与FM层中的隐含向量(因子)共享参数。
PaddlePaddle中的因子分解机层负责计算二阶组合特征的相互关系。以下的代码示例结合了因子分解机层和全连接层,形成了完整的的因子分解机:
本例子只实现了DeepFM论文中介绍的模型的DNN部分,DeepFM会在其他例子中给出。
```python
def fm_layer(input, factor_size):
first_order = paddle.layer.fc(input=input, size=1, act=paddle.activation.Linear())
second_order = paddle.layer.factorization_machine(input=input, factor_size=factor_size)
fm = paddle.layer.addto(input=[first_order, second_order],
act=paddle.activation.Linear(),
bias_attr=False)
return fm
```
## 数据准备
......@@ -58,11 +43,10 @@ python preprocess.py --datadir ./data/raw --outdir ./data
```bash
python train.py \
--train_data_path data/train.txt \
--test_data_path data/valid.txt \
2>&1 | tee train.log
```
训练到第9轮的第40000个batch后,测试的AUC为0.807178,误差(cost)为0.445196。
训练到第1轮的第40000个batch后,测试的AUC为0.807178,误差(cost)为0.445196。
## 预测
预测的命令行选项可以通过`python infer.py -h`列出。
......@@ -70,7 +54,6 @@ python train.py \
对测试集进行预测:
```bash
python infer.py \
--model_gz_path models/model-pass-9-batch-10000.tar.gz \
--data_path data/test.txt \
--prediction_output_path ./predict.txt
--model_path models/pass-0/ \
--data_path data/valid.txt
```
The minimum PaddlePaddle version needed for the code sample in this directory is v0.11.0. If you are on a version of PaddlePaddle earlier than v0.11.0, [please update your installation](http://www.paddlepaddle.org/docs/develop/documentation/en/build_and_install/pip_install_en.html).
---
# Deep Factorization Machine for Click-Through Rate prediction
# DNN for Click-Through Rate prediction
## Introduction
This model implements the DeepFM proposed in the following paper:
This model implements the DNN part proposed in the following paper:
```text
@inproceedings{guo2017deepfm,
......@@ -38,25 +35,9 @@ cd data && ./download.sh && cd ..
```
## Model
The DeepFM model is composed of the factorization machine layer (FM) and deep
neural networks (DNN). All the input features are feeded to both FM and DNN.
The output from FM and DNN are combined to form the final output. The embedding
layer for sparse features in the DNN shares the parameters with the latent
vectors (factors) of the FM layer.
The factorization machine layer in PaddlePaddle computes the second order
interactions. The following code example combines the factorization machine
layer and fully connected layer to form the full version of factorization
machine:
```python
def fm_layer(input, factor_size):
first_order = paddle.layer.fc(input=input, size=1, act=paddle.activation.Linear())
second_order = paddle.layer.factorization_machine(input=input, factor_size=factor_size)
fm = paddle.layer.addto(input=[first_order, second_order],
act=paddle.activation.Linear(),
bias_attr=False)
return fm
This Demo only implement the DNN part of the model described in DeepFM paper.
DeepFM model will be provided in other model.
```
## Data preparation
......@@ -76,11 +57,10 @@ To train the model:
```bash
python train.py \
--train_data_path data/train.txt \
--test_data_path data/valid.txt \
2>&1 | tee train.log
```
After training pass 9 batch 40000, the testing AUC is `0.807178` and the testing
After training pass 1 batch 40000, the testing AUC is `0.807178` and the testing
cost is `0.445196`.
## Infer
......@@ -89,7 +69,6 @@ The command line options for infering can be listed by `python infer.py -h`.
To make inference for the test dataset:
```bash
python infer.py \
--model_gz_path models/model-pass-9-batch-10000.tar.gz \
--data_path data/test.txt \
--prediction_output_path ./predict.txt
--model_path models/ \
--data_path data/valid.txt
```
import os
import gzip
import argparse
import itertools
import numpy as np
import numpy as np
import paddle
import paddle.fluid as fluid
from network_conf import DeepFM
import reader
from network_conf import ctr_dnn_model
def parse_args():
......@@ -24,10 +21,10 @@ def parse_args():
required=True,
help="The path of the dataset to infer")
parser.add_argument(
'--factor_size',
'--embedding_size',
type=int,
default=10,
help="The factor size for the factorization machine (default:10)")
help="The size for embedding layer (default:10)")
return parser.parse_args()
......@@ -39,15 +36,14 @@ def infer():
inference_scope = fluid.core.Scope()
dataset = reader.Dataset()
test_reader = paddle.batch(dataset.train(args.data_path), batch_size=1000)
test_reader = paddle.batch(dataset.train([args.data_path]), batch_size=1000)
startup_program = fluid.framework.Program()
test_program = fluid.framework.Program()
with fluid.framework.program_guard(test_program, startup_program):
loss, data_list, auc_var, batch_auc_var = DeepFM(args.factor_size)
loss, data_list, auc_var, batch_auc_var = ctr_dnn_model(args.embedding_size)
exe = fluid.Executor(place)
#exe.run(startup_program)
feeder = fluid.DataFeeder(feed_list=data_list, place=place)
......
......@@ -5,7 +5,7 @@ dense_feature_dim = 13
sparse_feature_dim = 117568
def DeepFM(factor_size, infer=False):
def ctr_dnn_model(embedding_size):
dense_input = fluid.layers.data(
name="dense_input", shape=[dense_feature_dim], dtype='float32')
sparse_input_ids = [
......@@ -17,7 +17,7 @@ def DeepFM(factor_size, infer=False):
def embedding_layer(input):
return fluid.layers.embedding(
input=input,
size=[sparse_feature_dim, factor_size],
size=[sparse_feature_dim, embedding_size],
param_attr=fluid.ParamAttr(name="SparseFeatFactors", initializer=fluid.initializer.Normal(scale=1/math.sqrt(sparse_feature_dim))))
sparse_embed_seq = map(embedding_layer, sparse_input_ids)
......@@ -32,16 +32,17 @@ def DeepFM(factor_size, infer=False):
predict = fluid.layers.fc(input=fc3, size=2, act='softmax',
param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal(scale=1/math.sqrt(fc3.shape[1]))))
label = fluid.layers.data(name='label', shape=[1], dtype='int64')
data_list = [dense_input] + sparse_input_ids
if not infer:
label = fluid.layers.data(name='label', shape=[1], dtype='int64')
cost = fluid.layers.cross_entropy(input=predict, label=label)
avg_cost = fluid.layers.reduce_sum(cost)
accuracy = fluid.layers.accuracy(input=predict, label=label)
auc_var, batch_auc_var, auc_states = fluid.layers.auc(input=predict, label=label, num_thresholds=2**12, slide_steps=20)
data_list.append(label)
return avg_cost, data_list, auc_var, batch_auc_var
else:
return predict, data_list
cost = fluid.layers.cross_entropy(input=predict, label=label)
avg_cost = fluid.layers.reduce_sum(cost)
accuracy = fluid.layers.accuracy(input=predict, label=label)
auc_var, batch_auc_var, auc_states = fluid.layers.auc(input=predict, label=label, num_thresholds=2**12, slide_steps=20)
data_list.append(label)
return avg_cost, data_list, auc_var, batch_auc_var
class Dataset:
def _reader_creator(self, path, is_infer):
def _reader_creator(self, file_list, is_infer):
def reader():
with open(path, 'r') as f:
for line in f:
features = line.rstrip('\n').split('\t')
dense_feature = map(float, features[0].split(','))
sparse_feature = map(lambda x: [int(x)], features[1].split(','))
if not is_infer:
label = [float(features[2])]
yield [dense_feature
] + sparse_feature + [label]
else:
yield [dense_feature] + sparse_feature
for file in file_list:
with open(file, 'r') as f:
for line in f:
features = line.rstrip('\n').split('\t')
dense_feature = map(float, features[0].split(','))
sparse_feature = map(lambda x: [int(x)], features[1].split(','))
if not is_infer:
label = [float(features[2])]
yield [dense_feature
] + sparse_feature + [label]
else:
yield [dense_feature] + sparse_feature
return reader
def train(self, path):
return self._reader_creator(path, False)
def train(self, file_list):
return self._reader_creator(file_list, False)
def test(self, path):
return self._reader_creator(path, False)
def test(self, file_list):
return self._reader_creator(file_list, False)
def infer(self, path):
return self._reader_creator(path, True)
feeding = {
'dense_input': 0,
'sparse_input': 1,
'C1': 2,
'C2': 3,
'C3': 4,
'C4': 5,
'C5': 6,
'C6': 7,
'C7': 8,
'C8': 9,
'C9': 10,
'C10': 11,
'C11': 12,
'C12': 13,
'C13': 14,
'C14': 15,
'C15': 16,
'C16': 17,
'C17': 18,
'C18': 19,
'C19': 20,
'C20': 21,
'C21': 22,
'C22': 23,
'C23': 24,
'C24': 25,
'C25': 26,
'C26': 27,
'label': 28
}
def infer(self, file_list):
return self._reader_creator(file_list, True)
......@@ -4,7 +4,7 @@ import argparse
import paddle.fluid as fluid
from network_conf import DeepFM
from network_conf import ctr_dnn_model
import reader
import paddle
......@@ -14,7 +14,7 @@ logger.setLevel(logging.INFO)
def parse_args():
parser = argparse.ArgumentParser(description="PaddlePaddle DeepFM example")
parser = argparse.ArgumentParser(description="PaddlePaddle CTR example")
parser.add_argument(
'--train_data_path',
type=str,
......@@ -23,7 +23,7 @@ def parse_args():
parser.add_argument(
'--test_data_path',
type=str,
default='./data/test.txt',
default='./data/valid.txt',
help="The path of testing dataset")
parser.add_argument(
'--batch_size',
......@@ -31,10 +31,10 @@ def parse_args():
default=1000,
help="The size of mini-batch (default:1000)")
parser.add_argument(
'--factor_size',
'--embedding_size',
type=int,
default=10,
help="The factor size for the factorization machine (default:10)")
help="The size for embedding layer (default:10)")
parser.add_argument(
'--num_passes',
type=int,
......@@ -55,14 +55,14 @@ def train():
if not os.path.isdir(args.model_output_dir):
os.mkdir(args.model_output_dir)
loss, data_list, auc_var, batch_auc_var = DeepFM(args.factor_size)
loss, data_list, auc_var, batch_auc_var = ctr_dnn_model(args.embedding_size)
optimizer = fluid.optimizer.Adam(learning_rate=1e-4)
optimize_ops, params_grads = optimizer.minimize(loss)
optimizer.minimize(loss)
dataset = reader.Dataset()
train_reader = paddle.batch(
paddle.reader.shuffle(
dataset.train(args.train_data_path),
dataset.train([args.train_data_path]),
buf_size=args.batch_size * 100),
batch_size=args.batch_size)
place = fluid.CPUPlace()
......@@ -80,13 +80,15 @@ def train():
feed=feeder.feed(data),
fetch_list=[loss, auc_var, batch_auc_var]
)
print('pass:' + str(pass_id) + ' batch:' + str(batch_id) + ' loss: ' + str(loss_val) + " auc: " + str(auc_val) + " batch_auc: " + str(batch_auc_val))
print('pass:' + str(pass_id) + ' batch:' + str(batch_id) +
' loss: ' + str(loss_val) + " auc: " + str(auc_val) +
" batch_auc: " + str(batch_auc_val))
batch_id += 1
if batch_id % 100 == 0 and batch_id != 0:
model_dir = 'output/batch-' + str(batch_id)
if batch_id % 1000 == 0 and batch_id != 0:
model_dir = args.model_output_dir + '/batch-' + str(batch_id)
fluid.io.save_inference_model(model_dir, data_name_list, [loss, auc_var], exe)
model_dir = 'output/pass-' + str(pass_id)
fluid.io.save_inference_model(model_dir, data_name_list, [loss_var, auc_var], exe)
model_dir = args.model_output_dir + '/pass-' + str(pass_id)
fluid.io.save_inference_model(model_dir, data_name_list, [loss, auc_var], exe)
if __name__ == '__main__':
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册