提交 4c03cf0f 编写于 作者: Q Qiao Longfei

update doc and model

上级 0f3eeda1
运行本目录下的程序示例需要使用PaddlePaddle v0.10.0 版本。如果您的PaddlePaddle安装版本低于此要求,请按照[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/en/build_and_install/pip_install_en.html)中的说明更新PaddlePaddle安装版本。
--- # 基于DNN模型的点击率预估模型
# 基于深度因子分解机的点击率预估模型
## 介绍 ## 介绍
本模型实现了下述论文中提出的DeepFM模型: 本模型实现了下述论文中提出的DNN模型:
```text ```text
@inproceedings{guo2017deepfm, @inproceedings{guo2017deepfm,
...@@ -17,8 +14,6 @@ ...@@ -17,8 +14,6 @@
} }
``` ```
DeepFM模型把因子分解机和深度神经网络的低阶和高阶特征的相互作用结合起来,有关因子分解机的详细信息,请参考论文[因子分解机](https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf)
## 数据集 ## 数据集
本文使用的是Kaggle公司举办的[展示广告竞赛](https://www.kaggle.com/c/criteo-display-ad-challenge/)中所使用的Criteo数据集。 本文使用的是Kaggle公司举办的[展示广告竞赛](https://www.kaggle.com/c/criteo-display-ad-challenge/)中所使用的Criteo数据集。
...@@ -30,18 +25,8 @@ cd data && ./download.sh && cd .. ...@@ -30,18 +25,8 @@ cd data && ./download.sh && cd ..
``` ```
## 模型 ## 模型
DeepFM模型是由因子分解机(FM)和深度神经网络(DNN)组成的。所有的输入特征都会同时输入FM和DNN,最后把FM和DNN的输出结合在一起形成最终的输出。DNN中稀疏特征生成的嵌入层与FM层中的隐含向量(因子)共享参数。 本例子只实现了DeepFM论文中介绍的模型的DNN部分,DeepFM会在其他例子中给出。
PaddlePaddle中的因子分解机层负责计算二阶组合特征的相互关系。以下的代码示例结合了因子分解机层和全连接层,形成了完整的的因子分解机:
```python
def fm_layer(input, factor_size):
first_order = paddle.layer.fc(input=input, size=1, act=paddle.activation.Linear())
second_order = paddle.layer.factorization_machine(input=input, factor_size=factor_size)
fm = paddle.layer.addto(input=[first_order, second_order],
act=paddle.activation.Linear(),
bias_attr=False)
return fm
``` ```
## 数据准备 ## 数据准备
...@@ -58,11 +43,10 @@ python preprocess.py --datadir ./data/raw --outdir ./data ...@@ -58,11 +43,10 @@ python preprocess.py --datadir ./data/raw --outdir ./data
```bash ```bash
python train.py \ python train.py \
--train_data_path data/train.txt \ --train_data_path data/train.txt \
--test_data_path data/valid.txt \
2>&1 | tee train.log 2>&1 | tee train.log
``` ```
训练到第9轮的第40000个batch后,测试的AUC为0.807178,误差(cost)为0.445196。 训练到第1轮的第40000个batch后,测试的AUC为0.807178,误差(cost)为0.445196。
## 预测 ## 预测
预测的命令行选项可以通过`python infer.py -h`列出。 预测的命令行选项可以通过`python infer.py -h`列出。
...@@ -70,7 +54,6 @@ python train.py \ ...@@ -70,7 +54,6 @@ python train.py \
对测试集进行预测: 对测试集进行预测:
```bash ```bash
python infer.py \ python infer.py \
--model_gz_path models/model-pass-9-batch-10000.tar.gz \ --model_path models/pass-0/ \
--data_path data/test.txt \ --data_path data/valid.txt
--prediction_output_path ./predict.txt
``` ```
The minimum PaddlePaddle version needed for the code sample in this directory is v0.11.0. If you are on a version of PaddlePaddle earlier than v0.11.0, [please update your installation](http://www.paddlepaddle.org/docs/develop/documentation/en/build_and_install/pip_install_en.html).
--- # DNN for Click-Through Rate prediction
# Deep Factorization Machine for Click-Through Rate prediction
## Introduction ## Introduction
This model implements the DeepFM proposed in the following paper: This model implements the DNN part proposed in the following paper:
```text ```text
@inproceedings{guo2017deepfm, @inproceedings{guo2017deepfm,
...@@ -38,25 +35,9 @@ cd data && ./download.sh && cd .. ...@@ -38,25 +35,9 @@ cd data && ./download.sh && cd ..
``` ```
## Model ## Model
The DeepFM model is composed of the factorization machine layer (FM) and deep This Demo only implement the DNN part of the model described in DeepFM paper.
neural networks (DNN). All the input features are feeded to both FM and DNN. DeepFM model will be provided in other model.
The output from FM and DNN are combined to form the final output. The embedding
layer for sparse features in the DNN shares the parameters with the latent
vectors (factors) of the FM layer.
The factorization machine layer in PaddlePaddle computes the second order
interactions. The following code example combines the factorization machine
layer and fully connected layer to form the full version of factorization
machine:
```python
def fm_layer(input, factor_size):
first_order = paddle.layer.fc(input=input, size=1, act=paddle.activation.Linear())
second_order = paddle.layer.factorization_machine(input=input, factor_size=factor_size)
fm = paddle.layer.addto(input=[first_order, second_order],
act=paddle.activation.Linear(),
bias_attr=False)
return fm
``` ```
## Data preparation ## Data preparation
...@@ -76,11 +57,10 @@ To train the model: ...@@ -76,11 +57,10 @@ To train the model:
```bash ```bash
python train.py \ python train.py \
--train_data_path data/train.txt \ --train_data_path data/train.txt \
--test_data_path data/valid.txt \
2>&1 | tee train.log 2>&1 | tee train.log
``` ```
After training pass 9 batch 40000, the testing AUC is `0.807178` and the testing After training pass 1 batch 40000, the testing AUC is `0.807178` and the testing
cost is `0.445196`. cost is `0.445196`.
## Infer ## Infer
...@@ -89,7 +69,6 @@ The command line options for infering can be listed by `python infer.py -h`. ...@@ -89,7 +69,6 @@ The command line options for infering can be listed by `python infer.py -h`.
To make inference for the test dataset: To make inference for the test dataset:
```bash ```bash
python infer.py \ python infer.py \
--model_gz_path models/model-pass-9-batch-10000.tar.gz \ --model_path models/ \
--data_path data/test.txt \ --data_path data/valid.txt
--prediction_output_path ./predict.txt
``` ```
import os
import gzip
import argparse import argparse
import itertools
import numpy as np
import numpy as np
import paddle import paddle
import paddle.fluid as fluid import paddle.fluid as fluid
from network_conf import DeepFM
import reader import reader
from network_conf import ctr_dnn_model
def parse_args(): def parse_args():
...@@ -24,10 +21,10 @@ def parse_args(): ...@@ -24,10 +21,10 @@ def parse_args():
required=True, required=True,
help="The path of the dataset to infer") help="The path of the dataset to infer")
parser.add_argument( parser.add_argument(
'--factor_size', '--embedding_size',
type=int, type=int,
default=10, default=10,
help="The factor size for the factorization machine (default:10)") help="The size for embedding layer (default:10)")
return parser.parse_args() return parser.parse_args()
...@@ -39,15 +36,14 @@ def infer(): ...@@ -39,15 +36,14 @@ def infer():
inference_scope = fluid.core.Scope() inference_scope = fluid.core.Scope()
dataset = reader.Dataset() dataset = reader.Dataset()
test_reader = paddle.batch(dataset.train(args.data_path), batch_size=1000) test_reader = paddle.batch(dataset.train([args.data_path]), batch_size=1000)
startup_program = fluid.framework.Program() startup_program = fluid.framework.Program()
test_program = fluid.framework.Program() test_program = fluid.framework.Program()
with fluid.framework.program_guard(test_program, startup_program): with fluid.framework.program_guard(test_program, startup_program):
loss, data_list, auc_var, batch_auc_var = DeepFM(args.factor_size) loss, data_list, auc_var, batch_auc_var = ctr_dnn_model(args.embedding_size)
exe = fluid.Executor(place) exe = fluid.Executor(place)
#exe.run(startup_program)
feeder = fluid.DataFeeder(feed_list=data_list, place=place) feeder = fluid.DataFeeder(feed_list=data_list, place=place)
......
...@@ -5,7 +5,7 @@ dense_feature_dim = 13 ...@@ -5,7 +5,7 @@ dense_feature_dim = 13
sparse_feature_dim = 117568 sparse_feature_dim = 117568
def DeepFM(factor_size, infer=False): def ctr_dnn_model(embedding_size):
dense_input = fluid.layers.data( dense_input = fluid.layers.data(
name="dense_input", shape=[dense_feature_dim], dtype='float32') name="dense_input", shape=[dense_feature_dim], dtype='float32')
sparse_input_ids = [ sparse_input_ids = [
...@@ -17,7 +17,7 @@ def DeepFM(factor_size, infer=False): ...@@ -17,7 +17,7 @@ def DeepFM(factor_size, infer=False):
def embedding_layer(input): def embedding_layer(input):
return fluid.layers.embedding( return fluid.layers.embedding(
input=input, input=input,
size=[sparse_feature_dim, factor_size], size=[sparse_feature_dim, embedding_size],
param_attr=fluid.ParamAttr(name="SparseFeatFactors", initializer=fluid.initializer.Normal(scale=1/math.sqrt(sparse_feature_dim)))) param_attr=fluid.ParamAttr(name="SparseFeatFactors", initializer=fluid.initializer.Normal(scale=1/math.sqrt(sparse_feature_dim))))
sparse_embed_seq = map(embedding_layer, sparse_input_ids) sparse_embed_seq = map(embedding_layer, sparse_input_ids)
...@@ -32,16 +32,17 @@ def DeepFM(factor_size, infer=False): ...@@ -32,16 +32,17 @@ def DeepFM(factor_size, infer=False):
predict = fluid.layers.fc(input=fc3, size=2, act='softmax', predict = fluid.layers.fc(input=fc3, size=2, act='softmax',
param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal(scale=1/math.sqrt(fc3.shape[1])))) param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal(scale=1/math.sqrt(fc3.shape[1]))))
label = fluid.layers.data(name='label', shape=[1], dtype='int64')
data_list = [dense_input] + sparse_input_ids data_list = [dense_input] + sparse_input_ids
if not infer:
label = fluid.layers.data(name='label', shape=[1], dtype='int64')
cost = fluid.layers.cross_entropy(input=predict, label=label) cost = fluid.layers.cross_entropy(input=predict, label=label)
avg_cost = fluid.layers.reduce_sum(cost) avg_cost = fluid.layers.reduce_sum(cost)
accuracy = fluid.layers.accuracy(input=predict, label=label) accuracy = fluid.layers.accuracy(input=predict, label=label)
auc_var, batch_auc_var, auc_states = fluid.layers.auc(input=predict, label=label, num_thresholds=2**12, slide_steps=20) auc_var, batch_auc_var, auc_states = fluid.layers.auc(input=predict, label=label, num_thresholds=2**12, slide_steps=20)
data_list.append(label) data_list.append(label)
return avg_cost, data_list, auc_var, batch_auc_var
else: return avg_cost, data_list, auc_var, batch_auc_var
return predict, data_list
class Dataset: class Dataset:
def _reader_creator(self, path, is_infer): def _reader_creator(self, file_list, is_infer):
def reader(): def reader():
with open(path, 'r') as f: for file in file_list:
for line in f: with open(file, 'r') as f:
features = line.rstrip('\n').split('\t') for line in f:
dense_feature = map(float, features[0].split(',')) features = line.rstrip('\n').split('\t')
sparse_feature = map(lambda x: [int(x)], features[1].split(',')) dense_feature = map(float, features[0].split(','))
if not is_infer: sparse_feature = map(lambda x: [int(x)], features[1].split(','))
label = [float(features[2])] if not is_infer:
yield [dense_feature label = [float(features[2])]
] + sparse_feature + [label] yield [dense_feature
else: ] + sparse_feature + [label]
yield [dense_feature] + sparse_feature else:
yield [dense_feature] + sparse_feature
return reader return reader
def train(self, path): def train(self, file_list):
return self._reader_creator(path, False) return self._reader_creator(file_list, False)
def test(self, path): def test(self, file_list):
return self._reader_creator(path, False) return self._reader_creator(file_list, False)
def infer(self, path): def infer(self, file_list):
return self._reader_creator(path, True) return self._reader_creator(file_list, True)
feeding = {
'dense_input': 0,
'sparse_input': 1,
'C1': 2,
'C2': 3,
'C3': 4,
'C4': 5,
'C5': 6,
'C6': 7,
'C7': 8,
'C8': 9,
'C9': 10,
'C10': 11,
'C11': 12,
'C12': 13,
'C13': 14,
'C14': 15,
'C15': 16,
'C16': 17,
'C17': 18,
'C18': 19,
'C19': 20,
'C20': 21,
'C21': 22,
'C22': 23,
'C23': 24,
'C24': 25,
'C25': 26,
'C26': 27,
'label': 28
}
...@@ -4,7 +4,7 @@ import argparse ...@@ -4,7 +4,7 @@ import argparse
import paddle.fluid as fluid import paddle.fluid as fluid
from network_conf import DeepFM from network_conf import ctr_dnn_model
import reader import reader
import paddle import paddle
...@@ -14,7 +14,7 @@ logger.setLevel(logging.INFO) ...@@ -14,7 +14,7 @@ logger.setLevel(logging.INFO)
def parse_args(): def parse_args():
parser = argparse.ArgumentParser(description="PaddlePaddle DeepFM example") parser = argparse.ArgumentParser(description="PaddlePaddle CTR example")
parser.add_argument( parser.add_argument(
'--train_data_path', '--train_data_path',
type=str, type=str,
...@@ -23,7 +23,7 @@ def parse_args(): ...@@ -23,7 +23,7 @@ def parse_args():
parser.add_argument( parser.add_argument(
'--test_data_path', '--test_data_path',
type=str, type=str,
default='./data/test.txt', default='./data/valid.txt',
help="The path of testing dataset") help="The path of testing dataset")
parser.add_argument( parser.add_argument(
'--batch_size', '--batch_size',
...@@ -31,10 +31,10 @@ def parse_args(): ...@@ -31,10 +31,10 @@ def parse_args():
default=1000, default=1000,
help="The size of mini-batch (default:1000)") help="The size of mini-batch (default:1000)")
parser.add_argument( parser.add_argument(
'--factor_size', '--embedding_size',
type=int, type=int,
default=10, default=10,
help="The factor size for the factorization machine (default:10)") help="The size for embedding layer (default:10)")
parser.add_argument( parser.add_argument(
'--num_passes', '--num_passes',
type=int, type=int,
...@@ -55,14 +55,14 @@ def train(): ...@@ -55,14 +55,14 @@ def train():
if not os.path.isdir(args.model_output_dir): if not os.path.isdir(args.model_output_dir):
os.mkdir(args.model_output_dir) os.mkdir(args.model_output_dir)
loss, data_list, auc_var, batch_auc_var = DeepFM(args.factor_size) loss, data_list, auc_var, batch_auc_var = ctr_dnn_model(args.embedding_size)
optimizer = fluid.optimizer.Adam(learning_rate=1e-4) optimizer = fluid.optimizer.Adam(learning_rate=1e-4)
optimize_ops, params_grads = optimizer.minimize(loss) optimizer.minimize(loss)
dataset = reader.Dataset() dataset = reader.Dataset()
train_reader = paddle.batch( train_reader = paddle.batch(
paddle.reader.shuffle( paddle.reader.shuffle(
dataset.train(args.train_data_path), dataset.train([args.train_data_path]),
buf_size=args.batch_size * 100), buf_size=args.batch_size * 100),
batch_size=args.batch_size) batch_size=args.batch_size)
place = fluid.CPUPlace() place = fluid.CPUPlace()
...@@ -80,13 +80,15 @@ def train(): ...@@ -80,13 +80,15 @@ def train():
feed=feeder.feed(data), feed=feeder.feed(data),
fetch_list=[loss, auc_var, batch_auc_var] fetch_list=[loss, auc_var, batch_auc_var]
) )
print('pass:' + str(pass_id) + ' batch:' + str(batch_id) + ' loss: ' + str(loss_val) + " auc: " + str(auc_val) + " batch_auc: " + str(batch_auc_val)) print('pass:' + str(pass_id) + ' batch:' + str(batch_id) +
' loss: ' + str(loss_val) + " auc: " + str(auc_val) +
" batch_auc: " + str(batch_auc_val))
batch_id += 1 batch_id += 1
if batch_id % 100 == 0 and batch_id != 0: if batch_id % 1000 == 0 and batch_id != 0:
model_dir = 'output/batch-' + str(batch_id) model_dir = args.model_output_dir + '/batch-' + str(batch_id)
fluid.io.save_inference_model(model_dir, data_name_list, [loss, auc_var], exe) fluid.io.save_inference_model(model_dir, data_name_list, [loss, auc_var], exe)
model_dir = 'output/pass-' + str(pass_id) model_dir = args.model_output_dir + '/pass-' + str(pass_id)
fluid.io.save_inference_model(model_dir, data_name_list, [loss_var, auc_var], exe) fluid.io.save_inference_model(model_dir, data_name_list, [loss, auc_var], exe)
if __name__ == '__main__': if __name__ == '__main__':
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册