提交 e94ba724 编写于 作者: C CandyCaneLane

new file: deepfm/README.md

	new file:   deepfm/args.py
	new file:   deepfm/criteo_reader.py
	new file:   deepfm/data/aid_data/feat_dict_10.pkl2
	new file:   deepfm/data/aid_data/train_file_idx.txt
	new file:   deepfm/data/download_preprocess.sh
	new file:   deepfm/data/preprocess.py
	new file:   deepfm/infer.py
	new file:   deepfm/local_train.py
	new file:   deepfm/network_conf.py
	new file:   deepfm/picture/deepfm_result.png
上级 eb359e55
# DeepFM for CTR Prediction
## Introduction
This model implementation reproduces the result of the paper "DeepFM: A Factorization-Machine based Neural Network for CTR Prediction" on Criteo dataset.
```text
@inproceedings{guo2017deepfm,
title={DeepFM: A Factorization-Machine based Neural Network for CTR Prediction},
author={Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li and Xiuqiang He},
booktitle={the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI)},
pages={1725--1731},
year={2017}
}
```
## Environment
- PaddlePaddle 1.5
## Download and preprocess data
We evaluate the effectiveness of our implemented DeepFM on Criteo dataset. The dataset was used for the [Display Advertising Challenge](https://www.kaggle.com/c/criteo-display-ad-challenge/) hosted by Kaggle and includes 45 million users'click records. Each row is the features for an ad display and the first column is a label indicating whether this ad has been clicked or not. There are 13 continuous features and 26 categorical ones.
To preprocess the raw dataset, we min-max normalize continuous features to [0, 1] and filter categorical features that occur less than 10 times. The dataset is randomly splited into two parts: 90% is for training, while the rest 10% is for testing.
Download and preprocess data:
```bash
cd data && sh download_preprocess.sh && cd ..
```
After executing these commands, 3 folders "train_data", "test_data" and "aid_data" will be generated. The folder "train_data" contains 90% of the raw data, while the rest 10% is in "test_data". The folder "aid_data" contains a created feature dictionary "feat_dict.pkl2".
## Train
```bash
nohup python local_train.py --model_output_dir models >> train_log 2>&1 &
```
## Infer
```bash
nohup python infer.py --model_output_dir models --test_epoch 1 >> infer_log 2>&1 &
```
Note: The last log info is the total Logloss and AUC for all test data.
## Result
Reproducing this result requires training with default hyperparameters. The default hyperparameter is shown in `args.py`. Using the default hyperparameters (10 threads, 100 batch size, etc.), it takes about 1.8 hours for CPUs to iterate the training data for one round.
When the training set is iterated to the 22nd round, the testing Logloss is `0.44797` and the testing AUC is `0.8046`.
<p align="center">
<img src="./picture/deepfm_result.png" height=200 hspace='10'/> <br />
</p>
import argparse
def parse_args():
parser = argparse.ArgumentParser(description="PaddlePaddle CTR example")
parser.add_argument(
'--train_data_dir',
type=str,
default='data/train_data',
help='The path of train data (default: data/train_data)')
parser.add_argument(
'--test_data_dir',
type=str,
default='data/test_data',
help='The path of test data (default: models)')
parser.add_argument(
'--batch_size',
type=int,
default=100,
help="The size of mini-batch (default:100)")
parser.add_argument(
'--embedding_size',
type=int,
default=10,
help="The size for embedding layer (default:10)")
parser.add_argument(
'--num_epoch',
type=int,
default=30,
help="The number of epochs to train (default: 10)")
parser.add_argument(
'--model_output_dir',
type=str,
required=True,
help='The path for model to store (default: models)')
parser.add_argument(
'--num_thread',
type=int,
default=10,
help='The number of threads (default: 10)')
parser.add_argument('--test_epoch', type=str, default='1')
parser.add_argument(
'--layer_sizes',
nargs='+',
type=int,
default=[400, 400, 400],
help='The size of each layers (default: [400, 400, 400])')
parser.add_argument(
'--act',
type=str,
default='relu',
help='The activation of each layers (default: relu)')
parser.add_argument(
'--lr', type=float, default=1e-4, help='Learning rate (default: 1e-4)')
parser.add_argument(
'--reg', type=float, default=1e-4, help=' (default: 1e-4)')
parser.add_argument('--num_field', type=int, default=39)
parser.add_argument('--num_feat', type=int, default=1086460) # 2090493
return parser.parse_args()
import sys
import paddle.fluid.incubate.data_generator as dg
try:
import cPickle as pickle
except ImportError:
import pickle
from collections import Counter
import os
class CriteoDataset(dg.MultiSlotDataGenerator):
def setup(self):
self.cont_min_ = [0, -3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
self.cont_max_ = [
5775, 257675, 65535, 969, 23159456, 431037, 56311, 6047, 29019, 46,
231, 4008, 7393
]
self.cont_diff_ = [
self.cont_max_[i] - self.cont_min_[i]
for i in range(len(self.cont_min_))
]
self.continuous_range_ = range(1, 14)
self.categorical_range_ = range(14, 40)
self.feat_dict_ = pickle.load(
open('data/aid_data/feat_dict_10.pkl2', 'rb'))
def _process_line(self, line):
features = line.rstrip('\n').split('\t')
feat_idx = []
feat_value = []
for idx in self.continuous_range_:
if features[idx] == '':
feat_idx.append(0)
feat_value.append(0.0)
else:
feat_idx.append(self.feat_dict_[idx])
feat_value.append(
(float(features[idx]) - self.cont_min_[idx - 1]) /
self.cont_diff_[idx - 1])
for idx in self.categorical_range_:
if features[idx] == '' or features[idx] not in self.feat_dict_:
feat_idx.append(0)
feat_value.append(0.0)
else:
feat_idx.append(self.feat_dict_[features[idx]])
feat_value.append(1.0)
label = [int(features[0])]
return feat_idx, feat_value, label
def test(self, filelist):
def local_iter():
for fname in filelist:
with open(fname.strip(), 'r') as fin:
for line in fin:
feat_idx, feat_value, label = self._process_line(line)
yield [feat_idx, feat_value, label]
return local_iter
def generate_sample(self, line):
def data_iter():
feat_idx, feat_value, label = self._process_line(line)
yield [('feat_idx', feat_idx), ('feat_value', feat_value), ('label',
label)]
return data_iter
if __name__ == '__main__':
criteo_dataset = CriteoDataset()
criteo_dataset.setup()
criteo_dataset.run_from_stdin()
此差异已折叠。
[143, 174, 214, 126, 27, 100, 74, 15, 83, 167, 87, 13, 90, 107, 1, 123, 76, 59, 44, 22, 203, 75, 216, 169, 101, 229, 63, 183, 112, 140, 91, 14, 115, 211, 227, 171, 51, 173, 137, 194, 223, 159, 168, 182, 208, 215, 7, 41, 120, 16, 77, 0, 220, 109, 166, 156, 29, 26, 95, 102, 196, 151, 98, 42, 163, 40, 114, 199, 35, 225, 179, 17, 62, 86, 149, 180, 133, 54, 170, 55, 68, 8, 99, 135, 181, 46, 134, 118, 201, 148, 210, 79, 25, 116, 38, 158, 141, 81, 37, 49, 39, 61, 34, 9, 150, 121, 65, 185, 213, 3, 11, 190, 20, 157, 108, 47, 24, 198, 104, 222, 127, 50, 4, 202, 142, 218, 48, 186, 32, 130, 85, 191, 53, 221, 224, 128, 33, 165, 172, 110, 69, 72, 152, 19, 88, 18, 119, 117, 111, 66, 177, 92, 106, 228, 212, 89, 195, 21, 113, 58, 43, 164, 138, 23, 70, 73, 178, 5, 122, 139, 97, 161, 162, 30, 136, 155, 93, 132, 52, 105, 80, 36, 10, 204, 45, 192, 125, 219, 209, 129, 124, 67, 176, 205, 154, 31, 60, 153, 146, 207, 56, 6, 71, 82, 217, 84, 226]
\ No newline at end of file
#!/bin/bash
wget --no-check-certificate https://s3-eu-west-1.amazonaws.com/kaggle-display-advertising-challenge-dataset/dac.tar.gz
tar zxf dac.tar.gz
rm -f dac.tar.gz
python preprocess.py
rm *.txt
rm -r raw_data
import os
import numpy
from collections import Counter
import shutil
import pickle
def get_raw_data():
if not os.path.isdir('raw_data'):
os.mkdir('raw_data')
fin = open('train.txt', 'r')
fout = open('raw_data/part-0', 'w')
for line_idx, line in enumerate(fin):
if line_idx % 200000 == 0 and line_idx != 0:
fout.close()
cur_part_idx = int(line_idx / 200000)
fout = open('raw_data/part-' + str(cur_part_idx), 'w')
fout.write(line)
fout.close()
fin.close()
def split_data():
split_rate_ = 0.9
dir_train_file_idx_ = 'aid_data/train_file_idx.txt'
filelist_ = [
'raw_data/part-%d' % x for x in range(len(os.listdir('raw_data')))
]
if not os.path.exists(dir_train_file_idx_):
train_file_idx = list(
numpy.random.choice(
len(filelist_), int(len(filelist_) * split_rate_), False))
with open(dir_train_file_idx_, 'w') as fout:
fout.write(str(train_file_idx))
else:
with open(dir_train_file_idx_, 'r') as fin:
train_file_idx = eval(fin.read())
for idx in range(len(filelist_)):
if idx in train_file_idx:
shutil.move(filelist_[idx], 'train_data')
else:
shutil.move(filelist_[idx], 'test_data')
def get_feat_dict():
freq_ = 10
dir_feat_dict_ = 'aid_data/feat_dict_' + str(freq_) + '.pkl2'
continuous_range_ = range(1, 14)
categorical_range_ = range(14, 40)
if not os.path.exists(dir_feat_dict_):
# print('generate a feature dict')
# Count the number of occurrences of discrete features
feat_cnt = Counter()
with open('train.txt', 'r') as fin:
for line_idx, line in enumerate(fin):
if line_idx % 100000 == 0:
print('generating feature dict', line_idx / 45000000)
features = line.lstrip('\n').split('\t')
for idx in categorical_range_:
if features[idx] == '': continue
feat_cnt.update([features[idx]])
# Only retain discrete features with high frequency
dis_feat_set = set()
for feat, ot in feat_cnt.items():
if ot >= freq_:
dis_feat_set.add(feat)
# Create a dictionary for continuous and discrete features
feat_dict = {}
tc = 1
# Continuous features
for idx in continuous_range_:
feat_dict[idx] = tc
tc += 1
# Discrete features
cnt_feat_set = set()
with open('train.txt', 'r') as fin:
for line_idx, line in enumerate(fin):
features = line.rstrip('\n').split('\t')
for idx in categorical_range_:
if features[idx] == '' or features[idx] not in dis_feat_set:
continue
if features[idx] not in cnt_feat_set:
cnt_feat_set.add(features[idx])
feat_dict[features[idx]] = tc
tc += 1
# Save dictionary
with open(dir_feat_dict_, 'wb') as fout:
pickle.dump(feat_dict, fout)
print('args.num_feat ', len(feat_dict) + 1)
if __name__ == '__main__':
if not os.path.isdir('train_data'):
os.mkdir('train_data')
if not os.path.isdir('test_data'):
os.mkdir('test_data')
if not os.path.isdir('aid_data'):
os.mkdir('aid_data')
get_raw_data()
split_data()
get_feat_dict()
print('Done!')
import logging
import numpy as np
import pickle
# disable gpu training for this example
import os
os.environ['CUDA_VISIBLE_DEVICES'] = ''
import paddle
import paddle.fluid as fluid
from args import parse_args
from criteo_reader import CriteoDataset
from network_conf import ctr_deepfm_model
logging.basicConfig(format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger('fluid')
logger.setLevel(logging.INFO)
def infer():
args = parse_args()
place = fluid.CPUPlace()
inference_scope = fluid.Scope()
test_files = [
args.test_data_dir + '/' + x for x in os.listdir(args.test_data_dir)
]
criteo_dataset = CriteoDataset()
criteo_dataset.setup()
test_reader = paddle.batch(
criteo_dataset.test(test_files), batch_size=args.batch_size)
startup_program = fluid.framework.Program()
test_program = fluid.framework.Program()
cur_model_path = args.model_output_dir + '/epoch_' + args.test_epoch
with fluid.scope_guard(inference_scope):
with fluid.framework.program_guard(test_program, startup_program):
loss, auc, data_list = ctr_deepfm_model(
args.embedding_size, args.num_field, args.num_feat,
args.layer_sizes, args.act, args.reg)
exe = fluid.Executor(place)
feeder = fluid.DataFeeder(feed_list=data_list, place=place)
fluid.io.load_persistables(
executor=exe,
dirname=cur_model_path,
main_program=fluid.default_main_program())
auc_states_names = ['_generated_var_2', '_generated_var_3']
for name in auc_states_names:
param = inference_scope.var(name).get_tensor()
param_array = np.zeros(param._get_dims()).astype("int64")
param.set(param_array, place)
loss_all = 0
num_ins = 0
for batch_id, data_test in enumerate(test_reader()):
loss_val, auc_val = exe.run(test_program,
feed=feeder.feed(data_test),
fetch_list=[loss.name, auc.name])
num_ins += len(data_test)
loss_all += loss_val
logger.info('TEST --> batch: {} loss: {} auc_val: {}'.format(
batch_id, loss_all / num_ins, auc_val))
print(
'The last log info is the total Logloss and AUC for all test data. '
)
if __name__ == '__main__':
infer()
from args import parse_args
import os
import paddle.fluid as fluid
import sys
from network_conf import ctr_deepfm_model
import time
import numpy
import pickle
def train():
args = parse_args()
print('---------- Configuration Arguments ----------')
for key, value in args.__dict__.items():
print(key + ':' + str(value))
if not os.path.isdir(args.model_output_dir):
os.mkdir(args.model_output_dir)
loss, auc, data_list = ctr_deepfm_model(args.embedding_size, args.num_field,
args.num_feat, args.layer_sizes,
args.act, args.reg)
optimizer = fluid.optimizer.SGD(
learning_rate=args.lr,
regularization=fluid.regularizer.L2DecayRegularizer(args.reg))
optimizer.minimize(loss)
exe = fluid.Executor(fluid.CPUPlace())
exe.run(fluid.default_startup_program())
dataset = fluid.DatasetFactory().create_dataset()
dataset.set_use_var(data_list)
pipe_command = 'python criteo_reader.py'
dataset.set_pipe_command(pipe_command)
dataset.set_batch_size(args.batch_size)
dataset.set_thread(args.num_thread)
train_filelist = [
args.train_data_dir + '/' + x for x in os.listdir(args.train_data_dir)
]
print('---------------------------------------------')
for epoch_id in range(args.num_epoch):
start = time.time()
dataset.set_filelist(train_filelist)
exe.train_from_dataset(
program=fluid.default_main_program(),
dataset=dataset,
fetch_list=[loss],
fetch_info=['epoch %d batch loss' % (epoch_id + 1)],
print_period=1000,
debug=False)
model_dir = args.model_output_dir + '/epoch_' + str(epoch_id + 1)
sys.stderr.write('epoch%d is finished and takes %f s\n' % (
(epoch_id + 1), time.time() - start))
fluid.io.save_persistables(
executor=exe,
dirname=model_dir,
main_program=fluid.default_main_program())
if __name__ == '__main__':
train()
import paddle.fluid as fluid
import math
def ctr_deepfm_model(embedding_size, num_field, num_feat, layer_sizes, act,
reg):
init_value_ = 0.1
raw_feat_idx = fluid.layers.data(
name='feat_idx', shape=[num_field], dtype='int64')
raw_feat_value = fluid.layers.data(
name='feat_value', shape=[num_field], dtype='float32')
label = fluid.layers.data(
name='label', shape=[1], dtype='float32') # None * 1
feat_idx = fluid.layers.reshape(raw_feat_idx,
[-1, num_field, 1]) # None * num_field * 1
feat_value = fluid.layers.reshape(
raw_feat_value, [-1, num_field, 1]) # None * num_field * 1
# -------------------- first order term --------------------
first_weights = fluid.layers.embedding(
input=feat_idx,
dtype='float32',
size=[num_feat + 1, 1],
padding_idx=0,
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.TruncatedNormalInitializer(
loc=0.0, scale=init_value_),
regularizer=fluid.regularizer.L1DecayRegularizer(
reg))) # None * num_field * 1
y_first_order = fluid.layers.reduce_sum((first_weights * feat_value), 1)
# -------------------- second order term --------------------
feat_embeddings = fluid.layers.embedding(
input=feat_idx,
dtype='float32',
size=[num_feat + 1, embedding_size],
padding_idx=0,
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.TruncatedNormalInitializer(
loc=0.0, scale=init_value_ / math.sqrt(float(
embedding_size))))) # None * num_field * embedding_size
feat_embeddings = feat_embeddings * feat_value # None * num_field * embedding_size
# sum_square part
summed_features_emb = fluid.layers.reduce_sum(feat_embeddings,
1) # None * embedding_size
summed_features_emb_square = fluid.layers.square(
summed_features_emb) # None * embedding_size
# square_sum part
squared_features_emb = fluid.layers.square(
feat_embeddings) # None * num_field * embedding_size
squared_sum_features_emb = fluid.layers.reduce_sum(
squared_features_emb, 1) # None * embedding_size
y_second_order = 0.5 * fluid.layers.reduce_sum(
summed_features_emb_square - squared_sum_features_emb, 1,
keep_dim=True) # None * 1
# -------------------- DNN --------------------
y_dnn = fluid.layers.reshape(feat_embeddings,
[-1, num_field * embedding_size])
for s in layer_sizes:
y_dnn = fluid.layers.fc(
input=y_dnn,
size=s,
act=act,
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.TruncatedNormalInitializer(
loc=0.0, scale=init_value_ / math.sqrt(float(10)))),
bias_attr=fluid.ParamAttr(
initializer=fluid.initializer.TruncatedNormalInitializer(
loc=0.0, scale=init_value_)))
y_dnn = fluid.layers.fc(
input=y_dnn,
size=1,
act=None,
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.TruncatedNormalInitializer(
loc=0.0, scale=init_value_)),
bias_attr=fluid.ParamAttr(
initializer=fluid.initializer.TruncatedNormalInitializer(
loc=0.0, scale=init_value_)))
# ------------------- DeepFM ------------------
predict = fluid.layers.sigmoid(y_first_order + y_second_order + y_dnn)
cost = fluid.layers.log_loss(input=predict, label=label)
batch_cost = fluid.layers.reduce_sum(cost)
# for auc
predict_2d = fluid.layers.concat([1 - predict, predict], 1)
label_int = fluid.layers.cast(label, 'int64')
auc_var, batch_auc_var, auc_states = fluid.layers.auc(input=predict_2d,
label=label_int,
slide_steps=0)
return batch_cost, auc_var, [raw_feat_idx, raw_feat_value, label]
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册