README.md 6.8 KB
Newer Older
O
overlordmax 已提交
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
# fibinet

 以下是本例的简要目录结构及说明: 

```
├── data #样例数据
	├── sample_data
		├── train
			├── sample_train.txt
	├── download.sh
	├── run.sh
	├── get_slot_data.py
├── __init__.py
├── README.md # 文档
├── model.py #模型文件
├── config.yaml #配置文件
```

F
frankwhzhang 已提交
19 20 21 22 23
注:在阅读该示例前,建议您先了解以下内容:

[paddlerec入门教程](https://github.com/PaddlePaddle/PaddleRec/blob/master/README.md)


F
frankwhzhang 已提交
24 25
---
## 内容
F
frankwhzhang 已提交
26

F
frankwhzhang 已提交
27 28 29 30 31 32 33 34 35
- [模型简介](#模型简介)
- [数据准备](#数据准备)
- [运行环境](#运行环境)
- [快速开始](#快速开始)
- [论文复现](#论文复现)
- [进阶使用](#进阶使用)
- [FAQ](#FAQ)

## 模型简介
O
overlordmax 已提交
36 37 38

[《FiBiNET: Combining Feature Importance and Bilinear feature Interaction for Click-Through Rate Prediction》]( https://arxiv.org/pdf/1905.09433.pdf)是新浪微博机器学习团队发表在RecSys19上的一篇论文,文章指出当前的许多通过特征组合进行CTR预估的工作主要使用特征向量的内积或哈达玛积来计算交叉特征,这种方法忽略了特征本身的重要程度。提出通过使用Squeeze-Excitation network (SENET) 结构动态学习特征的重要性以及使用一个双线性函数来更好的建模交叉特征。

F
frankwhzhang 已提交
39 40 41 42 43 44 45
本项目在paddlepaddle上实现FibiNET的网络结构,并在开源数据集Criteo上验证模型效果, 本模型配置默认使用demo数据集,若进行精度验证,请参考[论文复现](#论文复现)部分。

本项目支持功能

训练:单机CPU、单机单卡GPU、单机多卡GPU、本地模拟参数服务器训练、增量训练,配置请参考 [启动训练](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/train.md)   

预测:单机CPU、单机单卡GPU ;配置请参考[PaddleRec 离线预测](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/predict.md) 
O
overlordmax 已提交
46

F
frankwhzhang 已提交
47
## 数据准备
O
overlordmax 已提交
48 49 50 51 52 53 54

数据地址:[Criteo]( https://fleet.bj.bcebos.com/ctr_data.tar.gz)

(1)将原始训练集按9:1划分为训练集和验证集

(2)数值特征(连续特征)进行归一化处理

O
fix bug  
overlordmax 已提交
55 56 57 58 59 60
执行run.sh生成训练集和测试集

```
sh run.sh
```

F
frankwhzhang 已提交
61
原始的数据格式为13个dense部分特征+离散化特征,用'\t'切分, 对应的数据是data/train_data_full data/test_data_full
F
frankwhzhang 已提交
62 63 64 65
```
0   1   1   5   0   1382    4   15  2   181 1   2       2   68fd1e64    80e26c9b    fb936136    7b4723c4    25c83c98    7e0ccccf    de7995b8    1f89b562    a73ee510    a8cd5504    b2cb9c98    37c9c164    2824a5f6    1adce6ef    8ba8b39a    891b62e7    e5ba7672    f54016b9    21ddcdc9    b1252a9d    07b5194c        3a171ecb    c5c50484    e8b83407    9727dd16
```

F
frankwhzhang 已提交
66
经过get_slot_data.py处理后,得到如下数据, dense_feature中的值会merge在一起,对应net.py中的self._dense_data_var, '1:715353'表示net.py中的self._sparse_data_var[1] = 715353, 对应的数据是data/slot_train_data_full, data/slot_test_data_full
F
frankwhzhang 已提交
67 68 69 70 71
```
click:0 dense_feature:0.05 dense_feature:0.00663349917081 dense_feature:0.05 dense_feature:0.0 dense_feature:0.02159375 dense_feature:0.008 dense_feature:0.15 dense_feature:0.04 dense_feature:0.362 dense_feature:0.1 dense_feature:0.2 dense_feature:0.0 dense_feature:0.04 1:715353 2:817085 3:851010 4:833725 5:286835 6:948614 7:881652 8:507110 9:27346 10:646986 11:643076 12:200960 13:18464 14:202774 15:532679 16:729573 17:342789 18:562805 19:880474 20:984402 21:666449 22:26235 23:700326 24:452909 25:884722 26:787527
```


F
frankwhzhang 已提交
72

F
frankwhzhang 已提交
73
## 运行环境
O
overlordmax 已提交
74

F
frankwhzhang 已提交
75 76 77 78 79
PaddlePaddle>=1.7.2 

python 2.7/3.5/3.6/3.7

PaddleRec >=0.1
O
overlordmax 已提交
80

F
frankwhzhang 已提交
81
os : windows/linux/macos
O
overlordmax 已提交
82 83


F
frankwhzhang 已提交
84 85 86 87

## 快速开始

### 单机训练
O
overlordmax 已提交
88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115

CPU环境

在config.yaml文件中设置好设备,epochs等。

```
# select runner by name
mode: [single_cpu_train, single_cpu_infer]
# config of each runner.
# runner is a kind of paddle training class, which wraps the train/infer process.
runner:
- name: single_cpu_train
  class: train
  # num of epochs
  epochs: 4
  # device to run training or infer
  device: cpu
  save_checkpoint_interval: 2 # save model interval of epochs
  save_inference_interval: 4 # save inference
  save_checkpoint_path: "increment_model" # save checkpoint path
  save_inference_path: "inference" # save inference path
  save_inference_feed_varnames: [] # feed vars of save inference
  save_inference_fetch_varnames: [] # fetch vars of save inference
  init_model_path: "" # load model path
  print_interval: 10
  phases: [phase1]
```

F
frankwhzhang 已提交
116
### 单机预测
O
overlordmax 已提交
117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

CPU环境

在config.yaml文件中设置好epochs、device等参数。

```
- name: single_cpu_infer
  class: infer
  # num of epochs
  epochs: 1
  # device to run training or infer
  device: cpu #选择预测的设备
  init_model_path: "increment_dnn" # load model path
  phases: [phase2]
```

F
frankwhzhang 已提交
133
### 运行
O
overlordmax 已提交
134
```
C
Chengmo 已提交
135
python -m paddlerec.run -m models/rank/fibinet
O
overlordmax 已提交
136 137 138
```


F
frankwhzhang 已提交
139
### 结果展示
O
overlordmax 已提交
140

F
frankwhzhang 已提交
141
样例数据训练结果展示:
O
overlordmax 已提交
142 143

```
O
fix bug  
overlordmax 已提交
144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
Running SingleStartup.
W0623 12:03:35.130075   509 device_context.cc:237] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.2, Runtime API Version: 9.0
W0623 12:03:35.134771   509 device_context.cc:245] device: 0, cuDNN Version: 7.3.
Running SingleRunner.
batch: 100, AUC: [0.6449976], BATCH_AUC: [0.69029814]
batch: 200, AUC: [0.6769844], BATCH_AUC: [0.70255003]
batch: 300, AUC: [0.67131597], BATCH_AUC: [0.68954499]
batch: 400, AUC: [0.68129822], BATCH_AUC: [0.70892718]
batch: 500, AUC: [0.68242937], BATCH_AUC: [0.69269376]
batch: 600, AUC: [0.68741928], BATCH_AUC: [0.72034578]
...
batch: 1400, AUC: [0.84607023], BATCH_AUC: [0.93358024]
batch: 1500, AUC: [0.84796116], BATCH_AUC: [0.95302841]
batch: 1600, AUC: [0.84949111], BATCH_AUC: [0.92868531]
batch: 1700, AUC: [0.85113661], BATCH_AUC: [0.95452616]
batch: 1800, AUC: [0.85260467], BATCH_AUC: [0.92847032]
epoch 3 done, use time: 1618.1106688976288
O
overlordmax 已提交
161 162
```

F
frankwhzhang 已提交
163
样例数据预测结果展示
O
overlordmax 已提交
164 165 166

```
load persistables from increment_model/3
O
fix bug  
overlordmax 已提交
167 168 169 170 171 172 173 174
batch: 20, AUC: [0.85304064], BATCH_AUC: [0.94178556]
batch: 40, AUC: [0.85304544], BATCH_AUC: [0.95207907]
batch: 60, AUC: [0.85303907], BATCH_AUC: [0.94782551]
batch: 80, AUC: [0.85298773], BATCH_AUC: [0.93987691]
...
batch: 1780, AUC: [0.866046], BATCH_AUC: [0.96424594]
batch: 1800, AUC: [0.86633785], BATCH_AUC: [0.96900967]
batch: 1820, AUC: [0.86662365], BATCH_AUC: [0.96759972]
O
overlordmax 已提交
175 176
```

F
frankwhzhang 已提交
177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
## 论文复现

用原论文的完整数据复现论文效果需要在config.yaml中修改batch_size=1000, thread_num=8, epoch_num=4

使用gpu p100 单卡训练 60h 测试auc:0.79


修改后运行方案:修改config.yaml中的'workspace'为config.yaml的目录位置,执行
```
python -m paddlerec.run -m /home/your/dir/config.yaml #调试模式 直接指定本地config的绝对路径
```

## 进阶使用

## FAQ