README.md 9.8 KB
Newer Older
Z
zhang wenhui 已提交
1 2 3 4 5
# wide&deep

以下是本例的简要目录结构及说明: 

```
Y
yinhaofeng 已提交
6 7 8 9 10 11 12 13 14
├── data # 数据
    ├── sample_data #示例数据
        ├── train #训练数据
            ├── train_data.txt
    ├── create_data.sh #数据下载脚本
    ├── data_preparation.py #数据处理程序
    ├── get_slot_data.py #数据处理程序
    ├── run.sh #一键数据下载脚本
    ├── args.py ## 脚本参数
Z
zhang wenhui 已提交
15 16 17
├── __init__.py 
├── config.yaml #配置文件
├── model.py #模型文件
Y
yinhaofeng 已提交
18
├── README.md #文档
Z
zhang wenhui 已提交
19 20 21 22 23 24 25 26
```

注:在阅读该示例前,建议您先了解以下内容:

[paddlerec入门教程](https://github.com/PaddlePaddle/PaddleRec/blob/master/README.md)

## 内容

Y
yinhaofeng 已提交
27 28 29 30 31 32 33
- [模型简介](#模型简介)
- [数据准备](#数据准备)
- [运行环境](#运行环境)
- [快速开始](#快速开始)
- [论文复现](#论文复现)
- [进阶使用](#进阶使用)
- [FAQ](#FAQ)
Z
zhang wenhui 已提交
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51

## 模型简介

[《Wide & Deep Learning for Recommender Systems》]( https://arxiv.org/pdf/1606.07792.pdf)是Google 2016年发布的推荐框架,wide&deep设计了一种融合浅层(wide)模型和深层(deep)模型进行联合训练的框架,综合利用浅层模型的记忆能力和深层模型的泛化能力,实现单模型对推荐系统准确性和扩展性的兼顾。从推荐效果和服务性能两方面进行评价:

1. 效果上,在Google Play 进行线上A/B实验,wide&deep模型相比高度优化的Wide浅层模型,app下载率+3.9%。相比deep模型也有一定提升。
2. 性能上,通过切分一次请求需要处理的app 的Batch size为更小的size,并利用多线程并行请求达到提高处理效率的目的。单次响应耗时从31ms下降到14ms。

若进行精度验证,请参考[论文复现](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/rank/wide_deep#论文复现)部分。

本项目支持功能

训练:单机CPU、单机单卡GPU、单机多卡GPU、本地模拟参数服务器训练、增量训练,配置请参考 [启动训练](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/train.md)

预测:单机CPU、单机单卡GPU ;配置请参考[PaddleRec 离线预测](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/predict.md)

## 数据准备

Y
yinhaofeng 已提交
52
本例在paddlerec上实现wide&deep并在开源数据集Census-income Data上验证模型效果
Z
zhang wenhui 已提交
53 54 55 56 57 58
数据地址: 

[adult.data](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data)

[adult.test](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test)

Y
yinhaofeng 已提交
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
您可以在进入models/rank/wide_deep/data目录,直接运行一键数据生成脚本run.sh获取数据。
```
sh run.sh
```
在本例中需要调用pandas库,如环境中没有提前安装,可以使用命令 pip install pandas 安装。  

运行的结果示例如下:
```
--2020-09-27 16:57:38--  https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
Resolving archive.ics.uci.edu... 128.195.10.252
Connecting to archive.ics.uci.edu|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3974305 (3.8M) [application/x-httpd-php]
Saving to: data/adult.data

100%[===================================================================================================================>] 3,974,305   12.6K/s   in 6m 17s

2020-09-27 17:03:57 (10.3 KB/s) - data/adult.data saved [3974305/3974305]

--2020-09-27 17:03:57--  https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test
Resolving archive.ics.uci.edu... 128.195.10.252
Connecting to archive.ics.uci.edu|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2003153 (1.9M) [application/x-httpd-php]
Saving to: data/adult.test

100%[==================================================================================================================>] 2,003,153   12.7K/s   in 51s

2020-09-27 17:08:04 (13.5 KB/s) - data/adult.test saved [2003153/2003153]
```

Z
zhang wenhui 已提交
90 91 92 93 94 95 96 97 98 99
## 运行环境

PaddlePaddle>=1.7.2

python 2.7/3.5/3.6/3.7

PaddleRec >=0.1

os : windows/linux/macos

Y
yinhaofeng 已提交
100

Z
zhang wenhui 已提交
101
## 快速开始
Y
yinhaofeng 已提交
102
本文提供了样例数据可以供您快速体验,在paddlerec目录下执行下面的命令即可快速启动训练: 
Z
zhang wenhui 已提交
103 104

```
Y
yinhaofeng 已提交
105
python -m paddlerec.run -m models/rank/wide_deep/config.yaml
Z
zhang wenhui 已提交
106
```
Y
yinhaofeng 已提交
107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
使用样例数据快速跑通的结果实例:
```
PaddleRec: Runner train_runner Begin
Executor Mode: train
processor_register begin
Running SingleInstance.
Running SingleNetwork.
Warning:please make sure there are no hidden files in the dataset folder and check these hidden files:[]
Running SingleStartup.
Running SingleRunner.
I0927 17:16:18.305258  3437 parallel_executor.cc:440] The Program will be executed on CPU using ParallelExecutor, 1 cards are used, so 1 programs are executed in parallel.
I0927 17:16:18.310783  3437 build_strategy.cc:365] SeqOnlyAllReduceOps:0, num_trainers:1
I0927 17:16:18.314724  3437 parallel_executor.cc:307] Inplace strategy is enabled, when build_strategy.enable_inplace = True
I0927 17:16:18.317752  3437 parallel_executor.cc:375] Garbage collection strategy is enabled, when FLAGS_eager_delete_tensor_gb = 0
2020-09-27 17:16:18,475-INFO:  [Train] batch: 20, time_each_interval: 0.18s, ACC: [0.6], BATCH_AUC: [0.41666667], AUC: [0.61538462]
2020-09-27 17:16:18,583-INFO:  [Train] batch: 40, time_each_interval: 0.11s, ACC: [0.8], BATCH_AUC: [0.875], AUC: [0.59693471]
2020-09-27 17:16:18,625-INFO:  [Train] batch: 60, time_each_interval: 0.04s, ACC: [0.4], BATCH_AUC: [1.], AUC: [0.59405999]
2020-09-27 17:16:18,666-INFO:  [Train] batch: 80, time_each_interval: 0.04s, ACC: [0.8], BATCH_AUC: [0.5], AUC: [0.56687606]
epoch 0 done, use time: 0.503633022308, global metrics: ACC=[1.], BATCH_AUC=[0.], AUC=[0.56696623]
PaddleRec Finish
Z
zhang wenhui 已提交
127 128 129 130
```

## 论文复现

Y
yinhaofeng 已提交
131 132 133 134 135 136
为了方便使用者能够快速的跑通每一个模型,我们在每个模型下都提供了样例数据。如果需要复现readme中的效果,请按如下步骤依次操作即可。
在全量数据下模型的指标如下:  

| 模型 | auc | acc | batch_size | thread_num| epoch_num| Time of each epoch |
| :------| :------ | :------ | :------| :------ | :------| :------ | 
| wide_deep | 0.8987 | 0.775 | 40 | 1 | 80 | 约10s |
Z
zhang wenhui 已提交
137

Y
yinhaofeng 已提交
138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157
1. 确认您当前所在目录为PaddleRec/models/rank/wide_deep
2. 在data目录下运行数据一键处理脚本,命令如下:  
``` 
cd data
sh run.sh
cd ..
```
3. 退回deepfm目录中,打开文件config.yaml,更改其中的参数  
将workspace改为您当前的绝对路径。(可用pwd命令获取绝对路径)  
将train_sample中的batch_size从5改为40  
将train_sample中的data_path改为{workspace}/data/slot_train_data  
将infer_sample中的batch_size从5改为40  
将infer_sample中的data_path改为{workspace}/data/slot_test_data  
将train_runner中的epochs改为80
将infer_runner中的init_model_path改为increment/79
4. 运行命令,模型会进行80个epoch的训练,然后预测最后一个epoch,并获得相应auc和acc指标  
```
python -m paddlerec.run -m ./config.yaml
```
5. 经过全量数据训练后,执行预测的结果示例如下:
Z
zhang wenhui 已提交
158
```
Y
yinhaofeng 已提交
159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189
PaddleRec: Runner infer_runner Begin
Executor Mode: infer
processor_register begin
Running SingleInstance.
Running SingleNetwork.
Warning:please make sure there are no hidden files in the dataset folder and check these hidden files:[]
Running SingleInferStartup.
Running SingleInferRunner.
load persistables from increment/79
2020-09-27 17:37:17,679-INFO:  [Infer] batch: 20, time_each_interval: 0.77s, ACC: [0.8], AUC: [0.89880283]
2020-09-27 17:37:18,452-INFO:  [Infer] batch: 40, time_each_interval: 0.77s, ACC: [0.825], AUC: [0.89879974]
2020-09-27 17:37:19,023-INFO:  [Infer] batch: 60, time_each_interval: 0.57s, ACC: [0.7], AUC: [0.89880376]
2020-09-27 17:37:19,591-INFO:  [Infer] batch: 80, time_each_interval: 0.57s, ACC: [0.925], AUC: [0.89879592]
2020-09-27 17:37:20,195-INFO:  [Infer] batch: 100, time_each_interval: 0.60s, ACC: [0.725], AUC: [0.89879213]
2020-09-27 17:37:20,822-INFO:  [Infer] batch: 120, time_each_interval: 0.63s, ACC: [0.775], AUC: [0.89879757]
2020-09-27 17:37:21,303-INFO:  [Infer] batch: 140, time_each_interval: 0.48s, ACC: [0.775], AUC: [0.89879296]
2020-09-27 17:37:21,798-INFO:  [Infer] batch: 160, time_each_interval: 0.49s, ACC: [0.875], AUC: [0.89879267]
2020-09-27 17:37:22,265-INFO:  [Infer] batch: 180, time_each_interval: 0.47s, ACC: [0.85], AUC: [0.89879272]
2020-09-27 17:37:22,835-INFO:  [Infer] batch: 200, time_each_interval: 0.57s, ACC: [0.725], AUC: [0.89878928]
2020-09-27 17:37:23,364-INFO:  [Infer] batch: 220, time_each_interval: 0.53s, ACC: [0.825], AUC: [0.89878807]
2020-09-27 17:37:23,859-INFO:  [Infer] batch: 240, time_each_interval: 0.49s, ACC: [0.7], AUC: [0.8987825]
2020-09-27 17:37:24,337-INFO:  [Infer] batch: 260, time_each_interval: 0.48s, ACC: [0.775], AUC: [0.89878314]
2020-09-27 17:37:24,877-INFO:  [Infer] batch: 280, time_each_interval: 0.54s, ACC: [0.875], AUC: [0.89877827]
2020-09-27 17:37:25,410-INFO:  [Infer] batch: 300, time_each_interval: 0.53s, ACC: [0.75], AUC: [0.89877518]
2020-09-27 17:37:25,985-INFO:  [Infer] batch: 320, time_each_interval: 0.57s, ACC: [0.75], AUC: [0.89876936]
2020-09-27 17:37:26,447-INFO:  [Infer] batch: 340, time_each_interval: 0.46s, ACC: [0.775], AUC: [0.89876268]
2020-09-27 17:37:26,725-INFO:  [Infer] batch: 360, time_each_interval: 0.28s, ACC: [0.75], AUC: [0.8987574]
2020-09-27 17:37:26,889-INFO:  [Infer] batch: 380, time_each_interval: 0.16s, ACC: [0.8], AUC: [0.89874688]
2020-09-27 17:37:27,065-INFO:  [Infer] batch: 400, time_each_interval: 0.18s, ACC: [0.8], AUC: [0.89875484]
Infer infer_phase of epoch increment/79 done, use time: 10.2139520645, global metrics: ACC=[0.775], AUC=[0.89875217]
PaddleRec Finish
Z
zhang wenhui 已提交
190 191 192 193
```

## 进阶使用

Y
yinhaofeng 已提交
194
## FAQ