Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
PaddlePaddle
PaddleRec
提交
1cded004
P
PaddleRec
项目概览
PaddlePaddle
/
PaddleRec
通知
68
Star
12
Fork
5
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
27
列表
看板
标记
里程碑
合并请求
10
Wiki
1
Wiki
分析
仓库
DevOps
项目成员
Pages
P
PaddleRec
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
27
Issue
27
列表
看板
标记
里程碑
合并请求
10
合并请求
10
Pages
分析
分析
仓库分析
DevOps
Wiki
1
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
未验证
提交
1cded004
编写于
8月 26, 2020
作者:
T
tangwei12
提交者:
GitHub
8月 26, 2020
浏览文件
操作
浏览文件
下载
差异文件
Merge branch 'master' into multiview-simnet
上级
c7fd7a75
03cec6d1
变更
10
展开全部
隐藏空白更改
内联
并排
Showing
10 changed file
with
530 addition
and
221 deletion
+530
-221
models/recall/youtube_dnn/README.md
models/recall/youtube_dnn/README.md
+146
-0
models/recall/youtube_dnn/config.yaml
models/recall/youtube_dnn/config.yaml
+15
-15
models/recall/youtube_dnn/data/test/data.txt
models/recall/youtube_dnn/data/test/data.txt
+64
-0
models/recall/youtube_dnn/data/test/small_data.txt
models/recall/youtube_dnn/data/test/small_data.txt
+0
-100
models/recall/youtube_dnn/data/train/data.txt
models/recall/youtube_dnn/data/train/data.txt
+128
-0
models/recall/youtube_dnn/data/train/samll_data.txt
models/recall/youtube_dnn/data/train/samll_data.txt
+0
-100
models/recall/youtube_dnn/data_prepare.sh
models/recall/youtube_dnn/data_prepare.sh
+1
-0
models/recall/youtube_dnn/generate_ramdom_data.py
models/recall/youtube_dnn/generate_ramdom_data.py
+40
-0
models/recall/youtube_dnn/infer.py
models/recall/youtube_dnn/infer.py
+126
-0
models/recall/youtube_dnn/reader.py
models/recall/youtube_dnn/reader.py
+10
-6
未找到文件。
models/recall/youtube_dnn/README.md
0 → 100644
浏览文件 @
1cded004
# Youtebe-DNN
以下是本例的简要目录结构及说明:
```
├── data #样例数据
├── train
├── data.txt
├── test
├── data.txt
├── generate_ramdom_data # 随机训练数据生成文件
├── __init__.py
├── README.md # 文档
├── model.py #模型文件
├── config.yaml #配置文件
├── data_prepare.sh #一键数据处理脚本
├── reader.py #reader
├── infer.py # 预测程序
```
注:在阅读该示例前,建议您先了解以下内容:
[
paddlerec入门教程
](
https://github.com/PaddlePaddle/PaddleRec/blob/master/README.md
)
---
## 内容
-
[
模型简介
](
#模型简介
)
-
[
数据准备
](
#数据准备
)
-
[
运行环境
](
#运行环境
)
-
[
快速开始
](
#快速开始
)
-
[
论文复现
](
#论文复现
)
-
[
进阶使用
](
#进阶使用
)
-
[
FAQ
](
#FAQ
)
## 模型简介
[
《Deep Neural Networks for YouTube Recommendations》
](
https://link.zhihu.com/?target=https%3A//static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf
)
这篇论文是google的YouTube团队在推荐系统上DNN方面的尝试,是经典的向量化召回模型,主要通过模型来学习用户和物品的兴趣向量,并通过内积来计算用户和物品之间的相似性,从而得到最终的候选集。YouTube采取了两层深度网络完成整个推荐过程:
1.
第一层是
**Candidate Generation Model**
完成候选视频的快速筛选,这一步候选视频集合由百万降低到了百的量级。
2.
第二层是用
**Ranking Model**
完成几百个候选视频的精排。
本项目在paddlepaddle上完成YouTube dnn的召回部分Candidate Generation Model,分别获得用户和物品的向量表示,从而后续可以通过其他方法(如用户和物品的余弦相似度)给用户推荐物品。
由于原论文没有开源数据集,本项目随机构造数据验证网络的正确性。
本项目支持功能
训练:单机CPU、单机单卡GPU、本地模拟参数服务器训练、增量训练,配置请参考
[
启动训练
](
https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/train.md
)
预测:单机CPU、单机单卡GPU;配置请参考
[
PaddleRec 离线预测
](
https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/predict.md
)
## 数据处理
调用python generate_ramdom_data.py生成随机训练数据,每行数据格式如下:
```
#watch_vec;search_vec;other_feat;label
0.01,0.02,...,0.09;0.01,0.02,...,0.09;0.01,0.02,...,0.09;20
```
方便起见,我们提供了一键式数据生成脚本:
```
sh data_prepare.sh
```
## 运行环境
PaddlePaddle>=1.7.2
python 2.7/3.5/3.6/3.7
PaddleRec >=0.1
os : windows/linux/macos
## 快速开始
### 单机训练
```
mode: [cpu_single_train]
runner:
- name: cpu_single_train
class: train
device: cpu # if use_gpu, set it to gpu
epochs: 20
save_checkpoint_interval: 1
save_inference_interval: 1
save_checkpoint_path: "increment_youtubednn"
save_inference_path: "inference_youtubednn"
save_inference_feed_varnames: ["watch_vec", "search_vec", "other_feat"] # feed vars of save inference
save_inference_fetch_varnames: ["l3.tmp_2"]
print_interval: 1
```
### 单机预测
通过计算每个用户和每个物品的余弦相似度,给每个用户推荐topk视频:
cpu infer:
```
python infer.py --test_epoch 19 --inference_model_dir ./inference_youtubednn --increment_model_dir ./increment_youtubednn --watch_vec_size 64 --search_vec_size 64 --other_feat_size 64 --topk 5
```
gpu infer:
```
python infer.py --use_gpu 1 --test_epoch 19 --inference_model_dir ./inference_youtubednn --increment_model_dir ./increment_youtubednn --watch_vec_size 64 --search_vec_size 64 --other_feat_size 64 --topk 5
```
### 运行
```
python -m paddlerec.run -m paddlerec.models.recall.w2v
```
### 结果展示
样例数据训练结果展示:
```
Running SingleStartup.
Running SingleRunner.
batch: 1, acc: [0.03125]
batch: 2, acc: [0.0625]
batch: 3, acc: [0.]
...
epoch 0 done, use time: 0.0605320930481, global metrics: acc=[0.]
...
epoch 19 done, use time: 0.33447098732, global metrics: acc=[0.]
```
样例数据预测结果展示:
```
user:0, top K videos:[40, 31, 4, 33, 93]
user:1, top K videos:[35, 57, 58, 40, 17]
user:2, top K videos:[35, 17, 88, 40, 9]
user:3, top K videos:[73, 35, 39, 58, 38]
user:4, top K videos:[40, 31, 57, 4, 73]
user:5, top K videos:[38, 9, 7, 88, 22]
user:6, top K videos:[35, 73, 14, 58, 28]
user:7, top K videos:[35, 73, 58, 38, 56]
user:8, top K videos:[38, 40, 9, 35, 99]
user:9, top K videos:[88, 73, 9, 35, 28]
user:10, top K videos:[35, 52, 28, 54, 73]
```
## 进阶使用
## FAQ
models/recall/youtube_dnn/config.yaml
浏览文件 @
1cded004
...
...
@@ -17,11 +17,10 @@ workspace: "models/recall/youtube_dnn"
dataset
:
-
name
:
dataset_train
batch_size
:
5
type
:
DataLoader
#type: QueueDataset
batch_size
:
32
type
:
DataLoader
# or QueueDataset
data_path
:
"
{workspace}/data/train"
data_converter
:
"
{workspace}/r
andom_r
eader.py"
data_converter
:
"
{workspace}/reader.py"
hyper_parameters
:
watch_vec_size
:
64
...
...
@@ -30,22 +29,23 @@ hyper_parameters:
output_size
:
100
layers
:
[
128
,
64
,
32
]
optimizer
:
class
:
adam
learning_rate
:
0.001
strategy
:
async
class
:
SGD
learning_rate
:
0.01
mode
:
train_runner
mode
:
[
cpu_single_train
]
runner
:
-
name
:
train_runner
-
name
:
cpu_single_train
class
:
train
device
:
cpu
epochs
:
3
save_checkpoint_interval
:
2
save_inference_interval
:
4
save_checkpoint_path
:
"
increment"
save_inference_path
:
"
inference"
print_interval
:
10
epochs
:
20
save_checkpoint_interval
:
1
save_inference_interval
:
1
save_checkpoint_path
:
"
increment_youtubednn"
save_inference_path
:
"
inference_youtubednn"
save_inference_feed_varnames
:
[
"
watch_vec"
,
"
search_vec"
,
"
other_feat"
]
# feed vars of save inference
save_inference_fetch_varnames
:
[
"
l3.tmp_2"
]
print_interval
:
1
phase
:
-
name
:
train
...
...
models/recall/youtube_dnn/data/test/data.txt
0 → 100644
浏览文件 @
1cded004
此差异已折叠。
点击以展开。
models/recall/youtube_dnn/data/test/small_data.txt
已删除
100644 → 0
浏览文件 @
c7fd7a75
4764,174,1
4764,2958,0
4764,452,0
4764,1946,0
4764,3208,0
2044,2237,1
2044,1998,0
2044,328,0
2044,1542,0
2044,1932,0
4276,65,1
4276,3247,0
4276,942,0
4276,3666,0
4276,2222,0
3933,682,1
3933,2451,0
3933,3695,0
3933,1643,0
3933,3568,0
1151,1265,1
1151,118,0
1151,2532,0
1151,2083,0
1151,2350,0
1757,876,1
1757,201,0
1757,3633,0
1757,1068,0
1757,2549,0
3370,276,1
3370,2435,0
3370,606,0
3370,910,0
3370,2146,0
5137,1018,1
5137,2163,0
5137,3167,0
5137,2315,0
5137,3595,0
3933,2831,1
3933,2881,0
3933,2949,0
3933,3660,0
3933,417,0
3102,999,1
3102,1902,0
3102,2161,0
3102,3042,0
3102,1113,0
2022,336,1
2022,1672,0
2022,2656,0
2022,3649,0
2022,883,0
2664,655,1
2664,3660,0
2664,1711,0
2664,3386,0
2664,1668,0
25,701,1
25,32,0
25,2482,0
25,3177,0
25,2767,0
1738,1643,1
1738,2187,0
1738,228,0
1738,650,0
1738,3101,0
5411,1241,1
5411,2546,0
5411,3019,0
5411,3618,0
5411,1674,0
638,579,1
638,3512,0
638,783,0
638,2111,0
638,1880,0
3554,200,1
3554,2893,0
3554,2428,0
3554,969,0
3554,2741,0
4283,1074,1
4283,3056,0
4283,2032,0
4283,405,0
4283,1505,0
5111,200,1
5111,3488,0
5111,477,0
5111,2790,0
5111,40,0
3964,515,1
3964,1528,0
3964,2173,0
3964,1701,0
3964,2832,0
models/recall/youtube_dnn/data/train/data.txt
0 → 100644
浏览文件 @
1cded004
此差异已折叠。
点击以展开。
models/recall/youtube_dnn/data/train/samll_data.txt
已删除
100644 → 0
浏览文件 @
c7fd7a75
4764,174,1
4764,2958,0
4764,452,0
4764,1946,0
4764,3208,0
2044,2237,1
2044,1998,0
2044,328,0
2044,1542,0
2044,1932,0
4276,65,1
4276,3247,0
4276,942,0
4276,3666,0
4276,2222,0
3933,682,1
3933,2451,0
3933,3695,0
3933,1643,0
3933,3568,0
1151,1265,1
1151,118,0
1151,2532,0
1151,2083,0
1151,2350,0
1757,876,1
1757,201,0
1757,3633,0
1757,1068,0
1757,2549,0
3370,276,1
3370,2435,0
3370,606,0
3370,910,0
3370,2146,0
5137,1018,1
5137,2163,0
5137,3167,0
5137,2315,0
5137,3595,0
3933,2831,1
3933,2881,0
3933,2949,0
3933,3660,0
3933,417,0
3102,999,1
3102,1902,0
3102,2161,0
3102,3042,0
3102,1113,0
2022,336,1
2022,1672,0
2022,2656,0
2022,3649,0
2022,883,0
2664,655,1
2664,3660,0
2664,1711,0
2664,3386,0
2664,1668,0
25,701,1
25,32,0
25,2482,0
25,3177,0
25,2767,0
1738,1643,1
1738,2187,0
1738,228,0
1738,650,0
1738,3101,0
5411,1241,1
5411,2546,0
5411,3019,0
5411,3618,0
5411,1674,0
638,579,1
638,3512,0
638,783,0
638,2111,0
638,1880,0
3554,200,1
3554,2893,0
3554,2428,0
3554,969,0
3554,2741,0
4283,1074,1
4283,3056,0
4283,2032,0
4283,405,0
4283,1505,0
5111,200,1
5111,3488,0
5111,477,0
5111,2790,0
5111,40,0
3964,515,1
3964,1528,0
3964,2173,0
3964,1701,0
3964,2832,0
models/recall/youtube_dnn/data_prepare.sh
0 → 100644
浏览文件 @
1cded004
python generate_ramdom_data.py
models/recall/youtube_dnn/generate_ramdom_data.py
0 → 100644
浏览文件 @
1cded004
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import
paddle
import
numpy
as
np
# Build a random data set.
sample_size
=
100
batch_size
=
32
watch_vec_size
=
64
search_vec_size
=
64
other_feat_size
=
64
output_size
=
100
watch_vecs
=
np
.
random
.
rand
(
batch_size
*
sample_size
,
watch_vec_size
).
tolist
()
search_vecs
=
np
.
random
.
rand
(
batch_size
*
sample_size
,
search_vec_size
).
tolist
()
other_vecs
=
np
.
random
.
rand
(
batch_size
*
sample_size
,
other_feat_size
).
tolist
()
labels
=
np
.
random
.
randint
(
output_size
,
size
=
(
batch_size
*
sample_size
)).
tolist
()
output_path
=
"./data/train/data.txt"
with
open
(
output_path
,
'w'
)
as
fout
:
for
i
in
range
(
batch_size
*
sample_size
):
_str_
=
','
.
join
(
map
(
str
,
watch_vecs
[
i
]))
+
";"
+
','
.
join
(
map
(
str
,
search_vecs
[
i
]))
+
";"
+
','
.
join
(
map
(
str
,
other_vecs
[
i
]))
+
";"
+
str
(
labels
[
i
])
fout
.
write
(
_str_
)
fout
.
write
(
"
\n
"
)
models/recall/youtube_dnn/infer.py
0 → 100644
浏览文件 @
1cded004
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import
os
import
copy
import
numpy
as
np
import
argparse
import
paddle.fluid
as
fluid
import
pandas
as
pd
from
paddle.fluid.incubate.fleet.utils
import
utils
def
parse_args
():
parser
=
argparse
.
ArgumentParser
(
"PaddlePaddle Youtube DNN infer example"
)
parser
.
add_argument
(
'--use_gpu'
,
type
=
int
,
default
=
'0'
,
help
=
'whether use gpu'
)
parser
.
add_argument
(
"--batch_size"
,
type
=
int
,
default
=
32
,
help
=
"batch_size"
)
parser
.
add_argument
(
"--test_epoch"
,
type
=
int
,
default
=
19
,
help
=
"test_epoch"
)
parser
.
add_argument
(
'--inference_model_dir'
,
type
=
str
,
default
=
'./inference_youtubednn'
,
help
=
'inference_model_dir'
)
parser
.
add_argument
(
'--increment_model_dir'
,
type
=
str
,
default
=
'./increment_youtubednn'
,
help
=
'persistable_model_dir'
)
parser
.
add_argument
(
'--watch_vec_size'
,
type
=
int
,
default
=
64
,
help
=
'watch_vec_size'
)
parser
.
add_argument
(
'--search_vec_size'
,
type
=
int
,
default
=
64
,
help
=
'search_vec_size'
)
parser
.
add_argument
(
'--other_feat_size'
,
type
=
int
,
default
=
64
,
help
=
'other_feat_size'
)
parser
.
add_argument
(
'--topk'
,
type
=
int
,
default
=
5
,
help
=
'topk'
)
args
=
parser
.
parse_args
()
return
args
def
infer
(
args
):
video_save_path
=
os
.
path
.
join
(
args
.
increment_model_dir
,
str
(
args
.
test_epoch
),
"l4_weight"
)
video_vec
,
=
utils
.
load_var
(
"l4_weight"
,
[
32
,
100
],
'float32'
,
video_save_path
)
place
=
fluid
.
CUDAPlace
(
0
)
if
args
.
use_gpu
else
fluid
.
CPUPlace
()
exe
=
fluid
.
Executor
(
place
)
cur_model_path
=
os
.
path
.
join
(
args
.
inference_model_dir
,
str
(
args
.
test_epoch
))
user_vec
=
None
with
fluid
.
scope_guard
(
fluid
.
Scope
()):
infer_program
,
feed_target_names
,
fetch_vars
=
fluid
.
io
.
load_inference_model
(
cur_model_path
,
exe
)
# Build a random data set.
sample_size
=
100
watch_vecs
=
[]
search_vecs
=
[]
other_feats
=
[]
for
i
in
range
(
sample_size
):
watch_vec
=
np
.
random
.
rand
(
1
,
args
.
watch_vec_size
)
search_vec
=
np
.
random
.
rand
(
1
,
args
.
search_vec_size
)
other_feat
=
np
.
random
.
rand
(
1
,
args
.
other_feat_size
)
watch_vecs
.
append
(
watch_vec
)
search_vecs
.
append
(
search_vec
)
other_feats
.
append
(
other_feat
)
for
i
in
range
(
sample_size
):
l3
=
exe
.
run
(
infer_program
,
feed
=
{
"watch_vec"
:
watch_vecs
[
i
].
astype
(
'float32'
),
"search_vec"
:
search_vecs
[
i
].
astype
(
'float32'
),
"other_feat"
:
other_feats
[
i
].
astype
(
'float32'
),
},
return_numpy
=
True
,
fetch_list
=
fetch_vars
)
if
user_vec
is
not
None
:
user_vec
=
np
.
concatenate
([
user_vec
,
l3
[
0
]],
axis
=
0
)
else
:
user_vec
=
l3
[
0
]
# get topk result
user_video_sim_list
=
[]
for
i
in
range
(
user_vec
.
shape
[
0
]):
for
j
in
range
(
video_vec
.
shape
[
1
]):
user_video_sim
=
cos_sim
(
user_vec
[
i
],
video_vec
[:,
j
])
user_video_sim_list
.
append
(
user_video_sim
)
tmp_list
=
copy
.
deepcopy
(
user_video_sim_list
)
tmp_list
.
sort
()
max_sim_index
=
[
user_video_sim_list
.
index
(
one
)
for
one
in
tmp_list
[::
-
1
][:
args
.
topk
]
]
print
(
"user:{0}, top K videos:{1}"
.
format
(
i
,
max_sim_index
))
user_video_sim_list
=
[]
def
cos_sim
(
vector_a
,
vector_b
):
vector_a
=
np
.
mat
(
vector_a
)
vector_b
=
np
.
mat
(
vector_b
)
num
=
float
(
vector_a
*
vector_b
.
T
)
denom
=
np
.
linalg
.
norm
(
vector_a
)
*
np
.
linalg
.
norm
(
vector_b
)
cos
=
num
/
(
denom
+
1e-4
)
sim
=
0.5
+
0.5
*
cos
return
sim
if
__name__
==
"__main__"
:
args
=
parse_args
()
infer
(
args
)
models/recall/youtube_dnn/r
andom_r
eader.py
→
models/recall/youtube_dnn/reader.py
浏览文件 @
1cded004
...
...
@@ -39,13 +39,17 @@ class Reader(ReaderBase):
"""
This function needs to be implemented by the user, based on data format
"""
features
=
line
.
rstrip
().
split
(
";"
)
watch_vec
=
features
[
0
].
split
(
','
)
search_vec
=
features
[
1
].
split
(
','
)
other_feat
=
features
[
2
].
split
(
','
)
label
=
features
[
3
]
assert
(
len
(
watch_vec
)
==
self
.
watch_vec_size
)
assert
(
len
(
search_vec
)
==
self
.
search_vec_size
)
assert
(
len
(
other_feat
)
==
self
.
other_feat_size
)
feature_name
=
[
"watch_vec"
,
"search_vec"
,
"other_feat"
,
"label"
]
yield
list
(
zip
(
feature_name
,
[
np
.
random
.
rand
(
self
.
watch_vec_size
).
tolist
()
]
+
[
np
.
random
.
rand
(
self
.
search_vec_size
).
tolist
()]
+
[
np
.
random
.
rand
(
self
.
other_feat_size
).
tolist
()
]
+
[[
np
.
random
.
randint
(
self
.
output_size
)]]))
zip
(
feature_name
,
[
watch_vec
]
+
[
search_vec
]
+
[
other_feat
]
+
[
label
]))
return
reader
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录