Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
BaiXuePrincess
PaddleRec
提交
5fe475a0
P
PaddleRec
项目概览
BaiXuePrincess
/
PaddleRec
与 Fork 源项目一致
Fork自
PaddlePaddle / PaddleRec
通知
1
Star
0
Fork
0
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
0
列表
看板
标记
里程碑
合并请求
0
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
P
PaddleRec
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
0
Issue
0
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
未验证
提交
5fe475a0
编写于
6月 02, 2020
作者:
W
wuzhihua
提交者:
GitHub
6月 02, 2020
浏览文件
操作
浏览文件
下载
差异文件
Merge pull request #23 from yaoxuefeng6/add_lr
add LR in rank
上级
f07cf23d
0f3048bd
变更
9
展开全部
隐藏空白更改
内联
并排
Showing
9 changed file
with
542 addition
and
2 deletion
+542
-2
models/rank/deepfm/data/get_slot_data.py
models/rank/deepfm/data/get_slot_data.py
+4
-2
models/rank/logistic_regression/__init__.py
models/rank/logistic_regression/__init__.py
+13
-0
models/rank/logistic_regression/config.yaml
models/rank/logistic_regression/config.yaml
+74
-0
models/rank/logistic_regression/data/download_preprocess.py
models/rank/logistic_regression/data/download_preprocess.py
+39
-0
models/rank/logistic_regression/data/get_slot_data.py
models/rank/logistic_regression/data/get_slot_data.py
+100
-0
models/rank/logistic_regression/data/preprocess.py
models/rank/logistic_regression/data/preprocess.py
+114
-0
models/rank/logistic_regression/data/run.sh
models/rank/logistic_regression/data/run.sh
+13
-0
models/rank/logistic_regression/data/sample_data/train/sample_train.txt
...gistic_regression/data/sample_data/train/sample_train.txt
+100
-0
models/rank/logistic_regression/model.py
models/rank/logistic_regression/model.py
+85
-0
未找到文件。
models/rank/deepfm/data/get_slot_data.py
浏览文件 @
5fe475a0
...
...
@@ -12,9 +12,11 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import
yaml
import
yaml
,
os
from
paddlerec.core.reader
import
Reader
from
paddlerec.core.utils
import
envs
import
paddle.fluid.incubate.data_generator
as
dg
try
:
import
cPickle
as
pickle
except
ImportError
:
...
...
@@ -44,7 +46,7 @@ class TrainReader(dg.MultiSlotDataGenerator):
self
.
continuous_range_
=
range
(
1
,
14
)
self
.
categorical_range_
=
range
(
14
,
40
)
# load preprocessed feature dict
self
.
feat_dict_name
=
"
aid
_data/feat_dict_10.pkl2"
self
.
feat_dict_name
=
"
sample
_data/feat_dict_10.pkl2"
self
.
feat_dict_
=
pickle
.
load
(
open
(
self
.
feat_dict_name
,
'rb'
))
def
_process_line
(
self
,
line
):
...
...
models/rank/logistic_regression/__init__.py
0 → 100755
浏览文件 @
5fe475a0
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
models/rank/logistic_regression/config.yaml
0 → 100755
浏览文件 @
5fe475a0
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# global settings
debug
:
false
workspace
:
"
paddlerec.models.rank.deepfm"
dataset
:
-
name
:
train_sample
type
:
QueueDataset
batch_size
:
5
data_path
:
"
{workspace}/data/sample_data/train"
sparse_slots
:
"
label
feat_idx"
dense_slots
:
"
feat_value:39"
-
name
:
infer_sample
type
:
QueueDataset
batch_size
:
5
data_path
:
"
{workspace}/data/sample_data/train"
sparse_slots
:
"
label
feat_idx"
dense_slots
:
"
feat_value:39"
hyper_parameters
:
optimizer
:
class
:
SGD
learning_rate
:
0.0001
sparse_feature_number
:
1086460
sparse_feature_dim
:
9
num_field
:
39
reg
:
0.001
mode
:
train_runner
# if infer, change mode to "infer_runner" and change phase to "infer_phase"
runner
:
-
name
:
train_runner
trainer_class
:
single_train
epochs
:
2
device
:
cpu
init_model_path
:
"
"
save_checkpoint_interval
:
1
save_inference_interval
:
1
save_checkpoint_path
:
"
increment"
save_inference_path
:
"
inference"
print_interval
:
1
-
name
:
infer_runner
trainer_class
:
single_infer
epochs
:
1
device
:
cpu
init_model_path
:
"
increment/0"
print_interval
:
1
phase
:
-
name
:
phase1
model
:
"
{workspace}/model.py"
dataset_name
:
train_sample
thread_num
:
1
#- name: infer_phase
# model: "{workspace}/model.py"
# dataset_name: infer_sample
# thread_num: 1
models/rank/logistic_regression/data/download_preprocess.py
0 → 100755
浏览文件 @
5fe475a0
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import
os
import
shutil
import
sys
LOCAL_PATH
=
os
.
path
.
dirname
(
os
.
path
.
abspath
(
__file__
))
TOOLS_PATH
=
os
.
path
.
join
(
LOCAL_PATH
,
".."
,
".."
,
"tools"
)
sys
.
path
.
append
(
TOOLS_PATH
)
from
paddlerec.tools.tools
import
download_file_and_uncompress
,
download_file
if
__name__
==
'__main__'
:
url
=
"https://s3-eu-west-1.amazonaws.com/kaggle-display-advertising-challenge-dataset/dac.tar.gz"
url2
=
"https://paddlerec.bj.bcebos.com/deepfm%2Ffeat_dict_10.pkl2"
print
(
"download and extract starting..."
)
download_file_and_uncompress
(
url
)
download_file
(
url2
,
"./aid_data/feat_dict_10.pkl2"
,
True
)
print
(
"download and extract finished"
)
print
(
"preprocessing..."
)
os
.
system
(
"python preprocess.py"
)
print
(
"preprocess done"
)
shutil
.
rmtree
(
"raw_data"
)
print
(
"done"
)
models/rank/logistic_regression/data/get_slot_data.py
0 → 100755
浏览文件 @
5fe475a0
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import
yaml
import
os
from
paddlerec.core.reader
import
Reader
from
paddlerec.core.utils
import
envs
import
paddle.fluid.incubate.data_generator
as
dg
try
:
import
cPickle
as
pickle
except
ImportError
:
import
pickle
class
TrainReader
(
dg
.
MultiSlotDataGenerator
):
def
__init__
(
self
,
config
):
dg
.
MultiSlotDataGenerator
.
__init__
(
self
)
if
os
.
path
.
isfile
(
config
):
with
open
(
config
,
'r'
)
as
rb
:
_config
=
yaml
.
load
(
rb
.
read
(),
Loader
=
yaml
.
FullLoader
)
else
:
raise
ValueError
(
"reader config only support yaml"
)
def
init
(
self
):
self
.
cont_min_
=
[
0
,
-
3
,
0
,
0
,
0
,
0
,
0
,
0
,
0
,
0
,
0
,
0
,
0
]
self
.
cont_max_
=
[
5775
,
257675
,
65535
,
969
,
23159456
,
431037
,
56311
,
6047
,
29019
,
46
,
231
,
4008
,
7393
]
self
.
cont_diff_
=
[
self
.
cont_max_
[
i
]
-
self
.
cont_min_
[
i
]
for
i
in
range
(
len
(
self
.
cont_min_
))
]
self
.
continuous_range_
=
range
(
1
,
14
)
self
.
categorical_range_
=
range
(
14
,
40
)
# load preprocessed feature dict
self
.
feat_dict_name
=
"sample_data/feat_dict_10.pkl2"
self
.
feat_dict_
=
pickle
.
load
(
open
(
self
.
feat_dict_name
,
'rb'
))
def
_process_line
(
self
,
line
):
features
=
line
.
rstrip
(
'
\n
'
).
split
(
'
\t
'
)
feat_idx
=
[]
feat_value
=
[]
for
idx
in
self
.
continuous_range_
:
if
features
[
idx
]
==
''
:
feat_idx
.
append
(
0
)
feat_value
.
append
(
0.0
)
else
:
feat_idx
.
append
(
self
.
feat_dict_
[
idx
])
feat_value
.
append
(
(
float
(
features
[
idx
])
-
self
.
cont_min_
[
idx
-
1
])
/
self
.
cont_diff_
[
idx
-
1
])
for
idx
in
self
.
categorical_range_
:
if
features
[
idx
]
==
''
or
features
[
idx
]
not
in
self
.
feat_dict_
:
feat_idx
.
append
(
0
)
feat_value
.
append
(
0.0
)
else
:
feat_idx
.
append
(
self
.
feat_dict_
[
features
[
idx
]])
feat_value
.
append
(
1.0
)
label
=
[
int
(
features
[
0
])]
return
feat_idx
,
feat_value
,
label
def
generate_sample
(
self
,
line
):
"""
Read the data line by line and process it as a dictionary
"""
def
data_iter
():
feat_idx
,
feat_value
,
label
=
self
.
_process_line
(
line
)
s
=
""
for
i
in
[(
'feat_idx'
,
feat_idx
),
(
'feat_value'
,
feat_value
),
(
'label'
,
label
)]:
k
=
i
[
0
]
v
=
i
[
1
]
for
j
in
v
:
s
+=
" "
+
k
+
":"
+
str
(
j
)
print
s
.
strip
()
yield
None
return
data_iter
reader
=
TrainReader
(
"../config.yaml"
)
# run this file in original folder to find config.yaml
reader
.
init
()
reader
.
run_from_stdin
()
models/rank/logistic_regression/data/preprocess.py
0 → 100755
浏览文件 @
5fe475a0
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import
os
import
numpy
from
collections
import
Counter
import
shutil
import
pickle
def
get_raw_data
():
if
not
os
.
path
.
isdir
(
'raw_data'
):
os
.
mkdir
(
'raw_data'
)
fin
=
open
(
'train.txt'
,
'r'
)
fout
=
open
(
'raw_data/part-0'
,
'w'
)
for
line_idx
,
line
in
enumerate
(
fin
):
if
line_idx
%
200000
==
0
and
line_idx
!=
0
:
fout
.
close
()
cur_part_idx
=
int
(
line_idx
/
200000
)
fout
=
open
(
'raw_data/part-'
+
str
(
cur_part_idx
),
'w'
)
fout
.
write
(
line
)
fout
.
close
()
fin
.
close
()
def
split_data
():
split_rate_
=
0.9
dir_train_file_idx_
=
'aid_data/train_file_idx.txt'
filelist_
=
[
'raw_data/part-%d'
%
x
for
x
in
range
(
len
(
os
.
listdir
(
'raw_data'
)))
]
if
not
os
.
path
.
exists
(
dir_train_file_idx_
):
train_file_idx
=
list
(
numpy
.
random
.
choice
(
len
(
filelist_
),
int
(
len
(
filelist_
)
*
split_rate_
),
False
))
with
open
(
dir_train_file_idx_
,
'w'
)
as
fout
:
fout
.
write
(
str
(
train_file_idx
))
else
:
with
open
(
dir_train_file_idx_
,
'r'
)
as
fin
:
train_file_idx
=
eval
(
fin
.
read
())
for
idx
in
range
(
len
(
filelist_
)):
if
idx
in
train_file_idx
:
shutil
.
move
(
filelist_
[
idx
],
'train_data'
)
else
:
shutil
.
move
(
filelist_
[
idx
],
'test_data'
)
def
get_feat_dict
():
freq_
=
10
dir_feat_dict_
=
'aid_data/feat_dict_'
+
str
(
freq_
)
+
'.pkl2'
continuous_range_
=
range
(
1
,
14
)
categorical_range_
=
range
(
14
,
40
)
if
not
os
.
path
.
exists
(
dir_feat_dict_
):
# Count the number of occurrences of discrete features
feat_cnt
=
Counter
()
with
open
(
'train.txt'
,
'r'
)
as
fin
:
for
line_idx
,
line
in
enumerate
(
fin
):
if
line_idx
%
100000
==
0
:
print
(
'generating feature dict'
,
line_idx
/
45000000
)
features
=
line
.
rstrip
(
'
\n
'
).
split
(
'
\t
'
)
for
idx
in
categorical_range_
:
if
features
[
idx
]
==
''
:
continue
feat_cnt
.
update
([
features
[
idx
]])
# Only retain discrete features with high frequency
dis_feat_set
=
set
()
for
feat
,
ot
in
feat_cnt
.
items
():
if
ot
>=
freq_
:
dis_feat_set
.
add
(
feat
)
# Create a dictionary for continuous and discrete features
feat_dict
=
{}
tc
=
1
# Continuous features
for
idx
in
continuous_range_
:
feat_dict
[
idx
]
=
tc
tc
+=
1
for
feat
in
dis_feat_set
:
feat_dict
[
feat
]
=
tc
tc
+=
1
# Save dictionary
with
open
(
dir_feat_dict_
,
'wb'
)
as
fout
:
pickle
.
dump
(
feat_dict
,
fout
,
protocol
=
2
)
print
(
'args.num_feat '
,
len
(
feat_dict
)
+
1
)
if
__name__
==
'__main__'
:
if
not
os
.
path
.
isdir
(
'train_data'
):
os
.
mkdir
(
'train_data'
)
if
not
os
.
path
.
isdir
(
'test_data'
):
os
.
mkdir
(
'test_data'
)
if
not
os
.
path
.
isdir
(
'aid_data'
):
os
.
mkdir
(
'aid_data'
)
get_raw_data
()
split_data
()
get_feat_dict
()
print
(
'Done!'
)
models/rank/logistic_regression/data/run.sh
0 → 100644
浏览文件 @
5fe475a0
python download_preprocess.py
mkdir
slot_train_data
for
i
in
`
ls
./train_data
`
do
cat
train_data/
$i
| python get_slot_data.py
>
slot_train_data/
$i
done
mkdir
slot_test_data
for
i
in
`
ls
./test_data
`
do
cat
test_data/
$i
| python get_slot_data.py
>
slot_test_data/
$i
done
models/rank/logistic_regression/data/sample_data/train/sample_train.txt
0 → 100644
浏览文件 @
5fe475a0
此差异已折叠。
点击以展开。
models/rank/logistic_regression/model.py
0 → 100755
浏览文件 @
5fe475a0
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import
math
import
paddle.fluid
as
fluid
from
paddlerec.core.utils
import
envs
from
paddlerec.core.model
import
Model
as
ModelBase
class
Model
(
ModelBase
):
def
__init__
(
self
,
config
):
ModelBase
.
__init__
(
self
,
config
)
def
_init_hyper_parameters
(
self
):
self
.
sparse_feature_number
=
envs
.
get_global_env
(
"hyper_parameters.sparse_feature_number"
,
None
)
self
.
num_field
=
envs
.
get_global_env
(
"hyper_parameters.num_field"
,
None
)
self
.
reg
=
envs
.
get_global_env
(
"hyper_parameters.reg"
,
1e-4
)
def
net
(
self
,
inputs
,
is_infer
=
False
):
init_value_
=
0.1
is_distributed
=
True
if
envs
.
get_trainer
()
==
"CtrTrainer"
else
False
# ------------------------- network input --------------------------
raw_feat_idx
=
self
.
_sparse_data_var
[
1
]
raw_feat_value
=
self
.
_dense_data_var
[
0
]
self
.
label
=
self
.
_sparse_data_var
[
0
]
feat_idx
=
raw_feat_idx
feat_value
=
fluid
.
layers
.
reshape
(
raw_feat_value
,
[
-
1
,
self
.
num_field
])
# None * num_field * 1
first_weights_re
=
fluid
.
embedding
(
input
=
feat_idx
,
is_sparse
=
True
,
is_distributed
=
is_distributed
,
dtype
=
'float32'
,
size
=
[
self
.
sparse_feature_number
+
1
,
1
],
padding_idx
=
0
,
param_attr
=
fluid
.
ParamAttr
(
initializer
=
fluid
.
initializer
.
TruncatedNormalInitializer
(
loc
=
0.0
,
scale
=
init_value_
),
regularizer
=
fluid
.
regularizer
.
L1DecayRegularizer
(
self
.
reg
)))
first_weights
=
fluid
.
layers
.
reshape
(
first_weights_re
,
shape
=
[
-
1
,
self
.
num_field
])
# None * num_field * 1
y_first_order
=
fluid
.
layers
.
reduce_sum
(
first_weights
*
feat_value
,
1
,
keep_dim
=
True
)
b_linear
=
fluid
.
layers
.
create_parameter
(
shape
=
[
1
],
dtype
=
'float32'
,
default_initializer
=
fluid
.
initializer
.
ConstantInitializer
(
value
=
0
))
self
.
predict
=
fluid
.
layers
.
sigmoid
(
y_first_order
+
b_linear
)
cost
=
fluid
.
layers
.
log_loss
(
input
=
self
.
predict
,
label
=
fluid
.
layers
.
cast
(
self
.
label
,
"float32"
))
avg_cost
=
fluid
.
layers
.
reduce_sum
(
cost
)
self
.
_cost
=
avg_cost
predict_2d
=
fluid
.
layers
.
concat
([
1
-
self
.
predict
,
self
.
predict
],
1
)
label_int
=
fluid
.
layers
.
cast
(
self
.
label
,
'int64'
)
auc_var
,
batch_auc_var
,
_
=
fluid
.
layers
.
auc
(
input
=
predict_2d
,
label
=
label_int
,
slide_steps
=
0
)
self
.
_metrics
[
"AUC"
]
=
auc_var
self
.
_metrics
[
"BATCH_AUC"
]
=
batch_auc_var
if
is_infer
:
self
.
_infer_results
[
"AUC"
]
=
auc_var
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录