From d02319d8d3ad07d2ce32dfb060e8f6b68d818b1c Mon Sep 17 00:00:00 2001 From: zhang wenhui Date: Tue, 4 Aug 2020 15:33:01 +0800 Subject: [PATCH] add readme (#164) * fix readme.md * fix README.md * fix README.md * fix readme * fix readme * fix * fix widedeeep Co-authored-by: overlordmax <515704170@qq.com> Co-authored-by: tangwei12 --- models/multitask/esmm/README.md | 122 ++++++++++++++ models/multitask/esmm/data/run.sh | 26 +++ models/multitask/mmoe/README.md | 149 +++++++++++++++++ .../multitask/mmoe/data/data_preparation.py | 118 ++++++++++++++ models/multitask/share-bottom/README.md | 151 ++++++++++++++++++ .../share-bottom/data/data_preparation.py | 118 ++++++++++++++ models/multitask/share-bottom/data/run.sh | 16 ++ models/rank/flen/README.md | 95 ++++++----- models/rank/flen/data/get_data.py | 70 ++++++++ models/rank/wide_deep/README.md | 127 +++++++++++++++ 10 files changed, 943 insertions(+), 49 deletions(-) create mode 100644 models/multitask/esmm/README.md create mode 100644 models/multitask/esmm/data/run.sh create mode 100644 models/multitask/mmoe/README.md create mode 100644 models/multitask/mmoe/data/data_preparation.py create mode 100644 models/multitask/share-bottom/README.md create mode 100644 models/multitask/share-bottom/data/data_preparation.py create mode 100644 models/multitask/share-bottom/data/run.sh create mode 100644 models/rank/flen/data/get_data.py create mode 100644 models/rank/wide_deep/README.md diff --git a/models/multitask/esmm/README.md b/models/multitask/esmm/README.md new file mode 100644 index 00000000..91a1df76 --- /dev/null +++ b/models/multitask/esmm/README.md @@ -0,0 +1,122 @@ +# ESMM + +以下是本例的简要目录结构及说明: + +``` +├── data # 文档 + ├── train #训练数据 + ├──small.txt + ├── test #测试数据 + ├── small.txt + ├── run.sh +├── __init__.py +├── config.yaml #配置文件 +├── esmm_reader.py #数据读取文件 +├── model.py #模型文件 +``` + +注:在阅读该示例前,建议您先了解以下内容: + +[paddlerec入门教程](https://github.com/PaddlePaddle/PaddleRec/blob/master/README.md) + +## 内容 + +- [模型简介](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/esmm#模型简介) +- [数据准备](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/esmm#数据准备) +- [运行环境](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/esmm#运行环境) +- [快速开始](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/esmm#快速开始) +- [论文复现](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/esmm#论文复现) +- [进阶使用](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/esmm#进阶使用) +- [FAQ](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/esmm#FAQ) + +## 模型简介 + +不同于CTR预估问题,CVR预估面临两个关键问题: + +1. **Sample Selection Bias (SSB)** 转化是在点击之后才“有可能”发生的动作,传统CVR模型通常以点击数据为训练集,其中点击未转化为负例,点击并转化为正例。但是训练好的模型实际使用时,则是对整个空间的样本进行预估,而非只对点击样本进行预估。即是说,训练数据与实际要预测的数据来自不同分布,这个偏差对模型的泛化能力构成了很大挑战。 +2. **Data Sparsity (DS)** 作为CVR训练数据的点击样本远小于CTR预估训练使用的曝光样本。 + +ESMM是发表在 SIGIR’2018 的论文[《Entire Space Multi-Task Model: An Effective Approach for Estimating Post-Click Conversion Rate》]( https://arxiv.org/abs/1804.07931 )文章基于 Multi-Task Learning 的思路,提出一种新的CVR预估模型——ESMM,有效解决了真实场景中CVR预估面临的数据稀疏以及样本选择偏差这两个关键问题 + +本项目在paddlepaddle上实现ESMM的网络结构,并在开源数据集[Ali-CCP:Alibaba Click and Conversion Prediction]( https://tianchi.aliyun.com/datalab/dataSet.html?dataId=408 )上验证模型效果, 本模型配置默认使用demo数据集,若进行精度验证,请参考[论文复现](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/esmm#论文复现)部分。 + +本项目支持功能 + +训练:单机CPU、单机单卡GPU、单机多卡GPU、本地模拟参数服务器训练、增量训练,配置请参考 [启动训练](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/train.md) + +预测:单机CPU、单机单卡GPU ;配置请参考[PaddleRec 离线预测](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/predict.md) + +## 数据准备 + +数据地址:[Ali-CCP:Alibaba Click and Conversion Prediction]( https://tianchi.aliyun.com/datalab/dataSet.html?dataId=408 ) + +``` +cd data +sh run.sh +``` + +数据格式参见demo数据:data/train + + +## 运行环境 + +PaddlePaddle>=1.7.2 + +python 2.7/3.5/3.6/3.7 + +PaddleRec >=0.1 + +os : windows/linux/macos + +## 快速开始 + +### 单机训练 + +CPU环境 + +在config.yaml文件中设置好设备,epochs等。 + +``` +dataset: + - name: dataset_train + batch_size: 5 + type: QueueDataset + data_path: "{workspace}/data/train" + data_converter: "{workspace}/esmm_reader.py" + - name: dataset_infer + batch_size: 5 + type: QueueDataset + data_path: "{workspace}/data/test" + data_converter: "{workspace}/esmm_reader.py" +``` + +### 单机预测 + +CPU环境 + +在config.yaml文件中设置好epochs、device等参数。 + +``` + - name: infer_runner + class: infer + init_model_path: "increment/1" + device: cpu + print_interval: 1 + phases: [infer] +``` + + +## 论文复现 + +用原论文的完整数据复现论文效果需要在config.yaml中修改batch_size=1000, thread_num=8, epoch_num=4 + + +修改后运行方案:修改config.yaml中的'workspace'为config.yaml的目录位置,执行 + +``` +python -m paddlerec.run -m /home/your/dir/config.yaml #调试模式 直接指定本地config的绝对路径 +``` + +## 进阶使用 + +## FAQ diff --git a/models/multitask/esmm/data/run.sh b/models/multitask/esmm/data/run.sh new file mode 100644 index 00000000..c5698ffa --- /dev/null +++ b/models/multitask/esmm/data/run.sh @@ -0,0 +1,26 @@ +mkdir train_data +mkdir test_data +mkdir vocab +mkdir data +train_source_path="./data/sample_train.tar.gz" +train_target_path="train_data" +test_source_path="./data/sample_test.tar.gz" +test_target_path="test_data" +cd data +echo "downloading sample_train.tar.gz......" +curl -# 'http://jupter-oss.oss-cn-hangzhou.aliyuncs.com/file/opensearch/documents/408/sample_train.tar.gz?Expires=1586435769&OSSAccessKeyId=LTAIGx40tjZWxj6q&Signature=ahUDqhvKT1cGjC4%2FIER2EWtq7o4%3D&response-content-disposition=attachment%3B%20' -H 'Proxy-Connection: keep-alive' -H 'Upgrade-Insecure-Requests: 1' -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' -H 'Accept-Language: zh-CN,zh;q=0.9' --compressed --insecure -o sample_train.tar.gz +cd .. +echo "unzipping sample_train.tar.gz......" +tar -xzvf ${train_source_path} -C ${train_target_path} && rm -rf ${train_source_path} +cd data +echo "downloading sample_test.tar.gz......" +curl -# 'http://jupter-oss.oss-cn-hangzhou.aliyuncs.com/file/opensearch/documents/408/sample_test.tar.gz?Expires=1586435821&OSSAccessKeyId=LTAIGx40tjZWxj6q&Signature=OwLMPjt1agByQtRVi8pazsAliNk%3D&response-content-disposition=attachment%3B%20' -H 'Proxy-Connection: keep-alive' -H 'Upgrade-Insecure-Requests: 1' -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' -H 'Accept-Language: zh-CN,zh;q=0.9' --compressed --insecure -o sample_test.tar.gz +cd .. +echo "unzipping sample_test.tar.gz......" +tar -xzvf ${test_source_path} -C ${test_target_path} && rm -rf ${test_source_path} +echo "preprocessing data......" +python reader.py --train_data_path ${train_target_path} \ + --test_data_path ${test_target_path} \ + --vocab_path vocab/vocab_size.txt \ + --train_sample_size 6400 \ + --test_sample_size 6400 \ diff --git a/models/multitask/mmoe/README.md b/models/multitask/mmoe/README.md new file mode 100644 index 00000000..19f4674d --- /dev/null +++ b/models/multitask/mmoe/README.md @@ -0,0 +1,149 @@ +# MMOE + + 以下是本例的简要目录结构及说明: + +``` +├── data # 文档 + ├── train #训练数据 + ├── train_data.txt + ├── test #测试数据 + ├── test_data.txt + ├── run.sh + ├── data_preparation.py +├── __init__.py +├── config.yaml #配置文件 +├── census_reader.py #数据读取文件 +├── model.py #模型文件 +``` + +注:在阅读该示例前,建议您先了解以下内容: + +[paddlerec入门教程](https://github.com/PaddlePaddle/PaddleRec/blob/master/README.md) + +## 内容 + +- [模型简介](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/mmoe#模型简介) +- [数据准备](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/mmoe#数据准备) +- [运行环境](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/mmoe#运行环境) +- [快速开始](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/mmoe#快速开始) +- [论文复现](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/mmoe#论文复现) +- [进阶使用](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/mmoe#进阶使用) +- [FAQ](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/mmoe#FAQ) + +## 模型简介 + +多任务模型通过学习不同任务的联系和差异,可提高每个任务的学习效率和质量。多任务学习的的框架广泛采用shared-bottom的结构,不同任务间共用底部的隐层。这种结构本质上可以减少过拟合的风险,但是效果上可能受到任务差异和数据分布带来的影响。 论文[《Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts》]( https://www.kdd.org/kdd2018/accepted-papers/view/modeling-task-relationships-in-multi-task-learning-with-multi-gate-mixture- )中提出了一个Multi-gate Mixture-of-Experts(MMOE)的多任务学习结构。MMOE模型刻画了任务相关性,基于共享表示来学习特定任务的函数,避免了明显增加参数的缺点。 + +我们在Paddlepaddle定义MMOE的网络结构,在开源数据集Census-income Data上验证模型效果,两个任务的auc分别为: + +1.income + +> max_mmoe_test_auc_income:0.94937 +> +> mean_mmoe_test_auc_income:0.94465 + +2.marital + +> max_mmoe_test_auc_marital:0.99419 +> +> mean_mmoe_test_auc_marital:0.99324 + +若进行精度验证,请参考[论文复现](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/mmoe#论文复现)部分。 + +本项目支持功能 + +训练:单机CPU、单机单卡GPU、单机多卡GPU、本地模拟参数服务器训练、增量训练,配置请参考 [启动训练](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/train.md) +预测:单机CPU、单机单卡GPU ;配置请参考[PaddleRec 离线预测](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/predict.md) + +## 数据准备 + +数据地址: [Census-income Data](https://archive.ics.uci.edu/ml/machine-learning-databases/census-income-mld/census.tar.gz ) + +数据解压后, 在run.sh脚本文件中添加文件的路径,并运行脚本。 + +```sh +mkdir train_data +mkdir test_data +mkdir data +train_path="data/census-income.data" +test_path="data/census-income.test" +train_data_path="train_data/" +test_data_path="test_data/" +pip install -r requirements.txt +wget -P data/ https://archive.ics.uci.edu/ml/machine-learning-databases/census-income-mld/census.tar.gz +tar -zxvf data/census.tar.gz -C data/ + +python data_preparation.py --train_path ${train_path} \ + --test_path ${test_path} \ + --train_data_path ${train_data_path}\ + --test_data_path ${test_data_path} + +``` + +生成的格式以逗号为分割点 + +``` +0,0,73,0,0,0,0,1700.09,0,0 +``` + + +## 运行环境 + +PaddlePaddle>=1.7.2 + +python 2.7/3.5/3.6/3.7 + +PaddleRec >=0.1 + +os : windows/linux/macos + +## 快速开始 + +### 单机训练 + +CPU环境 + +在config.yaml文件中设置好设备,epochs等。 + +``` +dataset: +- name: dataset_train + batch_size: 5 + type: QueueDataset + data_path: "{workspace}/data/train" + data_converter: "{workspace}/census_reader.py" +- name: dataset_infer + batch_size: 5 + type: QueueDataset + data_path: "{workspace}/data/train" + data_converter: "{workspace}/census_reader.py" +``` + +### 单机预测 + +CPU环境 + +在config.yaml文件中设置好epochs、device等参数。 + +``` +- name: infer_runner + class: infer + init_model_path: "increment/0" + device: cpu +``` + +## 论文复现 + +用原论文的完整数据复现论文效果需要在config.yaml中修改batch_size=1000, thread_num=8, epoch_num=4 + +使用gpu p100 单卡训练 6.5h 测试auc: best:0.9940, mean:0.9932 + +修改后运行方案:修改config.yaml中的'workspace'为config.yaml的目录位置,执行 + +``` +python -m paddlerec.run -m /home/your/dir/config.yaml #调试模式 直接指定本地config的绝对路径 +``` + +## 进阶使用 + +## FAQ diff --git a/models/multitask/mmoe/data/data_preparation.py b/models/multitask/mmoe/data/data_preparation.py new file mode 100644 index 00000000..ad0775d1 --- /dev/null +++ b/models/multitask/mmoe/data/data_preparation.py @@ -0,0 +1,118 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import pandas as pd +import numpy as np +import paddle.fluid as fluid +from args import * + + +def fun1(x): + if x == ' 50000+.': + return 1 + else: + return 0 + + +def fun2(x): + if x == ' Never married': + return 1 + else: + return 0 + + +def data_preparation(train_path, test_path, train_data_path, test_data_path): + # The column names are from + # https://www2.1010data.com/documentationcenter/prod/Tutorials/MachineLearningExamples/CensusIncomeDataSet.html + column_names = [ + 'age', 'class_worker', 'det_ind_code', 'det_occ_code', 'education', + 'wage_per_hour', 'hs_college', 'marital_stat', 'major_ind_code', + 'major_occ_code', 'race', 'hisp_origin', 'sex', 'union_member', + 'unemp_reason', 'full_or_part_emp', 'capital_gains', 'capital_losses', + 'stock_dividends', 'tax_filer_stat', 'region_prev_res', + 'state_prev_res', 'det_hh_fam_stat', 'det_hh_summ', 'instance_weight', + 'mig_chg_msa', 'mig_chg_reg', 'mig_move_reg', 'mig_same', + 'mig_prev_sunbelt', 'num_emp', 'fam_under_18', 'country_father', + 'country_mother', 'country_self', 'citizenship', 'own_or_self', + 'vet_question', 'vet_benefits', 'weeks_worked', 'year', 'income_50k' + ] + + # Load the dataset in Pandas + train_df = pd.read_csv( + train_path, + delimiter=',', + header=None, + index_col=None, + names=column_names) + other_df = pd.read_csv( + test_path, + delimiter=',', + header=None, + index_col=None, + names=column_names) + + # First group of tasks according to the paper + label_columns = ['income_50k', 'marital_stat'] + + # One-hot encoding categorical columns + categorical_columns = [ + 'class_worker', 'det_ind_code', 'det_occ_code', 'education', + 'hs_college', 'major_ind_code', 'major_occ_code', 'race', + 'hisp_origin', 'sex', 'union_member', 'unemp_reason', + 'full_or_part_emp', 'tax_filer_stat', 'region_prev_res', + 'state_prev_res', 'det_hh_fam_stat', 'det_hh_summ', 'mig_chg_msa', + 'mig_chg_reg', 'mig_move_reg', 'mig_same', 'mig_prev_sunbelt', + 'fam_under_18', 'country_father', 'country_mother', 'country_self', + 'citizenship', 'vet_question' + ] + train_raw_labels = train_df[label_columns] + other_raw_labels = other_df[label_columns] + transformed_train = pd.get_dummies(train_df, columns=categorical_columns) + transformed_other = pd.get_dummies(other_df, columns=categorical_columns) + + # Filling the missing column in the other set + transformed_other[ + 'det_hh_fam_stat_ Grandchild <18 ever marr not in subfamily'] = 0 + # get label + transformed_train['income_50k'] = transformed_train['income_50k'].apply( + lambda x: fun1(x)) + transformed_train['marital_stat'] = transformed_train[ + 'marital_stat'].apply(lambda x: fun2(x)) + transformed_other['income_50k'] = transformed_other['income_50k'].apply( + lambda x: fun1(x)) + transformed_other['marital_stat'] = transformed_other[ + 'marital_stat'].apply(lambda x: fun2(x)) + # Split the other dataset into 1:1 validation to test according to the paper + validation_indices = transformed_other.sample( + frac=0.5, replace=False, random_state=1).index + test_indices = list(set(transformed_other.index) - set(validation_indices)) + validation_data = transformed_other.iloc[validation_indices] + test_data = transformed_other.iloc[test_indices] + + cols = transformed_train.columns.tolist() + cols.insert(0, cols.pop(cols.index('income_50k'))) + cols.insert(0, cols.pop(cols.index('marital_stat'))) + transformed_train = transformed_train[cols] + test_data = test_data[cols] + validation_data = validation_data[cols] + + print(transformed_train.shape, transformed_other.shape, + validation_data.shape, test_data.shape) + transformed_train.to_csv(train_data_path + 'train_data.csv', index=False) + test_data.to_csv(test_data_path + 'test_data.csv', index=False) + + +args = data_preparation_args() +data_preparation(args.train_path, args.test_path, args.train_data_path, + args.test_data_path) diff --git a/models/multitask/share-bottom/README.md b/models/multitask/share-bottom/README.md new file mode 100644 index 00000000..1995e70f --- /dev/null +++ b/models/multitask/share-bottom/README.md @@ -0,0 +1,151 @@ +# Share_bottom +以下是本例的简要目录结构及说明: + +``` +├── data # 文档 + ├── train #训练数据 + ├── train_data.txt + ├── test #测试数据 + ├── test_data.txt + ├── run.sh + ├── data_preparation.py +├── __init__.py +├── config.yaml #配置文件 +├── census_reader.py #数据读取文件 +├── model.py #模型文件 + +``` + +注:在阅读该示例前,建议您先了解以下内容: + +[paddlerec入门教程](https://github.com/PaddlePaddle/PaddleRec/blob/master/README.md) + +## 内容 + +- [模型简介](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/share-bottom#模型简介) +- [数据准备](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/share-bottom#数据准备) +- [运行环境](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/share-bottom#运行环境) +- [快速开始](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/share-bottom#快速开始) +- [论文复现](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/share-bottom#论文复现) +- [进阶使用](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/share-bottom#进阶使用) +- [FAQ](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/share-bottom#FAQ) + +## 模型简介 + +share_bottom是多任务学习的基本框架,其特点是对于不同的任务,底层的参数和网络结构是共享的,这种结构的优点是极大地减少网络的参数数量的情况下也能很好地对多任务进行学习,但缺点也很明显,由于底层的参数和网络结构是完全共享的,因此对于相关性不高的两个任务会导致优化冲突,从而影响模型最终的结果。后续很多Neural-based的多任务模型都是基于share_bottom发展而来的,如MMOE等模型可以改进share_bottom在多任务之间相关性低导致模型效果差的缺点。 + +我们在Paddlepaddle实现share_bottom网络结构,并在开源数据集Census-income Data上验证模型效果。两个任务的auc分别为: + +1.income + +>max_sb_test_auc_income:0.94993 +> +>mean_sb_test_auc_income: 0.93120 + +2.marital + +> max_sb_test_auc_marital:0.99384 +> +> mean_sb_test_auc_marital:0.99256 + +本项目在paddlepaddle上实现share_bottom的网络结构,并在开源数据集 [Census-income Data](https://archive.ics.uci.edu/ml/datasets/Census-Income+(KDD) )上验证模型效果, 本模型配置默认使用demo数据集,若进行精度验证,请参考[论文复现](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/share-bottom#论文复现)部分。 + +本项目支持功能 + +训练:单机CPU、单机单卡GPU、单机多卡GPU、本地模拟参数服务器训练、增量训练,配置请参考 [启动训练](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/train.md) + +预测:单机CPU、单机单卡GPU ;配置请参考[PaddleRec 离线预测](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/predict.md) + +## 数据准备 + +数据地址: [Census-income Data](https://archive.ics.uci.edu/ml/datasets/Census-Income+(KDD) ) + +数据解压后, 在create_data.sh脚本文件中添加文件的路径,并运行脚本。 + +```sh +mkdir train_data +mkdir test_data +mkdir data +train_path="data/census-income.data" +test_path="data/census-income.test" +train_data_path="train_data/" +test_data_path="test_data/" +pip install -r requirements.txt +wget -P data/ https://archive.ics.uci.edu/ml/machine-learning-databases/census-income-mld/census.tar.gz +tar -zxvf data/census.tar.gz -C data/ + +python data_preparation.py --train_path ${train_path} \ + --test_path ${test_path} \ + --train_data_path ${train_data_path}\ + --test_data_path ${test_data_path} + +``` + +生成的格式以逗号为分割点 + +``` +0,0,73,0,0,0,0,1700.09,0,0 +``` + + + +## 运行环境 + +PaddlePaddle>=1.7.2 + +python 2.7/3.5/3.6/3.7 + +PaddleRec >=0.1 + +os : windows/linux/macos + +## 快速开始 + +### 单机训练 + +CPU环境 + +在config.yaml文件中设置好设备,epochs等。 + +```sh +dataset: +- name: dataset_train + batch_size: 5 + type: QueueDataset + data_path: "{workspace}/data/train" + data_converter: "{workspace}/census_reader.py" +- name: dataset_infer + batch_size: 5 + type: QueueDataset + data_path: "{workspace}/data/train" + data_converter: "{workspace}/census_reader.py" +``` + +### 单机预测 + +CPU环境 + +在config.yaml文件中设置好epochs、device等参数。 + +```sh +- name: infer_runner + class: infer + init_model_path: "increment/0" + device: cpu +``` + +## 论文复现 + +用原论文的完整数据复现论文效果需要在config.yaml中修改batch_size=32, thread_num=8, epoch_num=100 + +使用gpu p100 单卡训练 4.5h 100轮, 测试auc:best: 0.9939,mean:0.9931 + +修改后运行方案:修改config.yaml中的'workspace'为config.yaml的目录位置,执行 + +```text +python -m paddlerec.run -m /home/your/dir/config.yaml #调试模式 直接指定本地config的绝对路径 +``` + +## 进阶使用 + +## FAQ diff --git a/models/multitask/share-bottom/data/data_preparation.py b/models/multitask/share-bottom/data/data_preparation.py new file mode 100644 index 00000000..ad0775d1 --- /dev/null +++ b/models/multitask/share-bottom/data/data_preparation.py @@ -0,0 +1,118 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import pandas as pd +import numpy as np +import paddle.fluid as fluid +from args import * + + +def fun1(x): + if x == ' 50000+.': + return 1 + else: + return 0 + + +def fun2(x): + if x == ' Never married': + return 1 + else: + return 0 + + +def data_preparation(train_path, test_path, train_data_path, test_data_path): + # The column names are from + # https://www2.1010data.com/documentationcenter/prod/Tutorials/MachineLearningExamples/CensusIncomeDataSet.html + column_names = [ + 'age', 'class_worker', 'det_ind_code', 'det_occ_code', 'education', + 'wage_per_hour', 'hs_college', 'marital_stat', 'major_ind_code', + 'major_occ_code', 'race', 'hisp_origin', 'sex', 'union_member', + 'unemp_reason', 'full_or_part_emp', 'capital_gains', 'capital_losses', + 'stock_dividends', 'tax_filer_stat', 'region_prev_res', + 'state_prev_res', 'det_hh_fam_stat', 'det_hh_summ', 'instance_weight', + 'mig_chg_msa', 'mig_chg_reg', 'mig_move_reg', 'mig_same', + 'mig_prev_sunbelt', 'num_emp', 'fam_under_18', 'country_father', + 'country_mother', 'country_self', 'citizenship', 'own_or_self', + 'vet_question', 'vet_benefits', 'weeks_worked', 'year', 'income_50k' + ] + + # Load the dataset in Pandas + train_df = pd.read_csv( + train_path, + delimiter=',', + header=None, + index_col=None, + names=column_names) + other_df = pd.read_csv( + test_path, + delimiter=',', + header=None, + index_col=None, + names=column_names) + + # First group of tasks according to the paper + label_columns = ['income_50k', 'marital_stat'] + + # One-hot encoding categorical columns + categorical_columns = [ + 'class_worker', 'det_ind_code', 'det_occ_code', 'education', + 'hs_college', 'major_ind_code', 'major_occ_code', 'race', + 'hisp_origin', 'sex', 'union_member', 'unemp_reason', + 'full_or_part_emp', 'tax_filer_stat', 'region_prev_res', + 'state_prev_res', 'det_hh_fam_stat', 'det_hh_summ', 'mig_chg_msa', + 'mig_chg_reg', 'mig_move_reg', 'mig_same', 'mig_prev_sunbelt', + 'fam_under_18', 'country_father', 'country_mother', 'country_self', + 'citizenship', 'vet_question' + ] + train_raw_labels = train_df[label_columns] + other_raw_labels = other_df[label_columns] + transformed_train = pd.get_dummies(train_df, columns=categorical_columns) + transformed_other = pd.get_dummies(other_df, columns=categorical_columns) + + # Filling the missing column in the other set + transformed_other[ + 'det_hh_fam_stat_ Grandchild <18 ever marr not in subfamily'] = 0 + # get label + transformed_train['income_50k'] = transformed_train['income_50k'].apply( + lambda x: fun1(x)) + transformed_train['marital_stat'] = transformed_train[ + 'marital_stat'].apply(lambda x: fun2(x)) + transformed_other['income_50k'] = transformed_other['income_50k'].apply( + lambda x: fun1(x)) + transformed_other['marital_stat'] = transformed_other[ + 'marital_stat'].apply(lambda x: fun2(x)) + # Split the other dataset into 1:1 validation to test according to the paper + validation_indices = transformed_other.sample( + frac=0.5, replace=False, random_state=1).index + test_indices = list(set(transformed_other.index) - set(validation_indices)) + validation_data = transformed_other.iloc[validation_indices] + test_data = transformed_other.iloc[test_indices] + + cols = transformed_train.columns.tolist() + cols.insert(0, cols.pop(cols.index('income_50k'))) + cols.insert(0, cols.pop(cols.index('marital_stat'))) + transformed_train = transformed_train[cols] + test_data = test_data[cols] + validation_data = validation_data[cols] + + print(transformed_train.shape, transformed_other.shape, + validation_data.shape, test_data.shape) + transformed_train.to_csv(train_data_path + 'train_data.csv', index=False) + test_data.to_csv(test_data_path + 'test_data.csv', index=False) + + +args = data_preparation_args() +data_preparation(args.train_path, args.test_path, args.train_data_path, + args.test_data_path) diff --git a/models/multitask/share-bottom/data/run.sh b/models/multitask/share-bottom/data/run.sh new file mode 100644 index 00000000..b60d42b3 --- /dev/null +++ b/models/multitask/share-bottom/data/run.sh @@ -0,0 +1,16 @@ +mkdir train_data +mkdir test_data +mkdir data +train_path="data/census-income.data" +test_path="data/census-income.test" +train_data_path="train_data/" +test_data_path="test_data/" +pip install -r requirements.txt + +wget -P data/ https://archive.ics.uci.edu/ml/machine-learning-databases/census-income-mld/census.tar.gz +tar -zxvf data/census.tar.gz -C data/ + +python data_preparation.py --train_path ${train_path} \ + --test_path ${test_path} \ + --train_data_path ${train_data_path}\ + --test_data_path ${test_data_path} diff --git a/models/rank/flen/README.md b/models/rank/flen/README.md index 9dafeac6..de1d663c 100644 --- a/models/rank/flen/README.md +++ b/models/rank/flen/README.md @@ -15,23 +15,53 @@ ├── config.yaml #配置文件 ``` -## 简介 +注:在阅读该示例前,建议您先了解以下内容: + +[paddlerec入门教程](https://github.com/PaddlePaddle/PaddleRec/blob/master/README.md) + + +--- +## 内容 + +- [模型简介](#模型简介) +- [数据准备](#数据准备) +- [运行环境](#运行环境) +- [快速开始](#快速开始) +- [论文复现](#论文复现) +- [进阶使用](#进阶使用) +- [FAQ](#FAQ) + +## 模型简介 [《FLEN: Leveraging Field for Scalable CTR Prediction》](https://arxiv.org/pdf/1911.04690.pdf)文章提出了field-wise bi-interaction pooling技术,解决了在大规模应用特征field信息时存在的时间复杂度和空间复杂度高的困境,同时提出了一种缓解梯度耦合问题的方法dicefactor。该模型已应用于美图的大规模推荐系统中,持续稳定地取得业务效果的全面提升。 -本项目在avazu数据集上验证模型效果 +本项目在avazu数据集上验证模型效果, 本模型配置默认使用demo数据集,若进行精度验证,请参考[论文复现](#论文复现)部分。 + +本项目支持功能 + +训练:单机CPU、单机单卡GPU、单机多卡GPU、本地模拟参数服务器训练、增量训练,配置请参考 [启动训练](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/train.md) + +预测:单机CPU、单机单卡GPU ;配置请参考[PaddleRec 离线预测](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/predict.md) + +## 数据准备 + -## 数据下载及预处理 -## 环境 -PaddlePaddle 1.7.2 +## 运行环境 -python3.7 +PaddlePaddle>=1.7.2 -PaddleRec +python 2.7/3.5/3.6/3.7 -## 单机训练 +PaddleRec >=0.1 + +os : windows/linux/macos + + + +## 快速开始 +### 单机训练 CPU环境 @@ -60,7 +90,7 @@ runner: phases: [phase1] ``` -## 单机预测 +### 单机预测 CPU环境 @@ -77,54 +107,21 @@ CPU环境 phases: [phase2] ``` -## 运行 +### 运行 ``` python -m paddlerec.run -m paddlerec.models.rank.flen ``` -## 模型效果 +## 论文复现 -在样例数据上测试模型 +用原论文的完整数据复现论文效果需要在config.yaml中修改batch_size=512, thread_num=8, epoch_num=1 -训练: +全量数据的效果未来补充。 -``` -0702 13:38:20.903220 7368 parallel_executor.cc:440] The Program will be executed on CPU using ParallelExecutor, 2 cards are used, so 2 programs are executed in parallel. -I0702 13:38:20.925912 7368 parallel_executor.cc:307] Inplace strategy is enabled, when build_strategy.enable_inplace = True -I0702 13:38:20.933356 7368 parallel_executor.cc:375] Garbage collection strategy is enabled, when FLAGS_eager_delete_tensor_gb = 0 -batch: 2, AUC: [0.09090909 0. ], BATCH_AUC: [0.09090909 0. ] -batch: 4, AUC: [0.31578947 0.29411765], BATCH_AUC: [0.31578947 0.29411765] -batch: 6, AUC: [0.41333333 0.33333333], BATCH_AUC: [0.41333333 0.33333333] -batch: 8, AUC: [0.4453125 0.44166667], BATCH_AUC: [0.4453125 0.44166667] -batch: 10, AUC: [0.39473684 0.38888889], BATCH_AUC: [0.44117647 0.41176471] -batch: 12, AUC: [0.41860465 0.45535714], BATCH_AUC: [0.5078125 0.54545455] -batch: 14, AUC: [0.43413729 0.42746615], BATCH_AUC: [0.56666667 0.56 ] -batch: 16, AUC: [0.46433566 0.47460087], BATCH_AUC: [0.53 0.59247649] -batch: 18, AUC: [0.44009217 0.44642857], BATCH_AUC: [0.46 0.47] -batch: 20, AUC: [0.42705314 0.43781095], BATCH_AUC: [0.45878136 0.4874552 ] -batch: 22, AUC: [0.45176471 0.46011281], BATCH_AUC: [0.48046875 0.45878136] -batch: 24, AUC: [0.48375 0.48910256], BATCH_AUC: [0.56630824 0.59856631] -epoch 0 done, use time: 0.21532440185546875 -PaddleRec Finish -``` -预测 -``` -PaddleRec: Runner single_cpu_infer Begin -Executor Mode: infer -processor_register begin -Running SingleInstance. -Running SingleNetwork. -QueueDataset can not support PY3, change to DataLoader -QueueDataset can not support PY3, change to DataLoader -Running SingleInferStartup. -Running SingleInferRunner. -load persistables from increment_model/0 -batch: 20, AUC: [0.49121353], BATCH_AUC: [0.66176471] -batch: 40, AUC: [0.51156463], BATCH_AUC: [0.55197133] -Infer phase2 of 0 done, use time: 0.3941819667816162 -PaddleRec Finish -``` +## 进阶使用 + +## FAQ diff --git a/models/rank/flen/data/get_data.py b/models/rank/flen/data/get_data.py new file mode 100644 index 00000000..27a26536 --- /dev/null +++ b/models/rank/flen/data/get_data.py @@ -0,0 +1,70 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import numpy as np +import pandas as pd +from sklearn.preprocessing import LabelEncoder + +data = pd.read_csv('./avazu_sample.txt') +data['day'] = data['hour'].apply(lambda x: str(x)[4:6]) +data['hour'] = data['hour'].apply(lambda x: str(x)[6:]) + +sparse_features = [ + 'hour', + 'C1', + 'banner_pos', + 'site_id', + 'site_domain', + 'site_category', + 'app_id', + 'app_domain', + 'app_category', + 'device_id', + 'device_model', + 'device_type', + 'device_conn_type', # 'device_ip', + 'C14', + 'C15', + 'C16', + 'C17', + 'C18', + 'C19', + 'C20', + 'C21', +] + +data[sparse_features] = data[sparse_features].fillna('-1', ) + +# 1.Label Encoding for sparse features,and do simple Transformation for dense features +for feat in sparse_features: + lbe = LabelEncoder() + data[feat] = lbe.fit_transform(data[feat]) + +cols = [ + 'click', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21', 'C1', + 'device_model', 'device_type', 'device_id', 'app_id', 'app_domain', + 'app_category', 'banner_pos', 'site_id', 'site_domain', 'site_category', + 'device_conn_type', 'hour' +] +# 计算每一个特征的最大值,作为vacob_size +data = data[cols] +line = '' +vacob_file = open('vacob_file.txt', 'w') +for col in cols[1:]: + max_val = data[col].max() + line += str(max_val) + ',' +vacob_file.write(line) +vacob_file.close() + +data.to_csv('./train_data/train_data.txt', index=False, header=None) diff --git a/models/rank/wide_deep/README.md b/models/rank/wide_deep/README.md new file mode 100644 index 00000000..c32047c0 --- /dev/null +++ b/models/rank/wide_deep/README.md @@ -0,0 +1,127 @@ +# wide&deep + +以下是本例的简要目录结构及说明: + +``` +├── data # 文档 + ├── train #训练数据 + ├── train_data.txt + ├── create_data.sh + ├── data_preparation.py + ├── get_slot_data.py + ├── run.sh +├── __init__.py +├── config.yaml #配置文件 +├── model.py #模型文件 +``` + +注:在阅读该示例前,建议您先了解以下内容: + +[paddlerec入门教程](https://github.com/PaddlePaddle/PaddleRec/blob/master/README.md) + +## 内容 + +- [模型简介](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/rank/wide_deep#模型简介) +- [数据准备](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/rank/wide_deep#数据准备) +- [运行环境](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/rank/wide_deep#运行环境) +- [快速开始](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/rank/wide_deep#快速开始) +- [论文复现](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/rank/wide_deep#论文复现) +- [进阶使用](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/rank/wide_deep#进阶使用) +- [FAQ](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/rank/wide_deep#FAQ) + +## 模型简介 + +[《Wide & Deep Learning for Recommender Systems》]( https://arxiv.org/pdf/1606.07792.pdf)是Google 2016年发布的推荐框架,wide&deep设计了一种融合浅层(wide)模型和深层(deep)模型进行联合训练的框架,综合利用浅层模型的记忆能力和深层模型的泛化能力,实现单模型对推荐系统准确性和扩展性的兼顾。从推荐效果和服务性能两方面进行评价: + +1. 效果上,在Google Play 进行线上A/B实验,wide&deep模型相比高度优化的Wide浅层模型,app下载率+3.9%。相比deep模型也有一定提升。 +2. 性能上,通过切分一次请求需要处理的app 的Batch size为更小的size,并利用多线程并行请求达到提高处理效率的目的。单次响应耗时从31ms下降到14ms。 + +本例在paddlepaddle上实现wide&deep并在开源数据集Census-income Data上验证模型效果,在测试集上的平均acc和auc分别为: + +> mean_acc: 0.76195 +> +> mean_auc: 0.90577 + +若进行精度验证,请参考[论文复现](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/rank/wide_deep#论文复现)部分。 + +本项目支持功能 + +训练:单机CPU、单机单卡GPU、单机多卡GPU、本地模拟参数服务器训练、增量训练,配置请参考 [启动训练](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/train.md) + +预测:单机CPU、单机单卡GPU ;配置请参考[PaddleRec 离线预测](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/predict.md) + +## 数据准备 + +数据地址: + +[adult.data](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data) + +[adult.test](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test) + +## 运行环境 + +PaddlePaddle>=1.7.2 + +python 2.7/3.5/3.6/3.7 + +PaddleRec >=0.1 + +os : windows/linux/macos + +## 快速开始 + +### 单机训练 + +CPU环境 + +在config.yaml文件中设置好设备,epochs等。 + +```sh +dataset: + - name: sample_1 + type: QueueDataset + batch_size: 5 + data_path: "{workspace}/data/sample_data/train" + sparse_slots: "label" + dense_slots: "wide_input:8 deep_input:58" + - name: infer_sample + type: QueueDataset + batch_size: 5 + data_path: "{workspace}/data/sample_data/train" + sparse_slots: "label" + dense_slots: "wide_input:8 deep_input:58" +``` + +### 单机预测 + +CPU环境 + +在config.yaml文件中设置好epochs、device等参数。 + +``` + - name: infer_runner + class: infer + device: cpu + init_model_path: "increment/0" +``` + + +## 论文复现 + + +用原论文的完整数据复现论文效果需要在config.yaml中修改batch_size=40, thread_num=8, epoch_num=40 + +本例在paddlepaddle上实现wide&deep并在开源数据集Census-income Data上验证模型效果,在测试集上的平均acc和auc分别为: + +mean_acc: 0.76195 , mean_auc: 0.90577 + + +修改后运行方案:修改config.yaml中的'workspace'为config.yaml的目录位置,执行 + +``` +python -m paddlerec.run -m /home/your/dir/config.yaml #调试模式 直接指定本地config的绝对路径 +``` + +## 进阶使用 + +## FAQ -- GitLab