提交 3e43efca 编写于 作者: M malin10

Merge branch 'master' of https://github.com/PaddlePaddle/PaddleRec into gru4rec

......@@ -16,13 +16,20 @@ before_install:
# For pylint dockstring checker
- sudo apt-get update
- sudo apt-get install -y python-pip libpython-dev
- sudo apt-get remove python-urllib3
- sudo apt-get purge python-urllib3
- sudo rm /usr/lib/python2.7/dist-packages/chardet-*
- sudo pip install -U pip
- sudo pip install --upgrade setuptools
- sudo pip install six --upgrade --ignore-installed six
- sudo pip install pillow
- sudo pip install PyYAML
- sudo pip install pylint pytest astroid isort pre-commit
- sudo pip install kiwisolver
- sudo pip install paddlepaddle==1.7.2 --ignore-installed urllib3
- sudo pip install scikit-build
- sudo pip install Pillow==5.3.0
- sudo pip install opencv-python==3.4.3.18
- sudo pip install rarfile==3.0
- sudo pip install paddlepaddle==1.7.2
- sudo python setup.py install
- |
function timeout() { perl -e 'alarm shift; exec @ARGV' "$@"; }
......
此差异已折叠。
(简体中文|[English](./README.md))
<p align="center">
<img align="center" src="doc/imgs/logo.png">
<p>
<p align="center">
<img align="center" src="doc/imgs/structure.png">
<p>
<p align="center">
<img align="center" src="doc/imgs/overview.png">
<p>
<h2 align="center">什么是推荐系统?</h2>
<p align="center">
<img align="center" src="doc/imgs/rec-overview.png">
<p>
- 推荐系统是在互联网信息爆炸式增长的时代背景下,帮助用户高效获得感兴趣信息的关键;
- 推荐系统也是帮助产品最大限度吸引用户、留存用户、增加用户粘性、提高用户转化率的银弹。
- 有无数优秀的产品依靠用户可感知的推荐系统建立了良好的口碑,也有无数的公司依靠直击用户痛点的推荐系统在行业中占领了一席之地。
> 可以说,谁能掌握和利用好推荐系统,谁就能在信息分发的激烈竞争中抢得先机。
> 但与此同时,有着许多问题困扰着推荐系统的开发者,比如:庞大的数据量,复杂的模型结构,低效的分布式训练环境,波动的在离线一致性,苛刻的上线部署要求,以上种种,不胜枚举。
<h2 align="center">什么是PaddleRec?</h2>
- 源于飞桨生态的搜索推荐模型 **一站式开箱即用工具**
- 适合初学者,开发者,研究者的推荐系统全流程解决方案
- 包含内容理解、匹配、召回、排序、 多任务、重排序等多个任务的完整推荐搜索算法库
| 方向 | 模型 | 单机CPU | 单机GPU | 分布式CPU | 分布式GPU | 论文 |
| :------: | :-----------------------------------------------------------------------: | :-----: | :-----: | :-------: | :-------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 内容理解 | [Text-Classifcation](models/contentunderstanding/classification/model.py) | ✓ | ✓ | ✓ | x | [EMNLP 2014][Convolutional neural networks for sentence classication](https://www.aclweb.org/anthology/D14-1181.pdf) |
| 内容理解 | [TagSpace](models/contentunderstanding/tagspace/model.py) | ✓ | ✓ | ✓ | x | [EMNLP 2014][TagSpace: Semantic Embeddings from Hashtags](https://www.aclweb.org/anthology/D14-1194.pdf) |
| 匹配 | [DSSM](models/match/dssm/model.py) | ✓ | ✓ | ✓ | x | [CIKM 2013][Learning Deep Structured Semantic Models for Web Search using Clickthrough Data](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cikm2013_DSSM_fullversion.pdf) |
| 匹配 | [MultiView-Simnet](models/match/multiview-simnet/model.py) | ✓ | ✓ | ✓ | x | [WWW 2015][A Multi-View Deep Learning Approach for Cross Domain User Modeling in Recommendation Systems](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/frp1159-songA.pdf) |
| 召回 | [TDM](models/treebased/tdm/model.py) | ✓ | >=1.8.0 | ✓ | >=1.8.0 | [KDD 2018][Learning Tree-based Deep Model for Recommender Systems](https://arxiv.org/pdf/1801.02294.pdf) |
| 召回 | [fasttext](models/recall/fasttext/model.py) | ✓ | ✓ | x | x | [EACL 2017][Bag of Tricks for Efficient Text Classification](https://www.aclweb.org/anthology/E17-2068.pdf) |
| 召回 | [Word2Vec](models/recall/word2vec/model.py) | ✓ | ✓ | ✓ | x | [NIPS 2013][Distributed Representations of Words and Phrases and their Compositionality](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) |
| 召回 | [SSR](models/recall/ssr/model.py) | ✓ | ✓ | ✓ | ✓ | [SIGIR 2016][Multi-Rate Deep Learning for Temporal Recommendation](http://sonyis.me/paperpdf/spr209-song_sigir16.pdf) |
| 召回 | [Gru4Rec](models/recall/gru4rec/model.py) | ✓ | ✓ | ✓ | ✓ | [2015][Session-based Recommendations with Recurrent Neural Networks](https://arxiv.org/abs/1511.06939) |
| 召回 | [Youtube_dnn](models/recall/youtube_dnn/model.py) | ✓ | ✓ | ✓ | ✓ | [RecSys 2016][Deep Neural Networks for YouTube Recommendations](https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf) |
| 召回 | [NCF](models/recall/ncf/model.py) | ✓ | ✓ | ✓ | ✓ | [WWW 2017][Neural Collaborative Filtering](https://arxiv.org/pdf/1708.05031.pdf) |
| 召回 | [GNN](models/recall/gnn/model.py) | ✓ | ✓ | ✓ | ✓ | [AAAI 2019][Session-based Recommendation with Graph Neural Networks](https://arxiv.org/abs/1811.00855) |
| 排序 | [Logistic Regression](models/rank/logistic_regression/model.py) | ✓ | x | ✓ | x | / |
| 排序 | [Dnn](models/rank/dnn/model.py) | ✓ | ✓ | ✓ | ✓ | / |
| 排序 | [FM](models/rank/fm/model.py) | ✓ | x | ✓ | x | [IEEE Data Mining 2010][Factorization machines](https://analyticsconsultores.com.mx/wp-content/uploads/2019/03/Factorization-Machines-Steffen-Rendle-Osaka-University-2010.pdf) |
| 排序 | [FFM](models/rank/ffm/model.py) | ✓ | x | ✓ | x | [RECSYS 2016][Field-aware Factorization Machines for CTR Prediction](https://dl.acm.org/doi/pdf/10.1145/2959100.2959134) |
| 排序 | [FNN](models/rank/fnn/model.py) | ✓ | x | ✓ | x | [ECIR 2016][Deep Learning over Multi-field Categorical Data](https://arxiv.org/pdf/1601.02376.pdf) |
| 排序 | [Deep Crossing](models/rank/deep_crossing/model.py) | ✓ | x | ✓ | x | [ACM 2016][Deep Crossing: Web-Scale Modeling without Manually Crafted Combinatorial Features](https://www.kdd.org/kdd2016/papers/files/adf0975-shanA.pdf) |
| 排序 | [Pnn](models/rank/pnn/model.py) | ✓ | x | ✓ | x | [ICDM 2016][Product-based Neural Networks for User Response Prediction](https://arxiv.org/pdf/1611.00144.pdf) |
| 排序 | [DCN](models/rank/dcn/model.py) | ✓ | x | ✓ | x | [KDD 2017][Deep & Cross Network for Ad Click Predictions](https://dl.acm.org/doi/pdf/10.1145/3124749.3124754) |
| 排序 | [NFM](models/rank/nfm/model.py) | ✓ | x | ✓ | x | [SIGIR 2017][Neural Factorization Machines for Sparse Predictive Analytics](https://dl.acm.org/doi/pdf/10.1145/3077136.3080777) |
| 排序 | [AFM](models/rank/afm/model.py) | ✓ | x | ✓ | x | [IJCAI 2017][Attentional Factorization Machines: Learning the Weight of Feature Interactions via Attention Networks](https://arxiv.org/pdf/1708.04617.pdf) |
| 排序 | [DeepFM](models/rank/deepfm/model.py) | ✓ | x | ✓ | x | [IJCAI 2017][DeepFM: A Factorization-Machine based Neural Network for CTR Prediction](https://arxiv.org/pdf/1703.04247.pdf) |
| 排序 | [xDeepFM](models/rank/xdeepfm/model.py) | ✓ | x | ✓ | x | [KDD 2018][xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems](https://dl.acm.org/doi/pdf/10.1145/3219819.3220023) |
| 排序 | [DIN](models/rank/din/model.py) | ✓ | x | ✓ | x | [KDD 2018][Deep Interest Network for Click-Through Rate Prediction](https://dl.acm.org/doi/pdf/10.1145/3219819.3219823) |
| 排序 | [DIEN](models/rank/dien/model.py) | ✓ | x | ✓ | x | [AAAI 2019][Deep Interest Evolution Network for Click-Through Rate Prediction](https://www.aaai.org/ojs/index.php/AAAI/article/view/4545/4423) |
| 排序 | [BST](models/rank/BST/model.py) | ✓ | x | ✓ | x | [DLP_KDD 2019][Behavior Sequence Transformer for E-commerce Recommendation in Alibaba](https://arxiv.org/pdf/1905.06874v1.pdf) |
| 排序 | [AutoInt](models/rank/AutoInt/model.py) | ✓ | x | ✓ | x | [CIKM 2019][AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks](https://arxiv.org/pdf/1810.11921.pdf) |
| 排序 | [Wide&Deep](models/rank/wide_deep/model.py) | ✓ | x | ✓ | x | [DLRS 2016][Wide & Deep Learning for Recommender Systems](https://dl.acm.org/doi/pdf/10.1145/2988450.2988454) |
| 排序 | [FGCNN](models/rank/fgcnn/model.py) | ✓ | ✓ | ✓ | ✓ | [WWW 2019][Feature Generation by Convolutional Neural Network for Click-Through Rate Prediction](https://arxiv.org/pdf/1904.04447.pdf) |
| 排序 | [Fibinet](models/rank/fibinet/model.py) | ✓ | ✓ | ✓ | ✓ | [RecSys19][FiBiNET: Combining Feature Importance and Bilinear feature Interaction for Click-Through Rate Prediction]( https://arxiv.org/pdf/1905.09433.pdf) |
| 排序 | [Flen](models/rank/flen/model.py) | ✓ | ✓ | ✓ | ✓ | [2019][FLEN: Leveraging Field for Scalable CTR Prediction]( https://arxiv.org/pdf/1911.04690.pdf) |
| 多任务 | [ESMM](models/multitask/esmm/model.py) | ✓ | ✓ | ✓ | ✓ | [SIGIR 2018][Entire Space Multi-Task Model: An Effective Approach for Estimating Post-Click Conversion Rate](https://arxiv.org/abs/1804.07931) |
| 多任务 | [MMOE](models/multitask/mmoe/model.py) | ✓ | ✓ | ✓ | ✓ | [KDD 2018][Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts](https://dl.acm.org/doi/abs/10.1145/3219819.3220007) |
| 多任务 | [ShareBottom](models/multitask/share-bottom/model.py) | ✓ | ✓ | ✓ | ✓ | [1998][Multitask learning](http://reports-archive.adm.cs.cmu.edu/anon/1997/CMU-CS-97-203.pdf) |
| 重排序 | [Listwise](models/rerank/listwise/model.py) | ✓ | ✓ | ✓ | x | [2019][Sequential Evaluation and Generation Framework for Combinatorial Recommender System](https://arxiv.org/pdf/1902.00245.pdf) |
<h2 align="center">快速安装</h2>
### 环境要求
* Python 2.7/ 3.5 / 3.6 / 3.7
* PaddlePaddle >= 1.7.2
* 操作系统: Windows/Mac/Linux
> Windows下PaddleRec目前仅支持单机训练,分布式训练建议使用Linux环境
### 安装命令
- 安装方法一 **PIP源直接安装**
```bash
python -m pip install paddle-rec
```
> 该方法会默认下载安装`paddlepaddle v1.7.2 cpu版本`,若提示`PaddlePaddle`无法安装,则依照下述方法首先安装`PaddlePaddle`,再安装`PaddleRec`:
> - 可以在[该地址](https://pypi.org/project/paddlepaddle/1.7.2/#files),下载PaddlePaddle后手动安装whl包
> - 可以先pip安装`PaddlePaddle`,`python -m pip install paddlepaddle==1.7.2 -i https://mirror.baidu.com/pypi/simple`
> - 其他安装问题可以在[Paddle Issue](https://github.com/PaddlePaddle/Paddle/issues)或[PaddleRec Issue](https://github.com/PaddlePaddle/PaddleRec/issues)提出,会有工程师及时解答
- 安装方法二 **源码编译安装**
- 安装飞桨 **注:需要用户安装版本 == 1.7.2 的飞桨**
```shell
python -m pip install paddlepaddle==1.7.2 -i https://mirror.baidu.com/pypi/simple
```
- 源码安装PaddleRec
```
git clone https://github.com/PaddlePaddle/PaddleRec/
cd PaddleRec
python setup.py install
```
- PaddleRec-GPU安装方法
在使用方法一或方法二完成PaddleRec安装后,需再手动安装`paddlepaddle-gpu`,并根据自身环境(Cuda/Cudnn)选择合适的版本,安装教程请查阅[飞桨-开始使用](https://www.paddlepaddle.org.cn/install/quick)
<h2 align="center">一键启动</h2>
我们以排序模型中的`dnn`模型为例介绍PaddleRec的一键启动。训练数据来源为[Criteo数据集](https://www.kaggle.com/c/criteo-display-ad-challenge/),我们从中截取了100条数据:
```bash
# 使用CPU进行单机训练
python -m paddlerec.run -m paddlerec.models.rank.dnn
```
<h2 align="center">帮助文档</h2>
### 项目背景
* [推荐系统介绍](doc/rec_background.md)
* [分布式深度学习介绍](doc/ps_background.md)
### 快速开始
* [十分钟上手PaddleRec](https://aistudio.baidu.com/aistudio/projectdetail/559336)
### 入门教程
* [数据准备](doc/slot_reader.md)
* [模型调参](doc/model.md)
* [启动单机训练](doc/train.md)
* [启动分布式训练](doc/distributed_train.md)
* [启动预测](doc/predict.md)
* [快速部署](doc/serving.md)
### 进阶教程
* [自定义Reader](doc/custom_reader.md)
* [自定义模型](doc/model_develop.md)
* [自定义流程](doc/trainer_develop.md)
* [yaml配置说明](doc/yaml.md)
* [PaddleRec设计文档](doc/design.md)
### Benchmark
* [Benchmark](doc/benchmark.md)
### FAQ
* [常见问题FAQ](doc/faq.md)
<h2 align="center">社区</h2>
<p align="center">
<br>
<img alt="Release" src="https://img.shields.io/badge/Release-0.1.0-yellowgreen">
<img alt="License" src="https://img.shields.io/github/license/PaddlePaddle/PaddleRec">
<img alt="Slack" src="https://img.shields.io/badge/Join-Slack-green">
<br>
<p>
### 版本历史
- 2020.06.17 - PaddleRec v0.1.0
- 2020.06.03 - PaddleRec v0.0.2
- 2020.05.14 - PaddleRec v0.0.1
### 许可证书
本项目的发布受[Apache 2.0 license](LICENSE)许可认证。
### 联系我们
如有意见、建议及使用中的BUG,欢迎在[GitHub Issue](https://github.com/PaddlePaddle/PaddleRec/issues)提交
亦可通过以下方式与我们沟通交流:
- QQ群号码:`861717190`
- 微信小助手微信号:`paddlerec2020`
<p align="center"><img width="200" height="200" margin="500" src="./doc/imgs/QQ_group.png"/>&#8194;&#8194;&#8194;&#8194;&#8194<img width="200" height="200" src="doc/imgs/weixin_supporter.png"/></p>
<p align="center">PaddleRec交流QQ群&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;PaddleRec微信小助手</p>
([简体中文](./README.md)|English)
<p align="center">
<img align="center" src="doc/imgs/logo.png">
<p>
<p align="center">
<img align="center" src="doc/imgs/overview_en.png">
<p>
<h2 align="center">What is recommendation system ?</h2>
<p align="center">
<img align="center" src="doc/imgs/rec-overview-en.png">
<p>
- Recommendation system helps users quickly find useful and interesting information from massive data.
- Recommendation system is also a silver bullet to attract users, retain users, increase users' stickness or conversionn.
> Who can better use the recommendation system, who can gain more advantage in the fierce competition.
>
> At the same time, there are many problems in the process of using the recommendation system, such as: huge data, complex model, inefficient distributed training, and so on.
<h2 align="center">What is PaddleRec ?</h2>
- A quick start tool of search & recommendation algorithm based on [PaddlePaddle](https://www.paddlepaddle.org.cn/documentation/docs/en/beginners_guide/index_en.html)
- A complete solution of recommendation system for beginners, developers and researchers.
- Recommendation algorithm library including content-understanding, match, recall, rank, multi-task, re-rank etc.
| Type | Algorithm | CPU | GPU | Parameter-Server | Multi-GPU | Paper |
| :-------------------: | :-----------------------------------------------------------------------: | :---: | :-----: | :--------------: | :-------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Content-Understanding | [Text-Classifcation](models/contentunderstanding/classification/model.py) | ✓ | ✓ | ✓ | x | [EMNLP 2014][Convolutional neural networks for sentence classication](https://www.aclweb.org/anthology/D14-1181.pdf) |
| Content-Understanding | [TagSpace](models/contentunderstanding/tagspace/model.py) | ✓ | ✓ | ✓ | x | [EMNLP 2014][TagSpace: Semantic Embeddings from Hashtags](https://www.aclweb.org/anthology/D14-1194.pdf) |
| Match | [DSSM](models/match/dssm/model.py) | ✓ | ✓ | ✓ | x | [CIKM 2013][Learning Deep Structured Semantic Models for Web Search using Clickthrough Data](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cikm2013_DSSM_fullversion.pdf) |
| Match | [MultiView-Simnet](models/match/multiview-simnet/model.py) | ✓ | ✓ | ✓ | x | [WWW 2015][A Multi-View Deep Learning Approach for Cross Domain User Modeling in Recommendation Systems](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/frp1159-songA.pdf) |
| Recall | [TDM](models/treebased/tdm/model.py) | ✓ | >=1.8.0 | ✓ | >=1.8.0 | [KDD 2018][Learning Tree-based Deep Model for Recommender Systems](https://arxiv.org/pdf/1801.02294.pdf) |
| Recall | [fasttext](models/recall/fasttext/model.py) | ✓ | ✓ | x | x | [EACL 2017][Bag of Tricks for Efficient Text Classification](https://www.aclweb.org/anthology/E17-2068.pdf) |
| Recall | [Word2Vec](models/recall/word2vec/model.py) | ✓ | ✓ | ✓ | x | [NIPS 2013][Distributed Representations of Words and Phrases and their Compositionality](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) |
| Recall | [SSR](models/recall/ssr/model.py) | ✓ | ✓ | ✓ | ✓ | [SIGIR 2016][Multi-Rate Deep Learning for Temporal Recommendation](http://sonyis.me/paperpdf/spr209-song_sigir16.pdf) |
| Recall | [Gru4Rec](models/recall/gru4rec/model.py) | ✓ | ✓ | ✓ | ✓ | [2015][Session-based Recommendations with Recurrent Neural Networks](https://arxiv.org/abs/1511.06939) |
| Recall | [Youtube_dnn](models/recall/youtube_dnn/model.py) | ✓ | ✓ | ✓ | ✓ | [RecSys 2016][Deep Neural Networks for YouTube Recommendations](https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf) |
| Recall | [NCF](models/recall/ncf/model.py) | ✓ | ✓ | ✓ | ✓ | [WWW 2017][Neural Collaborative Filtering](https://arxiv.org/pdf/1708.05031.pdf) |
| Recall | [GNN](models/recall/gnn/model.py) | ✓ | ✓ | ✓ | ✓ | [AAAI 2019][Session-based Recommendation with Graph Neural Networks](https://arxiv.org/abs/1811.00855) |
| Recall | [RALM](models/recall/look-alike_recall/model.py) | ✓ | ✓ | ✓ | ✓ | [KDD 2019][Real-time Attention Based Look-alike Model for Recommender System](https://arxiv.org/pdf/1906.05022.pdf) |
| Rank | [Logistic Regression](models/rank/logistic_regression/model.py) | ✓ | x | ✓ | x | / |
| Rank | [Dnn](models/rank/dnn/model.py) | ✓ | ✓ | ✓ | ✓ | / |
| Rank | [FM](models/rank/fm/model.py) | ✓ | x | ✓ | x | [IEEE Data Mining 2010][Factorization machines](https://analyticsconsultores.com.mx/wp-content/uploads/2019/03/Factorization-Machines-Steffen-Rendle-Osaka-University-2010.pdf) |
| Rank | [FFM](models/rank/ffm/model.py) | ✓ | x | ✓ | x | [RECSYS 2016][Field-aware Factorization Machines for CTR Prediction](https://dl.acm.org/doi/pdf/10.1145/2959100.2959134) |
| Rank | [FNN](models/rank/fnn/model.py) | ✓ | x | ✓ | x | [ECIR 2016][Deep Learning over Multi-field Categorical Data](https://arxiv.org/pdf/1601.02376.pdf) |
| Rank | [Deep Crossing](models/rank/deep_crossing/model.py) | ✓ | x | ✓ | x | [ACM 2016][Deep Crossing: Web-Scale Modeling without Manually Crafted Combinatorial Features](https://www.kdd.org/kdd2016/papers/files/adf0975-shanA.pdf) |
| Rank | [Pnn](models/rank/pnn/model.py) | ✓ | x | ✓ | x | [ICDM 2016][Product-based Neural Networks for User Response Prediction](https://arxiv.org/pdf/1611.00144.pdf) |
| Rank | [DCN](models/rank/dcn/model.py) | ✓ | x | ✓ | x | [KDD 2017][Deep & Cross Network for Ad Click Predictions](https://dl.acm.org/doi/pdf/10.1145/3124749.3124754) |
| Rank | [NFM](models/rank/nfm/model.py) | ✓ | x | ✓ | x | [SIGIR 2017][Neural Factorization Machines for Sparse Predictive Analytics](https://dl.acm.org/doi/pdf/10.1145/3077136.3080777) |
| Rank | [AFM](models/rank/afm/model.py) | ✓ | x | ✓ | x | [IJCAI 2017][Attentional Factorization Machines: Learning the Weight of Feature Interactions via Attention Networks](https://arxiv.org/pdf/1708.04617.pdf) |
| Rank | [DeepFM](models/rank/deepfm/model.py) | ✓ | x | ✓ | x | [IJCAI 2017][DeepFM: A Factorization-Machine based Neural Network for CTR Prediction](https://arxiv.org/pdf/1703.04247.pdf) |
| Rank | [xDeepFM](models/rank/xdeepfm/model.py) | ✓ | x | ✓ | x | [KDD 2018][xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems](https://dl.acm.org/doi/pdf/10.1145/3219819.3220023) |
| Rank | [DIN](models/rank/din/model.py) | ✓ | x | ✓ | x | [KDD 2018][Deep Interest Network for Click-Through Rate Prediction](https://dl.acm.org/doi/pdf/10.1145/3219819.3219823) |
| Rank | [DIEN](models/rank/dien/model.py) | ✓ | x | ✓ | x | [AAAI 2019][Deep Interest Evolution Network for Click-Through Rate Prediction](https://www.aaai.org/ojs/index.php/AAAI/article/view/4545/4423) |
| Rank | [BST](models/rank/BST/model.py) | ✓ | x | ✓ | x | [DLP-KDD 2019][Behavior Sequence Transformer for E-commerce Recommendation in Alibaba](https://arxiv.org/pdf/1905.06874v1.pdf) |
| Rank | [AutoInt](models/rank/AutoInt/model.py) | ✓ | x | ✓ | x | [CIKM 2019][AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks](https://arxiv.org/pdf/1810.11921.pdf) |
| Rank | [Wide&Deep](models/rank/wide_deep/model.py) | ✓ | x | ✓ | x | [DLRS 2016][Wide & Deep Learning for Recommender Systems](https://dl.acm.org/doi/pdf/10.1145/2988450.2988454) |
| Rank | [FGCNN](models/rank/fgcnn/model.py) | ✓ | ✓ | ✓ | ✓ | [WWW 2019][Feature Generation by Convolutional Neural Network for Click-Through Rate Prediction](https://arxiv.org/pdf/1904.04447.pdf) |
| Rank | [Fibinet](models/rank/fibinet/model.py) | ✓ | ✓ | ✓ | ✓ | [RecSys19][FiBiNET: Combining Feature Importance and Bilinear feature Interaction for Click-Through Rate Prediction]( https://arxiv.org/pdf/1905.09433.pdf) |
| Rank | [Flen](models/rank/flen/model.py) | ✓ | ✓ | ✓ | ✓ | [2019][FLEN: Leveraging Field for Scalable CTR Prediction]( https://arxiv.org/pdf/1911.04690.pdf) |
| Multi-Task | [ESMM](models/multitask/esmm/model.py) | ✓ | ✓ | ✓ | ✓ | [SIGIR 2018][Entire Space Multi-Task Model: An Effective Approach for Estimating Post-Click Conversion Rate](https://arxiv.org/abs/1804.07931) |
| Multi-Task | [MMOE](models/multitask/mmoe/model.py) | ✓ | ✓ | ✓ | ✓ | [KDD 2018][Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts](https://dl.acm.org/doi/abs/10.1145/3219819.3220007) |
| Multi-Task | [ShareBottom](models/multitask/share-bottom/model.py) | ✓ | ✓ | ✓ | ✓ | [1998][Multitask learning](http://reports-archive.adm.cs.cmu.edu/anon/1997/CMU-CS-97-203.pdf) |
| Re-Rank | [Listwise](models/rerank/listwise/model.py) | ✓ | ✓ | ✓ | x | [2019][Sequential Evaluation and Generation Framework for Combinatorial Recommender System](https://arxiv.org/pdf/1902.00245.pdf) |
<h2 align="center">Getting Started</h2>
### Environmental requirements
* Python 2.7/ 3.5 / 3.6 / 3.7
* PaddlePaddle >= 1.7.2
* operating system: Windows/Mac/Linux
> Linux is recommended for distributed training
### Installation
1. **Install by pip**
```bash
python -m pip install paddle-rec
```
> This method will download and install `paddlepaddle-v1.7.2-cpu`. If `PaddlePaddle` can not be installed automatically,You need to install `PaddlePaddle` manually,and then install `PaddleRec` again:
> - Download [PaddlePaddle](https://pypi.org/project/paddlepaddle/1.7.2/#files) and install by pip.
> - Install `PaddlePaddle` by pip,`python -m pip install paddlepaddle==1.7.2 -i https://mirror.baidu.com/pypi/simple`
> - Other installation problems can be raised in [Paddle Issue](https://github.com/PaddlePaddle/Paddle/issues) or [PaddleRec Issue](https://github.com/PaddlePaddle/PaddleRec/issues)
2. **Install by source code**
- Install PaddlePaddle
```shell
python -m pip install paddlepaddle==1.7.2 -i https://mirror.baidu.com/pypi/simple
```
- Install PaddleRec by source code
```
git clone https://github.com/PaddlePaddle/PaddleRec/
cd PaddleRec
python setup.py install
```
- Install PaddleRec-GPU
After installing `PaddleRec`,please install the appropriate version of `paddlepaddle-gpu` according to your environment (CUDA / cudnn),refer to the installation tutorial [Installation Manuals](https://www.paddlepaddle.org.cn/documentation/docs/en/install/index_en.html)
<h2 align="center">Quick Start</h2>
We take the `dnn` algorithm as an example to get start of `PaddleRec`, and we take 100 pieces of training data from [Criteo Dataset](https://www.kaggle.com/c/criteo-display-ad-challenge/):
```bash
# Training with cpu
git clone https://github.com/PaddlePaddle/PaddleRec.git paddle-rec
cd paddle-rec
python -m paddlerec.run -m models/rank/dnn/config.yaml
```
<h2 align="center">Documentation</h2>
### Background
* [Recommendation System](doc/rec_background.md)
* [Distributed deep learning](doc/ps_background.md)
### Introductory Project
* [Get start of PaddleRec in ten minutes](https://aistudio.baidu.com/aistudio/projectdetail/559336)
### Introductory tutorial
* [Data](doc/slot_reader.md)
* [Model](doc/model.md)
* [Loacl Train](doc/train.md)
* [Distributed Train](doc/distributed_train.md)
* [Predict](doc/predict.md)
* [Serving](doc/serving.md)
### Advanced tutorial
* [Custom Reader](doc/custom_reader.md)
* [Custom Model](doc/model_develop.md)
* [Custom Training Process](doc/trainer_develop.md)
* [Configuration description of yaml](doc/yaml.md)
* [Design document of PaddleRec](doc/design.md)
### Benchmark
* [Benchmark](doc/benchmark.md)
### FAQ
* [Common Problem FAQ](doc/faq.md)
<h2 align="center">Community</h2>
<p align="center">
<br>
<img alt="Release" src="https://img.shields.io/badge/Release-0.1.0-yellowgreen">
<img alt="License" src="https://img.shields.io/github/license/PaddlePaddle/PaddleRec">
<img alt="Slack" src="https://img.shields.io/badge/Join-Slack-green">
<br>
<p>
### Version history
- 2020.06.17 - PaddleRec v0.1.0
- 2020.06.03 - PaddleRec v0.0.2
- 2020.05.14 - PaddleRec v0.0.1
### License
[Apache 2.0 license](LICENSE)
### Contact us
For any feedback, please propose a [GitHub Issue](https://github.com/PaddlePaddle/PaddleRec/issues)
You can also communicate with us in the following ways:
- QQ group id:`861717190`
- Wechat account:`paddlerec2020`
<p align="center"><img width="200" height="200" margin="500" src="./doc/imgs/QQ_group.png"/>&#8194;&#8194;&#8194;&#8194;&#8194<img width="200" height="200" src="doc/imgs/weixin_supporter.png"/></p>
<p align="center">PaddleRec QQ Group&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;PaddleRec Wechat account</p>
echo "Run before_hook.sh ..."
wget https://paddlerec.bj.bcebos.com/whl/PaddleRec.tar.gz
wget https://paddlerec.bj.bcebos.com/whl/PaddleRec.tar.gz --no-check-certificate
tar -xf PaddleRec.tar.gz
......@@ -10,6 +10,6 @@ python setup.py install
pip uninstall -y paddlepaddle
pip install paddlepaddle-gpu==<$ PADDLEPADDLE_VERSION $> --index-url=http://pip.baidu.com/pypi/simple --trusted-host pip.baidu.com
pip install paddlepaddle==<$ PADDLEPADDLE_VERSION $> --index-url=http://pip.baidu.com/pypi/simple --trusted-host pip.baidu.com
echo "End before_hook.sh ..."
echo "Run before_hook.sh ..."
wget https://paddlerec.bj.bcebos.com/whl/PaddleRec.tar.gz
wget https://paddlerec.bj.bcebos.com/whl/PaddleRec.tar.gz --no-check-certificate
tar -xf PaddleRec.tar.gz
......
......@@ -39,7 +39,12 @@ function _before_submit() {
elif [ ${DISTRIBUTE_MODE} == "COLLECTIVE_GPU_K8S" ]; then
_gen_gpu_before_hook
_gen_k8s_config
_gen_k8s_job
_gen_k8s_gpu_job
_gen_end_hook
elif [ ${DISTRIBUTE_MODE} == "PS_CPU_K8S" ]; then
_gen_cpu_before_hook
_gen_k8s_config
_gen_k8s_cpu_job
_gen_end_hook
fi
......@@ -54,6 +59,7 @@ function _gen_mpi_config() {
-e "s#<$ OUTPUT_PATH $>#$OUTPUT_PATH#g" \
-e "s#<$ THIRDPARTY_PATH $>#$THIRDPARTY_PATH#g" \
-e "s#<$ CPU_NUM $>#$max_thread_num#g" \
-e "s#<$ USE_PYTHON3 $>#$USE_PYTHON3#g" \
-e "s#<$ FLAGS_communicator_is_sgd_optimizer $>#$FLAGS_communicator_is_sgd_optimizer#g" \
-e "s#<$ FLAGS_communicator_send_queue_size $>#$FLAGS_communicator_send_queue_size#g" \
-e "s#<$ FLAGS_communicator_thread_pool_size $>#$FLAGS_communicator_thread_pool_size#g" \
......@@ -71,6 +77,7 @@ function _gen_k8s_config() {
-e "s#<$ AFS_REMOTE_MOUNT_POINT $>#$AFS_REMOTE_MOUNT_POINT#g" \
-e "s#<$ OUTPUT_PATH $>#$OUTPUT_PATH#g" \
-e "s#<$ CPU_NUM $>#$max_thread_num#g" \
-e "s#<$ USE_PYTHON3 $>#$USE_PYTHON3#g" \
-e "s#<$ FLAGS_communicator_is_sgd_optimizer $>#$FLAGS_communicator_is_sgd_optimizer#g" \
-e "s#<$ FLAGS_communicator_send_queue_size $>#$FLAGS_communicator_send_queue_size#g" \
-e "s#<$ FLAGS_communicator_thread_pool_size $>#$FLAGS_communicator_thread_pool_size#g" \
......@@ -101,6 +108,7 @@ function _gen_end_hook() {
function _gen_mpi_job() {
echo "gen mpi_job.sh"
sed -e "s#<$ GROUP_NAME $>#$GROUP_NAME#g" \
-e "s#<$ JOB_NAME $>#$OLD_JOB_NAME#g" \
-e "s#<$ AK $>#$AK#g" \
-e "s#<$ SK $>#$SK#g" \
-e "s#<$ MPI_PRIORITY $>#$PRIORITY#g" \
......@@ -109,18 +117,34 @@ function _gen_mpi_job() {
${abs_dir}/cloud/mpi_job.sh.template >${PWD}/job.sh
}
function _gen_k8s_job() {
function _gen_k8s_gpu_job() {
echo "gen k8s_job.sh"
sed -e "s#<$ GROUP_NAME $>#$GROUP_NAME#g" \
-e "s#<$ JOB_NAME $>#$OLD_JOB_NAME#g" \
-e "s#<$ AK $>#$AK#g" \
-e "s#<$ SK $>#$SK#g" \
-e "s#<$ K8S_PRIORITY $>#$PRIORITY#g" \
-e "s#<$ K8S_TRAINERS $>#$K8S_TRAINERS#g" \
-e "s#<$ K8S_CPU_CORES $>#$K8S_CPU_CORES#g" \
-e "s#<$ K8S_GPU_CARD $>#$K8S_GPU_CARD#g" \
-e "s#<$ START_CMD $>#$START_CMD#g" \
${abs_dir}/cloud/k8s_job.sh.template >${PWD}/job.sh
}
function _gen_k8s_cpu_job() {
echo "gen k8s_job.sh"
sed -e "s#<$ GROUP_NAME $>#$GROUP_NAME#g" \
-e "s#<$ JOB_NAME $>#$OLD_JOB_NAME#g" \
-e "s#<$ AK $>#$AK#g" \
-e "s#<$ SK $>#$SK#g" \
-e "s#<$ K8S_PRIORITY $>#$PRIORITY#g" \
-e "s#<$ K8S_TRAINERS $>#$K8S_TRAINERS#g" \
-e "s#<$ K8S_PS_NUM $>#$K8S_PS_NUM#g" \
-e "s#<$ K8S_PS_CORES $>#$K8S_PS_CORES#g" \
-e "s#<$ K8S_CPU_CORES $>#$K8S_CPU_CORES#g" \
-e "s#<$ START_CMD $>#$START_CMD#g" \
${abs_dir}/cloud/k8s_cpu_job.sh.template >${PWD}/job.sh
}
#-----------------------------------------------------------------------------------------------------------------
......@@ -145,6 +169,7 @@ function _submit() {
function package_hook() {
cur_time=`date +"%Y%m%d%H%M"`
new_job_name="${JOB_NAME}_${cur_time}"
export OLD_JOB_NAME=${JOB_NAME}
export JOB_NAME=${new_job_name}
export job_file_path="${PWD}/${new_job_name}"
mkdir ${job_file_path}
......
......@@ -19,6 +19,8 @@ afs_local_mount_point="/root/paddlejob/workspace/env_run/afs/"
# 新k8s afs挂载帮助文档: http://wiki.baidu.com/pages/viewpage.action?pageId=906443193
PADDLE_PADDLEREC_ROLE=WORKER
PADDLEREC_CLUSTER_TYPE=K8S
use_python3=<$ USE_PYTHON3 $>
CPU_NUM=<$ CPU_NUM $>
GLOG_v=0
......
#!/bin/bash
###############################################################
## 注意-- 注意--注意 ##
## K8S PS-CPU多机作业作业示例 ##
###############################################################
job_name=<$ JOB_NAME $>
# 作业参数
group_name="<$ GROUP_NAME $>"
job_version="paddle-fluid-v1.7.1"
start_cmd="<$ START_CMD $>"
wall_time="2000:00:00"
k8s_priority=<$ K8S_PRIORITY $>
k8s_trainers=<$ K8S_TRAINERS $>
k8s_cpu_cores=<$ K8S_CPU_CORES $>
k8s_ps_num=<$ K8S_PS_NUM $>
k8s_ps_cores=<$ K8S_PS_CORES $>
# 你的ak/sk(可在paddlecloud web页面【个人中心】处获取)
ak=<$ AK $>
sk=<$ SK $>
paddlecloud job --ak ${ak} --sk ${sk} \
train --job-name ${job_name} \
--group-name ${group_name} \
--job-conf config.ini \
--start-cmd "${start_cmd}" \
--files ./* \
--job-version ${job_version} \
--k8s-priority ${k8s_priority} \
--wall-time ${wall_time} \
--k8s-trainers ${k8s_trainers} \
--k8s-cpu-cores ${k8s_cpu_cores} \
--k8s-ps-num ${k8s_ps_num} \
--k8s-ps-cores ${k8s_ps_cores} \
--is-standalone 0 \
--distribute-job-type "PSERVER" \
--json
\ No newline at end of file
......@@ -3,18 +3,30 @@
## 注意-- 注意--注意 ##
## K8S NCCL2多机作业作业示例 ##
###############################################################
job_name=${JOB_NAME}
job_name=<$ JOB_NAME $>
# 作业参数
group_name="<$ GROUP_NAME $>"
job_version="paddle-fluid-v1.7.1"
start_cmd="<$ START_CMD $>"
wall_time="10:00:00"
wall_time="2000:00:00"
k8s_priority=<$ K8S_PRIORITY $>
k8s_trainers=<$ K8S_TRAINERS $>
k8s_cpu_cores=<$ K8S_CPU_CORES $>
k8s_gpu_cards=<$ K8S_GPU_CARD $>
is_stand_alone=0
nccl="--distribute-job-type "NCCL2""
if [ ${k8s_trainers} == 1 ];then
is_stand_alone=1
nccl="--job-remark single-trainer"
if [ ${k8s_gpu_cards} == 1];then
nccl="--job-remark single-gpu"
echo "Attention: Use single GPU card for PaddleRec distributed training, please set runner class from 'cluster_train' to 'train' in config.yaml."
fi
fi
# 你的ak/sk(可在paddlecloud web页面【个人中心】处获取)
ak=<$ AK $>
sk=<$ SK $>
......@@ -27,9 +39,11 @@ paddlecloud job --ak ${ak} --sk ${sk} \
--files ./* \
--job-version ${job_version} \
--k8s-trainers ${k8s_trainers} \
--k8s-cpu-cores ${k8s_cpu_cores} \
--k8s-gpu-cards ${k8s_gpu_cards} \
--k8s-priority ${k8s_priority} \
--wall-time ${wall_time} \
--is-standalone 0 \
--distribute-job-type "NCCL2" \
--json
\ No newline at end of file
--is-standalone ${is_stand_alone} \
--json \
${nccl}
\ No newline at end of file
......@@ -17,6 +17,8 @@ output_path=<$ OUTPUT_PATH $>
thirdparty_path=<$ THIRDPARTY_PATH $>
PADDLE_PADDLEREC_ROLE=WORKER
PADDLEREC_CLUSTER_TYPE=MPI
use_python3=<$ USE_PYTHON3 $>
CPU_NUM=<$ CPU_NUM $>
GLOG_v=0
......
......@@ -3,13 +3,13 @@
## 注意--注意--注意 ##
## MPI 类型作业演示 ##
###############################################################
job_name=${JOB_NAME}
job_name=<$ JOB_NAME $>
# 作业参数
group_name=<$ GROUP_NAME $>
job_version="paddle-fluid-v1.7.1"
start_cmd="<$ START_CMD $>"
wall_time="2:00:00"
wall_time="2000:00:00"
# 你的ak/sk(可在paddlecloud web页面【个人中心】处获取)
ak=<$ AK $>
......
......@@ -67,10 +67,10 @@ class ClusterEngine(Engine):
@staticmethod
def workspace_replace():
workspace = envs.get_runtime_environ("workspace")
remote_workspace = envs.get_runtime_environ("remote_workspace")
for k, v in os.environ.items():
v = v.replace("{workspace}", workspace)
v = v.replace("{workspace}", remote_workspace)
os.environ[k] = str(v)
def run(self):
......@@ -98,14 +98,12 @@ class ClusterEngine(Engine):
cluster_env_check_tool = PaddleCloudMpiEnv()
else:
raise ValueError(
"Paddlecloud with Mpi don't support GPU training, check your config"
"Paddlecloud with Mpi don't support GPU training, check your config.yaml & backend.yaml"
)
elif cluster_type.upper() == "K8S":
if fleet_mode == "PS":
if device == "CPU":
raise ValueError(
"PS-CPU on paddlecloud is not supported at this time, comming soon"
)
cluster_env_check_tool = CloudPsCpuEnv()
elif device == "GPU":
raise ValueError(
"PS-GPU on paddlecloud is not supported at this time, comming soon"
......@@ -115,7 +113,7 @@ class ClusterEngine(Engine):
cluster_env_check_tool = CloudCollectiveEnv()
elif device == "CPU":
raise ValueError(
"Unexpected config -> device: CPU with fleet_mode: Collective, check your config"
"Unexpected config -> device: CPU with fleet_mode: Collective, check your config.yaml"
)
else:
raise ValueError("cluster_type {} error, must in MPI/K8S".format(
......@@ -161,23 +159,30 @@ class ClusterEnvBase(object):
self.cluster_env["PADDLE_VERSION"] = self.backend_env.get(
"config.paddle_version", "1.7.2")
# python_version
self.cluster_env["USE_PYTHON3"] = self.backend_env.get(
"config.use_python3", "0")
# communicator
max_thread_num = int(envs.get_runtime_environ("max_thread_num"))
self.cluster_env[
"FLAGS_communicator_is_sgd_optimizer"] = self.backend_env.get(
"config.communicator.FLAGS_communicator_is_sgd_optimizer", 0)
self.cluster_env[
"FLAGS_communicator_send_queue_size"] = self.backend_env.get(
"config.communicator.FLAGS_communicator_send_queue_size", 5)
"config.communicator.FLAGS_communicator_send_queue_size",
max_thread_num)
self.cluster_env[
"FLAGS_communicator_thread_pool_size"] = self.backend_env.get(
"config.communicator.FLAGS_communicator_thread_pool_size", 32)
self.cluster_env[
"FLAGS_communicator_max_merge_var_num"] = self.backend_env.get(
"config.communicator.FLAGS_communicator_max_merge_var_num", 5)
"config.communicator.FLAGS_communicator_max_merge_var_num",
max_thread_num)
self.cluster_env[
"FLAGS_communicator_max_send_grad_num_before_recv"] = self.backend_env.get(
"config.communicator.FLAGS_communicator_max_send_grad_num_before_recv",
5)
max_thread_num)
self.cluster_env["FLAGS_communicator_fake_rpc"] = self.backend_env.get(
"config.communicator.FLAGS_communicator_fake_rpc", 0)
self.cluster_env["FLAGS_rpc_retry_times"] = self.backend_env.get(
......@@ -234,7 +239,7 @@ class PaddleCloudMpiEnv(ClusterEnvBase):
"config.train_data_path", "")
if self.cluster_env["TRAIN_DATA_PATH"] == "":
raise ValueError(
"No -- TRAIN_DATA_PATH -- found in your backend.yaml, please check."
"No -- TRAIN_DATA_PATH -- found in your backend.yaml, please add train_data_path in your backend yaml."
)
# test_data_path
self.cluster_env["TEST_DATA_PATH"] = self.backend_env.get(
......@@ -274,7 +279,7 @@ class PaddleCloudK8sEnv(ClusterEnvBase):
category=UserWarning,
stacklevel=2)
warnings.warn(
"The remote mount point will be mounted to the ./afs/",
"The remote afs path will be mounted to the ./afs/",
category=UserWarning,
stacklevel=2)
......@@ -293,3 +298,21 @@ class CloudCollectiveEnv(PaddleCloudK8sEnv):
"submit.k8s_gpu_card", 1)
self.cluster_env["K8S_CPU_CORES"] = self.backend_env.get(
"submit.k8s_cpu_cores", 1)
class CloudPsCpuEnv(PaddleCloudK8sEnv):
def __init__(self):
super(CloudPsCpuEnv, self).__init__()
def env_check(self):
super(CloudPsCpuEnv, self).env_check()
self.cluster_env["DISTRIBUTE_MODE"] = "PS_CPU_K8S"
self.cluster_env["K8S_TRAINERS"] = self.backend_env.get(
"submit.k8s_trainers", 1)
self.cluster_env["K8S_CPU_CORES"] = self.backend_env.get(
"submit.k8s_cpu_cores", 2)
self.cluster_env["K8S_PS_NUM"] = self.backend_env.get(
"submit.k8s_ps_num", 1)
self.cluster_env["K8S_PS_CORES"] = self.backend_env.get(
"submit.k8s_ps_cores", 2)
......@@ -22,6 +22,19 @@ trainers = {}
def trainer_registry():
trainers["SingleTrainer"] = os.path.join(trainer_abs, "single_trainer.py")
trainers["ClusterTrainer"] = os.path.join(trainer_abs,
"cluster_trainer.py")
trainers["CtrCodingTrainer"] = os.path.join(trainer_abs,
"ctr_coding_trainer.py")
trainers["CtrModulTrainer"] = os.path.join(trainer_abs,
"ctr_modul_trainer.py")
trainers["TDMSingleTrainer"] = os.path.join(trainer_abs,
"tdm_single_trainer.py")
trainers["TDMClusterTrainer"] = os.path.join(trainer_abs,
"tdm_cluster_trainer.py")
trainers["OnlineLearningTrainer"] = os.path.join(
trainer_abs, "online_learning_trainer.py")
# Definition of procedure execution process
trainers["CtrCodingTrainer"] = os.path.join(trainer_abs,
"ctr_coding_trainer.py")
......
......@@ -23,34 +23,58 @@ class Metric(object):
__metaclass__ = abc.ABCMeta
def __init__(self, config):
""" """
"""R
"""
pass
def clear(self, scope=None, **kwargs):
"""
clear current value
Args:
scope: value container
params: extend varilable for clear
def clear(self, scope=None):
"""R
"""
if scope is None:
scope = fluid.global_scope()
place = fluid.CPUPlace()
for (varname, dtype) in self._need_clear_list:
if scope.find_var(varname) is None:
for key in self._global_metric_state_vars:
varname, dtype = self._global_metric_state_vars[key]
var = scope.find_var(varname)
if not var:
continue
var = scope.var(varname).get_tensor()
var = var.get_tensor()
data_array = np.zeros(var._get_dims()).astype(dtype)
var.set(data_array, place)
def calculate(self, scope, params):
def _get_global_metric_state(self, fleet, scope, metric_name, mode="sum"):
"""R
"""
calculate result
Args:
scope: value container
params: extend varilable for clear
var = scope.find_var(metric_name)
if not var:
return None
input = np.array(var.get_tensor())
if fleet is None:
return input
fleet._role_maker._barrier_worker()
old_shape = np.array(input.shape)
input = input.reshape(-1)
output = np.copy(input) * 0
fleet._role_maker._all_reduce(input, output, mode=mode)
output = output.reshape(old_shape)
return output
def calc_global_metrics(self, fleet, scope=None):
"""R
"""
if scope is None:
scope = fluid.global_scope()
global_metrics = dict()
for key in self._global_metric_state_vars:
varname, dtype = self._global_metric_state_vars[key]
global_metrics[key] = self._get_global_metric_state(fleet, scope,
varname)
return self._calculate(global_metrics)
def _calculate(self, global_metrics):
pass
@abc.abstractmethod
......
......@@ -12,6 +12,9 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from precision import Precision
from .recall_k import RecallK
from .pairwise_pn import PosNegRatio
from .precision_recall import PrecisionRecall
from .auc import AUC
__all__ = ['Precision']
__all__ = ['RecallK', 'PosNegRatio', 'AUC', 'PrecisionRecall']
......@@ -18,102 +18,60 @@ import numpy as np
import paddle.fluid as fluid
from paddlerec.core.metric import Metric
from paddle.fluid.layers.tensor import Variable
class AUCMetric(Metric):
class AUC(Metric):
"""
Metric For Fluid Model
"""
def __init__(self, config, fleet):
def __init__(self,
input,
label,
curve='ROC',
num_thresholds=2**12 - 1,
topk=1,
slide_steps=1):
""" """
self.config = config
self.fleet = fleet
def clear(self, scope, params):
"""
Clear current metric value, usually set to zero
Args:
scope : paddle runtime var container
params(dict) :
label : a group name for metric
metric_dict : current metric_items in group
Return:
None
"""
self._label = params['label']
self._metric_dict = params['metric_dict']
self._result = {}
place = fluid.CPUPlace()
for metric_name in self._metric_dict:
metric_config = self._metric_dict[metric_name]
if scope.find_var(metric_config['var'].name) is None:
continue
metric_var = scope.var(metric_config['var'].name).get_tensor()
data_type = 'float32'
if 'data_type' in metric_config:
data_type = metric_config['data_type']
data_array = np.zeros(metric_var._get_dims()).astype(data_type)
metric_var.set(data_array, place)
def get_metric(self, scope, metric_name):
"""
reduce metric named metric_name from all worker
Return:
metric reduce result
"""
metric = np.array(scope.find_var(metric_name).get_tensor())
old_metric_shape = np.array(metric.shape)
metric = metric.reshape(-1)
global_metric = np.copy(metric) * 0
self.fleet._role_maker.all_reduce_worker(metric, global_metric)
global_metric = global_metric.reshape(old_metric_shape)
return global_metric[0]
def get_global_metrics(self, scope, metric_dict):
"""
reduce all metric in metric_dict from all worker
Return:
dict : {matric_name : metric_result}
"""
self.fleet._role_maker._barrier_worker()
result = {}
for metric_name in metric_dict:
metric_item = metric_dict[metric_name]
if scope.find_var(metric_item['var'].name) is None:
result[metric_name] = None
continue
result[metric_name] = self.get_metric(scope,
metric_item['var'].name)
return result
def calculate_auc(self, global_pos, global_neg):
"""R
"""
num_bucket = len(global_pos)
area = 0.0
pos = 0.0
neg = 0.0
new_pos = 0.0
new_neg = 0.0
total_ins_num = 0
for i in range(num_bucket):
index = num_bucket - 1 - i
new_pos = pos + global_pos[index]
total_ins_num += global_pos[index]
new_neg = neg + global_neg[index]
total_ins_num += global_neg[index]
area += (new_neg - neg) * (pos + new_pos) / 2
pos = new_pos
neg = new_neg
auc_value = None
if pos * neg == 0 or total_ins_num == 0:
auc_value = 0.5
else:
auc_value = area / (pos * neg)
return auc_value
def calculate_bucket_error(self, global_pos, global_neg):
if not isinstance(input, Variable):
raise ValueError("input must be Variable, but received %s" %
type(input))
if not isinstance(label, Variable):
raise ValueError("label must be Variable, but received %s" %
type(label))
auc_out, batch_auc_out, [
batch_stat_pos, batch_stat_neg, stat_pos, stat_neg
] = fluid.layers.auc(input,
label,
curve=curve,
num_thresholds=num_thresholds,
topk=topk,
slide_steps=slide_steps)
prob = fluid.layers.slice(input, axes=[1], starts=[1], ends=[2])
label_cast = fluid.layers.cast(label, dtype="float32")
label_cast.stop_gradient = True
sqrerr, abserr, prob, q, pos, total = \
fluid.contrib.layers.ctr_metric_bundle(prob, label_cast)
self._global_metric_state_vars = dict()
self._global_metric_state_vars['stat_pos'] = (stat_pos.name, "float32")
self._global_metric_state_vars['stat_neg'] = (stat_neg.name, "float32")
self._global_metric_state_vars['total_ins_num'] = (total.name,
"float32")
self._global_metric_state_vars['pos_ins_num'] = (pos.name, "float32")
self._global_metric_state_vars['q'] = (q.name, "float32")
self._global_metric_state_vars['prob'] = (prob.name, "float32")
self._global_metric_state_vars['abserr'] = (abserr.name, "float32")
self._global_metric_state_vars['sqrerr'] = (sqrerr.name, "float32")
self.metrics = dict()
self.metrics["AUC"] = auc_out
self.metrics["BATCH_AUC"] = batch_auc_out
def _calculate_bucket_error(self, global_pos, global_neg):
"""R
"""
num_bucket = len(global_pos)
......@@ -161,56 +119,69 @@ class AUCMetric(Metric):
bucket_error = error_sum / error_count if error_count > 0 else 0.0
return bucket_error
def calculate(self, scope, params):
""" """
self._label = params['label']
self._metric_dict = params['metric_dict']
self.fleet._role_maker._barrier_worker()
result = self.get_global_metrics(scope, self._metric_dict)
def _calculate_auc(self, global_pos, global_neg):
"""R
"""
num_bucket = len(global_pos)
area = 0.0
pos = 0.0
neg = 0.0
new_pos = 0.0
new_neg = 0.0
total_ins_num = 0
for i in range(num_bucket):
index = num_bucket - 1 - i
new_pos = pos + global_pos[index]
total_ins_num += global_pos[index]
new_neg = neg + global_neg[index]
total_ins_num += global_neg[index]
area += (new_neg - neg) * (pos + new_pos) / 2
pos = new_pos
neg = new_neg
auc_value = None
if pos * neg == 0 or total_ins_num == 0:
auc_value = 0.5
else:
auc_value = area / (pos * neg)
return auc_value
def _calculate(self, global_metrics):
result = dict()
for key in self._global_metric_state_vars:
if key not in global_metrics:
raise ValueError("%s not existed" % key)
result[key] = global_metrics[key][0]
if result['total_ins_num'] == 0:
self._result = result
self._result['auc'] = 0
self._result['bucket_error'] = 0
self._result['actual_ctr'] = 0
self._result['predict_ctr'] = 0
self._result['mae'] = 0
self._result['rmse'] = 0
self._result['copc'] = 0
self._result['mean_q'] = 0
return self._result
if 'stat_pos' in result and 'stat_neg' in result:
result['auc'] = self.calculate_auc(result['stat_pos'],
result['stat_neg'])
result['bucket_error'] = self.calculate_auc(result['stat_pos'],
result['stat_neg'])
if 'pos_ins_num' in result:
result['auc'] = 0
result['bucket_error'] = 0
result['actual_ctr'] = 0
result['predict_ctr'] = 0
result['mae'] = 0
result['rmse'] = 0
result['copc'] = 0
result['mean_q'] = 0
else:
result['auc'] = self._calculate_auc(result['stat_pos'],
result['stat_neg'])
result['bucket_error'] = self._calculate_bucket_error(
result['stat_pos'], result['stat_neg'])
result['actual_ctr'] = result['pos_ins_num'] / result[
'total_ins_num']
if 'abserr' in result:
result['mae'] = result['abserr'] / result['total_ins_num']
if 'sqrerr' in result:
result['rmse'] = math.sqrt(result['sqrerr'] /
result['total_ins_num'])
if 'prob' in result:
result['predict_ctr'] = result['prob'] / result['total_ins_num']
if abs(result['predict_ctr']) > 1e-6:
result['copc'] = result['actual_ctr'] / result['predict_ctr']
if 'q' in result:
result['mean_q'] = result['q'] / result['total_ins_num']
self._result = result
return result
def get_result(self):
""" """
return self._result
def __str__(self):
""" """
result = self.get_result()
result_str = "%s AUC=%.6f BUCKET_ERROR=%.6f MAE=%.6f RMSE=%.6f " \
result_str = "AUC=%.6f BUCKET_ERROR=%.6f MAE=%.6f RMSE=%.6f " \
"Actural_CTR=%.6f Predicted_CTR=%.6f COPC=%.6f MEAN Q_VALUE=%.6f Ins number=%s" % \
(self._label, result['auc'], result['bucket_error'], result['mae'], result['rmse'],
(result['auc'], result['bucket_error'], result['mae'], result['rmse'],
result['actual_ctr'],
result['predict_ctr'], result['copc'], result['mean_q'], result['total_ins_num'])
return result_str
def get_result(self):
return self.metrics
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import math
import numpy as np
import paddle.fluid as fluid
from paddlerec.core.metric import Metric
from paddle.fluid.initializer import Constant
from paddle.fluid.layer_helper import LayerHelper
from paddle.fluid.layers.tensor import Variable
class PosNegRatio(Metric):
"""
Metric For Fluid Model
"""
def __init__(self, pos_score, neg_score):
""" """
kwargs = locals()
del kwargs['self']
helper = LayerHelper("PaddleRec_PosNegRatio", **kwargs)
if "pos_score" not in kwargs or "neg_score" not in kwargs:
raise ValueError(
"PosNegRatio expect pos_score and neg_score as inputs.")
pos_score = kwargs.get('pos_score')
neg_score = kwargs.get('neg_score')
if not isinstance(pos_score, Variable):
raise ValueError("pos_score must be Variable, but received %s" %
type(pos_score))
if not isinstance(neg_score, Variable):
raise ValueError("neg_score must be Variable, but received %s" %
type(neg_score))
wrong = fluid.layers.cast(
fluid.layers.less_equal(pos_score, neg_score), dtype='float32')
wrong_cnt = fluid.layers.reduce_sum(wrong)
right = fluid.layers.cast(
fluid.layers.less_than(neg_score, pos_score), dtype='float32')
right_cnt = fluid.layers.reduce_sum(right)
global_right_cnt, _ = helper.create_or_get_global_variable(
name="right_cnt", persistable=True, dtype='float32', shape=[1])
global_wrong_cnt, _ = helper.create_or_get_global_variable(
name="wrong_cnt", persistable=True, dtype='float32', shape=[1])
for var in [global_right_cnt, global_wrong_cnt]:
helper.set_variable_initializer(
var, Constant(
value=0.0, force_cpu=True))
helper.append_op(
type="elementwise_add",
inputs={"X": [global_right_cnt],
"Y": [right_cnt]},
outputs={"Out": [global_right_cnt]})
helper.append_op(
type="elementwise_add",
inputs={"X": [global_wrong_cnt],
"Y": [wrong_cnt]},
outputs={"Out": [global_wrong_cnt]})
self.pn = (global_right_cnt + 1.0) / (global_wrong_cnt + 1.0)
self._global_metric_state_vars = dict()
self._global_metric_state_vars['right_cnt'] = (global_right_cnt.name,
"float32")
self._global_metric_state_vars['wrong_cnt'] = (global_wrong_cnt.name,
"float32")
self.metrics = dict()
self.metrics['WrongCnt'] = global_wrong_cnt
self.metrics['RightCnt'] = global_right_cnt
self.metrics['PN'] = self.pn
def _calculate(self, global_metrics):
for key in self._global_communicate_var:
if key not in global_metrics:
raise ValueError("%s not existed" % key)
pn = (global_metrics['right_cnt'][0] + 1.0) / (
global_metrics['wrong_cnt'][0] + 1.0)
return "RightCnt=%s WrongCnt=%s PN=%s" % (
str(global_metrics['right_cnt'][0]),
str(global_metrics['wrong_cnt'][0]), str(pn))
def get_result(self):
return self.metrics
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import math
import numpy as np
import paddle.fluid as fluid
from paddlerec.core.metric import Metric
from paddle.fluid.initializer import Constant
from paddle.fluid.layer_helper import LayerHelper
from paddle.fluid.layers.tensor import Variable
class PrecisionRecall(Metric):
"""
Metric For Fluid Model
"""
def __init__(self, input, label, class_num):
"""R
"""
kwargs = locals()
del kwargs['self']
self.num_cls = class_num
if not isinstance(input, Variable):
raise ValueError("input must be Variable, but received %s" %
type(input))
if not isinstance(label, Variable):
raise ValueError("label must be Variable, but received %s" %
type(label))
helper = LayerHelper("PaddleRec_PrecisionRecall", **kwargs)
label = fluid.layers.cast(label, dtype="int32")
label.stop_gradient = True
max_probs, indices = fluid.layers.nn.topk(input, k=1)
indices = fluid.layers.cast(indices, dtype="int32")
indices.stop_gradient = True
states_info, _ = helper.create_or_get_global_variable(
name="states_info",
persistable=True,
dtype='float32',
shape=[self.num_cls, 4])
states_info.stop_gradient = True
helper.set_variable_initializer(
states_info, Constant(
value=0.0, force_cpu=True))
batch_metrics, _ = helper.create_or_get_global_variable(
name="batch_metrics",
persistable=False,
dtype='float32',
shape=[6])
accum_metrics, _ = helper.create_or_get_global_variable(
name="global_metrics",
persistable=False,
dtype='float32',
shape=[6])
batch_states = fluid.layers.fill_constant(
shape=[self.num_cls, 4], value=0.0, dtype="float32")
batch_states.stop_gradient = True
helper.append_op(
type="precision_recall",
attrs={'class_number': self.num_cls},
inputs={
'MaxProbs': [max_probs],
'Indices': [indices],
'Labels': [label],
'StatesInfo': [states_info]
},
outputs={
'BatchMetrics': [batch_metrics],
'AccumMetrics': [accum_metrics],
'AccumStatesInfo': [batch_states]
})
helper.append_op(
type="assign",
inputs={'X': [batch_states]},
outputs={'Out': [states_info]})
batch_states.stop_gradient = True
states_info.stop_gradient = True
self._global_metric_state_vars = dict()
self._global_metric_state_vars['states_info'] = (states_info.name,
"float32")
self.metrics = dict()
self.metrics["precision_recall_f1"] = accum_metrics
self.metrics["[TP FP TN FN]"] = states_info
def _calculate(self, global_metrics):
for key in self._global_metric_state_vars:
if key not in global_metrics:
raise ValueError("%s not existed" % key)
def calc_precision(tp_count, fp_count):
if tp_count > 0.0 or fp_count > 0.0:
return tp_count / (tp_count + fp_count)
return 1.0
def calc_recall(tp_count, fn_count):
if tp_count > 0.0 or fn_count > 0.0:
return tp_count / (tp_count + fn_count)
return 1.0
def calc_f1_score(precision, recall):
if precision > 0.0 or recall > 0.0:
return 2 * precision * recall / (precision + recall)
return 0.0
states = global_metrics["states_info"]
total_tp_count = 0.0
total_fp_count = 0.0
total_fn_count = 0.0
macro_avg_precision = 0.0
macro_avg_recall = 0.0
for i in range(self.num_cls):
total_tp_count += states[i][0]
total_fp_count += states[i][1]
total_fn_count += states[i][3]
macro_avg_precision += calc_precision(states[i][0], states[i][1])
macro_avg_recall += calc_recall(states[i][0], states[i][3])
metrics = []
macro_avg_precision /= self.num_cls
macro_avg_recall /= self.num_cls
metrics.append(macro_avg_precision)
metrics.append(macro_avg_recall)
metrics.append(calc_f1_score(macro_avg_precision, macro_avg_recall))
micro_avg_precision = calc_precision(total_tp_count, total_fp_count)
metrics.append(micro_avg_precision)
micro_avg_recall = calc_recall(total_tp_count, total_fn_count)
metrics.append(micro_avg_recall)
metrics.append(calc_f1_score(micro_avg_precision, micro_avg_recall))
return "total metrics: [TP, FP, TN, FN]=%s; precision_recall_f1=%s" % (
str(states), str(np.array(metrics).astype('float32')))
def get_result(self):
return self.metrics
......@@ -18,92 +18,86 @@ import numpy as np
import paddle.fluid as fluid
from paddlerec.core.metric import Metric
from paddle.fluid.layers import nn, accuracy
from paddle.fluid.layers import accuracy
from paddle.fluid.initializer import Constant
from paddle.fluid.layer_helper import LayerHelper
from paddle.fluid.layers.tensor import Variable
class Precision(Metric):
class RecallK(Metric):
"""
Metric For Fluid Model
"""
def __init__(self, **kwargs):
def __init__(self, input, label, k=20):
""" """
helper = LayerHelper("PaddleRec_Precision", **kwargs)
self.batch_accuracy = accuracy(
kwargs.get("input"), kwargs.get("label"), kwargs.get("k"))
local_ins_num, _ = helper.create_or_get_global_variable(
name="local_ins_num", persistable=True, dtype='float32',
shape=[1])
local_pos_num, _ = helper.create_or_get_global_variable(
name="local_pos_num", persistable=True, dtype='float32',
shape=[1])
batch_pos_num, _ = helper.create_or_get_global_variable(
name="batch_pos_num",
persistable=False,
dtype='float32',
shape=[1])
batch_ins_num, _ = helper.create_or_get_global_variable(
name="batch_ins_num",
persistable=False,
dtype='float32',
shape=[1])
tmp_ones = helper.create_global_variable(
name="batch_size_like_ones",
persistable=False,
dtype='float32',
shape=[-1])
for var in [
batch_pos_num, batch_ins_num, local_pos_num, local_ins_num
]:
print(var, type(var))
kwargs = locals()
del kwargs['self']
self.k = k
if not isinstance(input, Variable):
raise ValueError("input must be Variable, but received %s" %
type(input))
if not isinstance(label, Variable):
raise ValueError("label must be Variable, but received %s" %
type(label))
helper = LayerHelper("PaddleRec_RecallK", **kwargs)
batch_accuracy = accuracy(input, label, self.k)
global_ins_cnt, _ = helper.create_or_get_global_variable(
name="ins_cnt", persistable=True, dtype='float32', shape=[1])
global_pos_cnt, _ = helper.create_or_get_global_variable(
name="pos_cnt", persistable=True, dtype='float32', shape=[1])
for var in [global_ins_cnt, global_pos_cnt]:
helper.set_variable_initializer(
var, Constant(
value=0.0, force_cpu=True))
helper.append_op(
type='fill_constant_batch_size_like',
inputs={"Input": kwargs.get("label")},
outputs={'Out': [tmp_ones]},
attrs={
'shape': [-1, 1],
'dtype': tmp_ones.dtype,
'value': float(1.0),
})
helper.append_op(
type="reduce_sum",
inputs={"X": [tmp_ones]},
outputs={"Out": [batch_ins_num]})
helper.append_op(
type="elementwise_mul",
inputs={"X": [batch_ins_num],
"Y": [self.batch_accuracy]},
outputs={"Out": [batch_pos_num]})
tmp_ones = fluid.layers.fill_constant(
shape=fluid.layers.shape(label), dtype="float32", value=1.0)
batch_ins = fluid.layers.reduce_sum(tmp_ones)
batch_pos = batch_ins * batch_accuracy
helper.append_op(
type="elementwise_add",
inputs={"X": [local_pos_num],
"Y": [batch_pos_num]},
outputs={"Out": [local_pos_num]})
inputs={"X": [global_ins_cnt],
"Y": [batch_ins]},
outputs={"Out": [global_ins_cnt]})
helper.append_op(
type="elementwise_add",
inputs={"X": [local_ins_num],
"Y": [batch_ins_num]},
outputs={"Out": [local_ins_num]})
inputs={"X": [global_pos_cnt],
"Y": [batch_pos]},
outputs={"Out": [global_pos_cnt]})
self.acc = global_pos_cnt / global_ins_cnt
self.accuracy = local_pos_num / local_ins_num
self._global_metric_state_vars = dict()
self._global_metric_state_vars['ins_cnt'] = (global_ins_cnt.name,
"float32")
self._global_metric_state_vars['pos_cnt'] = (global_pos_cnt.name,
"float32")
self._need_clear_list = [("local_ins_num", "float32"),
("local_pos_num", "float32")]
metric_name = "Acc(Recall@%d)" % self.k
self.metrics = dict()
metric_varname = "P@%d" % kwargs.get("k")
self.metrics[metric_varname] = self.accuracy
self.metrics["InsCnt"] = global_ins_cnt
self.metrics["RecallCnt"] = global_pos_cnt
self.metrics[metric_name] = self.acc
# self.metrics["batch_metrics"] = batch_metrics
def _calculate(self, global_metrics):
for key in self._global_metric_state_vars:
if key not in global_metrics:
raise ValueError("%s not existed" % key)
ins_cnt = global_metrics['ins_cnt'][0]
pos_cnt = global_metrics['pos_cnt'][0]
if ins_cnt == 0:
acc = 0
else:
acc = float(pos_cnt) / ins_cnt
return "InsCnt=%s RecallCnt=%s Acc(Recall@%d)=%s" % (
str(ins_cnt), str(pos_cnt), self.k, str(acc))
def get_result(self):
return self.metrics
......@@ -107,6 +107,7 @@ class Trainer(object):
self.device = Device.GPU
gpu_id = int(os.environ.get('FLAGS_selected_gpus', 0))
self._place = fluid.CUDAPlace(gpu_id)
print("PaddleRec run on device GPU: {}".format(gpu_id))
self._exe = fluid.Executor(self._place)
elif device == "CPU":
self.device = Device.CPU
......@@ -146,6 +147,7 @@ class Trainer(object):
elif engine.upper() == "CLUSTER":
self.engine = EngineMode.CLUSTER
self.is_fleet = True
self.which_cluster_type()
else:
raise ValueError("Not Support Engine {}".format(engine))
self._context["is_fleet"] = self.is_fleet
......@@ -165,6 +167,14 @@ class Trainer(object):
self._context["is_pslib"] = (fleet_mode.upper() == "PSLIB")
self._context["fleet_mode"] = fleet_mode
def which_cluster_type(self):
cluster_type = os.getenv("PADDLEREC_CLUSTER_TYPE", "MPI")
print("PADDLEREC_CLUSTER_TYPE: {}".format(cluster_type))
if cluster_type and cluster_type.upper() == "K8S":
self._context["cluster_type"] = "K8S"
else:
self._context["cluster_type"] = "MPI"
def which_executor_mode(self):
executor_mode = envs.get_runtime_environ("train.trainer.executor_mode")
if executor_mode.upper() not in ["TRAIN", "INFER"]:
......
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
General Trainer, applicable to many situations: Single/Cluster/Local_Cluster + PS/COLLECTIVE
"""
from __future__ import print_function
import os
from paddlerec.core.utils import envs
from paddlerec.core.trainer import Trainer, EngineMode, FleetMode
class FineTuningTrainer(Trainer):
"""
Trainer for various situations
"""
def __init__(self, config=None):
Trainer.__init__(self, config)
self.processor_register()
self.abs_dir = os.path.dirname(os.path.abspath(__file__))
self.runner_env_name = "runner." + self._context["runner_name"]
def processor_register(self):
print("processor_register begin")
self.regist_context_processor('uninit', self.instance)
self.regist_context_processor('network_pass', self.network)
self.regist_context_processor('startup_pass', self.startup)
self.regist_context_processor('train_pass', self.runner)
self.regist_context_processor('terminal_pass', self.terminal)
def instance(self, context):
instance_class_path = envs.get_global_env(
self.runner_env_name + ".instance_class_path", default_value=None)
if instance_class_path:
instance_class = envs.lazy_instance_by_fliename(
instance_class_path, "Instance")(context)
else:
if self.engine == EngineMode.SINGLE:
instance_class_name = "SingleInstance"
else:
raise ValueError(
"FineTuningTrainer can only support SingleTraining.")
instance_path = os.path.join(self.abs_dir, "framework",
"instance.py")
instance_class = envs.lazy_instance_by_fliename(
instance_path, instance_class_name)(context)
instance_class.instance(context)
def network(self, context):
network_class_path = envs.get_global_env(
self.runner_env_name + ".network_class_path", default_value=None)
if network_class_path:
network_class = envs.lazy_instance_by_fliename(network_class_path,
"Network")(context)
else:
if self.engine == EngineMode.SINGLE:
network_class_name = "FineTuningNetwork"
else:
raise ValueError(
"FineTuningTrainer can only support SingleTraining.")
network_path = os.path.join(self.abs_dir, "framework",
"network.py")
network_class = envs.lazy_instance_by_fliename(
network_path, network_class_name)(context)
network_class.build_network(context)
def startup(self, context):
startup_class_path = envs.get_global_env(
self.runner_env_name + ".startup_class_path", default_value=None)
if startup_class_path:
startup_class = envs.lazy_instance_by_fliename(startup_class_path,
"Startup")(context)
else:
if self.engine == EngineMode.SINGLE and not context["is_infer"]:
startup_class_name = "FineTuningStartup"
else:
raise ValueError(
"FineTuningTrainer can only support SingleTraining.")
startup_path = os.path.join(self.abs_dir, "framework",
"startup.py")
startup_class = envs.lazy_instance_by_fliename(
startup_path, startup_class_name)(context)
startup_class.startup(context)
def runner(self, context):
runner_class_path = envs.get_global_env(
self.runner_env_name + ".runner_class_path", default_value=None)
if runner_class_path:
runner_class = envs.lazy_instance_by_fliename(runner_class_path,
"Runner")(context)
else:
if self.engine == EngineMode.SINGLE and not context["is_infer"]:
runner_class_name = "SingleRunner"
else:
raise ValueError(
"FineTuningTrainer can only support SingleTraining.")
runner_path = os.path.join(self.abs_dir, "framework", "runner.py")
runner_class = envs.lazy_instance_by_fliename(
runner_path, runner_class_name)(context)
runner_class.run(context)
def terminal(self, context):
terminal_class_path = envs.get_global_env(
self.runner_env_name + ".terminal_class_path", default_value=None)
if terminal_class_path:
terminal_class = envs.lazy_instance_by_fliename(
terminal_class_path, "Terminal")(context)
terminal_class.terminal(context)
else:
terminal_class_name = "TerminalBase"
if self.engine != EngineMode.SINGLE and self.fleet_mode != FleetMode.COLLECTIVE:
terminal_class_name = "PSTerminal"
terminal_path = os.path.join(self.abs_dir, "framework",
"terminal.py")
terminal_class = envs.lazy_instance_by_fliename(
terminal_path, terminal_class_name)(context)
terminal_class.terminal(context)
context['is_exit'] = True
......@@ -123,10 +123,21 @@ class QueueDataset(DatasetBase):
os.path.join(train_data_path, x)
for x in os.listdir(train_data_path)
]
file_list.sort()
need_split_files = False
if context["engine"] == EngineMode.LOCAL_CLUSTER:
# for local cluster: split files for multi process
need_split_files = True
elif context["engine"] == EngineMode.CLUSTER and context[
"cluster_type"] == "K8S":
# for k8s mount afs, split files for every node
need_split_files = True
if need_split_files:
file_list = split_files(file_list, context["fleet"].worker_index(),
context["fleet"].worker_num())
print("File_list: {}".format(file_list))
dataset.set_filelist(file_list)
for model_dict in context["phases"]:
if model_dict["dataset_name"] == dataset_name:
......
......@@ -23,7 +23,7 @@ from paddlerec.core.trainers.framework.dataset import DataLoader, QueueDataset
__all__ = [
"NetworkBase", "SingleNetwork", "PSNetwork", "PslibNetwork",
"CollectiveNetwork"
"CollectiveNetwork", "FineTuningNetwork"
]
......@@ -99,7 +99,90 @@ class SingleNetwork(NetworkBase):
context["dataset"] = {}
for dataset in context["env"]["dataset"]:
type = envs.get_global_env("dataset." + dataset["name"] + ".type")
if type != "DataLoader":
if type == "QueueDataset":
dataset_class = QueueDataset(context)
context["dataset"][dataset[
"name"]] = dataset_class.create_dataset(dataset["name"],
context)
context["status"] = "startup_pass"
class FineTuningNetwork(NetworkBase):
"""R
"""
def __init__(self, context):
print("Running FineTuningNetwork.")
def build_network(self, context):
context["model"] = {}
for model_dict in context["phases"]:
context["model"][model_dict["name"]] = {}
train_program = fluid.Program()
startup_program = fluid.Program()
scope = fluid.Scope()
dataset_name = model_dict["dataset_name"]
with fluid.program_guard(train_program, startup_program):
with fluid.unique_name.guard():
with fluid.scope_guard(scope):
model_path = envs.os_path_adapter(
envs.workspace_adapter(model_dict["model"]))
model = envs.lazy_instance_by_fliename(
model_path, "Model")(context["env"])
model._data_var = model.input_data(
dataset_name=model_dict["dataset_name"])
if envs.get_global_env("dataset." + dataset_name +
".type") == "DataLoader":
model._init_dataloader(
is_infer=context["is_infer"])
data_loader = DataLoader(context)
data_loader.get_dataloader(context, dataset_name,
model._data_loader)
model.net(model._data_var, context["is_infer"])
finetuning_varnames = envs.get_global_env(
"runner." + context["runner_name"] +
".finetuning_aspect_varnames",
default_value=[])
if len(finetuning_varnames) == 0:
raise ValueError(
"nothing need to be fine tuning, you may use other traning mode"
)
if len(finetuning_varnames) != 1:
raise ValueError(
"fine tuning mode can only accept one varname now"
)
varname = finetuning_varnames[0]
finetuning_vars = train_program.global_block().vars[
varname]
finetuning_vars.stop_gradient = True
optimizer = model.optimizer()
optimizer.minimize(model._cost)
context["model"][model_dict["name"]][
"main_program"] = train_program
context["model"][model_dict["name"]][
"startup_program"] = startup_program
context["model"][model_dict["name"]]["scope"] = scope
context["model"][model_dict["name"]]["model"] = model
context["model"][model_dict["name"]][
"default_main_program"] = train_program.clone()
context["model"][model_dict["name"]]["compiled_program"] = None
context["dataset"] = {}
for dataset in context["env"]["dataset"]:
type = envs.get_global_env("dataset." + dataset["name"] + ".type")
if type == "QueueDataset":
dataset_class = QueueDataset(context)
context["dataset"][dataset[
"name"]] = dataset_class.create_dataset(dataset["name"],
......@@ -133,9 +216,7 @@ class PSNetwork(NetworkBase):
if envs.get_global_env("dataset." + dataset_name +
".type") == "DataLoader":
model._init_dataloader(is_infer=False)
data_loader = DataLoader(context)
data_loader.get_dataloader(context, dataset_name,
model._data_loader)
model.net(model._data_var, False)
optimizer = model.optimizer()
strategy = self._build_strategy(context)
......@@ -160,7 +241,11 @@ class PSNetwork(NetworkBase):
for dataset in context["env"]["dataset"]:
type = envs.get_global_env("dataset." + dataset["name"] +
".type")
if type != "DataLoader":
if type == "DataLoader":
data_loader = DataLoader(context)
data_loader.get_dataloader(context, dataset_name,
model._data_loader)
elif type == "QueueDataset":
dataset_class = QueueDataset(context)
context["dataset"][dataset[
"name"]] = dataset_class.create_dataset(
......@@ -229,9 +314,6 @@ class PslibNetwork(NetworkBase):
if envs.get_global_env("dataset." + dataset_name +
".type") == "DataLoader":
model._init_dataloader(is_infer=False)
data_loader = DataLoader(context)
data_loader.get_dataloader(context, dataset_name,
model._data_loader)
model.net(model._data_var, False)
optimizer = model.optimizer()
......@@ -257,7 +339,11 @@ class PslibNetwork(NetworkBase):
for dataset in context["env"]["dataset"]:
type = envs.get_global_env("dataset." + dataset["name"] +
".type")
if type != "DataLoader":
if type == "DataLoader":
data_loader = DataLoader(context)
data_loader.get_dataloader(context, dataset_name, context[
"model"][model_dict["name"]]["model"]._data_loader)
elif type == "QueueDataset":
dataset_class = QueueDataset(context)
context["dataset"][dataset[
"name"]] = dataset_class.create_dataset(
......@@ -323,7 +409,10 @@ class CollectiveNetwork(NetworkBase):
context["dataset"] = {}
for dataset in context["env"]["dataset"]:
type = envs.get_global_env("dataset." + dataset["name"] + ".type")
if type != "DataLoader":
if type == "QueueDataset":
raise ValueError(
"Collective don't support QueueDataset training, please use DataLoader."
)
dataset_class = QueueDataset(context)
context["dataset"][dataset[
"name"]] = dataset_class.create_dataset(dataset["name"],
......
......@@ -16,10 +16,12 @@ from __future__ import print_function
import os
import time
import warnings
import numpy as np
import paddle.fluid as fluid
from paddlerec.core.utils import envs
from paddlerec.core.metric import Metric
__all__ = [
"RunnerBase", "SingleRunner", "PSRunner", "CollectiveRunner", "PslibRunner"
......@@ -77,9 +79,10 @@ class RunnerBase(object):
name = "dataset." + reader_name + "."
if envs.get_global_env(name + "type") == "DataLoader":
self._executor_dataloader_train(model_dict, context)
return self._executor_dataloader_train(model_dict, context)
else:
self._executor_dataset_train(model_dict, context)
return None
def _executor_dataset_train(self, model_dict, context):
reader_name = model_dict["dataset_name"]
......@@ -137,8 +140,10 @@ class RunnerBase(object):
metrics_varnames = []
metrics_format = []
metrics_names = ["total_batch"]
metrics_format.append("{}: {{}}".format("batch"))
for name, var in metrics.items():
metrics_names.append(name)
metrics_varnames.append(var.name)
metrics_format.append("{}: {{}}".format(name))
metrics_format = ", ".join(metrics_format)
......@@ -147,6 +152,7 @@ class RunnerBase(object):
reader.start()
batch_id = 0
scope = context["model"][model_name]["scope"]
result = None
with fluid.scope_guard(scope):
try:
while True:
......@@ -168,6 +174,10 @@ class RunnerBase(object):
except fluid.core.EOFException:
reader.reset()
if batch_id > 0:
result = dict(zip(metrics_names, metrics))
return result
def _get_dataloader_program(self, model_dict, context):
model_name = model_dict["name"]
if context["model"][model_name]["compiled_program"] == None:
......@@ -275,6 +285,7 @@ class RunnerBase(object):
return (epoch_id + 1) % epoch_interval == 0
def save_inference_model():
# get global env
name = "runner." + context["runner_name"] + "."
save_interval = int(
envs.get_global_env(name + "save_inference_interval", -1))
......@@ -287,18 +298,44 @@ class RunnerBase(object):
if feed_varnames is None or fetch_varnames is None or feed_varnames == "" or fetch_varnames == "" or \
len(feed_varnames) == 0 or len(fetch_varnames) == 0:
return
fetch_vars = [
fluid.default_main_program().global_block().vars[varname]
for varname in fetch_varnames
]
# check feed var exist
for var_name in feed_varnames:
if var_name not in fluid.default_main_program().global_block(
).vars:
raise ValueError(
"Feed variable: {} not in default_main_program, global block has follow vars: {}".
format(var_name,
fluid.default_main_program().global_block()
.vars.keys()))
# check fetch var exist
fetch_vars = []
for var_name in fetch_varnames:
if var_name not in fluid.default_main_program().global_block(
).vars:
raise ValueError(
"Fetch variable: {} not in default_main_program, global block has follow vars: {}".
format(var_name,
fluid.default_main_program().global_block()
.vars.keys()))
else:
fetch_vars.append(fluid.default_main_program()
.global_block().vars[var_name])
dirname = envs.get_global_env(name + "save_inference_path", None)
assert dirname is not None
dirname = os.path.join(dirname, str(epoch_id))
if is_fleet:
context["fleet"].save_inference_model(
context["exe"], dirname, feed_varnames, fetch_vars)
warnings.warn(
"Save inference model in cluster training is not recommended! Using save checkpoint instead.",
category=UserWarning,
stacklevel=2)
if context["fleet"].worker_index() == 0:
context["fleet"].save_inference_model(
context["exe"], dirname, feed_varnames, fetch_vars)
else:
fluid.io.save_inference_model(dirname, feed_varnames,
fetch_vars, context["exe"])
......@@ -314,7 +351,8 @@ class RunnerBase(object):
return
dirname = os.path.join(dirname, str(epoch_id))
if is_fleet:
context["fleet"].save_persistables(context["exe"], dirname)
if context["fleet"].worker_index() == 0:
context["fleet"].save_persistables(context["exe"], dirname)
else:
fluid.io.save_persistables(context["exe"], dirname)
......@@ -336,11 +374,28 @@ class SingleRunner(RunnerBase):
".epochs"))
for epoch in range(epochs):
for model_dict in context["phases"]:
model_class = context["model"][model_dict["name"]]["model"]
metrics = model_class._metrics
begin_time = time.time()
self._run(context, model_dict)
result = self._run(context, model_dict)
end_time = time.time()
seconds = end_time - begin_time
print("epoch {} done, use time: {}".format(epoch, seconds))
message = "epoch {} done, use time: {}".format(epoch, seconds)
metrics_result = []
for key in metrics:
if isinstance(metrics[key], Metric):
_str = metrics[key].calc_global_metrics(
None,
context["model"][model_dict["name"]]["scope"])
metrics_result.append(_str)
elif result is not None:
_str = "{}={}".format(key, result[key])
metrics_result.append(_str)
if len(metrics_result) > 0:
message += ", global metrics: " + ", ".join(metrics_result)
print(message)
with fluid.scope_guard(context["model"][model_dict["name"]][
"scope"]):
train_prog = context["model"][model_dict["name"]][
......@@ -362,12 +417,32 @@ class PSRunner(RunnerBase):
envs.get_global_env("runner." + context["runner_name"] +
".epochs"))
model_dict = context["env"]["phase"][0]
model_class = context["model"][model_dict["name"]]["model"]
metrics = model_class._metrics
for epoch in range(epochs):
begin_time = time.time()
self._run(context, model_dict)
result = self._run(context, model_dict)
end_time = time.time()
seconds = end_time - begin_time
print("epoch {} done, use time: {}".format(epoch, seconds))
message = "epoch {} done, use time: {}".format(epoch, seconds)
# TODO, wait for PaddleCloudRoleMaker supports gloo
from paddle.fluid.incubate.fleet.base.role_maker import GeneralRoleMaker
if context["fleet"] is not None and isinstance(context["fleet"],
GeneralRoleMaker):
metrics_result = []
for key in metrics:
if isinstance(metrics[key], Metric):
_str = metrics[key].calc_global_metrics(
context["fleet"],
context["model"][model_dict["name"]]["scope"])
metrics_result.append(_str)
elif result is not None:
_str = "{}={}".format(key, result[key])
metrics_result.append(_str)
if len(metrics_result) > 0:
message += ", global metrics: " + ", ".join(metrics_result)
print(message)
with fluid.scope_guard(context["model"][model_dict["name"]][
"scope"]):
train_prog = context["model"][model_dict["name"]][
......@@ -476,14 +551,30 @@ class SingleInferRunner(RunnerBase):
for index, epoch_name in enumerate(self.epoch_model_name_list):
for model_dict in context["phases"]:
model_class = context["model"][model_dict["name"]]["model"]
metrics = model_class._infer_results
self._load(context, model_dict,
self.epoch_model_path_list[index])
begin_time = time.time()
self._run(context, model_dict)
result = self._run(context, model_dict)
end_time = time.time()
seconds = end_time - begin_time
print("Infer {} of {} done, use time: {}".format(model_dict[
"name"], epoch_name, seconds))
message = "Infer {} of epoch {} done, use time: {}".format(
model_dict["name"], epoch_name, seconds)
metrics_result = []
for key in metrics:
if isinstance(metrics[key], Metric):
_str = metrics[key].calc_global_metrics(
None,
context["model"][model_dict["name"]]["scope"])
metrics_result.append(_str)
elif result is not None:
_str = "{}={}".format(key, result[key])
metrics_result.append(_str)
if len(metrics_result) > 0:
message += ", global metrics: " + ", ".join(metrics_result)
print(message)
context["status"] = "terminal_pass"
def _load(self, context, model_dict, model_path):
......
......@@ -17,9 +17,13 @@ from __future__ import print_function
import warnings
import paddle.fluid as fluid
import paddle.fluid.core as core
from paddlerec.core.utils import envs
__all__ = ["StartupBase", "SingleStartup", "PSStartup", "CollectiveStartup"]
__all__ = [
"StartupBase", "SingleStartup", "PSStartup", "CollectiveStartup",
"FineTuningStartup"
]
class StartupBase(object):
......@@ -65,6 +69,122 @@ class SingleStartup(StartupBase):
context["status"] = "train_pass"
class FineTuningStartup(StartupBase):
"""R
"""
def __init__(self, context):
self.op_name_scope = "op_namescope"
self.clip_op_name_scope = "@CLIP"
self.self.op_role_var_attr_name = core.op_proto_and_checker_maker.kOpRoleVarAttrName(
)
print("Running SingleStartup.")
def _is_opt_role_op(self, op):
# NOTE: depend on oprole to find out whether this op is for
# optimize
op_maker = core.op_proto_and_checker_maker
optimize_role = core.op_proto_and_checker_maker.OpRole.Optimize
if op_maker.kOpRoleAttrName() in op.attr_names and \
int(op.all_attrs()[op_maker.kOpRoleAttrName()]) == int(optimize_role):
return True
return False
def _get_params_grads(self, program):
"""
Get optimizer operators, parameters and gradients from origin_program
Returns:
opt_ops (list): optimize operators.
params_grads (dict): parameter->gradient.
"""
block = program.global_block()
params_grads = []
# tmp set to dedup
optimize_params = set()
origin_var_dict = program.global_block().vars
for op in block.ops:
if self._is_opt_role_op(op):
# Todo(chengmo): Whether clip related op belongs to Optimize guard should be discussed
# delete clip op from opt_ops when run in Parameter Server mode
if self.op_name_scope in op.all_attrs(
) and self.clip_op_name_scope in op.attr(self.op_name_scope):
op._set_attr(
"op_role",
int(core.op_proto_and_checker_maker.OpRole.Backward))
continue
if op.attr(self.op_role_var_attr_name):
param_name = op.attr(self.op_role_var_attr_name)[0]
grad_name = op.attr(self.op_role_var_attr_name)[1]
if not param_name in optimize_params:
optimize_params.add(param_name)
params_grads.append([
origin_var_dict[param_name],
origin_var_dict[grad_name]
])
return params_grads
@staticmethod
def is_persistable(var):
"""
Check whether the given variable is persistable.
Args:
var(Variable): The variable to be checked.
Returns:
bool: True if the given `var` is persistable
False if not.
Examples:
.. code-block:: python
import paddle.fluid as fluid
param = fluid.default_main_program().global_block().var('fc.b')
res = fluid.io.is_persistable(param)
"""
if var.desc.type() == core.VarDesc.VarType.FEED_MINIBATCH or \
var.desc.type() == core.VarDesc.VarType.FETCH_LIST or \
var.desc.type() == core.VarDesc.VarType.READER:
return False
return var.persistable
def load(self, context, is_fleet=False, main_program=None):
dirname = envs.get_global_env(
"runner." + context["runner_name"] + ".init_model_path", None)
if dirname is None or dirname == "":
return
print("going to load ", dirname)
params_grads = self._get_params_grads(main_program)
update_params = [p for p, _ in params_grads]
need_load_vars = []
parameters = list(
filter(FineTuningStartup.is_persistable, main_program.list_vars()))
for param in parameters:
if param not in update_params:
need_load_vars.append(param)
fluid.io.load_vars(context["exe"], dirname, main_program,
need_load_vars)
print("load from {} success".format(dirname))
def startup(self, context):
for model_dict in context["phases"]:
with fluid.scope_guard(context["model"][model_dict["name"]][
"scope"]):
train_prog = context["model"][model_dict["name"]][
"main_program"]
startup_prog = context["model"][model_dict["name"]][
"startup_program"]
with fluid.program_guard(train_prog, startup_prog):
context["exe"].run(startup_prog)
self.load(context, main_program=train_prog)
context["status"] = "train_pass"
class PSStartup(StartupBase):
def __init__(self, context):
print("Running PSStartup.")
......
......@@ -39,9 +39,21 @@ def dataloader_by_name(readerclass,
data_path = os.path.join(package_base, data_path.split("::")[1])
files = [str(data_path) + "/%s" % x for x in os.listdir(data_path)]
files.sort()
need_split_files = False
if context["engine"] == EngineMode.LOCAL_CLUSTER:
# for local cluster: split files for multi process
need_split_files = True
elif context["engine"] == EngineMode.CLUSTER and context[
"cluster_type"] == "K8S":
# for k8s mount mode, split files for every node
need_split_files = True
print("need_split_files: {}".format(need_split_files))
if need_split_files:
files = split_files(files, context["fleet"].worker_index(),
context["fleet"].worker_num())
print("file_list : {}".format(files))
reader = reader_class(yaml_file)
......@@ -81,10 +93,20 @@ def slotdataloader_by_name(readerclass, dataset_name, yaml_file, context):
data_path = os.path.join(package_base, data_path.split("::")[1])
files = [str(data_path) + "/%s" % x for x in os.listdir(data_path)]
files.sort()
need_split_files = False
if context["engine"] == EngineMode.LOCAL_CLUSTER:
# for local cluster: split files for multi process
need_split_files = True
elif context["engine"] == EngineMode.CLUSTER and context[
"cluster_type"] == "K8S":
# for k8s mount mode, split files for every node
need_split_files = True
if need_split_files:
files = split_files(files, context["fleet"].worker_index(),
context["fleet"].worker_num())
print("file_list: {}".format(files))
sparse = get_global_env(name + "sparse_slots", "#")
if sparse == "":
......@@ -135,10 +157,20 @@ def slotdataloader(readerclass, train, yaml_file, context):
data_path = os.path.join(package_base, data_path.split("::")[1])
files = [str(data_path) + "/%s" % x for x in os.listdir(data_path)]
files.sort()
need_split_files = False
if context["engine"] == EngineMode.LOCAL_CLUSTER:
# for local cluster: split files for multi process
need_split_files = True
elif context["engine"] == EngineMode.CLUSTER and context[
"cluster_type"] == "K8S":
# for k8s mount mode, split files for every node
need_split_files = True
if need_split_files:
files = split_files(files, context["fleet"].worker_index(),
context["fleet"].worker_num())
print("file_list: {}".format(files))
sparse = get_global_env("sparse_slots", "#", namespace)
if sparse == "":
......
# PaddleRec 自定义数据集及Reader
用户自定义数据集及配置异步Reader,需要关注以下几个步骤:
* [数据集整理](#数据集整理)
* [在模型组网中加入输入占位符](#在模型组网中加入输入占位符)
* [Reader实现](#Reader的实现)
* [在yaml文件中配置Reader](#在yaml文件中配置reader)
我们以CTR-DNN模型为例,给出了从数据整理,变量定义,Reader写法,调试的完整历程。
* [数据及Reader示例-DNN](#数据及Reader示例-DNN)
## 数据集整理
PaddleRec支持模型自定义数据集。
关于数据的tips:
1. 数据量:
PaddleRec面向大规模数据设计,可以轻松支持亿级的数据读取,工业级的数据读写api:`dataset`在搜索、推荐、信息流等业务得到了充分打磨。
2. 文件类型:
支持任意直接可读的文本数据,`dataset`同时支持`.gz`格式的文本压缩数据,无需额外代码,可直接读取。数据样本应以`\n`为标志,按行组织。
3. 文件存放位置:
文件通常存放在训练节点本地,但同时,`dataset`支持使用`hadoop`远程读取数据,数据无需下载到本地,为dataset配置hadoop相关账户及地址即可。
4. 数据类型
Reader处理的是以行为单位的`string`数据,喂入网络的数据需要转为`int`,`float`的数值数据,不支持`string`喂入网络,不建议明文保存及处理训练数据。
5. Tips
Dataset模式下,训练线程与数据读取线程的关系强相关,为了多线程充分利用,`强烈建议将文件合理的拆为多个小文件`,尤其是在分布式训练场景下,可以均衡各个节点的数据量,同时加快数据的下载速度。
## 在模型组网中加入输入占位符
Reader读取文件后,产出的数据喂入网络,需要有占位符进行接收。占位符在Paddle中使用`fluid.data``fluid.layers.data`进行定义。`data`的定义可以参考[fluid.data](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/fluid_cn/data_cn.html#data)以及[fluid.layers.data](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/layers_cn/data_cn.html#data)
假如您希望输入三个数据,分别是维度32的数据A,维度变长的稀疏数据B,以及一个一维的标签数据C,并希望梯度可以经过该变量向前传递,则示例如下:
数据A的定义:
```python
var_a = fluid.data(name='A', shape= [-1, 32], dtype='float32')
```
数据B的定义,变长数据的使用可以参考[LoDTensor](https://www.paddlepaddle.org.cn/documentation/docs/zh/beginners_guide/basic_concept/lod_tensor.html#cn-user-guide-lod-tensor)
```python
var_b = fluid.data(name='B', shape=[-1, 1], lod_level=1, dtype='int64')
```
数据C的定义:
```python
var_c = fluid.data(name='C', shape=[-1, 1], dtype='int32')
var_c.stop_gradient = False
```
当我们完成以上三个数据的定义后,在PaddleRec的模型定义中,还需将其加入model基类成员变量`self._data_var`
```python
self._data_var.append(var_a)
self._data_var.append(var_b)
self._data_var.append(var_c)
```
至此,我们完成了在组网中定义输入数据的工作。
## Reader的实现
### Reader的实现范式
Reader的逻辑需要一个单独的python文件进行描述。我们试写一个`test_reader.py`,实现的具体流程如下:
1. 首先我们需要引入Reader基类
```python
from paddlerec.core.reader import ReaderBase
```
2. 创建一个子类,继承Reader的基类,训练所需Reader命名为`TrainerReader`
```python
class TrainerReader(ReaderBase):
def init(self):
pass
def generator_sample(self, line):
pass
```
3.`init(self)`函数中声明一些在数据读取中会用到的变量,必要时可以在`config.yaml`文件中配置变量,利用`env.get_global_env()`拿到。
比如,我们希望从yaml文件中读取一个数据预处理变量`avg=10`,目的是将数据A的数据缩小10倍,可以这样实现:
首先更改yaml文件,在某个space下加入该变量
```yaml
...
train:
reader:
avg: 10
...
```
再更改Reader的init函数
```python
from paddlerec.core.utils import envs
class TrainerReader(Reader):
def init(self):
self.avg = envs.get_global_env("avg", None, "train.reader")
def generator_sample(self, line):
pass
```
4. 继承并实现基类中的`generate_sample(self, line)`函数,逐行读取数据。
- 该函数应返回一个可以迭代的reader方法(带有yield的函数不再是一个普通的函数,而是一个生成器generator,成为了可以迭代的对象,等价于一个数组、链表、文件、字符串etc.)
- 在这个可以迭代的函数中,如示例代码中的`def reader()`,我们定义数据读取的逻辑。以行为单位的数据进行截取,转换及预处理。
- 最后,我们需要将数据整理为特定的格式,才能够被PaddleRec的Reader正确读取,并灌入的训练的网络中。简单来说,数据的输出顺序与我们在网络中创建的`inputs`必须是严格一一对应的,并转换为类似字典的形式。
示例: 假设数据ABC在文本数据中,每行以这样的形式存储:
```shell
0.1,0.2,0.3...3.0,3.1,3.2 \t 99999,99998,99997 \t 1 \n
```
则示例代码如下:
```python
from paddlerec.core.utils import envs
class TrainerReader(Reader):
def init(self):
self.avg = envs.get_global_env("avg", None, "train.reader")
def generator_sample(self, line):
def reader(self, line):
# 先分割 '\n', 再以 '\t'为标志分割为list
variables = (line.strip('\n')).split('\t')
# A是第一个元素,并且每个数据之间使用','分割
var_a = variables[0].split(',') # list
var_a = [float(i) / self.avg for i in var_a] # 将str数据转换为float
# B是第二个元素,同样以 ',' 分割
var_b = variables[1].split(',') # list
var_b = [int(i) for i in var_b] # 将str数据转换为int
# C是第三个元素, 只有一个元素,没有分割符
var_c = variables[2]
var_c = int(var_c) # 将str数据转换为int
var_c = [var_c] # 将单独的数据元素置入list中
# 将数据与数据名结合,组织为dict的形式
# 如下,output形式为{ A: var_a, B: var_b, C: var_c}
variable_name = ['A', 'B', 'C']
output = zip(variable_name, [var_a] + [var_b] + [var_c])
# 将数据输出,使用yield方法,将该函数变为了一个可迭代的对象
yield output
```
至此,我们完成了Reader的实现。
### 在yaml文件中配置Reader
在模型的yaml配置文件中,主要的修改是三个,如下
```yaml
reader:
batch_size: 2
class: "{workspace}/reader.py"
train_data_path: "{workspace}/data/train_data"
reader_debug_mode: False
```
batch_size: 顾名思义,是小批量训练时的样本大小
class: 运行改模型所需reader的路径
train_data_path: 训练数据所在文件夹
reader_debug_mode: 测试reader语法,及输出是否符合预期的debug模式的开关
## 数据及Reader示例-DNN
Reader代码来源于[criteo_reader.py](../models/rank/criteo_reader.py), 组网代码来源于[model.py](../models/rank/dnn/model.py)
### Criteo数据集格式
CTR-DNN训练及测试数据集选用[Display Advertising Challenge](https://www.kaggle.com/c/criteo-display-ad-challenge/)所用的Criteo数据集。该数据集包括两部分:训练集和测试集。训练集包含一段时间内Criteo的部分流量,测试集则对应训练数据后一天的广告点击流量。
每一行数据格式如下所示:
```bash
<label> <integer feature 1> ... <integer feature 13> <categorical feature 1> ... <categorical feature 26>
```
其中```<label>```表示广告是否被点击,点击用1表示,未点击用0表示。```<integer feature>```代表数值特征(连续特征),共有13个连续特征。```<categorical feature>```代表分类特征(离散特征),共有26个离散特征。相邻两个特征用```\t```分隔,缺失特征用空格表示。测试集中```<label>```特征已被移除。
### Criteo数据集的预处理
数据预处理共包括两步:
- 将原始训练集按9:1划分为训练集和验证集
- 数值特征(连续特征)需进行归一化处理,但需要注意的是,对每一个特征```<integer feature i>```,归一化时用到的最大值并不是用全局最大值,而是取排序后95%位置处的特征值作为最大值,同时保留极值。
### CTR网络输入的定义
正如前所述,Criteo数据集中,分为连续数据与离散(稀疏)数据,所以整体而言,CTR-DNN模型的数据输入层包括三个,分别是:`dense_input`用于输入连续数据,维度由超参数`dense_feature_dim`指定,数据类型是归一化后的浮点型数据。`sparse_input_ids`用于记录离散数据,在Criteo数据集中,共有26个slot,所以我们创建了名为`C1~C26`的26个稀疏参数输入,并设置`lod_level=1`,代表其为变长数据,数据类型为整数;最后是每条样本的`label`,代表了是否被点击,数据类型是整数,0代表负样例,1代表正样例。
在Paddle中数据输入的声明使用`paddle.fluid.layers.data()`,会创建指定类型的占位符,数据IO会依据此定义进行数据的输入。
稀疏参数输入的定义:
```python
def sparse_inputs():
ids = envs.get_global_env("hyper_parameters.sparse_inputs_slots", None)
sparse_input_ids = [
fluid.layers.data(name="S" + str(i),
shape=[1],
lod_level=1,
dtype="int64") for i in range(1, ids)
]
return sparse_input_ids
```
稠密参数输入的定义:
```python
def dense_input():
dim = envs.get_global_env("hyper_parameters.dense_input_dim", None)
dense_input_var = fluid.layers.data(name="D",
shape=[dim],
dtype="float32")
return dense_input_var
```
标签的定义:
```python
def label_input():
label = fluid.layers.data(name="click", shape=[1], dtype="int64")
return label
```
组合起来,正确的声明他们:
```python
self.sparse_inputs = sparse_inputs()
self.dense_input = dense_input()
self.label_input = label_input()
self._data_var.append(self.dense_input)
for input in self.sparse_inputs:
self._data_var.append(input)
self._data_var.append(self.label_input)
```
### Criteo Reader写法
```python
# 引入PaddleRec的Reader基类
from paddlerec.core.reader import ReaderBase
# 引入PaddleRec的读取yaml配置文件的方法
from paddlerec.core.utils import envs
# 定义TrainReader,需要继承 paddlerec.core.reader.Reader
class Reader(ReaderBase)::
# 数据预处理逻辑,继承自基类
# 如果无需处理, 使用pass跳过该函数的执行
def init(self):
self.cont_min_ = [0, -3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
self.cont_max_ = [20, 600, 100, 50, 64000, 500, 100, 50, 500, 10, 10, 10, 50]
self.cont_diff_ = [20, 603, 100, 50, 64000, 500, 100, 50, 500, 10, 10, 10, 50]
self.hash_dim_ = envs.get_global_env("hyper_parameters.sparse_feature_number", None, "train.model")
self.continuous_range_ = range(1, 14)
self.categorical_range_ = range(14, 40)
# 读取数据方法,继承自基类
# 实现可以迭代的reader函数,逐行处理数据
def generate_sample(self, line):
"""
Read the data line by line and process it as a dictionary
"""
def reader():
"""
This function needs to be implemented by the user, based on data format
"""
features = line.rstrip('\n').split('\t')
dense_feature = []
sparse_feature = []
for idx in self.continuous_range_:
if features[idx] == "":
dense_feature.append(0.0)
else:
dense_feature.append(
(float(features[idx]) - self.cont_min_[idx - 1]) /
self.cont_diff_[idx - 1])
for idx in self.categorical_range_:
sparse_feature.append(
[hash(str(idx) + features[idx]) % self.hash_dim_])
label = [int(features[0])]
feature_name = ["D"]
for idx in self.categorical_range_:
feature_name.append("S" + str(idx - 13))
feature_name.append("label")
yield zip(feature_name, [dense_feature] + sparse_feature + [label])
return reader
```
### 调试Reader
在Linux下运行时,默认启动`Dataset`模式,在Win/Mac下运行时,默认启动`Dataloader`模式。
通过在`config.yaml`中添加或修改`reader_debug_mode=True`打开debug模式,只会结合组网运行reader的部分,读取10条样本,并print,方便您观察格式是否符合预期或隐藏bug。
```yaml
reader:
batch_size: 2
class: "{workspace}/../criteo_reader.py"
train_data_path: "{workspace}/data/train"
reader_debug_mode: True
```
修改后,使用paddlerec.run执行该修改后的yaml文件,可以观察输出。
```bash
python -m paddlerec.run -m ./models/rank/dnn/config.yaml -e single
```
### Dataset调试
dataset输出的数据格式如下:
` dense_input:size ; dense_input:value ; sparse_input:size ; sparse_input:value ; ... ; sparse_input:size ; sparse_input:value ; label:size ; label:value `
基本规律是对于每个变量,会先输出其维度大小,再输出其具体值。
直接debug `criteo_reader`理想的输出为(截取了一个片段):
```bash
...
13 0.0 0.00497512437811 0.05 0.08 0.207421875 0.028 0.35 0.08 0.082 0.0 0.4 0.0 0.08 1 737395 1 210498 1 903564 1 286224 1 286835 1 906818 1 90
6116 1 67180 1 27346 1 51086 1 142177 1 95024 1 157883 1 873363 1 600281 1 812592 1 228085 1 35900 1 880474 1 984402 1 100885 1 26235 1 410878 1 798162 1 499868 1 306163 1 0
...
```
可以看到首先输出的是13维的dense参数,随后是分立的sparse参数,最后一个是1维的label,数值为0,输出符合预期。
>使用Dataset的一些注意事项
> - Dataset的基本原理:将数据print到缓存,再由C++端的代码实现读取,因此,我们不能在dataset的读取代码中,加入与数据读取无关的print信息,会导致C++端拿到错误的数据信息。
> - dataset目前只支持在`unbuntu`及`CentOS`等标准Linux环境下使用,在`Windows`及`Mac`下使用时,会产生预料之外的错误,请知悉。
### DataLoader调试
dataloader的输出格式为`list: [ list[var_1], list[var_2], ... , list[var_3]]`,每条样本的数据会被放在一个 **list[list]** 中,list[0]为第一个variable。
直接debug `criteo_reader`理想的输出为(截取了一个片段):
```bash
...
[[0.0, 0.004975124378109453, 0.05, 0.08, 0.207421875, 0.028, 0.35, 0.08, 0.082, 0.0, 0.4, 0.0, 0.08], [560746], [902436], [262029], [182633], [368411], [735166], [321120], [39572], [185732], [140298], [926671], [81559], [461249], [728372], [915018], [907965], [818961], [850958], [311492], [980340], [254960], [175041], [524857], [764893], [526288], [220126], [0]]
...
```
可以看到首先输出的是13维的dense参数的list,随后是分立的sparse参数,各自在一个list中,最后一个是1维的label的list,数值为0,输出符合预期。
......@@ -9,6 +9,7 @@
- [第三步:增加集群运行`backend.yaml`配置](#第三步增加集群运行backendyaml配置)
- [MPI集群的Parameter Server模式配置](#mpi集群的parameter-server模式配置)
- [K8S集群的Collective模式配置](#k8s集群的collective模式配置)
- [K8S集群的PS-CPU模式配置](#k8s集群的ps-cpu模式配置)
- [第四步:任务提交](#第四步任务提交)
- [使用PaddleCloud Client提交](#使用paddlecloud-client提交)
- [第一步:在`before_hook.sh`里手动安装PaddleRec](#第一步在before_hooksh里手动安装paddlerec)
......@@ -34,10 +35,10 @@
分布式运行首先需要更改`config.yaml`,主要调整以下内容:
- workspace: 调整为在节点运行时的工作目录
- runner_class: 从单机的"train"调整为"cluster_train"
- fleet_mode: 选则参数服务器模式,抑或GPU Collective模式
- distribute_strategy: 可选项,选择分布式训练的策略
- workspace: 调整为在远程节点运行时的工作目录,一般设置为`"./"`即可
- runner_class: 从单机的"train"调整为"cluster_train",单机训练->分布式训练(例外情况,k8s上单机单卡训练仍然为train,后续支持)
- fleet_mode: 选择参数服务器模式(ps),或者GPU的all-reduce模式(collective)
- distribute_strategy: 可选项,选择分布式训练的策略,目前只在参数服务器模式下生效,可选项:`sync、asycn、half_async、geo`
配置选项具体参数,可以参考[yaml配置说明](./yaml.md)
......@@ -47,50 +48,72 @@
```yaml
# workspace
workspace: "paddlerec.models.rank.dnn"
workspace: "models/rank/dnn"
mode: [single_cpu_train]
# config of each runner.
# runner is a kind of paddle training class, which wraps the train/infer process.
runner:
- name: single_cpu_train
class: train
# num of epochs
epochs: 4
# device to run training or infer
device: cpu
save_checkpoint_interval: 2 # save model interval of epochs
save_checkpoint_path: "increment_dnn" # save checkpoint path
init_model_path: "" # load model path
save_checkpoint_interval: 2
save_checkpoint_path: "increment_dnn"
init_model_path: ""
print_interval: 10
phases: [phase1]
dataset:
- name: dataloader_train
batch_size: 2
type: DataLoader
data_path: "{workspace}/data/sample_data/train"
sparse_slots: "click 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26"
dense_slots: "dense_var:13"
phase:
- name: phase1
model: "{workspace}/model.py"
dataset_name: dataloader_train
thread_num: 1
```
分布式的训练配置可以改为:
```yaml
# workspace
# 改变一:代码上传至节点后,与运行shell同在一个默认目录下
# 改变一:代码上传至节点后,在默认目录下
workspace: "./"
mode: [ps_cluster]
# config of each runner.
# runner is a kind of paddle training class, which wraps the train/infer process.
runner:
- name: ps_cluster
# 改变二:调整runner的class
class: cluster_train
# num of epochs
epochs: 4
# device to run training or infer
device: cpu
# 改变三 & 四: 指定fleet_mode 与 distribute_strategy
fleet_mode: ps
distribute_strategy: async
save_checkpoint_interval: 2 # save model interval of epochs
save_checkpoint_path: "increment_dnn" # save checkpoint path
init_model_path: "" # load model path
save_checkpoint_interval: 2
save_checkpoint_path: "increment_dnn"
init_model_path: ""
print_interval: 10
phases: [phase1]
dataset:
- name: dataloader_train
batch_size: 2
type: DataLoader
# 改变五: 改变数据的读取目录
# 通常而言,mpi模式下,数据会下载到远程节点执行目录的'./train_data'下, k8s则与挂载位置有关
data_path: "{workspace}/train_data"
sparse_slots: "click 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26"
dense_slots: "dense_var:13"
phase:
- name: phase1
model: "{workspace}/model.py"
dataset_name: dataloader_train
# 分布式训练节点的CPU_NUM环境变量与thread_num相等,多个phase时,取最大的thread_num
thread_num: 1
```
除此之外,还需关注数据及模型加载的路径,一般而言:
......@@ -110,6 +133,8 @@ cluster_type: mpi # k8s 可选
config:
# 填写任务运行的paddle官方版本号 >= 1.7.2, 默认1.7.2
paddle_version: "1.7.2"
# 是否使用PaddleCloud运行环境下的Python3,默认使用python2
use_python3: 1
# hdfs/afs的配置信息填写
fs_name: "afs://xxx.com"
......@@ -130,11 +155,13 @@ config:
# paddle参数服务器分布式底层超参,无特殊需求不理不改
communicator:
# 使用SGD优化器时,建议设置为1
FLAGS_communicator_is_sgd_optimizer: 0
# 以下三个变量默认都等于训练时的线程数:CPU_NUM
FLAGS_communicator_send_queue_size: 5
FLAGS_communicator_thread_pool_size: 32
FLAGS_communicator_max_merge_var_num: 5
FLAGS_communicator_max_send_grad_num_before_recv: 5
FLAGS_communicator_thread_pool_size: 32
FLAGS_communicator_fake_rpc: 0
FLAGS_rpc_retry_times: 3
......@@ -165,7 +192,14 @@ submit:
# for k8s gpu
# k8s gpu 模式下,训练节点数,及每个节点上的GPU卡数
k8s_trainers: 2
k8s_cpu_cores: 4
k8s_gpu_card: 1
# for k8s ps-cpu
k8s_trainers: 2
k8s_cpu_cores: 4
k8s_ps_num: 2
k8s_ps_cores: 4
```
......@@ -173,18 +207,51 @@ submit:
除此之外,我们还需要关注上传到工作目录的文件(`files选项`)的路径问题,在示例中是`./*.py`,说明我们执行任务提交时,与这些py文件在同一目录。若不在同一目录,则需要适当调整files路径,或改为这些文件的绝对路径。
不建议利用`files`上传数据文件,可以通过指定`train_data_path`自动下载,或指定`afs_remote_mount_point`挂载实现数据到节点的转移。
不建议利用`files`上传过大的数据文件,可以通过指定`train_data_path`自动下载,或在k8s模式下指定`afs_remote_mount_point`挂载实现数据到节点的转移。
#### MPI集群的Parameter Server模式配置
下面是一个利用PaddleCloud提交MPI参数服务器模式任务的`backend.yaml`示例
首先调整`config.yaml`:
```yaml
workspace: "./"
mode: [ps_cluster]
dataset:
- name: dataloader_train
batch_size: 2
type: DataLoader
data_path: "{workspace}/train_data"
sparse_slots: "click 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26"
dense_slots: "dense_var:13"
runner:
- name: ps_cluster
class: cluster_train
epochs: 2
device: cpu
fleet_mode: ps
save_checkpoint_interval: 1
save_checkpoint_path: "increment_dnn"
init_model_path: ""
print_interval: 1
phases: [phase1]
phase:
- name: phase1
model: "{workspace}/model.py"
dataset_name: dataloader_train
thread_num: 1
```
再新增`backend.yaml`
```yaml
backend: "PaddleCloud"
cluster_type: mpi # k8s 可选
cluster_type: mpi # k8s可选
config:
# 填写任务运行的paddle官方版本号 >= 1.7.2, 默认1.7.2
paddle_version: "1.7.2"
# hdfs/afs的配置信息填写
......@@ -229,9 +296,45 @@ submit:
下面是一个利用PaddleCloud提交K8S集群进行GPU训练的`backend.yaml`示例
首先调整`config.yaml`
```yaml
workspace: "./"
mode: [collective_cluster]
dataset:
- name: dataloader_train
batch_size: 2
type: DataLoader
data_path: "{workspace}/afs/挂载数据文件夹的路径"
sparse_slots: "click 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26"
dense_slots: "dense_var:13"
runner:
- name: collective_cluster
class: cluster_train
epochs: 2
device: gpu
fleet_mode: collective
save_checkpoint_interval: 1 # save model interval of epochs
save_checkpoint_path: "increment_dnn" # save checkpoint path
init_model_path: "" # load model path
print_interval: 1
phases: [phase1]
phase:
- name: phase1
model: "{workspace}/model.py"
dataset_name: dataloader_train
thread_num: 1
```
再增加`backend.yaml`
```yaml
backend: "PaddleCloud"
cluster_type: mpi # k8s 可选
cluster_type: k8s # mpi 可选
config:
# 填写任务运行的paddle官方版本号 >= 1.7.2, 默认1.7.2
......@@ -271,9 +374,93 @@ submit:
# for k8s gpu
# k8s gpu 模式下,训练节点数,及每个节点上的GPU卡数
k8s_trainers: 2
k8s_cpu_cores: 4
k8s_gpu_card: 1
```
#### K8S集群的PS-CPU模式配置
下面是一个利用PaddleCloud提交K8S集群进行参数服务器CPU训练的`backend.yaml`示例
首先调整`config.yaml`:
```yaml
workspace: "./"
mode: [ps_cluster]
dataset:
- name: dataloader_train
batch_size: 2
type: DataLoader
data_path: "{workspace}/afs/挂载数据文件夹的路径"
sparse_slots: "click 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26"
dense_slots: "dense_var:13"
runner:
- name: ps_cluster
class: cluster_train
epochs: 2
device: cpu
fleet_mode: ps
save_checkpoint_interval: 1
save_checkpoint_path: "increment_dnn"
init_model_path: ""
print_interval: 1
phases: [phase1]
phase:
- name: phase1
model: "{workspace}/model.py"
dataset_name: dataloader_train
thread_num: 1
```
再新增`backend.yaml`
```yaml
backend: "PaddleCloud"
cluster_type: k8s # mpi 可选
config:
# 填写任务运行的paddle官方版本号 >= 1.7.2, 默认1.7.2
paddle_version: "1.7.2"
# hdfs/afs的配置信息填写
fs_name: "afs://xxx.com"
fs_ugi: "usr,pwd"
# 填任务输出目录的远程地址,如afs:/user/your/path/ 则此处填 /user/your/path
output_path: ""
# for k8s
# 填远程挂载地址,如afs:/user/your/path/ 则此处填 /user/your/path
afs_remote_mount_point: ""
submit:
# PaddleCloud 个人信息 AK 及 SK
ak: ""
sk: ""
# 任务运行优先级,默认high
priority: "high"
# 任务名称
job_name: "PaddleRec_CTR"
# 训练资源所在组
group: ""
# 节点上的任务启动命令
start_cmd: "python -m paddlerec.run -m ./config.yaml"
# 本地需要上传到节点工作目录的文件
files: ./*.py ./*.yaml
# for k8s gpu
# k8s ps-cpu 模式下,训练节点数,参数服务器节点数,及每个节点上的cpu核心数及内存限制
k8s_trainers: 2
k8s_cpu_cores: 4
k8s_ps_num: 2
k8s_ps_cores: 4
```
### 第四步:任务提交
当我们准备好`config.yaml``backend.yaml`,便可以进行一键任务提交,命令为:
......
# 如何给模型增加Metric
## PaddleRec Metric使用示例
```
from paddlerec.core.model import ModelBase
from paddlerec.core.metrics import RecallK
class Model(ModelBase):
def __init__(self, config):
ModelBase.__init__(self, config)
def net(self, inputs, is_infer=False):
...
acc = RecallK(input=logits, label=label, k=20)
self._metrics["Train_P@20"] = acc
```
## Metric类
### 成员变量
> _global_metric_state_vars(dict),
字典类型,用以存储metric计算过程中需要的中间状态变量。一般情况下,这些中间状态需要是Persistable=True的变量,所以会在模型保存的时候也会被保存下来。因此infer阶段需手动将这些中间状态值清零,进而保证预测结果的正确性。
### 成员函数
> clear(self, scope):
从scope中将self._global_metric_state_vars中的状态值全清零。该函数一般用在**infer**阶段开始的时候。用以保证预测指标的正确性。
> calc_global_metrics(self, fleet, scope=None):
将self._global_metric_state_vars中的状态值在所有训练节点上做all_reduce操作,进而下一步调用_calculate()函数计算全局指标。若fleet=None,则all_reduce的结果为自己本身,即单机全局指标计算。
> get_result(self): 返回训练过程中需要fetch,并定期打印至屏幕的变量。返回类型为dict。
## Metrics
### AUC
> AUC(input ,label, curve='ROC', num_thresholds=2**12 - 1, topk=1, slide_steps=1)
Auc,全称Area Under the Curve(AUC),该层根据前向输出和标签计算AUC,在二分类(binary classification)估计中广泛使用。在二分类(binary classification)中广泛使用。相关定义参考 https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve 。
#### 参数
- **input(Tensor|LoDTensor)**: 数据类型为float32,float64。浮点二维变量。输入为网络的预测值。shape为[batch_size, 2]。
- **label(Tensor|LoDTensor)**: 数据类型为int64,int32。输入为数据集的标签。shape为[batch_size, 1]。
- **curve(str)**: 曲线类型,可以为 ROC 或 PR,默认 ROC。
- **num_thresholds(int)**: 将roc曲线离散化时使用的临界值数。默认200。
- **topk(int)**: 取topk的输出值用于计算。
- **slide_steps(int)**: - 当计算batch auc时,不仅用当前步也用于先前步。slide_steps=1,表示用当前步;slide_steps = 3表示用当前步和前两步;slide_steps = 0,则用所有步。
#### 返回值
该指标训练过程中定期的变量有两个:
- **AUC**: 整体AUC值
- **BATCH_AUC**:当前batch的AUC值
### PrecisionRecall
> PrecisionRecall(input, label, class_num)
计算precison, recall, f1。
#### 参数
- **input(Tensor|LoDTensor)**: 数据类型为float32,float64。输入为网络的预测值。shape为[batch_size, class_num]
- **label(Tensor|LoDTensor)**: 数据类型为int32。输入为数据集的标签。shape为 [batch_size, 1]
- **class_num(int)**: 类别个数。
#### 返回值
- **[TP FP TN FN]**: 形状为[class_num, 4]的变量,用以表征每种类型的TP,FP,TN和FN值。TP=true positive, FP=false positive, TN=true negative, FN=false negative。若需计算每种类型的precison, recall,f1, 则可根据如下公式进行计算:
precision = TP / (TP + FP); recall = TP = TP / (TP + FN); F1 = 2 * precision * recall / (precision + recall)。
- **precision_recall_f1**: 形状为[6],分别代表[macro_avg_precision, macro_avg_recall, macro_avg_f1, micro_avg_precision, micro_avg_recall, micro_avg_f1],这里macro代表先计算每种类型的准确率,召回率,F1,然后求平均。micro代表先计算所有类型的整体TP,TN, FP, FN等中间值,然后在计算准确率,召回率,F1.
### RecallK
> RecallK(input, label, k=20)
TopK的召回准确率,对于任意一条样本来说,若前top_k个分类结果中包含正确分类标签,则视为正样本。
#### 参数
- **input(Tensor|LoDTensor)**: 数据类型为float32,float64。输入为网络的预测值。shape为[batch_size, class_dim]
- **label(Tensor|LoDTensor)**: 数据类型为int64,int32。输入为数据集的标签。shape为 [batch_size, 1]
- **k(int)**: 取每个类别中top_k个预测值用于计算召回准确率。
#### 返回值
- **InsCnt**:样本总数
- **RecallCnt**: topk可以正确被召回的样本数
- **Acc(Recall@k)**: RecallCnt/InsCnt,即Topk召回准确率。
## PairWise_PN
> PosNegRatio(pos_score, neg_score)
正逆序指标,一般用在输入是pairwise的模型中。例如输入既包含正样本,也包含负样本,模型需要去学习最大化正负样本打分的差异。
#### 参数
- **pos_score(Tensor|LoDTensor)**: 正样本的打分,数据类型为float32,float64。浮点二维变量,值的范围为[0,1]。
- **neg_score(Tensor|LoDTensor)**:负样本的打分。数据类型为float32,float64。浮点二维变量,值的范围为[0,1]。
#### 返回值
- **RightCnt**: pos_score > neg_score的样本数
- **WrongCnt**: pos_score <= neg_score的样本数
- **PN**: (RightCnt + 1.0) / (WrongCnt + 1.0), 正逆序,+1.0是为了避免除0错误。
### Customized_Metric
如果你需要在自定义metric,那么你需要按如下步骤操作:
1. 继承paddlerec.core.Metric,定义你的MyMetric类。
2. 在MyMetric的构造函数中,自定义Metric组网,声明self._global_metric_state_vars私有变量。
3. 定义_calculate(global_metrics),全局指标计算。该函数的输入globla_metrics,存储了self._global_metric_state_vars中所有中间状态变量的全局统计值。最终结果以str格式返回。
自定义Metric模版如下,你可以参考注释,或paddlerec.core.metrics下已经实现的precision_recall, auc, pairwise_pn, recall_k等指标的计算方式,自定义自己的Metric类。
```
from paddlerec.core.Metric import Metric
class MyMetric(Metric):
def __init__(self):
# 1. 自定义Metric组网
** 1. your code **
# 2. 设置中间状态字典
self._global_metric_state_vars = dict()
** 2. your code **
def get_result(self):
# 3. 定义训练过程中需要打印的变量,以字典格式返回
self. _metrics = dict()
** 3. your code **
def _calculate(self, global_metrics):
# 4. 全局指标计算,global_metrics为字典类型,存储了self._global_metric_state_vars中所有中间状态变量的全局统计值。返回格式为str。
** your code **
```
......@@ -92,7 +92,7 @@ def input_data(self, is_infer=False, **kwargs):
return train_inputs
```
更多数据读取教程,请参考[自定义数据集及Reader](custom_dataset_reader.md)
更多数据读取教程,请参考[自定义数据集及Reader](custom_reader.md)
### 组网的定义
......@@ -113,6 +113,8 @@ def input_data(self, is_infer=False, **kwargs):
可以参考官方模型的示例学习net的构造方法。
除可以使用Paddle的Metrics接口外,PaddleRec也统一封装了一些常见的Metrics评价指标,并允许开发者定义自己的Metrics类,相关文件参考[Metrics开发文档](metrics.md)
## 如何运行自定义模型
记录`model.py`,`config.yaml`及数据读取`reader.py`的文件路径,建议置于同一文件夹下,如`/home/custom_model`下,更改`config.yaml`中的配置选项
......
# PaddleRec 预训练模型
PaddleRec基于业务实践,使用真实数据,产出了推荐领域算法的若干预训练模型,方便开发者进行算法调研。
## 文本分类预训练模型
### 获取地址
```bash
wget xxx.tar.gz
```
### 使用方法
解压后,得到的是一个paddle的模型文件夹,使用`PaddleRec/models/contentunderstanding/classification_finetue`模型进行加载
......@@ -20,7 +20,7 @@ python -m paddlerec.run -m paddlerec.models.xxx.yyy
例如启动`recall`下的`word2vec`模型的默认配置;
```shell
python -m paddlerec.run -m paddlerec.models.recall.word2vec
python -m paddlerec.run -m models/recall/word2vec
```
### 2. 启动内置模型的个性化配置训练
......
# PaddleRec yaml配置说明
# PaddleRec config.yaml配置说明
## 全局变量
......@@ -12,31 +12,31 @@
## runner变量
| 名称 | 类型 | 取值 | 是否必须 | 作用描述 |
| :---------------------------: | :----------: | :-------------------------------------------: | :------: | :------------------------------------------------------------------: |
| name | string | 任意 | 是 | 指定runner名称 |
| 名称 | 类型 | 取值 | 是否必须 | 作用描述 |
| :---------------------------: | :----------: | :-------------------------------------------------------: | :------: | :------------------------------------------------------------------: |
| name | string | 任意 | 是 | 指定runner名称 |
| class | string | train(默认) / infer / local_cluster_train / cluster_train | 是 | 指定运行runner的类别(单机/分布式, 训练/预测) |
| device | string | cpu(默认) / gpu | 否 | 程序执行设备 |
| fleet_mode | string | ps(默认) / pslib / collective | 否 | 分布式运行模式 |
| selected_gpus | string | "0"(默认) | 否 | 程序运行GPU卡号,若以"0,1"的方式指定多卡,则会默认启用collective模式 |
| worker_num | int | 1(默认) | 否 | 参数服务器模式下worker的数量 |
| server_num | int | 1(默认) | 否 | 参数服务器模式下server的数量 |
| distribute_strategy | string | async(默认)/sync/half_async/geo | 否 | 参数服务器模式下训练模式的选择 |
| epochs | int | >= 1 | 否 | 模型训练迭代轮数 |
| phases | list[string] | 由phase name组成的list | 否 | 当前runner的训练过程列表,顺序执行 |
| init_model_path | string | 路径 | 否 | 初始化模型地址 |
| save_checkpoint_interval | int | >= 1 | 否 | Save参数的轮数间隔 |
| save_checkpoint_path | string | 路径 | 否 | Save参数的地址 |
| save_inference_interval | int | >= 1 | 否 | Save预测模型的轮数间隔 |
| save_inference_path | string | 路径 | 否 | Save预测模型的地址 |
| save_inference_feed_varnames | list[string] | 组网中指定Variable的name | 否 | 预测模型的入口变量name |
| save_inference_fetch_varnames | list[string] | 组网中指定Variable的name | 否 | 预测模型的出口变量name |
| print_interval | int | >= 1 | 否 | 训练指标打印batch间隔 |
| instance_class_path | string | 路径 | 否 | 自定义instance流程实现的地址 |
| network_class_path | string | 路径 | 否 | 自定义network流程实现的地址 |
| startup_class_path | string | 路径 | 否 | 自定义startup流程实现的地址 |
| runner_class_path | string | 路径 | 否 | 自定义runner流程实现的地址 |
| terminal_class_path | string | 路径 | 否 | 自定义terminal流程实现的地址 |
| device | string | cpu(默认) / gpu | 否 | 程序执行设备 |
| fleet_mode | string | ps(默认) / pslib / collective | 否 | 分布式运行模式 |
| selected_gpus | string | "0"(默认) | 否 | 程序运行GPU卡号,若以"0,1"的方式指定多卡,则会默认启用collective模式 |
| worker_num | int | 1(默认) | 否 | 参数服务器模式下worker的数量 |
| server_num | int | 1(默认) | 否 | 参数服务器模式下server的数量 |
| distribute_strategy | string | async(默认)/sync/half_async/geo | 否 | 参数服务器模式下训练模式的选择 |
| epochs | int | >= 1 | 否 | 模型训练迭代轮数 |
| phases | list[string] | 由phase name组成的list | 否 | 当前runner的训练过程列表,顺序执行 |
| init_model_path | string | 路径 | 否 | 初始化模型地址 |
| save_checkpoint_interval | int | >= 1 | 否 | Save参数的轮数间隔 |
| save_checkpoint_path | string | 路径 | 否 | Save参数的地址 |
| save_inference_interval | int | >= 1 | 否 | Save预测模型的轮数间隔 |
| save_inference_path | string | 路径 | 否 | Save预测模型的地址 |
| save_inference_feed_varnames | list[string] | 组网中指定Variable的name | 否 | 预测模型的入口变量name |
| save_inference_fetch_varnames | list[string] | 组网中指定Variable的name | 否 | 预测模型的出口变量name |
| print_interval | int | >= 1 | 否 | 训练指标打印batch间隔 |
| instance_class_path | string | 路径 | 否 | 自定义instance流程实现的地址 |
| network_class_path | string | 路径 | 否 | 自定义network流程实现的地址 |
| startup_class_path | string | 路径 | 否 | 自定义startup流程实现的地址 |
| runner_class_path | string | 路径 | 否 | 自定义runner流程实现的地址 |
| terminal_class_path | string | 路径 | 否 | 自定义terminal流程实现的地址 |
......@@ -70,3 +70,55 @@
| optimizer.learning_rate | float | > 0 | 否 | 指定学习率 |
| reg | float | > 0 | 否 | L2正则化参数,只在SGD下生效 |
| others | / | / | / | 由各个模型组网独立指定 |
# PaddleRec backend.yaml配置说明
## 全局变量
| 名称 | 类型 | 取值 | 是否必须 | 作用描述 |
| :----------: | :----: | :-------------: | :------: | :----------------------------------------------: |
| backend | string | paddlecloud/k8s | 是 | 使用PaddleCloud平台提交,还是在公有云K8S集群提交 |
| cluster_type | string | mpi/k8s | 是 | 指定运行的计算集群: mpi 还是 k8s |
## config
| 名称 | 类型 | 取值 | 是否必须 | 作用描述 |
| :--------------------: | :----: | :-------------------------------------: | :------: | :------------------------------------------------------------------------------------------: |
| paddle_version | string | paddle官方版本号,如1.7.2/1.8.0/1.8.3等 | 否 | 指定运行训练使用的Paddle版本,默认1.7.2 |
| use_python3 | int | 0(默认)/1 | 否 | 指定是否使用python3进行训练 |
| fs_name | string | "afs://xxx.com" | 是 | hdfs/afs集群名称所需配置 |
| fs_ugi | string | "usr,pwd" | 是 | hdfs/afs集群密钥所需配置 |
| output_path | string | "/user/your/path" | 否 | 任务输出的远程目录 |
| train_data_path | string | "/user/your/path" | 是 | mpi集群下指定训练数据路径,paddlecloud会自动将数据分片并下载到工作目录的`./train_data`文件夹 |
| test_data_path | string | "/user/your/path" | 否 | mpi集群下指定测试数据路径,会自动下载到工作目录的`./test_data`文件夹 |
| thirdparty_path | string | "/user/your/path" | 否 | mpi集群下指定thirdparty路径,会自动下载到工作目录的`./thirdparty`文件夹 |
| afs_remote_mount_point | string | "/user/your/path" | 是 | k8s集群下指定远程路径的地址,会挂载到工作目录的`./afs/下` |
### config.communicator
| 名称 | 类型 | 取值 | 是否必须 | 作用描述 |
| :----------------------------------------------: | :---: | :------------: | :------: | :----------------------------------------------------: |
| FLAGS_communicator_is_sgd_optimizer | int | 0(默认)/1 | 否 | 异步分布式训练时的多线程的梯度融合方式是否使用SGD模式 |
| FLAGS_communicator_send_queue_size | int | 线程数(默认) | 否 | 分布式训练时发送队列的大小 |
| FLAGS_communicator_max_merge_var_num | int | 线程数(默认) | 否 | 分布式训练多线程梯度融合时,线程数的配置 |
| FLAGS_communicator_max_send_grad_num_before_recv | int | 线程数(默认) | 否 | 分布式训练使用独立recv参数线程时,与send的步调配置超参 |
| FLAGS_communicator_thread_pool_size | int | 32(默认) | 否 | 分布式训练时,多线程发送参数的线程池大小 |
| FLAGS_communicator_fake_rpc | int | 0(默认)/1 | 否 | 分布式训练时,选择不进行通信 |
| FLAGS_rpc_retry_times | int | 3(默认) | 否 | 分布式训练时,GRPC的失败重试次数 |
## submit
| 名称 | 类型 | 取值 | 是否必须 | 作用描述 |
| :-----------: | :----: | :-------------------------: | :------: | :------------------------------------------------------: |
| ak | string | PaddleCloud平台提供的ak密钥 | 是 | paddlecloud用户配置 |
| sk | string | PaddleCloud平台提供的sk密钥 | 否 | paddlecloud用户配置 |
| priority | string | normal/high/very_high | 否 | 任务优先级 |
| job_name | string | 任意 | 是 | 任务名称 |
| group | string | 计算资源所在组名称 | 是 | 组名称 |
| start_cmd | string | 任意 | 是 | 启动命令,默认`python -m paddlerec.run -m ./config.yaml` |
| files | string | 任意 | 是 | 随任务提交上传的文件,给出相对或绝对路径 |
| nodes | int | >=1(默认1) | 否 | mpi集群下的节点数 |
| k8s_trainers | int | >=1(默认1) | 否 | k8s集群下worker的节点数 |
| k8s_cpu_cores | int | >=1(默认1) | 否 | k8s集群下worker的CPU核数 |
| k8s_gpu_card | int | >=1(默认1) | 否 | k8s集群下worker的GPU卡数 |
| k8s_ps_num | int | >=1(默认1) | 否 | k8s集群下server的节点数 |
| k8s_ps_cores | int | >=1(默认1) | 否 | k8s集群下server的CPU核数 |
......@@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
workspace: "paddlerec.models.contentunderstanding.classification"
workspace: "models/contentunderstanding/classification"
dataset:
- name: data1
......
......@@ -39,8 +39,11 @@
##使用教程(快速开始)
```
python -m paddlerec.run -m paddlerec.models.contentunderstanding.tagspace
python -m paddlerec.run -m paddlerec.models.contentunderstanding.classification
git clone https://github.com/PaddlePaddle/PaddleRec.git paddle-rec
cd paddle-rec
python -m paddlerec.run -m models/contentunderstanding/tagspace/config.yaml
python -m paddlerec.run -m models/contentunderstanding/classification/config.yaml
```
## 使用教程(复现论文)
......
......@@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
workspace: "paddlerec.models.contentunderstanding.tagspace"
workspace: "models/contentunderstanding/tagspace"
dataset:
- name: sample_1
......
......@@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
workspace: "paddlerec.models.demo.movie_recommand"
workspace: "models/demo/movie_recommand"
# list of dataset
dataset:
......
......@@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
workspace: "paddlerec.models.demo.movie_recommand"
workspace: "models/demo/movie_recommand"
# list of dataset
dataset:
......
......@@ -13,7 +13,7 @@
# limitations under the License.
workspace: "paddlerec.models.match.dssm"
workspace: "models/match/dssm"
dataset:
- name: dataset_train
......
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Copyrigh t(c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
workspace: "models/match/match-pyramid"
dataset:
- name: dataset_train
batch_size: 128
type: DataLoader
data_path: "{workspace}/data/train"
data_converter: "{workspace}/train_reader.py"
- name: dataset_infer
batch_size: 1
type: DataLoader
data_path: "{workspace}/data/test"
data_converter: "{workspace}/test_reader.py"
hyper_parameters:
optimizer:
class: adam
learning_rate: 0.001
strategy: async
emb_path: "./data/embedding.npy"
sentence_left_size: 20
sentence_right_size: 500
vocab_size: 193368
emb_size: 50
kernel_num: 8
hidden_size: 20
hidden_act: "relu"
out_size: 1
channels: 1
conv_filter: [2,10]
conv_act: "relu"
pool_size: [6,50]
pool_stride: [6,50]
pool_type: "max"
pool_padding: "VALID"
mode: [train_runner , infer_runner]
# config of each runner.
# runner is a kind of paddle training class, which wraps the train/infer process.
runner:
- name: train_runner
class: train
# num of epochs
epochs: 2
# device to run training or infer
device: cpu
save_checkpoint_interval: 1 # save model interval of epochs
save_inference_interval: 1 # save inference
save_checkpoint_path: "inference" # save checkpoint path
save_inference_path: "inference" # save inference path
save_inference_feed_varnames: [] # feed vars of save inference
save_inference_fetch_varnames: [] # fetch vars of save inference
init_model_path: "" # load model path
print_interval: 2
phases: phase_train
- name: infer_runner
class: infer
# device to run training or infer
device: cpu
print_interval: 1
init_model_path: "inference/1" # load model path
phases: phase_infer
# runner will run all the phase in each epoch
phase:
- name: phase_train
model: "{workspace}/model.py" # user-defined model
dataset_name: dataset_train # select dataset by name
thread_num: 1
- name: phase_infer
model: "{workspace}/model.py" # user-defined model
dataset_name: dataset_infer # select dataset by name
thread_num: 1
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import numpy as np
import random
# Read Word Dict and Inverse Word Dict
def read_word_dict(filename):
word_dict = {}
for line in open(filename):
line = line.strip().split()
word_dict[int(line[1])] = line[0]
print('[%s]\n\tWord dict size: %d' % (filename, len(word_dict)))
return word_dict
# Read Embedding File
def read_embedding(filename):
embed = {}
for line in open(filename):
line = line.strip().split()
embed[int(line[0])] = list(map(float, line[1:]))
print('[%s]\n\tEmbedding size: %d' % (filename, len(embed)))
return embed
# Convert Embedding Dict 2 numpy array
def convert_embed_2_numpy(embed_dict, embed=None):
for k in embed_dict:
embed[k] = np.array(embed_dict[k])
print('Generate numpy embed:', embed.shape)
return embed
# Read Data
def read_data(filename):
data = {}
for line in open(filename):
line = line.strip().split()
data[line[0]] = list(map(int, line[2:]))
print('[%s]\n\tData size: %s' % (filename, len(data)))
return data
# Read Relation Data
def read_relation(filename):
data = []
for line in open(filename):
line = line.strip().split()
data.append((int(line[0]), line[1], line[2]))
print('[%s]\n\tInstance size: %s' % (filename, len(data)))
return data
Letor07Path = "./data"
word_dict = read_word_dict(filename=os.path.join(Letor07Path, 'word_dict.txt'))
query_data = read_data(filename=os.path.join(Letor07Path, 'qid_query.txt'))
doc_data = read_data(filename=os.path.join(Letor07Path, 'docid_doc.txt'))
embed_dict = read_embedding(filename=os.path.join(Letor07Path,
'embed_wiki-pdc_d50_norm'))
_PAD_ = len(word_dict) #193367
embed_dict[_PAD_] = np.zeros((50, ), dtype=np.float32)
word_dict[_PAD_] = '[PAD]'
W_init_embed = np.float32(np.random.uniform(-0.02, 0.02, [len(word_dict), 50]))
embedding = convert_embed_2_numpy(embed_dict, embed=W_init_embed)
np.save("embedding.npy", embedding)
batch_size = 64
data1_maxlen = 20
data2_maxlen = 500
embed_size = 50
train_iters = 2500
def make_train():
rel_set = {}
pair_list = []
rel = read_relation(filename=os.path.join(Letor07Path,
'relation.train.fold1.txt'))
for label, d1, d2 in rel:
if d1 not in rel_set:
rel_set[d1] = {}
if label not in rel_set[d1]:
rel_set[d1][label] = []
rel_set[d1][label].append(d2)
for d1 in rel_set:
label_list = sorted(rel_set[d1].keys(), reverse=True)
for hidx, high_label in enumerate(label_list[:-1]):
for low_label in label_list[hidx + 1:]:
for high_d2 in rel_set[d1][high_label]:
for low_d2 in rel_set[d1][low_label]:
pair_list.append((d1, high_d2, low_d2))
print('Pair Instance Count:', len(pair_list))
f = open("./data/train/train.txt", "w")
for batch in range(800):
X1 = np.zeros((batch_size * 2, data1_maxlen), dtype=np.int32)
X2 = np.zeros((batch_size * 2, data2_maxlen), dtype=np.int32)
X1[:] = _PAD_
X2[:] = _PAD_
for i in range(batch_size):
d1, d2p, d2n = random.choice(pair_list)
d1_len = min(data1_maxlen, len(query_data[d1]))
d2p_len = min(data2_maxlen, len(doc_data[d2p]))
d2n_len = min(data2_maxlen, len(doc_data[d2n]))
X1[i, :d1_len] = query_data[d1][:d1_len]
X2[i, :d2p_len] = doc_data[d2p][:d2p_len]
X1[i + batch_size, :d1_len] = query_data[d1][:d1_len]
X2[i + batch_size, :d2n_len] = doc_data[d2n][:d2n_len]
for i in range(batch_size * 2):
q = [str(x) for x in list(X1[i])]
d = [str(x) for x in list(X2[i])]
f.write(",".join(q) + "\t" + ",".join(d) + "\n")
f.close()
def make_test():
rel = read_relation(filename=os.path.join(Letor07Path,
'relation.test.fold1.txt'))
f = open("./data/test/test.txt", "w")
for label, d1, d2 in rel:
X1 = np.zeros(data1_maxlen, dtype=np.int32)
X2 = np.zeros(data2_maxlen, dtype=np.int32)
X1[:] = _PAD_
X2[:] = _PAD_
d1_len = min(data1_maxlen, len(query_data[d1]))
d2_len = min(data2_maxlen, len(doc_data[d2]))
X1[:d1_len] = query_data[d1][:d1_len]
X2[:d2_len] = doc_data[d2][:d2_len]
q = [str(x) for x in list(X1)]
d = [str(x) for x in list(X2)]
f.write(",".join(q) + "\t" + ",".join(d) + "\t" + str(label) + "\t" +
d1 + "\n")
f.close()
make_train()
make_test()
2 9639 GX099-60-3149248
1 9639 GX028-47-6554966
1 9639 GX031-84-2802741
1 9639 GX031-86-1702683
1 9639 GX031-89-11392170
1 9639 GX035-46-10142187
1 9639 GX039-07-1333080
1 9639 GX040-05-15096071
1 9639 GX045-35-10693225
1 9639 GX045-74-6226888
1 9639 GX046-31-8871083
1 9639 GX046-56-6274894
1 9639 GX050-09-14629105
1 9639 GX097-05-12714275
1 9639 GX101-06-7768196
1 9639 GX124-50-4934142
1 9639 GX259-01-13320140
1 9639 GX259-50-8109630
1 9639 GX259-72-16176934
1 9639 GX259-98-7821925
1 9639 GX260-27-13260880
1 9639 GX260-54-6363694
1 9639 GX260-78-6999656
1 9639 GX261-04-0843988
1 9639 GX261-23-4964814
0 9639 GX021-75-7026755
0 9639 GX021-80-16449591
0 9639 GX025-40-7135810
0 9639 GX031-89-9020252
0 9639 GX037-45-0533209
0 9639 GX038-17-11223353
0 9639 GX057-07-13335832
0 9639 GX081-50-12756687
0 9639 GX124-43-2364716
0 9639 GX129-60-0000000
0 9639 GX219-07-7475581
0 9639 GX233-90-7976935
0 9639 GX267-49-2983064
0 9639 GX267-74-2413254
0 9639 GX270-05-13614294
1 9329 GX234-05-0812081
0 9329 GX000-00-0000000
0 9329 GX008-50-3899336
0 9329 GX011-75-8470249
0 9329 GX020-42-13388867
0 9329 GX024-91-8520306
0 9329 GX026-88-6087429
0 9329 GX027-22-1703847
0 9329 GX034-11-2617393
0 9329 GX036-02-7994497
0 9329 GX046-08-13858054
0 9329 GX059-85-11403109
0 9329 GX099-37-0232298
0 9329 GX099-46-11473306
0 9329 GX108-04-9589788
0 9329 GX110-50-11723940
0 9329 GX124-11-4119164
0 9329 GX149-82-15204191
0 9329 GX165-95-6198495
0 9329 GX225-56-4184936
0 9329 GX229-57-4487470
0 9329 GX230-37-4125963
0 9329 GX231-40-14574318
0 9329 GX238-44-10302536
0 9329 GX239-85-8572461
0 9329 GX244-17-10154048
0 9329 GX245-16-4169590
0 9329 GX245-46-6341859
0 9329 GX246-91-8487173
0 9329 GX262-88-13259441
0 9329 GX263-41-4135561
0 9329 GX264-07-6385713
0 9329 GX264-38-12253757
0 9329 GX264-90-15990025
0 9329 GX265-89-6212449
0 9329 GX268-41-12034794
0 9329 GX268-83-5140660
0 9329 GX270-46-0293828
0 9329 GX270-64-11852140
0 9329 GX271-10-12458597
2 9326 GX272-03-6610348
1 9326 GX011-12-0595978
0 9326 GX000-00-0000000
0 9326 GX000-38-9492606
0 9326 GX000-84-4587136
0 9326 GX002-41-5566464
0 9326 GX002-51-2615036
0 9326 GX004-56-12238694
0 9326 GX004-72-2476906
0 9326 GX008-13-1835206
0 9326 GX008-64-7705528
0 9326 GX009-87-0976731
0 9326 GX012-24-7688369
0 9326 GX012-96-8727608
0 9326 GX023-87-16736657
0 9326 GX025-21-11820239
0 9326 GX025-22-15113698
0 9326 GX025-51-13959128
0 9326 GX025-57-11414648
0 9326 GX025-64-7587631
0 9326 GX027-62-4542881
0 9326 GX031-25-4759403
0 9326 GX036-10-7902858
0 9326 GX047-04-9457544
0 9326 GX047-06-4014803
0 9326 GX048-00-15113058
0 9326 GX048-02-12975919
0 9326 GX048-78-3273874
0 9326 GX235-35-0963257
0 9326 GX235-98-3789570
0 9326 GX236-51-15473637
0 9326 GX237-96-0892713
0 9326 GX239-35-7413891
0 9326 GX239-95-0176537
0 9326 GX251-34-10377030
0 9326 GX254-19-11374782
0 9326 GX260-63-10533444
0 9326 GX265-94-14886230
0 9326 GX269-78-1500497
0 9326 GX270-59-10270517
2 8946 GX046-79-6984659
2 8946 GX148-33-1869479
2 8946 GX252-36-12638222
1 8946 GX017-47-13290921
1 8946 GX030-69-3218092
1 8946 GX034-82-4550348
1 8946 GX044-01-9283107
1 8946 GX047-98-6660623
1 8946 GX057-96-12580825
1 8946 GX059-94-12068143
1 8946 GX060-13-13600036
1 8946 GX060-74-6594973
1 8946 GX093-08-1158999
0 8946 GX000-00-0000000
0 8946 GX000-42-15811803
0 8946 GX000-81-16418910
0 8946 GX008-38-10557859
0 8946 GX011-01-10891808
0 8946 GX013-71-5708874
0 8946 GX015-72-4458924
0 8946 GX023-91-9869060
0 8946 GX027-56-6376748
0 8946 GX037-11-10829529
0 8946 GX038-55-0681330
0 8946 GX043-86-4200105
0 8946 GX047-52-3712485
0 8946 GX053-77-4836617
0 8946 GX070-62-1070063
0 8946 GX105-53-13372327
0 8946 GX218-61-6263172
0 8946 GX223-72-13625320
0 8946 GX230-68-14727182
0 8946 GX235-34-7733230
0 8946 GX251-73-0159347
0 8946 GX254-47-1098586
0 8946 GX263-76-6934681
0 8946 GX263-84-8668756
0 8946 GX264-70-14223639
0 8946 GX269-12-5910753
0 8946 GX271-93-9895614
1 9747 GX006-77-1973537
1 9747 GX244-83-8716953
1 9747 GX269-92-7189826
0 9747 GX000-00-0000000
0 9747 GX001-51-8693413
0 9747 GX003-10-2820641
0 9747 GX003-74-0557776
0 9747 GX003-79-13695689
0 9747 GX009-57-0938999
0 9747 GX009-59-8595527
0 9747 GX009-80-10629348
0 9747 GX010-37-0206372
0 9747 GX013-46-2187318
0 9747 GX014-58-4004859
0 9747 GX015-79-5393654
0 9747 GX032-50-7316370
0 9747 GX049-33-2206612
0 9747 GX050-34-0439256
0 9747 GX062-76-0914936
0 9747 GX065-73-7392661
0 9747 GX148-27-15770966
0 9747 GX155-71-0504939
0 9747 GX229-75-14750078
0 9747 GX231-01-0640962
0 9747 GX236-45-15598812
0 9747 GX247-19-9516715
0 9747 GX247-34-4277646
0 9747 GX247-63-10766287
0 9747 GX248-23-15998266
0 9747 GX249-85-9742193
0 9747 GX250-31-7671617
0 9747 GX252-56-2141580
0 9747 GX253-15-3406713
0 9747 GX264-07-15838087
0 9747 GX264-43-6543997
0 9747 GX266-18-14688076
0 9747 GX267-50-2036010
0 9747 GX268-28-0548507
0 9747 GX269-49-14171555
0 9747 GX269-63-15607386
2 9740 GX005-94-14208849
2 9740 GX008-51-5639660
2 9740 GX012-37-2342061
2 9740 GX019-75-13916532
2 9740 GX074-76-16261807
2 9740 GX077-07-2951943
2 9740 GX229-28-11068981
2 9740 GX237-80-7497206
2 9740 GX257-53-10589749
2 9740 GX258-06-0611419
2 9740 GX268-55-9791226
1 9740 GX007-62-1126118
1 9740 GX015-78-0216468
1 9740 GX038-65-1678199
1 9740 GX041-25-14803324
1 9740 GX063-71-0401425
1 9740 GX077-08-15801730
1 9740 GX098-07-2885671
1 9740 GX135-28-6485892
1 9740 GX228-85-10518518
1 9740 GX231-93-11279468
1 9740 GX234-70-15061254
1 9740 GX236-31-11149347
1 9740 GX240-68-1184464
1 9740 GX248-03-7275316
1 9740 GX253-11-9846012
1 9740 GX255-05-10638500
1 9740 GX267-73-4450097
1 9740 GX269-19-0642640
0 9740 GX001-74-5132048
0 9740 GX001-88-2603815
0 9740 GX004-83-7935833
0 9740 GX007-01-16750210
0 9740 GX040-11-5249209
0 9740 GX042-38-2886005
0 9740 GX052-20-4359789
0 9740 GX067-74-3718011
0 9740 GX077-01-13481396
0 9740 GX242-92-8868913
0 9740 GX262-74-4596688
2 8835 GX010-99-5715419
2 8835 GX049-99-2518724
0 8835 GX000-00-0000000
0 8835 GX007-91-6779497
0 8835 GX008-14-0788708
0 8835 GX008-15-13942125
0 8835 GX011-58-14336551
0 8835 GX012-79-10684001
0 8835 GX013-00-10822427
0 8835 GX013-03-5962783
0 8835 GX015-54-0251701
0 8835 GX017-36-5859317
0 8835 GX017-60-0601078
0 8835 GX027-24-16202205
0 8835 GX030-11-15814183
0 8835 GX030-76-11969233
此差异已折叠。
此差异已折叠。
#!/bin/bash
echo "...........load data................."
wget --no-check-certificate 'https://paddlerec.bj.bcebos.com/match_pyramid/match_pyramid_data.tar.gz'
mv ./match_pyramid_data.tar.gz ./data
rm -rf ./data/relation.test.fold1.txt ./data/realtion.train.fold1.txt
tar -xvf ./data/match_pyramid_data.tar.gz
echo "...........data process..............."
python ./data/process.py
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import random
import numpy as np
def eval_MAP(pred, gt):
map_value = 0.0
r = 0.0
c = list(zip(pred, gt))
random.shuffle(c)
c = sorted(c, key=lambda x: x[0], reverse=True)
for j, (p, g) in enumerate(c):
if g != 0:
r += 1
map_value += r / (j + 1.0)
if r == 0:
return 0.0
else:
return map_value / r
filename = './data/relation.test.fold1.txt'
gt = []
qid = []
f = open(filename, "r")
f.readline()
num = 0
for line in f.readlines():
num = num + 1
line = line.strip().split()
gt.append(int(line[0]))
qid.append(line[1])
f.close()
print(num)
filename = './result.txt'
pred = []
for line in open(filename):
line = line.strip().split(",")
line[1] = line[1].split(":")
line = line[1][1].strip(" ")
line = line.strip("[")
line = line.strip("]")
pred.append(float(line))
result_dict = {}
for i in range(len(qid)):
if qid[i] not in result_dict:
result_dict[qid[i]] = []
result_dict[qid[i]].append([gt[i], pred[i]])
print(len(result_dict))
map = 0
for qid in result_dict:
gt = np.array(result_dict[qid])[:, 0]
pred = np.array(result_dict[qid])[:, 1]
map += eval_MAP(pred, gt)
map = map / len(result_dict)
print("map=", map)
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import sys
import random
import numpy as np
import paddle
import paddle.fluid as fluid
from paddlerec.core.utils import envs
from paddlerec.core.model import ModelBase
class Model(ModelBase):
def __init__(self, config):
ModelBase.__init__(self, config)
def _init_hyper_parameters(self):
self.emb_path = envs.get_global_env("hyper_parameters.emb_path")
self.sentence_left_size = envs.get_global_env(
"hyper_parameters.sentence_left_size")
self.sentence_right_size = envs.get_global_env(
"hyper_parameters.sentence_right_size")
self.vocab_size = envs.get_global_env("hyper_parameters.vocab_size")
self.emb_size = envs.get_global_env("hyper_parameters.emb_size")
self.kernel_num = envs.get_global_env("hyper_parameters.kernel_num")
self.hidden_size = envs.get_global_env("hyper_parameters.hidden_size")
self.hidden_act = envs.get_global_env("hyper_parameters.hidden_act")
self.out_size = envs.get_global_env("hyper_parameters.out_size")
self.channels = envs.get_global_env("hyper_parameters.channels")
self.conv_filter = envs.get_global_env("hyper_parameters.conv_filter")
self.conv_act = envs.get_global_env("hyper_parameters.conv_act")
self.pool_size = envs.get_global_env("hyper_parameters.pool_size")
self.pool_stride = envs.get_global_env("hyper_parameters.pool_stride")
self.pool_type = envs.get_global_env("hyper_parameters.pool_type")
self.pool_padding = envs.get_global_env(
"hyper_parameters.pool_padding")
def input_data(self, is_infer=False, **kwargs):
sentence_left = fluid.data(
name="sentence_left",
shape=[-1, self.sentence_left_size, 1],
dtype='int64',
lod_level=0)
sentence_right = fluid.data(
name="sentence_right",
shape=[-1, self.sentence_right_size, 1],
dtype='int64',
lod_level=0)
return [sentence_left, sentence_right]
def embedding_layer(self, input):
"""
embedding layer
"""
if os.path.isfile(self.emb_path):
embedding_array = np.load(self.emb_path)
emb = fluid.layers.embedding(
input=input,
size=[self.vocab_size, self.emb_size],
padding_idx=0,
param_attr=fluid.ParamAttr(
name="word_embedding",
initializer=fluid.initializer.NumpyArrayInitializer(
embedding_array)))
else:
emb = fluid.layers.embedding(
input=input,
size=[self.vocab_size, self.emb_size],
padding_idx=0,
param_attr=fluid.ParamAttr(
name="word_embedding",
initializer=fluid.initializer.Xavier()))
return emb
def conv_pool_layer(self, input):
"""
convolution and pool layer
"""
# data format NCHW
# same padding
conv = fluid.layers.conv2d(
input=input,
num_filters=self.kernel_num,
stride=1,
padding="SAME",
filter_size=self.conv_filter,
act=self.conv_act)
pool = fluid.layers.pool2d(
input=conv,
pool_size=self.pool_size,
pool_stride=self.pool_stride,
pool_type=self.pool_type,
pool_padding=self.pool_padding)
return pool
def net(self, inputs, is_infer=False):
left_emb = self.embedding_layer(inputs[0])
right_emb = self.embedding_layer(inputs[1])
cross = fluid.layers.matmul(left_emb, right_emb, transpose_y=True)
cross = fluid.layers.reshape(cross,
[-1, 1, cross.shape[1], cross.shape[2]])
conv_pool = self.conv_pool_layer(input=cross)
relu_hid = fluid.layers.fc(input=conv_pool,
size=self.hidden_size,
act=self.hidden_act)
prediction = fluid.layers.fc(
input=relu_hid,
size=self.out_size, )
if is_infer:
self._infer_results["prediction"] = prediction
return
pos = fluid.layers.slice(
prediction, axes=[0, 1], starts=[0, 0], ends=[64, 1])
neg = fluid.layers.slice(
prediction, axes=[0, 1], starts=[64, 0], ends=[128, 1])
loss_part1 = fluid.layers.elementwise_sub(
fluid.layers.fill_constant(
shape=[64, 1], value=1.0, dtype='float32'),
pos)
loss_part2 = fluid.layers.elementwise_add(loss_part1, neg)
loss_part3 = fluid.layers.elementwise_max(
fluid.layers.fill_constant(
shape=[64, 1], value=0.0, dtype='float32'),
loss_part2)
avg_cost = fluid.layers.mean(loss_part3)
self._cost = avg_cost
# match-pyramid文本匹配模型
## 介绍
在许多自然语言处理任务中,匹配两个文本是一个基本问题。一种有效的方法是从单词,短语和句子中提取有意义的匹配模式以产生匹配分数。受卷积神经网络在图像识别中的成功启发,神经元可以根据提取的基本视觉模式(例如定向的边角和边角)捕获许多复杂的模式,所以我们尝试将文本匹配建模为图像识别问题。本模型对齐原作者庞亮开源的tensorflow代码:https://github.com/pl8787/MatchPyramid-TensorFlow/blob/master/model/model_mp.py, 实现了下述论文中提出的Match-Pyramid模型:
```text
@inproceedings{Pang L , Lan Y , Guo J , et al. Text Matching as Image Recognition[J]. 2016.,
title={Text Matching as Image Recognition},
author={Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Shengxian Wan, Xueqi Cheng},
year={2016}
}
```
## 数据准备
训练及测试数据集选用Letor07数据集和 embed_wiki-pdc_d50_norm 词向量初始化embedding层。
该数据集包括:
1.词典文件:我们将每个单词映射得到一个唯一的编号wid,并将此映射保存在单词词典文件中。例如:word_dict.txt
2.语料库文件:我们使用字符串标识符的值表示一个句子的编号。第二个数字表示句子的长度。例如:qid_query.txt和docid_doc.txt
3.关系文件:关系文件被用来存储两个句子之间的关系,如query 和document之间的关系。例如:relation.train.fold1.txt, relation.test.fold1.txt
4.嵌入层文件:我们将预训练的词向量存储在嵌入文件中。例如:embed_wiki-pdc_d50_norm
## 数据下载和预处理
本文提供了数据集的下载以及一键生成训练和测试数据的预处理脚本,您可以直接一键运行:bash data_process.sh
执行该脚本,会从国内源的服务器上下载Letor07数据集,删除掉data文件夹中原有的relation.test.fold1.txt和relation.train.fold1.txt,并将完整的数据集解压到data文件夹。随后运行 process.py 将全量训练数据放置于`./data/train`,全量测试数据放置于`./data/test`。并生成用于初始化embedding层的embedding.npy文件
执行该脚本的理想输出为:
```
bash data_process.sh
...........load data...............
--2020-07-13 13:24:50-- https://paddlerec.bj.bcebos.com/match_pyramid/match_pyramid_data.tar.gz
Resolving paddlerec.bj.bcebos.com... 10.70.0.165
Connecting to paddlerec.bj.bcebos.com|10.70.0.165|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 214449643 (205M) [application/x-gzip]
Saving to: “match_pyramid_data.tar.gz”
100%[==========================================================================================================>] 214,449,643 114M/s in 1.8s
2020-07-13 13:24:52 (114 MB/s) - “match_pyramid_data.tar.gz” saved [214449643/214449643]
data/
data/relation.test.fold1.txt
data/relation.test.fold2.txt
data/relation.test.fold3.txt
data/relation.test.fold4.txt
data/relation.test.fold5.txt
data/relation.train.fold1.txt
data/relation.train.fold2.txt
data/relation.train.fold3.txt
data/relation.train.fold4.txt
data/relation.train.fold5.txt
data/relation.txt
data/docid_doc.txt
data/qid_query.txt
data/word_dict.txt
data/embed_wiki-pdc_d50_norm
...........data process...............
[./data/word_dict.txt]
Word dict size: 193367
[./data/qid_query.txt]
Data size: 1692
[./data/docid_doc.txt]
Data size: 65323
[./data/embed_wiki-pdc_d50_norm]
Embedding size: 109282
('Generate numpy embed:', (193368, 50))
[./data/relation.train.fold1.txt]
Instance size: 47828
('Pair Instance Count:', 325439)
[./data/relation.test.fold1.txt]
Instance size: 13652
```
## 一键训练并测试评估
本文提供了一键执行训练,测试和评估的脚本,您可以直接一键运行:bash run.sh
执行该脚本后,会执行python -m paddlerec.run -m ./config.yaml 命令开始训练并测试模型,将测试的结果保存到result.txt文件,最后通过执行eval.py进行评估得到数据的map指标
执行该脚本的理想输出为:
```
..............test.................
13651
336
('map=', 0.420878322843591)
```
## 每个文件的作用
paddlerec可以:
通过config.yaml规定模型的参数
通过model.py规定模型的组网
使用train_reader.py读取训练集中的数据
使用test_reader.py读取测试集中的数据。
本文额外提供:
data_process.sh用来一键处理数据
run.sh用来一键启动训练,直接得出测试结果
eval.py通过保存的测试结果,计算map指标
如需详细了解paddlerec的使用方法请参考https://github.com/PaddlePaddle/PaddleRec/blob/master/README_CN.md 页面下方的教程。
#!/bin/bash
echo "................run................."
python -m paddlerec.run -m ./config.yaml >result1.txt
grep -A1 "prediction" ./result1.txt >./result.txt
rm -f result1.txt
python eval.py
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import print_function
from paddlerec.core.reader import ReaderBase
class Reader(ReaderBase):
def init(self):
pass
def generate_sample(self, line):
"""
Read the data line by line and process it as a dictionary
"""
def reader():
"""
This function needs to be implemented by the user, based on data format
"""
features = line.strip('\n').split('\t')
doc1 = [int(word_id) for word_id in features[0].split(",")]
doc2 = [int(word_id) for word_id in features[1].split(",")]
features_name = ["doc1", "doc2"]
yield zip(features_name, [doc1] + [doc2])
return reader
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import print_function
from paddlerec.core.reader import ReaderBase
class Reader(ReaderBase):
def init(self):
pass
def generate_sample(self, line):
"""
Read the data line by line and process it as a dictionary
"""
def reader():
"""
This function needs to be implemented by the user, based on data format
"""
features = line.strip('\n').split('\t')
doc1 = [int(word_id) for word_id in features[0].split(",")]
doc2 = [int(word_id) for word_id in features[1].split(",")]
features_name = ["doc1", "doc2"]
yield zip(features_name, [doc1] + [doc2])
return reader
......@@ -13,7 +13,7 @@
# limitations under the License.
# workspace
workspace: "paddlerec.models.match.multiview-simnet"
workspace: "models/match/multiview-simnet"
# list of dataset
dataset:
......
......@@ -34,8 +34,11 @@
## 使用教程(快速开始)
### 训练
```shell
python -m paddlerec.run -m paddlerec.models.match.dssm # dssm
python -m paddlerec.run -m paddlerec.models.match.multiview-simnet # multiview-simnet
git clone https://github.com/PaddlePaddle/PaddleRec.git paddle-rec
cd paddle-rec
python -m paddlerec.run -m models/match/dssm/config.yaml # dssm
python -m paddlerec.run -m models/match/multiview-simnet/config.yaml # multiview-simnet
```
### 预测
......
# ESMM
以下是本例的简要目录结构及说明:
```
├── data # 文档
├── train #训练数据
├──small.txt
├── test #测试数据
├── small.txt
├── run.sh
├── __init__.py
├── config.yaml #配置文件
├── esmm_reader.py #数据读取文件
├── model.py #模型文件
```
注:在阅读该示例前,建议您先了解以下内容:
[paddlerec入门教程](https://github.com/PaddlePaddle/PaddleRec/blob/master/README.md)
## 内容
- [模型简介](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/esmm#模型简介)
- [数据准备](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/esmm#数据准备)
- [运行环境](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/esmm#运行环境)
- [快速开始](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/esmm#快速开始)
- [论文复现](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/esmm#论文复现)
- [进阶使用](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/esmm#进阶使用)
- [FAQ](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/esmm#FAQ)
## 模型简介
不同于CTR预估问题,CVR预估面临两个关键问题:
1. **Sample Selection Bias (SSB)** 转化是在点击之后才“有可能”发生的动作,传统CVR模型通常以点击数据为训练集,其中点击未转化为负例,点击并转化为正例。但是训练好的模型实际使用时,则是对整个空间的样本进行预估,而非只对点击样本进行预估。即是说,训练数据与实际要预测的数据来自不同分布,这个偏差对模型的泛化能力构成了很大挑战。
2. **Data Sparsity (DS)** 作为CVR训练数据的点击样本远小于CTR预估训练使用的曝光样本。
ESMM是发表在 SIGIR’2018 的论文[《Entire Space Multi-Task Model: An Effective Approach for Estimating Post-Click Conversion Rate》]( https://arxiv.org/abs/1804.07931 )文章基于 Multi-Task Learning 的思路,提出一种新的CVR预估模型——ESMM,有效解决了真实场景中CVR预估面临的数据稀疏以及样本选择偏差这两个关键问题
本项目在paddlepaddle上实现ESMM的网络结构,并在开源数据集[Ali-CCP:Alibaba Click and Conversion Prediction]( https://tianchi.aliyun.com/datalab/dataSet.html?dataId=408 )上验证模型效果, 本模型配置默认使用demo数据集,若进行精度验证,请参考[论文复现](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/esmm#论文复现)部分。
本项目支持功能
训练:单机CPU、单机单卡GPU、单机多卡GPU、本地模拟参数服务器训练、增量训练,配置请参考 [启动训练](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/train.md)
预测:单机CPU、单机单卡GPU ;配置请参考[PaddleRec 离线预测](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/predict.md)
## 数据准备
数据地址:[Ali-CCP:Alibaba Click and Conversion Prediction]( https://tianchi.aliyun.com/datalab/dataSet.html?dataId=408 )
```
cd data
sh run.sh
```
数据格式参见demo数据:data/train
## 运行环境
PaddlePaddle>=1.7.2
python 2.7/3.5/3.6/3.7
PaddleRec >=0.1
os : windows/linux/macos
## 快速开始
### 单机训练
CPU环境
在config.yaml文件中设置好设备,epochs等。
```
dataset:
- name: dataset_train
batch_size: 5
type: QueueDataset
data_path: "{workspace}/data/train"
data_converter: "{workspace}/esmm_reader.py"
- name: dataset_infer
batch_size: 5
type: QueueDataset
data_path: "{workspace}/data/test"
data_converter: "{workspace}/esmm_reader.py"
```
### 单机预测
CPU环境
在config.yaml文件中设置好epochs、device等参数。
```
- name: infer_runner
class: infer
init_model_path: "increment/1"
device: cpu
print_interval: 1
phases: [infer]
```
## 论文复现
用原论文的完整数据复现论文效果需要在config.yaml中修改batch_size=1000, thread_num=8, epoch_num=4
修改后运行方案:修改config.yaml中的'workspace'为config.yaml的目录位置,执行
```
python -m paddlerec.run -m /home/your/dir/config.yaml #调试模式 直接指定本地config的绝对路径
```
## 进阶使用
## FAQ
......@@ -13,7 +13,7 @@
# limitations under the License.
workspace: "paddlerec.models.multitask.esmm"
workspace: "models/multitask/esmm"
dataset:
- name: dataset_train
......
mkdir train_data
mkdir test_data
mkdir vocab
mkdir data
train_source_path="./data/sample_train.tar.gz"
train_target_path="train_data"
test_source_path="./data/sample_test.tar.gz"
test_target_path="test_data"
cd data
echo "downloading sample_train.tar.gz......"
curl -# 'http://jupter-oss.oss-cn-hangzhou.aliyuncs.com/file/opensearch/documents/408/sample_train.tar.gz?Expires=1586435769&OSSAccessKeyId=LTAIGx40tjZWxj6q&Signature=ahUDqhvKT1cGjC4%2FIER2EWtq7o4%3D&response-content-disposition=attachment%3B%20' -H 'Proxy-Connection: keep-alive' -H 'Upgrade-Insecure-Requests: 1' -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' -H 'Accept-Language: zh-CN,zh;q=0.9' --compressed --insecure -o sample_train.tar.gz
cd ..
echo "unzipping sample_train.tar.gz......"
tar -xzvf ${train_source_path} -C ${train_target_path} && rm -rf ${train_source_path}
cd data
echo "downloading sample_test.tar.gz......"
curl -# 'http://jupter-oss.oss-cn-hangzhou.aliyuncs.com/file/opensearch/documents/408/sample_test.tar.gz?Expires=1586435821&OSSAccessKeyId=LTAIGx40tjZWxj6q&Signature=OwLMPjt1agByQtRVi8pazsAliNk%3D&response-content-disposition=attachment%3B%20' -H 'Proxy-Connection: keep-alive' -H 'Upgrade-Insecure-Requests: 1' -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' -H 'Accept-Language: zh-CN,zh;q=0.9' --compressed --insecure -o sample_test.tar.gz
cd ..
echo "unzipping sample_test.tar.gz......"
tar -xzvf ${test_source_path} -C ${test_target_path} && rm -rf ${test_source_path}
echo "preprocessing data......"
python reader.py --train_data_path ${train_target_path} \
--test_data_path ${test_target_path} \
--vocab_path vocab/vocab_size.txt \
--train_sample_size 6400 \
--test_sample_size 6400 \
# MMOE
以下是本例的简要目录结构及说明:
```
├── data # 文档
├── train #训练数据
├── train_data.txt
├── test #测试数据
├── test_data.txt
├── run.sh
├── data_preparation.py
├── __init__.py
├── config.yaml #配置文件
├── census_reader.py #数据读取文件
├── model.py #模型文件
```
注:在阅读该示例前,建议您先了解以下内容:
[paddlerec入门教程](https://github.com/PaddlePaddle/PaddleRec/blob/master/README.md)
## 内容
- [模型简介](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/mmoe#模型简介)
- [数据准备](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/mmoe#数据准备)
- [运行环境](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/mmoe#运行环境)
- [快速开始](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/mmoe#快速开始)
- [论文复现](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/mmoe#论文复现)
- [进阶使用](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/mmoe#进阶使用)
- [FAQ](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/mmoe#FAQ)
## 模型简介
多任务模型通过学习不同任务的联系和差异,可提高每个任务的学习效率和质量。多任务学习的的框架广泛采用shared-bottom的结构,不同任务间共用底部的隐层。这种结构本质上可以减少过拟合的风险,但是效果上可能受到任务差异和数据分布带来的影响。 论文[《Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts》]( https://www.kdd.org/kdd2018/accepted-papers/view/modeling-task-relationships-in-multi-task-learning-with-multi-gate-mixture- )中提出了一个Multi-gate Mixture-of-Experts(MMOE)的多任务学习结构。MMOE模型刻画了任务相关性,基于共享表示来学习特定任务的函数,避免了明显增加参数的缺点。
我们在Paddlepaddle定义MMOE的网络结构,在开源数据集Census-income Data上验证模型效果,两个任务的auc分别为:
1.income
> max_mmoe_test_auc_income:0.94937
>
> mean_mmoe_test_auc_income:0.94465
2.marital
> max_mmoe_test_auc_marital:0.99419
>
> mean_mmoe_test_auc_marital:0.99324
若进行精度验证,请参考[论文复现](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/mmoe#论文复现)部分。
本项目支持功能
训练:单机CPU、单机单卡GPU、单机多卡GPU、本地模拟参数服务器训练、增量训练,配置请参考 [启动训练](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/train.md)
预测:单机CPU、单机单卡GPU ;配置请参考[PaddleRec 离线预测](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/predict.md)
## 数据准备
数据地址: [Census-income Data](https://archive.ics.uci.edu/ml/machine-learning-databases/census-income-mld/census.tar.gz )
数据解压后, 在run.sh脚本文件中添加文件的路径,并运行脚本。
```sh
mkdir train_data
mkdir test_data
mkdir data
train_path="data/census-income.data"
test_path="data/census-income.test"
train_data_path="train_data/"
test_data_path="test_data/"
pip install -r requirements.txt
wget -P data/ https://archive.ics.uci.edu/ml/machine-learning-databases/census-income-mld/census.tar.gz
tar -zxvf data/census.tar.gz -C data/
python data_preparation.py --train_path ${train_path} \
--test_path ${test_path} \
--train_data_path ${train_data_path}\
--test_data_path ${test_data_path}
```
生成的格式以逗号为分割点
```
0,0,73,0,0,0,0,1700.09,0,0
```
## 运行环境
PaddlePaddle>=1.7.2
python 2.7/3.5/3.6/3.7
PaddleRec >=0.1
os : windows/linux/macos
## 快速开始
### 单机训练
CPU环境
在config.yaml文件中设置好设备,epochs等。
```
dataset:
- name: dataset_train
batch_size: 5
type: QueueDataset
data_path: "{workspace}/data/train"
data_converter: "{workspace}/census_reader.py"
- name: dataset_infer
batch_size: 5
type: QueueDataset
data_path: "{workspace}/data/train"
data_converter: "{workspace}/census_reader.py"
```
### 单机预测
CPU环境
在config.yaml文件中设置好epochs、device等参数。
```
- name: infer_runner
class: infer
init_model_path: "increment/0"
device: cpu
```
## 论文复现
用原论文的完整数据复现论文效果需要在config.yaml中修改batch_size=1000, thread_num=8, epoch_num=4
使用gpu p100 单卡训练 6.5h 测试auc: best:0.9940, mean:0.9932
修改后运行方案:修改config.yaml中的'workspace'为config.yaml的目录位置,执行
```
python -m paddlerec.run -m /home/your/dir/config.yaml #调试模式 直接指定本地config的绝对路径
```
## 进阶使用
## FAQ
......@@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
workspace: "paddlerec.models.multitask.mmoe"
workspace: "models/multitask/mmoe"
dataset:
- name: dataset_train
......
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import pandas as pd
import numpy as np
import paddle.fluid as fluid
from args import *
def fun1(x):
if x == ' 50000+.':
return 1
else:
return 0
def fun2(x):
if x == ' Never married':
return 1
else:
return 0
def data_preparation(train_path, test_path, train_data_path, test_data_path):
# The column names are from
# https://www2.1010data.com/documentationcenter/prod/Tutorials/MachineLearningExamples/CensusIncomeDataSet.html
column_names = [
'age', 'class_worker', 'det_ind_code', 'det_occ_code', 'education',
'wage_per_hour', 'hs_college', 'marital_stat', 'major_ind_code',
'major_occ_code', 'race', 'hisp_origin', 'sex', 'union_member',
'unemp_reason', 'full_or_part_emp', 'capital_gains', 'capital_losses',
'stock_dividends', 'tax_filer_stat', 'region_prev_res',
'state_prev_res', 'det_hh_fam_stat', 'det_hh_summ', 'instance_weight',
'mig_chg_msa', 'mig_chg_reg', 'mig_move_reg', 'mig_same',
'mig_prev_sunbelt', 'num_emp', 'fam_under_18', 'country_father',
'country_mother', 'country_self', 'citizenship', 'own_or_self',
'vet_question', 'vet_benefits', 'weeks_worked', 'year', 'income_50k'
]
# Load the dataset in Pandas
train_df = pd.read_csv(
train_path,
delimiter=',',
header=None,
index_col=None,
names=column_names)
other_df = pd.read_csv(
test_path,
delimiter=',',
header=None,
index_col=None,
names=column_names)
# First group of tasks according to the paper
label_columns = ['income_50k', 'marital_stat']
# One-hot encoding categorical columns
categorical_columns = [
'class_worker', 'det_ind_code', 'det_occ_code', 'education',
'hs_college', 'major_ind_code', 'major_occ_code', 'race',
'hisp_origin', 'sex', 'union_member', 'unemp_reason',
'full_or_part_emp', 'tax_filer_stat', 'region_prev_res',
'state_prev_res', 'det_hh_fam_stat', 'det_hh_summ', 'mig_chg_msa',
'mig_chg_reg', 'mig_move_reg', 'mig_same', 'mig_prev_sunbelt',
'fam_under_18', 'country_father', 'country_mother', 'country_self',
'citizenship', 'vet_question'
]
train_raw_labels = train_df[label_columns]
other_raw_labels = other_df[label_columns]
transformed_train = pd.get_dummies(train_df, columns=categorical_columns)
transformed_other = pd.get_dummies(other_df, columns=categorical_columns)
# Filling the missing column in the other set
transformed_other[
'det_hh_fam_stat_ Grandchild <18 ever marr not in subfamily'] = 0
# get label
transformed_train['income_50k'] = transformed_train['income_50k'].apply(
lambda x: fun1(x))
transformed_train['marital_stat'] = transformed_train[
'marital_stat'].apply(lambda x: fun2(x))
transformed_other['income_50k'] = transformed_other['income_50k'].apply(
lambda x: fun1(x))
transformed_other['marital_stat'] = transformed_other[
'marital_stat'].apply(lambda x: fun2(x))
# Split the other dataset into 1:1 validation to test according to the paper
validation_indices = transformed_other.sample(
frac=0.5, replace=False, random_state=1).index
test_indices = list(set(transformed_other.index) - set(validation_indices))
validation_data = transformed_other.iloc[validation_indices]
test_data = transformed_other.iloc[test_indices]
cols = transformed_train.columns.tolist()
cols.insert(0, cols.pop(cols.index('income_50k')))
cols.insert(0, cols.pop(cols.index('marital_stat')))
transformed_train = transformed_train[cols]
test_data = test_data[cols]
validation_data = validation_data[cols]
print(transformed_train.shape, transformed_other.shape,
validation_data.shape, test_data.shape)
transformed_train.to_csv(train_data_path + 'train_data.csv', index=False)
test_data.to_csv(test_data_path + 'test_data.csv', index=False)
args = data_preparation_args()
data_preparation(args.train_path, args.test_path, args.train_data_path,
args.test_data_path)
......@@ -44,9 +44,12 @@
## 使用教程(快速开始)
```shell
python -m paddlerec.run -m paddlerec.models.multitask.mmoe # mmoe
python -m paddlerec.run -m paddlerec.models.multitask.share-bottom # share-bottom
python -m paddlerec.run -m paddlerec.models.multitask.esmm # esmm
git clone https://github.com/PaddlePaddle/PaddleRec.git paddle-rec
cd paddle-rec
python -m paddlerec.run -m models/multitask/mmoe/config.yaml # mmoe
python -m paddlerec.run -m models/multitask/share-bottom/config.yaml # share-bottom
python -m paddlerec.run -m models/multitask/esmm/config.yaml # esmm
```
## 使用教程(复现论文)
......
此差异已折叠。
......@@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
workspace: "paddlerec.models.multitask.share-bottom"
workspace: "models/multitask/share-bottom"
dataset:
- name: dataset_train
......
此差异已折叠。
mkdir train_data
mkdir test_data
mkdir data
train_path="data/census-income.data"
test_path="data/census-income.test"
train_data_path="train_data/"
test_data_path="test_data/"
pip install -r requirements.txt
wget -P data/ https://archive.ics.uci.edu/ml/machine-learning-databases/census-income-mld/census.tar.gz
tar -zxvf data/census.tar.gz -C data/
python data_preparation.py --train_path ${train_path} \
--test_path ${test_path} \
--train_data_path ${train_data_path}\
--test_data_path ${test_data_path}
......@@ -14,7 +14,7 @@
# global settings
debug: false
workspace: "paddlerec.models.rank.AutoInt"
workspace: "models/rank/AutoInt"
dataset:
......
......@@ -14,7 +14,7 @@
# global settings
debug: false
workspace: "paddlerec.models.rank.BST"
workspace: "models/rank/BST"
dataset:
- name: sample_1
......
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册