提交 3e43efca 编写于 作者: M malin10

Merge branch 'master' of https://github.com/PaddlePaddle/PaddleRec into gru4rec

...@@ -16,13 +16,20 @@ before_install: ...@@ -16,13 +16,20 @@ before_install:
# For pylint dockstring checker # For pylint dockstring checker
- sudo apt-get update - sudo apt-get update
- sudo apt-get install -y python-pip libpython-dev - sudo apt-get install -y python-pip libpython-dev
- sudo apt-get remove python-urllib3
- sudo apt-get purge python-urllib3
- sudo rm /usr/lib/python2.7/dist-packages/chardet-*
- sudo pip install -U pip - sudo pip install -U pip
- sudo pip install --upgrade setuptools
- sudo pip install six --upgrade --ignore-installed six - sudo pip install six --upgrade --ignore-installed six
- sudo pip install pillow
- sudo pip install PyYAML - sudo pip install PyYAML
- sudo pip install pylint pytest astroid isort pre-commit - sudo pip install pylint pytest astroid isort pre-commit
- sudo pip install kiwisolver - sudo pip install kiwisolver
- sudo pip install paddlepaddle==1.7.2 --ignore-installed urllib3 - sudo pip install scikit-build
- sudo pip install Pillow==5.3.0
- sudo pip install opencv-python==3.4.3.18
- sudo pip install rarfile==3.0
- sudo pip install paddlepaddle==1.7.2
- sudo python setup.py install - sudo python setup.py install
- | - |
function timeout() { perl -e 'alarm shift; exec @ARGV' "$@"; } function timeout() { perl -e 'alarm shift; exec @ARGV' "$@"; }
......
([简体中文](./README_CN.md)|English) (简体中文|[English](./README_EN.md))
<p align="center"> <p align="center">
<img align="center" src="doc/imgs/logo.png"> <img align="center" src="doc/imgs/logo.png">
<p> <p>
<p align="center"> <p align="center">
<img align="center" src="doc/imgs/overview_en.png"> <img align="center" src="doc/imgs/structure.png">
<p>
<p align="center">
<img align="center" src="doc/imgs/overview.png">
<p> <p>
<h2 align="center">What is recommendation system ?</h2> <h2 align="center">什么是推荐系统?</h2>
<p align="center"> <p align="center">
<img align="center" src="doc/imgs/rec-overview-en.png"> <img align="center" src="doc/imgs/rec-overview.png">
<p> <p>
- Recommendation system helps users quickly find useful and interesting information from massive data. - 推荐系统是在互联网信息爆炸式增长的时代背景下,帮助用户高效获得感兴趣信息的关键;
- Recommendation system is also a silver bullet to attract users, retain users, increase users' stickness or conversionn. - 推荐系统也是帮助产品最大限度吸引用户、留存用户、增加用户粘性、提高用户转化率的银弹。
> Who can better use the recommendation system, who can gain more advantage in the fierce competition. - 有无数优秀的产品依靠用户可感知的推荐系统建立了良好的口碑,也有无数的公司依靠直击用户痛点的推荐系统在行业中占领了一席之地。
>
> At the same time, there are many problems in the process of using the recommendation system, such as: huge data, complex model, inefficient distributed training, and so on. > 可以说,谁能掌握和利用好推荐系统,谁就能在信息分发的激烈竞争中抢得先机。
> 但与此同时,有着许多问题困扰着推荐系统的开发者,比如:庞大的数据量,复杂的模型结构,低效的分布式训练环境,波动的在离线一致性,苛刻的上线部署要求,以上种种,不胜枚举。
<h2 align="center">What is PaddleRec ?</h2>
<h2 align="center">什么是PaddleRec?</h2>
- A quick start tool of search & recommendation algorithm based on [PaddlePaddle](https://www.paddlepaddle.org.cn/documentation/docs/en/beginners_guide/index_en.html)
- A complete solution of recommendation system for beginners, developers and researchers. - 源于飞桨生态的搜索推荐模型 **一站式开箱即用工具**
- Recommendation algorithm library including content-understanding, match, recall, rank, multi-task, re-rank etc. - 适合初学者,开发者,研究者的推荐系统全流程解决方案
- 包含内容理解、匹配、召回、排序、 多任务、重排序等多个任务的完整推荐搜索算法库
| Type | Algorithm | CPU | GPU | Parameter-Server | Multi-GPU | Paper |
| :-------------------: | :-----------------------------------------------------------------------: | :---: | :-----: | :--------------: | :-------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | 方向 | 模型 | 单机CPU | 单机GPU | 分布式CPU | 分布式GPU | 论文 |
| Content-Understanding | [Text-Classifcation](models/contentunderstanding/classification/model.py) | ✓ | ✓ | ✓ | x | [EMNLP 2014][Convolutional neural networks for sentence classication](https://www.aclweb.org/anthology/D14-1181.pdf) | | :------: | :-----------------------------------------------------------------------: | :-----: | :-----: | :-------: | :-------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Content-Understanding | [TagSpace](models/contentunderstanding/tagspace/model.py) | ✓ | ✓ | ✓ | x | [EMNLP 2014][TagSpace: Semantic Embeddings from Hashtags](https://www.aclweb.org/anthology/D14-1194.pdf) | | 内容理解 | [Text-Classifcation](models/contentunderstanding/classification/model.py) | ✓ | ✓ | ✓ | x | [EMNLP 2014][Convolutional neural networks for sentence classication](https://www.aclweb.org/anthology/D14-1181.pdf) |
| Match | [DSSM](models/match/dssm/model.py) | ✓ | ✓ | ✓ | x | [CIKM 2013][Learning Deep Structured Semantic Models for Web Search using Clickthrough Data](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cikm2013_DSSM_fullversion.pdf) | | 内容理解 | [TagSpace](models/contentunderstanding/tagspace/model.py) | ✓ | ✓ | ✓ | x | [EMNLP 2014][TagSpace: Semantic Embeddings from Hashtags](https://www.aclweb.org/anthology/D14-1194.pdf) |
| Match | [MultiView-Simnet](models/match/multiview-simnet/model.py) | ✓ | ✓ | ✓ | x | [WWW 2015][A Multi-View Deep Learning Approach for Cross Domain User Modeling in Recommendation Systems](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/frp1159-songA.pdf) | | 匹配 | [DSSM](models/match/dssm/model.py) | ✓ | ✓ | ✓ | x | [CIKM 2013][Learning Deep Structured Semantic Models for Web Search using Clickthrough Data](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cikm2013_DSSM_fullversion.pdf) |
| Recall | [TDM](models/treebased/tdm/model.py) | ✓ | >=1.8.0 | ✓ | >=1.8.0 | [KDD 2018][Learning Tree-based Deep Model for Recommender Systems](https://arxiv.org/pdf/1801.02294.pdf) | | 匹配 | [MultiView-Simnet](models/match/multiview-simnet/model.py) | ✓ | ✓ | ✓ | x | [WWW 2015][A Multi-View Deep Learning Approach for Cross Domain User Modeling in Recommendation Systems](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/frp1159-songA.pdf) |
| Recall | [fasttext](models/recall/fasttext/model.py) | ✓ | ✓ | x | x | [EACL 2017][Bag of Tricks for Efficient Text Classification](https://www.aclweb.org/anthology/E17-2068.pdf) | | 召回 | [TDM](models/treebased/tdm/model.py) | ✓ | >=1.8.0 | ✓ | >=1.8.0 | [KDD 2018][Learning Tree-based Deep Model for Recommender Systems](https://arxiv.org/pdf/1801.02294.pdf) |
| Recall | [Word2Vec](models/recall/word2vec/model.py) | ✓ | ✓ | ✓ | x | [NIPS 2013][Distributed Representations of Words and Phrases and their Compositionality](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) | | 召回 | [fasttext](models/recall/fasttext/model.py) | ✓ | ✓ | x | x | [EACL 2017][Bag of Tricks for Efficient Text Classification](https://www.aclweb.org/anthology/E17-2068.pdf) |
| Recall | [SSR](models/recall/ssr/model.py) | ✓ | ✓ | ✓ | ✓ | [SIGIR 2016][Multi-Rate Deep Learning for Temporal Recommendation](http://sonyis.me/paperpdf/spr209-song_sigir16.pdf) | | 召回 | [Word2Vec](models/recall/word2vec/model.py) | ✓ | ✓ | ✓ | x | [NIPS 2013][Distributed Representations of Words and Phrases and their Compositionality](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) |
| Recall | [Gru4Rec](models/recall/gru4rec/model.py) | ✓ | ✓ | ✓ | ✓ | [2015][Session-based Recommendations with Recurrent Neural Networks](https://arxiv.org/abs/1511.06939) | | 召回 | [SSR](models/recall/ssr/model.py) | ✓ | ✓ | ✓ | ✓ | [SIGIR 2016][Multi-Rate Deep Learning for Temporal Recommendation](http://sonyis.me/paperpdf/spr209-song_sigir16.pdf) |
| Recall | [Youtube_dnn](models/recall/youtube_dnn/model.py) | ✓ | ✓ | ✓ | ✓ | [RecSys 2016][Deep Neural Networks for YouTube Recommendations](https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf) | | 召回 | [Gru4Rec](models/recall/gru4rec/model.py) | ✓ | ✓ | ✓ | ✓ | [2015][Session-based Recommendations with Recurrent Neural Networks](https://arxiv.org/abs/1511.06939) |
| Recall | [NCF](models/recall/ncf/model.py) | ✓ | ✓ | ✓ | ✓ | [WWW 2017][Neural Collaborative Filtering](https://arxiv.org/pdf/1708.05031.pdf) | | 召回 | [Youtube_dnn](models/recall/youtube_dnn/model.py) | ✓ | ✓ | ✓ | ✓ | [RecSys 2016][Deep Neural Networks for YouTube Recommendations](https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf) |
| Recall | [GNN](models/recall/gnn/model.py) | ✓ | ✓ | ✓ | ✓ | [AAAI 2019][Session-based Recommendation with Graph Neural Networks](https://arxiv.org/abs/1811.00855) | | 召回 | [NCF](models/recall/ncf/model.py) | ✓ | ✓ | ✓ | ✓ | [WWW 2017][Neural Collaborative Filtering](https://arxiv.org/pdf/1708.05031.pdf) |
| Rank | [Logistic Regression](models/rank/logistic_regression/model.py) | ✓ | x | ✓ | x | / | | 召回 | [GNN](models/recall/gnn/model.py) | ✓ | ✓ | ✓ | ✓ | [AAAI 2019][Session-based Recommendation with Graph Neural Networks](https://arxiv.org/abs/1811.00855) |
| Rank | [Dnn](models/rank/dnn/model.py) | ✓ | ✓ | ✓ | ✓ | / | | 召回 | [RALM](models/recall/look-alike_recall/model.py) | ✓ | ✓ | ✓ | ✓ | [KDD 2019][Real-time Attention Based Look-alike Model for Recommender System](https://arxiv.org/pdf/1906.05022.pdf) |
| Rank | [FM](models/rank/fm/model.py) | ✓ | x | ✓ | x | [IEEE Data Mining 2010][Factorization machines](https://analyticsconsultores.com.mx/wp-content/uploads/2019/03/Factorization-Machines-Steffen-Rendle-Osaka-University-2010.pdf) | | 排序 | [Logistic Regression](models/rank/logistic_regression/model.py) | ✓ | x | ✓ | x | / |
| Rank | [FFM](models/rank/ffm/model.py) | ✓ | x | ✓ | x | [RECSYS 2016][Field-aware Factorization Machines for CTR Prediction](https://dl.acm.org/doi/pdf/10.1145/2959100.2959134) | | 排序 | [Dnn](models/rank/dnn/model.py) | ✓ | ✓ | ✓ | ✓ | / |
| Rank | [FNN](models/rank/fnn/model.py) | ✓ | x | ✓ | x | [ECIR 2016][Deep Learning over Multi-field Categorical Data](https://arxiv.org/pdf/1601.02376.pdf) | | 排序 | [FM](models/rank/fm/model.py) | ✓ | x | ✓ | x | [IEEE Data Mining 2010][Factorization machines](https://analyticsconsultores.com.mx/wp-content/uploads/2019/03/Factorization-Machines-Steffen-Rendle-Osaka-University-2010.pdf) |
| Rank | [Deep Crossing](models/rank/deep_crossing/model.py) | ✓ | x | ✓ | x | [ACM 2016][Deep Crossing: Web-Scale Modeling without Manually Crafted Combinatorial Features](https://www.kdd.org/kdd2016/papers/files/adf0975-shanA.pdf) | | 排序 | [FFM](models/rank/ffm/model.py) | ✓ | x | ✓ | x | [RECSYS 2016][Field-aware Factorization Machines for CTR Prediction](https://dl.acm.org/doi/pdf/10.1145/2959100.2959134) |
| Rank | [Pnn](models/rank/pnn/model.py) | ✓ | x | ✓ | x | [ICDM 2016][Product-based Neural Networks for User Response Prediction](https://arxiv.org/pdf/1611.00144.pdf) | | 排序 | [FNN](models/rank/fnn/model.py) | ✓ | x | ✓ | x | [ECIR 2016][Deep Learning over Multi-field Categorical Data](https://arxiv.org/pdf/1601.02376.pdf) |
| Rank | [DCN](models/rank/dcn/model.py) | ✓ | x | ✓ | x | [KDD 2017][Deep & Cross Network for Ad Click Predictions](https://dl.acm.org/doi/pdf/10.1145/3124749.3124754) | | 排序 | [Deep Crossing](models/rank/deep_crossing/model.py) | ✓ | x | ✓ | x | [ACM 2016][Deep Crossing: Web-Scale Modeling without Manually Crafted Combinatorial Features](https://www.kdd.org/kdd2016/papers/files/adf0975-shanA.pdf) |
| Rank | [NFM](models/rank/nfm/model.py) | ✓ | x | ✓ | x | [SIGIR 2017][Neural Factorization Machines for Sparse Predictive Analytics](https://dl.acm.org/doi/pdf/10.1145/3077136.3080777) | | 排序 | [Pnn](models/rank/pnn/model.py) | ✓ | x | ✓ | x | [ICDM 2016][Product-based Neural Networks for User Response Prediction](https://arxiv.org/pdf/1611.00144.pdf) |
| Rank | [AFM](models/rank/afm/model.py) | ✓ | x | ✓ | x | [IJCAI 2017][Attentional Factorization Machines: Learning the Weight of Feature Interactions via Attention Networks](https://arxiv.org/pdf/1708.04617.pdf) | | 排序 | [DCN](models/rank/dcn/model.py) | ✓ | x | ✓ | x | [KDD 2017][Deep & Cross Network for Ad Click Predictions](https://dl.acm.org/doi/pdf/10.1145/3124749.3124754) |
| Rank | [DeepFM](models/rank/deepfm/model.py) | ✓ | x | ✓ | x | [IJCAI 2017][DeepFM: A Factorization-Machine based Neural Network for CTR Prediction](https://arxiv.org/pdf/1703.04247.pdf) | | 排序 | [NFM](models/rank/nfm/model.py) | ✓ | x | ✓ | x | [SIGIR 2017][Neural Factorization Machines for Sparse Predictive Analytics](https://dl.acm.org/doi/pdf/10.1145/3077136.3080777) |
| Rank | [xDeepFM](models/rank/xdeepfm/model.py) | ✓ | x | ✓ | x | [KDD 2018][xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems](https://dl.acm.org/doi/pdf/10.1145/3219819.3220023) | | 排序 | [AFM](models/rank/afm/model.py) | ✓ | x | ✓ | x | [IJCAI 2017][Attentional Factorization Machines: Learning the Weight of Feature Interactions via Attention Networks](https://arxiv.org/pdf/1708.04617.pdf) |
| Rank | [DIN](models/rank/din/model.py) | ✓ | x | ✓ | x | [KDD 2018][Deep Interest Network for Click-Through Rate Prediction](https://dl.acm.org/doi/pdf/10.1145/3219819.3219823) | | 排序 | [DeepFM](models/rank/deepfm/model.py) | ✓ | x | ✓ | x | [IJCAI 2017][DeepFM: A Factorization-Machine based Neural Network for CTR Prediction](https://arxiv.org/pdf/1703.04247.pdf) |
| Rank | [DIEN](models/rank/dien/model.py) | ✓ | x | ✓ | x | [AAAI 2019][Deep Interest Evolution Network for Click-Through Rate Prediction](https://www.aaai.org/ojs/index.php/AAAI/article/view/4545/4423) | | 排序 | [xDeepFM](models/rank/xdeepfm/model.py) | ✓ | x | ✓ | x | [KDD 2018][xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems](https://dl.acm.org/doi/pdf/10.1145/3219819.3220023) |
| Rank | [BST](models/rank/BST/model.py) | ✓ | x | ✓ | x | [DLP-KDD 2019][Behavior Sequence Transformer for E-commerce Recommendation in Alibaba](https://arxiv.org/pdf/1905.06874v1.pdf) | | 排序 | [DIN](models/rank/din/model.py) | ✓ | x | ✓ | x | [KDD 2018][Deep Interest Network for Click-Through Rate Prediction](https://dl.acm.org/doi/pdf/10.1145/3219819.3219823) |
| Rank | [AutoInt](models/rank/AutoInt/model.py) | ✓ | x | ✓ | x | [CIKM 2019][AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks](https://arxiv.org/pdf/1810.11921.pdf) | | 排序 | [DIEN](models/rank/dien/model.py) | ✓ | x | ✓ | x | [AAAI 2019][Deep Interest Evolution Network for Click-Through Rate Prediction](https://www.aaai.org/ojs/index.php/AAAI/article/view/4545/4423) |
| Rank | [Wide&Deep](models/rank/wide_deep/model.py) | ✓ | x | ✓ | x | [DLRS 2016][Wide & Deep Learning for Recommender Systems](https://dl.acm.org/doi/pdf/10.1145/2988450.2988454) | | 排序 | [BST](models/rank/BST/model.py) | ✓ | x | ✓ | x | [DLP_KDD 2019][Behavior Sequence Transformer for E-commerce Recommendation in Alibaba](https://arxiv.org/pdf/1905.06874v1.pdf) |
| Rank | [FGCNN](models/rank/fgcnn/model.py) | ✓ | ✓ | ✓ | ✓ | [WWW 2019][Feature Generation by Convolutional Neural Network for Click-Through Rate Prediction](https://arxiv.org/pdf/1904.04447.pdf) | | 排序 | [AutoInt](models/rank/AutoInt/model.py) | ✓ | x | ✓ | x | [CIKM 2019][AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks](https://arxiv.org/pdf/1810.11921.pdf) |
| Rank | [Fibinet](models/rank/fibinet/model.py) | ✓ | ✓ | ✓ | ✓ | [RecSys19][FiBiNET: Combining Feature Importance and Bilinear feature Interaction for Click-Through Rate Prediction]( https://arxiv.org/pdf/1905.09433.pdf) | | 排序 | [Wide&Deep](models/rank/wide_deep/model.py) | ✓ | x | ✓ | x | [DLRS 2016][Wide & Deep Learning for Recommender Systems](https://dl.acm.org/doi/pdf/10.1145/2988450.2988454) |
| Rank | [Flen](models/rank/flen/model.py) | ✓ | ✓ | ✓ | ✓ | [2019][FLEN: Leveraging Field for Scalable CTR Prediction]( https://arxiv.org/pdf/1911.04690.pdf) | | 排序 | [FGCNN](models/rank/fgcnn/model.py) | ✓ | ✓ | ✓ | ✓ | [WWW 2019][Feature Generation by Convolutional Neural Network for Click-Through Rate Prediction](https://arxiv.org/pdf/1904.04447.pdf) |
| Multi-Task | [ESMM](models/multitask/esmm/model.py) | ✓ | ✓ | ✓ | ✓ | [SIGIR 2018][Entire Space Multi-Task Model: An Effective Approach for Estimating Post-Click Conversion Rate](https://arxiv.org/abs/1804.07931) | | 排序 | [Fibinet](models/rank/fibinet/model.py) | ✓ | ✓ | ✓ | ✓ | [RecSys19][FiBiNET: Combining Feature Importance and Bilinear feature Interaction for Click-Through Rate Prediction]( https://arxiv.org/pdf/1905.09433.pdf) |
| Multi-Task | [MMOE](models/multitask/mmoe/model.py) | ✓ | ✓ | ✓ | ✓ | [KDD 2018][Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts](https://dl.acm.org/doi/abs/10.1145/3219819.3220007) | | 排序 | [Flen](models/rank/flen/model.py) | ✓ | ✓ | ✓ | ✓ | [2019][FLEN: Leveraging Field for Scalable CTR Prediction]( https://arxiv.org/pdf/1911.04690.pdf) |
| Multi-Task | [ShareBottom](models/multitask/share-bottom/model.py) | ✓ | ✓ | ✓ | ✓ | [1998][Multitask learning](http://reports-archive.adm.cs.cmu.edu/anon/1997/CMU-CS-97-203.pdf) | | 多任务 | [ESMM](models/multitask/esmm/model.py) | ✓ | ✓ | ✓ | ✓ | [SIGIR 2018][Entire Space Multi-Task Model: An Effective Approach for Estimating Post-Click Conversion Rate](https://arxiv.org/abs/1804.07931) |
| Re-Rank | [Listwise](models/rerank/listwise/model.py) | ✓ | ✓ | ✓ | x | [2019][Sequential Evaluation and Generation Framework for Combinatorial Recommender System](https://arxiv.org/pdf/1902.00245.pdf) | | 多任务 | [MMOE](models/multitask/mmoe/model.py) | ✓ | ✓ | ✓ | ✓ | [KDD 2018][Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts](https://dl.acm.org/doi/abs/10.1145/3219819.3220007) |
| 多任务 | [ShareBottom](models/multitask/share-bottom/model.py) | ✓ | ✓ | ✓ | ✓ | [1998][Multitask learning](http://reports-archive.adm.cs.cmu.edu/anon/1997/CMU-CS-97-203.pdf) |
| 重排序 | [Listwise](models/rerank/listwise/model.py) | ✓ | ✓ | ✓ | x | [2019][Sequential Evaluation and Generation Framework for Combinatorial Recommender System](https://arxiv.org/pdf/1902.00245.pdf) |
<h2 align="center">Getting Started</h2>
### Environmental requirements <h2 align="center">快速安装</h2>
### 环境要求
* Python 2.7/ 3.5 / 3.6 / 3.7 * Python 2.7/ 3.5 / 3.6 / 3.7
* PaddlePaddle >= 1.7.2 * PaddlePaddle >= 1.7.2
* operating system: Windows/Mac/Linux * 操作系统: Windows/Mac/Linux
> Linux is recommended for distributed training > Windows下PaddleRec目前仅支持单机训练,分布式训练建议使用Linux环境
### Installation ### 安装命令
1. **Install by pip** - 安装方法一 **PIP源直接安装**
```bash ```bash
python -m pip install paddle-rec python -m pip install paddle-rec
``` ```
> This method will download and install `paddlepaddle-v1.7.2-cpu`. If `PaddlePaddle` can not be installed automatically,You need to install `PaddlePaddle` manually,and then install `PaddleRec` again > 该方法会默认下载安装`paddlepaddle v1.7.2 cpu版本`,若提示`PaddlePaddle`无法安装,则依照下述方法首先安装`PaddlePaddle`,再安装`PaddleRec`
> - Download [PaddlePaddle](https://pypi.org/project/paddlepaddle/1.7.2/#files) and install by pip. > - 可以在[该地址](https://pypi.org/project/paddlepaddle/1.7.2/#files),下载PaddlePaddle后手动安装whl包
> - Install `PaddlePaddle` by pip,`python -m pip install paddlepaddle==1.7.2 -i https://mirror.baidu.com/pypi/simple` > - 可以先pip安装`PaddlePaddle`,`python -m pip install paddlepaddle==1.7.2 -i https://mirror.baidu.com/pypi/simple`
> - Other installation problems can be raised in [Paddle Issue](https://github.com/PaddlePaddle/Paddle/issues) or [PaddleRec Issue](https://github.com/PaddlePaddle/PaddleRec/issues) > - 其他安装问题可以在[Paddle Issue](https://github.com/PaddlePaddle/Paddle/issues)或[PaddleRec Issue](https://github.com/PaddlePaddle/PaddleRec/issues)提出,会有工程师及时解答
2. **Install by source code** - 安装方法二 **源码编译安装**
- Install PaddlePaddle - 安装飞桨 **注:需要用户安装版本 == 1.7.2 的飞桨**
```shell ```shell
python -m pip install paddlepaddle==1.7.2 -i https://mirror.baidu.com/pypi/simple python -m pip install paddlepaddle==1.7.2 -i https://mirror.baidu.com/pypi/simple
``` ```
- Install PaddleRec by source code - 源码安装PaddleRec
``` ```
git clone https://github.com/PaddlePaddle/PaddleRec/ git clone https://github.com/PaddlePaddle/PaddleRec/
...@@ -107,54 +113,58 @@ ...@@ -107,54 +113,58 @@
python setup.py install python setup.py install
``` ```
- Install PaddleRec-GPU - PaddleRec-GPU安装方法
After installing `PaddleRec`,please install the appropriate version of `paddlepaddle-gpu` according to your environment (CUDA / cudnn),refer to the installation tutorial [Installation Manuals](https://www.paddlepaddle.org.cn/documentation/docs/en/install/index_en.html) 在使用方法一或方法二完成PaddleRec安装后,需再手动安装`paddlepaddle-gpu`,并根据自身环境(Cuda/Cudnn)选择合适的版本,安装教程请查阅[飞桨-开始使用](https://www.paddlepaddle.org.cn/install/quick)
<h2 align="center">Quick Start</h2> <h2 align="center">一键启动</h2>
We take the `dnn` algorithm as an example to get start of `PaddleRec`, and we take 100 pieces of training data from [Criteo Dataset](https://www.kaggle.com/c/criteo-display-ad-challenge/): 我们以排序模型中的`dnn`模型为例介绍PaddleRec的一键启动。训练数据来源为[Criteo数据集](https://www.kaggle.com/c/criteo-display-ad-challenge/),我们从中截取了100条数据:
```bash ```bash
# Training with cpu # 使用CPU进行单机训练
python -m paddlerec.run -m paddlerec.models.rank.dnn git clone https://github.com/PaddlePaddle/PaddleRec.git paddle-rec
cd paddle-rec
python -m paddlerec.run -m models/rank/dnn/config.yaml
``` ```
<h2 align="center">Documentation</h2> <h2 align="center">帮助文档</h2>
### Background ### 项目背景
* [Recommendation System](doc/rec_background.md) * [推荐系统介绍](doc/rec_background.md)
* [Distributed deep learning](doc/ps_background.md) * [分布式深度学习介绍](doc/ps_background.md)
### Introductory Project ### 快速开始
* [Get start of PaddleRec in ten minutes](https://aistudio.baidu.com/aistudio/projectdetail/559336) * [十分钟上手PaddleRec](https://aistudio.baidu.com/aistudio/projectdetail/559336)
### Introductory tutorial ### 入门教程
* [Data](doc/slot_reader.md) * [数据准备](doc/slot_reader.md)
* [Model](doc/model.md) * [模型调参](doc/model.md)
* [Loacl Train](doc/train.md) * [启动单机训练](doc/train.md)
* [Distributed Train](doc/distributed_train.md) * [启动分布式训练](doc/distributed_train.md)
* [Predict](doc/predict.md) * [启动预测](doc/predict.md)
* [Serving](doc/serving.md) * [快速部署](doc/serving.md)
* [预训练模型](doc/pre_train_model.md)
### Advanced tutorial ### 进阶教程
* [Custom Reader](doc/custom_reader.md) * [自定义Reader](doc/custom_reader.md)
* [Custom Model](doc/model_develop.md) * [自定义模型](doc/model_develop.md)
* [Custom Training Process](doc/trainer_develop.md) * [自定义流程](doc/trainer_develop.md)
* [Configuration description of yaml](doc/yaml.md) * [yaml配置说明](doc/yaml.md)
* [Design document of PaddleRec](doc/design.md) * [PaddleRec设计文档](doc/design.md)
### Benchmark ### Benchmark
* [Benchmark](doc/benchmark.md) * [Benchmark](doc/benchmark.md)
### FAQ ### FAQ
* [Common Problem FAQ](doc/faq.md) * [常见问题FAQ](doc/faq.md)
<h2 align="center">Community</h2> <h2 align="center">社区</h2>
<p align="center"> <p align="center">
<br> <br>
...@@ -164,22 +174,22 @@ python -m paddlerec.run -m paddlerec.models.rank.dnn ...@@ -164,22 +174,22 @@ python -m paddlerec.run -m paddlerec.models.rank.dnn
<br> <br>
<p> <p>
### Version history ### 版本历史
- 2020.06.17 - PaddleRec v0.1.0 - 2020.06.17 - PaddleRec v0.1.0
- 2020.06.03 - PaddleRec v0.0.2 - 2020.06.03 - PaddleRec v0.0.2
- 2020.05.14 - PaddleRec v0.0.1 - 2020.05.14 - PaddleRec v0.0.1
### License ### 许可证书
[Apache 2.0 license](LICENSE) 本项目的发布受[Apache 2.0 license](LICENSE)许可认证。
### Contact us ### 联系我们
For any feedback, please propose a [GitHub Issue](https://github.com/PaddlePaddle/PaddleRec/issues) 如有意见、建议及使用中的BUG,欢迎在[GitHub Issue](https://github.com/PaddlePaddle/PaddleRec/issues)提交
You can also communicate with us in the following ways 亦可通过以下方式与我们沟通交流
- QQ group id`861717190` - QQ群号码`861717190`
- Wechat account`paddlerec2020` - 微信小助手微信号`paddlerec2020`
<p align="center"><img width="200" height="200" margin="500" src="./doc/imgs/QQ_group.png"/>&#8194;&#8194;&#8194;&#8194;&#8194<img width="200" height="200" src="doc/imgs/weixin_supporter.png"/></p> <p align="center"><img width="200" height="200" margin="500" src="./doc/imgs/QQ_group.png"/>&#8194;&#8194;&#8194;&#8194;&#8194<img width="200" height="200" src="doc/imgs/weixin_supporter.png"/></p>
<p align="center">PaddleRec QQ Group&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;PaddleRec Wechat account</p> <p align="center">PaddleRec交流QQ群&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;PaddleRec微信小助手</p>
(简体中文|[English](./README.md))
<p align="center">
<img align="center" src="doc/imgs/logo.png">
<p>
<p align="center">
<img align="center" src="doc/imgs/structure.png">
<p>
<p align="center">
<img align="center" src="doc/imgs/overview.png">
<p>
<h2 align="center">什么是推荐系统?</h2>
<p align="center">
<img align="center" src="doc/imgs/rec-overview.png">
<p>
- 推荐系统是在互联网信息爆炸式增长的时代背景下,帮助用户高效获得感兴趣信息的关键;
- 推荐系统也是帮助产品最大限度吸引用户、留存用户、增加用户粘性、提高用户转化率的银弹。
- 有无数优秀的产品依靠用户可感知的推荐系统建立了良好的口碑,也有无数的公司依靠直击用户痛点的推荐系统在行业中占领了一席之地。
> 可以说,谁能掌握和利用好推荐系统,谁就能在信息分发的激烈竞争中抢得先机。
> 但与此同时,有着许多问题困扰着推荐系统的开发者,比如:庞大的数据量,复杂的模型结构,低效的分布式训练环境,波动的在离线一致性,苛刻的上线部署要求,以上种种,不胜枚举。
<h2 align="center">什么是PaddleRec?</h2>
- 源于飞桨生态的搜索推荐模型 **一站式开箱即用工具**
- 适合初学者,开发者,研究者的推荐系统全流程解决方案
- 包含内容理解、匹配、召回、排序、 多任务、重排序等多个任务的完整推荐搜索算法库
| 方向 | 模型 | 单机CPU | 单机GPU | 分布式CPU | 分布式GPU | 论文 |
| :------: | :-----------------------------------------------------------------------: | :-----: | :-----: | :-------: | :-------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 内容理解 | [Text-Classifcation](models/contentunderstanding/classification/model.py) | ✓ | ✓ | ✓ | x | [EMNLP 2014][Convolutional neural networks for sentence classication](https://www.aclweb.org/anthology/D14-1181.pdf) |
| 内容理解 | [TagSpace](models/contentunderstanding/tagspace/model.py) | ✓ | ✓ | ✓ | x | [EMNLP 2014][TagSpace: Semantic Embeddings from Hashtags](https://www.aclweb.org/anthology/D14-1194.pdf) |
| 匹配 | [DSSM](models/match/dssm/model.py) | ✓ | ✓ | ✓ | x | [CIKM 2013][Learning Deep Structured Semantic Models for Web Search using Clickthrough Data](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cikm2013_DSSM_fullversion.pdf) |
| 匹配 | [MultiView-Simnet](models/match/multiview-simnet/model.py) | ✓ | ✓ | ✓ | x | [WWW 2015][A Multi-View Deep Learning Approach for Cross Domain User Modeling in Recommendation Systems](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/frp1159-songA.pdf) |
| 召回 | [TDM](models/treebased/tdm/model.py) | ✓ | >=1.8.0 | ✓ | >=1.8.0 | [KDD 2018][Learning Tree-based Deep Model for Recommender Systems](https://arxiv.org/pdf/1801.02294.pdf) |
| 召回 | [fasttext](models/recall/fasttext/model.py) | ✓ | ✓ | x | x | [EACL 2017][Bag of Tricks for Efficient Text Classification](https://www.aclweb.org/anthology/E17-2068.pdf) |
| 召回 | [Word2Vec](models/recall/word2vec/model.py) | ✓ | ✓ | ✓ | x | [NIPS 2013][Distributed Representations of Words and Phrases and their Compositionality](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) |
| 召回 | [SSR](models/recall/ssr/model.py) | ✓ | ✓ | ✓ | ✓ | [SIGIR 2016][Multi-Rate Deep Learning for Temporal Recommendation](http://sonyis.me/paperpdf/spr209-song_sigir16.pdf) |
| 召回 | [Gru4Rec](models/recall/gru4rec/model.py) | ✓ | ✓ | ✓ | ✓ | [2015][Session-based Recommendations with Recurrent Neural Networks](https://arxiv.org/abs/1511.06939) |
| 召回 | [Youtube_dnn](models/recall/youtube_dnn/model.py) | ✓ | ✓ | ✓ | ✓ | [RecSys 2016][Deep Neural Networks for YouTube Recommendations](https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf) |
| 召回 | [NCF](models/recall/ncf/model.py) | ✓ | ✓ | ✓ | ✓ | [WWW 2017][Neural Collaborative Filtering](https://arxiv.org/pdf/1708.05031.pdf) |
| 召回 | [GNN](models/recall/gnn/model.py) | ✓ | ✓ | ✓ | ✓ | [AAAI 2019][Session-based Recommendation with Graph Neural Networks](https://arxiv.org/abs/1811.00855) |
| 排序 | [Logistic Regression](models/rank/logistic_regression/model.py) | ✓ | x | ✓ | x | / |
| 排序 | [Dnn](models/rank/dnn/model.py) | ✓ | ✓ | ✓ | ✓ | / |
| 排序 | [FM](models/rank/fm/model.py) | ✓ | x | ✓ | x | [IEEE Data Mining 2010][Factorization machines](https://analyticsconsultores.com.mx/wp-content/uploads/2019/03/Factorization-Machines-Steffen-Rendle-Osaka-University-2010.pdf) |
| 排序 | [FFM](models/rank/ffm/model.py) | ✓ | x | ✓ | x | [RECSYS 2016][Field-aware Factorization Machines for CTR Prediction](https://dl.acm.org/doi/pdf/10.1145/2959100.2959134) |
| 排序 | [FNN](models/rank/fnn/model.py) | ✓ | x | ✓ | x | [ECIR 2016][Deep Learning over Multi-field Categorical Data](https://arxiv.org/pdf/1601.02376.pdf) |
| 排序 | [Deep Crossing](models/rank/deep_crossing/model.py) | ✓ | x | ✓ | x | [ACM 2016][Deep Crossing: Web-Scale Modeling without Manually Crafted Combinatorial Features](https://www.kdd.org/kdd2016/papers/files/adf0975-shanA.pdf) |
| 排序 | [Pnn](models/rank/pnn/model.py) | ✓ | x | ✓ | x | [ICDM 2016][Product-based Neural Networks for User Response Prediction](https://arxiv.org/pdf/1611.00144.pdf) |
| 排序 | [DCN](models/rank/dcn/model.py) | ✓ | x | ✓ | x | [KDD 2017][Deep & Cross Network for Ad Click Predictions](https://dl.acm.org/doi/pdf/10.1145/3124749.3124754) |
| 排序 | [NFM](models/rank/nfm/model.py) | ✓ | x | ✓ | x | [SIGIR 2017][Neural Factorization Machines for Sparse Predictive Analytics](https://dl.acm.org/doi/pdf/10.1145/3077136.3080777) |
| 排序 | [AFM](models/rank/afm/model.py) | ✓ | x | ✓ | x | [IJCAI 2017][Attentional Factorization Machines: Learning the Weight of Feature Interactions via Attention Networks](https://arxiv.org/pdf/1708.04617.pdf) |
| 排序 | [DeepFM](models/rank/deepfm/model.py) | ✓ | x | ✓ | x | [IJCAI 2017][DeepFM: A Factorization-Machine based Neural Network for CTR Prediction](https://arxiv.org/pdf/1703.04247.pdf) |
| 排序 | [xDeepFM](models/rank/xdeepfm/model.py) | ✓ | x | ✓ | x | [KDD 2018][xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems](https://dl.acm.org/doi/pdf/10.1145/3219819.3220023) |
| 排序 | [DIN](models/rank/din/model.py) | ✓ | x | ✓ | x | [KDD 2018][Deep Interest Network for Click-Through Rate Prediction](https://dl.acm.org/doi/pdf/10.1145/3219819.3219823) |
| 排序 | [DIEN](models/rank/dien/model.py) | ✓ | x | ✓ | x | [AAAI 2019][Deep Interest Evolution Network for Click-Through Rate Prediction](https://www.aaai.org/ojs/index.php/AAAI/article/view/4545/4423) |
| 排序 | [BST](models/rank/BST/model.py) | ✓ | x | ✓ | x | [DLP_KDD 2019][Behavior Sequence Transformer for E-commerce Recommendation in Alibaba](https://arxiv.org/pdf/1905.06874v1.pdf) |
| 排序 | [AutoInt](models/rank/AutoInt/model.py) | ✓ | x | ✓ | x | [CIKM 2019][AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks](https://arxiv.org/pdf/1810.11921.pdf) |
| 排序 | [Wide&Deep](models/rank/wide_deep/model.py) | ✓ | x | ✓ | x | [DLRS 2016][Wide & Deep Learning for Recommender Systems](https://dl.acm.org/doi/pdf/10.1145/2988450.2988454) |
| 排序 | [FGCNN](models/rank/fgcnn/model.py) | ✓ | ✓ | ✓ | ✓ | [WWW 2019][Feature Generation by Convolutional Neural Network for Click-Through Rate Prediction](https://arxiv.org/pdf/1904.04447.pdf) |
| 排序 | [Fibinet](models/rank/fibinet/model.py) | ✓ | ✓ | ✓ | ✓ | [RecSys19][FiBiNET: Combining Feature Importance and Bilinear feature Interaction for Click-Through Rate Prediction]( https://arxiv.org/pdf/1905.09433.pdf) |
| 排序 | [Flen](models/rank/flen/model.py) | ✓ | ✓ | ✓ | ✓ | [2019][FLEN: Leveraging Field for Scalable CTR Prediction]( https://arxiv.org/pdf/1911.04690.pdf) |
| 多任务 | [ESMM](models/multitask/esmm/model.py) | ✓ | ✓ | ✓ | ✓ | [SIGIR 2018][Entire Space Multi-Task Model: An Effective Approach for Estimating Post-Click Conversion Rate](https://arxiv.org/abs/1804.07931) |
| 多任务 | [MMOE](models/multitask/mmoe/model.py) | ✓ | ✓ | ✓ | ✓ | [KDD 2018][Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts](https://dl.acm.org/doi/abs/10.1145/3219819.3220007) |
| 多任务 | [ShareBottom](models/multitask/share-bottom/model.py) | ✓ | ✓ | ✓ | ✓ | [1998][Multitask learning](http://reports-archive.adm.cs.cmu.edu/anon/1997/CMU-CS-97-203.pdf) |
| 重排序 | [Listwise](models/rerank/listwise/model.py) | ✓ | ✓ | ✓ | x | [2019][Sequential Evaluation and Generation Framework for Combinatorial Recommender System](https://arxiv.org/pdf/1902.00245.pdf) |
<h2 align="center">快速安装</h2>
### 环境要求
* Python 2.7/ 3.5 / 3.6 / 3.7
* PaddlePaddle >= 1.7.2
* 操作系统: Windows/Mac/Linux
> Windows下PaddleRec目前仅支持单机训练,分布式训练建议使用Linux环境
### 安装命令
- 安装方法一 **PIP源直接安装**
```bash
python -m pip install paddle-rec
```
> 该方法会默认下载安装`paddlepaddle v1.7.2 cpu版本`,若提示`PaddlePaddle`无法安装,则依照下述方法首先安装`PaddlePaddle`,再安装`PaddleRec`:
> - 可以在[该地址](https://pypi.org/project/paddlepaddle/1.7.2/#files),下载PaddlePaddle后手动安装whl包
> - 可以先pip安装`PaddlePaddle`,`python -m pip install paddlepaddle==1.7.2 -i https://mirror.baidu.com/pypi/simple`
> - 其他安装问题可以在[Paddle Issue](https://github.com/PaddlePaddle/Paddle/issues)或[PaddleRec Issue](https://github.com/PaddlePaddle/PaddleRec/issues)提出,会有工程师及时解答
- 安装方法二 **源码编译安装**
- 安装飞桨 **注:需要用户安装版本 == 1.7.2 的飞桨**
```shell
python -m pip install paddlepaddle==1.7.2 -i https://mirror.baidu.com/pypi/simple
```
- 源码安装PaddleRec
```
git clone https://github.com/PaddlePaddle/PaddleRec/
cd PaddleRec
python setup.py install
```
- PaddleRec-GPU安装方法
在使用方法一或方法二完成PaddleRec安装后,需再手动安装`paddlepaddle-gpu`,并根据自身环境(Cuda/Cudnn)选择合适的版本,安装教程请查阅[飞桨-开始使用](https://www.paddlepaddle.org.cn/install/quick)
<h2 align="center">一键启动</h2>
我们以排序模型中的`dnn`模型为例介绍PaddleRec的一键启动。训练数据来源为[Criteo数据集](https://www.kaggle.com/c/criteo-display-ad-challenge/),我们从中截取了100条数据:
```bash
# 使用CPU进行单机训练
python -m paddlerec.run -m paddlerec.models.rank.dnn
```
<h2 align="center">帮助文档</h2>
### 项目背景
* [推荐系统介绍](doc/rec_background.md)
* [分布式深度学习介绍](doc/ps_background.md)
### 快速开始
* [十分钟上手PaddleRec](https://aistudio.baidu.com/aistudio/projectdetail/559336)
### 入门教程
* [数据准备](doc/slot_reader.md)
* [模型调参](doc/model.md)
* [启动单机训练](doc/train.md)
* [启动分布式训练](doc/distributed_train.md)
* [启动预测](doc/predict.md)
* [快速部署](doc/serving.md)
### 进阶教程
* [自定义Reader](doc/custom_reader.md)
* [自定义模型](doc/model_develop.md)
* [自定义流程](doc/trainer_develop.md)
* [yaml配置说明](doc/yaml.md)
* [PaddleRec设计文档](doc/design.md)
### Benchmark
* [Benchmark](doc/benchmark.md)
### FAQ
* [常见问题FAQ](doc/faq.md)
<h2 align="center">社区</h2>
<p align="center">
<br>
<img alt="Release" src="https://img.shields.io/badge/Release-0.1.0-yellowgreen">
<img alt="License" src="https://img.shields.io/github/license/PaddlePaddle/PaddleRec">
<img alt="Slack" src="https://img.shields.io/badge/Join-Slack-green">
<br>
<p>
### 版本历史
- 2020.06.17 - PaddleRec v0.1.0
- 2020.06.03 - PaddleRec v0.0.2
- 2020.05.14 - PaddleRec v0.0.1
### 许可证书
本项目的发布受[Apache 2.0 license](LICENSE)许可认证。
### 联系我们
如有意见、建议及使用中的BUG,欢迎在[GitHub Issue](https://github.com/PaddlePaddle/PaddleRec/issues)提交
亦可通过以下方式与我们沟通交流:
- QQ群号码:`861717190`
- 微信小助手微信号:`paddlerec2020`
<p align="center"><img width="200" height="200" margin="500" src="./doc/imgs/QQ_group.png"/>&#8194;&#8194;&#8194;&#8194;&#8194<img width="200" height="200" src="doc/imgs/weixin_supporter.png"/></p>
<p align="center">PaddleRec交流QQ群&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;PaddleRec微信小助手</p>
([简体中文](./README.md)|English)
<p align="center">
<img align="center" src="doc/imgs/logo.png">
<p>
<p align="center">
<img align="center" src="doc/imgs/overview_en.png">
<p>
<h2 align="center">What is recommendation system ?</h2>
<p align="center">
<img align="center" src="doc/imgs/rec-overview-en.png">
<p>
- Recommendation system helps users quickly find useful and interesting information from massive data.
- Recommendation system is also a silver bullet to attract users, retain users, increase users' stickness or conversionn.
> Who can better use the recommendation system, who can gain more advantage in the fierce competition.
>
> At the same time, there are many problems in the process of using the recommendation system, such as: huge data, complex model, inefficient distributed training, and so on.
<h2 align="center">What is PaddleRec ?</h2>
- A quick start tool of search & recommendation algorithm based on [PaddlePaddle](https://www.paddlepaddle.org.cn/documentation/docs/en/beginners_guide/index_en.html)
- A complete solution of recommendation system for beginners, developers and researchers.
- Recommendation algorithm library including content-understanding, match, recall, rank, multi-task, re-rank etc.
| Type | Algorithm | CPU | GPU | Parameter-Server | Multi-GPU | Paper |
| :-------------------: | :-----------------------------------------------------------------------: | :---: | :-----: | :--------------: | :-------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Content-Understanding | [Text-Classifcation](models/contentunderstanding/classification/model.py) | ✓ | ✓ | ✓ | x | [EMNLP 2014][Convolutional neural networks for sentence classication](https://www.aclweb.org/anthology/D14-1181.pdf) |
| Content-Understanding | [TagSpace](models/contentunderstanding/tagspace/model.py) | ✓ | ✓ | ✓ | x | [EMNLP 2014][TagSpace: Semantic Embeddings from Hashtags](https://www.aclweb.org/anthology/D14-1194.pdf) |
| Match | [DSSM](models/match/dssm/model.py) | ✓ | ✓ | ✓ | x | [CIKM 2013][Learning Deep Structured Semantic Models for Web Search using Clickthrough Data](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cikm2013_DSSM_fullversion.pdf) |
| Match | [MultiView-Simnet](models/match/multiview-simnet/model.py) | ✓ | ✓ | ✓ | x | [WWW 2015][A Multi-View Deep Learning Approach for Cross Domain User Modeling in Recommendation Systems](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/frp1159-songA.pdf) |
| Recall | [TDM](models/treebased/tdm/model.py) | ✓ | >=1.8.0 | ✓ | >=1.8.0 | [KDD 2018][Learning Tree-based Deep Model for Recommender Systems](https://arxiv.org/pdf/1801.02294.pdf) |
| Recall | [fasttext](models/recall/fasttext/model.py) | ✓ | ✓ | x | x | [EACL 2017][Bag of Tricks for Efficient Text Classification](https://www.aclweb.org/anthology/E17-2068.pdf) |
| Recall | [Word2Vec](models/recall/word2vec/model.py) | ✓ | ✓ | ✓ | x | [NIPS 2013][Distributed Representations of Words and Phrases and their Compositionality](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) |
| Recall | [SSR](models/recall/ssr/model.py) | ✓ | ✓ | ✓ | ✓ | [SIGIR 2016][Multi-Rate Deep Learning for Temporal Recommendation](http://sonyis.me/paperpdf/spr209-song_sigir16.pdf) |
| Recall | [Gru4Rec](models/recall/gru4rec/model.py) | ✓ | ✓ | ✓ | ✓ | [2015][Session-based Recommendations with Recurrent Neural Networks](https://arxiv.org/abs/1511.06939) |
| Recall | [Youtube_dnn](models/recall/youtube_dnn/model.py) | ✓ | ✓ | ✓ | ✓ | [RecSys 2016][Deep Neural Networks for YouTube Recommendations](https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf) |
| Recall | [NCF](models/recall/ncf/model.py) | ✓ | ✓ | ✓ | ✓ | [WWW 2017][Neural Collaborative Filtering](https://arxiv.org/pdf/1708.05031.pdf) |
| Recall | [GNN](models/recall/gnn/model.py) | ✓ | ✓ | ✓ | ✓ | [AAAI 2019][Session-based Recommendation with Graph Neural Networks](https://arxiv.org/abs/1811.00855) |
| Recall | [RALM](models/recall/look-alike_recall/model.py) | ✓ | ✓ | ✓ | ✓ | [KDD 2019][Real-time Attention Based Look-alike Model for Recommender System](https://arxiv.org/pdf/1906.05022.pdf) |
| Rank | [Logistic Regression](models/rank/logistic_regression/model.py) | ✓ | x | ✓ | x | / |
| Rank | [Dnn](models/rank/dnn/model.py) | ✓ | ✓ | ✓ | ✓ | / |
| Rank | [FM](models/rank/fm/model.py) | ✓ | x | ✓ | x | [IEEE Data Mining 2010][Factorization machines](https://analyticsconsultores.com.mx/wp-content/uploads/2019/03/Factorization-Machines-Steffen-Rendle-Osaka-University-2010.pdf) |
| Rank | [FFM](models/rank/ffm/model.py) | ✓ | x | ✓ | x | [RECSYS 2016][Field-aware Factorization Machines for CTR Prediction](https://dl.acm.org/doi/pdf/10.1145/2959100.2959134) |
| Rank | [FNN](models/rank/fnn/model.py) | ✓ | x | ✓ | x | [ECIR 2016][Deep Learning over Multi-field Categorical Data](https://arxiv.org/pdf/1601.02376.pdf) |
| Rank | [Deep Crossing](models/rank/deep_crossing/model.py) | ✓ | x | ✓ | x | [ACM 2016][Deep Crossing: Web-Scale Modeling without Manually Crafted Combinatorial Features](https://www.kdd.org/kdd2016/papers/files/adf0975-shanA.pdf) |
| Rank | [Pnn](models/rank/pnn/model.py) | ✓ | x | ✓ | x | [ICDM 2016][Product-based Neural Networks for User Response Prediction](https://arxiv.org/pdf/1611.00144.pdf) |
| Rank | [DCN](models/rank/dcn/model.py) | ✓ | x | ✓ | x | [KDD 2017][Deep & Cross Network for Ad Click Predictions](https://dl.acm.org/doi/pdf/10.1145/3124749.3124754) |
| Rank | [NFM](models/rank/nfm/model.py) | ✓ | x | ✓ | x | [SIGIR 2017][Neural Factorization Machines for Sparse Predictive Analytics](https://dl.acm.org/doi/pdf/10.1145/3077136.3080777) |
| Rank | [AFM](models/rank/afm/model.py) | ✓ | x | ✓ | x | [IJCAI 2017][Attentional Factorization Machines: Learning the Weight of Feature Interactions via Attention Networks](https://arxiv.org/pdf/1708.04617.pdf) |
| Rank | [DeepFM](models/rank/deepfm/model.py) | ✓ | x | ✓ | x | [IJCAI 2017][DeepFM: A Factorization-Machine based Neural Network for CTR Prediction](https://arxiv.org/pdf/1703.04247.pdf) |
| Rank | [xDeepFM](models/rank/xdeepfm/model.py) | ✓ | x | ✓ | x | [KDD 2018][xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems](https://dl.acm.org/doi/pdf/10.1145/3219819.3220023) |
| Rank | [DIN](models/rank/din/model.py) | ✓ | x | ✓ | x | [KDD 2018][Deep Interest Network for Click-Through Rate Prediction](https://dl.acm.org/doi/pdf/10.1145/3219819.3219823) |
| Rank | [DIEN](models/rank/dien/model.py) | ✓ | x | ✓ | x | [AAAI 2019][Deep Interest Evolution Network for Click-Through Rate Prediction](https://www.aaai.org/ojs/index.php/AAAI/article/view/4545/4423) |
| Rank | [BST](models/rank/BST/model.py) | ✓ | x | ✓ | x | [DLP-KDD 2019][Behavior Sequence Transformer for E-commerce Recommendation in Alibaba](https://arxiv.org/pdf/1905.06874v1.pdf) |
| Rank | [AutoInt](models/rank/AutoInt/model.py) | ✓ | x | ✓ | x | [CIKM 2019][AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks](https://arxiv.org/pdf/1810.11921.pdf) |
| Rank | [Wide&Deep](models/rank/wide_deep/model.py) | ✓ | x | ✓ | x | [DLRS 2016][Wide & Deep Learning for Recommender Systems](https://dl.acm.org/doi/pdf/10.1145/2988450.2988454) |
| Rank | [FGCNN](models/rank/fgcnn/model.py) | ✓ | ✓ | ✓ | ✓ | [WWW 2019][Feature Generation by Convolutional Neural Network for Click-Through Rate Prediction](https://arxiv.org/pdf/1904.04447.pdf) |
| Rank | [Fibinet](models/rank/fibinet/model.py) | ✓ | ✓ | ✓ | ✓ | [RecSys19][FiBiNET: Combining Feature Importance and Bilinear feature Interaction for Click-Through Rate Prediction]( https://arxiv.org/pdf/1905.09433.pdf) |
| Rank | [Flen](models/rank/flen/model.py) | ✓ | ✓ | ✓ | ✓ | [2019][FLEN: Leveraging Field for Scalable CTR Prediction]( https://arxiv.org/pdf/1911.04690.pdf) |
| Multi-Task | [ESMM](models/multitask/esmm/model.py) | ✓ | ✓ | ✓ | ✓ | [SIGIR 2018][Entire Space Multi-Task Model: An Effective Approach for Estimating Post-Click Conversion Rate](https://arxiv.org/abs/1804.07931) |
| Multi-Task | [MMOE](models/multitask/mmoe/model.py) | ✓ | ✓ | ✓ | ✓ | [KDD 2018][Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts](https://dl.acm.org/doi/abs/10.1145/3219819.3220007) |
| Multi-Task | [ShareBottom](models/multitask/share-bottom/model.py) | ✓ | ✓ | ✓ | ✓ | [1998][Multitask learning](http://reports-archive.adm.cs.cmu.edu/anon/1997/CMU-CS-97-203.pdf) |
| Re-Rank | [Listwise](models/rerank/listwise/model.py) | ✓ | ✓ | ✓ | x | [2019][Sequential Evaluation and Generation Framework for Combinatorial Recommender System](https://arxiv.org/pdf/1902.00245.pdf) |
<h2 align="center">Getting Started</h2>
### Environmental requirements
* Python 2.7/ 3.5 / 3.6 / 3.7
* PaddlePaddle >= 1.7.2
* operating system: Windows/Mac/Linux
> Linux is recommended for distributed training
### Installation
1. **Install by pip**
```bash
python -m pip install paddle-rec
```
> This method will download and install `paddlepaddle-v1.7.2-cpu`. If `PaddlePaddle` can not be installed automatically,You need to install `PaddlePaddle` manually,and then install `PaddleRec` again:
> - Download [PaddlePaddle](https://pypi.org/project/paddlepaddle/1.7.2/#files) and install by pip.
> - Install `PaddlePaddle` by pip,`python -m pip install paddlepaddle==1.7.2 -i https://mirror.baidu.com/pypi/simple`
> - Other installation problems can be raised in [Paddle Issue](https://github.com/PaddlePaddle/Paddle/issues) or [PaddleRec Issue](https://github.com/PaddlePaddle/PaddleRec/issues)
2. **Install by source code**
- Install PaddlePaddle
```shell
python -m pip install paddlepaddle==1.7.2 -i https://mirror.baidu.com/pypi/simple
```
- Install PaddleRec by source code
```
git clone https://github.com/PaddlePaddle/PaddleRec/
cd PaddleRec
python setup.py install
```
- Install PaddleRec-GPU
After installing `PaddleRec`,please install the appropriate version of `paddlepaddle-gpu` according to your environment (CUDA / cudnn),refer to the installation tutorial [Installation Manuals](https://www.paddlepaddle.org.cn/documentation/docs/en/install/index_en.html)
<h2 align="center">Quick Start</h2>
We take the `dnn` algorithm as an example to get start of `PaddleRec`, and we take 100 pieces of training data from [Criteo Dataset](https://www.kaggle.com/c/criteo-display-ad-challenge/):
```bash
# Training with cpu
git clone https://github.com/PaddlePaddle/PaddleRec.git paddle-rec
cd paddle-rec
python -m paddlerec.run -m models/rank/dnn/config.yaml
```
<h2 align="center">Documentation</h2>
### Background
* [Recommendation System](doc/rec_background.md)
* [Distributed deep learning](doc/ps_background.md)
### Introductory Project
* [Get start of PaddleRec in ten minutes](https://aistudio.baidu.com/aistudio/projectdetail/559336)
### Introductory tutorial
* [Data](doc/slot_reader.md)
* [Model](doc/model.md)
* [Loacl Train](doc/train.md)
* [Distributed Train](doc/distributed_train.md)
* [Predict](doc/predict.md)
* [Serving](doc/serving.md)
### Advanced tutorial
* [Custom Reader](doc/custom_reader.md)
* [Custom Model](doc/model_develop.md)
* [Custom Training Process](doc/trainer_develop.md)
* [Configuration description of yaml](doc/yaml.md)
* [Design document of PaddleRec](doc/design.md)
### Benchmark
* [Benchmark](doc/benchmark.md)
### FAQ
* [Common Problem FAQ](doc/faq.md)
<h2 align="center">Community</h2>
<p align="center">
<br>
<img alt="Release" src="https://img.shields.io/badge/Release-0.1.0-yellowgreen">
<img alt="License" src="https://img.shields.io/github/license/PaddlePaddle/PaddleRec">
<img alt="Slack" src="https://img.shields.io/badge/Join-Slack-green">
<br>
<p>
### Version history
- 2020.06.17 - PaddleRec v0.1.0
- 2020.06.03 - PaddleRec v0.0.2
- 2020.05.14 - PaddleRec v0.0.1
### License
[Apache 2.0 license](LICENSE)
### Contact us
For any feedback, please propose a [GitHub Issue](https://github.com/PaddlePaddle/PaddleRec/issues)
You can also communicate with us in the following ways:
- QQ group id:`861717190`
- Wechat account:`paddlerec2020`
<p align="center"><img width="200" height="200" margin="500" src="./doc/imgs/QQ_group.png"/>&#8194;&#8194;&#8194;&#8194;&#8194<img width="200" height="200" src="doc/imgs/weixin_supporter.png"/></p>
<p align="center">PaddleRec QQ Group&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;PaddleRec Wechat account</p>
echo "Run before_hook.sh ..." echo "Run before_hook.sh ..."
wget https://paddlerec.bj.bcebos.com/whl/PaddleRec.tar.gz wget https://paddlerec.bj.bcebos.com/whl/PaddleRec.tar.gz --no-check-certificate
tar -xf PaddleRec.tar.gz tar -xf PaddleRec.tar.gz
...@@ -10,6 +10,6 @@ python setup.py install ...@@ -10,6 +10,6 @@ python setup.py install
pip uninstall -y paddlepaddle pip uninstall -y paddlepaddle
pip install paddlepaddle-gpu==<$ PADDLEPADDLE_VERSION $> --index-url=http://pip.baidu.com/pypi/simple --trusted-host pip.baidu.com pip install paddlepaddle==<$ PADDLEPADDLE_VERSION $> --index-url=http://pip.baidu.com/pypi/simple --trusted-host pip.baidu.com
echo "End before_hook.sh ..." echo "End before_hook.sh ..."
echo "Run before_hook.sh ..." echo "Run before_hook.sh ..."
wget https://paddlerec.bj.bcebos.com/whl/PaddleRec.tar.gz wget https://paddlerec.bj.bcebos.com/whl/PaddleRec.tar.gz --no-check-certificate
tar -xf PaddleRec.tar.gz tar -xf PaddleRec.tar.gz
......
...@@ -39,7 +39,12 @@ function _before_submit() { ...@@ -39,7 +39,12 @@ function _before_submit() {
elif [ ${DISTRIBUTE_MODE} == "COLLECTIVE_GPU_K8S" ]; then elif [ ${DISTRIBUTE_MODE} == "COLLECTIVE_GPU_K8S" ]; then
_gen_gpu_before_hook _gen_gpu_before_hook
_gen_k8s_config _gen_k8s_config
_gen_k8s_job _gen_k8s_gpu_job
_gen_end_hook
elif [ ${DISTRIBUTE_MODE} == "PS_CPU_K8S" ]; then
_gen_cpu_before_hook
_gen_k8s_config
_gen_k8s_cpu_job
_gen_end_hook _gen_end_hook
fi fi
...@@ -54,6 +59,7 @@ function _gen_mpi_config() { ...@@ -54,6 +59,7 @@ function _gen_mpi_config() {
-e "s#<$ OUTPUT_PATH $>#$OUTPUT_PATH#g" \ -e "s#<$ OUTPUT_PATH $>#$OUTPUT_PATH#g" \
-e "s#<$ THIRDPARTY_PATH $>#$THIRDPARTY_PATH#g" \ -e "s#<$ THIRDPARTY_PATH $>#$THIRDPARTY_PATH#g" \
-e "s#<$ CPU_NUM $>#$max_thread_num#g" \ -e "s#<$ CPU_NUM $>#$max_thread_num#g" \
-e "s#<$ USE_PYTHON3 $>#$USE_PYTHON3#g" \
-e "s#<$ FLAGS_communicator_is_sgd_optimizer $>#$FLAGS_communicator_is_sgd_optimizer#g" \ -e "s#<$ FLAGS_communicator_is_sgd_optimizer $>#$FLAGS_communicator_is_sgd_optimizer#g" \
-e "s#<$ FLAGS_communicator_send_queue_size $>#$FLAGS_communicator_send_queue_size#g" \ -e "s#<$ FLAGS_communicator_send_queue_size $>#$FLAGS_communicator_send_queue_size#g" \
-e "s#<$ FLAGS_communicator_thread_pool_size $>#$FLAGS_communicator_thread_pool_size#g" \ -e "s#<$ FLAGS_communicator_thread_pool_size $>#$FLAGS_communicator_thread_pool_size#g" \
...@@ -71,6 +77,7 @@ function _gen_k8s_config() { ...@@ -71,6 +77,7 @@ function _gen_k8s_config() {
-e "s#<$ AFS_REMOTE_MOUNT_POINT $>#$AFS_REMOTE_MOUNT_POINT#g" \ -e "s#<$ AFS_REMOTE_MOUNT_POINT $>#$AFS_REMOTE_MOUNT_POINT#g" \
-e "s#<$ OUTPUT_PATH $>#$OUTPUT_PATH#g" \ -e "s#<$ OUTPUT_PATH $>#$OUTPUT_PATH#g" \
-e "s#<$ CPU_NUM $>#$max_thread_num#g" \ -e "s#<$ CPU_NUM $>#$max_thread_num#g" \
-e "s#<$ USE_PYTHON3 $>#$USE_PYTHON3#g" \
-e "s#<$ FLAGS_communicator_is_sgd_optimizer $>#$FLAGS_communicator_is_sgd_optimizer#g" \ -e "s#<$ FLAGS_communicator_is_sgd_optimizer $>#$FLAGS_communicator_is_sgd_optimizer#g" \
-e "s#<$ FLAGS_communicator_send_queue_size $>#$FLAGS_communicator_send_queue_size#g" \ -e "s#<$ FLAGS_communicator_send_queue_size $>#$FLAGS_communicator_send_queue_size#g" \
-e "s#<$ FLAGS_communicator_thread_pool_size $>#$FLAGS_communicator_thread_pool_size#g" \ -e "s#<$ FLAGS_communicator_thread_pool_size $>#$FLAGS_communicator_thread_pool_size#g" \
...@@ -101,6 +108,7 @@ function _gen_end_hook() { ...@@ -101,6 +108,7 @@ function _gen_end_hook() {
function _gen_mpi_job() { function _gen_mpi_job() {
echo "gen mpi_job.sh" echo "gen mpi_job.sh"
sed -e "s#<$ GROUP_NAME $>#$GROUP_NAME#g" \ sed -e "s#<$ GROUP_NAME $>#$GROUP_NAME#g" \
-e "s#<$ JOB_NAME $>#$OLD_JOB_NAME#g" \
-e "s#<$ AK $>#$AK#g" \ -e "s#<$ AK $>#$AK#g" \
-e "s#<$ SK $>#$SK#g" \ -e "s#<$ SK $>#$SK#g" \
-e "s#<$ MPI_PRIORITY $>#$PRIORITY#g" \ -e "s#<$ MPI_PRIORITY $>#$PRIORITY#g" \
...@@ -109,18 +117,34 @@ function _gen_mpi_job() { ...@@ -109,18 +117,34 @@ function _gen_mpi_job() {
${abs_dir}/cloud/mpi_job.sh.template >${PWD}/job.sh ${abs_dir}/cloud/mpi_job.sh.template >${PWD}/job.sh
} }
function _gen_k8s_job() { function _gen_k8s_gpu_job() {
echo "gen k8s_job.sh" echo "gen k8s_job.sh"
sed -e "s#<$ GROUP_NAME $>#$GROUP_NAME#g" \ sed -e "s#<$ GROUP_NAME $>#$GROUP_NAME#g" \
-e "s#<$ JOB_NAME $>#$OLD_JOB_NAME#g" \
-e "s#<$ AK $>#$AK#g" \ -e "s#<$ AK $>#$AK#g" \
-e "s#<$ SK $>#$SK#g" \ -e "s#<$ SK $>#$SK#g" \
-e "s#<$ K8S_PRIORITY $>#$PRIORITY#g" \ -e "s#<$ K8S_PRIORITY $>#$PRIORITY#g" \
-e "s#<$ K8S_TRAINERS $>#$K8S_TRAINERS#g" \ -e "s#<$ K8S_TRAINERS $>#$K8S_TRAINERS#g" \
-e "s#<$ K8S_CPU_CORES $>#$K8S_CPU_CORES#g" \
-e "s#<$ K8S_GPU_CARD $>#$K8S_GPU_CARD#g" \ -e "s#<$ K8S_GPU_CARD $>#$K8S_GPU_CARD#g" \
-e "s#<$ START_CMD $>#$START_CMD#g" \ -e "s#<$ START_CMD $>#$START_CMD#g" \
${abs_dir}/cloud/k8s_job.sh.template >${PWD}/job.sh ${abs_dir}/cloud/k8s_job.sh.template >${PWD}/job.sh
} }
function _gen_k8s_cpu_job() {
echo "gen k8s_job.sh"
sed -e "s#<$ GROUP_NAME $>#$GROUP_NAME#g" \
-e "s#<$ JOB_NAME $>#$OLD_JOB_NAME#g" \
-e "s#<$ AK $>#$AK#g" \
-e "s#<$ SK $>#$SK#g" \
-e "s#<$ K8S_PRIORITY $>#$PRIORITY#g" \
-e "s#<$ K8S_TRAINERS $>#$K8S_TRAINERS#g" \
-e "s#<$ K8S_PS_NUM $>#$K8S_PS_NUM#g" \
-e "s#<$ K8S_PS_CORES $>#$K8S_PS_CORES#g" \
-e "s#<$ K8S_CPU_CORES $>#$K8S_CPU_CORES#g" \
-e "s#<$ START_CMD $>#$START_CMD#g" \
${abs_dir}/cloud/k8s_cpu_job.sh.template >${PWD}/job.sh
}
#----------------------------------------------------------------------------------------------------------------- #-----------------------------------------------------------------------------------------------------------------
...@@ -145,6 +169,7 @@ function _submit() { ...@@ -145,6 +169,7 @@ function _submit() {
function package_hook() { function package_hook() {
cur_time=`date +"%Y%m%d%H%M"` cur_time=`date +"%Y%m%d%H%M"`
new_job_name="${JOB_NAME}_${cur_time}" new_job_name="${JOB_NAME}_${cur_time}"
export OLD_JOB_NAME=${JOB_NAME}
export JOB_NAME=${new_job_name} export JOB_NAME=${new_job_name}
export job_file_path="${PWD}/${new_job_name}" export job_file_path="${PWD}/${new_job_name}"
mkdir ${job_file_path} mkdir ${job_file_path}
......
...@@ -19,6 +19,8 @@ afs_local_mount_point="/root/paddlejob/workspace/env_run/afs/" ...@@ -19,6 +19,8 @@ afs_local_mount_point="/root/paddlejob/workspace/env_run/afs/"
# 新k8s afs挂载帮助文档: http://wiki.baidu.com/pages/viewpage.action?pageId=906443193 # 新k8s afs挂载帮助文档: http://wiki.baidu.com/pages/viewpage.action?pageId=906443193
PADDLE_PADDLEREC_ROLE=WORKER PADDLE_PADDLEREC_ROLE=WORKER
PADDLEREC_CLUSTER_TYPE=K8S
use_python3=<$ USE_PYTHON3 $>
CPU_NUM=<$ CPU_NUM $> CPU_NUM=<$ CPU_NUM $>
GLOG_v=0 GLOG_v=0
......
#!/bin/bash
###############################################################
## 注意-- 注意--注意 ##
## K8S PS-CPU多机作业作业示例 ##
###############################################################
job_name=<$ JOB_NAME $>
# 作业参数
group_name="<$ GROUP_NAME $>"
job_version="paddle-fluid-v1.7.1"
start_cmd="<$ START_CMD $>"
wall_time="2000:00:00"
k8s_priority=<$ K8S_PRIORITY $>
k8s_trainers=<$ K8S_TRAINERS $>
k8s_cpu_cores=<$ K8S_CPU_CORES $>
k8s_ps_num=<$ K8S_PS_NUM $>
k8s_ps_cores=<$ K8S_PS_CORES $>
# 你的ak/sk(可在paddlecloud web页面【个人中心】处获取)
ak=<$ AK $>
sk=<$ SK $>
paddlecloud job --ak ${ak} --sk ${sk} \
train --job-name ${job_name} \
--group-name ${group_name} \
--job-conf config.ini \
--start-cmd "${start_cmd}" \
--files ./* \
--job-version ${job_version} \
--k8s-priority ${k8s_priority} \
--wall-time ${wall_time} \
--k8s-trainers ${k8s_trainers} \
--k8s-cpu-cores ${k8s_cpu_cores} \
--k8s-ps-num ${k8s_ps_num} \
--k8s-ps-cores ${k8s_ps_cores} \
--is-standalone 0 \
--distribute-job-type "PSERVER" \
--json
\ No newline at end of file
...@@ -3,18 +3,30 @@ ...@@ -3,18 +3,30 @@
## 注意-- 注意--注意 ## ## 注意-- 注意--注意 ##
## K8S NCCL2多机作业作业示例 ## ## K8S NCCL2多机作业作业示例 ##
############################################################### ###############################################################
job_name=${JOB_NAME} job_name=<$ JOB_NAME $>
# 作业参数 # 作业参数
group_name="<$ GROUP_NAME $>" group_name="<$ GROUP_NAME $>"
job_version="paddle-fluid-v1.7.1" job_version="paddle-fluid-v1.7.1"
start_cmd="<$ START_CMD $>" start_cmd="<$ START_CMD $>"
wall_time="10:00:00" wall_time="2000:00:00"
k8s_priority=<$ K8S_PRIORITY $> k8s_priority=<$ K8S_PRIORITY $>
k8s_trainers=<$ K8S_TRAINERS $> k8s_trainers=<$ K8S_TRAINERS $>
k8s_cpu_cores=<$ K8S_CPU_CORES $>
k8s_gpu_cards=<$ K8S_GPU_CARD $> k8s_gpu_cards=<$ K8S_GPU_CARD $>
is_stand_alone=0
nccl="--distribute-job-type "NCCL2""
if [ ${k8s_trainers} == 1 ];then
is_stand_alone=1
nccl="--job-remark single-trainer"
if [ ${k8s_gpu_cards} == 1];then
nccl="--job-remark single-gpu"
echo "Attention: Use single GPU card for PaddleRec distributed training, please set runner class from 'cluster_train' to 'train' in config.yaml."
fi
fi
# 你的ak/sk(可在paddlecloud web页面【个人中心】处获取) # 你的ak/sk(可在paddlecloud web页面【个人中心】处获取)
ak=<$ AK $> ak=<$ AK $>
sk=<$ SK $> sk=<$ SK $>
...@@ -27,9 +39,11 @@ paddlecloud job --ak ${ak} --sk ${sk} \ ...@@ -27,9 +39,11 @@ paddlecloud job --ak ${ak} --sk ${sk} \
--files ./* \ --files ./* \
--job-version ${job_version} \ --job-version ${job_version} \
--k8s-trainers ${k8s_trainers} \ --k8s-trainers ${k8s_trainers} \
--k8s-cpu-cores ${k8s_cpu_cores} \
--k8s-gpu-cards ${k8s_gpu_cards} \ --k8s-gpu-cards ${k8s_gpu_cards} \
--k8s-priority ${k8s_priority} \ --k8s-priority ${k8s_priority} \
--wall-time ${wall_time} \ --wall-time ${wall_time} \
--is-standalone 0 \ --is-standalone ${is_stand_alone} \
--distribute-job-type "NCCL2" \ --json \
--json ${nccl}
\ No newline at end of file
\ No newline at end of file
...@@ -17,6 +17,8 @@ output_path=<$ OUTPUT_PATH $> ...@@ -17,6 +17,8 @@ output_path=<$ OUTPUT_PATH $>
thirdparty_path=<$ THIRDPARTY_PATH $> thirdparty_path=<$ THIRDPARTY_PATH $>
PADDLE_PADDLEREC_ROLE=WORKER PADDLE_PADDLEREC_ROLE=WORKER
PADDLEREC_CLUSTER_TYPE=MPI
use_python3=<$ USE_PYTHON3 $>
CPU_NUM=<$ CPU_NUM $> CPU_NUM=<$ CPU_NUM $>
GLOG_v=0 GLOG_v=0
......
...@@ -3,13 +3,13 @@ ...@@ -3,13 +3,13 @@
## 注意--注意--注意 ## ## 注意--注意--注意 ##
## MPI 类型作业演示 ## ## MPI 类型作业演示 ##
############################################################### ###############################################################
job_name=${JOB_NAME} job_name=<$ JOB_NAME $>
# 作业参数 # 作业参数
group_name=<$ GROUP_NAME $> group_name=<$ GROUP_NAME $>
job_version="paddle-fluid-v1.7.1" job_version="paddle-fluid-v1.7.1"
start_cmd="<$ START_CMD $>" start_cmd="<$ START_CMD $>"
wall_time="2:00:00" wall_time="2000:00:00"
# 你的ak/sk(可在paddlecloud web页面【个人中心】处获取) # 你的ak/sk(可在paddlecloud web页面【个人中心】处获取)
ak=<$ AK $> ak=<$ AK $>
......
...@@ -67,10 +67,10 @@ class ClusterEngine(Engine): ...@@ -67,10 +67,10 @@ class ClusterEngine(Engine):
@staticmethod @staticmethod
def workspace_replace(): def workspace_replace():
workspace = envs.get_runtime_environ("workspace") remote_workspace = envs.get_runtime_environ("remote_workspace")
for k, v in os.environ.items(): for k, v in os.environ.items():
v = v.replace("{workspace}", workspace) v = v.replace("{workspace}", remote_workspace)
os.environ[k] = str(v) os.environ[k] = str(v)
def run(self): def run(self):
...@@ -98,14 +98,12 @@ class ClusterEngine(Engine): ...@@ -98,14 +98,12 @@ class ClusterEngine(Engine):
cluster_env_check_tool = PaddleCloudMpiEnv() cluster_env_check_tool = PaddleCloudMpiEnv()
else: else:
raise ValueError( raise ValueError(
"Paddlecloud with Mpi don't support GPU training, check your config" "Paddlecloud with Mpi don't support GPU training, check your config.yaml & backend.yaml"
) )
elif cluster_type.upper() == "K8S": elif cluster_type.upper() == "K8S":
if fleet_mode == "PS": if fleet_mode == "PS":
if device == "CPU": if device == "CPU":
raise ValueError( cluster_env_check_tool = CloudPsCpuEnv()
"PS-CPU on paddlecloud is not supported at this time, comming soon"
)
elif device == "GPU": elif device == "GPU":
raise ValueError( raise ValueError(
"PS-GPU on paddlecloud is not supported at this time, comming soon" "PS-GPU on paddlecloud is not supported at this time, comming soon"
...@@ -115,7 +113,7 @@ class ClusterEngine(Engine): ...@@ -115,7 +113,7 @@ class ClusterEngine(Engine):
cluster_env_check_tool = CloudCollectiveEnv() cluster_env_check_tool = CloudCollectiveEnv()
elif device == "CPU": elif device == "CPU":
raise ValueError( raise ValueError(
"Unexpected config -> device: CPU with fleet_mode: Collective, check your config" "Unexpected config -> device: CPU with fleet_mode: Collective, check your config.yaml"
) )
else: else:
raise ValueError("cluster_type {} error, must in MPI/K8S".format( raise ValueError("cluster_type {} error, must in MPI/K8S".format(
...@@ -161,23 +159,30 @@ class ClusterEnvBase(object): ...@@ -161,23 +159,30 @@ class ClusterEnvBase(object):
self.cluster_env["PADDLE_VERSION"] = self.backend_env.get( self.cluster_env["PADDLE_VERSION"] = self.backend_env.get(
"config.paddle_version", "1.7.2") "config.paddle_version", "1.7.2")
# python_version
self.cluster_env["USE_PYTHON3"] = self.backend_env.get(
"config.use_python3", "0")
# communicator # communicator
max_thread_num = int(envs.get_runtime_environ("max_thread_num"))
self.cluster_env[ self.cluster_env[
"FLAGS_communicator_is_sgd_optimizer"] = self.backend_env.get( "FLAGS_communicator_is_sgd_optimizer"] = self.backend_env.get(
"config.communicator.FLAGS_communicator_is_sgd_optimizer", 0) "config.communicator.FLAGS_communicator_is_sgd_optimizer", 0)
self.cluster_env[ self.cluster_env[
"FLAGS_communicator_send_queue_size"] = self.backend_env.get( "FLAGS_communicator_send_queue_size"] = self.backend_env.get(
"config.communicator.FLAGS_communicator_send_queue_size", 5) "config.communicator.FLAGS_communicator_send_queue_size",
max_thread_num)
self.cluster_env[ self.cluster_env[
"FLAGS_communicator_thread_pool_size"] = self.backend_env.get( "FLAGS_communicator_thread_pool_size"] = self.backend_env.get(
"config.communicator.FLAGS_communicator_thread_pool_size", 32) "config.communicator.FLAGS_communicator_thread_pool_size", 32)
self.cluster_env[ self.cluster_env[
"FLAGS_communicator_max_merge_var_num"] = self.backend_env.get( "FLAGS_communicator_max_merge_var_num"] = self.backend_env.get(
"config.communicator.FLAGS_communicator_max_merge_var_num", 5) "config.communicator.FLAGS_communicator_max_merge_var_num",
max_thread_num)
self.cluster_env[ self.cluster_env[
"FLAGS_communicator_max_send_grad_num_before_recv"] = self.backend_env.get( "FLAGS_communicator_max_send_grad_num_before_recv"] = self.backend_env.get(
"config.communicator.FLAGS_communicator_max_send_grad_num_before_recv", "config.communicator.FLAGS_communicator_max_send_grad_num_before_recv",
5) max_thread_num)
self.cluster_env["FLAGS_communicator_fake_rpc"] = self.backend_env.get( self.cluster_env["FLAGS_communicator_fake_rpc"] = self.backend_env.get(
"config.communicator.FLAGS_communicator_fake_rpc", 0) "config.communicator.FLAGS_communicator_fake_rpc", 0)
self.cluster_env["FLAGS_rpc_retry_times"] = self.backend_env.get( self.cluster_env["FLAGS_rpc_retry_times"] = self.backend_env.get(
...@@ -234,7 +239,7 @@ class PaddleCloudMpiEnv(ClusterEnvBase): ...@@ -234,7 +239,7 @@ class PaddleCloudMpiEnv(ClusterEnvBase):
"config.train_data_path", "") "config.train_data_path", "")
if self.cluster_env["TRAIN_DATA_PATH"] == "": if self.cluster_env["TRAIN_DATA_PATH"] == "":
raise ValueError( raise ValueError(
"No -- TRAIN_DATA_PATH -- found in your backend.yaml, please check." "No -- TRAIN_DATA_PATH -- found in your backend.yaml, please add train_data_path in your backend yaml."
) )
# test_data_path # test_data_path
self.cluster_env["TEST_DATA_PATH"] = self.backend_env.get( self.cluster_env["TEST_DATA_PATH"] = self.backend_env.get(
...@@ -274,7 +279,7 @@ class PaddleCloudK8sEnv(ClusterEnvBase): ...@@ -274,7 +279,7 @@ class PaddleCloudK8sEnv(ClusterEnvBase):
category=UserWarning, category=UserWarning,
stacklevel=2) stacklevel=2)
warnings.warn( warnings.warn(
"The remote mount point will be mounted to the ./afs/", "The remote afs path will be mounted to the ./afs/",
category=UserWarning, category=UserWarning,
stacklevel=2) stacklevel=2)
...@@ -293,3 +298,21 @@ class CloudCollectiveEnv(PaddleCloudK8sEnv): ...@@ -293,3 +298,21 @@ class CloudCollectiveEnv(PaddleCloudK8sEnv):
"submit.k8s_gpu_card", 1) "submit.k8s_gpu_card", 1)
self.cluster_env["K8S_CPU_CORES"] = self.backend_env.get( self.cluster_env["K8S_CPU_CORES"] = self.backend_env.get(
"submit.k8s_cpu_cores", 1) "submit.k8s_cpu_cores", 1)
class CloudPsCpuEnv(PaddleCloudK8sEnv):
def __init__(self):
super(CloudPsCpuEnv, self).__init__()
def env_check(self):
super(CloudPsCpuEnv, self).env_check()
self.cluster_env["DISTRIBUTE_MODE"] = "PS_CPU_K8S"
self.cluster_env["K8S_TRAINERS"] = self.backend_env.get(
"submit.k8s_trainers", 1)
self.cluster_env["K8S_CPU_CORES"] = self.backend_env.get(
"submit.k8s_cpu_cores", 2)
self.cluster_env["K8S_PS_NUM"] = self.backend_env.get(
"submit.k8s_ps_num", 1)
self.cluster_env["K8S_PS_CORES"] = self.backend_env.get(
"submit.k8s_ps_cores", 2)
...@@ -22,6 +22,19 @@ trainers = {} ...@@ -22,6 +22,19 @@ trainers = {}
def trainer_registry(): def trainer_registry():
trainers["SingleTrainer"] = os.path.join(trainer_abs, "single_trainer.py")
trainers["ClusterTrainer"] = os.path.join(trainer_abs,
"cluster_trainer.py")
trainers["CtrCodingTrainer"] = os.path.join(trainer_abs,
"ctr_coding_trainer.py")
trainers["CtrModulTrainer"] = os.path.join(trainer_abs,
"ctr_modul_trainer.py")
trainers["TDMSingleTrainer"] = os.path.join(trainer_abs,
"tdm_single_trainer.py")
trainers["TDMClusterTrainer"] = os.path.join(trainer_abs,
"tdm_cluster_trainer.py")
trainers["OnlineLearningTrainer"] = os.path.join(
trainer_abs, "online_learning_trainer.py")
# Definition of procedure execution process # Definition of procedure execution process
trainers["CtrCodingTrainer"] = os.path.join(trainer_abs, trainers["CtrCodingTrainer"] = os.path.join(trainer_abs,
"ctr_coding_trainer.py") "ctr_coding_trainer.py")
......
...@@ -23,34 +23,58 @@ class Metric(object): ...@@ -23,34 +23,58 @@ class Metric(object):
__metaclass__ = abc.ABCMeta __metaclass__ = abc.ABCMeta
def __init__(self, config): def __init__(self, config):
""" """ """R
"""
pass pass
def clear(self, scope=None, **kwargs): def clear(self, scope=None):
""" """R
clear current value
Args:
scope: value container
params: extend varilable for clear
""" """
if scope is None: if scope is None:
scope = fluid.global_scope() scope = fluid.global_scope()
place = fluid.CPUPlace() place = fluid.CPUPlace()
for (varname, dtype) in self._need_clear_list: for key in self._global_metric_state_vars:
if scope.find_var(varname) is None: varname, dtype = self._global_metric_state_vars[key]
var = scope.find_var(varname)
if not var:
continue continue
var = scope.var(varname).get_tensor() var = var.get_tensor()
data_array = np.zeros(var._get_dims()).astype(dtype) data_array = np.zeros(var._get_dims()).astype(dtype)
var.set(data_array, place) var.set(data_array, place)
def calculate(self, scope, params): def _get_global_metric_state(self, fleet, scope, metric_name, mode="sum"):
"""R
""" """
calculate result var = scope.find_var(metric_name)
Args: if not var:
scope: value container return None
params: extend varilable for clear input = np.array(var.get_tensor())
if fleet is None:
return input
fleet._role_maker._barrier_worker()
old_shape = np.array(input.shape)
input = input.reshape(-1)
output = np.copy(input) * 0
fleet._role_maker._all_reduce(input, output, mode=mode)
output = output.reshape(old_shape)
return output
def calc_global_metrics(self, fleet, scope=None):
"""R
""" """
if scope is None:
scope = fluid.global_scope()
global_metrics = dict()
for key in self._global_metric_state_vars:
varname, dtype = self._global_metric_state_vars[key]
global_metrics[key] = self._get_global_metric_state(fleet, scope,
varname)
return self._calculate(global_metrics)
def _calculate(self, global_metrics):
pass pass
@abc.abstractmethod @abc.abstractmethod
......
...@@ -12,6 +12,9 @@ ...@@ -12,6 +12,9 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
from precision import Precision from .recall_k import RecallK
from .pairwise_pn import PosNegRatio
from .precision_recall import PrecisionRecall
from .auc import AUC
__all__ = ['Precision'] __all__ = ['RecallK', 'PosNegRatio', 'AUC', 'PrecisionRecall']
...@@ -18,102 +18,60 @@ import numpy as np ...@@ -18,102 +18,60 @@ import numpy as np
import paddle.fluid as fluid import paddle.fluid as fluid
from paddlerec.core.metric import Metric from paddlerec.core.metric import Metric
from paddle.fluid.layers.tensor import Variable
class AUCMetric(Metric): class AUC(Metric):
""" """
Metric For Fluid Model Metric For Fluid Model
""" """
def __init__(self, config, fleet): def __init__(self,
input,
label,
curve='ROC',
num_thresholds=2**12 - 1,
topk=1,
slide_steps=1):
""" """ """ """
self.config = config if not isinstance(input, Variable):
self.fleet = fleet raise ValueError("input must be Variable, but received %s" %
type(input))
def clear(self, scope, params): if not isinstance(label, Variable):
""" raise ValueError("label must be Variable, but received %s" %
Clear current metric value, usually set to zero type(label))
Args:
scope : paddle runtime var container auc_out, batch_auc_out, [
params(dict) : batch_stat_pos, batch_stat_neg, stat_pos, stat_neg
label : a group name for metric ] = fluid.layers.auc(input,
metric_dict : current metric_items in group label,
Return: curve=curve,
None num_thresholds=num_thresholds,
""" topk=topk,
self._label = params['label'] slide_steps=slide_steps)
self._metric_dict = params['metric_dict']
self._result = {} prob = fluid.layers.slice(input, axes=[1], starts=[1], ends=[2])
place = fluid.CPUPlace() label_cast = fluid.layers.cast(label, dtype="float32")
for metric_name in self._metric_dict: label_cast.stop_gradient = True
metric_config = self._metric_dict[metric_name] sqrerr, abserr, prob, q, pos, total = \
if scope.find_var(metric_config['var'].name) is None: fluid.contrib.layers.ctr_metric_bundle(prob, label_cast)
continue
metric_var = scope.var(metric_config['var'].name).get_tensor() self._global_metric_state_vars = dict()
data_type = 'float32' self._global_metric_state_vars['stat_pos'] = (stat_pos.name, "float32")
if 'data_type' in metric_config: self._global_metric_state_vars['stat_neg'] = (stat_neg.name, "float32")
data_type = metric_config['data_type'] self._global_metric_state_vars['total_ins_num'] = (total.name,
data_array = np.zeros(metric_var._get_dims()).astype(data_type) "float32")
metric_var.set(data_array, place) self._global_metric_state_vars['pos_ins_num'] = (pos.name, "float32")
self._global_metric_state_vars['q'] = (q.name, "float32")
def get_metric(self, scope, metric_name): self._global_metric_state_vars['prob'] = (prob.name, "float32")
""" self._global_metric_state_vars['abserr'] = (abserr.name, "float32")
reduce metric named metric_name from all worker self._global_metric_state_vars['sqrerr'] = (sqrerr.name, "float32")
Return:
metric reduce result self.metrics = dict()
""" self.metrics["AUC"] = auc_out
metric = np.array(scope.find_var(metric_name).get_tensor()) self.metrics["BATCH_AUC"] = batch_auc_out
old_metric_shape = np.array(metric.shape)
metric = metric.reshape(-1) def _calculate_bucket_error(self, global_pos, global_neg):
global_metric = np.copy(metric) * 0
self.fleet._role_maker.all_reduce_worker(metric, global_metric)
global_metric = global_metric.reshape(old_metric_shape)
return global_metric[0]
def get_global_metrics(self, scope, metric_dict):
"""
reduce all metric in metric_dict from all worker
Return:
dict : {matric_name : metric_result}
"""
self.fleet._role_maker._barrier_worker()
result = {}
for metric_name in metric_dict:
metric_item = metric_dict[metric_name]
if scope.find_var(metric_item['var'].name) is None:
result[metric_name] = None
continue
result[metric_name] = self.get_metric(scope,
metric_item['var'].name)
return result
def calculate_auc(self, global_pos, global_neg):
"""R
"""
num_bucket = len(global_pos)
area = 0.0
pos = 0.0
neg = 0.0
new_pos = 0.0
new_neg = 0.0
total_ins_num = 0
for i in range(num_bucket):
index = num_bucket - 1 - i
new_pos = pos + global_pos[index]
total_ins_num += global_pos[index]
new_neg = neg + global_neg[index]
total_ins_num += global_neg[index]
area += (new_neg - neg) * (pos + new_pos) / 2
pos = new_pos
neg = new_neg
auc_value = None
if pos * neg == 0 or total_ins_num == 0:
auc_value = 0.5
else:
auc_value = area / (pos * neg)
return auc_value
def calculate_bucket_error(self, global_pos, global_neg):
"""R """R
""" """
num_bucket = len(global_pos) num_bucket = len(global_pos)
...@@ -161,56 +119,69 @@ class AUCMetric(Metric): ...@@ -161,56 +119,69 @@ class AUCMetric(Metric):
bucket_error = error_sum / error_count if error_count > 0 else 0.0 bucket_error = error_sum / error_count if error_count > 0 else 0.0
return bucket_error return bucket_error
def calculate(self, scope, params): def _calculate_auc(self, global_pos, global_neg):
""" """ """R
self._label = params['label'] """
self._metric_dict = params['metric_dict'] num_bucket = len(global_pos)
self.fleet._role_maker._barrier_worker() area = 0.0
result = self.get_global_metrics(scope, self._metric_dict) pos = 0.0
neg = 0.0
new_pos = 0.0
new_neg = 0.0
total_ins_num = 0
for i in range(num_bucket):
index = num_bucket - 1 - i
new_pos = pos + global_pos[index]
total_ins_num += global_pos[index]
new_neg = neg + global_neg[index]
total_ins_num += global_neg[index]
area += (new_neg - neg) * (pos + new_pos) / 2
pos = new_pos
neg = new_neg
auc_value = None
if pos * neg == 0 or total_ins_num == 0:
auc_value = 0.5
else:
auc_value = area / (pos * neg)
return auc_value
def _calculate(self, global_metrics):
result = dict()
for key in self._global_metric_state_vars:
if key not in global_metrics:
raise ValueError("%s not existed" % key)
result[key] = global_metrics[key][0]
if result['total_ins_num'] == 0: if result['total_ins_num'] == 0:
self._result = result result['auc'] = 0
self._result['auc'] = 0 result['bucket_error'] = 0
self._result['bucket_error'] = 0 result['actual_ctr'] = 0
self._result['actual_ctr'] = 0 result['predict_ctr'] = 0
self._result['predict_ctr'] = 0 result['mae'] = 0
self._result['mae'] = 0 result['rmse'] = 0
self._result['rmse'] = 0 result['copc'] = 0
self._result['copc'] = 0 result['mean_q'] = 0
self._result['mean_q'] = 0 else:
return self._result result['auc'] = self._calculate_auc(result['stat_pos'],
if 'stat_pos' in result and 'stat_neg' in result: result['stat_neg'])
result['auc'] = self.calculate_auc(result['stat_pos'], result['bucket_error'] = self._calculate_bucket_error(
result['stat_neg']) result['stat_pos'], result['stat_neg'])
result['bucket_error'] = self.calculate_auc(result['stat_pos'],
result['stat_neg'])
if 'pos_ins_num' in result:
result['actual_ctr'] = result['pos_ins_num'] / result[ result['actual_ctr'] = result['pos_ins_num'] / result[
'total_ins_num'] 'total_ins_num']
if 'abserr' in result:
result['mae'] = result['abserr'] / result['total_ins_num'] result['mae'] = result['abserr'] / result['total_ins_num']
if 'sqrerr' in result:
result['rmse'] = math.sqrt(result['sqrerr'] / result['rmse'] = math.sqrt(result['sqrerr'] /
result['total_ins_num']) result['total_ins_num'])
if 'prob' in result:
result['predict_ctr'] = result['prob'] / result['total_ins_num'] result['predict_ctr'] = result['prob'] / result['total_ins_num']
if abs(result['predict_ctr']) > 1e-6: if abs(result['predict_ctr']) > 1e-6:
result['copc'] = result['actual_ctr'] / result['predict_ctr'] result['copc'] = result['actual_ctr'] / result['predict_ctr']
if 'q' in result:
result['mean_q'] = result['q'] / result['total_ins_num'] result['mean_q'] = result['q'] / result['total_ins_num']
self._result = result
return result
def get_result(self):
""" """
return self._result
def __str__(self): result_str = "AUC=%.6f BUCKET_ERROR=%.6f MAE=%.6f RMSE=%.6f " \
""" """
result = self.get_result()
result_str = "%s AUC=%.6f BUCKET_ERROR=%.6f MAE=%.6f RMSE=%.6f " \
"Actural_CTR=%.6f Predicted_CTR=%.6f COPC=%.6f MEAN Q_VALUE=%.6f Ins number=%s" % \ "Actural_CTR=%.6f Predicted_CTR=%.6f COPC=%.6f MEAN Q_VALUE=%.6f Ins number=%s" % \
(self._label, result['auc'], result['bucket_error'], result['mae'], result['rmse'], (result['auc'], result['bucket_error'], result['mae'], result['rmse'],
result['actual_ctr'], result['actual_ctr'],
result['predict_ctr'], result['copc'], result['mean_q'], result['total_ins_num']) result['predict_ctr'], result['copc'], result['mean_q'], result['total_ins_num'])
return result_str return result_str
def get_result(self):
return self.metrics
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import math
import numpy as np
import paddle.fluid as fluid
from paddlerec.core.metric import Metric
from paddle.fluid.initializer import Constant
from paddle.fluid.layer_helper import LayerHelper
from paddle.fluid.layers.tensor import Variable
class PosNegRatio(Metric):
"""
Metric For Fluid Model
"""
def __init__(self, pos_score, neg_score):
""" """
kwargs = locals()
del kwargs['self']
helper = LayerHelper("PaddleRec_PosNegRatio", **kwargs)
if "pos_score" not in kwargs or "neg_score" not in kwargs:
raise ValueError(
"PosNegRatio expect pos_score and neg_score as inputs.")
pos_score = kwargs.get('pos_score')
neg_score = kwargs.get('neg_score')
if not isinstance(pos_score, Variable):
raise ValueError("pos_score must be Variable, but received %s" %
type(pos_score))
if not isinstance(neg_score, Variable):
raise ValueError("neg_score must be Variable, but received %s" %
type(neg_score))
wrong = fluid.layers.cast(
fluid.layers.less_equal(pos_score, neg_score), dtype='float32')
wrong_cnt = fluid.layers.reduce_sum(wrong)
right = fluid.layers.cast(
fluid.layers.less_than(neg_score, pos_score), dtype='float32')
right_cnt = fluid.layers.reduce_sum(right)
global_right_cnt, _ = helper.create_or_get_global_variable(
name="right_cnt", persistable=True, dtype='float32', shape=[1])
global_wrong_cnt, _ = helper.create_or_get_global_variable(
name="wrong_cnt", persistable=True, dtype='float32', shape=[1])
for var in [global_right_cnt, global_wrong_cnt]:
helper.set_variable_initializer(
var, Constant(
value=0.0, force_cpu=True))
helper.append_op(
type="elementwise_add",
inputs={"X": [global_right_cnt],
"Y": [right_cnt]},
outputs={"Out": [global_right_cnt]})
helper.append_op(
type="elementwise_add",
inputs={"X": [global_wrong_cnt],
"Y": [wrong_cnt]},
outputs={"Out": [global_wrong_cnt]})
self.pn = (global_right_cnt + 1.0) / (global_wrong_cnt + 1.0)
self._global_metric_state_vars = dict()
self._global_metric_state_vars['right_cnt'] = (global_right_cnt.name,
"float32")
self._global_metric_state_vars['wrong_cnt'] = (global_wrong_cnt.name,
"float32")
self.metrics = dict()
self.metrics['WrongCnt'] = global_wrong_cnt
self.metrics['RightCnt'] = global_right_cnt
self.metrics['PN'] = self.pn
def _calculate(self, global_metrics):
for key in self._global_communicate_var:
if key not in global_metrics:
raise ValueError("%s not existed" % key)
pn = (global_metrics['right_cnt'][0] + 1.0) / (
global_metrics['wrong_cnt'][0] + 1.0)
return "RightCnt=%s WrongCnt=%s PN=%s" % (
str(global_metrics['right_cnt'][0]),
str(global_metrics['wrong_cnt'][0]), str(pn))
def get_result(self):
return self.metrics
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import math
import numpy as np
import paddle.fluid as fluid
from paddlerec.core.metric import Metric
from paddle.fluid.initializer import Constant
from paddle.fluid.layer_helper import LayerHelper
from paddle.fluid.layers.tensor import Variable
class PrecisionRecall(Metric):
"""
Metric For Fluid Model
"""
def __init__(self, input, label, class_num):
"""R
"""
kwargs = locals()
del kwargs['self']
self.num_cls = class_num
if not isinstance(input, Variable):
raise ValueError("input must be Variable, but received %s" %
type(input))
if not isinstance(label, Variable):
raise ValueError("label must be Variable, but received %s" %
type(label))
helper = LayerHelper("PaddleRec_PrecisionRecall", **kwargs)
label = fluid.layers.cast(label, dtype="int32")
label.stop_gradient = True
max_probs, indices = fluid.layers.nn.topk(input, k=1)
indices = fluid.layers.cast(indices, dtype="int32")
indices.stop_gradient = True
states_info, _ = helper.create_or_get_global_variable(
name="states_info",
persistable=True,
dtype='float32',
shape=[self.num_cls, 4])
states_info.stop_gradient = True
helper.set_variable_initializer(
states_info, Constant(
value=0.0, force_cpu=True))
batch_metrics, _ = helper.create_or_get_global_variable(
name="batch_metrics",
persistable=False,
dtype='float32',
shape=[6])
accum_metrics, _ = helper.create_or_get_global_variable(
name="global_metrics",
persistable=False,
dtype='float32',
shape=[6])
batch_states = fluid.layers.fill_constant(
shape=[self.num_cls, 4], value=0.0, dtype="float32")
batch_states.stop_gradient = True
helper.append_op(
type="precision_recall",
attrs={'class_number': self.num_cls},
inputs={
'MaxProbs': [max_probs],
'Indices': [indices],
'Labels': [label],
'StatesInfo': [states_info]
},
outputs={
'BatchMetrics': [batch_metrics],
'AccumMetrics': [accum_metrics],
'AccumStatesInfo': [batch_states]
})
helper.append_op(
type="assign",
inputs={'X': [batch_states]},
outputs={'Out': [states_info]})
batch_states.stop_gradient = True
states_info.stop_gradient = True
self._global_metric_state_vars = dict()
self._global_metric_state_vars['states_info'] = (states_info.name,
"float32")
self.metrics = dict()
self.metrics["precision_recall_f1"] = accum_metrics
self.metrics["[TP FP TN FN]"] = states_info
def _calculate(self, global_metrics):
for key in self._global_metric_state_vars:
if key not in global_metrics:
raise ValueError("%s not existed" % key)
def calc_precision(tp_count, fp_count):
if tp_count > 0.0 or fp_count > 0.0:
return tp_count / (tp_count + fp_count)
return 1.0
def calc_recall(tp_count, fn_count):
if tp_count > 0.0 or fn_count > 0.0:
return tp_count / (tp_count + fn_count)
return 1.0
def calc_f1_score(precision, recall):
if precision > 0.0 or recall > 0.0:
return 2 * precision * recall / (precision + recall)
return 0.0
states = global_metrics["states_info"]
total_tp_count = 0.0
total_fp_count = 0.0
total_fn_count = 0.0
macro_avg_precision = 0.0
macro_avg_recall = 0.0
for i in range(self.num_cls):
total_tp_count += states[i][0]
total_fp_count += states[i][1]
total_fn_count += states[i][3]
macro_avg_precision += calc_precision(states[i][0], states[i][1])
macro_avg_recall += calc_recall(states[i][0], states[i][3])
metrics = []
macro_avg_precision /= self.num_cls
macro_avg_recall /= self.num_cls
metrics.append(macro_avg_precision)
metrics.append(macro_avg_recall)
metrics.append(calc_f1_score(macro_avg_precision, macro_avg_recall))
micro_avg_precision = calc_precision(total_tp_count, total_fp_count)
metrics.append(micro_avg_precision)
micro_avg_recall = calc_recall(total_tp_count, total_fn_count)
metrics.append(micro_avg_recall)
metrics.append(calc_f1_score(micro_avg_precision, micro_avg_recall))
return "total metrics: [TP, FP, TN, FN]=%s; precision_recall_f1=%s" % (
str(states), str(np.array(metrics).astype('float32')))
def get_result(self):
return self.metrics
...@@ -18,92 +18,86 @@ import numpy as np ...@@ -18,92 +18,86 @@ import numpy as np
import paddle.fluid as fluid import paddle.fluid as fluid
from paddlerec.core.metric import Metric from paddlerec.core.metric import Metric
from paddle.fluid.layers import nn, accuracy from paddle.fluid.layers import accuracy
from paddle.fluid.initializer import Constant from paddle.fluid.initializer import Constant
from paddle.fluid.layer_helper import LayerHelper from paddle.fluid.layer_helper import LayerHelper
from paddle.fluid.layers.tensor import Variable
class Precision(Metric): class RecallK(Metric):
""" """
Metric For Fluid Model Metric For Fluid Model
""" """
def __init__(self, **kwargs): def __init__(self, input, label, k=20):
""" """ """ """
helper = LayerHelper("PaddleRec_Precision", **kwargs) kwargs = locals()
self.batch_accuracy = accuracy( del kwargs['self']
kwargs.get("input"), kwargs.get("label"), kwargs.get("k")) self.k = k
local_ins_num, _ = helper.create_or_get_global_variable(
name="local_ins_num", persistable=True, dtype='float32', if not isinstance(input, Variable):
shape=[1]) raise ValueError("input must be Variable, but received %s" %
local_pos_num, _ = helper.create_or_get_global_variable( type(input))
name="local_pos_num", persistable=True, dtype='float32', if not isinstance(label, Variable):
shape=[1]) raise ValueError("label must be Variable, but received %s" %
type(label))
batch_pos_num, _ = helper.create_or_get_global_variable(
name="batch_pos_num", helper = LayerHelper("PaddleRec_RecallK", **kwargs)
persistable=False, batch_accuracy = accuracy(input, label, self.k)
dtype='float32', global_ins_cnt, _ = helper.create_or_get_global_variable(
shape=[1]) name="ins_cnt", persistable=True, dtype='float32', shape=[1])
batch_ins_num, _ = helper.create_or_get_global_variable( global_pos_cnt, _ = helper.create_or_get_global_variable(
name="batch_ins_num", name="pos_cnt", persistable=True, dtype='float32', shape=[1])
persistable=False,
dtype='float32', for var in [global_ins_cnt, global_pos_cnt]:
shape=[1])
tmp_ones = helper.create_global_variable(
name="batch_size_like_ones",
persistable=False,
dtype='float32',
shape=[-1])
for var in [
batch_pos_num, batch_ins_num, local_pos_num, local_ins_num
]:
print(var, type(var))
helper.set_variable_initializer( helper.set_variable_initializer(
var, Constant( var, Constant(
value=0.0, force_cpu=True)) value=0.0, force_cpu=True))
helper.append_op( tmp_ones = fluid.layers.fill_constant(
type='fill_constant_batch_size_like', shape=fluid.layers.shape(label), dtype="float32", value=1.0)
inputs={"Input": kwargs.get("label")}, batch_ins = fluid.layers.reduce_sum(tmp_ones)
outputs={'Out': [tmp_ones]}, batch_pos = batch_ins * batch_accuracy
attrs={
'shape': [-1, 1],
'dtype': tmp_ones.dtype,
'value': float(1.0),
})
helper.append_op(
type="reduce_sum",
inputs={"X": [tmp_ones]},
outputs={"Out": [batch_ins_num]})
helper.append_op(
type="elementwise_mul",
inputs={"X": [batch_ins_num],
"Y": [self.batch_accuracy]},
outputs={"Out": [batch_pos_num]})
helper.append_op( helper.append_op(
type="elementwise_add", type="elementwise_add",
inputs={"X": [local_pos_num], inputs={"X": [global_ins_cnt],
"Y": [batch_pos_num]}, "Y": [batch_ins]},
outputs={"Out": [local_pos_num]}) outputs={"Out": [global_ins_cnt]})
helper.append_op( helper.append_op(
type="elementwise_add", type="elementwise_add",
inputs={"X": [local_ins_num], inputs={"X": [global_pos_cnt],
"Y": [batch_ins_num]}, "Y": [batch_pos]},
outputs={"Out": [local_ins_num]}) outputs={"Out": [global_pos_cnt]})
self.acc = global_pos_cnt / global_ins_cnt
self.accuracy = local_pos_num / local_ins_num self._global_metric_state_vars = dict()
self._global_metric_state_vars['ins_cnt'] = (global_ins_cnt.name,
"float32")
self._global_metric_state_vars['pos_cnt'] = (global_pos_cnt.name,
"float32")
self._need_clear_list = [("local_ins_num", "float32"), metric_name = "Acc(Recall@%d)" % self.k
("local_pos_num", "float32")]
self.metrics = dict() self.metrics = dict()
metric_varname = "P@%d" % kwargs.get("k") self.metrics["InsCnt"] = global_ins_cnt
self.metrics[metric_varname] = self.accuracy self.metrics["RecallCnt"] = global_pos_cnt
self.metrics[metric_name] = self.acc
# self.metrics["batch_metrics"] = batch_metrics
def _calculate(self, global_metrics):
for key in self._global_metric_state_vars:
if key not in global_metrics:
raise ValueError("%s not existed" % key)
ins_cnt = global_metrics['ins_cnt'][0]
pos_cnt = global_metrics['pos_cnt'][0]
if ins_cnt == 0:
acc = 0
else:
acc = float(pos_cnt) / ins_cnt
return "InsCnt=%s RecallCnt=%s Acc(Recall@%d)=%s" % (
str(ins_cnt), str(pos_cnt), self.k, str(acc))
def get_result(self): def get_result(self):
return self.metrics return self.metrics
...@@ -107,6 +107,7 @@ class Trainer(object): ...@@ -107,6 +107,7 @@ class Trainer(object):
self.device = Device.GPU self.device = Device.GPU
gpu_id = int(os.environ.get('FLAGS_selected_gpus', 0)) gpu_id = int(os.environ.get('FLAGS_selected_gpus', 0))
self._place = fluid.CUDAPlace(gpu_id) self._place = fluid.CUDAPlace(gpu_id)
print("PaddleRec run on device GPU: {}".format(gpu_id))
self._exe = fluid.Executor(self._place) self._exe = fluid.Executor(self._place)
elif device == "CPU": elif device == "CPU":
self.device = Device.CPU self.device = Device.CPU
...@@ -146,6 +147,7 @@ class Trainer(object): ...@@ -146,6 +147,7 @@ class Trainer(object):
elif engine.upper() == "CLUSTER": elif engine.upper() == "CLUSTER":
self.engine = EngineMode.CLUSTER self.engine = EngineMode.CLUSTER
self.is_fleet = True self.is_fleet = True
self.which_cluster_type()
else: else:
raise ValueError("Not Support Engine {}".format(engine)) raise ValueError("Not Support Engine {}".format(engine))
self._context["is_fleet"] = self.is_fleet self._context["is_fleet"] = self.is_fleet
...@@ -165,6 +167,14 @@ class Trainer(object): ...@@ -165,6 +167,14 @@ class Trainer(object):
self._context["is_pslib"] = (fleet_mode.upper() == "PSLIB") self._context["is_pslib"] = (fleet_mode.upper() == "PSLIB")
self._context["fleet_mode"] = fleet_mode self._context["fleet_mode"] = fleet_mode
def which_cluster_type(self):
cluster_type = os.getenv("PADDLEREC_CLUSTER_TYPE", "MPI")
print("PADDLEREC_CLUSTER_TYPE: {}".format(cluster_type))
if cluster_type and cluster_type.upper() == "K8S":
self._context["cluster_type"] = "K8S"
else:
self._context["cluster_type"] = "MPI"
def which_executor_mode(self): def which_executor_mode(self):
executor_mode = envs.get_runtime_environ("train.trainer.executor_mode") executor_mode = envs.get_runtime_environ("train.trainer.executor_mode")
if executor_mode.upper() not in ["TRAIN", "INFER"]: if executor_mode.upper() not in ["TRAIN", "INFER"]:
......
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
General Trainer, applicable to many situations: Single/Cluster/Local_Cluster + PS/COLLECTIVE
"""
from __future__ import print_function
import os
from paddlerec.core.utils import envs
from paddlerec.core.trainer import Trainer, EngineMode, FleetMode
class FineTuningTrainer(Trainer):
"""
Trainer for various situations
"""
def __init__(self, config=None):
Trainer.__init__(self, config)
self.processor_register()
self.abs_dir = os.path.dirname(os.path.abspath(__file__))
self.runner_env_name = "runner." + self._context["runner_name"]
def processor_register(self):
print("processor_register begin")
self.regist_context_processor('uninit', self.instance)
self.regist_context_processor('network_pass', self.network)
self.regist_context_processor('startup_pass', self.startup)
self.regist_context_processor('train_pass', self.runner)
self.regist_context_processor('terminal_pass', self.terminal)
def instance(self, context):
instance_class_path = envs.get_global_env(
self.runner_env_name + ".instance_class_path", default_value=None)
if instance_class_path:
instance_class = envs.lazy_instance_by_fliename(
instance_class_path, "Instance")(context)
else:
if self.engine == EngineMode.SINGLE:
instance_class_name = "SingleInstance"
else:
raise ValueError(
"FineTuningTrainer can only support SingleTraining.")
instance_path = os.path.join(self.abs_dir, "framework",
"instance.py")
instance_class = envs.lazy_instance_by_fliename(
instance_path, instance_class_name)(context)
instance_class.instance(context)
def network(self, context):
network_class_path = envs.get_global_env(
self.runner_env_name + ".network_class_path", default_value=None)
if network_class_path:
network_class = envs.lazy_instance_by_fliename(network_class_path,
"Network")(context)
else:
if self.engine == EngineMode.SINGLE:
network_class_name = "FineTuningNetwork"
else:
raise ValueError(
"FineTuningTrainer can only support SingleTraining.")
network_path = os.path.join(self.abs_dir, "framework",
"network.py")
network_class = envs.lazy_instance_by_fliename(
network_path, network_class_name)(context)
network_class.build_network(context)
def startup(self, context):
startup_class_path = envs.get_global_env(
self.runner_env_name + ".startup_class_path", default_value=None)
if startup_class_path:
startup_class = envs.lazy_instance_by_fliename(startup_class_path,
"Startup")(context)
else:
if self.engine == EngineMode.SINGLE and not context["is_infer"]:
startup_class_name = "FineTuningStartup"
else:
raise ValueError(
"FineTuningTrainer can only support SingleTraining.")
startup_path = os.path.join(self.abs_dir, "framework",
"startup.py")
startup_class = envs.lazy_instance_by_fliename(
startup_path, startup_class_name)(context)
startup_class.startup(context)
def runner(self, context):
runner_class_path = envs.get_global_env(
self.runner_env_name + ".runner_class_path", default_value=None)
if runner_class_path:
runner_class = envs.lazy_instance_by_fliename(runner_class_path,
"Runner")(context)
else:
if self.engine == EngineMode.SINGLE and not context["is_infer"]:
runner_class_name = "SingleRunner"
else:
raise ValueError(
"FineTuningTrainer can only support SingleTraining.")
runner_path = os.path.join(self.abs_dir, "framework", "runner.py")
runner_class = envs.lazy_instance_by_fliename(
runner_path, runner_class_name)(context)
runner_class.run(context)
def terminal(self, context):
terminal_class_path = envs.get_global_env(
self.runner_env_name + ".terminal_class_path", default_value=None)
if terminal_class_path:
terminal_class = envs.lazy_instance_by_fliename(
terminal_class_path, "Terminal")(context)
terminal_class.terminal(context)
else:
terminal_class_name = "TerminalBase"
if self.engine != EngineMode.SINGLE and self.fleet_mode != FleetMode.COLLECTIVE:
terminal_class_name = "PSTerminal"
terminal_path = os.path.join(self.abs_dir, "framework",
"terminal.py")
terminal_class = envs.lazy_instance_by_fliename(
terminal_path, terminal_class_name)(context)
terminal_class.terminal(context)
context['is_exit'] = True
...@@ -123,10 +123,21 @@ class QueueDataset(DatasetBase): ...@@ -123,10 +123,21 @@ class QueueDataset(DatasetBase):
os.path.join(train_data_path, x) os.path.join(train_data_path, x)
for x in os.listdir(train_data_path) for x in os.listdir(train_data_path)
] ]
file_list.sort()
need_split_files = False
if context["engine"] == EngineMode.LOCAL_CLUSTER: if context["engine"] == EngineMode.LOCAL_CLUSTER:
# for local cluster: split files for multi process
need_split_files = True
elif context["engine"] == EngineMode.CLUSTER and context[
"cluster_type"] == "K8S":
# for k8s mount afs, split files for every node
need_split_files = True
if need_split_files:
file_list = split_files(file_list, context["fleet"].worker_index(), file_list = split_files(file_list, context["fleet"].worker_index(),
context["fleet"].worker_num()) context["fleet"].worker_num())
print("File_list: {}".format(file_list)) print("File_list: {}".format(file_list))
dataset.set_filelist(file_list) dataset.set_filelist(file_list)
for model_dict in context["phases"]: for model_dict in context["phases"]:
if model_dict["dataset_name"] == dataset_name: if model_dict["dataset_name"] == dataset_name:
......
...@@ -23,7 +23,7 @@ from paddlerec.core.trainers.framework.dataset import DataLoader, QueueDataset ...@@ -23,7 +23,7 @@ from paddlerec.core.trainers.framework.dataset import DataLoader, QueueDataset
__all__ = [ __all__ = [
"NetworkBase", "SingleNetwork", "PSNetwork", "PslibNetwork", "NetworkBase", "SingleNetwork", "PSNetwork", "PslibNetwork",
"CollectiveNetwork" "CollectiveNetwork", "FineTuningNetwork"
] ]
...@@ -99,7 +99,90 @@ class SingleNetwork(NetworkBase): ...@@ -99,7 +99,90 @@ class SingleNetwork(NetworkBase):
context["dataset"] = {} context["dataset"] = {}
for dataset in context["env"]["dataset"]: for dataset in context["env"]["dataset"]:
type = envs.get_global_env("dataset." + dataset["name"] + ".type") type = envs.get_global_env("dataset." + dataset["name"] + ".type")
if type != "DataLoader":
if type == "QueueDataset":
dataset_class = QueueDataset(context)
context["dataset"][dataset[
"name"]] = dataset_class.create_dataset(dataset["name"],
context)
context["status"] = "startup_pass"
class FineTuningNetwork(NetworkBase):
"""R
"""
def __init__(self, context):
print("Running FineTuningNetwork.")
def build_network(self, context):
context["model"] = {}
for model_dict in context["phases"]:
context["model"][model_dict["name"]] = {}
train_program = fluid.Program()
startup_program = fluid.Program()
scope = fluid.Scope()
dataset_name = model_dict["dataset_name"]
with fluid.program_guard(train_program, startup_program):
with fluid.unique_name.guard():
with fluid.scope_guard(scope):
model_path = envs.os_path_adapter(
envs.workspace_adapter(model_dict["model"]))
model = envs.lazy_instance_by_fliename(
model_path, "Model")(context["env"])
model._data_var = model.input_data(
dataset_name=model_dict["dataset_name"])
if envs.get_global_env("dataset." + dataset_name +
".type") == "DataLoader":
model._init_dataloader(
is_infer=context["is_infer"])
data_loader = DataLoader(context)
data_loader.get_dataloader(context, dataset_name,
model._data_loader)
model.net(model._data_var, context["is_infer"])
finetuning_varnames = envs.get_global_env(
"runner." + context["runner_name"] +
".finetuning_aspect_varnames",
default_value=[])
if len(finetuning_varnames) == 0:
raise ValueError(
"nothing need to be fine tuning, you may use other traning mode"
)
if len(finetuning_varnames) != 1:
raise ValueError(
"fine tuning mode can only accept one varname now"
)
varname = finetuning_varnames[0]
finetuning_vars = train_program.global_block().vars[
varname]
finetuning_vars.stop_gradient = True
optimizer = model.optimizer()
optimizer.minimize(model._cost)
context["model"][model_dict["name"]][
"main_program"] = train_program
context["model"][model_dict["name"]][
"startup_program"] = startup_program
context["model"][model_dict["name"]]["scope"] = scope
context["model"][model_dict["name"]]["model"] = model
context["model"][model_dict["name"]][
"default_main_program"] = train_program.clone()
context["model"][model_dict["name"]]["compiled_program"] = None
context["dataset"] = {}
for dataset in context["env"]["dataset"]:
type = envs.get_global_env("dataset." + dataset["name"] + ".type")
if type == "QueueDataset":
dataset_class = QueueDataset(context) dataset_class = QueueDataset(context)
context["dataset"][dataset[ context["dataset"][dataset[
"name"]] = dataset_class.create_dataset(dataset["name"], "name"]] = dataset_class.create_dataset(dataset["name"],
...@@ -133,9 +216,7 @@ class PSNetwork(NetworkBase): ...@@ -133,9 +216,7 @@ class PSNetwork(NetworkBase):
if envs.get_global_env("dataset." + dataset_name + if envs.get_global_env("dataset." + dataset_name +
".type") == "DataLoader": ".type") == "DataLoader":
model._init_dataloader(is_infer=False) model._init_dataloader(is_infer=False)
data_loader = DataLoader(context)
data_loader.get_dataloader(context, dataset_name,
model._data_loader)
model.net(model._data_var, False) model.net(model._data_var, False)
optimizer = model.optimizer() optimizer = model.optimizer()
strategy = self._build_strategy(context) strategy = self._build_strategy(context)
...@@ -160,7 +241,11 @@ class PSNetwork(NetworkBase): ...@@ -160,7 +241,11 @@ class PSNetwork(NetworkBase):
for dataset in context["env"]["dataset"]: for dataset in context["env"]["dataset"]:
type = envs.get_global_env("dataset." + dataset["name"] + type = envs.get_global_env("dataset." + dataset["name"] +
".type") ".type")
if type != "DataLoader": if type == "DataLoader":
data_loader = DataLoader(context)
data_loader.get_dataloader(context, dataset_name,
model._data_loader)
elif type == "QueueDataset":
dataset_class = QueueDataset(context) dataset_class = QueueDataset(context)
context["dataset"][dataset[ context["dataset"][dataset[
"name"]] = dataset_class.create_dataset( "name"]] = dataset_class.create_dataset(
...@@ -229,9 +314,6 @@ class PslibNetwork(NetworkBase): ...@@ -229,9 +314,6 @@ class PslibNetwork(NetworkBase):
if envs.get_global_env("dataset." + dataset_name + if envs.get_global_env("dataset." + dataset_name +
".type") == "DataLoader": ".type") == "DataLoader":
model._init_dataloader(is_infer=False) model._init_dataloader(is_infer=False)
data_loader = DataLoader(context)
data_loader.get_dataloader(context, dataset_name,
model._data_loader)
model.net(model._data_var, False) model.net(model._data_var, False)
optimizer = model.optimizer() optimizer = model.optimizer()
...@@ -257,7 +339,11 @@ class PslibNetwork(NetworkBase): ...@@ -257,7 +339,11 @@ class PslibNetwork(NetworkBase):
for dataset in context["env"]["dataset"]: for dataset in context["env"]["dataset"]:
type = envs.get_global_env("dataset." + dataset["name"] + type = envs.get_global_env("dataset." + dataset["name"] +
".type") ".type")
if type != "DataLoader": if type == "DataLoader":
data_loader = DataLoader(context)
data_loader.get_dataloader(context, dataset_name, context[
"model"][model_dict["name"]]["model"]._data_loader)
elif type == "QueueDataset":
dataset_class = QueueDataset(context) dataset_class = QueueDataset(context)
context["dataset"][dataset[ context["dataset"][dataset[
"name"]] = dataset_class.create_dataset( "name"]] = dataset_class.create_dataset(
...@@ -323,7 +409,10 @@ class CollectiveNetwork(NetworkBase): ...@@ -323,7 +409,10 @@ class CollectiveNetwork(NetworkBase):
context["dataset"] = {} context["dataset"] = {}
for dataset in context["env"]["dataset"]: for dataset in context["env"]["dataset"]:
type = envs.get_global_env("dataset." + dataset["name"] + ".type") type = envs.get_global_env("dataset." + dataset["name"] + ".type")
if type != "DataLoader": if type == "QueueDataset":
raise ValueError(
"Collective don't support QueueDataset training, please use DataLoader."
)
dataset_class = QueueDataset(context) dataset_class = QueueDataset(context)
context["dataset"][dataset[ context["dataset"][dataset[
"name"]] = dataset_class.create_dataset(dataset["name"], "name"]] = dataset_class.create_dataset(dataset["name"],
......
...@@ -16,10 +16,12 @@ from __future__ import print_function ...@@ -16,10 +16,12 @@ from __future__ import print_function
import os import os
import time import time
import warnings
import numpy as np import numpy as np
import paddle.fluid as fluid import paddle.fluid as fluid
from paddlerec.core.utils import envs from paddlerec.core.utils import envs
from paddlerec.core.metric import Metric
__all__ = [ __all__ = [
"RunnerBase", "SingleRunner", "PSRunner", "CollectiveRunner", "PslibRunner" "RunnerBase", "SingleRunner", "PSRunner", "CollectiveRunner", "PslibRunner"
...@@ -77,9 +79,10 @@ class RunnerBase(object): ...@@ -77,9 +79,10 @@ class RunnerBase(object):
name = "dataset." + reader_name + "." name = "dataset." + reader_name + "."
if envs.get_global_env(name + "type") == "DataLoader": if envs.get_global_env(name + "type") == "DataLoader":
self._executor_dataloader_train(model_dict, context) return self._executor_dataloader_train(model_dict, context)
else: else:
self._executor_dataset_train(model_dict, context) self._executor_dataset_train(model_dict, context)
return None
def _executor_dataset_train(self, model_dict, context): def _executor_dataset_train(self, model_dict, context):
reader_name = model_dict["dataset_name"] reader_name = model_dict["dataset_name"]
...@@ -137,8 +140,10 @@ class RunnerBase(object): ...@@ -137,8 +140,10 @@ class RunnerBase(object):
metrics_varnames = [] metrics_varnames = []
metrics_format = [] metrics_format = []
metrics_names = ["total_batch"]
metrics_format.append("{}: {{}}".format("batch")) metrics_format.append("{}: {{}}".format("batch"))
for name, var in metrics.items(): for name, var in metrics.items():
metrics_names.append(name)
metrics_varnames.append(var.name) metrics_varnames.append(var.name)
metrics_format.append("{}: {{}}".format(name)) metrics_format.append("{}: {{}}".format(name))
metrics_format = ", ".join(metrics_format) metrics_format = ", ".join(metrics_format)
...@@ -147,6 +152,7 @@ class RunnerBase(object): ...@@ -147,6 +152,7 @@ class RunnerBase(object):
reader.start() reader.start()
batch_id = 0 batch_id = 0
scope = context["model"][model_name]["scope"] scope = context["model"][model_name]["scope"]
result = None
with fluid.scope_guard(scope): with fluid.scope_guard(scope):
try: try:
while True: while True:
...@@ -168,6 +174,10 @@ class RunnerBase(object): ...@@ -168,6 +174,10 @@ class RunnerBase(object):
except fluid.core.EOFException: except fluid.core.EOFException:
reader.reset() reader.reset()
if batch_id > 0:
result = dict(zip(metrics_names, metrics))
return result
def _get_dataloader_program(self, model_dict, context): def _get_dataloader_program(self, model_dict, context):
model_name = model_dict["name"] model_name = model_dict["name"]
if context["model"][model_name]["compiled_program"] == None: if context["model"][model_name]["compiled_program"] == None:
...@@ -275,6 +285,7 @@ class RunnerBase(object): ...@@ -275,6 +285,7 @@ class RunnerBase(object):
return (epoch_id + 1) % epoch_interval == 0 return (epoch_id + 1) % epoch_interval == 0
def save_inference_model(): def save_inference_model():
# get global env
name = "runner." + context["runner_name"] + "." name = "runner." + context["runner_name"] + "."
save_interval = int( save_interval = int(
envs.get_global_env(name + "save_inference_interval", -1)) envs.get_global_env(name + "save_inference_interval", -1))
...@@ -287,18 +298,44 @@ class RunnerBase(object): ...@@ -287,18 +298,44 @@ class RunnerBase(object):
if feed_varnames is None or fetch_varnames is None or feed_varnames == "" or fetch_varnames == "" or \ if feed_varnames is None or fetch_varnames is None or feed_varnames == "" or fetch_varnames == "" or \
len(feed_varnames) == 0 or len(fetch_varnames) == 0: len(feed_varnames) == 0 or len(fetch_varnames) == 0:
return return
fetch_vars = [
fluid.default_main_program().global_block().vars[varname] # check feed var exist
for varname in fetch_varnames for var_name in feed_varnames:
] if var_name not in fluid.default_main_program().global_block(
).vars:
raise ValueError(
"Feed variable: {} not in default_main_program, global block has follow vars: {}".
format(var_name,
fluid.default_main_program().global_block()
.vars.keys()))
# check fetch var exist
fetch_vars = []
for var_name in fetch_varnames:
if var_name not in fluid.default_main_program().global_block(
).vars:
raise ValueError(
"Fetch variable: {} not in default_main_program, global block has follow vars: {}".
format(var_name,
fluid.default_main_program().global_block()
.vars.keys()))
else:
fetch_vars.append(fluid.default_main_program()
.global_block().vars[var_name])
dirname = envs.get_global_env(name + "save_inference_path", None) dirname = envs.get_global_env(name + "save_inference_path", None)
assert dirname is not None assert dirname is not None
dirname = os.path.join(dirname, str(epoch_id)) dirname = os.path.join(dirname, str(epoch_id))
if is_fleet: if is_fleet:
context["fleet"].save_inference_model( warnings.warn(
context["exe"], dirname, feed_varnames, fetch_vars) "Save inference model in cluster training is not recommended! Using save checkpoint instead.",
category=UserWarning,
stacklevel=2)
if context["fleet"].worker_index() == 0:
context["fleet"].save_inference_model(
context["exe"], dirname, feed_varnames, fetch_vars)
else: else:
fluid.io.save_inference_model(dirname, feed_varnames, fluid.io.save_inference_model(dirname, feed_varnames,
fetch_vars, context["exe"]) fetch_vars, context["exe"])
...@@ -314,7 +351,8 @@ class RunnerBase(object): ...@@ -314,7 +351,8 @@ class RunnerBase(object):
return return
dirname = os.path.join(dirname, str(epoch_id)) dirname = os.path.join(dirname, str(epoch_id))
if is_fleet: if is_fleet:
context["fleet"].save_persistables(context["exe"], dirname) if context["fleet"].worker_index() == 0:
context["fleet"].save_persistables(context["exe"], dirname)
else: else:
fluid.io.save_persistables(context["exe"], dirname) fluid.io.save_persistables(context["exe"], dirname)
...@@ -336,11 +374,28 @@ class SingleRunner(RunnerBase): ...@@ -336,11 +374,28 @@ class SingleRunner(RunnerBase):
".epochs")) ".epochs"))
for epoch in range(epochs): for epoch in range(epochs):
for model_dict in context["phases"]: for model_dict in context["phases"]:
model_class = context["model"][model_dict["name"]]["model"]
metrics = model_class._metrics
begin_time = time.time() begin_time = time.time()
self._run(context, model_dict) result = self._run(context, model_dict)
end_time = time.time() end_time = time.time()
seconds = end_time - begin_time seconds = end_time - begin_time
print("epoch {} done, use time: {}".format(epoch, seconds)) message = "epoch {} done, use time: {}".format(epoch, seconds)
metrics_result = []
for key in metrics:
if isinstance(metrics[key], Metric):
_str = metrics[key].calc_global_metrics(
None,
context["model"][model_dict["name"]]["scope"])
metrics_result.append(_str)
elif result is not None:
_str = "{}={}".format(key, result[key])
metrics_result.append(_str)
if len(metrics_result) > 0:
message += ", global metrics: " + ", ".join(metrics_result)
print(message)
with fluid.scope_guard(context["model"][model_dict["name"]][ with fluid.scope_guard(context["model"][model_dict["name"]][
"scope"]): "scope"]):
train_prog = context["model"][model_dict["name"]][ train_prog = context["model"][model_dict["name"]][
...@@ -362,12 +417,32 @@ class PSRunner(RunnerBase): ...@@ -362,12 +417,32 @@ class PSRunner(RunnerBase):
envs.get_global_env("runner." + context["runner_name"] + envs.get_global_env("runner." + context["runner_name"] +
".epochs")) ".epochs"))
model_dict = context["env"]["phase"][0] model_dict = context["env"]["phase"][0]
model_class = context["model"][model_dict["name"]]["model"]
metrics = model_class._metrics
for epoch in range(epochs): for epoch in range(epochs):
begin_time = time.time() begin_time = time.time()
self._run(context, model_dict) result = self._run(context, model_dict)
end_time = time.time() end_time = time.time()
seconds = end_time - begin_time seconds = end_time - begin_time
print("epoch {} done, use time: {}".format(epoch, seconds)) message = "epoch {} done, use time: {}".format(epoch, seconds)
# TODO, wait for PaddleCloudRoleMaker supports gloo
from paddle.fluid.incubate.fleet.base.role_maker import GeneralRoleMaker
if context["fleet"] is not None and isinstance(context["fleet"],
GeneralRoleMaker):
metrics_result = []
for key in metrics:
if isinstance(metrics[key], Metric):
_str = metrics[key].calc_global_metrics(
context["fleet"],
context["model"][model_dict["name"]]["scope"])
metrics_result.append(_str)
elif result is not None:
_str = "{}={}".format(key, result[key])
metrics_result.append(_str)
if len(metrics_result) > 0:
message += ", global metrics: " + ", ".join(metrics_result)
print(message)
with fluid.scope_guard(context["model"][model_dict["name"]][ with fluid.scope_guard(context["model"][model_dict["name"]][
"scope"]): "scope"]):
train_prog = context["model"][model_dict["name"]][ train_prog = context["model"][model_dict["name"]][
...@@ -476,14 +551,30 @@ class SingleInferRunner(RunnerBase): ...@@ -476,14 +551,30 @@ class SingleInferRunner(RunnerBase):
for index, epoch_name in enumerate(self.epoch_model_name_list): for index, epoch_name in enumerate(self.epoch_model_name_list):
for model_dict in context["phases"]: for model_dict in context["phases"]:
model_class = context["model"][model_dict["name"]]["model"]
metrics = model_class._infer_results
self._load(context, model_dict, self._load(context, model_dict,
self.epoch_model_path_list[index]) self.epoch_model_path_list[index])
begin_time = time.time() begin_time = time.time()
self._run(context, model_dict) result = self._run(context, model_dict)
end_time = time.time() end_time = time.time()
seconds = end_time - begin_time seconds = end_time - begin_time
print("Infer {} of {} done, use time: {}".format(model_dict[ message = "Infer {} of epoch {} done, use time: {}".format(
"name"], epoch_name, seconds)) model_dict["name"], epoch_name, seconds)
metrics_result = []
for key in metrics:
if isinstance(metrics[key], Metric):
_str = metrics[key].calc_global_metrics(
None,
context["model"][model_dict["name"]]["scope"])
metrics_result.append(_str)
elif result is not None:
_str = "{}={}".format(key, result[key])
metrics_result.append(_str)
if len(metrics_result) > 0:
message += ", global metrics: " + ", ".join(metrics_result)
print(message)
context["status"] = "terminal_pass" context["status"] = "terminal_pass"
def _load(self, context, model_dict, model_path): def _load(self, context, model_dict, model_path):
......
...@@ -17,9 +17,13 @@ from __future__ import print_function ...@@ -17,9 +17,13 @@ from __future__ import print_function
import warnings import warnings
import paddle.fluid as fluid import paddle.fluid as fluid
import paddle.fluid.core as core
from paddlerec.core.utils import envs from paddlerec.core.utils import envs
__all__ = ["StartupBase", "SingleStartup", "PSStartup", "CollectiveStartup"] __all__ = [
"StartupBase", "SingleStartup", "PSStartup", "CollectiveStartup",
"FineTuningStartup"
]
class StartupBase(object): class StartupBase(object):
...@@ -65,6 +69,122 @@ class SingleStartup(StartupBase): ...@@ -65,6 +69,122 @@ class SingleStartup(StartupBase):
context["status"] = "train_pass" context["status"] = "train_pass"
class FineTuningStartup(StartupBase):
"""R
"""
def __init__(self, context):
self.op_name_scope = "op_namescope"
self.clip_op_name_scope = "@CLIP"
self.self.op_role_var_attr_name = core.op_proto_and_checker_maker.kOpRoleVarAttrName(
)
print("Running SingleStartup.")
def _is_opt_role_op(self, op):
# NOTE: depend on oprole to find out whether this op is for
# optimize
op_maker = core.op_proto_and_checker_maker
optimize_role = core.op_proto_and_checker_maker.OpRole.Optimize
if op_maker.kOpRoleAttrName() in op.attr_names and \
int(op.all_attrs()[op_maker.kOpRoleAttrName()]) == int(optimize_role):
return True
return False
def _get_params_grads(self, program):
"""
Get optimizer operators, parameters and gradients from origin_program
Returns:
opt_ops (list): optimize operators.
params_grads (dict): parameter->gradient.
"""
block = program.global_block()
params_grads = []
# tmp set to dedup
optimize_params = set()
origin_var_dict = program.global_block().vars
for op in block.ops:
if self._is_opt_role_op(op):
# Todo(chengmo): Whether clip related op belongs to Optimize guard should be discussed
# delete clip op from opt_ops when run in Parameter Server mode
if self.op_name_scope in op.all_attrs(
) and self.clip_op_name_scope in op.attr(self.op_name_scope):
op._set_attr(
"op_role",
int(core.op_proto_and_checker_maker.OpRole.Backward))
continue
if op.attr(self.op_role_var_attr_name):
param_name = op.attr(self.op_role_var_attr_name)[0]
grad_name = op.attr(self.op_role_var_attr_name)[1]
if not param_name in optimize_params:
optimize_params.add(param_name)
params_grads.append([
origin_var_dict[param_name],
origin_var_dict[grad_name]
])
return params_grads
@staticmethod
def is_persistable(var):
"""
Check whether the given variable is persistable.
Args:
var(Variable): The variable to be checked.
Returns:
bool: True if the given `var` is persistable
False if not.
Examples:
.. code-block:: python
import paddle.fluid as fluid
param = fluid.default_main_program().global_block().var('fc.b')
res = fluid.io.is_persistable(param)
"""
if var.desc.type() == core.VarDesc.VarType.FEED_MINIBATCH or \
var.desc.type() == core.VarDesc.VarType.FETCH_LIST or \
var.desc.type() == core.VarDesc.VarType.READER:
return False
return var.persistable
def load(self, context, is_fleet=False, main_program=None):
dirname = envs.get_global_env(
"runner." + context["runner_name"] + ".init_model_path", None)
if dirname is None or dirname == "":
return
print("going to load ", dirname)
params_grads = self._get_params_grads(main_program)
update_params = [p for p, _ in params_grads]
need_load_vars = []
parameters = list(
filter(FineTuningStartup.is_persistable, main_program.list_vars()))
for param in parameters:
if param not in update_params:
need_load_vars.append(param)
fluid.io.load_vars(context["exe"], dirname, main_program,
need_load_vars)
print("load from {} success".format(dirname))
def startup(self, context):
for model_dict in context["phases"]:
with fluid.scope_guard(context["model"][model_dict["name"]][
"scope"]):
train_prog = context["model"][model_dict["name"]][
"main_program"]
startup_prog = context["model"][model_dict["name"]][
"startup_program"]
with fluid.program_guard(train_prog, startup_prog):
context["exe"].run(startup_prog)
self.load(context, main_program=train_prog)
context["status"] = "train_pass"
class PSStartup(StartupBase): class PSStartup(StartupBase):
def __init__(self, context): def __init__(self, context):
print("Running PSStartup.") print("Running PSStartup.")
......
...@@ -39,9 +39,21 @@ def dataloader_by_name(readerclass, ...@@ -39,9 +39,21 @@ def dataloader_by_name(readerclass,
data_path = os.path.join(package_base, data_path.split("::")[1]) data_path = os.path.join(package_base, data_path.split("::")[1])
files = [str(data_path) + "/%s" % x for x in os.listdir(data_path)] files = [str(data_path) + "/%s" % x for x in os.listdir(data_path)]
files.sort()
need_split_files = False
if context["engine"] == EngineMode.LOCAL_CLUSTER: if context["engine"] == EngineMode.LOCAL_CLUSTER:
# for local cluster: split files for multi process
need_split_files = True
elif context["engine"] == EngineMode.CLUSTER and context[
"cluster_type"] == "K8S":
# for k8s mount mode, split files for every node
need_split_files = True
print("need_split_files: {}".format(need_split_files))
if need_split_files:
files = split_files(files, context["fleet"].worker_index(), files = split_files(files, context["fleet"].worker_index(),
context["fleet"].worker_num()) context["fleet"].worker_num())
print("file_list : {}".format(files)) print("file_list : {}".format(files))
reader = reader_class(yaml_file) reader = reader_class(yaml_file)
...@@ -81,10 +93,20 @@ def slotdataloader_by_name(readerclass, dataset_name, yaml_file, context): ...@@ -81,10 +93,20 @@ def slotdataloader_by_name(readerclass, dataset_name, yaml_file, context):
data_path = os.path.join(package_base, data_path.split("::")[1]) data_path = os.path.join(package_base, data_path.split("::")[1])
files = [str(data_path) + "/%s" % x for x in os.listdir(data_path)] files = [str(data_path) + "/%s" % x for x in os.listdir(data_path)]
files.sort()
need_split_files = False
if context["engine"] == EngineMode.LOCAL_CLUSTER: if context["engine"] == EngineMode.LOCAL_CLUSTER:
# for local cluster: split files for multi process
need_split_files = True
elif context["engine"] == EngineMode.CLUSTER and context[
"cluster_type"] == "K8S":
# for k8s mount mode, split files for every node
need_split_files = True
if need_split_files:
files = split_files(files, context["fleet"].worker_index(), files = split_files(files, context["fleet"].worker_index(),
context["fleet"].worker_num()) context["fleet"].worker_num())
print("file_list: {}".format(files))
sparse = get_global_env(name + "sparse_slots", "#") sparse = get_global_env(name + "sparse_slots", "#")
if sparse == "": if sparse == "":
...@@ -135,10 +157,20 @@ def slotdataloader(readerclass, train, yaml_file, context): ...@@ -135,10 +157,20 @@ def slotdataloader(readerclass, train, yaml_file, context):
data_path = os.path.join(package_base, data_path.split("::")[1]) data_path = os.path.join(package_base, data_path.split("::")[1])
files = [str(data_path) + "/%s" % x for x in os.listdir(data_path)] files = [str(data_path) + "/%s" % x for x in os.listdir(data_path)]
files.sort()
need_split_files = False
if context["engine"] == EngineMode.LOCAL_CLUSTER: if context["engine"] == EngineMode.LOCAL_CLUSTER:
# for local cluster: split files for multi process
need_split_files = True
elif context["engine"] == EngineMode.CLUSTER and context[
"cluster_type"] == "K8S":
# for k8s mount mode, split files for every node
need_split_files = True
if need_split_files:
files = split_files(files, context["fleet"].worker_index(), files = split_files(files, context["fleet"].worker_index(),
context["fleet"].worker_num()) context["fleet"].worker_num())
print("file_list: {}".format(files))
sparse = get_global_env("sparse_slots", "#", namespace) sparse = get_global_env("sparse_slots", "#", namespace)
if sparse == "": if sparse == "":
......
# PaddleRec 自定义数据集及Reader
用户自定义数据集及配置异步Reader,需要关注以下几个步骤:
* [数据集整理](#数据集整理)
* [在模型组网中加入输入占位符](#在模型组网中加入输入占位符)
* [Reader实现](#Reader的实现)
* [在yaml文件中配置Reader](#在yaml文件中配置reader)
我们以CTR-DNN模型为例,给出了从数据整理,变量定义,Reader写法,调试的完整历程。
* [数据及Reader示例-DNN](#数据及Reader示例-DNN)
## 数据集整理
PaddleRec支持模型自定义数据集。
关于数据的tips:
1. 数据量:
PaddleRec面向大规模数据设计,可以轻松支持亿级的数据读取,工业级的数据读写api:`dataset`在搜索、推荐、信息流等业务得到了充分打磨。
2. 文件类型:
支持任意直接可读的文本数据,`dataset`同时支持`.gz`格式的文本压缩数据,无需额外代码,可直接读取。数据样本应以`\n`为标志,按行组织。
3. 文件存放位置:
文件通常存放在训练节点本地,但同时,`dataset`支持使用`hadoop`远程读取数据,数据无需下载到本地,为dataset配置hadoop相关账户及地址即可。
4. 数据类型
Reader处理的是以行为单位的`string`数据,喂入网络的数据需要转为`int`,`float`的数值数据,不支持`string`喂入网络,不建议明文保存及处理训练数据。
5. Tips
Dataset模式下,训练线程与数据读取线程的关系强相关,为了多线程充分利用,`强烈建议将文件合理的拆为多个小文件`,尤其是在分布式训练场景下,可以均衡各个节点的数据量,同时加快数据的下载速度。
## 在模型组网中加入输入占位符
Reader读取文件后,产出的数据喂入网络,需要有占位符进行接收。占位符在Paddle中使用`fluid.data``fluid.layers.data`进行定义。`data`的定义可以参考[fluid.data](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/fluid_cn/data_cn.html#data)以及[fluid.layers.data](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/layers_cn/data_cn.html#data)
假如您希望输入三个数据,分别是维度32的数据A,维度变长的稀疏数据B,以及一个一维的标签数据C,并希望梯度可以经过该变量向前传递,则示例如下:
数据A的定义:
```python
var_a = fluid.data(name='A', shape= [-1, 32], dtype='float32')
```
数据B的定义,变长数据的使用可以参考[LoDTensor](https://www.paddlepaddle.org.cn/documentation/docs/zh/beginners_guide/basic_concept/lod_tensor.html#cn-user-guide-lod-tensor)
```python
var_b = fluid.data(name='B', shape=[-1, 1], lod_level=1, dtype='int64')
```
数据C的定义:
```python
var_c = fluid.data(name='C', shape=[-1, 1], dtype='int32')
var_c.stop_gradient = False
```
当我们完成以上三个数据的定义后,在PaddleRec的模型定义中,还需将其加入model基类成员变量`self._data_var`
```python
self._data_var.append(var_a)
self._data_var.append(var_b)
self._data_var.append(var_c)
```
至此,我们完成了在组网中定义输入数据的工作。
## Reader的实现
### Reader的实现范式
Reader的逻辑需要一个单独的python文件进行描述。我们试写一个`test_reader.py`,实现的具体流程如下:
1. 首先我们需要引入Reader基类
```python
from paddlerec.core.reader import ReaderBase
```
2. 创建一个子类,继承Reader的基类,训练所需Reader命名为`TrainerReader`
```python
class TrainerReader(ReaderBase):
def init(self):
pass
def generator_sample(self, line):
pass
```
3.`init(self)`函数中声明一些在数据读取中会用到的变量,必要时可以在`config.yaml`文件中配置变量,利用`env.get_global_env()`拿到。
比如,我们希望从yaml文件中读取一个数据预处理变量`avg=10`,目的是将数据A的数据缩小10倍,可以这样实现:
首先更改yaml文件,在某个space下加入该变量
```yaml
...
train:
reader:
avg: 10
...
```
再更改Reader的init函数
```python
from paddlerec.core.utils import envs
class TrainerReader(Reader):
def init(self):
self.avg = envs.get_global_env("avg", None, "train.reader")
def generator_sample(self, line):
pass
```
4. 继承并实现基类中的`generate_sample(self, line)`函数,逐行读取数据。
- 该函数应返回一个可以迭代的reader方法(带有yield的函数不再是一个普通的函数,而是一个生成器generator,成为了可以迭代的对象,等价于一个数组、链表、文件、字符串etc.)
- 在这个可以迭代的函数中,如示例代码中的`def reader()`,我们定义数据读取的逻辑。以行为单位的数据进行截取,转换及预处理。
- 最后,我们需要将数据整理为特定的格式,才能够被PaddleRec的Reader正确读取,并灌入的训练的网络中。简单来说,数据的输出顺序与我们在网络中创建的`inputs`必须是严格一一对应的,并转换为类似字典的形式。
示例: 假设数据ABC在文本数据中,每行以这样的形式存储:
```shell
0.1,0.2,0.3...3.0,3.1,3.2 \t 99999,99998,99997 \t 1 \n
```
则示例代码如下:
```python
from paddlerec.core.utils import envs
class TrainerReader(Reader):
def init(self):
self.avg = envs.get_global_env("avg", None, "train.reader")
def generator_sample(self, line):
def reader(self, line):
# 先分割 '\n', 再以 '\t'为标志分割为list
variables = (line.strip('\n')).split('\t')
# A是第一个元素,并且每个数据之间使用','分割
var_a = variables[0].split(',') # list
var_a = [float(i) / self.avg for i in var_a] # 将str数据转换为float
# B是第二个元素,同样以 ',' 分割
var_b = variables[1].split(',') # list
var_b = [int(i) for i in var_b] # 将str数据转换为int
# C是第三个元素, 只有一个元素,没有分割符
var_c = variables[2]
var_c = int(var_c) # 将str数据转换为int
var_c = [var_c] # 将单独的数据元素置入list中
# 将数据与数据名结合,组织为dict的形式
# 如下,output形式为{ A: var_a, B: var_b, C: var_c}
variable_name = ['A', 'B', 'C']
output = zip(variable_name, [var_a] + [var_b] + [var_c])
# 将数据输出,使用yield方法,将该函数变为了一个可迭代的对象
yield output
```
至此,我们完成了Reader的实现。
### 在yaml文件中配置Reader
在模型的yaml配置文件中,主要的修改是三个,如下
```yaml
reader:
batch_size: 2
class: "{workspace}/reader.py"
train_data_path: "{workspace}/data/train_data"
reader_debug_mode: False
```
batch_size: 顾名思义,是小批量训练时的样本大小
class: 运行改模型所需reader的路径
train_data_path: 训练数据所在文件夹
reader_debug_mode: 测试reader语法,及输出是否符合预期的debug模式的开关
## 数据及Reader示例-DNN
Reader代码来源于[criteo_reader.py](../models/rank/criteo_reader.py), 组网代码来源于[model.py](../models/rank/dnn/model.py)
### Criteo数据集格式
CTR-DNN训练及测试数据集选用[Display Advertising Challenge](https://www.kaggle.com/c/criteo-display-ad-challenge/)所用的Criteo数据集。该数据集包括两部分:训练集和测试集。训练集包含一段时间内Criteo的部分流量,测试集则对应训练数据后一天的广告点击流量。
每一行数据格式如下所示:
```bash
<label> <integer feature 1> ... <integer feature 13> <categorical feature 1> ... <categorical feature 26>
```
其中```<label>```表示广告是否被点击,点击用1表示,未点击用0表示。```<integer feature>```代表数值特征(连续特征),共有13个连续特征。```<categorical feature>```代表分类特征(离散特征),共有26个离散特征。相邻两个特征用```\t```分隔,缺失特征用空格表示。测试集中```<label>```特征已被移除。
### Criteo数据集的预处理
数据预处理共包括两步:
- 将原始训练集按9:1划分为训练集和验证集
- 数值特征(连续特征)需进行归一化处理,但需要注意的是,对每一个特征```<integer feature i>```,归一化时用到的最大值并不是用全局最大值,而是取排序后95%位置处的特征值作为最大值,同时保留极值。
### CTR网络输入的定义
正如前所述,Criteo数据集中,分为连续数据与离散(稀疏)数据,所以整体而言,CTR-DNN模型的数据输入层包括三个,分别是:`dense_input`用于输入连续数据,维度由超参数`dense_feature_dim`指定,数据类型是归一化后的浮点型数据。`sparse_input_ids`用于记录离散数据,在Criteo数据集中,共有26个slot,所以我们创建了名为`C1~C26`的26个稀疏参数输入,并设置`lod_level=1`,代表其为变长数据,数据类型为整数;最后是每条样本的`label`,代表了是否被点击,数据类型是整数,0代表负样例,1代表正样例。
在Paddle中数据输入的声明使用`paddle.fluid.layers.data()`,会创建指定类型的占位符,数据IO会依据此定义进行数据的输入。
稀疏参数输入的定义:
```python
def sparse_inputs():
ids = envs.get_global_env("hyper_parameters.sparse_inputs_slots", None)
sparse_input_ids = [
fluid.layers.data(name="S" + str(i),
shape=[1],
lod_level=1,
dtype="int64") for i in range(1, ids)
]
return sparse_input_ids
```
稠密参数输入的定义:
```python
def dense_input():
dim = envs.get_global_env("hyper_parameters.dense_input_dim", None)
dense_input_var = fluid.layers.data(name="D",
shape=[dim],
dtype="float32")
return dense_input_var
```
标签的定义:
```python
def label_input():
label = fluid.layers.data(name="click", shape=[1], dtype="int64")
return label
```
组合起来,正确的声明他们:
```python
self.sparse_inputs = sparse_inputs()
self.dense_input = dense_input()
self.label_input = label_input()
self._data_var.append(self.dense_input)
for input in self.sparse_inputs:
self._data_var.append(input)
self._data_var.append(self.label_input)
```
### Criteo Reader写法
```python
# 引入PaddleRec的Reader基类
from paddlerec.core.reader import ReaderBase
# 引入PaddleRec的读取yaml配置文件的方法
from paddlerec.core.utils import envs
# 定义TrainReader,需要继承 paddlerec.core.reader.Reader
class Reader(ReaderBase)::
# 数据预处理逻辑,继承自基类
# 如果无需处理, 使用pass跳过该函数的执行
def init(self):
self.cont_min_ = [0, -3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
self.cont_max_ = [20, 600, 100, 50, 64000, 500, 100, 50, 500, 10, 10, 10, 50]
self.cont_diff_ = [20, 603, 100, 50, 64000, 500, 100, 50, 500, 10, 10, 10, 50]
self.hash_dim_ = envs.get_global_env("hyper_parameters.sparse_feature_number", None, "train.model")
self.continuous_range_ = range(1, 14)
self.categorical_range_ = range(14, 40)
# 读取数据方法,继承自基类
# 实现可以迭代的reader函数,逐行处理数据
def generate_sample(self, line):
"""
Read the data line by line and process it as a dictionary
"""
def reader():
"""
This function needs to be implemented by the user, based on data format
"""
features = line.rstrip('\n').split('\t')
dense_feature = []
sparse_feature = []
for idx in self.continuous_range_:
if features[idx] == "":
dense_feature.append(0.0)
else:
dense_feature.append(
(float(features[idx]) - self.cont_min_[idx - 1]) /
self.cont_diff_[idx - 1])
for idx in self.categorical_range_:
sparse_feature.append(
[hash(str(idx) + features[idx]) % self.hash_dim_])
label = [int(features[0])]
feature_name = ["D"]
for idx in self.categorical_range_:
feature_name.append("S" + str(idx - 13))
feature_name.append("label")
yield zip(feature_name, [dense_feature] + sparse_feature + [label])
return reader
```
### 调试Reader
在Linux下运行时,默认启动`Dataset`模式,在Win/Mac下运行时,默认启动`Dataloader`模式。
通过在`config.yaml`中添加或修改`reader_debug_mode=True`打开debug模式,只会结合组网运行reader的部分,读取10条样本,并print,方便您观察格式是否符合预期或隐藏bug。
```yaml
reader:
batch_size: 2
class: "{workspace}/../criteo_reader.py"
train_data_path: "{workspace}/data/train"
reader_debug_mode: True
```
修改后,使用paddlerec.run执行该修改后的yaml文件,可以观察输出。
```bash
python -m paddlerec.run -m ./models/rank/dnn/config.yaml -e single
```
### Dataset调试
dataset输出的数据格式如下:
` dense_input:size ; dense_input:value ; sparse_input:size ; sparse_input:value ; ... ; sparse_input:size ; sparse_input:value ; label:size ; label:value `
基本规律是对于每个变量,会先输出其维度大小,再输出其具体值。
直接debug `criteo_reader`理想的输出为(截取了一个片段):
```bash
...
13 0.0 0.00497512437811 0.05 0.08 0.207421875 0.028 0.35 0.08 0.082 0.0 0.4 0.0 0.08 1 737395 1 210498 1 903564 1 286224 1 286835 1 906818 1 90
6116 1 67180 1 27346 1 51086 1 142177 1 95024 1 157883 1 873363 1 600281 1 812592 1 228085 1 35900 1 880474 1 984402 1 100885 1 26235 1 410878 1 798162 1 499868 1 306163 1 0
...
```
可以看到首先输出的是13维的dense参数,随后是分立的sparse参数,最后一个是1维的label,数值为0,输出符合预期。
>使用Dataset的一些注意事项
> - Dataset的基本原理:将数据print到缓存,再由C++端的代码实现读取,因此,我们不能在dataset的读取代码中,加入与数据读取无关的print信息,会导致C++端拿到错误的数据信息。
> - dataset目前只支持在`unbuntu`及`CentOS`等标准Linux环境下使用,在`Windows`及`Mac`下使用时,会产生预料之外的错误,请知悉。
### DataLoader调试
dataloader的输出格式为`list: [ list[var_1], list[var_2], ... , list[var_3]]`,每条样本的数据会被放在一个 **list[list]** 中,list[0]为第一个variable。
直接debug `criteo_reader`理想的输出为(截取了一个片段):
```bash
...
[[0.0, 0.004975124378109453, 0.05, 0.08, 0.207421875, 0.028, 0.35, 0.08, 0.082, 0.0, 0.4, 0.0, 0.08], [560746], [902436], [262029], [182633], [368411], [735166], [321120], [39572], [185732], [140298], [926671], [81559], [461249], [728372], [915018], [907965], [818961], [850958], [311492], [980340], [254960], [175041], [524857], [764893], [526288], [220126], [0]]
...
```
可以看到首先输出的是13维的dense参数的list,随后是分立的sparse参数,各自在一个list中,最后一个是1维的label的list,数值为0,输出符合预期。
...@@ -9,6 +9,7 @@ ...@@ -9,6 +9,7 @@
- [第三步:增加集群运行`backend.yaml`配置](#第三步增加集群运行backendyaml配置) - [第三步:增加集群运行`backend.yaml`配置](#第三步增加集群运行backendyaml配置)
- [MPI集群的Parameter Server模式配置](#mpi集群的parameter-server模式配置) - [MPI集群的Parameter Server模式配置](#mpi集群的parameter-server模式配置)
- [K8S集群的Collective模式配置](#k8s集群的collective模式配置) - [K8S集群的Collective模式配置](#k8s集群的collective模式配置)
- [K8S集群的PS-CPU模式配置](#k8s集群的ps-cpu模式配置)
- [第四步:任务提交](#第四步任务提交) - [第四步:任务提交](#第四步任务提交)
- [使用PaddleCloud Client提交](#使用paddlecloud-client提交) - [使用PaddleCloud Client提交](#使用paddlecloud-client提交)
- [第一步:在`before_hook.sh`里手动安装PaddleRec](#第一步在before_hooksh里手动安装paddlerec) - [第一步:在`before_hook.sh`里手动安装PaddleRec](#第一步在before_hooksh里手动安装paddlerec)
...@@ -34,10 +35,10 @@ ...@@ -34,10 +35,10 @@
分布式运行首先需要更改`config.yaml`,主要调整以下内容: 分布式运行首先需要更改`config.yaml`,主要调整以下内容:
- workspace: 调整为在节点运行时的工作目录 - workspace: 调整为在远程节点运行时的工作目录,一般设置为`"./"`即可
- runner_class: 从单机的"train"调整为"cluster_train" - runner_class: 从单机的"train"调整为"cluster_train",单机训练->分布式训练(例外情况,k8s上单机单卡训练仍然为train,后续支持)
- fleet_mode: 选则参数服务器模式,抑或GPU Collective模式 - fleet_mode: 选择参数服务器模式(ps),或者GPU的all-reduce模式(collective)
- distribute_strategy: 可选项,选择分布式训练的策略 - distribute_strategy: 可选项,选择分布式训练的策略,目前只在参数服务器模式下生效,可选项:`sync、asycn、half_async、geo`
配置选项具体参数,可以参考[yaml配置说明](./yaml.md) 配置选项具体参数,可以参考[yaml配置说明](./yaml.md)
...@@ -47,50 +48,72 @@ ...@@ -47,50 +48,72 @@
```yaml ```yaml
# workspace # workspace
workspace: "paddlerec.models.rank.dnn" workspace: "models/rank/dnn"
mode: [single_cpu_train] mode: [single_cpu_train]
# config of each runner.
# runner is a kind of paddle training class, which wraps the train/infer process.
runner: runner:
- name: single_cpu_train - name: single_cpu_train
class: train class: train
# num of epochs
epochs: 4 epochs: 4
# device to run training or infer
device: cpu device: cpu
save_checkpoint_interval: 2 # save model interval of epochs save_checkpoint_interval: 2
save_checkpoint_path: "increment_dnn" # save checkpoint path save_checkpoint_path: "increment_dnn"
init_model_path: "" # load model path init_model_path: ""
print_interval: 10 print_interval: 10
phases: [phase1] phases: [phase1]
dataset:
- name: dataloader_train
batch_size: 2
type: DataLoader
data_path: "{workspace}/data/sample_data/train"
sparse_slots: "click 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26"
dense_slots: "dense_var:13"
phase:
- name: phase1
model: "{workspace}/model.py"
dataset_name: dataloader_train
thread_num: 1
``` ```
分布式的训练配置可以改为: 分布式的训练配置可以改为:
```yaml ```yaml
# workspace # 改变一:代码上传至节点后,在默认目录下
# 改变一:代码上传至节点后,与运行shell同在一个默认目录下
workspace: "./" workspace: "./"
mode: [ps_cluster] mode: [ps_cluster]
# config of each runner.
# runner is a kind of paddle training class, which wraps the train/infer process.
runner: runner:
- name: ps_cluster - name: ps_cluster
# 改变二:调整runner的class # 改变二:调整runner的class
class: cluster_train class: cluster_train
# num of epochs
epochs: 4 epochs: 4
# device to run training or infer
device: cpu device: cpu
# 改变三 & 四: 指定fleet_mode 与 distribute_strategy # 改变三 & 四: 指定fleet_mode 与 distribute_strategy
fleet_mode: ps fleet_mode: ps
distribute_strategy: async distribute_strategy: async
save_checkpoint_interval: 2 # save model interval of epochs save_checkpoint_interval: 2
save_checkpoint_path: "increment_dnn" # save checkpoint path save_checkpoint_path: "increment_dnn"
init_model_path: "" # load model path init_model_path: ""
print_interval: 10 print_interval: 10
phases: [phase1] phases: [phase1]
dataset:
- name: dataloader_train
batch_size: 2
type: DataLoader
# 改变五: 改变数据的读取目录
# 通常而言,mpi模式下,数据会下载到远程节点执行目录的'./train_data'下, k8s则与挂载位置有关
data_path: "{workspace}/train_data"
sparse_slots: "click 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26"
dense_slots: "dense_var:13"
phase:
- name: phase1
model: "{workspace}/model.py"
dataset_name: dataloader_train
# 分布式训练节点的CPU_NUM环境变量与thread_num相等,多个phase时,取最大的thread_num
thread_num: 1
``` ```
除此之外,还需关注数据及模型加载的路径,一般而言: 除此之外,还需关注数据及模型加载的路径,一般而言:
...@@ -110,6 +133,8 @@ cluster_type: mpi # k8s 可选 ...@@ -110,6 +133,8 @@ cluster_type: mpi # k8s 可选
config: config:
# 填写任务运行的paddle官方版本号 >= 1.7.2, 默认1.7.2 # 填写任务运行的paddle官方版本号 >= 1.7.2, 默认1.7.2
paddle_version: "1.7.2" paddle_version: "1.7.2"
# 是否使用PaddleCloud运行环境下的Python3,默认使用python2
use_python3: 1
# hdfs/afs的配置信息填写 # hdfs/afs的配置信息填写
fs_name: "afs://xxx.com" fs_name: "afs://xxx.com"
...@@ -130,11 +155,13 @@ config: ...@@ -130,11 +155,13 @@ config:
# paddle参数服务器分布式底层超参,无特殊需求不理不改 # paddle参数服务器分布式底层超参,无特殊需求不理不改
communicator: communicator:
# 使用SGD优化器时,建议设置为1
FLAGS_communicator_is_sgd_optimizer: 0 FLAGS_communicator_is_sgd_optimizer: 0
# 以下三个变量默认都等于训练时的线程数:CPU_NUM
FLAGS_communicator_send_queue_size: 5 FLAGS_communicator_send_queue_size: 5
FLAGS_communicator_thread_pool_size: 32
FLAGS_communicator_max_merge_var_num: 5 FLAGS_communicator_max_merge_var_num: 5
FLAGS_communicator_max_send_grad_num_before_recv: 5 FLAGS_communicator_max_send_grad_num_before_recv: 5
FLAGS_communicator_thread_pool_size: 32
FLAGS_communicator_fake_rpc: 0 FLAGS_communicator_fake_rpc: 0
FLAGS_rpc_retry_times: 3 FLAGS_rpc_retry_times: 3
...@@ -165,7 +192,14 @@ submit: ...@@ -165,7 +192,14 @@ submit:
# for k8s gpu # for k8s gpu
# k8s gpu 模式下,训练节点数,及每个节点上的GPU卡数 # k8s gpu 模式下,训练节点数,及每个节点上的GPU卡数
k8s_trainers: 2 k8s_trainers: 2
k8s_cpu_cores: 4
k8s_gpu_card: 1 k8s_gpu_card: 1
# for k8s ps-cpu
k8s_trainers: 2
k8s_cpu_cores: 4
k8s_ps_num: 2
k8s_ps_cores: 4
``` ```
...@@ -173,18 +207,51 @@ submit: ...@@ -173,18 +207,51 @@ submit:
除此之外,我们还需要关注上传到工作目录的文件(`files选项`)的路径问题,在示例中是`./*.py`,说明我们执行任务提交时,与这些py文件在同一目录。若不在同一目录,则需要适当调整files路径,或改为这些文件的绝对路径。 除此之外,我们还需要关注上传到工作目录的文件(`files选项`)的路径问题,在示例中是`./*.py`,说明我们执行任务提交时,与这些py文件在同一目录。若不在同一目录,则需要适当调整files路径,或改为这些文件的绝对路径。
不建议利用`files`上传数据文件,可以通过指定`train_data_path`自动下载,或指定`afs_remote_mount_point`挂载实现数据到节点的转移。 不建议利用`files`上传过大的数据文件,可以通过指定`train_data_path`自动下载,或在k8s模式下指定`afs_remote_mount_point`挂载实现数据到节点的转移。
#### MPI集群的Parameter Server模式配置 #### MPI集群的Parameter Server模式配置
下面是一个利用PaddleCloud提交MPI参数服务器模式任务的`backend.yaml`示例 下面是一个利用PaddleCloud提交MPI参数服务器模式任务的`backend.yaml`示例
首先调整`config.yaml`:
```yaml
workspace: "./"
mode: [ps_cluster]
dataset:
- name: dataloader_train
batch_size: 2
type: DataLoader
data_path: "{workspace}/train_data"
sparse_slots: "click 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26"
dense_slots: "dense_var:13"
runner:
- name: ps_cluster
class: cluster_train
epochs: 2
device: cpu
fleet_mode: ps
save_checkpoint_interval: 1
save_checkpoint_path: "increment_dnn"
init_model_path: ""
print_interval: 1
phases: [phase1]
phase:
- name: phase1
model: "{workspace}/model.py"
dataset_name: dataloader_train
thread_num: 1
```
再新增`backend.yaml`
```yaml ```yaml
backend: "PaddleCloud" backend: "PaddleCloud"
cluster_type: mpi # k8s 可选 cluster_type: mpi # k8s可选
config: config:
# 填写任务运行的paddle官方版本号 >= 1.7.2, 默认1.7.2
paddle_version: "1.7.2" paddle_version: "1.7.2"
# hdfs/afs的配置信息填写 # hdfs/afs的配置信息填写
...@@ -229,9 +296,45 @@ submit: ...@@ -229,9 +296,45 @@ submit:
下面是一个利用PaddleCloud提交K8S集群进行GPU训练的`backend.yaml`示例 下面是一个利用PaddleCloud提交K8S集群进行GPU训练的`backend.yaml`示例
首先调整`config.yaml`
```yaml
workspace: "./"
mode: [collective_cluster]
dataset:
- name: dataloader_train
batch_size: 2
type: DataLoader
data_path: "{workspace}/afs/挂载数据文件夹的路径"
sparse_slots: "click 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26"
dense_slots: "dense_var:13"
runner:
- name: collective_cluster
class: cluster_train
epochs: 2
device: gpu
fleet_mode: collective
save_checkpoint_interval: 1 # save model interval of epochs
save_checkpoint_path: "increment_dnn" # save checkpoint path
init_model_path: "" # load model path
print_interval: 1
phases: [phase1]
phase:
- name: phase1
model: "{workspace}/model.py"
dataset_name: dataloader_train
thread_num: 1
```
再增加`backend.yaml`
```yaml ```yaml
backend: "PaddleCloud" backend: "PaddleCloud"
cluster_type: mpi # k8s 可选 cluster_type: k8s # mpi 可选
config: config:
# 填写任务运行的paddle官方版本号 >= 1.7.2, 默认1.7.2 # 填写任务运行的paddle官方版本号 >= 1.7.2, 默认1.7.2
...@@ -271,9 +374,93 @@ submit: ...@@ -271,9 +374,93 @@ submit:
# for k8s gpu # for k8s gpu
# k8s gpu 模式下,训练节点数,及每个节点上的GPU卡数 # k8s gpu 模式下,训练节点数,及每个节点上的GPU卡数
k8s_trainers: 2 k8s_trainers: 2
k8s_cpu_cores: 4
k8s_gpu_card: 1 k8s_gpu_card: 1
``` ```
#### K8S集群的PS-CPU模式配置
下面是一个利用PaddleCloud提交K8S集群进行参数服务器CPU训练的`backend.yaml`示例
首先调整`config.yaml`:
```yaml
workspace: "./"
mode: [ps_cluster]
dataset:
- name: dataloader_train
batch_size: 2
type: DataLoader
data_path: "{workspace}/afs/挂载数据文件夹的路径"
sparse_slots: "click 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26"
dense_slots: "dense_var:13"
runner:
- name: ps_cluster
class: cluster_train
epochs: 2
device: cpu
fleet_mode: ps
save_checkpoint_interval: 1
save_checkpoint_path: "increment_dnn"
init_model_path: ""
print_interval: 1
phases: [phase1]
phase:
- name: phase1
model: "{workspace}/model.py"
dataset_name: dataloader_train
thread_num: 1
```
再新增`backend.yaml`
```yaml
backend: "PaddleCloud"
cluster_type: k8s # mpi 可选
config:
# 填写任务运行的paddle官方版本号 >= 1.7.2, 默认1.7.2
paddle_version: "1.7.2"
# hdfs/afs的配置信息填写
fs_name: "afs://xxx.com"
fs_ugi: "usr,pwd"
# 填任务输出目录的远程地址,如afs:/user/your/path/ 则此处填 /user/your/path
output_path: ""
# for k8s
# 填远程挂载地址,如afs:/user/your/path/ 则此处填 /user/your/path
afs_remote_mount_point: ""
submit:
# PaddleCloud 个人信息 AK 及 SK
ak: ""
sk: ""
# 任务运行优先级,默认high
priority: "high"
# 任务名称
job_name: "PaddleRec_CTR"
# 训练资源所在组
group: ""
# 节点上的任务启动命令
start_cmd: "python -m paddlerec.run -m ./config.yaml"
# 本地需要上传到节点工作目录的文件
files: ./*.py ./*.yaml
# for k8s gpu
# k8s ps-cpu 模式下,训练节点数,参数服务器节点数,及每个节点上的cpu核心数及内存限制
k8s_trainers: 2
k8s_cpu_cores: 4
k8s_ps_num: 2
k8s_ps_cores: 4
```
### 第四步:任务提交 ### 第四步:任务提交
当我们准备好`config.yaml``backend.yaml`,便可以进行一键任务提交,命令为: 当我们准备好`config.yaml``backend.yaml`,便可以进行一键任务提交,命令为:
......
# 如何给模型增加Metric
## PaddleRec Metric使用示例
```
from paddlerec.core.model import ModelBase
from paddlerec.core.metrics import RecallK
class Model(ModelBase):
def __init__(self, config):
ModelBase.__init__(self, config)
def net(self, inputs, is_infer=False):
...
acc = RecallK(input=logits, label=label, k=20)
self._metrics["Train_P@20"] = acc
```
## Metric类
### 成员变量
> _global_metric_state_vars(dict),
字典类型,用以存储metric计算过程中需要的中间状态变量。一般情况下,这些中间状态需要是Persistable=True的变量,所以会在模型保存的时候也会被保存下来。因此infer阶段需手动将这些中间状态值清零,进而保证预测结果的正确性。
### 成员函数
> clear(self, scope):
从scope中将self._global_metric_state_vars中的状态值全清零。该函数一般用在**infer**阶段开始的时候。用以保证预测指标的正确性。
> calc_global_metrics(self, fleet, scope=None):
将self._global_metric_state_vars中的状态值在所有训练节点上做all_reduce操作,进而下一步调用_calculate()函数计算全局指标。若fleet=None,则all_reduce的结果为自己本身,即单机全局指标计算。
> get_result(self): 返回训练过程中需要fetch,并定期打印至屏幕的变量。返回类型为dict。
## Metrics
### AUC
> AUC(input ,label, curve='ROC', num_thresholds=2**12 - 1, topk=1, slide_steps=1)
Auc,全称Area Under the Curve(AUC),该层根据前向输出和标签计算AUC,在二分类(binary classification)估计中广泛使用。在二分类(binary classification)中广泛使用。相关定义参考 https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve 。
#### 参数
- **input(Tensor|LoDTensor)**: 数据类型为float32,float64。浮点二维变量。输入为网络的预测值。shape为[batch_size, 2]。
- **label(Tensor|LoDTensor)**: 数据类型为int64,int32。输入为数据集的标签。shape为[batch_size, 1]。
- **curve(str)**: 曲线类型,可以为 ROC 或 PR,默认 ROC。
- **num_thresholds(int)**: 将roc曲线离散化时使用的临界值数。默认200。
- **topk(int)**: 取topk的输出值用于计算。
- **slide_steps(int)**: - 当计算batch auc时,不仅用当前步也用于先前步。slide_steps=1,表示用当前步;slide_steps = 3表示用当前步和前两步;slide_steps = 0,则用所有步。
#### 返回值
该指标训练过程中定期的变量有两个:
- **AUC**: 整体AUC值
- **BATCH_AUC**:当前batch的AUC值
### PrecisionRecall
> PrecisionRecall(input, label, class_num)
计算precison, recall, f1。
#### 参数
- **input(Tensor|LoDTensor)**: 数据类型为float32,float64。输入为网络的预测值。shape为[batch_size, class_num]
- **label(Tensor|LoDTensor)**: 数据类型为int32。输入为数据集的标签。shape为 [batch_size, 1]
- **class_num(int)**: 类别个数。
#### 返回值
- **[TP FP TN FN]**: 形状为[class_num, 4]的变量,用以表征每种类型的TP,FP,TN和FN值。TP=true positive, FP=false positive, TN=true negative, FN=false negative。若需计算每种类型的precison, recall,f1, 则可根据如下公式进行计算:
precision = TP / (TP + FP); recall = TP = TP / (TP + FN); F1 = 2 * precision * recall / (precision + recall)。
- **precision_recall_f1**: 形状为[6],分别代表[macro_avg_precision, macro_avg_recall, macro_avg_f1, micro_avg_precision, micro_avg_recall, micro_avg_f1],这里macro代表先计算每种类型的准确率,召回率,F1,然后求平均。micro代表先计算所有类型的整体TP,TN, FP, FN等中间值,然后在计算准确率,召回率,F1.
### RecallK
> RecallK(input, label, k=20)
TopK的召回准确率,对于任意一条样本来说,若前top_k个分类结果中包含正确分类标签,则视为正样本。
#### 参数
- **input(Tensor|LoDTensor)**: 数据类型为float32,float64。输入为网络的预测值。shape为[batch_size, class_dim]
- **label(Tensor|LoDTensor)**: 数据类型为int64,int32。输入为数据集的标签。shape为 [batch_size, 1]
- **k(int)**: 取每个类别中top_k个预测值用于计算召回准确率。
#### 返回值
- **InsCnt**:样本总数
- **RecallCnt**: topk可以正确被召回的样本数
- **Acc(Recall@k)**: RecallCnt/InsCnt,即Topk召回准确率。
## PairWise_PN
> PosNegRatio(pos_score, neg_score)
正逆序指标,一般用在输入是pairwise的模型中。例如输入既包含正样本,也包含负样本,模型需要去学习最大化正负样本打分的差异。
#### 参数
- **pos_score(Tensor|LoDTensor)**: 正样本的打分,数据类型为float32,float64。浮点二维变量,值的范围为[0,1]。
- **neg_score(Tensor|LoDTensor)**:负样本的打分。数据类型为float32,float64。浮点二维变量,值的范围为[0,1]。
#### 返回值
- **RightCnt**: pos_score > neg_score的样本数
- **WrongCnt**: pos_score <= neg_score的样本数
- **PN**: (RightCnt + 1.0) / (WrongCnt + 1.0), 正逆序,+1.0是为了避免除0错误。
### Customized_Metric
如果你需要在自定义metric,那么你需要按如下步骤操作:
1. 继承paddlerec.core.Metric,定义你的MyMetric类。
2. 在MyMetric的构造函数中,自定义Metric组网,声明self._global_metric_state_vars私有变量。
3. 定义_calculate(global_metrics),全局指标计算。该函数的输入globla_metrics,存储了self._global_metric_state_vars中所有中间状态变量的全局统计值。最终结果以str格式返回。
自定义Metric模版如下,你可以参考注释,或paddlerec.core.metrics下已经实现的precision_recall, auc, pairwise_pn, recall_k等指标的计算方式,自定义自己的Metric类。
```
from paddlerec.core.Metric import Metric
class MyMetric(Metric):
def __init__(self):
# 1. 自定义Metric组网
** 1. your code **
# 2. 设置中间状态字典
self._global_metric_state_vars = dict()
** 2. your code **
def get_result(self):
# 3. 定义训练过程中需要打印的变量,以字典格式返回
self. _metrics = dict()
** 3. your code **
def _calculate(self, global_metrics):
# 4. 全局指标计算,global_metrics为字典类型,存储了self._global_metric_state_vars中所有中间状态变量的全局统计值。返回格式为str。
** your code **
```
...@@ -92,7 +92,7 @@ def input_data(self, is_infer=False, **kwargs): ...@@ -92,7 +92,7 @@ def input_data(self, is_infer=False, **kwargs):
return train_inputs return train_inputs
``` ```
更多数据读取教程,请参考[自定义数据集及Reader](custom_dataset_reader.md) 更多数据读取教程,请参考[自定义数据集及Reader](custom_reader.md)
### 组网的定义 ### 组网的定义
...@@ -113,6 +113,8 @@ def input_data(self, is_infer=False, **kwargs): ...@@ -113,6 +113,8 @@ def input_data(self, is_infer=False, **kwargs):
可以参考官方模型的示例学习net的构造方法。 可以参考官方模型的示例学习net的构造方法。
除可以使用Paddle的Metrics接口外,PaddleRec也统一封装了一些常见的Metrics评价指标,并允许开发者定义自己的Metrics类,相关文件参考[Metrics开发文档](metrics.md)
## 如何运行自定义模型 ## 如何运行自定义模型
记录`model.py`,`config.yaml`及数据读取`reader.py`的文件路径,建议置于同一文件夹下,如`/home/custom_model`下,更改`config.yaml`中的配置选项 记录`model.py`,`config.yaml`及数据读取`reader.py`的文件路径,建议置于同一文件夹下,如`/home/custom_model`下,更改`config.yaml`中的配置选项
......
# PaddleRec 预训练模型
PaddleRec基于业务实践,使用真实数据,产出了推荐领域算法的若干预训练模型,方便开发者进行算法调研。
## 文本分类预训练模型
### 获取地址
```bash
wget xxx.tar.gz
```
### 使用方法
解压后,得到的是一个paddle的模型文件夹,使用`PaddleRec/models/contentunderstanding/classification_finetue`模型进行加载
...@@ -20,7 +20,7 @@ python -m paddlerec.run -m paddlerec.models.xxx.yyy ...@@ -20,7 +20,7 @@ python -m paddlerec.run -m paddlerec.models.xxx.yyy
例如启动`recall`下的`word2vec`模型的默认配置; 例如启动`recall`下的`word2vec`模型的默认配置;
```shell ```shell
python -m paddlerec.run -m paddlerec.models.recall.word2vec python -m paddlerec.run -m models/recall/word2vec
``` ```
### 2. 启动内置模型的个性化配置训练 ### 2. 启动内置模型的个性化配置训练
......
# PaddleRec yaml配置说明 # PaddleRec config.yaml配置说明
## 全局变量 ## 全局变量
...@@ -12,31 +12,31 @@ ...@@ -12,31 +12,31 @@
## runner变量 ## runner变量
| 名称 | 类型 | 取值 | 是否必须 | 作用描述 | | 名称 | 类型 | 取值 | 是否必须 | 作用描述 |
| :---------------------------: | :----------: | :-------------------------------------------: | :------: | :------------------------------------------------------------------: | | :---------------------------: | :----------: | :-------------------------------------------------------: | :------: | :------------------------------------------------------------------: |
| name | string | 任意 | 是 | 指定runner名称 | | name | string | 任意 | 是 | 指定runner名称 |
| class | string | train(默认) / infer / local_cluster_train / cluster_train | 是 | 指定运行runner的类别(单机/分布式, 训练/预测) | | class | string | train(默认) / infer / local_cluster_train / cluster_train | 是 | 指定运行runner的类别(单机/分布式, 训练/预测) |
| device | string | cpu(默认) / gpu | 否 | 程序执行设备 | | device | string | cpu(默认) / gpu | 否 | 程序执行设备 |
| fleet_mode | string | ps(默认) / pslib / collective | 否 | 分布式运行模式 | | fleet_mode | string | ps(默认) / pslib / collective | 否 | 分布式运行模式 |
| selected_gpus | string | "0"(默认) | 否 | 程序运行GPU卡号,若以"0,1"的方式指定多卡,则会默认启用collective模式 | | selected_gpus | string | "0"(默认) | 否 | 程序运行GPU卡号,若以"0,1"的方式指定多卡,则会默认启用collective模式 |
| worker_num | int | 1(默认) | 否 | 参数服务器模式下worker的数量 | | worker_num | int | 1(默认) | 否 | 参数服务器模式下worker的数量 |
| server_num | int | 1(默认) | 否 | 参数服务器模式下server的数量 | | server_num | int | 1(默认) | 否 | 参数服务器模式下server的数量 |
| distribute_strategy | string | async(默认)/sync/half_async/geo | 否 | 参数服务器模式下训练模式的选择 | | distribute_strategy | string | async(默认)/sync/half_async/geo | 否 | 参数服务器模式下训练模式的选择 |
| epochs | int | >= 1 | 否 | 模型训练迭代轮数 | | epochs | int | >= 1 | 否 | 模型训练迭代轮数 |
| phases | list[string] | 由phase name组成的list | 否 | 当前runner的训练过程列表,顺序执行 | | phases | list[string] | 由phase name组成的list | 否 | 当前runner的训练过程列表,顺序执行 |
| init_model_path | string | 路径 | 否 | 初始化模型地址 | | init_model_path | string | 路径 | 否 | 初始化模型地址 |
| save_checkpoint_interval | int | >= 1 | 否 | Save参数的轮数间隔 | | save_checkpoint_interval | int | >= 1 | 否 | Save参数的轮数间隔 |
| save_checkpoint_path | string | 路径 | 否 | Save参数的地址 | | save_checkpoint_path | string | 路径 | 否 | Save参数的地址 |
| save_inference_interval | int | >= 1 | 否 | Save预测模型的轮数间隔 | | save_inference_interval | int | >= 1 | 否 | Save预测模型的轮数间隔 |
| save_inference_path | string | 路径 | 否 | Save预测模型的地址 | | save_inference_path | string | 路径 | 否 | Save预测模型的地址 |
| save_inference_feed_varnames | list[string] | 组网中指定Variable的name | 否 | 预测模型的入口变量name | | save_inference_feed_varnames | list[string] | 组网中指定Variable的name | 否 | 预测模型的入口变量name |
| save_inference_fetch_varnames | list[string] | 组网中指定Variable的name | 否 | 预测模型的出口变量name | | save_inference_fetch_varnames | list[string] | 组网中指定Variable的name | 否 | 预测模型的出口变量name |
| print_interval | int | >= 1 | 否 | 训练指标打印batch间隔 | | print_interval | int | >= 1 | 否 | 训练指标打印batch间隔 |
| instance_class_path | string | 路径 | 否 | 自定义instance流程实现的地址 | | instance_class_path | string | 路径 | 否 | 自定义instance流程实现的地址 |
| network_class_path | string | 路径 | 否 | 自定义network流程实现的地址 | | network_class_path | string | 路径 | 否 | 自定义network流程实现的地址 |
| startup_class_path | string | 路径 | 否 | 自定义startup流程实现的地址 | | startup_class_path | string | 路径 | 否 | 自定义startup流程实现的地址 |
| runner_class_path | string | 路径 | 否 | 自定义runner流程实现的地址 | | runner_class_path | string | 路径 | 否 | 自定义runner流程实现的地址 |
| terminal_class_path | string | 路径 | 否 | 自定义terminal流程实现的地址 | | terminal_class_path | string | 路径 | 否 | 自定义terminal流程实现的地址 |
...@@ -70,3 +70,55 @@ ...@@ -70,3 +70,55 @@
| optimizer.learning_rate | float | > 0 | 否 | 指定学习率 | | optimizer.learning_rate | float | > 0 | 否 | 指定学习率 |
| reg | float | > 0 | 否 | L2正则化参数,只在SGD下生效 | | reg | float | > 0 | 否 | L2正则化参数,只在SGD下生效 |
| others | / | / | / | 由各个模型组网独立指定 | | others | / | / | / | 由各个模型组网独立指定 |
# PaddleRec backend.yaml配置说明
## 全局变量
| 名称 | 类型 | 取值 | 是否必须 | 作用描述 |
| :----------: | :----: | :-------------: | :------: | :----------------------------------------------: |
| backend | string | paddlecloud/k8s | 是 | 使用PaddleCloud平台提交,还是在公有云K8S集群提交 |
| cluster_type | string | mpi/k8s | 是 | 指定运行的计算集群: mpi 还是 k8s |
## config
| 名称 | 类型 | 取值 | 是否必须 | 作用描述 |
| :--------------------: | :----: | :-------------------------------------: | :------: | :------------------------------------------------------------------------------------------: |
| paddle_version | string | paddle官方版本号,如1.7.2/1.8.0/1.8.3等 | 否 | 指定运行训练使用的Paddle版本,默认1.7.2 |
| use_python3 | int | 0(默认)/1 | 否 | 指定是否使用python3进行训练 |
| fs_name | string | "afs://xxx.com" | 是 | hdfs/afs集群名称所需配置 |
| fs_ugi | string | "usr,pwd" | 是 | hdfs/afs集群密钥所需配置 |
| output_path | string | "/user/your/path" | 否 | 任务输出的远程目录 |
| train_data_path | string | "/user/your/path" | 是 | mpi集群下指定训练数据路径,paddlecloud会自动将数据分片并下载到工作目录的`./train_data`文件夹 |
| test_data_path | string | "/user/your/path" | 否 | mpi集群下指定测试数据路径,会自动下载到工作目录的`./test_data`文件夹 |
| thirdparty_path | string | "/user/your/path" | 否 | mpi集群下指定thirdparty路径,会自动下载到工作目录的`./thirdparty`文件夹 |
| afs_remote_mount_point | string | "/user/your/path" | 是 | k8s集群下指定远程路径的地址,会挂载到工作目录的`./afs/下` |
### config.communicator
| 名称 | 类型 | 取值 | 是否必须 | 作用描述 |
| :----------------------------------------------: | :---: | :------------: | :------: | :----------------------------------------------------: |
| FLAGS_communicator_is_sgd_optimizer | int | 0(默认)/1 | 否 | 异步分布式训练时的多线程的梯度融合方式是否使用SGD模式 |
| FLAGS_communicator_send_queue_size | int | 线程数(默认) | 否 | 分布式训练时发送队列的大小 |
| FLAGS_communicator_max_merge_var_num | int | 线程数(默认) | 否 | 分布式训练多线程梯度融合时,线程数的配置 |
| FLAGS_communicator_max_send_grad_num_before_recv | int | 线程数(默认) | 否 | 分布式训练使用独立recv参数线程时,与send的步调配置超参 |
| FLAGS_communicator_thread_pool_size | int | 32(默认) | 否 | 分布式训练时,多线程发送参数的线程池大小 |
| FLAGS_communicator_fake_rpc | int | 0(默认)/1 | 否 | 分布式训练时,选择不进行通信 |
| FLAGS_rpc_retry_times | int | 3(默认) | 否 | 分布式训练时,GRPC的失败重试次数 |
## submit
| 名称 | 类型 | 取值 | 是否必须 | 作用描述 |
| :-----------: | :----: | :-------------------------: | :------: | :------------------------------------------------------: |
| ak | string | PaddleCloud平台提供的ak密钥 | 是 | paddlecloud用户配置 |
| sk | string | PaddleCloud平台提供的sk密钥 | 否 | paddlecloud用户配置 |
| priority | string | normal/high/very_high | 否 | 任务优先级 |
| job_name | string | 任意 | 是 | 任务名称 |
| group | string | 计算资源所在组名称 | 是 | 组名称 |
| start_cmd | string | 任意 | 是 | 启动命令,默认`python -m paddlerec.run -m ./config.yaml` |
| files | string | 任意 | 是 | 随任务提交上传的文件,给出相对或绝对路径 |
| nodes | int | >=1(默认1) | 否 | mpi集群下的节点数 |
| k8s_trainers | int | >=1(默认1) | 否 | k8s集群下worker的节点数 |
| k8s_cpu_cores | int | >=1(默认1) | 否 | k8s集群下worker的CPU核数 |
| k8s_gpu_card | int | >=1(默认1) | 否 | k8s集群下worker的GPU卡数 |
| k8s_ps_num | int | >=1(默认1) | 否 | k8s集群下server的节点数 |
| k8s_ps_cores | int | >=1(默认1) | 否 | k8s集群下server的CPU核数 |
...@@ -12,7 +12,7 @@ ...@@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
workspace: "paddlerec.models.contentunderstanding.classification" workspace: "models/contentunderstanding/classification"
dataset: dataset:
- name: data1 - name: data1
......
...@@ -39,8 +39,11 @@ ...@@ -39,8 +39,11 @@
##使用教程(快速开始) ##使用教程(快速开始)
``` ```
python -m paddlerec.run -m paddlerec.models.contentunderstanding.tagspace git clone https://github.com/PaddlePaddle/PaddleRec.git paddle-rec
python -m paddlerec.run -m paddlerec.models.contentunderstanding.classification cd paddle-rec
python -m paddlerec.run -m models/contentunderstanding/tagspace/config.yaml
python -m paddlerec.run -m models/contentunderstanding/classification/config.yaml
``` ```
## 使用教程(复现论文) ## 使用教程(复现论文)
......
...@@ -12,7 +12,7 @@ ...@@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
workspace: "paddlerec.models.contentunderstanding.tagspace" workspace: "models/contentunderstanding/tagspace"
dataset: dataset:
- name: sample_1 - name: sample_1
......
...@@ -12,7 +12,7 @@ ...@@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
workspace: "paddlerec.models.demo.movie_recommand" workspace: "models/demo/movie_recommand"
# list of dataset # list of dataset
dataset: dataset:
......
...@@ -12,7 +12,7 @@ ...@@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
workspace: "paddlerec.models.demo.movie_recommand" workspace: "models/demo/movie_recommand"
# list of dataset # list of dataset
dataset: dataset:
......
...@@ -13,7 +13,7 @@ ...@@ -13,7 +13,7 @@
# limitations under the License. # limitations under the License.
workspace: "paddlerec.models.match.dssm" workspace: "models/match/dssm"
dataset: dataset:
- name: dataset_train - name: dataset_train
......
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Copyrigh t(c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
workspace: "models/match/match-pyramid"
dataset:
- name: dataset_train
batch_size: 128
type: DataLoader
data_path: "{workspace}/data/train"
data_converter: "{workspace}/train_reader.py"
- name: dataset_infer
batch_size: 1
type: DataLoader
data_path: "{workspace}/data/test"
data_converter: "{workspace}/test_reader.py"
hyper_parameters:
optimizer:
class: adam
learning_rate: 0.001
strategy: async
emb_path: "./data/embedding.npy"
sentence_left_size: 20
sentence_right_size: 500
vocab_size: 193368
emb_size: 50
kernel_num: 8
hidden_size: 20
hidden_act: "relu"
out_size: 1
channels: 1
conv_filter: [2,10]
conv_act: "relu"
pool_size: [6,50]
pool_stride: [6,50]
pool_type: "max"
pool_padding: "VALID"
mode: [train_runner , infer_runner]
# config of each runner.
# runner is a kind of paddle training class, which wraps the train/infer process.
runner:
- name: train_runner
class: train
# num of epochs
epochs: 2
# device to run training or infer
device: cpu
save_checkpoint_interval: 1 # save model interval of epochs
save_inference_interval: 1 # save inference
save_checkpoint_path: "inference" # save checkpoint path
save_inference_path: "inference" # save inference path
save_inference_feed_varnames: [] # feed vars of save inference
save_inference_fetch_varnames: [] # fetch vars of save inference
init_model_path: "" # load model path
print_interval: 2
phases: phase_train
- name: infer_runner
class: infer
# device to run training or infer
device: cpu
print_interval: 1
init_model_path: "inference/1" # load model path
phases: phase_infer
# runner will run all the phase in each epoch
phase:
- name: phase_train
model: "{workspace}/model.py" # user-defined model
dataset_name: dataset_train # select dataset by name
thread_num: 1
- name: phase_infer
model: "{workspace}/model.py" # user-defined model
dataset_name: dataset_infer # select dataset by name
thread_num: 1
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import numpy as np
import random
# Read Word Dict and Inverse Word Dict
def read_word_dict(filename):
word_dict = {}
for line in open(filename):
line = line.strip().split()
word_dict[int(line[1])] = line[0]
print('[%s]\n\tWord dict size: %d' % (filename, len(word_dict)))
return word_dict
# Read Embedding File
def read_embedding(filename):
embed = {}
for line in open(filename):
line = line.strip().split()
embed[int(line[0])] = list(map(float, line[1:]))
print('[%s]\n\tEmbedding size: %d' % (filename, len(embed)))
return embed
# Convert Embedding Dict 2 numpy array
def convert_embed_2_numpy(embed_dict, embed=None):
for k in embed_dict:
embed[k] = np.array(embed_dict[k])
print('Generate numpy embed:', embed.shape)
return embed
# Read Data
def read_data(filename):
data = {}
for line in open(filename):
line = line.strip().split()
data[line[0]] = list(map(int, line[2:]))
print('[%s]\n\tData size: %s' % (filename, len(data)))
return data
# Read Relation Data
def read_relation(filename):
data = []
for line in open(filename):
line = line.strip().split()
data.append((int(line[0]), line[1], line[2]))
print('[%s]\n\tInstance size: %s' % (filename, len(data)))
return data
Letor07Path = "./data"
word_dict = read_word_dict(filename=os.path.join(Letor07Path, 'word_dict.txt'))
query_data = read_data(filename=os.path.join(Letor07Path, 'qid_query.txt'))
doc_data = read_data(filename=os.path.join(Letor07Path, 'docid_doc.txt'))
embed_dict = read_embedding(filename=os.path.join(Letor07Path,
'embed_wiki-pdc_d50_norm'))
_PAD_ = len(word_dict) #193367
embed_dict[_PAD_] = np.zeros((50, ), dtype=np.float32)
word_dict[_PAD_] = '[PAD]'
W_init_embed = np.float32(np.random.uniform(-0.02, 0.02, [len(word_dict), 50]))
embedding = convert_embed_2_numpy(embed_dict, embed=W_init_embed)
np.save("embedding.npy", embedding)
batch_size = 64
data1_maxlen = 20
data2_maxlen = 500
embed_size = 50
train_iters = 2500
def make_train():
rel_set = {}
pair_list = []
rel = read_relation(filename=os.path.join(Letor07Path,
'relation.train.fold1.txt'))
for label, d1, d2 in rel:
if d1 not in rel_set:
rel_set[d1] = {}
if label not in rel_set[d1]:
rel_set[d1][label] = []
rel_set[d1][label].append(d2)
for d1 in rel_set:
label_list = sorted(rel_set[d1].keys(), reverse=True)
for hidx, high_label in enumerate(label_list[:-1]):
for low_label in label_list[hidx + 1:]:
for high_d2 in rel_set[d1][high_label]:
for low_d2 in rel_set[d1][low_label]:
pair_list.append((d1, high_d2, low_d2))
print('Pair Instance Count:', len(pair_list))
f = open("./data/train/train.txt", "w")
for batch in range(800):
X1 = np.zeros((batch_size * 2, data1_maxlen), dtype=np.int32)
X2 = np.zeros((batch_size * 2, data2_maxlen), dtype=np.int32)
X1[:] = _PAD_
X2[:] = _PAD_
for i in range(batch_size):
d1, d2p, d2n = random.choice(pair_list)
d1_len = min(data1_maxlen, len(query_data[d1]))
d2p_len = min(data2_maxlen, len(doc_data[d2p]))
d2n_len = min(data2_maxlen, len(doc_data[d2n]))
X1[i, :d1_len] = query_data[d1][:d1_len]
X2[i, :d2p_len] = doc_data[d2p][:d2p_len]
X1[i + batch_size, :d1_len] = query_data[d1][:d1_len]
X2[i + batch_size, :d2n_len] = doc_data[d2n][:d2n_len]
for i in range(batch_size * 2):
q = [str(x) for x in list(X1[i])]
d = [str(x) for x in list(X2[i])]
f.write(",".join(q) + "\t" + ",".join(d) + "\n")
f.close()
def make_test():
rel = read_relation(filename=os.path.join(Letor07Path,
'relation.test.fold1.txt'))
f = open("./data/test/test.txt", "w")
for label, d1, d2 in rel:
X1 = np.zeros(data1_maxlen, dtype=np.int32)
X2 = np.zeros(data2_maxlen, dtype=np.int32)
X1[:] = _PAD_
X2[:] = _PAD_
d1_len = min(data1_maxlen, len(query_data[d1]))
d2_len = min(data2_maxlen, len(doc_data[d2]))
X1[:d1_len] = query_data[d1][:d1_len]
X2[:d2_len] = doc_data[d2][:d2_len]
q = [str(x) for x in list(X1)]
d = [str(x) for x in list(X2)]
f.write(",".join(q) + "\t" + ",".join(d) + "\t" + str(label) + "\t" +
d1 + "\n")
f.close()
make_train()
make_test()
2 9639 GX099-60-3149248
1 9639 GX028-47-6554966
1 9639 GX031-84-2802741
1 9639 GX031-86-1702683
1 9639 GX031-89-11392170
1 9639 GX035-46-10142187
1 9639 GX039-07-1333080
1 9639 GX040-05-15096071
1 9639 GX045-35-10693225
1 9639 GX045-74-6226888
1 9639 GX046-31-8871083
1 9639 GX046-56-6274894
1 9639 GX050-09-14629105
1 9639 GX097-05-12714275
1 9639 GX101-06-7768196
1 9639 GX124-50-4934142
1 9639 GX259-01-13320140
1 9639 GX259-50-8109630
1 9639 GX259-72-16176934
1 9639 GX259-98-7821925
1 9639 GX260-27-13260880
1 9639 GX260-54-6363694
1 9639 GX260-78-6999656
1 9639 GX261-04-0843988
1 9639 GX261-23-4964814
0 9639 GX021-75-7026755
0 9639 GX021-80-16449591
0 9639 GX025-40-7135810
0 9639 GX031-89-9020252
0 9639 GX037-45-0533209
0 9639 GX038-17-11223353
0 9639 GX057-07-13335832
0 9639 GX081-50-12756687
0 9639 GX124-43-2364716
0 9639 GX129-60-0000000
0 9639 GX219-07-7475581
0 9639 GX233-90-7976935
0 9639 GX267-49-2983064
0 9639 GX267-74-2413254
0 9639 GX270-05-13614294
1 9329 GX234-05-0812081
0 9329 GX000-00-0000000
0 9329 GX008-50-3899336
0 9329 GX011-75-8470249
0 9329 GX020-42-13388867
0 9329 GX024-91-8520306
0 9329 GX026-88-6087429
0 9329 GX027-22-1703847
0 9329 GX034-11-2617393
0 9329 GX036-02-7994497
0 9329 GX046-08-13858054
0 9329 GX059-85-11403109
0 9329 GX099-37-0232298
0 9329 GX099-46-11473306
0 9329 GX108-04-9589788
0 9329 GX110-50-11723940
0 9329 GX124-11-4119164
0 9329 GX149-82-15204191
0 9329 GX165-95-6198495
0 9329 GX225-56-4184936
0 9329 GX229-57-4487470
0 9329 GX230-37-4125963
0 9329 GX231-40-14574318
0 9329 GX238-44-10302536
0 9329 GX239-85-8572461
0 9329 GX244-17-10154048
0 9329 GX245-16-4169590
0 9329 GX245-46-6341859
0 9329 GX246-91-8487173
0 9329 GX262-88-13259441
0 9329 GX263-41-4135561
0 9329 GX264-07-6385713
0 9329 GX264-38-12253757
0 9329 GX264-90-15990025
0 9329 GX265-89-6212449
0 9329 GX268-41-12034794
0 9329 GX268-83-5140660
0 9329 GX270-46-0293828
0 9329 GX270-64-11852140
0 9329 GX271-10-12458597
2 9326 GX272-03-6610348
1 9326 GX011-12-0595978
0 9326 GX000-00-0000000
0 9326 GX000-38-9492606
0 9326 GX000-84-4587136
0 9326 GX002-41-5566464
0 9326 GX002-51-2615036
0 9326 GX004-56-12238694
0 9326 GX004-72-2476906
0 9326 GX008-13-1835206
0 9326 GX008-64-7705528
0 9326 GX009-87-0976731
0 9326 GX012-24-7688369
0 9326 GX012-96-8727608
0 9326 GX023-87-16736657
0 9326 GX025-21-11820239
0 9326 GX025-22-15113698
0 9326 GX025-51-13959128
0 9326 GX025-57-11414648
0 9326 GX025-64-7587631
0 9326 GX027-62-4542881
0 9326 GX031-25-4759403
0 9326 GX036-10-7902858
0 9326 GX047-04-9457544
0 9326 GX047-06-4014803
0 9326 GX048-00-15113058
0 9326 GX048-02-12975919
0 9326 GX048-78-3273874
0 9326 GX235-35-0963257
0 9326 GX235-98-3789570
0 9326 GX236-51-15473637
0 9326 GX237-96-0892713
0 9326 GX239-35-7413891
0 9326 GX239-95-0176537
0 9326 GX251-34-10377030
0 9326 GX254-19-11374782
0 9326 GX260-63-10533444
0 9326 GX265-94-14886230
0 9326 GX269-78-1500497
0 9326 GX270-59-10270517
2 8946 GX046-79-6984659
2 8946 GX148-33-1869479
2 8946 GX252-36-12638222
1 8946 GX017-47-13290921
1 8946 GX030-69-3218092
1 8946 GX034-82-4550348
1 8946 GX044-01-9283107
1 8946 GX047-98-6660623
1 8946 GX057-96-12580825
1 8946 GX059-94-12068143
1 8946 GX060-13-13600036
1 8946 GX060-74-6594973
1 8946 GX093-08-1158999
0 8946 GX000-00-0000000
0 8946 GX000-42-15811803
0 8946 GX000-81-16418910
0 8946 GX008-38-10557859
0 8946 GX011-01-10891808
0 8946 GX013-71-5708874
0 8946 GX015-72-4458924
0 8946 GX023-91-9869060
0 8946 GX027-56-6376748
0 8946 GX037-11-10829529
0 8946 GX038-55-0681330
0 8946 GX043-86-4200105
0 8946 GX047-52-3712485
0 8946 GX053-77-4836617
0 8946 GX070-62-1070063
0 8946 GX105-53-13372327
0 8946 GX218-61-6263172
0 8946 GX223-72-13625320
0 8946 GX230-68-14727182
0 8946 GX235-34-7733230
0 8946 GX251-73-0159347
0 8946 GX254-47-1098586
0 8946 GX263-76-6934681
0 8946 GX263-84-8668756
0 8946 GX264-70-14223639
0 8946 GX269-12-5910753
0 8946 GX271-93-9895614
1 9747 GX006-77-1973537
1 9747 GX244-83-8716953
1 9747 GX269-92-7189826
0 9747 GX000-00-0000000
0 9747 GX001-51-8693413
0 9747 GX003-10-2820641
0 9747 GX003-74-0557776
0 9747 GX003-79-13695689
0 9747 GX009-57-0938999
0 9747 GX009-59-8595527
0 9747 GX009-80-10629348
0 9747 GX010-37-0206372
0 9747 GX013-46-2187318
0 9747 GX014-58-4004859
0 9747 GX015-79-5393654
0 9747 GX032-50-7316370
0 9747 GX049-33-2206612
0 9747 GX050-34-0439256
0 9747 GX062-76-0914936
0 9747 GX065-73-7392661
0 9747 GX148-27-15770966
0 9747 GX155-71-0504939
0 9747 GX229-75-14750078
0 9747 GX231-01-0640962
0 9747 GX236-45-15598812
0 9747 GX247-19-9516715
0 9747 GX247-34-4277646
0 9747 GX247-63-10766287
0 9747 GX248-23-15998266
0 9747 GX249-85-9742193
0 9747 GX250-31-7671617
0 9747 GX252-56-2141580
0 9747 GX253-15-3406713
0 9747 GX264-07-15838087
0 9747 GX264-43-6543997
0 9747 GX266-18-14688076
0 9747 GX267-50-2036010
0 9747 GX268-28-0548507
0 9747 GX269-49-14171555
0 9747 GX269-63-15607386
2 9740 GX005-94-14208849
2 9740 GX008-51-5639660
2 9740 GX012-37-2342061
2 9740 GX019-75-13916532
2 9740 GX074-76-16261807
2 9740 GX077-07-2951943
2 9740 GX229-28-11068981
2 9740 GX237-80-7497206
2 9740 GX257-53-10589749
2 9740 GX258-06-0611419
2 9740 GX268-55-9791226
1 9740 GX007-62-1126118
1 9740 GX015-78-0216468
1 9740 GX038-65-1678199
1 9740 GX041-25-14803324
1 9740 GX063-71-0401425
1 9740 GX077-08-15801730
1 9740 GX098-07-2885671
1 9740 GX135-28-6485892
1 9740 GX228-85-10518518
1 9740 GX231-93-11279468
1 9740 GX234-70-15061254
1 9740 GX236-31-11149347
1 9740 GX240-68-1184464
1 9740 GX248-03-7275316
1 9740 GX253-11-9846012
1 9740 GX255-05-10638500
1 9740 GX267-73-4450097
1 9740 GX269-19-0642640
0 9740 GX001-74-5132048
0 9740 GX001-88-2603815
0 9740 GX004-83-7935833
0 9740 GX007-01-16750210
0 9740 GX040-11-5249209
0 9740 GX042-38-2886005
0 9740 GX052-20-4359789
0 9740 GX067-74-3718011
0 9740 GX077-01-13481396
0 9740 GX242-92-8868913
0 9740 GX262-74-4596688
2 8835 GX010-99-5715419
2 8835 GX049-99-2518724
0 8835 GX000-00-0000000
0 8835 GX007-91-6779497
0 8835 GX008-14-0788708
0 8835 GX008-15-13942125
0 8835 GX011-58-14336551
0 8835 GX012-79-10684001
0 8835 GX013-00-10822427
0 8835 GX013-03-5962783
0 8835 GX015-54-0251701
0 8835 GX017-36-5859317
0 8835 GX017-60-0601078
0 8835 GX027-24-16202205
0 8835 GX030-11-15814183
0 8835 GX030-76-11969233
因为 它太大了无法显示 source diff 。你可以改为 查看blob
因为 它太大了无法显示 source diff 。你可以改为 查看blob
#!/bin/bash
echo "...........load data................."
wget --no-check-certificate 'https://paddlerec.bj.bcebos.com/match_pyramid/match_pyramid_data.tar.gz'
mv ./match_pyramid_data.tar.gz ./data
rm -rf ./data/relation.test.fold1.txt ./data/realtion.train.fold1.txt
tar -xvf ./data/match_pyramid_data.tar.gz
echo "...........data process..............."
python ./data/process.py
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import random
import numpy as np
def eval_MAP(pred, gt):
map_value = 0.0
r = 0.0
c = list(zip(pred, gt))
random.shuffle(c)
c = sorted(c, key=lambda x: x[0], reverse=True)
for j, (p, g) in enumerate(c):
if g != 0:
r += 1
map_value += r / (j + 1.0)
if r == 0:
return 0.0
else:
return map_value / r
filename = './data/relation.test.fold1.txt'
gt = []
qid = []
f = open(filename, "r")
f.readline()
num = 0
for line in f.readlines():
num = num + 1
line = line.strip().split()
gt.append(int(line[0]))
qid.append(line[1])
f.close()
print(num)
filename = './result.txt'
pred = []
for line in open(filename):
line = line.strip().split(",")
line[1] = line[1].split(":")
line = line[1][1].strip(" ")
line = line.strip("[")
line = line.strip("]")
pred.append(float(line))
result_dict = {}
for i in range(len(qid)):
if qid[i] not in result_dict:
result_dict[qid[i]] = []
result_dict[qid[i]].append([gt[i], pred[i]])
print(len(result_dict))
map = 0
for qid in result_dict:
gt = np.array(result_dict[qid])[:, 0]
pred = np.array(result_dict[qid])[:, 1]
map += eval_MAP(pred, gt)
map = map / len(result_dict)
print("map=", map)
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import sys
import random
import numpy as np
import paddle
import paddle.fluid as fluid
from paddlerec.core.utils import envs
from paddlerec.core.model import ModelBase
class Model(ModelBase):
def __init__(self, config):
ModelBase.__init__(self, config)
def _init_hyper_parameters(self):
self.emb_path = envs.get_global_env("hyper_parameters.emb_path")
self.sentence_left_size = envs.get_global_env(
"hyper_parameters.sentence_left_size")
self.sentence_right_size = envs.get_global_env(
"hyper_parameters.sentence_right_size")
self.vocab_size = envs.get_global_env("hyper_parameters.vocab_size")
self.emb_size = envs.get_global_env("hyper_parameters.emb_size")
self.kernel_num = envs.get_global_env("hyper_parameters.kernel_num")
self.hidden_size = envs.get_global_env("hyper_parameters.hidden_size")
self.hidden_act = envs.get_global_env("hyper_parameters.hidden_act")
self.out_size = envs.get_global_env("hyper_parameters.out_size")
self.channels = envs.get_global_env("hyper_parameters.channels")
self.conv_filter = envs.get_global_env("hyper_parameters.conv_filter")
self.conv_act = envs.get_global_env("hyper_parameters.conv_act")
self.pool_size = envs.get_global_env("hyper_parameters.pool_size")
self.pool_stride = envs.get_global_env("hyper_parameters.pool_stride")
self.pool_type = envs.get_global_env("hyper_parameters.pool_type")
self.pool_padding = envs.get_global_env(
"hyper_parameters.pool_padding")
def input_data(self, is_infer=False, **kwargs):
sentence_left = fluid.data(
name="sentence_left",
shape=[-1, self.sentence_left_size, 1],
dtype='int64',
lod_level=0)
sentence_right = fluid.data(
name="sentence_right",
shape=[-1, self.sentence_right_size, 1],
dtype='int64',
lod_level=0)
return [sentence_left, sentence_right]
def embedding_layer(self, input):
"""
embedding layer
"""
if os.path.isfile(self.emb_path):
embedding_array = np.load(self.emb_path)
emb = fluid.layers.embedding(
input=input,
size=[self.vocab_size, self.emb_size],
padding_idx=0,
param_attr=fluid.ParamAttr(
name="word_embedding",
initializer=fluid.initializer.NumpyArrayInitializer(
embedding_array)))
else:
emb = fluid.layers.embedding(
input=input,
size=[self.vocab_size, self.emb_size],
padding_idx=0,
param_attr=fluid.ParamAttr(
name="word_embedding",
initializer=fluid.initializer.Xavier()))
return emb
def conv_pool_layer(self, input):
"""
convolution and pool layer
"""
# data format NCHW
# same padding
conv = fluid.layers.conv2d(
input=input,
num_filters=self.kernel_num,
stride=1,
padding="SAME",
filter_size=self.conv_filter,
act=self.conv_act)
pool = fluid.layers.pool2d(
input=conv,
pool_size=self.pool_size,
pool_stride=self.pool_stride,
pool_type=self.pool_type,
pool_padding=self.pool_padding)
return pool
def net(self, inputs, is_infer=False):
left_emb = self.embedding_layer(inputs[0])
right_emb = self.embedding_layer(inputs[1])
cross = fluid.layers.matmul(left_emb, right_emb, transpose_y=True)
cross = fluid.layers.reshape(cross,
[-1, 1, cross.shape[1], cross.shape[2]])
conv_pool = self.conv_pool_layer(input=cross)
relu_hid = fluid.layers.fc(input=conv_pool,
size=self.hidden_size,
act=self.hidden_act)
prediction = fluid.layers.fc(
input=relu_hid,
size=self.out_size, )
if is_infer:
self._infer_results["prediction"] = prediction
return
pos = fluid.layers.slice(
prediction, axes=[0, 1], starts=[0, 0], ends=[64, 1])
neg = fluid.layers.slice(
prediction, axes=[0, 1], starts=[64, 0], ends=[128, 1])
loss_part1 = fluid.layers.elementwise_sub(
fluid.layers.fill_constant(
shape=[64, 1], value=1.0, dtype='float32'),
pos)
loss_part2 = fluid.layers.elementwise_add(loss_part1, neg)
loss_part3 = fluid.layers.elementwise_max(
fluid.layers.fill_constant(
shape=[64, 1], value=0.0, dtype='float32'),
loss_part2)
avg_cost = fluid.layers.mean(loss_part3)
self._cost = avg_cost
# match-pyramid文本匹配模型
## 介绍
在许多自然语言处理任务中,匹配两个文本是一个基本问题。一种有效的方法是从单词,短语和句子中提取有意义的匹配模式以产生匹配分数。受卷积神经网络在图像识别中的成功启发,神经元可以根据提取的基本视觉模式(例如定向的边角和边角)捕获许多复杂的模式,所以我们尝试将文本匹配建模为图像识别问题。本模型对齐原作者庞亮开源的tensorflow代码:https://github.com/pl8787/MatchPyramid-TensorFlow/blob/master/model/model_mp.py, 实现了下述论文中提出的Match-Pyramid模型:
```text
@inproceedings{Pang L , Lan Y , Guo J , et al. Text Matching as Image Recognition[J]. 2016.,
title={Text Matching as Image Recognition},
author={Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Shengxian Wan, Xueqi Cheng},
year={2016}
}
```
## 数据准备
训练及测试数据集选用Letor07数据集和 embed_wiki-pdc_d50_norm 词向量初始化embedding层。
该数据集包括:
1.词典文件:我们将每个单词映射得到一个唯一的编号wid,并将此映射保存在单词词典文件中。例如:word_dict.txt
2.语料库文件:我们使用字符串标识符的值表示一个句子的编号。第二个数字表示句子的长度。例如:qid_query.txt和docid_doc.txt
3.关系文件:关系文件被用来存储两个句子之间的关系,如query 和document之间的关系。例如:relation.train.fold1.txt, relation.test.fold1.txt
4.嵌入层文件:我们将预训练的词向量存储在嵌入文件中。例如:embed_wiki-pdc_d50_norm
## 数据下载和预处理
本文提供了数据集的下载以及一键生成训练和测试数据的预处理脚本,您可以直接一键运行:bash data_process.sh
执行该脚本,会从国内源的服务器上下载Letor07数据集,删除掉data文件夹中原有的relation.test.fold1.txt和relation.train.fold1.txt,并将完整的数据集解压到data文件夹。随后运行 process.py 将全量训练数据放置于`./data/train`,全量测试数据放置于`./data/test`。并生成用于初始化embedding层的embedding.npy文件
执行该脚本的理想输出为:
```
bash data_process.sh
...........load data...............
--2020-07-13 13:24:50-- https://paddlerec.bj.bcebos.com/match_pyramid/match_pyramid_data.tar.gz
Resolving paddlerec.bj.bcebos.com... 10.70.0.165
Connecting to paddlerec.bj.bcebos.com|10.70.0.165|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 214449643 (205M) [application/x-gzip]
Saving to: “match_pyramid_data.tar.gz”
100%[==========================================================================================================>] 214,449,643 114M/s in 1.8s
2020-07-13 13:24:52 (114 MB/s) - “match_pyramid_data.tar.gz” saved [214449643/214449643]
data/
data/relation.test.fold1.txt
data/relation.test.fold2.txt
data/relation.test.fold3.txt
data/relation.test.fold4.txt
data/relation.test.fold5.txt
data/relation.train.fold1.txt
data/relation.train.fold2.txt
data/relation.train.fold3.txt
data/relation.train.fold4.txt
data/relation.train.fold5.txt
data/relation.txt
data/docid_doc.txt
data/qid_query.txt
data/word_dict.txt
data/embed_wiki-pdc_d50_norm
...........data process...............
[./data/word_dict.txt]
Word dict size: 193367
[./data/qid_query.txt]
Data size: 1692
[./data/docid_doc.txt]
Data size: 65323
[./data/embed_wiki-pdc_d50_norm]
Embedding size: 109282
('Generate numpy embed:', (193368, 50))
[./data/relation.train.fold1.txt]
Instance size: 47828
('Pair Instance Count:', 325439)
[./data/relation.test.fold1.txt]
Instance size: 13652
```
## 一键训练并测试评估
本文提供了一键执行训练,测试和评估的脚本,您可以直接一键运行:bash run.sh
执行该脚本后,会执行python -m paddlerec.run -m ./config.yaml 命令开始训练并测试模型,将测试的结果保存到result.txt文件,最后通过执行eval.py进行评估得到数据的map指标
执行该脚本的理想输出为:
```
..............test.................
13651
336
('map=', 0.420878322843591)
```
## 每个文件的作用
paddlerec可以:
通过config.yaml规定模型的参数
通过model.py规定模型的组网
使用train_reader.py读取训练集中的数据
使用test_reader.py读取测试集中的数据。
本文额外提供:
data_process.sh用来一键处理数据
run.sh用来一键启动训练,直接得出测试结果
eval.py通过保存的测试结果,计算map指标
如需详细了解paddlerec的使用方法请参考https://github.com/PaddlePaddle/PaddleRec/blob/master/README_CN.md 页面下方的教程。
#!/bin/bash
echo "................run................."
python -m paddlerec.run -m ./config.yaml >result1.txt
grep -A1 "prediction" ./result1.txt >./result.txt
rm -f result1.txt
python eval.py
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import print_function
from paddlerec.core.reader import ReaderBase
class Reader(ReaderBase):
def init(self):
pass
def generate_sample(self, line):
"""
Read the data line by line and process it as a dictionary
"""
def reader():
"""
This function needs to be implemented by the user, based on data format
"""
features = line.strip('\n').split('\t')
doc1 = [int(word_id) for word_id in features[0].split(",")]
doc2 = [int(word_id) for word_id in features[1].split(",")]
features_name = ["doc1", "doc2"]
yield zip(features_name, [doc1] + [doc2])
return reader
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import print_function
from paddlerec.core.reader import ReaderBase
class Reader(ReaderBase):
def init(self):
pass
def generate_sample(self, line):
"""
Read the data line by line and process it as a dictionary
"""
def reader():
"""
This function needs to be implemented by the user, based on data format
"""
features = line.strip('\n').split('\t')
doc1 = [int(word_id) for word_id in features[0].split(",")]
doc2 = [int(word_id) for word_id in features[1].split(",")]
features_name = ["doc1", "doc2"]
yield zip(features_name, [doc1] + [doc2])
return reader
...@@ -13,7 +13,7 @@ ...@@ -13,7 +13,7 @@
# limitations under the License. # limitations under the License.
# workspace # workspace
workspace: "paddlerec.models.match.multiview-simnet" workspace: "models/match/multiview-simnet"
# list of dataset # list of dataset
dataset: dataset:
......
...@@ -34,8 +34,11 @@ ...@@ -34,8 +34,11 @@
## 使用教程(快速开始) ## 使用教程(快速开始)
### 训练 ### 训练
```shell ```shell
python -m paddlerec.run -m paddlerec.models.match.dssm # dssm git clone https://github.com/PaddlePaddle/PaddleRec.git paddle-rec
python -m paddlerec.run -m paddlerec.models.match.multiview-simnet # multiview-simnet cd paddle-rec
python -m paddlerec.run -m models/match/dssm/config.yaml # dssm
python -m paddlerec.run -m models/match/multiview-simnet/config.yaml # multiview-simnet
``` ```
### 预测 ### 预测
......
# ESMM
以下是本例的简要目录结构及说明:
```
├── data # 文档
├── train #训练数据
├──small.txt
├── test #测试数据
├── small.txt
├── run.sh
├── __init__.py
├── config.yaml #配置文件
├── esmm_reader.py #数据读取文件
├── model.py #模型文件
```
注:在阅读该示例前,建议您先了解以下内容:
[paddlerec入门教程](https://github.com/PaddlePaddle/PaddleRec/blob/master/README.md)
## 内容
- [模型简介](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/esmm#模型简介)
- [数据准备](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/esmm#数据准备)
- [运行环境](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/esmm#运行环境)
- [快速开始](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/esmm#快速开始)
- [论文复现](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/esmm#论文复现)
- [进阶使用](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/esmm#进阶使用)
- [FAQ](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/esmm#FAQ)
## 模型简介
不同于CTR预估问题,CVR预估面临两个关键问题:
1. **Sample Selection Bias (SSB)** 转化是在点击之后才“有可能”发生的动作,传统CVR模型通常以点击数据为训练集,其中点击未转化为负例,点击并转化为正例。但是训练好的模型实际使用时,则是对整个空间的样本进行预估,而非只对点击样本进行预估。即是说,训练数据与实际要预测的数据来自不同分布,这个偏差对模型的泛化能力构成了很大挑战。
2. **Data Sparsity (DS)** 作为CVR训练数据的点击样本远小于CTR预估训练使用的曝光样本。
ESMM是发表在 SIGIR’2018 的论文[《Entire Space Multi-Task Model: An Effective Approach for Estimating Post-Click Conversion Rate》]( https://arxiv.org/abs/1804.07931 )文章基于 Multi-Task Learning 的思路,提出一种新的CVR预估模型——ESMM,有效解决了真实场景中CVR预估面临的数据稀疏以及样本选择偏差这两个关键问题
本项目在paddlepaddle上实现ESMM的网络结构,并在开源数据集[Ali-CCP:Alibaba Click and Conversion Prediction]( https://tianchi.aliyun.com/datalab/dataSet.html?dataId=408 )上验证模型效果, 本模型配置默认使用demo数据集,若进行精度验证,请参考[论文复现](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/esmm#论文复现)部分。
本项目支持功能
训练:单机CPU、单机单卡GPU、单机多卡GPU、本地模拟参数服务器训练、增量训练,配置请参考 [启动训练](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/train.md)
预测:单机CPU、单机单卡GPU ;配置请参考[PaddleRec 离线预测](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/predict.md)
## 数据准备
数据地址:[Ali-CCP:Alibaba Click and Conversion Prediction]( https://tianchi.aliyun.com/datalab/dataSet.html?dataId=408 )
```
cd data
sh run.sh
```
数据格式参见demo数据:data/train
## 运行环境
PaddlePaddle>=1.7.2
python 2.7/3.5/3.6/3.7
PaddleRec >=0.1
os : windows/linux/macos
## 快速开始
### 单机训练
CPU环境
在config.yaml文件中设置好设备,epochs等。
```
dataset:
- name: dataset_train
batch_size: 5
type: QueueDataset
data_path: "{workspace}/data/train"
data_converter: "{workspace}/esmm_reader.py"
- name: dataset_infer
batch_size: 5
type: QueueDataset
data_path: "{workspace}/data/test"
data_converter: "{workspace}/esmm_reader.py"
```
### 单机预测
CPU环境
在config.yaml文件中设置好epochs、device等参数。
```
- name: infer_runner
class: infer
init_model_path: "increment/1"
device: cpu
print_interval: 1
phases: [infer]
```
## 论文复现
用原论文的完整数据复现论文效果需要在config.yaml中修改batch_size=1000, thread_num=8, epoch_num=4
修改后运行方案:修改config.yaml中的'workspace'为config.yaml的目录位置,执行
```
python -m paddlerec.run -m /home/your/dir/config.yaml #调试模式 直接指定本地config的绝对路径
```
## 进阶使用
## FAQ
...@@ -13,7 +13,7 @@ ...@@ -13,7 +13,7 @@
# limitations under the License. # limitations under the License.
workspace: "paddlerec.models.multitask.esmm" workspace: "models/multitask/esmm"
dataset: dataset:
- name: dataset_train - name: dataset_train
......
mkdir train_data
mkdir test_data
mkdir vocab
mkdir data
train_source_path="./data/sample_train.tar.gz"
train_target_path="train_data"
test_source_path="./data/sample_test.tar.gz"
test_target_path="test_data"
cd data
echo "downloading sample_train.tar.gz......"
curl -# 'http://jupter-oss.oss-cn-hangzhou.aliyuncs.com/file/opensearch/documents/408/sample_train.tar.gz?Expires=1586435769&OSSAccessKeyId=LTAIGx40tjZWxj6q&Signature=ahUDqhvKT1cGjC4%2FIER2EWtq7o4%3D&response-content-disposition=attachment%3B%20' -H 'Proxy-Connection: keep-alive' -H 'Upgrade-Insecure-Requests: 1' -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' -H 'Accept-Language: zh-CN,zh;q=0.9' --compressed --insecure -o sample_train.tar.gz
cd ..
echo "unzipping sample_train.tar.gz......"
tar -xzvf ${train_source_path} -C ${train_target_path} && rm -rf ${train_source_path}
cd data
echo "downloading sample_test.tar.gz......"
curl -# 'http://jupter-oss.oss-cn-hangzhou.aliyuncs.com/file/opensearch/documents/408/sample_test.tar.gz?Expires=1586435821&OSSAccessKeyId=LTAIGx40tjZWxj6q&Signature=OwLMPjt1agByQtRVi8pazsAliNk%3D&response-content-disposition=attachment%3B%20' -H 'Proxy-Connection: keep-alive' -H 'Upgrade-Insecure-Requests: 1' -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' -H 'Accept-Language: zh-CN,zh;q=0.9' --compressed --insecure -o sample_test.tar.gz
cd ..
echo "unzipping sample_test.tar.gz......"
tar -xzvf ${test_source_path} -C ${test_target_path} && rm -rf ${test_source_path}
echo "preprocessing data......"
python reader.py --train_data_path ${train_target_path} \
--test_data_path ${test_target_path} \
--vocab_path vocab/vocab_size.txt \
--train_sample_size 6400 \
--test_sample_size 6400 \
# MMOE
以下是本例的简要目录结构及说明:
```
├── data # 文档
├── train #训练数据
├── train_data.txt
├── test #测试数据
├── test_data.txt
├── run.sh
├── data_preparation.py
├── __init__.py
├── config.yaml #配置文件
├── census_reader.py #数据读取文件
├── model.py #模型文件
```
注:在阅读该示例前,建议您先了解以下内容:
[paddlerec入门教程](https://github.com/PaddlePaddle/PaddleRec/blob/master/README.md)
## 内容
- [模型简介](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/mmoe#模型简介)
- [数据准备](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/mmoe#数据准备)
- [运行环境](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/mmoe#运行环境)
- [快速开始](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/mmoe#快速开始)
- [论文复现](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/mmoe#论文复现)
- [进阶使用](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/mmoe#进阶使用)
- [FAQ](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/mmoe#FAQ)
## 模型简介
多任务模型通过学习不同任务的联系和差异,可提高每个任务的学习效率和质量。多任务学习的的框架广泛采用shared-bottom的结构,不同任务间共用底部的隐层。这种结构本质上可以减少过拟合的风险,但是效果上可能受到任务差异和数据分布带来的影响。 论文[《Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts》]( https://www.kdd.org/kdd2018/accepted-papers/view/modeling-task-relationships-in-multi-task-learning-with-multi-gate-mixture- )中提出了一个Multi-gate Mixture-of-Experts(MMOE)的多任务学习结构。MMOE模型刻画了任务相关性,基于共享表示来学习特定任务的函数,避免了明显增加参数的缺点。
我们在Paddlepaddle定义MMOE的网络结构,在开源数据集Census-income Data上验证模型效果,两个任务的auc分别为:
1.income
> max_mmoe_test_auc_income:0.94937
>
> mean_mmoe_test_auc_income:0.94465
2.marital
> max_mmoe_test_auc_marital:0.99419
>
> mean_mmoe_test_auc_marital:0.99324
若进行精度验证,请参考[论文复现](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/mmoe#论文复现)部分。
本项目支持功能
训练:单机CPU、单机单卡GPU、单机多卡GPU、本地模拟参数服务器训练、增量训练,配置请参考 [启动训练](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/train.md)
预测:单机CPU、单机单卡GPU ;配置请参考[PaddleRec 离线预测](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/predict.md)
## 数据准备
数据地址: [Census-income Data](https://archive.ics.uci.edu/ml/machine-learning-databases/census-income-mld/census.tar.gz )
数据解压后, 在run.sh脚本文件中添加文件的路径,并运行脚本。
```sh
mkdir train_data
mkdir test_data
mkdir data
train_path="data/census-income.data"
test_path="data/census-income.test"
train_data_path="train_data/"
test_data_path="test_data/"
pip install -r requirements.txt
wget -P data/ https://archive.ics.uci.edu/ml/machine-learning-databases/census-income-mld/census.tar.gz
tar -zxvf data/census.tar.gz -C data/
python data_preparation.py --train_path ${train_path} \
--test_path ${test_path} \
--train_data_path ${train_data_path}\
--test_data_path ${test_data_path}
```
生成的格式以逗号为分割点
```
0,0,73,0,0,0,0,1700.09,0,0
```
## 运行环境
PaddlePaddle>=1.7.2
python 2.7/3.5/3.6/3.7
PaddleRec >=0.1
os : windows/linux/macos
## 快速开始
### 单机训练
CPU环境
在config.yaml文件中设置好设备,epochs等。
```
dataset:
- name: dataset_train
batch_size: 5
type: QueueDataset
data_path: "{workspace}/data/train"
data_converter: "{workspace}/census_reader.py"
- name: dataset_infer
batch_size: 5
type: QueueDataset
data_path: "{workspace}/data/train"
data_converter: "{workspace}/census_reader.py"
```
### 单机预测
CPU环境
在config.yaml文件中设置好epochs、device等参数。
```
- name: infer_runner
class: infer
init_model_path: "increment/0"
device: cpu
```
## 论文复现
用原论文的完整数据复现论文效果需要在config.yaml中修改batch_size=1000, thread_num=8, epoch_num=4
使用gpu p100 单卡训练 6.5h 测试auc: best:0.9940, mean:0.9932
修改后运行方案:修改config.yaml中的'workspace'为config.yaml的目录位置,执行
```
python -m paddlerec.run -m /home/your/dir/config.yaml #调试模式 直接指定本地config的绝对路径
```
## 进阶使用
## FAQ
...@@ -12,7 +12,7 @@ ...@@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
workspace: "paddlerec.models.multitask.mmoe" workspace: "models/multitask/mmoe"
dataset: dataset:
- name: dataset_train - name: dataset_train
......
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import pandas as pd
import numpy as np
import paddle.fluid as fluid
from args import *
def fun1(x):
if x == ' 50000+.':
return 1
else:
return 0
def fun2(x):
if x == ' Never married':
return 1
else:
return 0
def data_preparation(train_path, test_path, train_data_path, test_data_path):
# The column names are from
# https://www2.1010data.com/documentationcenter/prod/Tutorials/MachineLearningExamples/CensusIncomeDataSet.html
column_names = [
'age', 'class_worker', 'det_ind_code', 'det_occ_code', 'education',
'wage_per_hour', 'hs_college', 'marital_stat', 'major_ind_code',
'major_occ_code', 'race', 'hisp_origin', 'sex', 'union_member',
'unemp_reason', 'full_or_part_emp', 'capital_gains', 'capital_losses',
'stock_dividends', 'tax_filer_stat', 'region_prev_res',
'state_prev_res', 'det_hh_fam_stat', 'det_hh_summ', 'instance_weight',
'mig_chg_msa', 'mig_chg_reg', 'mig_move_reg', 'mig_same',
'mig_prev_sunbelt', 'num_emp', 'fam_under_18', 'country_father',
'country_mother', 'country_self', 'citizenship', 'own_or_self',
'vet_question', 'vet_benefits', 'weeks_worked', 'year', 'income_50k'
]
# Load the dataset in Pandas
train_df = pd.read_csv(
train_path,
delimiter=',',
header=None,
index_col=None,
names=column_names)
other_df = pd.read_csv(
test_path,
delimiter=',',
header=None,
index_col=None,
names=column_names)
# First group of tasks according to the paper
label_columns = ['income_50k', 'marital_stat']
# One-hot encoding categorical columns
categorical_columns = [
'class_worker', 'det_ind_code', 'det_occ_code', 'education',
'hs_college', 'major_ind_code', 'major_occ_code', 'race',
'hisp_origin', 'sex', 'union_member', 'unemp_reason',
'full_or_part_emp', 'tax_filer_stat', 'region_prev_res',
'state_prev_res', 'det_hh_fam_stat', 'det_hh_summ', 'mig_chg_msa',
'mig_chg_reg', 'mig_move_reg', 'mig_same', 'mig_prev_sunbelt',
'fam_under_18', 'country_father', 'country_mother', 'country_self',
'citizenship', 'vet_question'
]
train_raw_labels = train_df[label_columns]
other_raw_labels = other_df[label_columns]
transformed_train = pd.get_dummies(train_df, columns=categorical_columns)
transformed_other = pd.get_dummies(other_df, columns=categorical_columns)
# Filling the missing column in the other set
transformed_other[
'det_hh_fam_stat_ Grandchild <18 ever marr not in subfamily'] = 0
# get label
transformed_train['income_50k'] = transformed_train['income_50k'].apply(
lambda x: fun1(x))
transformed_train['marital_stat'] = transformed_train[
'marital_stat'].apply(lambda x: fun2(x))
transformed_other['income_50k'] = transformed_other['income_50k'].apply(
lambda x: fun1(x))
transformed_other['marital_stat'] = transformed_other[
'marital_stat'].apply(lambda x: fun2(x))
# Split the other dataset into 1:1 validation to test according to the paper
validation_indices = transformed_other.sample(
frac=0.5, replace=False, random_state=1).index
test_indices = list(set(transformed_other.index) - set(validation_indices))
validation_data = transformed_other.iloc[validation_indices]
test_data = transformed_other.iloc[test_indices]
cols = transformed_train.columns.tolist()
cols.insert(0, cols.pop(cols.index('income_50k')))
cols.insert(0, cols.pop(cols.index('marital_stat')))
transformed_train = transformed_train[cols]
test_data = test_data[cols]
validation_data = validation_data[cols]
print(transformed_train.shape, transformed_other.shape,
validation_data.shape, test_data.shape)
transformed_train.to_csv(train_data_path + 'train_data.csv', index=False)
test_data.to_csv(test_data_path + 'test_data.csv', index=False)
args = data_preparation_args()
data_preparation(args.train_path, args.test_path, args.train_data_path,
args.test_data_path)
...@@ -44,9 +44,12 @@ ...@@ -44,9 +44,12 @@
## 使用教程(快速开始) ## 使用教程(快速开始)
```shell ```shell
python -m paddlerec.run -m paddlerec.models.multitask.mmoe # mmoe git clone https://github.com/PaddlePaddle/PaddleRec.git paddle-rec
python -m paddlerec.run -m paddlerec.models.multitask.share-bottom # share-bottom cd paddle-rec
python -m paddlerec.run -m paddlerec.models.multitask.esmm # esmm
python -m paddlerec.run -m models/multitask/mmoe/config.yaml # mmoe
python -m paddlerec.run -m models/multitask/share-bottom/config.yaml # share-bottom
python -m paddlerec.run -m models/multitask/esmm/config.yaml # esmm
``` ```
## 使用教程(复现论文) ## 使用教程(复现论文)
......
# Share_bottom
以下是本例的简要目录结构及说明:
```
├── data # 文档
├── train #训练数据
├── train_data.txt
├── test #测试数据
├── test_data.txt
├── run.sh
├── data_preparation.py
├── __init__.py
├── config.yaml #配置文件
├── census_reader.py #数据读取文件
├── model.py #模型文件
```
注:在阅读该示例前,建议您先了解以下内容:
[paddlerec入门教程](https://github.com/PaddlePaddle/PaddleRec/blob/master/README.md)
## 内容
- [模型简介](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/share-bottom#模型简介)
- [数据准备](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/share-bottom#数据准备)
- [运行环境](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/share-bottom#运行环境)
- [快速开始](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/share-bottom#快速开始)
- [论文复现](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/share-bottom#论文复现)
- [进阶使用](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/share-bottom#进阶使用)
- [FAQ](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/share-bottom#FAQ)
## 模型简介
share_bottom是多任务学习的基本框架,其特点是对于不同的任务,底层的参数和网络结构是共享的,这种结构的优点是极大地减少网络的参数数量的情况下也能很好地对多任务进行学习,但缺点也很明显,由于底层的参数和网络结构是完全共享的,因此对于相关性不高的两个任务会导致优化冲突,从而影响模型最终的结果。后续很多Neural-based的多任务模型都是基于share_bottom发展而来的,如MMOE等模型可以改进share_bottom在多任务之间相关性低导致模型效果差的缺点。
我们在Paddlepaddle实现share_bottom网络结构,并在开源数据集Census-income Data上验证模型效果。两个任务的auc分别为:
1.income
>max_sb_test_auc_income:0.94993
>
>mean_sb_test_auc_income: 0.93120
2.marital
> max_sb_test_auc_marital:0.99384
>
> mean_sb_test_auc_marital:0.99256
本项目在paddlepaddle上实现share_bottom的网络结构,并在开源数据集 [Census-income Data](https://archive.ics.uci.edu/ml/datasets/Census-Income+(KDD) )上验证模型效果, 本模型配置默认使用demo数据集,若进行精度验证,请参考[论文复现](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/share-bottom#论文复现)部分。
本项目支持功能
训练:单机CPU、单机单卡GPU、单机多卡GPU、本地模拟参数服务器训练、增量训练,配置请参考 [启动训练](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/train.md)
预测:单机CPU、单机单卡GPU ;配置请参考[PaddleRec 离线预测](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/predict.md)
## 数据准备
数据地址: [Census-income Data](https://archive.ics.uci.edu/ml/datasets/Census-Income+(KDD) )
数据解压后, 在create_data.sh脚本文件中添加文件的路径,并运行脚本。
```sh
mkdir train_data
mkdir test_data
mkdir data
train_path="data/census-income.data"
test_path="data/census-income.test"
train_data_path="train_data/"
test_data_path="test_data/"
pip install -r requirements.txt
wget -P data/ https://archive.ics.uci.edu/ml/machine-learning-databases/census-income-mld/census.tar.gz
tar -zxvf data/census.tar.gz -C data/
python data_preparation.py --train_path ${train_path} \
--test_path ${test_path} \
--train_data_path ${train_data_path}\
--test_data_path ${test_data_path}
```
生成的格式以逗号为分割点
```
0,0,73,0,0,0,0,1700.09,0,0
```
## 运行环境
PaddlePaddle>=1.7.2
python 2.7/3.5/3.6/3.7
PaddleRec >=0.1
os : windows/linux/macos
## 快速开始
### 单机训练
CPU环境
在config.yaml文件中设置好设备,epochs等。
```sh
dataset:
- name: dataset_train
batch_size: 5
type: QueueDataset
data_path: "{workspace}/data/train"
data_converter: "{workspace}/census_reader.py"
- name: dataset_infer
batch_size: 5
type: QueueDataset
data_path: "{workspace}/data/train"
data_converter: "{workspace}/census_reader.py"
```
### 单机预测
CPU环境
在config.yaml文件中设置好epochs、device等参数。
```sh
- name: infer_runner
class: infer
init_model_path: "increment/0"
device: cpu
```
## 论文复现
用原论文的完整数据复现论文效果需要在config.yaml中修改batch_size=32, thread_num=8, epoch_num=100
使用gpu p100 单卡训练 4.5h 100轮, 测试auc:best: 0.9939,mean:0.9931
修改后运行方案:修改config.yaml中的'workspace'为config.yaml的目录位置,执行
```text
python -m paddlerec.run -m /home/your/dir/config.yaml #调试模式 直接指定本地config的绝对路径
```
## 进阶使用
## FAQ
...@@ -12,7 +12,7 @@ ...@@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
workspace: "paddlerec.models.multitask.share-bottom" workspace: "models/multitask/share-bottom"
dataset: dataset:
- name: dataset_train - name: dataset_train
......
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import pandas as pd
import numpy as np
import paddle.fluid as fluid
from args import *
def fun1(x):
if x == ' 50000+.':
return 1
else:
return 0
def fun2(x):
if x == ' Never married':
return 1
else:
return 0
def data_preparation(train_path, test_path, train_data_path, test_data_path):
# The column names are from
# https://www2.1010data.com/documentationcenter/prod/Tutorials/MachineLearningExamples/CensusIncomeDataSet.html
column_names = [
'age', 'class_worker', 'det_ind_code', 'det_occ_code', 'education',
'wage_per_hour', 'hs_college', 'marital_stat', 'major_ind_code',
'major_occ_code', 'race', 'hisp_origin', 'sex', 'union_member',
'unemp_reason', 'full_or_part_emp', 'capital_gains', 'capital_losses',
'stock_dividends', 'tax_filer_stat', 'region_prev_res',
'state_prev_res', 'det_hh_fam_stat', 'det_hh_summ', 'instance_weight',
'mig_chg_msa', 'mig_chg_reg', 'mig_move_reg', 'mig_same',
'mig_prev_sunbelt', 'num_emp', 'fam_under_18', 'country_father',
'country_mother', 'country_self', 'citizenship', 'own_or_self',
'vet_question', 'vet_benefits', 'weeks_worked', 'year', 'income_50k'
]
# Load the dataset in Pandas
train_df = pd.read_csv(
train_path,
delimiter=',',
header=None,
index_col=None,
names=column_names)
other_df = pd.read_csv(
test_path,
delimiter=',',
header=None,
index_col=None,
names=column_names)
# First group of tasks according to the paper
label_columns = ['income_50k', 'marital_stat']
# One-hot encoding categorical columns
categorical_columns = [
'class_worker', 'det_ind_code', 'det_occ_code', 'education',
'hs_college', 'major_ind_code', 'major_occ_code', 'race',
'hisp_origin', 'sex', 'union_member', 'unemp_reason',
'full_or_part_emp', 'tax_filer_stat', 'region_prev_res',
'state_prev_res', 'det_hh_fam_stat', 'det_hh_summ', 'mig_chg_msa',
'mig_chg_reg', 'mig_move_reg', 'mig_same', 'mig_prev_sunbelt',
'fam_under_18', 'country_father', 'country_mother', 'country_self',
'citizenship', 'vet_question'
]
train_raw_labels = train_df[label_columns]
other_raw_labels = other_df[label_columns]
transformed_train = pd.get_dummies(train_df, columns=categorical_columns)
transformed_other = pd.get_dummies(other_df, columns=categorical_columns)
# Filling the missing column in the other set
transformed_other[
'det_hh_fam_stat_ Grandchild <18 ever marr not in subfamily'] = 0
# get label
transformed_train['income_50k'] = transformed_train['income_50k'].apply(
lambda x: fun1(x))
transformed_train['marital_stat'] = transformed_train[
'marital_stat'].apply(lambda x: fun2(x))
transformed_other['income_50k'] = transformed_other['income_50k'].apply(
lambda x: fun1(x))
transformed_other['marital_stat'] = transformed_other[
'marital_stat'].apply(lambda x: fun2(x))
# Split the other dataset into 1:1 validation to test according to the paper
validation_indices = transformed_other.sample(
frac=0.5, replace=False, random_state=1).index
test_indices = list(set(transformed_other.index) - set(validation_indices))
validation_data = transformed_other.iloc[validation_indices]
test_data = transformed_other.iloc[test_indices]
cols = transformed_train.columns.tolist()
cols.insert(0, cols.pop(cols.index('income_50k')))
cols.insert(0, cols.pop(cols.index('marital_stat')))
transformed_train = transformed_train[cols]
test_data = test_data[cols]
validation_data = validation_data[cols]
print(transformed_train.shape, transformed_other.shape,
validation_data.shape, test_data.shape)
transformed_train.to_csv(train_data_path + 'train_data.csv', index=False)
test_data.to_csv(test_data_path + 'test_data.csv', index=False)
args = data_preparation_args()
data_preparation(args.train_path, args.test_path, args.train_data_path,
args.test_data_path)
mkdir train_data
mkdir test_data
mkdir data
train_path="data/census-income.data"
test_path="data/census-income.test"
train_data_path="train_data/"
test_data_path="test_data/"
pip install -r requirements.txt
wget -P data/ https://archive.ics.uci.edu/ml/machine-learning-databases/census-income-mld/census.tar.gz
tar -zxvf data/census.tar.gz -C data/
python data_preparation.py --train_path ${train_path} \
--test_path ${test_path} \
--train_data_path ${train_data_path}\
--test_data_path ${test_data_path}
...@@ -14,7 +14,7 @@ ...@@ -14,7 +14,7 @@
# global settings # global settings
debug: false debug: false
workspace: "paddlerec.models.rank.AutoInt" workspace: "models/rank/AutoInt"
dataset: dataset:
......
...@@ -14,7 +14,7 @@ ...@@ -14,7 +14,7 @@
# global settings # global settings
debug: false debug: false
workspace: "paddlerec.models.rank.BST" workspace: "models/rank/BST"
dataset: dataset:
- name: sample_1 - name: sample_1
......
...@@ -15,7 +15,7 @@ ...@@ -15,7 +15,7 @@
# global settings # global settings
debug: false debug: false
workspace: "paddlerec.models.rank.afm" workspace: "models/rank/afm"
dataset: dataset:
- name: train_sample - name: train_sample
......
...@@ -15,7 +15,7 @@ ...@@ -15,7 +15,7 @@
# global settings # global settings
debug: false debug: false
workspace: "paddlerec.models.rank.dcn" workspace: "models/rank/dcn"
dataset: dataset:
- name: train_sample - name: train_sample
......
...@@ -15,7 +15,7 @@ ...@@ -15,7 +15,7 @@
# global settings # global settings
debug: false debug: false
workspace: "paddlerec.models.rank.deep_crossing" workspace: "models/rank/deep_crossing"
dataset: dataset:
- name: train_sample - name: train_sample
......
...@@ -14,7 +14,7 @@ ...@@ -14,7 +14,7 @@
# global settings # global settings
debug: false debug: false
workspace: "paddlerec.models.rank.deepfm" workspace: "models/rank/deepfm"
dataset: dataset:
......
# DIEN
## 注意: config.yaml中指定了训练阶段的dataset名称为sample_1,预测阶段的dataset名称为infer_sample。同时在reader.py 和 infer_reader.py中,通个这两个dataset的名字读取的dataset的相关配置,如果修改了config.yaml中的dataset名字,需要在对应的reader.py 或者 infer_reader.py中同步修改下。
...@@ -14,19 +14,19 @@ ...@@ -14,19 +14,19 @@
# global settings # global settings
debug: false debug: false
workspace: "paddlerec.models.rank.dien" workspace: "models/rank/dien"
dataset: dataset:
- name: sample_1 - name: sample_1
type: DataLoader type: DataLoader
batch_size: 5 batch_size: 32
data_path: "{workspace}/data/train_data" data_path: "{workspace}/data/train_data"
data_converter: "{workspace}/reader.py" data_converter: "{workspace}/reader.py"
- name: infer_sample - name: infer_sample
type: DataLoader type: DataLoader
batch_size: 5 batch_size: 32
data_path: "{workspace}/data/train_data" data_path: "{workspace}/data/train_data"
data_converter: "{workspace}/reader.py" data_converter: "{workspace}/infer_reader.py"
hyper_parameters: hyper_parameters:
optimizer: optimizer:
......
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import print_function
import os
import random
try:
import cPickle as pickle
except ImportError:
import pickle
import numpy as np
from paddlerec.core.reader import ReaderBase
from paddlerec.core.utils import envs
class Reader(ReaderBase):
def init(self):
self.train_data_path = envs.get_global_env(
"dataset.infer_sample.data_path", None)
self.res = []
self.max_len = 0
self.neg_candidate_item = []
self.neg_candidate_cat = []
self.max_neg_item = 10000
self.max_neg_cat = 1000
data_file_list = os.listdir(self.train_data_path)
for i in range(0, len(data_file_list)):
train_data_file = os.path.join(self.train_data_path,
data_file_list[i])
with open(train_data_file, "r") as fin:
for line in fin:
line = line.strip().split(';')
hist = line[0].split()
self.max_len = max(self.max_len, len(hist))
fo = open("tmp.txt", "w")
fo.write(str(self.max_len))
fo.close()
self.batch_size = envs.get_global_env(
"dataset.infer_sample.batch_size", 32, None)
self.group_size = self.batch_size * 20
def _process_line(self, line):
line = line.strip().split(';')
hist = line[0].split()
hist = [int(i) for i in hist]
cate = line[1].split()
cate = [int(i) for i in cate]
return [hist, cate, [int(line[2])], [int(line[3])], [float(line[4])]]
def generate_sample(self, line):
"""
Read the data line by line and process it as a dictionary
"""
def data_iter():
# feat_idx, feat_value, label = self._process_line(line)
yield self._process_line(line)
return data_iter
def pad_batch_data(self, input, max_len):
res = np.array([x + [0] * (max_len - len(x)) for x in input])
res = res.astype("int64").reshape([-1, max_len])
return res
def make_data(self, b):
max_len = max(len(x[0]) for x in b)
# item = self.pad_batch_data([x[0] for x in b], max_len)
# cat = self.pad_batch_data([x[1] for x in b], max_len)
item = [x[0] for x in b]
cat = [x[1] for x in b]
neg_item = [None] * len(item)
neg_cat = [None] * len(cat)
for i in range(len(b)):
neg_item[i] = []
neg_cat[i] = []
if len(self.neg_candidate_item) < self.max_neg_item:
self.neg_candidate_item.extend(b[i][0])
if len(self.neg_candidate_item) > self.max_neg_item:
self.neg_candidate_item = self.neg_candidate_item[
0:self.max_neg_item]
else:
len_seq = len(b[i][0])
start_idx = random.randint(0, self.max_neg_item - len_seq - 1)
self.neg_candidate_item[start_idx:start_idx + len_seq + 1] = b[
i][0]
if len(self.neg_candidate_cat) < self.max_neg_cat:
self.neg_candidate_cat.extend(b[i][1])
if len(self.neg_candidate_cat) > self.max_neg_cat:
self.neg_candidate_cat = self.neg_candidate_cat[
0:self.max_neg_cat]
else:
len_seq = len(b[i][1])
start_idx = random.randint(0, self.max_neg_cat - len_seq - 1)
self.neg_candidate_item[start_idx:start_idx + len_seq + 1] = b[
i][1]
for _ in range(len(b[i][0])):
neg_item[i].append(self.neg_candidate_item[random.randint(
0, len(self.neg_candidate_item) - 1)])
for _ in range(len(b[i][1])):
neg_cat[i].append(self.neg_candidate_cat[random.randint(
0, len(self.neg_candidate_cat) - 1)])
len_array = [len(x[0]) for x in b]
mask = np.array(
[[0] * x + [-1e9] * (max_len - x) for x in len_array]).reshape(
[-1, max_len, 1])
target_item_seq = np.array(
[[x[2]] * max_len for x in b]).astype("int64").reshape(
[-1, max_len])
target_cat_seq = np.array(
[[x[3]] * max_len for x in b]).astype("int64").reshape(
[-1, max_len])
res = []
for i in range(len(b)):
res.append([
item[i], cat[i], b[i][2], b[i][3], b[i][4], mask[i],
target_item_seq[i], target_cat_seq[i], neg_item[i], neg_cat[i]
])
return res
def batch_reader(self, reader, batch_size, group_size):
def batch_reader():
bg = []
for line in reader:
bg.append(line)
if len(bg) == group_size:
sortb = sorted(bg, key=lambda x: len(x[0]), reverse=False)
bg = []
for i in range(0, group_size, batch_size):
b = sortb[i:i + batch_size]
yield self.make_data(b)
len_bg = len(bg)
if len_bg != 0:
sortb = sorted(bg, key=lambda x: len(x[0]), reverse=False)
bg = []
remain = len_bg % batch_size
for i in range(0, len_bg - remain, batch_size):
b = sortb[i:i + batch_size]
yield self.make_data(b)
return batch_reader
def base_read(self, file_dir):
res = []
for train_file in file_dir:
with open(train_file, "r") as fin:
for line in fin:
line = line.strip().split(';')
hist = line[0].split()
cate = line[1].split()
res.append([hist, cate, line[2], line[3], float(line[4])])
return res
def generate_batch_from_trainfiles(self, files):
data_set = self.base_read(files)
random.shuffle(data_set)
return self.batch_reader(data_set, self.batch_size,
self.batch_size * 20)
...@@ -51,7 +51,7 @@ class Reader(ReaderBase): ...@@ -51,7 +51,7 @@ class Reader(ReaderBase):
fo.write(str(self.max_len)) fo.write(str(self.max_len))
fo.close() fo.close()
self.batch_size = envs.get_global_env("dataset.sample_1.batch_size", self.batch_size = envs.get_global_env("dataset.sample_1.batch_size",
32, "train.reader") 32, None)
self.group_size = self.batch_size * 20 self.group_size = self.batch_size * 20
def _process_line(self, line): def _process_line(self, line):
......
# DIN
## 注意: config.yaml中指定了训练阶段的dataset名称为sample_1,预测阶段的dataset名称为infer_sample。同时在reader.py 和 infer_reader.py中,通个这两个dataset的名字读取的dataset的相关配置,如果修改了config.yaml中的dataset名字,需要在对应的reader.py 或者 infer_reader.py中同步修改下。
...@@ -14,7 +14,7 @@ ...@@ -14,7 +14,7 @@
# global settings # global settings
debug: false debug: false
workspace: "paddlerec.models.rank.din" workspace: "models/rank/din"
dataset: dataset:
- name: sample_1 - name: sample_1
...@@ -26,7 +26,7 @@ dataset: ...@@ -26,7 +26,7 @@ dataset:
type: DataLoader type: DataLoader
batch_size: 5 batch_size: 5
data_path: "{workspace}/data/train_data" data_path: "{workspace}/data/train_data"
data_converter: "{workspace}/reader.py" data_converter: "{workspace}/infer_reader.py"
hyper_parameters: hyper_parameters:
optimizer: optimizer:
......
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import print_function
import os
import random
try:
import cPickle as pickle
except ImportError:
import pickle
import numpy as np
from paddlerec.core.reader import ReaderBase
from paddlerec.core.utils import envs
class Reader(ReaderBase):
def init(self):
self.train_data_path = envs.get_global_env(
"dataset.infer_sample.data_path", None)
self.res = []
self.max_len = 0
data_file_list = os.listdir(self.train_data_path)
for i in range(0, len(data_file_list)):
train_data_file = os.path.join(self.train_data_path,
data_file_list[i])
with open(train_data_file, "r") as fin:
for line in fin:
line = line.strip().split(';')
hist = line[0].split()
self.max_len = max(self.max_len, len(hist))
fo = open("tmp.txt", "w")
fo.write(str(self.max_len))
fo.close()
self.batch_size = envs.get_global_env(
"dataset.infer_sample.batch_size", 32, None)
self.group_size = self.batch_size * 20
def _process_line(self, line):
line = line.strip().split(';')
hist = line[0].split()
hist = [int(i) for i in hist]
cate = line[1].split()
cate = [int(i) for i in cate]
return [hist, cate, [int(line[2])], [int(line[3])], [float(line[4])]]
def generate_sample(self, line):
"""
Read the data line by line and process it as a dictionary
"""
def data_iter():
# feat_idx, feat_value, label = self._process_line(line)
yield self._process_line(line)
return data_iter
def pad_batch_data(self, input, max_len):
res = np.array([x + [0] * (max_len - len(x)) for x in input])
res = res.astype("int64").reshape([-1, max_len])
return res
def make_data(self, b):
max_len = max(len(x[0]) for x in b)
item = self.pad_batch_data([x[0] for x in b], max_len)
cat = self.pad_batch_data([x[1] for x in b], max_len)
len_array = [len(x[0]) for x in b]
mask = np.array(
[[0] * x + [-1e9] * (max_len - x) for x in len_array]).reshape(
[-1, max_len, 1])
target_item_seq = np.array(
[[x[2]] * max_len for x in b]).astype("int64").reshape(
[-1, max_len])
target_cat_seq = np.array(
[[x[3]] * max_len for x in b]).astype("int64").reshape(
[-1, max_len])
res = []
for i in range(len(b)):
res.append([
item[i], cat[i], b[i][2], b[i][3], b[i][4], mask[i],
target_item_seq[i], target_cat_seq[i]
])
return res
def batch_reader(self, reader, batch_size, group_size):
def batch_reader():
bg = []
for line in reader:
bg.append(line)
if len(bg) == group_size:
sortb = sorted(bg, key=lambda x: len(x[0]), reverse=False)
bg = []
for i in range(0, group_size, batch_size):
b = sortb[i:i + batch_size]
yield self.make_data(b)
len_bg = len(bg)
if len_bg != 0:
sortb = sorted(bg, key=lambda x: len(x[0]), reverse=False)
bg = []
remain = len_bg % batch_size
for i in range(0, len_bg - remain, batch_size):
b = sortb[i:i + batch_size]
yield self.make_data(b)
return batch_reader
def base_read(self, file_dir):
res = []
for train_file in file_dir:
with open(train_file, "r") as fin:
for line in fin:
line = line.strip().split(';')
hist = line[0].split()
cate = line[1].split()
res.append([hist, cate, line[2], line[3], float(line[4])])
return res
def generate_batch_from_trainfiles(self, files):
data_set = self.base_read(files)
random.shuffle(data_set)
return self.batch_reader(data_set, self.batch_size,
self.batch_size * 20)
...@@ -47,7 +47,7 @@ class Reader(ReaderBase): ...@@ -47,7 +47,7 @@ class Reader(ReaderBase):
fo.write(str(self.max_len)) fo.write(str(self.max_len))
fo.close() fo.close()
self.batch_size = envs.get_global_env("dataset.sample_1.batch_size", self.batch_size = envs.get_global_env("dataset.sample_1.batch_size",
32, "train.reader") 32, None)
self.group_size = self.batch_size * 20 self.group_size = self.batch_size * 20
def _process_line(self, line): def _process_line(self, line):
......
...@@ -11,12 +11,8 @@ ...@@ -11,12 +11,8 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
workspace: "./"
backend: "PaddleCloud" backend: "PaddleCloud"
cluster_type: k8s # k8s 可选 cluster_type: k8s # mpi 可选
config: config:
fs_name: "afs://xxx.com" fs_name: "afs://xxx.com"
...@@ -56,5 +52,12 @@ submit: ...@@ -56,5 +52,12 @@ submit:
# for k8s gpu # for k8s gpu
k8s_trainers: 2 k8s_trainers: 2
k8s_cpu_cores: 2
k8s_gpu_card: 1 k8s_gpu_card: 1
# for k8s ps-cpu
k8s_trainers: 2
k8s_cpu_cores: 4
k8s_ps_num: 2
k8s_ps_cores: 4
...@@ -13,7 +13,7 @@ ...@@ -13,7 +13,7 @@
# limitations under the License. # limitations under the License.
# workspace # workspace
workspace: "paddlerec.models.rank.dnn" workspace: "models/rank/dnn"
# list of dataset # list of dataset
dataset: dataset:
...@@ -67,7 +67,6 @@ runner: ...@@ -67,7 +67,6 @@ runner:
save_inference_path: "inference" # save inference path save_inference_path: "inference" # save inference path
save_inference_feed_varnames: [] # feed vars of save inference save_inference_feed_varnames: [] # feed vars of save inference
save_inference_fetch_varnames: [] # fetch vars of save inference save_inference_fetch_varnames: [] # fetch vars of save inference
init_model_path: "" # load model path
print_interval: 10 print_interval: 10
phases: [phase1] phases: [phase1]
......
...@@ -15,7 +15,7 @@ ...@@ -15,7 +15,7 @@
# global settings # global settings
debug: false debug: false
workspace: "paddlerec.models.rank.ffm" workspace: "models/rank/ffm"
dataset: dataset:
- name: train_sample - name: train_sample
......
...@@ -15,7 +15,7 @@ ...@@ -15,7 +15,7 @@
# global settings # global settings
debug: false debug: false
workspace: "paddlerec.models.rank.fgcnn" workspace: "models/rank/fgcnn"
dataset: dataset:
- name: train_sample - name: train_sample
......
...@@ -16,13 +16,35 @@ ...@@ -16,13 +16,35 @@
├── config.yaml #配置文件 ├── config.yaml #配置文件
``` ```
## 简介 注:在阅读该示例前,建议您先了解以下内容:
[paddlerec入门教程](https://github.com/PaddlePaddle/PaddleRec/blob/master/README.md)
---
## 内容
- [模型简介](#模型简介)
- [数据准备](#数据准备)
- [运行环境](#运行环境)
- [快速开始](#快速开始)
- [论文复现](#论文复现)
- [进阶使用](#进阶使用)
- [FAQ](#FAQ)
## 模型简介
[《FiBiNET: Combining Feature Importance and Bilinear feature Interaction for Click-Through Rate Prediction》]( https://arxiv.org/pdf/1905.09433.pdf)是新浪微博机器学习团队发表在RecSys19上的一篇论文,文章指出当前的许多通过特征组合进行CTR预估的工作主要使用特征向量的内积或哈达玛积来计算交叉特征,这种方法忽略了特征本身的重要程度。提出通过使用Squeeze-Excitation network (SENET) 结构动态学习特征的重要性以及使用一个双线性函数来更好的建模交叉特征。 [《FiBiNET: Combining Feature Importance and Bilinear feature Interaction for Click-Through Rate Prediction》]( https://arxiv.org/pdf/1905.09433.pdf)是新浪微博机器学习团队发表在RecSys19上的一篇论文,文章指出当前的许多通过特征组合进行CTR预估的工作主要使用特征向量的内积或哈达玛积来计算交叉特征,这种方法忽略了特征本身的重要程度。提出通过使用Squeeze-Excitation network (SENET) 结构动态学习特征的重要性以及使用一个双线性函数来更好的建模交叉特征。
本项目在paddlepaddle上实现FibiNET的网络结构,并在开源数据集Criteo上验证模型效果。 本项目在paddlepaddle上实现FibiNET的网络结构,并在开源数据集Criteo上验证模型效果, 本模型配置默认使用demo数据集,若进行精度验证,请参考[论文复现](#论文复现)部分。
本项目支持功能
训练:单机CPU、单机单卡GPU、单机多卡GPU、本地模拟参数服务器训练、增量训练,配置请参考 [启动训练](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/train.md)
预测:单机CPU、单机单卡GPU ;配置请参考[PaddleRec 离线预测](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/predict.md)
## 数据下载及预处理 ## 数据准备
数据地址:[Criteo]( https://fleet.bj.bcebos.com/ctr_data.tar.gz) 数据地址:[Criteo]( https://fleet.bj.bcebos.com/ctr_data.tar.gz)
...@@ -36,15 +58,33 @@ ...@@ -36,15 +58,33 @@
sh run.sh sh run.sh
``` ```
## 环境 原始的数据格式为13个dense部分特征+离散化特征,用'\t'切分, 对应的数据是data/train_data_full data/test_data_full
```
0 1 1 5 0 1382 4 15 2 181 1 2 2 68fd1e64 80e26c9b fb936136 7b4723c4 25c83c98 7e0ccccf de7995b8 1f89b562 a73ee510 a8cd5504 b2cb9c98 37c9c164 2824a5f6 1adce6ef 8ba8b39a 891b62e7 e5ba7672 f54016b9 21ddcdc9 b1252a9d 07b5194c 3a171ecb c5c50484 e8b83407 9727dd16
```
经过get_slot_data.py处理后,得到如下数据, dense_feature中的值会merge在一起,对应net.py中的self._dense_data_var, '1:715353'表示net.py中的self._sparse_data_var[1] = 715353, 对应的数据是data/slot_train_data_full, data/slot_test_data_full
```
click:0 dense_feature:0.05 dense_feature:0.00663349917081 dense_feature:0.05 dense_feature:0.0 dense_feature:0.02159375 dense_feature:0.008 dense_feature:0.15 dense_feature:0.04 dense_feature:0.362 dense_feature:0.1 dense_feature:0.2 dense_feature:0.0 dense_feature:0.04 1:715353 2:817085 3:851010 4:833725 5:286835 6:948614 7:881652 8:507110 9:27346 10:646986 11:643076 12:200960 13:18464 14:202774 15:532679 16:729573 17:342789 18:562805 19:880474 20:984402 21:666449 22:26235 23:700326 24:452909 25:884722 26:787527
```
PaddlePaddle 1.7.2 ## 运行环境
python3.7 PaddlePaddle>=1.7.2
PaddleRec python 2.7/3.5/3.6/3.7
## 单机训练 PaddleRec >=0.1
os : windows/linux/macos
## 快速开始
### 单机训练
CPU环境 CPU环境
...@@ -73,7 +113,7 @@ runner: ...@@ -73,7 +113,7 @@ runner:
phases: [phase1] phases: [phase1]
``` ```
## 单机预测 ### 单机预测
CPU环境 CPU环境
...@@ -90,17 +130,15 @@ CPU环境 ...@@ -90,17 +130,15 @@ CPU环境
phases: [phase2] phases: [phase2]
``` ```
## 运行 ### 运行
``` ```
python -m paddlerec.run -m paddlerec.models.rank.fibinet python -m paddlerec.run -m models/rank/fibinet
``` ```
## 模型效果
在样例数据上测试模型 ### 结果展示
训练 样例数据训练结果展示
``` ```
Running SingleStartup. Running SingleStartup.
...@@ -122,7 +160,7 @@ batch: 1800, AUC: [0.85260467], BATCH_AUC: [0.92847032] ...@@ -122,7 +160,7 @@ batch: 1800, AUC: [0.85260467], BATCH_AUC: [0.92847032]
epoch 3 done, use time: 1618.1106688976288 epoch 3 done, use time: 1618.1106688976288
``` ```
预测 样例数据预测结果展示
``` ```
load persistables from increment_model/3 load persistables from increment_model/3
...@@ -136,3 +174,18 @@ batch: 1800, AUC: [0.86633785], BATCH_AUC: [0.96900967] ...@@ -136,3 +174,18 @@ batch: 1800, AUC: [0.86633785], BATCH_AUC: [0.96900967]
batch: 1820, AUC: [0.86662365], BATCH_AUC: [0.96759972] batch: 1820, AUC: [0.86662365], BATCH_AUC: [0.96759972]
``` ```
## 论文复现
用原论文的完整数据复现论文效果需要在config.yaml中修改batch_size=1000, thread_num=8, epoch_num=4
使用gpu p100 单卡训练 60h 测试auc:0.79
修改后运行方案:修改config.yaml中的'workspace'为config.yaml的目录位置,执行
```
python -m paddlerec.run -m /home/your/dir/config.yaml #调试模式 直接指定本地config的绝对路径
```
## 进阶使用
## FAQ
...@@ -13,7 +13,7 @@ ...@@ -13,7 +13,7 @@
# limitations under the License. # limitations under the License.
# workspace # workspace
workspace: "paddlerec.models.rank.fibinet" workspace: "models/rank/fibinet"
# list of dataset # list of dataset
dataset: dataset:
......
...@@ -15,23 +15,53 @@ ...@@ -15,23 +15,53 @@
├── config.yaml #配置文件 ├── config.yaml #配置文件
``` ```
## 简介 注:在阅读该示例前,建议您先了解以下内容:
[paddlerec入门教程](https://github.com/PaddlePaddle/PaddleRec/blob/master/README.md)
---
## 内容
- [模型简介](#模型简介)
- [数据准备](#数据准备)
- [运行环境](#运行环境)
- [快速开始](#快速开始)
- [论文复现](#论文复现)
- [进阶使用](#进阶使用)
- [FAQ](#FAQ)
## 模型简介
[《FLEN: Leveraging Field for Scalable CTR Prediction》](https://arxiv.org/pdf/1911.04690.pdf)文章提出了field-wise bi-interaction pooling技术,解决了在大规模应用特征field信息时存在的时间复杂度和空间复杂度高的困境,同时提出了一种缓解梯度耦合问题的方法dicefactor。该模型已应用于美图的大规模推荐系统中,持续稳定地取得业务效果的全面提升。 [《FLEN: Leveraging Field for Scalable CTR Prediction》](https://arxiv.org/pdf/1911.04690.pdf)文章提出了field-wise bi-interaction pooling技术,解决了在大规模应用特征field信息时存在的时间复杂度和空间复杂度高的困境,同时提出了一种缓解梯度耦合问题的方法dicefactor。该模型已应用于美图的大规模推荐系统中,持续稳定地取得业务效果的全面提升。
本项目在avazu数据集上验证模型效果 本项目在avazu数据集上验证模型效果, 本模型配置默认使用demo数据集,若进行精度验证,请参考[论文复现](#论文复现)部分。
本项目支持功能
训练:单机CPU、单机单卡GPU、单机多卡GPU、本地模拟参数服务器训练、增量训练,配置请参考 [启动训练](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/train.md)
预测:单机CPU、单机单卡GPU ;配置请参考[PaddleRec 离线预测](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/predict.md)
## 数据准备
## 数据下载及预处理
## 环境
PaddlePaddle 1.7.2 ## 运行环境
python3.7 PaddlePaddle>=1.7.2
PaddleRec python 2.7/3.5/3.6/3.7
## 单机训练 PaddleRec >=0.1
os : windows/linux/macos
## 快速开始
### 单机训练
CPU环境 CPU环境
...@@ -60,7 +90,7 @@ runner: ...@@ -60,7 +90,7 @@ runner:
phases: [phase1] phases: [phase1]
``` ```
## 单机预测 ### 单机预测
CPU环境 CPU环境
...@@ -77,54 +107,21 @@ CPU环境 ...@@ -77,54 +107,21 @@ CPU环境
phases: [phase2] phases: [phase2]
``` ```
## 运行 ### 运行
``` ```
python -m paddlerec.run -m paddlerec.models.rank.flen python -m paddlerec.run -m models/rank/flen
``` ```
## 模型效果 ## 论文复现
在样例数据上测试模型 用原论文的完整数据复现论文效果需要在config.yaml中修改batch_size=512, thread_num=8, epoch_num=1
训练: 全量数据的效果未来补充。
```
0702 13:38:20.903220 7368 parallel_executor.cc:440] The Program will be executed on CPU using ParallelExecutor, 2 cards are used, so 2 programs are executed in parallel.
I0702 13:38:20.925912 7368 parallel_executor.cc:307] Inplace strategy is enabled, when build_strategy.enable_inplace = True
I0702 13:38:20.933356 7368 parallel_executor.cc:375] Garbage collection strategy is enabled, when FLAGS_eager_delete_tensor_gb = 0
batch: 2, AUC: [0.09090909 0. ], BATCH_AUC: [0.09090909 0. ]
batch: 4, AUC: [0.31578947 0.29411765], BATCH_AUC: [0.31578947 0.29411765]
batch: 6, AUC: [0.41333333 0.33333333], BATCH_AUC: [0.41333333 0.33333333]
batch: 8, AUC: [0.4453125 0.44166667], BATCH_AUC: [0.4453125 0.44166667]
batch: 10, AUC: [0.39473684 0.38888889], BATCH_AUC: [0.44117647 0.41176471]
batch: 12, AUC: [0.41860465 0.45535714], BATCH_AUC: [0.5078125 0.54545455]
batch: 14, AUC: [0.43413729 0.42746615], BATCH_AUC: [0.56666667 0.56 ]
batch: 16, AUC: [0.46433566 0.47460087], BATCH_AUC: [0.53 0.59247649]
batch: 18, AUC: [0.44009217 0.44642857], BATCH_AUC: [0.46 0.47]
batch: 20, AUC: [0.42705314 0.43781095], BATCH_AUC: [0.45878136 0.4874552 ]
batch: 22, AUC: [0.45176471 0.46011281], BATCH_AUC: [0.48046875 0.45878136]
batch: 24, AUC: [0.48375 0.48910256], BATCH_AUC: [0.56630824 0.59856631]
epoch 0 done, use time: 0.21532440185546875
PaddleRec Finish
```
预测
```
PaddleRec: Runner single_cpu_infer Begin
Executor Mode: infer
processor_register begin
Running SingleInstance.
Running SingleNetwork.
QueueDataset can not support PY3, change to DataLoader
QueueDataset can not support PY3, change to DataLoader
Running SingleInferStartup.
Running SingleInferRunner.
load persistables from increment_model/0
batch: 20, AUC: [0.49121353], BATCH_AUC: [0.66176471]
batch: 40, AUC: [0.51156463], BATCH_AUC: [0.55197133]
Infer phase2 of 0 done, use time: 0.3941819667816162
PaddleRec Finish
```
## 进阶使用
## FAQ
...@@ -13,7 +13,7 @@ ...@@ -13,7 +13,7 @@
# limitations under the License. # limitations under the License.
# workspace # workspace
workspace: "paddlerec.models.rank.flen" workspace: "models/rank/flen"
# list of dataset # list of dataset
dataset: dataset:
......
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
data = pd.read_csv('./avazu_sample.txt')
data['day'] = data['hour'].apply(lambda x: str(x)[4:6])
data['hour'] = data['hour'].apply(lambda x: str(x)[6:])
sparse_features = [
'hour',
'C1',
'banner_pos',
'site_id',
'site_domain',
'site_category',
'app_id',
'app_domain',
'app_category',
'device_id',
'device_model',
'device_type',
'device_conn_type', # 'device_ip',
'C14',
'C15',
'C16',
'C17',
'C18',
'C19',
'C20',
'C21',
]
data[sparse_features] = data[sparse_features].fillna('-1', )
# 1.Label Encoding for sparse features,and do simple Transformation for dense features
for feat in sparse_features:
lbe = LabelEncoder()
data[feat] = lbe.fit_transform(data[feat])
cols = [
'click', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21', 'C1',
'device_model', 'device_type', 'device_id', 'app_id', 'app_domain',
'app_category', 'banner_pos', 'site_id', 'site_domain', 'site_category',
'device_conn_type', 'hour'
]
# 计算每一个特征的最大值,作为vacob_size
data = data[cols]
line = ''
vacob_file = open('vacob_file.txt', 'w')
for col in cols[1:]:
max_val = data[col].max()
line += str(max_val) + ','
vacob_file.write(line)
vacob_file.close()
data.to_csv('./train_data/train_data.txt', index=False, header=None)
...@@ -15,7 +15,7 @@ ...@@ -15,7 +15,7 @@
# global settings # global settings
debug: false debug: false
workspace: "paddlerec.models.rank.fm" workspace: "models/rank/fm"
dataset: dataset:
- name: train_sample - name: train_sample
......
...@@ -15,7 +15,7 @@ ...@@ -15,7 +15,7 @@
# global settings # global settings
debug: false debug: false
workspace: "paddlerec.models.rank.fnn" workspace: "models/rank/fnn"
dataset: dataset:
- name: train_sample - name: train_sample
......
...@@ -14,7 +14,7 @@ ...@@ -14,7 +14,7 @@
# global settings # global settings
debug: false debug: false
workspace: "paddlerec.models.rank.logistic_regression" workspace: "models/rank/logistic_regression"
dataset: dataset:
......
...@@ -15,7 +15,7 @@ ...@@ -15,7 +15,7 @@
# global settings # global settings
debug: false debug: false
workspace: "paddlerec.models.rank.nfm" workspace: "models/rank/nfm"
dataset: dataset:
- name: train_sample - name: train_sample
......
...@@ -15,7 +15,7 @@ ...@@ -15,7 +15,7 @@
# global settings # global settings
debug: false debug: false
workspace: "paddlerec.models.rank.pnn" workspace: "models/rank/pnn"
dataset: dataset:
- name: train_sample - name: train_sample
......
...@@ -107,7 +107,7 @@ sh run.sh ...@@ -107,7 +107,7 @@ sh run.sh
### 训练 ### 训练
``` ```
cd modles/rank/dnn # 进入选定好的排序模型的目录 以DNN为例 cd modles/rank/dnn # 进入选定好的排序模型的目录 以DNN为例
python -m paddlerec.run -m paddlerec.models.rank.dnn # 使用内置配置 python -m paddlerec.run -m models/rank/dnn/config.yaml # 使用内置配置
# 如果需要使用自定义配置,config.yaml中workspace需要使用改模型目录的绝对路径 # 如果需要使用自定义配置,config.yaml中workspace需要使用改模型目录的绝对路径
# 自定义修改超参后,指定配置文件,使用自定义配置 # 自定义修改超参后,指定配置文件,使用自定义配置
python -m paddlerec.run -m ./config.yaml python -m paddlerec.run -m ./config.yaml
......
# wide&deep
以下是本例的简要目录结构及说明:
```
├── data # 文档
├── train #训练数据
├── train_data.txt
├── create_data.sh
├── data_preparation.py
├── get_slot_data.py
├── run.sh
├── __init__.py
├── config.yaml #配置文件
├── model.py #模型文件
```
注:在阅读该示例前,建议您先了解以下内容:
[paddlerec入门教程](https://github.com/PaddlePaddle/PaddleRec/blob/master/README.md)
## 内容
- [模型简介](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/rank/wide_deep#模型简介)
- [数据准备](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/rank/wide_deep#数据准备)
- [运行环境](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/rank/wide_deep#运行环境)
- [快速开始](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/rank/wide_deep#快速开始)
- [论文复现](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/rank/wide_deep#论文复现)
- [进阶使用](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/rank/wide_deep#进阶使用)
- [FAQ](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/rank/wide_deep#FAQ)
## 模型简介
[《Wide & Deep Learning for Recommender Systems》]( https://arxiv.org/pdf/1606.07792.pdf)是Google 2016年发布的推荐框架,wide&deep设计了一种融合浅层(wide)模型和深层(deep)模型进行联合训练的框架,综合利用浅层模型的记忆能力和深层模型的泛化能力,实现单模型对推荐系统准确性和扩展性的兼顾。从推荐效果和服务性能两方面进行评价:
1. 效果上,在Google Play 进行线上A/B实验,wide&deep模型相比高度优化的Wide浅层模型,app下载率+3.9%。相比deep模型也有一定提升。
2. 性能上,通过切分一次请求需要处理的app 的Batch size为更小的size,并利用多线程并行请求达到提高处理效率的目的。单次响应耗时从31ms下降到14ms。
本例在paddlepaddle上实现wide&deep并在开源数据集Census-income Data上验证模型效果,在测试集上的平均acc和auc分别为:
> mean_acc: 0.76195
>
> mean_auc: 0.90577
若进行精度验证,请参考[论文复现](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/rank/wide_deep#论文复现)部分。
本项目支持功能
训练:单机CPU、单机单卡GPU、单机多卡GPU、本地模拟参数服务器训练、增量训练,配置请参考 [启动训练](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/train.md)
预测:单机CPU、单机单卡GPU ;配置请参考[PaddleRec 离线预测](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/predict.md)
## 数据准备
数据地址:
[adult.data](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data)
[adult.test](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test)
## 运行环境
PaddlePaddle>=1.7.2
python 2.7/3.5/3.6/3.7
PaddleRec >=0.1
os : windows/linux/macos
## 快速开始
### 单机训练
CPU环境
在config.yaml文件中设置好设备,epochs等。
```sh
dataset:
- name: sample_1
type: QueueDataset
batch_size: 5
data_path: "{workspace}/data/sample_data/train"
sparse_slots: "label"
dense_slots: "wide_input:8 deep_input:58"
- name: infer_sample
type: QueueDataset
batch_size: 5
data_path: "{workspace}/data/sample_data/train"
sparse_slots: "label"
dense_slots: "wide_input:8 deep_input:58"
```
### 单机预测
CPU环境
在config.yaml文件中设置好epochs、device等参数。
```
- name: infer_runner
class: infer
device: cpu
init_model_path: "increment/0"
```
## 论文复现
用原论文的完整数据复现论文效果需要在config.yaml中修改batch_size=40, thread_num=8, epoch_num=40
本例在paddlepaddle上实现wide&deep并在开源数据集Census-income Data上验证模型效果,在测试集上的平均acc和auc分别为:
mean_acc: 0.76195 , mean_auc: 0.90577
修改后运行方案:修改config.yaml中的'workspace'为config.yaml的目录位置,执行
```
python -m paddlerec.run -m /home/your/dir/config.yaml #调试模式 直接指定本地config的绝对路径
```
## 进阶使用
## FAQ
...@@ -14,7 +14,7 @@ ...@@ -14,7 +14,7 @@
# global settings # global settings
debug: false debug: false
workspace: "paddlerec.models.rank.wide_deep" workspace: "models/rank/wide_deep"
dataset: dataset:
......
...@@ -12,7 +12,7 @@ ...@@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
debug: false debug: false
workspace: "paddlerec.models.rank.xdeepfm" workspace: "models/rank/xdeepfm"
dataset: dataset:
- name: sample_1 - name: sample_1
......
...@@ -11,7 +11,7 @@ ...@@ -11,7 +11,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
workspace: "paddlerec.models.recall.fasttext" workspace: "models/recall/fasttext"
# list of dataset # list of dataset
dataset: dataset:
......
...@@ -13,7 +13,7 @@ ...@@ -13,7 +13,7 @@
# limitations under the License. # limitations under the License.
# workspace # workspace
workspace: "paddlerec.models.recall.gnn" workspace: "models/recall/gnn"
# list of dataset # list of dataset
dataset: dataset:
...@@ -42,11 +42,11 @@ hyper_parameters: ...@@ -42,11 +42,11 @@ hyper_parameters:
gnn_propogation_steps: 1 gnn_propogation_steps: 1
# select runner by name # select runner by name
mode: train_runner mode: [single_cpu_train, single_cpu_infer]
# config of each runner. # config of each runner.
# runner is a kind of paddle training class, which wraps the train/infer process. # runner is a kind of paddle training class, which wraps the train/infer process.
runner: runner:
- name: train_runner - name: single_cpu_train
class: train class: train
# num of epochs # num of epochs
epochs: 5 epochs: 5
...@@ -59,21 +59,23 @@ runner: ...@@ -59,21 +59,23 @@ runner:
save_inference_feed_varnames: [] # feed vars of save inference save_inference_feed_varnames: [] # feed vars of save inference
save_inference_fetch_varnames: [] # fetch vars of save inference save_inference_fetch_varnames: [] # fetch vars of save inference
init_model_path: "" # load model path init_model_path: "" # load model path
print_interval: 10 print_interval: 1
- name: infer_runner phases: [phase1]
- name: single_cpu_infer
class: infer class: infer
# device to run training or infer # device to run training or infer
device: cpu device: cpu
print_interval: 1 print_interval: 1
init_model_path: "increment_gnn" # load model path init_model_path: "increment_gnn" # load model path
phases: [phase2]
# runner will run all the phase in each epoch # runner will run all the phase in each epoch
phase: phase:
- name: phase_train - name: phase1
model: "{workspace}/model.py" # user-defined model model: "{workspace}/model.py" # user-defined model
dataset_name: dataset_train # select dataset by name dataset_name: dataset_train # select dataset by name
thread_num: 1 thread_num: 1
# - name: phase_infer - name: phase2
# model: "{workspace}/model.py" # user-defined model model: "{workspace}/model.py" # user-defined model
# dataset_name: dataset_infer # select dataset by name dataset_name: dataset_infer # select dataset by name
# thread_num: 1 thread_num: 1
...@@ -20,7 +20,7 @@ import paddle.fluid.layers as layers ...@@ -20,7 +20,7 @@ import paddle.fluid.layers as layers
from paddlerec.core.utils import envs from paddlerec.core.utils import envs
from paddlerec.core.model import ModelBase from paddlerec.core.model import ModelBase
from paddlerec.core.metrics import Precision from paddlerec.core.metrics import RecallK
class Model(ModelBase): class Model(ModelBase):
...@@ -236,7 +236,7 @@ class Model(ModelBase): ...@@ -236,7 +236,7 @@ class Model(ModelBase):
softmax = layers.softmax_with_cross_entropy( softmax = layers.softmax_with_cross_entropy(
logits=logits, label=inputs[6]) # [batch_size, 1] logits=logits, label=inputs[6]) # [batch_size, 1]
self.loss = layers.reduce_mean(softmax) # [1] self.loss = layers.reduce_mean(softmax) # [1]
acc = Precision(input=logits, label=inputs[6], k=20) acc = RecallK(input=logits, label=inputs[6], k=20)
self._cost = self.loss self._cost = self.loss
if is_infer: if is_infer:
......
# GNN # GNN
## 快速开始 以下是本例的简要目录结构及说明:
PaddleRec中每个内置模型都配备了对应的样例数据,用户可基于该数据集快速对模型、环境进行验证,从而降低后续的调试成本。在内置数据集上进行训练的命令为:
``` ```
python -m paddlerec.run -m paddlerec.models.recall.gnn ├── data #样例数据
├── train
├── train.txt
├── test
├── test.txt
├── download.py
├── convert_data.py
├── preprocess.py
├── __init__.py
├── README.md # 文档
├── model.py #模型文件
├── config.yaml #配置文件
├── data_prepare.sh #一键数据处理脚本
├── reader.py #训练数据reader
├── evaluate_reader.py # 预测数据reader
``` ```
注:在阅读该示例前,建议您先了解以下内容:
[paddlerec入门教程](https://github.com/PaddlePaddle/PaddleRec/blob/master/README.md)
---
## 内容
- [模型简介](#模型简介)
- [数据准备](#数据准备)
- [运行环境](#运行环境)
- [快速开始](#快速开始)
- [论文复现](#论文复现)
- [进阶使用](#进阶使用)
- [FAQ](#FAQ)
## 模型简介
SR-GNN模型的介绍可以参阅论文[Session-based Recommendation with Graph Neural Networks](https://arxiv.org/abs/1811.00855)
本文解决的是Session-based Recommendation这一问题,过程大致分为以下四步:
1. 首先对所有的session序列通过有向图进行建模。
2. 然后通过GNN,学习每个node(item)的隐向量表示
3. 通过一个attention架构模型得到每个session的embedding
4. 最后通过一个softmax层进行全表预测
本示例中,我们复现了论文效果,在DIGINETICA数据集上P@20可以达到50.7。
同时推荐用户参考[ IPython Notebook demo](https://aistudio.baidu.com/aistudio/projectDetail/124382)
本模型配置默认使用demo数据集,若进行精度验证,请参考[论文复现](#论文复现)部分。
本项目支持功能
训练:单机CPU、单机单卡GPU、单机多卡GPU、本地模拟参数服务器训练、增量训练,配置请参考 [启动训练](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/train.md)
预测:单机CPU、单机单卡GPU ;配置请参考[PaddleRec 离线预测](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/predict.md)
## 数据处理 ## 数据处理
- Step1: 原始数据数据集下载,本示例提供了两个开源数据集:DIGINETICA和Yoochoose,可选其中任意一个训练本模型。 本示例中数据处理共包含三步:
- Step1: 原始数据数据集下载,本示例提供了两个开源数据集:DIGINETICA和Yoochoose,可选其中任意一个训练本模型。数据下载命令及原始数据格式如下所示。若采用diginetica数据集,执行完该命令之后,会在data目录下得到原始数据文件train-item-views.csv。若采用yoochoose数据集,执行完该命令之后,会在data目录下得到原始数据文件yoochoose-clicks.dat。
``` ```
cd data && python download.py diginetica # or yoochoose cd data && python download.py diginetica # or yoochoose
``` ```
...@@ -24,14 +80,13 @@ python -m paddlerec.run -m paddlerec.models.recall.gnn ...@@ -24,14 +80,13 @@ python -m paddlerec.run -m paddlerec.models.recall.gnn
4. timeframe - time since the first query in a session, in milliseconds. 4. timeframe - time since the first query in a session, in milliseconds.
5. eventdate - calendar date. 5. eventdate - calendar date.
- Step2: 数据预处理 - Step2: 数据预处理。
1. 以session_id为key合并原始数据集,得到每个session的日期,及顺序点击列表。
2. 过滤掉长度为1的session;过滤掉点击次数小于5的items。
3. 训练集、测试集划分。原始数据集里最新日期七天内的作为训练集,更早之前的数据作为测试集。
``` ```
cd data && python preprocess.py --dataset diginetica # or yoochoose cd data && python preprocess.py --dataset diginetica # or yoochoose
``` ```
1. 以session_id为key合并原始数据集,得到每个session的日期,及顺序点击列表。
2. 过滤掉长度为1的session;过滤掉点击次数小于5的items。
3. 训练集、测试集划分。原始数据集里最新日期七天内的作为测试集,更早之前的数据作为测试集。
- Step3: 数据整理。 将训练文件统一放在data/train目录下,测试文件统一放在data/test目录下。 - Step3: 数据整理。 将训练文件统一放在data/train目录下,测试文件统一放在data/test目录下。
``` ```
cat data/diginetica/train.txt | wc -l >> data/config.txt # or yoochoose1_4 or yoochoose1_64 cat data/diginetica/train.txt | wc -l >> data/config.txt # or yoochoose1_4 or yoochoose1_64
...@@ -40,37 +95,130 @@ python -m paddlerec.run -m paddlerec.models.recall.gnn ...@@ -40,37 +95,130 @@ python -m paddlerec.run -m paddlerec.models.recall.gnn
mv data/diginetica/train.txt data/train mv data/diginetica/train.txt data/train
mv data/diginetica/test.txt data/test mv data/diginetica/test.txt data/test
``` ```
数据处理完成后,data/train目录存放训练数据,data/test目录下存放测试数据,data/config.txt中存放数据统计信息,用以配置模型超参。 数据处理完成后,data/train目录存放训练数据,data/test目录下存放测试数据,数据格式如下:
```
#session\tlabel
10,11,12,12,13,14\t15
```
data/config.txt中存放数据统计信息,第一行代表训练集中item总数,用以配置模型词表大小,第二行代表训练集大小。
方便起见, 我们提供了一键式数据处理脚本: 方便起见, 我们提供了一键式数据处理脚本:
``` ```
sh data_prepare.sh diginetica # or yoochoose1_4 or yoochoose1_64 sh data_prepare.sh diginetica # or yoochoose1_4 or yoochoose1_64
``` ```
## 实验配置 ## 运行环境
为在真实数据中复现论文中的效果,你还需要完成如下几步,PaddleRec所有配置均通过修改模型目录下的config.yaml文件完成: PaddlePaddle>=1.7.2
1. 真实数据配置。config.yaml中数据集相关配置见`dataset`字段,数据路径通过`data_path`进行配置。用户可以直接将workspace修改为当前项目目录的绝对路径完成设置。 python 2.7/3.5/3.6/3.7
2. 超参配置。
- batch_size: 修改config.yaml中dataset_train数据集的batch_size为100。
- epochs: 修改config.yaml中runner的epochs为5。
- sparse_feature_number: 不同训练数据集(diginetica or yoochoose)配置不一致,diginetica数据集配置为43098,yoochoose数据集配置为37484。具体见数据处理后得到的data/config.txt文件中第一行。
- corpus_size: 不同训练数据集配置不一致,diginetica数据集配置为719470,yoochoose数据集配置为5917745。具体见数据处理后得到的data/config.txt文件中第二行。
## 训练 PaddleRec >=0.1
在完成[实验配置](##实验配置)后,执行如下命令完成训练:
os : windows/linux/macos
## 快速开始
### 单机训练
CPU环境
在config.yaml文件中设置好设备,epochs等。
```
# select runner by name
mode: [single_cpu_train, single_cpu_infer]
# config of each runner.
# runner is a kind of paddle training class, which wraps the train/infer process.
runner:
- name: single_cpu_train
class: train
# num of epochs
epochs: 2
# device to run training or infer
device: cpu
save_checkpoint_interval: 1 # save model interval of epochs
save_inference_interval: 1 # save inference
save_checkpoint_path: "increment_gnn" # save checkpoint path
save_inference_path: "inference_gnn" # save inference path
save_inference_feed_varnames: [] # feed vars of save inference
save_inference_fetch_varnames: [] # fetch vars of save inference
init_model_path: "" # load model path
print_interval: 1
phases: [phase1]
```
### 单机预测
CPU环境
在config.yaml文件中设置好epochs、device等参数。
```
- name: single_cpu_infer
class: infer
# device to run training or infer
device: cpu
print_interval: 1
init_model_path: "increment_gnn" # load model path
phases: [phase2]
``` ```
python -m paddlerec.run -m ./config.yaml
### 运行
```
python -m paddlerec.run -m models/recall/gnn/config.yaml
``` ```
## 测试 ### 结果展示
开始测试前,你需要完成如下几步配置:
1. 修改config.yaml中的mode,为infer_runner。
2. 修改config.yaml中的phase,为phase_infer,需按提示注释掉phase_trainer。
3. 修改config.yaml中dataset_infer数据集的batch_size为100。
完成上面两步配置后,执行如下命令完成测试: 样例数据训练结果展示:
```
Running SingleStartup.
Running SingleRunner.
batch: 1, LOSS: [10.67443], InsCnt: [200.], RecallCnt: [0.], Acc(Recall@20): [0.]
batch: 2, LOSS: [10.672471], InsCnt: [300.], RecallCnt: [0.], Acc(Recall@20): [0.]
batch: 3, LOSS: [10.672463], InsCnt: [400.], RecallCnt: [1.], Acc(Recall@20): [0.0025]
batch: 4, LOSS: [10.670724], InsCnt: [500.], RecallCnt: [2.], Acc(Recall@20): [0.004]
batch: 5, LOSS: [10.66949], InsCnt: [600.], RecallCnt: [2.], Acc(Recall@20): [0.00333333]
batch: 6, LOSS: [10.670102], InsCnt: [700.], RecallCnt: [2.], Acc(Recall@20): [0.00285714]
batch: 7, LOSS: [10.671348], InsCnt: [800.], RecallCnt: [2.], Acc(Recall@20): [0.0025]
...
epoch 0 done, use time: 2926.6897077560425, global metrics: LOSS=[6.0788856], InsCnt=719400.0 RecallCnt=224033.0 Acc(Recall@20)=0.3114164581595774
...
epoch 4 done, use time: 3083.101449728012, global metrics: LOSS=[4.249889], InsCnt=3597000.0 RecallCnt=2070666.0 Acc(Recall@20)=0.5756647206005004
``` ```
python -m paddlerec.run -m ./config.yaml 样例数据预测结果展示:
``` ```
Running SingleInferStartup.
Running SingleInferRunner.
load persistables from increment_gnn/2
batch: 1, InsCnt: [200.], RecallCnt: [96.], Acc(Recall@20): [0.48], LOSS: [5.7198644]
batch: 2, InsCnt: [300.], RecallCnt: [153.], Acc(Recall@20): [0.51], LOSS: [5.4096317]
batch: 3, InsCnt: [400.], RecallCnt: [210.], Acc(Recall@20): [0.525], LOSS: [5.300991]
batch: 4, InsCnt: [500.], RecallCnt: [258.], Acc(Recall@20): [0.516], LOSS: [5.6269655]
batch: 5, InsCnt: [600.], RecallCnt: [311.], Acc(Recall@20): [0.5183333], LOSS: [5.39276]
batch: 6, InsCnt: [700.], RecallCnt: [352.], Acc(Recall@20): [0.50285715], LOSS: [5.633842]
batch: 7, InsCnt: [800.], RecallCnt: [406.], Acc(Recall@20): [0.5075], LOSS: [5.342844]
batch: 8, InsCnt: [900.], RecallCnt: [465.], Acc(Recall@20): [0.51666665], LOSS: [4.918761]
...
Infer phase2 of epoch 0 done, use time: 549.1640813350677, global metrics: InsCnt=60800.0 RecallCnt=31083.0 Acc(Recall@20)=0.511233552631579, LOSS=[5.8957024]
```
## 论文复现
用原论文的完整数据复现论文效果需要在config.yaml修改超参:
- batch_size: 修改config.yaml中dataset_train数据集的batch_size为100。
- epochs: 修改config.yaml中runner的epochs为5。
- sparse_feature_number: 不同训练数据集(diginetica or yoochoose)配置不一致,diginetica数据集配置为43098,yoochoose数据集配置为37484。具体见数据处理后得到的data/config.txt文件中第一行。
- corpus_size: 不同训练数据集配置不一致,diginetica数据集配置为719470,yoochoose数据集配置为5917745。具体见数据处理后得到的data/config.txt文件中第二行。
使用cpu训练 5轮 测试Recall@20:0.51367
修改后运行方案:修改config.yaml中的'workspace'为config.yaml的目录位置,执行
```
python -m paddlerec.run -m /home/your/dir/config.yaml #调试模式 直接指定本地config的绝对路径
```
## 进阶使用
## FAQ
...@@ -12,7 +12,7 @@ ...@@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
workspace: "paddlerec.models.recall.gru4rec" workspace: "models/recall/gru4rec"
dataset: dataset:
- name: dataset_train - name: dataset_train
......
# look-alike recall
以下是本例的简要目录结构及说明:
```
├── config.yaml # 配置文件
├── data # 样例数据文件夹
│ ├── build_dataset.py # 生成样例数据程序示例
│ └── train_data # 样例数据
│ └── paddle_train.txt # 默认样例数据
├── __init__.py
├── model.py # 模型文件
└── README.md # 文档
```
注:在阅读该示例前,建议您先了解以下内容:
[paddlerec入门教程](https://github.com/PaddlePaddle/PaddleRec/blob/master/README.md)
---
## 内容
- [模型简介](#模型简介)
- [数据准备](#数据准备)
- [运行环境](#运行环境)
- [快速开始](#快速开始)
- [论文复现](#论文复现)
- [进阶使用](#进阶使用)
- [FAQ](#FAQ)
## 模型简介
本目录目录模型文件参考论文 [《Real-time Attention Based Look-alike Model for Recommender System》]( https://arxiv.org/pdf/1906.05022),是发表在KDD 19上的一篇论文,文章指出目前基于深度学习的模型没有能够缓解"马太效应",模型倾向于偏向拥有比较多的特征的头部资源,而那些同样优质的缺少用户交互信息的长尾资源得不到充分的曝光。文章提出推荐广告的经典算法 look-alike 是缓解"马太效应"一个不错的选择。但是受限于推荐领域有别于推荐广告严格的实时性要求,该算法未被大规模采用。基于以上,文章提出了一种实时性的基于attention的looka-like算法 RALM。
本项目在paddlepaddle上主要实现RALM的网络结构,其他更多实时性的工程尝试请参考原论文。因为原论文没有在开源数据集上验证模型效果,本项目提供了100行样例数据。验证模型的正确性,若进行精度验证,请参考样例数据格式或者自定义更改相关配置构建自己的数据集,在工程环境中进行验证。
模型大体结构为双塔结构,可以理解为target user 和 user seeds两个塔。使用论文中提出的local_attention 和 global_attention模块。损失函数采用cosine similarity损失函数。
本项目支持功能
训练:单机CPU、单机单卡GPU、单机多卡GPU、本地模拟参数服务器训练、增量训练,配置请参考 [启动训练](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/train.md)
预测:单机CPU、单机单卡GPU ;配置请参考[PaddleRec 离线预测](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/predict.md)
## 数据准备
数据地址:[样例数据](./data/train_data/paddle_train.txt)
样例数据中每条样本包含三个slot:user_seeds, target_user, label。
(1) user_seeds: 基于当前的资源圈定的种子用户
(2) target_user: 目标用户
(3) label: 点击与否
注:本项目提供的样例数据为完全fake的,没有任何实际参考价值。用户可根据样例数据格式自行构建基于自己项目或者工程的数据集。
执行build_dataset.py生成训练集和测试集
```
cd data
python build_dataset.py
```
运行后生成的数据格式为3个离散化特征,用'\t'切分, 对应的slot是user_seeds, target_user, label
```
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
```
## 运行环境
PaddlePaddle>=1.7.2
python 2.7/3.5/3.6/3.7
PaddleRec >=0.1
os : windows/linux/macos
## 快速开始
### 单机训练
CPU环境
在config.yaml文件中设置好设备,epochs等。
```
# select runner by name
mode: [single_cpu_train, single_cpu_infer]
# config of each runner.
# runner is a kind of paddle training class, which wraps the train/infer process.
runner:
- name: single_cpu_train
class: train
# num of epochs
epochs: 4
# device to run training or infer
device: cpu
save_checkpoint_interval: 2 # save model interval of epochs
save_inference_interval: 4 # save inference
save_checkpoint_path: "increment_model" # save checkpoint path
save_inference_path: "inference" # save inference path
save_inference_feed_varnames: [] # feed vars of save inference
save_inference_fetch_varnames: [] # fetch vars of save inference
init_model_path: "" # load model path
print_interval: 10
phases: [phase1]
```
### 单机预测
CPU环境
在config.yaml文件中设置好epochs、device等参数。
```
- name: single_cpu_infer
class: infer
# num of epochs
epochs: 1
# device to run training or infer
device: cpu #选择预测的设备
init_model_path: "increment_dnn" # load model path
phases: [phase2]
```
### 运行
```
python -m paddlerec.run -m models/recall/look-alike_recall/config.yaml
```
### 结果展示
样例数据训练结果展示:
```
PaddleRec: Runner train_runner Begin
Executor Mode: train
processor_register begin
Running SingleInstance.
Running SingleNetwork.
Running SingleStartup.
Running SingleRunner.
I0729 15:51:44.029929 22883 parallel_executor.cc:440] The Program will be executed on CPU using ParallelExecutor, 1 cards are used, so 1 programs are executed in parallel.
I0729 15:51:44.031812 22883 build_strategy.cc:365] SeqOnlyAllReduceOps:0, num_trainers:1
I0729 15:51:44.033733 22883 parallel_executor.cc:307] Inplace strategy is enabled, when build_strategy.enable_inplace = True
I0729 15:51:44.035027 22883 parallel_executor.cc:375] Garbage collection strategy is enabled, when FLAGS_eager_delete_tensor_gb = 0
batch: 1, BATCH_AUC: [0.], AUC: [0.]
batch: 2, BATCH_AUC: [0.], AUC: [0.]
epoch 0 done, use time: 0.0433671474457
PaddleRec Finish
```
## 论文复现
论文中没有提供基于公开数据集的实验结果。
## 进阶使用
## FAQ
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# global settings
debug: false
workspace: "models/recall/look-alike_recall"
dataset:
- name: sample_1
type: DataLoader
batch_size: 32
data_path: "{workspace}/data/train_data"
sparse_slots: "label user_seeds target_user"
- name: infer_sample
type: DataLoader
batch_size: 32
data_path: "{workspace}/data/train_data"
sparse_slots: "label user_seed target_user"
hyper_parameters:
optimizer:
class: SGD
learning_rate: 0.0001
use_DataLoader: True
user_emb_size: 96
user_count: 100000
seeds_num: 20
transformed_size: 96
mode: train_runner
runner:
- name: train_runner
class: train
epochs: 1
device: cpu
init_model_path: ""
save_checkpoint_interval: 1
save_inference_interval: 1
save_checkpoint_path: "increment"
save_inference_path: "inference"
print_interval: 1
- name: infer_runner
class: infer
device: cpu
init_model_path: "increment/0"
print_interval: 1
phase:
- name: phase1
model: "{workspace}/model.py"
dataset_name: sample_1
thread_num: 1
#- name: infer_phase
# model: "{workspace}/model.py"
# dataset_name: infer_sample
# thread_num: 1
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import print_function
import random
import pickle
def print_to_file(data, fout, slot):
if not isinstance(data, list):
data = [data]
for i in range(len(data)):
fout.write(slot + ":" + str(data[i]))
fout.write(' ')
fake_seed_users = [i for i in range(2, 20)]
target_user = [1]
length = 100
print("make train data")
with open("paddle_train.txt", "w") as fout:
for _ in range(length):
print_to_file(fake_seed_users, fout, "user_seeds")
print_to_file(target_user, fout, "target_user")
print_to_file(1, fout, "label")
fout.write("\n")
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
user_seeds:2 user_seeds:3 user_seeds:4 user_seeds:5 user_seeds:6 user_seeds:7 user_seeds:8 user_seeds:9 user_seeds:10 user_seeds:11 user_seeds:12 user_seeds:13 user_seeds:14 user_seeds:15 user_seeds:16 user_seeds:17 user_seeds:18 user_seeds:19 target_user:1 label:1
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import math
import numpy as np
import paddle.fluid as fluid
import paddle.fluid.layers as layers
from paddlerec.core.utils import envs
from paddlerec.core.model import ModelBase
class Model(ModelBase):
def __init__(self, config):
ModelBase.__init__(self, config)
def _init_hyper_parameters(self):
self.user_emb_size = envs.get_global_env(
"hyper_parameters.user_emb_size", 64)
self.user_count = envs.get_global_env("hyper_parameters.user_count",
100000)
self.transformed_size = envs.get_global_env(
"hyper_parameter.transformed_size", 96)
def local_attention_unit(self, user_seeds, target_user):
wl = fluid.layers.create_parameter(
shape=[self.user_emb_size, self.user_emb_size], dtype="float32")
out = fluid.layers.matmul(user_seeds,
wl) # batch_size * max_len * emb_size
out = fluid.layers.matmul(
out, target_user, transpose_y=True) # batch_size * max_len * 1
out = fluid.layers.tanh(out)
out = fluid.layers.softmax(out, axis=-2)
out = user_seeds * out
out = fluid.layers.reduce_sum(out, dim=1) # batch_size * emb_size
return out
def global_attention_unit(self, user_seeds):
wg = fluid.layers.create_parameter(
shape=[self.user_emb_size, self.user_emb_size], dtype="float32")
out = fluid.layers.matmul(user_seeds, wg)
out = fluid.layers.tanh(out)
out = fluid.layers.matmul(out, user_seeds, transpose_y=True)
out = fluid.layers.softmax(out)
out = fluid.layers.matmul(out, user_seeds)
out = fluid.layers.reduce_sum(out, dim=1)
return out
def net(self, inputs, is_infer=False):
init_value_ = 0.1
user_seeds = self._sparse_data_var[1]
target_user = self._sparse_data_var[2]
self.label = self._sparse_data_var[0]
user_emb_attr = fluid.ParamAttr(name="user_emb")
user_seeds_emb = fluid.embedding(
input=user_seeds,
size=[self.user_count, self.user_emb_size],
param_attr=user_emb_attr,
is_sparse=True)
target_user_emb = fluid.embedding(
input=target_user,
size=[self.user_count, self.user_emb_size],
param_attr=user_emb_attr,
is_sparse=True) # batch_size * 1 * emb_size
user_seeds_emb = fluid.layers.reduce_sum(
user_seeds_emb, dim=1) # batch_size(with lod) * emb_size
pad_value = fluid.layers.assign(input=np.array(
[0.0], dtype=np.float32))
user_seeds_emb, _ = fluid.layers.sequence_pad(
user_seeds_emb, pad_value
) # batch_size(without lod) * max_sequence_length(in batch) * emb_size
target_transform_matrix = fluid.layers.create_parameter(
shape=[self.user_emb_size, self.transformed_size], dtype="float32")
seeds_transform_matrix = fluid.layers.create_parameter(
shape=[self.user_emb_size, self.transformed_size], dtype="float32")
user_seeds_emb_transformed = fluid.layers.matmul(
user_seeds_emb, seeds_transform_matrix)
target_user_emb_transormed = fluid.layers.matmul(
target_user_emb, target_transform_matrix)
seeds_tower = self.local_attention_unit(
user_seeds_emb_transformed,
target_user_emb_transormed) + self.global_attention_unit(
user_seeds_emb_transformed)
target_tower = fluid.layers.reduce_sum(
target_user_emb_transormed, dim=1)
score = fluid.layers.cos_sim(seeds_tower, target_tower)
y_dnn = fluid.layers.cast(self.label, dtype="float32")
self.predict = fluid.layers.sigmoid(score)
cost = fluid.layers.log_loss(
input=score, label=fluid.layers.cast(self.label, "float32"))
avg_cost = fluid.layers.reduce_sum(cost)
self._cost = avg_cost
predict_2d = fluid.layers.concat([1 - self.predict, self.predict], 1)
label_int = fluid.layers.cast(self.label, 'int64')
auc_var, batch_auc_var, _ = fluid.layers.auc(input=predict_2d,
label=label_int,
slide_steps=0)
self._metrics["AUC"] = auc_var
self._metrics["BATCH_AUC"] = batch_auc_var
if is_infer:
self._infer_results["AUC"] = auc_var
...@@ -12,7 +12,7 @@ ...@@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
workspace: "paddlerec.models.recall.ncf" workspace: "models/recall/ncf"
dataset: dataset:
- name: dataset_train - name: dataset_train
......
...@@ -25,6 +25,7 @@ ...@@ -25,6 +25,7 @@
| NCF | Neural Collaborative Filtering | [WWW 2017][Neural Collaborative Filtering](https://arxiv.org/pdf/1708.05031.pdf) | | NCF | Neural Collaborative Filtering | [WWW 2017][Neural Collaborative Filtering](https://arxiv.org/pdf/1708.05031.pdf) |
| GNN | SR-GNN | [AAAI 2019][Session-based Recommendation with Graph Neural Networks](https://arxiv.org/abs/1811.00855) | | GNN | SR-GNN | [AAAI 2019][Session-based Recommendation with Graph Neural Networks](https://arxiv.org/abs/1811.00855) |
| Fasttext | fasttext | [EACL 2017][Bag of Tricks for Efficient Text Classification](https://www.aclweb.org/anthology/E17-2068.pdf) | | Fasttext | fasttext | [EACL 2017][Bag of Tricks for Efficient Text Classification](https://www.aclweb.org/anthology/E17-2068.pdf) |
| RALM | Real-time Attention Based Look-alike Model | [KDD 2019][Real-time Attention Based Look-alike Model for Recommender System](https://arxiv.org/pdf/1906.05022.pdf) |
下面是每个模型的简介(注:图片引用自链接中的论文) 下面是每个模型的简介(注:图片引用自链接中的论文)
...@@ -61,12 +62,15 @@ ...@@ -61,12 +62,15 @@
## 使用教程(快速开始) ## 使用教程(快速开始)
### ###
```shell ```shell
python -m paddlerec.run -m paddlerec.models.recall.word2vec # word2vec git clone https://github.com/PaddlePaddle/PaddleRec.git paddle-rec
python -m paddlerec.run -m paddlerec.models.recall.ssr # ssr cd paddle-rec
python -m paddlerec.run -m paddlerec.models.recall.gru4rec # gru4rec
python -m paddlerec.run -m paddlerec.models.recall.gnn # gnn python -m paddlerec.run -m models/recall/word2vec/config.yaml # word2vec
python -m paddlerec.run -m paddlerec.models.recall.ncf # ncf python -m paddlerec.run -m models/recall/ssr/config.yaml # ssr
python -m paddlerec.run -m paddlerec.models.recall.youtube_dnn # youtube_dnn python -m paddlerec.run -m models/recall/gru4rec/config.yaml # gru4rec
python -m paddlerec.run -m models/recall/gnn/config.yaml # gnn
python -m paddlerec.run -m models/recall/ncf/config.yaml # ncf
python -m paddlerec.run -m models/recall/youtube_dnn/config.yaml # youtube_dnn
``` ```
## 使用教程(复现论文) ## 使用教程(复现论文)
...@@ -86,6 +90,9 @@ sh data_prepare.sh ...@@ -86,6 +90,9 @@ sh data_prepare.sh
### 训练 ### 训练
```bash ```bash
git clone https://github.com/PaddlePaddle/PaddleRec.git paddle-rec
cd paddle-rec
cd modles/recall/gnn # 进入选定好的召回模型的目录 以gnn为例 cd modles/recall/gnn # 进入选定好的召回模型的目录 以gnn为例
python -m paddlerec.run -m ./config.yaml # 自定义修改超参后,指定配置文件,使用自定义配置 python -m paddlerec.run -m ./config.yaml # 自定义修改超参后,指定配置文件,使用自定义配置
``` ```
......
...@@ -12,7 +12,7 @@ ...@@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
workspace: "paddlerec.models.recall.ssr" workspace: "models/recall/ssr"
dataset: dataset:
- name: dataset_train - name: dataset_train
......
...@@ -11,7 +11,7 @@ ...@@ -11,7 +11,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
workspace: "paddlerec.models.recall.word2vec" workspace: "models/recall/word2vec"
# list of dataset # list of dataset
dataset: dataset:
......
...@@ -13,7 +13,7 @@ ...@@ -13,7 +13,7 @@
# limitations under the License. # limitations under the License.
workspace: "paddlerec.models.recall.youtube_dnn" workspace: "models/recall/youtube_dnn"
dataset: dataset:
- name: dataset_train - name: dataset_train
......
...@@ -13,33 +13,34 @@ ...@@ -13,33 +13,34 @@
# limitations under the License. # limitations under the License.
workspace: "paddlerec.models.rerank.listwise" workspace: "models/rerank/listwise"
dataset: dataset:
- name: dataset_train - name: dataset_train
batch_size: 5
type: DataLoader type: DataLoader
data_path: "{workspace}/data/train" data_path: "{workspace}/data/train"
data_converter: "{workspace}/random_reader.py" data_converter: "{workspace}/random_reader.py"
- name: dataset_infer - name: dataset_infer
batch_size: 5
type: DataLoader type: DataLoader
data_path: "{workspace}/data/test" data_path: "{workspace}/data/test"
data_converter: "{workspace}/random_reader.py" data_converter: "{workspace}/random_reader.py"
hyper_parameters: hyper_parameters:
optimizer:
class: sgd
learning_rate: 0.01
strategy: async
hidden_size: 128 hidden_size: 128
user_vocab: 200 user_vocab: 200
item_vocab: 1000 item_vocab: 1000
item_len: 5 item_len: 5
embed_size: 16 embed_size: 16
batch_size: 1 batch_size: 1
optimizer:
class: sgd
learning_rate: 0.01
strategy: async
#use infer_runner mode and modify 'phase' below if infer #use infer_runner mode and modify 'phase' below if infer
mode: train_runner mode: [train_runner, infer_runner]
#mode: infer_runner
runner: runner:
- name: train_runner - name: train_runner
...@@ -48,19 +49,22 @@ runner: ...@@ -48,19 +49,22 @@ runner:
epochs: 3 epochs: 3
save_checkpoint_interval: 2 save_checkpoint_interval: 2
save_inference_interval: 4 save_inference_interval: 4
save_checkpoint_path: "increment" save_checkpoint_path: "increment_listwise"
save_inference_path: "inference" save_inference_path: "inference"
print_interval: 1
phases: [train]
- name: infer_runner - name: infer_runner
class: infer class: infer
init_model_path: "increment/0" init_model_path: "increment_listwise/2"
device: cpu device: cpu
phases: [infer]
phase: phase:
- name: train - name: train
model: "{workspace}/model.py" model: "{workspace}/model.py"
dataset_name: dataset_train dataset_name: dataset_train
thread_num: 1 thread_num: 1
#- name: infer - name: infer
# model: "{workspace}/model.py" model: "{workspace}/model.py"
# dataset_name: dataset_infer dataset_name: dataset_infer
# thread_num: 1 thread_num: 1
...@@ -28,7 +28,10 @@ ...@@ -28,7 +28,10 @@
## 使用教程(快速开始) ## 使用教程(快速开始)
```shell ```shell
python -m paddlerec.run -m paddlerec.models.rerank.listwise # listwise git clone https://github.com/PaddlePaddle/PaddleRec.git paddle-rec
cd paddle-rec
python -m paddlerec.run -m models/rerank/listwise/config.yaml # listwise
``` ```
## 使用教程(复现论文) ## 使用教程(复现论文)
......
...@@ -8,7 +8,10 @@ ...@@ -8,7 +8,10 @@
2. 基于单机模型,可以进行分布式的参数服务器训练 2. 基于单机模型,可以进行分布式的参数服务器训练
```shell ```shell
python -m paddlerec.run -m paddlerec.models.treebased.tdm git clone https://github.com/PaddlePaddle/PaddleRec.git paddle-rec
cd paddle-rec
python -m paddlerec.run -m models/treebased/tdm/config.yaml
``` ```
## 树结构的准备 ## 树结构的准备
......
...@@ -12,7 +12,7 @@ ...@@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
workspace: "paddlerec.models.treebased.tdm" workspace: "models/treebased/tdm"
# list of dataset # list of dataset
dataset: dataset:
......
...@@ -16,7 +16,6 @@ import os ...@@ -16,7 +16,6 @@ import os
import subprocess import subprocess
import sys import sys
import argparse import argparse
import tempfile
import warnings import warnings
import copy import copy
...@@ -39,6 +38,7 @@ def engine_registry(): ...@@ -39,6 +38,7 @@ def engine_registry():
engines["TRANSPILER"]["INFER"] = single_infer_engine engines["TRANSPILER"]["INFER"] = single_infer_engine
engines["TRANSPILER"]["LOCAL_CLUSTER_TRAIN"] = local_cluster_engine engines["TRANSPILER"]["LOCAL_CLUSTER_TRAIN"] = local_cluster_engine
engines["TRANSPILER"]["CLUSTER_TRAIN"] = cluster_engine engines["TRANSPILER"]["CLUSTER_TRAIN"] = cluster_engine
engines["TRANSPILER"]["ONLINE_LEARNING"] = online_learning
engines["PSLIB"]["TRAIN"] = local_mpi_engine engines["PSLIB"]["TRAIN"] = local_mpi_engine
engines["PSLIB"]["LOCAL_CLUSTER_TRAIN"] = local_mpi_engine engines["PSLIB"]["LOCAL_CLUSTER_TRAIN"] = local_mpi_engine
engines["PSLIB"]["CLUSTER_TRAIN"] = cluster_mpi_engine engines["PSLIB"]["CLUSTER_TRAIN"] = cluster_mpi_engine
...@@ -259,6 +259,20 @@ def single_infer_engine(args): ...@@ -259,6 +259,20 @@ def single_infer_engine(args):
return trainer return trainer
def online_learning(args):
trainer = "OnlineLearningTrainer"
single_envs = {}
single_envs["train.trainer.trainer"] = trainer
single_envs["train.trainer.threads"] = "2"
single_envs["train.trainer.engine"] = "online_learning"
single_envs["train.trainer.platform"] = envs.get_platform()
print("use {} engine to run model: {}".format(trainer, args.model))
set_runtime_envs(single_envs, args.model)
trainer = TrainerFactory.create(args.model)
return trainer
def cluster_engine(args): def cluster_engine(args):
def master(): def master():
from paddlerec.core.engine.cluster.cluster import ClusterEngine from paddlerec.core.engine.cluster.cluster import ClusterEngine
......
...@@ -38,15 +38,18 @@ readme = "" ...@@ -38,15 +38,18 @@ readme = ""
def build(dirname): def build(dirname):
package_dir = os.path.dirname(os.path.abspath(__file__)) package_dir = os.path.dirname(os.path.abspath(__file__))
shutil.copytree( shutil.copytree(
package_dir, dirname, ignore=shutil.ignore_patterns(".git")) package_dir,
dirname,
ignore=shutil.ignore_patterns(".git", "models", "build", "dist",
"*.md"))
os.mkdir(os.path.join(dirname, "paddlerec")) os.mkdir(os.path.join(dirname, "paddlerec"))
shutil.move( shutil.move(
os.path.join(dirname, "core"), os.path.join(dirname, "paddlerec")) os.path.join(dirname, "core"), os.path.join(dirname, "paddlerec"))
shutil.move( shutil.move(
os.path.join(dirname, "doc"), os.path.join(dirname, "paddlerec")) os.path.join(dirname, "doc"), os.path.join(dirname, "paddlerec"))
shutil.move(
os.path.join(dirname, "models"), os.path.join(dirname, "paddlerec"))
shutil.move( shutil.move(
os.path.join(dirname, "tests"), os.path.join(dirname, "paddlerec")) os.path.join(dirname, "tests"), os.path.join(dirname, "paddlerec"))
shutil.move( shutil.move(
...@@ -63,17 +66,8 @@ def build(dirname): ...@@ -63,17 +66,8 @@ def build(dirname):
package_dir = {'': dirname} package_dir = {'': dirname}
package_data = {} package_data = {}
models_copy = [
'data/*.txt', 'data/*/*.txt', '*.yaml', '*.sh', 'tree/*.npy',
'tree/*.txt', 'data/sample_data/*', 'data/sample_data/train/*',
'data/sample_data/infer/*', 'data/*/*.csv', 'Criteo_data/*',
'Criteo_data/sample_data/train/*'
]
engine_copy = ['*/*.sh', '*/*.template'] engine_copy = ['*/*.sh', '*/*.template']
for package in packages: for package in packages:
if package.startswith("paddlerec.models."):
package_data[package] = models_copy
if package.startswith("paddlerec.core.engine"): if package.startswith("paddlerec.core.engine"):
package_data[package] = engine_copy package_data[package] = engine_copy
...@@ -98,16 +92,6 @@ build(dirname) ...@@ -98,16 +92,6 @@ build(dirname)
shutil.rmtree(dirname) shutil.rmtree(dirname)
print(u''' print(u'''
\033[32m
██████╗ █████╗ ██████╗ ██████╗ ██╗ ███████╗██████╗ ███████╗ ██████╗
██╔══██╗██╔══██╗██╔══██╗██╔══██╗██║ ██╔════╝██╔══██╗██╔════╝██╔════╝
██████╔╝███████║██║ ██║██║ ██║██║ █████╗ ██████╔╝█████╗ ██║
██╔═══╝ ██╔══██║██║ ██║██║ ██║██║ ██╔══╝ ██╔══██╗██╔══╝ ██║
██║ ██║ ██║██████╔╝██████╔╝███████╗███████╗██║ ██║███████╗╚██████╗
╚═╝ ╚═╝ ╚═╝╚═════╝ ╚═════╝ ╚══════╝╚══════╝╚═╝ ╚═╝╚══════╝ ╚═════╝
\033[0m
\033[34m
Installation Complete. Congratulations! Installation Complete. Congratulations!
How to use it ? Please visit our webside: https://github.com/PaddlePaddle/PaddleRec How to use it ? Please visit our webside: https://github.com/PaddlePaddle/PaddleRec
\033[0m
''') ''')
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import print_function
import unittest
import numpy as np
from paddlerec.core.metrics import AUC
import paddle
import paddle.fluid as fluid
class TestAUC(unittest.TestCase):
def setUp(self):
self.ins_num = 64
self.batch_nums = 3
self.datas = []
for i in range(self.batch_nums):
probs = np.random.uniform(0, 1.0,
(self.ins_num, 2)).astype('float32')
labels = np.random.choice(range(2), self.ins_num).reshape(
(self.ins_num, 1)).astype('int64')
self.datas.append((probs, labels))
self.place = fluid.core.CPUPlace()
self.num_thresholds = 2**12
python_auc = fluid.metrics.Auc(name="auc",
curve='ROC',
num_thresholds=self.num_thresholds)
for i in range(self.batch_nums):
python_auc.update(self.datas[i][0], self.datas[i][1])
self.auc = np.array(python_auc.eval())
def build_network(self):
predict = fluid.data(
name="predict", shape=[-1, 2], dtype='float32', lod_level=0)
label = fluid.data(
name="label", shape=[-1, 1], dtype='int64', lod_level=0)
auc = AUC(input=predict,
label=label,
num_thresholds=self.num_thresholds,
curve='ROC')
return auc
def test_forward(self):
precision_recall = self.build_network()
metrics = precision_recall.get_result()
fetch_vars = []
metric_keys = []
for item in metrics.items():
fetch_vars.append(item[1])
metric_keys.append(item[0])
exe = fluid.Executor(self.place)
exe.run(fluid.default_startup_program())
for i in range(self.batch_nums):
outs = exe.run(
fluid.default_main_program(),
feed={'predict': self.datas[i][0],
'label': self.datas[i][1]},
fetch_list=fetch_vars,
return_numpy=True)
outs = dict(zip(metric_keys, outs))
self.assertTrue(np.allclose(outs['AUC'], self.auc))
def test_exception(self):
self.assertRaises(Exception, AUC)
self.assertRaises(
Exception, AUC, input=self.datas[0][0], label=self.datas[0][1]),
if __name__ == '__main__':
unittest.main()
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import print_function
import unittest
import numpy as np
from paddlerec.core.metrics import PosNegRatio
import paddle
import paddle.fluid as fluid
class TestPosNegRatio(unittest.TestCase):
def setUp(self):
self.ins_num = 64
self.batch_nums = 3
self.datas = []
self.right_cnt = 0.0
self.wrong_cnt = 0.0
for i in range(self.batch_nums):
neg_score = np.random.uniform(0, 1.0,
(self.ins_num, 1)).astype('float32')
pos_score = np.random.uniform(0, 1.0,
(self.ins_num, 1)).astype('float32')
right_cnt = np.sum(np.less(neg_score, pos_score)).astype('int32')
wrong_cnt = np.sum(np.less_equal(pos_score, neg_score)).astype(
'int32')
self.right_cnt += float(right_cnt)
self.wrong_cnt += float(wrong_cnt)
self.datas.append((pos_score, neg_score))
self.place = fluid.core.CPUPlace()
def build_network(self):
pos_score = fluid.data(
name="pos_score", shape=[-1, 1], dtype='float32', lod_level=0)
neg_score = fluid.data(
name="neg_score", shape=[-1, 1], dtype='float32', lod_level=0)
pairwise_pn = PosNegRatio(pos_score=pos_score, neg_score=neg_score)
return pairwise_pn
def test_forward(self):
pn = self.build_network()
metrics = pn.get_result()
fetch_vars = []
metric_keys = []
for item in metrics.items():
fetch_vars.append(item[1])
metric_keys.append(item[0])
exe = fluid.Executor(self.place)
exe.run(fluid.default_startup_program())
for i in range(self.batch_nums):
outs = exe.run(fluid.default_main_program(),
feed={
'pos_score': self.datas[i][0],
'neg_score': self.datas[i][1]
},
fetch_list=fetch_vars,
return_numpy=True)
outs = dict(zip(metric_keys, outs))
self.assertTrue(np.allclose(outs['RightCnt'], self.right_cnt))
self.assertTrue(np.allclose(outs['WrongCnt'], self.wrong_cnt))
self.assertTrue(
np.allclose(outs['PN'],
np.array((self.right_cnt + 1.0) / (self.wrong_cnt + 1.0
))))
def test_exception(self):
self.assertRaises(Exception, PosNegRatio)
self.assertRaises(
Exception,
PosNegRatio,
pos_score=self.datas[0][0],
neg_score=self.datas[0][1]),
if __name__ == '__main__':
unittest.main()
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import print_function
import unittest
import numpy as np
from paddlerec.core.metrics import PrecisionRecall
import paddle
import paddle.fluid as fluid
def calc_precision(tp_count, fp_count):
if tp_count > 0.0 or fp_count > 0.0:
return tp_count / (tp_count + fp_count)
return 1.0
def calc_recall(tp_count, fn_count):
if tp_count > 0.0 or fn_count > 0.0:
return tp_count / (tp_count + fn_count)
return 1.0
def calc_f1_score(precision, recall):
if precision > 0.0 or recall > 0.0:
return 2 * precision * recall / (precision + recall)
return 0.0
def get_states(idxs, labels, cls_num, weights=None, batch_nums=1):
ins_num = idxs.shape[0]
# TP FP TN FN
states = np.zeros((cls_num, 4)).astype('float32')
for i in range(ins_num):
w = weights[i] if weights is not None else 1.0
idx = idxs[i][0]
label = labels[i][0]
if idx == label:
states[idx][0] += w
for j in range(cls_num):
states[j][2] += w
states[idx][2] -= w
else:
states[label][3] += w
states[idx][1] += w
for j in range(cls_num):
states[j][2] += w
states[label][2] -= w
states[idx][2] -= w
return states
def compute_metrics(states, cls_num):
total_tp_count = 0.0
total_fp_count = 0.0
total_fn_count = 0.0
macro_avg_precision = 0.0
macro_avg_recall = 0.0
for i in range(cls_num):
total_tp_count += states[i][0]
total_fp_count += states[i][1]
total_fn_count += states[i][3]
macro_avg_precision += calc_precision(states[i][0], states[i][1])
macro_avg_recall += calc_recall(states[i][0], states[i][3])
metrics = []
macro_avg_precision /= cls_num
macro_avg_recall /= cls_num
metrics.append(macro_avg_precision)
metrics.append(macro_avg_recall)
metrics.append(calc_f1_score(macro_avg_precision, macro_avg_recall))
micro_avg_precision = calc_precision(total_tp_count, total_fp_count)
metrics.append(micro_avg_precision)
micro_avg_recall = calc_recall(total_tp_count, total_fn_count)
metrics.append(micro_avg_recall)
metrics.append(calc_f1_score(micro_avg_precision, micro_avg_recall))
return np.array(metrics).astype('float32')
class TestPrecisionRecall(unittest.TestCase):
def setUp(self):
self.ins_num = 64
self.cls_num = 10
self.batch_nums = 3
self.datas = []
self.states = np.zeros((self.cls_num, 4)).astype('float32')
for i in range(self.batch_nums):
probs = np.random.uniform(0, 1.0, (self.ins_num,
self.cls_num)).astype('float32')
idxs = np.array(np.argmax(
probs, axis=1)).reshape(self.ins_num, 1).astype('int32')
labels = np.random.choice(range(self.cls_num),
self.ins_num).reshape(
(self.ins_num, 1)).astype('int32')
self.datas.append((probs, labels))
states = get_states(idxs, labels, self.cls_num)
self.states = np.add(self.states, states)
self.metrics = compute_metrics(self.states, self.cls_num)
self.place = fluid.core.CPUPlace()
def build_network(self):
predict = fluid.data(
name="predict",
shape=[-1, self.cls_num],
dtype='float32',
lod_level=0)
label = fluid.data(
name="label", shape=[-1, 1], dtype='int32', lod_level=0)
precision_recall = PrecisionRecall(
input=predict, label=label, class_num=self.cls_num)
return precision_recall
def test_forward(self):
precision_recall = self.build_network()
metrics = precision_recall.get_result()
fetch_vars = []
metric_keys = []
for item in metrics.items():
fetch_vars.append(item[1])
metric_keys.append(item[0])
exe = fluid.Executor(self.place)
exe.run(fluid.default_startup_program())
for i in range(self.batch_nums):
outs = exe.run(
fluid.default_main_program(),
feed={'predict': self.datas[i][0],
'label': self.datas[i][1]},
fetch_list=fetch_vars,
return_numpy=True)
outs = dict(zip(metric_keys, outs))
self.assertTrue(np.allclose(outs['[TP FP TN FN]'], self.states))
self.assertTrue(np.allclose(outs['precision_recall_f1'], self.metrics))
def test_exception(self):
self.assertRaises(Exception, PrecisionRecall)
self.assertRaises(
Exception,
PrecisionRecall,
input=self.datas[0][0],
label=self.datas[0][1],
class_num=self.cls_num)
if __name__ == '__main__':
unittest.main()
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import print_function
import unittest
import numpy as np
from paddlerec.core.metrics import RecallK
import paddle
import paddle.fluid as fluid
class TestRecallK(unittest.TestCase):
def setUp(self):
self.ins_num = 64
self.cls_num = 10
self.topk = 2
self.batch_nums = 3
self.datas = []
self.match_num = 0.0
for i in range(self.batch_nums):
z = np.random.uniform(0, 1.0, (self.ins_num,
self.cls_num)).astype('float32')
pred = np.exp(z) / sum(np.exp(z))
label = np.random.choice(range(self.cls_num),
self.ins_num).reshape(
(self.ins_num, 1)).astype('int64')
self.datas.append((pred, label))
max_k_preds = pred.argsort(
axis=1)[:, -self.topk:][:, ::-1] #top-k label
match_array = np.logical_or.reduce(max_k_preds == label, axis=1)
self.match_num += np.sum(match_array).astype('float32')
self.place = fluid.core.CPUPlace()
def build_network(self):
pred = fluid.data(
name="pred",
shape=[-1, self.cls_num],
dtype='float32',
lod_level=0)
label = fluid.data(
name="label", shape=[-1, 1], dtype='int64', lod_level=0)
recall_k = RecallK(input=pred, label=label, k=self.topk)
return recall_k
def test_forward(self):
net = self.build_network()
metrics = net.get_result()
fetch_vars = []
metric_keys = []
for item in metrics.items():
fetch_vars.append(item[1])
metric_keys.append(item[0])
exe = fluid.Executor(self.place)
exe.run(fluid.default_startup_program())
for i in range(self.batch_nums):
outs = exe.run(
fluid.default_main_program(),
feed={'pred': self.datas[i][0],
'label': self.datas[i][1]},
fetch_list=fetch_vars,
return_numpy=True)
outs = dict(zip(metric_keys, outs))
self.assertTrue(
np.allclose(outs['InsCnt'], self.ins_num * self.batch_nums))
self.assertTrue(np.allclose(outs['RecallCnt'], self.match_num))
self.assertTrue(
np.allclose(outs['Acc(Recall@%d)' % (self.topk)],
np.array(self.match_num / (self.ins_num *
self.batch_nums))))
def test_exception(self):
self.assertRaises(Exception, RecallK)
self.assertRaises(
Exception, RecallK, input=self.datas[0][0],
label=self.datas[0][1]),
if __name__ == '__main__':
unittest.main()
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册