diff --git a/.travis.yml b/.travis.yml index 2d7eddf950dea628e35108418d0f663993578d60..cee9ec6db72f4f84da037faafae6dc15db6a23cc 100644 --- a/.travis.yml +++ b/.travis.yml @@ -16,13 +16,20 @@ before_install: # For pylint dockstring checker - sudo apt-get update - sudo apt-get install -y python-pip libpython-dev + - sudo apt-get remove python-urllib3 + - sudo apt-get purge python-urllib3 + - sudo rm /usr/lib/python2.7/dist-packages/chardet-* - sudo pip install -U pip + - sudo pip install --upgrade setuptools - sudo pip install six --upgrade --ignore-installed six - - sudo pip install pillow - sudo pip install PyYAML - sudo pip install pylint pytest astroid isort pre-commit - sudo pip install kiwisolver - - sudo pip install paddlepaddle==1.7.2 --ignore-installed urllib3 + - sudo pip install scikit-build + - sudo pip install Pillow==5.3.0 + - sudo pip install opencv-python==3.4.3.18 + - sudo pip install rarfile==3.0 + - sudo pip install paddlepaddle==1.7.2 - sudo python setup.py install - | function timeout() { perl -e 'alarm shift; exec @ARGV' "$@"; } diff --git a/README.md b/README.md index 86f45a9e192b6c2e3d3b5339e45eae399b99911f..84c53d2a06ee7b52ae7c89187fb0316730390f01 100644 --- a/README.md +++ b/README.md @@ -1,105 +1,111 @@ -([简体中文](./README_CN.md)|English) +(简体中文|[English](./README_EN.md)) +

- + +

+

+

-

What is recommendation system ?

+

什么是推荐系统?

- +

-- Recommendation system helps users quickly find useful and interesting information from massive data. - -- Recommendation system is also a silver bullet to attract users, retain users, increase users' stickness or conversionn. - - > Who can better use the recommendation system, who can gain more advantage in the fierce competition. - > - > At the same time, there are many problems in the process of using the recommendation system, such as: huge data, complex model, inefficient distributed training, and so on. - -

What is PaddleRec ?

- - -- A quick start tool of search & recommendation algorithm based on [PaddlePaddle](https://www.paddlepaddle.org.cn/documentation/docs/en/beginners_guide/index_en.html) -- A complete solution of recommendation system for beginners, developers and researchers. -- Recommendation algorithm library including content-understanding, match, recall, rank, multi-task, re-rank etc. - - - | Type | Algorithm | CPU | GPU | Parameter-Server | Multi-GPU | Paper | - | :-------------------: | :-----------------------------------------------------------------------: | :---: | :-----: | :--------------: | :-------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | - | Content-Understanding | [Text-Classifcation](models/contentunderstanding/classification/model.py) | ✓ | ✓ | ✓ | x | [EMNLP 2014][Convolutional neural networks for sentence classication](https://www.aclweb.org/anthology/D14-1181.pdf) | - | Content-Understanding | [TagSpace](models/contentunderstanding/tagspace/model.py) | ✓ | ✓ | ✓ | x | [EMNLP 2014][TagSpace: Semantic Embeddings from Hashtags](https://www.aclweb.org/anthology/D14-1194.pdf) | - | Match | [DSSM](models/match/dssm/model.py) | ✓ | ✓ | ✓ | x | [CIKM 2013][Learning Deep Structured Semantic Models for Web Search using Clickthrough Data](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cikm2013_DSSM_fullversion.pdf) | - | Match | [MultiView-Simnet](models/match/multiview-simnet/model.py) | ✓ | ✓ | ✓ | x | [WWW 2015][A Multi-View Deep Learning Approach for Cross Domain User Modeling in Recommendation Systems](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/frp1159-songA.pdf) | - | Recall | [TDM](models/treebased/tdm/model.py) | ✓ | >=1.8.0 | ✓ | >=1.8.0 | [KDD 2018][Learning Tree-based Deep Model for Recommender Systems](https://arxiv.org/pdf/1801.02294.pdf) | - | Recall | [fasttext](models/recall/fasttext/model.py) | ✓ | ✓ | x | x | [EACL 2017][Bag of Tricks for Efficient Text Classification](https://www.aclweb.org/anthology/E17-2068.pdf) | - | Recall | [Word2Vec](models/recall/word2vec/model.py) | ✓ | ✓ | ✓ | x | [NIPS 2013][Distributed Representations of Words and Phrases and their Compositionality](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) | - | Recall | [SSR](models/recall/ssr/model.py) | ✓ | ✓ | ✓ | ✓ | [SIGIR 2016][Multi-Rate Deep Learning for Temporal Recommendation](http://sonyis.me/paperpdf/spr209-song_sigir16.pdf) | - | Recall | [Gru4Rec](models/recall/gru4rec/model.py) | ✓ | ✓ | ✓ | ✓ | [2015][Session-based Recommendations with Recurrent Neural Networks](https://arxiv.org/abs/1511.06939) | - | Recall | [Youtube_dnn](models/recall/youtube_dnn/model.py) | ✓ | ✓ | ✓ | ✓ | [RecSys 2016][Deep Neural Networks for YouTube Recommendations](https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf) | - | Recall | [NCF](models/recall/ncf/model.py) | ✓ | ✓ | ✓ | ✓ | [WWW 2017][Neural Collaborative Filtering](https://arxiv.org/pdf/1708.05031.pdf) | - | Recall | [GNN](models/recall/gnn/model.py) | ✓ | ✓ | ✓ | ✓ | [AAAI 2019][Session-based Recommendation with Graph Neural Networks](https://arxiv.org/abs/1811.00855) | - | Rank | [Logistic Regression](models/rank/logistic_regression/model.py) | ✓ | x | ✓ | x | / | - | Rank | [Dnn](models/rank/dnn/model.py) | ✓ | ✓ | ✓ | ✓ | / | - | Rank | [FM](models/rank/fm/model.py) | ✓ | x | ✓ | x | [IEEE Data Mining 2010][Factorization machines](https://analyticsconsultores.com.mx/wp-content/uploads/2019/03/Factorization-Machines-Steffen-Rendle-Osaka-University-2010.pdf) | - | Rank | [FFM](models/rank/ffm/model.py) | ✓ | x | ✓ | x | [RECSYS 2016][Field-aware Factorization Machines for CTR Prediction](https://dl.acm.org/doi/pdf/10.1145/2959100.2959134) | - | Rank | [FNN](models/rank/fnn/model.py) | ✓ | x | ✓ | x | [ECIR 2016][Deep Learning over Multi-field Categorical Data](https://arxiv.org/pdf/1601.02376.pdf) | - | Rank | [Deep Crossing](models/rank/deep_crossing/model.py) | ✓ | x | ✓ | x | [ACM 2016][Deep Crossing: Web-Scale Modeling without Manually Crafted Combinatorial Features](https://www.kdd.org/kdd2016/papers/files/adf0975-shanA.pdf) | - | Rank | [Pnn](models/rank/pnn/model.py) | ✓ | x | ✓ | x | [ICDM 2016][Product-based Neural Networks for User Response Prediction](https://arxiv.org/pdf/1611.00144.pdf) | - | Rank | [DCN](models/rank/dcn/model.py) | ✓ | x | ✓ | x | [KDD 2017][Deep & Cross Network for Ad Click Predictions](https://dl.acm.org/doi/pdf/10.1145/3124749.3124754) | - | Rank | [NFM](models/rank/nfm/model.py) | ✓ | x | ✓ | x | [SIGIR 2017][Neural Factorization Machines for Sparse Predictive Analytics](https://dl.acm.org/doi/pdf/10.1145/3077136.3080777) | - | Rank | [AFM](models/rank/afm/model.py) | ✓ | x | ✓ | x | [IJCAI 2017][Attentional Factorization Machines: Learning the Weight of Feature Interactions via Attention Networks](https://arxiv.org/pdf/1708.04617.pdf) | - | Rank | [DeepFM](models/rank/deepfm/model.py) | ✓ | x | ✓ | x | [IJCAI 2017][DeepFM: A Factorization-Machine based Neural Network for CTR Prediction](https://arxiv.org/pdf/1703.04247.pdf) | - | Rank | [xDeepFM](models/rank/xdeepfm/model.py) | ✓ | x | ✓ | x | [KDD 2018][xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems](https://dl.acm.org/doi/pdf/10.1145/3219819.3220023) | - | Rank | [DIN](models/rank/din/model.py) | ✓ | x | ✓ | x | [KDD 2018][Deep Interest Network for Click-Through Rate Prediction](https://dl.acm.org/doi/pdf/10.1145/3219819.3219823) | - | Rank | [DIEN](models/rank/dien/model.py) | ✓ | x | ✓ | x | [AAAI 2019][Deep Interest Evolution Network for Click-Through Rate Prediction](https://www.aaai.org/ojs/index.php/AAAI/article/view/4545/4423) | - | Rank | [BST](models/rank/BST/model.py) | ✓ | x | ✓ | x | [DLP-KDD 2019][Behavior Sequence Transformer for E-commerce Recommendation in Alibaba](https://arxiv.org/pdf/1905.06874v1.pdf) | - | Rank | [AutoInt](models/rank/AutoInt/model.py) | ✓ | x | ✓ | x | [CIKM 2019][AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks](https://arxiv.org/pdf/1810.11921.pdf) | - | Rank | [Wide&Deep](models/rank/wide_deep/model.py) | ✓ | x | ✓ | x | [DLRS 2016][Wide & Deep Learning for Recommender Systems](https://dl.acm.org/doi/pdf/10.1145/2988450.2988454) | - | Rank | [FGCNN](models/rank/fgcnn/model.py) | ✓ | ✓ | ✓ | ✓ | [WWW 2019][Feature Generation by Convolutional Neural Network for Click-Through Rate Prediction](https://arxiv.org/pdf/1904.04447.pdf) | - | Rank | [Fibinet](models/rank/fibinet/model.py) | ✓ | ✓ | ✓ | ✓ | [RecSys19][FiBiNET: Combining Feature Importance and Bilinear feature Interaction for Click-Through Rate Prediction]( https://arxiv.org/pdf/1905.09433.pdf) | - | Rank | [Flen](models/rank/flen/model.py) | ✓ | ✓ | ✓ | ✓ | [2019][FLEN: Leveraging Field for Scalable CTR Prediction]( https://arxiv.org/pdf/1911.04690.pdf) | - | Multi-Task | [ESMM](models/multitask/esmm/model.py) | ✓ | ✓ | ✓ | ✓ | [SIGIR 2018][Entire Space Multi-Task Model: An Effective Approach for Estimating Post-Click Conversion Rate](https://arxiv.org/abs/1804.07931) | - | Multi-Task | [MMOE](models/multitask/mmoe/model.py) | ✓ | ✓ | ✓ | ✓ | [KDD 2018][Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts](https://dl.acm.org/doi/abs/10.1145/3219819.3220007) | - | Multi-Task | [ShareBottom](models/multitask/share-bottom/model.py) | ✓ | ✓ | ✓ | ✓ | [1998][Multitask learning](http://reports-archive.adm.cs.cmu.edu/anon/1997/CMU-CS-97-203.pdf) | - | Re-Rank | [Listwise](models/rerank/listwise/model.py) | ✓ | ✓ | ✓ | x | [2019][Sequential Evaluation and Generation Framework for Combinatorial Recommender System](https://arxiv.org/pdf/1902.00245.pdf) | - - - - - -

Getting Started

- -### Environmental requirements +- 推荐系统是在互联网信息爆炸式增长的时代背景下,帮助用户高效获得感兴趣信息的关键; + +- 推荐系统也是帮助产品最大限度吸引用户、留存用户、增加用户粘性、提高用户转化率的银弹。 + +- 有无数优秀的产品依靠用户可感知的推荐系统建立了良好的口碑,也有无数的公司依靠直击用户痛点的推荐系统在行业中占领了一席之地。 + + > 可以说,谁能掌握和利用好推荐系统,谁就能在信息分发的激烈竞争中抢得先机。 + > 但与此同时,有着许多问题困扰着推荐系统的开发者,比如:庞大的数据量,复杂的模型结构,低效的分布式训练环境,波动的在离线一致性,苛刻的上线部署要求,以上种种,不胜枚举。 + +

什么是PaddleRec?

+ + +- 源于飞桨生态的搜索推荐模型 **一站式开箱即用工具** +- 适合初学者,开发者,研究者的推荐系统全流程解决方案 +- 包含内容理解、匹配、召回、排序、 多任务、重排序等多个任务的完整推荐搜索算法库 + + + | 方向 | 模型 | 单机CPU | 单机GPU | 分布式CPU | 分布式GPU | 论文 | + | :------: | :-----------------------------------------------------------------------: | :-----: | :-----: | :-------: | :-------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | + | 内容理解 | [Text-Classifcation](models/contentunderstanding/classification/model.py) | ✓ | ✓ | ✓ | x | [EMNLP 2014][Convolutional neural networks for sentence classication](https://www.aclweb.org/anthology/D14-1181.pdf) | + | 内容理解 | [TagSpace](models/contentunderstanding/tagspace/model.py) | ✓ | ✓ | ✓ | x | [EMNLP 2014][TagSpace: Semantic Embeddings from Hashtags](https://www.aclweb.org/anthology/D14-1194.pdf) | + | 匹配 | [DSSM](models/match/dssm/model.py) | ✓ | ✓ | ✓ | x | [CIKM 2013][Learning Deep Structured Semantic Models for Web Search using Clickthrough Data](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cikm2013_DSSM_fullversion.pdf) | + | 匹配 | [MultiView-Simnet](models/match/multiview-simnet/model.py) | ✓ | ✓ | ✓ | x | [WWW 2015][A Multi-View Deep Learning Approach for Cross Domain User Modeling in Recommendation Systems](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/frp1159-songA.pdf) | + | 召回 | [TDM](models/treebased/tdm/model.py) | ✓ | >=1.8.0 | ✓ | >=1.8.0 | [KDD 2018][Learning Tree-based Deep Model for Recommender Systems](https://arxiv.org/pdf/1801.02294.pdf) | + | 召回 | [fasttext](models/recall/fasttext/model.py) | ✓ | ✓ | x | x | [EACL 2017][Bag of Tricks for Efficient Text Classification](https://www.aclweb.org/anthology/E17-2068.pdf) | + | 召回 | [Word2Vec](models/recall/word2vec/model.py) | ✓ | ✓ | ✓ | x | [NIPS 2013][Distributed Representations of Words and Phrases and their Compositionality](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) | + | 召回 | [SSR](models/recall/ssr/model.py) | ✓ | ✓ | ✓ | ✓ | [SIGIR 2016][Multi-Rate Deep Learning for Temporal Recommendation](http://sonyis.me/paperpdf/spr209-song_sigir16.pdf) | + | 召回 | [Gru4Rec](models/recall/gru4rec/model.py) | ✓ | ✓ | ✓ | ✓ | [2015][Session-based Recommendations with Recurrent Neural Networks](https://arxiv.org/abs/1511.06939) | + | 召回 | [Youtube_dnn](models/recall/youtube_dnn/model.py) | ✓ | ✓ | ✓ | ✓ | [RecSys 2016][Deep Neural Networks for YouTube Recommendations](https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf) | + | 召回 | [NCF](models/recall/ncf/model.py) | ✓ | ✓ | ✓ | ✓ | [WWW 2017][Neural Collaborative Filtering](https://arxiv.org/pdf/1708.05031.pdf) | + | 召回 | [GNN](models/recall/gnn/model.py) | ✓ | ✓ | ✓ | ✓ | [AAAI 2019][Session-based Recommendation with Graph Neural Networks](https://arxiv.org/abs/1811.00855) | + | 召回 | [RALM](models/recall/look-alike_recall/model.py) | ✓ | ✓ | ✓ | ✓ | [KDD 2019][Real-time Attention Based Look-alike Model for Recommender System](https://arxiv.org/pdf/1906.05022.pdf) | + | 排序 | [Logistic Regression](models/rank/logistic_regression/model.py) | ✓ | x | ✓ | x | / | + | 排序 | [Dnn](models/rank/dnn/model.py) | ✓ | ✓ | ✓ | ✓ | / | + | 排序 | [FM](models/rank/fm/model.py) | ✓ | x | ✓ | x | [IEEE Data Mining 2010][Factorization machines](https://analyticsconsultores.com.mx/wp-content/uploads/2019/03/Factorization-Machines-Steffen-Rendle-Osaka-University-2010.pdf) | + | 排序 | [FFM](models/rank/ffm/model.py) | ✓ | x | ✓ | x | [RECSYS 2016][Field-aware Factorization Machines for CTR Prediction](https://dl.acm.org/doi/pdf/10.1145/2959100.2959134) | + | 排序 | [FNN](models/rank/fnn/model.py) | ✓ | x | ✓ | x | [ECIR 2016][Deep Learning over Multi-field Categorical Data](https://arxiv.org/pdf/1601.02376.pdf) | + | 排序 | [Deep Crossing](models/rank/deep_crossing/model.py) | ✓ | x | ✓ | x | [ACM 2016][Deep Crossing: Web-Scale Modeling without Manually Crafted Combinatorial Features](https://www.kdd.org/kdd2016/papers/files/adf0975-shanA.pdf) | + | 排序 | [Pnn](models/rank/pnn/model.py) | ✓ | x | ✓ | x | [ICDM 2016][Product-based Neural Networks for User Response Prediction](https://arxiv.org/pdf/1611.00144.pdf) | + | 排序 | [DCN](models/rank/dcn/model.py) | ✓ | x | ✓ | x | [KDD 2017][Deep & Cross Network for Ad Click Predictions](https://dl.acm.org/doi/pdf/10.1145/3124749.3124754) | + | 排序 | [NFM](models/rank/nfm/model.py) | ✓ | x | ✓ | x | [SIGIR 2017][Neural Factorization Machines for Sparse Predictive Analytics](https://dl.acm.org/doi/pdf/10.1145/3077136.3080777) | + | 排序 | [AFM](models/rank/afm/model.py) | ✓ | x | ✓ | x | [IJCAI 2017][Attentional Factorization Machines: Learning the Weight of Feature Interactions via Attention Networks](https://arxiv.org/pdf/1708.04617.pdf) | + | 排序 | [DeepFM](models/rank/deepfm/model.py) | ✓ | x | ✓ | x | [IJCAI 2017][DeepFM: A Factorization-Machine based Neural Network for CTR Prediction](https://arxiv.org/pdf/1703.04247.pdf) | + | 排序 | [xDeepFM](models/rank/xdeepfm/model.py) | ✓ | x | ✓ | x | [KDD 2018][xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems](https://dl.acm.org/doi/pdf/10.1145/3219819.3220023) | + | 排序 | [DIN](models/rank/din/model.py) | ✓ | x | ✓ | x | [KDD 2018][Deep Interest Network for Click-Through Rate Prediction](https://dl.acm.org/doi/pdf/10.1145/3219819.3219823) | + | 排序 | [DIEN](models/rank/dien/model.py) | ✓ | x | ✓ | x | [AAAI 2019][Deep Interest Evolution Network for Click-Through Rate Prediction](https://www.aaai.org/ojs/index.php/AAAI/article/view/4545/4423) | + | 排序 | [BST](models/rank/BST/model.py) | ✓ | x | ✓ | x | [DLP_KDD 2019][Behavior Sequence Transformer for E-commerce Recommendation in Alibaba](https://arxiv.org/pdf/1905.06874v1.pdf) | + | 排序 | [AutoInt](models/rank/AutoInt/model.py) | ✓ | x | ✓ | x | [CIKM 2019][AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks](https://arxiv.org/pdf/1810.11921.pdf) | + | 排序 | [Wide&Deep](models/rank/wide_deep/model.py) | ✓ | x | ✓ | x | [DLRS 2016][Wide & Deep Learning for Recommender Systems](https://dl.acm.org/doi/pdf/10.1145/2988450.2988454) | + | 排序 | [FGCNN](models/rank/fgcnn/model.py) | ✓ | ✓ | ✓ | ✓ | [WWW 2019][Feature Generation by Convolutional Neural Network for Click-Through Rate Prediction](https://arxiv.org/pdf/1904.04447.pdf) | + | 排序 | [Fibinet](models/rank/fibinet/model.py) | ✓ | ✓ | ✓ | ✓ | [RecSys19][FiBiNET: Combining Feature Importance and Bilinear feature Interaction for Click-Through Rate Prediction]( https://arxiv.org/pdf/1905.09433.pdf) | + | 排序 | [Flen](models/rank/flen/model.py) | ✓ | ✓ | ✓ | ✓ | [2019][FLEN: Leveraging Field for Scalable CTR Prediction]( https://arxiv.org/pdf/1911.04690.pdf) | + | 多任务 | [ESMM](models/multitask/esmm/model.py) | ✓ | ✓ | ✓ | ✓ | [SIGIR 2018][Entire Space Multi-Task Model: An Effective Approach for Estimating Post-Click Conversion Rate](https://arxiv.org/abs/1804.07931) | + | 多任务 | [MMOE](models/multitask/mmoe/model.py) | ✓ | ✓ | ✓ | ✓ | [KDD 2018][Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts](https://dl.acm.org/doi/abs/10.1145/3219819.3220007) | + | 多任务 | [ShareBottom](models/multitask/share-bottom/model.py) | ✓ | ✓ | ✓ | ✓ | [1998][Multitask learning](http://reports-archive.adm.cs.cmu.edu/anon/1997/CMU-CS-97-203.pdf) | + | 重排序 | [Listwise](models/rerank/listwise/model.py) | ✓ | ✓ | ✓ | x | [2019][Sequential Evaluation and Generation Framework for Combinatorial Recommender System](https://arxiv.org/pdf/1902.00245.pdf) | + + + + + +

快速安装

+ +### 环境要求 * Python 2.7/ 3.5 / 3.6 / 3.7 * PaddlePaddle >= 1.7.2 -* operating system: Windows/Mac/Linux +* 操作系统: Windows/Mac/Linux - > Linux is recommended for distributed training + > Windows下PaddleRec目前仅支持单机训练,分布式训练建议使用Linux环境 -### Installation +### 安装命令 -1. **Install by pip** +- 安装方法一 **PIP源直接安装** ```bash python -m pip install paddle-rec ``` - > This method will download and install `paddlepaddle-v1.7.2-cpu`. If `PaddlePaddle` can not be installed automatically,You need to install `PaddlePaddle` manually,and then install `PaddleRec` again: - > - Download [PaddlePaddle](https://pypi.org/project/paddlepaddle/1.7.2/#files) and install by pip. - > - Install `PaddlePaddle` by pip,`python -m pip install paddlepaddle==1.7.2 -i https://mirror.baidu.com/pypi/simple` - > - Other installation problems can be raised in [Paddle Issue](https://github.com/PaddlePaddle/Paddle/issues) or [PaddleRec Issue](https://github.com/PaddlePaddle/PaddleRec/issues) + > 该方法会默认下载安装`paddlepaddle v1.7.2 cpu版本`,若提示`PaddlePaddle`无法安装,则依照下述方法首先安装`PaddlePaddle`,再安装`PaddleRec`: + > - 可以在[该地址](https://pypi.org/project/paddlepaddle/1.7.2/#files),下载PaddlePaddle后手动安装whl包 + > - 可以先pip安装`PaddlePaddle`,`python -m pip install paddlepaddle==1.7.2 -i https://mirror.baidu.com/pypi/simple` + > - 其他安装问题可以在[Paddle Issue](https://github.com/PaddlePaddle/Paddle/issues)或[PaddleRec Issue](https://github.com/PaddlePaddle/PaddleRec/issues)提出,会有工程师及时解答 -2. **Install by source code** - - - Install PaddlePaddle +- 安装方法二 **源码编译安装** + + - 安装飞桨 **注:需要用户安装版本 == 1.7.2 的飞桨** ```shell python -m pip install paddlepaddle==1.7.2 -i https://mirror.baidu.com/pypi/simple ``` - - Install PaddleRec by source code + - 源码安装PaddleRec ``` git clone https://github.com/PaddlePaddle/PaddleRec/ @@ -107,54 +113,58 @@ python setup.py install ``` -- Install PaddleRec-GPU +- PaddleRec-GPU安装方法 - After installing `PaddleRec`,please install the appropriate version of `paddlepaddle-gpu` according to your environment (CUDA / cudnn),refer to the installation tutorial [Installation Manuals](https://www.paddlepaddle.org.cn/documentation/docs/en/install/index_en.html) + 在使用方法一或方法二完成PaddleRec安装后,需再手动安装`paddlepaddle-gpu`,并根据自身环境(Cuda/Cudnn)选择合适的版本,安装教程请查阅[飞桨-开始使用](https://www.paddlepaddle.org.cn/install/quick) -

Quick Start

+

一键启动

-We take the `dnn` algorithm as an example to get start of `PaddleRec`, and we take 100 pieces of training data from [Criteo Dataset](https://www.kaggle.com/c/criteo-display-ad-challenge/): +我们以排序模型中的`dnn`模型为例介绍PaddleRec的一键启动。训练数据来源为[Criteo数据集](https://www.kaggle.com/c/criteo-display-ad-challenge/),我们从中截取了100条数据: ```bash -# Training with cpu -python -m paddlerec.run -m paddlerec.models.rank.dnn +# 使用CPU进行单机训练 +git clone https://github.com/PaddlePaddle/PaddleRec.git paddle-rec +cd paddle-rec + +python -m paddlerec.run -m models/rank/dnn/config.yaml ``` -

Documentation

+

帮助文档

-### Background -* [Recommendation System](doc/rec_background.md) -* [Distributed deep learning](doc/ps_background.md) +### 项目背景 +* [推荐系统介绍](doc/rec_background.md) +* [分布式深度学习介绍](doc/ps_background.md) -### Introductory Project -* [Get start of PaddleRec in ten minutes](https://aistudio.baidu.com/aistudio/projectdetail/559336) +### 快速开始 +* [十分钟上手PaddleRec](https://aistudio.baidu.com/aistudio/projectdetail/559336) -### Introductory tutorial -* [Data](doc/slot_reader.md) -* [Model](doc/model.md) -* [Loacl Train](doc/train.md) -* [Distributed Train](doc/distributed_train.md) -* [Predict](doc/predict.md) -* [Serving](doc/serving.md) +### 入门教程 +* [数据准备](doc/slot_reader.md) +* [模型调参](doc/model.md) +* [启动单机训练](doc/train.md) +* [启动分布式训练](doc/distributed_train.md) +* [启动预测](doc/predict.md) +* [快速部署](doc/serving.md) +* [预训练模型](doc/pre_train_model.md) -### Advanced tutorial -* [Custom Reader](doc/custom_reader.md) -* [Custom Model](doc/model_develop.md) -* [Custom Training Process](doc/trainer_develop.md) -* [Configuration description of yaml](doc/yaml.md) -* [Design document of PaddleRec](doc/design.md) +### 进阶教程 +* [自定义Reader](doc/custom_reader.md) +* [自定义模型](doc/model_develop.md) +* [自定义流程](doc/trainer_develop.md) +* [yaml配置说明](doc/yaml.md) +* [PaddleRec设计文档](doc/design.md) ### Benchmark * [Benchmark](doc/benchmark.md) ### FAQ -* [Common Problem FAQ](doc/faq.md) +* [常见问题FAQ](doc/faq.md) -

Community

+

社区


@@ -164,22 +174,22 @@ python -m paddlerec.run -m paddlerec.models.rank.dnn

-### Version history +### 版本历史 - 2020.06.17 - PaddleRec v0.1.0 - 2020.06.03 - PaddleRec v0.0.2 - 2020.05.14 - PaddleRec v0.0.1 -### License -[Apache 2.0 license](LICENSE) +### 许可证书 +本项目的发布受[Apache 2.0 license](LICENSE)许可认证。 -### Contact us +### 联系我们 -For any feedback, please propose a [GitHub Issue](https://github.com/PaddlePaddle/PaddleRec/issues) +如有意见、建议及使用中的BUG,欢迎在[GitHub Issue](https://github.com/PaddlePaddle/PaddleRec/issues)提交 -You can also communicate with us in the following ways: +亦可通过以下方式与我们沟通交流: -- QQ group id:`861717190` -- Wechat account:`paddlerec2020` +- QQ群号码:`861717190` +- 微信小助手微信号:`paddlerec2020`

     

-

PaddleRec QQ Group               PaddleRec Wechat account

+

PaddleRec交流QQ群               PaddleRec微信小助手

diff --git a/README_CN.md b/README_CN.md deleted file mode 100644 index 2b6d57f48163dad3dee12cab0aeb3a7b4d6a8920..0000000000000000000000000000000000000000 --- a/README_CN.md +++ /dev/null @@ -1,190 +0,0 @@ -(简体中文|[English](./README.md)) - -

- -

-

- -

-

- -

- - -

什么是推荐系统?

-

- -

- -- 推荐系统是在互联网信息爆炸式增长的时代背景下,帮助用户高效获得感兴趣信息的关键; - -- 推荐系统也是帮助产品最大限度吸引用户、留存用户、增加用户粘性、提高用户转化率的银弹。 - -- 有无数优秀的产品依靠用户可感知的推荐系统建立了良好的口碑,也有无数的公司依靠直击用户痛点的推荐系统在行业中占领了一席之地。 - - > 可以说,谁能掌握和利用好推荐系统,谁就能在信息分发的激烈竞争中抢得先机。 - > 但与此同时,有着许多问题困扰着推荐系统的开发者,比如:庞大的数据量,复杂的模型结构,低效的分布式训练环境,波动的在离线一致性,苛刻的上线部署要求,以上种种,不胜枚举。 - -

什么是PaddleRec?

- - -- 源于飞桨生态的搜索推荐模型 **一站式开箱即用工具** -- 适合初学者,开发者,研究者的推荐系统全流程解决方案 -- 包含内容理解、匹配、召回、排序、 多任务、重排序等多个任务的完整推荐搜索算法库 - - - | 方向 | 模型 | 单机CPU | 单机GPU | 分布式CPU | 分布式GPU | 论文 | - | :------: | :-----------------------------------------------------------------------: | :-----: | :-----: | :-------: | :-------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | - | 内容理解 | [Text-Classifcation](models/contentunderstanding/classification/model.py) | ✓ | ✓ | ✓ | x | [EMNLP 2014][Convolutional neural networks for sentence classication](https://www.aclweb.org/anthology/D14-1181.pdf) | - | 内容理解 | [TagSpace](models/contentunderstanding/tagspace/model.py) | ✓ | ✓ | ✓ | x | [EMNLP 2014][TagSpace: Semantic Embeddings from Hashtags](https://www.aclweb.org/anthology/D14-1194.pdf) | - | 匹配 | [DSSM](models/match/dssm/model.py) | ✓ | ✓ | ✓ | x | [CIKM 2013][Learning Deep Structured Semantic Models for Web Search using Clickthrough Data](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cikm2013_DSSM_fullversion.pdf) | - | 匹配 | [MultiView-Simnet](models/match/multiview-simnet/model.py) | ✓ | ✓ | ✓ | x | [WWW 2015][A Multi-View Deep Learning Approach for Cross Domain User Modeling in Recommendation Systems](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/frp1159-songA.pdf) | - | 召回 | [TDM](models/treebased/tdm/model.py) | ✓ | >=1.8.0 | ✓ | >=1.8.0 | [KDD 2018][Learning Tree-based Deep Model for Recommender Systems](https://arxiv.org/pdf/1801.02294.pdf) | - | 召回 | [fasttext](models/recall/fasttext/model.py) | ✓ | ✓ | x | x | [EACL 2017][Bag of Tricks for Efficient Text Classification](https://www.aclweb.org/anthology/E17-2068.pdf) | - | 召回 | [Word2Vec](models/recall/word2vec/model.py) | ✓ | ✓ | ✓ | x | [NIPS 2013][Distributed Representations of Words and Phrases and their Compositionality](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) | - | 召回 | [SSR](models/recall/ssr/model.py) | ✓ | ✓ | ✓ | ✓ | [SIGIR 2016][Multi-Rate Deep Learning for Temporal Recommendation](http://sonyis.me/paperpdf/spr209-song_sigir16.pdf) | - | 召回 | [Gru4Rec](models/recall/gru4rec/model.py) | ✓ | ✓ | ✓ | ✓ | [2015][Session-based Recommendations with Recurrent Neural Networks](https://arxiv.org/abs/1511.06939) | - | 召回 | [Youtube_dnn](models/recall/youtube_dnn/model.py) | ✓ | ✓ | ✓ | ✓ | [RecSys 2016][Deep Neural Networks for YouTube Recommendations](https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf) | - | 召回 | [NCF](models/recall/ncf/model.py) | ✓ | ✓ | ✓ | ✓ | [WWW 2017][Neural Collaborative Filtering](https://arxiv.org/pdf/1708.05031.pdf) | - | 召回 | [GNN](models/recall/gnn/model.py) | ✓ | ✓ | ✓ | ✓ | [AAAI 2019][Session-based Recommendation with Graph Neural Networks](https://arxiv.org/abs/1811.00855) | - | 排序 | [Logistic Regression](models/rank/logistic_regression/model.py) | ✓ | x | ✓ | x | / | - | 排序 | [Dnn](models/rank/dnn/model.py) | ✓ | ✓ | ✓ | ✓ | / | - | 排序 | [FM](models/rank/fm/model.py) | ✓ | x | ✓ | x | [IEEE Data Mining 2010][Factorization machines](https://analyticsconsultores.com.mx/wp-content/uploads/2019/03/Factorization-Machines-Steffen-Rendle-Osaka-University-2010.pdf) | - | 排序 | [FFM](models/rank/ffm/model.py) | ✓ | x | ✓ | x | [RECSYS 2016][Field-aware Factorization Machines for CTR Prediction](https://dl.acm.org/doi/pdf/10.1145/2959100.2959134) | - | 排序 | [FNN](models/rank/fnn/model.py) | ✓ | x | ✓ | x | [ECIR 2016][Deep Learning over Multi-field Categorical Data](https://arxiv.org/pdf/1601.02376.pdf) | - | 排序 | [Deep Crossing](models/rank/deep_crossing/model.py) | ✓ | x | ✓ | x | [ACM 2016][Deep Crossing: Web-Scale Modeling without Manually Crafted Combinatorial Features](https://www.kdd.org/kdd2016/papers/files/adf0975-shanA.pdf) | - | 排序 | [Pnn](models/rank/pnn/model.py) | ✓ | x | ✓ | x | [ICDM 2016][Product-based Neural Networks for User Response Prediction](https://arxiv.org/pdf/1611.00144.pdf) | - | 排序 | [DCN](models/rank/dcn/model.py) | ✓ | x | ✓ | x | [KDD 2017][Deep & Cross Network for Ad Click Predictions](https://dl.acm.org/doi/pdf/10.1145/3124749.3124754) | - | 排序 | [NFM](models/rank/nfm/model.py) | ✓ | x | ✓ | x | [SIGIR 2017][Neural Factorization Machines for Sparse Predictive Analytics](https://dl.acm.org/doi/pdf/10.1145/3077136.3080777) | - | 排序 | [AFM](models/rank/afm/model.py) | ✓ | x | ✓ | x | [IJCAI 2017][Attentional Factorization Machines: Learning the Weight of Feature Interactions via Attention Networks](https://arxiv.org/pdf/1708.04617.pdf) | - | 排序 | [DeepFM](models/rank/deepfm/model.py) | ✓ | x | ✓ | x | [IJCAI 2017][DeepFM: A Factorization-Machine based Neural Network for CTR Prediction](https://arxiv.org/pdf/1703.04247.pdf) | - | 排序 | [xDeepFM](models/rank/xdeepfm/model.py) | ✓ | x | ✓ | x | [KDD 2018][xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems](https://dl.acm.org/doi/pdf/10.1145/3219819.3220023) | - | 排序 | [DIN](models/rank/din/model.py) | ✓ | x | ✓ | x | [KDD 2018][Deep Interest Network for Click-Through Rate Prediction](https://dl.acm.org/doi/pdf/10.1145/3219819.3219823) | - | 排序 | [DIEN](models/rank/dien/model.py) | ✓ | x | ✓ | x | [AAAI 2019][Deep Interest Evolution Network for Click-Through Rate Prediction](https://www.aaai.org/ojs/index.php/AAAI/article/view/4545/4423) | - | 排序 | [BST](models/rank/BST/model.py) | ✓ | x | ✓ | x | [DLP_KDD 2019][Behavior Sequence Transformer for E-commerce Recommendation in Alibaba](https://arxiv.org/pdf/1905.06874v1.pdf) | - | 排序 | [AutoInt](models/rank/AutoInt/model.py) | ✓ | x | ✓ | x | [CIKM 2019][AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks](https://arxiv.org/pdf/1810.11921.pdf) | - | 排序 | [Wide&Deep](models/rank/wide_deep/model.py) | ✓ | x | ✓ | x | [DLRS 2016][Wide & Deep Learning for Recommender Systems](https://dl.acm.org/doi/pdf/10.1145/2988450.2988454) | - | 排序 | [FGCNN](models/rank/fgcnn/model.py) | ✓ | ✓ | ✓ | ✓ | [WWW 2019][Feature Generation by Convolutional Neural Network for Click-Through Rate Prediction](https://arxiv.org/pdf/1904.04447.pdf) | - | 排序 | [Fibinet](models/rank/fibinet/model.py) | ✓ | ✓ | ✓ | ✓ | [RecSys19][FiBiNET: Combining Feature Importance and Bilinear feature Interaction for Click-Through Rate Prediction]( https://arxiv.org/pdf/1905.09433.pdf) | - | 排序 | [Flen](models/rank/flen/model.py) | ✓ | ✓ | ✓ | ✓ | [2019][FLEN: Leveraging Field for Scalable CTR Prediction]( https://arxiv.org/pdf/1911.04690.pdf) | - | 多任务 | [ESMM](models/multitask/esmm/model.py) | ✓ | ✓ | ✓ | ✓ | [SIGIR 2018][Entire Space Multi-Task Model: An Effective Approach for Estimating Post-Click Conversion Rate](https://arxiv.org/abs/1804.07931) | - | 多任务 | [MMOE](models/multitask/mmoe/model.py) | ✓ | ✓ | ✓ | ✓ | [KDD 2018][Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts](https://dl.acm.org/doi/abs/10.1145/3219819.3220007) | - | 多任务 | [ShareBottom](models/multitask/share-bottom/model.py) | ✓ | ✓ | ✓ | ✓ | [1998][Multitask learning](http://reports-archive.adm.cs.cmu.edu/anon/1997/CMU-CS-97-203.pdf) | - | 重排序 | [Listwise](models/rerank/listwise/model.py) | ✓ | ✓ | ✓ | x | [2019][Sequential Evaluation and Generation Framework for Combinatorial Recommender System](https://arxiv.org/pdf/1902.00245.pdf) | - - - - - -

快速安装

- -### 环境要求 -* Python 2.7/ 3.5 / 3.6 / 3.7 -* PaddlePaddle >= 1.7.2 -* 操作系统: Windows/Mac/Linux - - > Windows下PaddleRec目前仅支持单机训练,分布式训练建议使用Linux环境 - -### 安装命令 - -- 安装方法一 **PIP源直接安装** - ```bash - python -m pip install paddle-rec - ``` - > 该方法会默认下载安装`paddlepaddle v1.7.2 cpu版本`,若提示`PaddlePaddle`无法安装,则依照下述方法首先安装`PaddlePaddle`,再安装`PaddleRec`: - > - 可以在[该地址](https://pypi.org/project/paddlepaddle/1.7.2/#files),下载PaddlePaddle后手动安装whl包 - > - 可以先pip安装`PaddlePaddle`,`python -m pip install paddlepaddle==1.7.2 -i https://mirror.baidu.com/pypi/simple` - > - 其他安装问题可以在[Paddle Issue](https://github.com/PaddlePaddle/Paddle/issues)或[PaddleRec Issue](https://github.com/PaddlePaddle/PaddleRec/issues)提出,会有工程师及时解答 - -- 安装方法二 **源码编译安装** - - - 安装飞桨 **注:需要用户安装版本 == 1.7.2 的飞桨** - - ```shell - python -m pip install paddlepaddle==1.7.2 -i https://mirror.baidu.com/pypi/simple - ``` - - - 源码安装PaddleRec - - ``` - git clone https://github.com/PaddlePaddle/PaddleRec/ - cd PaddleRec - python setup.py install - ``` - -- PaddleRec-GPU安装方法 - - 在使用方法一或方法二完成PaddleRec安装后,需再手动安装`paddlepaddle-gpu`,并根据自身环境(Cuda/Cudnn)选择合适的版本,安装教程请查阅[飞桨-开始使用](https://www.paddlepaddle.org.cn/install/quick) - - -

一键启动

- -我们以排序模型中的`dnn`模型为例介绍PaddleRec的一键启动。训练数据来源为[Criteo数据集](https://www.kaggle.com/c/criteo-display-ad-challenge/),我们从中截取了100条数据: - -```bash -# 使用CPU进行单机训练 -python -m paddlerec.run -m paddlerec.models.rank.dnn -``` - - -

帮助文档

- -### 项目背景 -* [推荐系统介绍](doc/rec_background.md) -* [分布式深度学习介绍](doc/ps_background.md) - -### 快速开始 -* [十分钟上手PaddleRec](https://aistudio.baidu.com/aistudio/projectdetail/559336) - -### 入门教程 -* [数据准备](doc/slot_reader.md) -* [模型调参](doc/model.md) -* [启动单机训练](doc/train.md) -* [启动分布式训练](doc/distributed_train.md) -* [启动预测](doc/predict.md) -* [快速部署](doc/serving.md) - - -### 进阶教程 -* [自定义Reader](doc/custom_reader.md) -* [自定义模型](doc/model_develop.md) -* [自定义流程](doc/trainer_develop.md) -* [yaml配置说明](doc/yaml.md) -* [PaddleRec设计文档](doc/design.md) - -### Benchmark -* [Benchmark](doc/benchmark.md) - -### FAQ -* [常见问题FAQ](doc/faq.md) - - -

社区

- -

-
- Release - License - Slack -
-

- -### 版本历史 -- 2020.06.17 - PaddleRec v0.1.0 -- 2020.06.03 - PaddleRec v0.0.2 -- 2020.05.14 - PaddleRec v0.0.1 - -### 许可证书 -本项目的发布受[Apache 2.0 license](LICENSE)许可认证。 - -### 联系我们 - -如有意见、建议及使用中的BUG,欢迎在[GitHub Issue](https://github.com/PaddlePaddle/PaddleRec/issues)提交 - -亦可通过以下方式与我们沟通交流: - -- QQ群号码:`861717190` -- 微信小助手微信号:`paddlerec2020` - -

     

-

PaddleRec交流QQ群               PaddleRec微信小助手

diff --git a/README_EN.md b/README_EN.md new file mode 100644 index 0000000000000000000000000000000000000000..b409c1ad96406c30c8423eb8c693f74a2182088f --- /dev/null +++ b/README_EN.md @@ -0,0 +1,189 @@ +([简体中文](./README.md)|English) +

+ +

+

+ +

+ + +

What is recommendation system ?

+

+ +

+ +- Recommendation system helps users quickly find useful and interesting information from massive data. + +- Recommendation system is also a silver bullet to attract users, retain users, increase users' stickness or conversionn. + + > Who can better use the recommendation system, who can gain more advantage in the fierce competition. + > + > At the same time, there are many problems in the process of using the recommendation system, such as: huge data, complex model, inefficient distributed training, and so on. + +

What is PaddleRec ?

+ + +- A quick start tool of search & recommendation algorithm based on [PaddlePaddle](https://www.paddlepaddle.org.cn/documentation/docs/en/beginners_guide/index_en.html) +- A complete solution of recommendation system for beginners, developers and researchers. +- Recommendation algorithm library including content-understanding, match, recall, rank, multi-task, re-rank etc. + + + | Type | Algorithm | CPU | GPU | Parameter-Server | Multi-GPU | Paper | + | :-------------------: | :-----------------------------------------------------------------------: | :---: | :-----: | :--------------: | :-------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | + | Content-Understanding | [Text-Classifcation](models/contentunderstanding/classification/model.py) | ✓ | ✓ | ✓ | x | [EMNLP 2014][Convolutional neural networks for sentence classication](https://www.aclweb.org/anthology/D14-1181.pdf) | + | Content-Understanding | [TagSpace](models/contentunderstanding/tagspace/model.py) | ✓ | ✓ | ✓ | x | [EMNLP 2014][TagSpace: Semantic Embeddings from Hashtags](https://www.aclweb.org/anthology/D14-1194.pdf) | + | Match | [DSSM](models/match/dssm/model.py) | ✓ | ✓ | ✓ | x | [CIKM 2013][Learning Deep Structured Semantic Models for Web Search using Clickthrough Data](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cikm2013_DSSM_fullversion.pdf) | + | Match | [MultiView-Simnet](models/match/multiview-simnet/model.py) | ✓ | ✓ | ✓ | x | [WWW 2015][A Multi-View Deep Learning Approach for Cross Domain User Modeling in Recommendation Systems](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/frp1159-songA.pdf) | + | Recall | [TDM](models/treebased/tdm/model.py) | ✓ | >=1.8.0 | ✓ | >=1.8.0 | [KDD 2018][Learning Tree-based Deep Model for Recommender Systems](https://arxiv.org/pdf/1801.02294.pdf) | + | Recall | [fasttext](models/recall/fasttext/model.py) | ✓ | ✓ | x | x | [EACL 2017][Bag of Tricks for Efficient Text Classification](https://www.aclweb.org/anthology/E17-2068.pdf) | + | Recall | [Word2Vec](models/recall/word2vec/model.py) | ✓ | ✓ | ✓ | x | [NIPS 2013][Distributed Representations of Words and Phrases and their Compositionality](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) | + | Recall | [SSR](models/recall/ssr/model.py) | ✓ | ✓ | ✓ | ✓ | [SIGIR 2016][Multi-Rate Deep Learning for Temporal Recommendation](http://sonyis.me/paperpdf/spr209-song_sigir16.pdf) | + | Recall | [Gru4Rec](models/recall/gru4rec/model.py) | ✓ | ✓ | ✓ | ✓ | [2015][Session-based Recommendations with Recurrent Neural Networks](https://arxiv.org/abs/1511.06939) | + | Recall | [Youtube_dnn](models/recall/youtube_dnn/model.py) | ✓ | ✓ | ✓ | ✓ | [RecSys 2016][Deep Neural Networks for YouTube Recommendations](https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf) | + | Recall | [NCF](models/recall/ncf/model.py) | ✓ | ✓ | ✓ | ✓ | [WWW 2017][Neural Collaborative Filtering](https://arxiv.org/pdf/1708.05031.pdf) | + | Recall | [GNN](models/recall/gnn/model.py) | ✓ | ✓ | ✓ | ✓ | [AAAI 2019][Session-based Recommendation with Graph Neural Networks](https://arxiv.org/abs/1811.00855) | + | Recall | [RALM](models/recall/look-alike_recall/model.py) | ✓ | ✓ | ✓ | ✓ | [KDD 2019][Real-time Attention Based Look-alike Model for Recommender System](https://arxiv.org/pdf/1906.05022.pdf) | + | Rank | [Logistic Regression](models/rank/logistic_regression/model.py) | ✓ | x | ✓ | x | / | + | Rank | [Dnn](models/rank/dnn/model.py) | ✓ | ✓ | ✓ | ✓ | / | + | Rank | [FM](models/rank/fm/model.py) | ✓ | x | ✓ | x | [IEEE Data Mining 2010][Factorization machines](https://analyticsconsultores.com.mx/wp-content/uploads/2019/03/Factorization-Machines-Steffen-Rendle-Osaka-University-2010.pdf) | + | Rank | [FFM](models/rank/ffm/model.py) | ✓ | x | ✓ | x | [RECSYS 2016][Field-aware Factorization Machines for CTR Prediction](https://dl.acm.org/doi/pdf/10.1145/2959100.2959134) | + | Rank | [FNN](models/rank/fnn/model.py) | ✓ | x | ✓ | x | [ECIR 2016][Deep Learning over Multi-field Categorical Data](https://arxiv.org/pdf/1601.02376.pdf) | + | Rank | [Deep Crossing](models/rank/deep_crossing/model.py) | ✓ | x | ✓ | x | [ACM 2016][Deep Crossing: Web-Scale Modeling without Manually Crafted Combinatorial Features](https://www.kdd.org/kdd2016/papers/files/adf0975-shanA.pdf) | + | Rank | [Pnn](models/rank/pnn/model.py) | ✓ | x | ✓ | x | [ICDM 2016][Product-based Neural Networks for User Response Prediction](https://arxiv.org/pdf/1611.00144.pdf) | + | Rank | [DCN](models/rank/dcn/model.py) | ✓ | x | ✓ | x | [KDD 2017][Deep & Cross Network for Ad Click Predictions](https://dl.acm.org/doi/pdf/10.1145/3124749.3124754) | + | Rank | [NFM](models/rank/nfm/model.py) | ✓ | x | ✓ | x | [SIGIR 2017][Neural Factorization Machines for Sparse Predictive Analytics](https://dl.acm.org/doi/pdf/10.1145/3077136.3080777) | + | Rank | [AFM](models/rank/afm/model.py) | ✓ | x | ✓ | x | [IJCAI 2017][Attentional Factorization Machines: Learning the Weight of Feature Interactions via Attention Networks](https://arxiv.org/pdf/1708.04617.pdf) | + | Rank | [DeepFM](models/rank/deepfm/model.py) | ✓ | x | ✓ | x | [IJCAI 2017][DeepFM: A Factorization-Machine based Neural Network for CTR Prediction](https://arxiv.org/pdf/1703.04247.pdf) | + | Rank | [xDeepFM](models/rank/xdeepfm/model.py) | ✓ | x | ✓ | x | [KDD 2018][xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems](https://dl.acm.org/doi/pdf/10.1145/3219819.3220023) | + | Rank | [DIN](models/rank/din/model.py) | ✓ | x | ✓ | x | [KDD 2018][Deep Interest Network for Click-Through Rate Prediction](https://dl.acm.org/doi/pdf/10.1145/3219819.3219823) | + | Rank | [DIEN](models/rank/dien/model.py) | ✓ | x | ✓ | x | [AAAI 2019][Deep Interest Evolution Network for Click-Through Rate Prediction](https://www.aaai.org/ojs/index.php/AAAI/article/view/4545/4423) | + | Rank | [BST](models/rank/BST/model.py) | ✓ | x | ✓ | x | [DLP-KDD 2019][Behavior Sequence Transformer for E-commerce Recommendation in Alibaba](https://arxiv.org/pdf/1905.06874v1.pdf) | + | Rank | [AutoInt](models/rank/AutoInt/model.py) | ✓ | x | ✓ | x | [CIKM 2019][AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks](https://arxiv.org/pdf/1810.11921.pdf) | + | Rank | [Wide&Deep](models/rank/wide_deep/model.py) | ✓ | x | ✓ | x | [DLRS 2016][Wide & Deep Learning for Recommender Systems](https://dl.acm.org/doi/pdf/10.1145/2988450.2988454) | + | Rank | [FGCNN](models/rank/fgcnn/model.py) | ✓ | ✓ | ✓ | ✓ | [WWW 2019][Feature Generation by Convolutional Neural Network for Click-Through Rate Prediction](https://arxiv.org/pdf/1904.04447.pdf) | + | Rank | [Fibinet](models/rank/fibinet/model.py) | ✓ | ✓ | ✓ | ✓ | [RecSys19][FiBiNET: Combining Feature Importance and Bilinear feature Interaction for Click-Through Rate Prediction]( https://arxiv.org/pdf/1905.09433.pdf) | + | Rank | [Flen](models/rank/flen/model.py) | ✓ | ✓ | ✓ | ✓ | [2019][FLEN: Leveraging Field for Scalable CTR Prediction]( https://arxiv.org/pdf/1911.04690.pdf) | + | Multi-Task | [ESMM](models/multitask/esmm/model.py) | ✓ | ✓ | ✓ | ✓ | [SIGIR 2018][Entire Space Multi-Task Model: An Effective Approach for Estimating Post-Click Conversion Rate](https://arxiv.org/abs/1804.07931) | + | Multi-Task | [MMOE](models/multitask/mmoe/model.py) | ✓ | ✓ | ✓ | ✓ | [KDD 2018][Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts](https://dl.acm.org/doi/abs/10.1145/3219819.3220007) | + | Multi-Task | [ShareBottom](models/multitask/share-bottom/model.py) | ✓ | ✓ | ✓ | ✓ | [1998][Multitask learning](http://reports-archive.adm.cs.cmu.edu/anon/1997/CMU-CS-97-203.pdf) | + | Re-Rank | [Listwise](models/rerank/listwise/model.py) | ✓ | ✓ | ✓ | x | [2019][Sequential Evaluation and Generation Framework for Combinatorial Recommender System](https://arxiv.org/pdf/1902.00245.pdf) | + + + + + +

Getting Started

+ +### Environmental requirements +* Python 2.7/ 3.5 / 3.6 / 3.7 +* PaddlePaddle >= 1.7.2 +* operating system: Windows/Mac/Linux + + > Linux is recommended for distributed training + +### Installation + +1. **Install by pip** + ```bash + python -m pip install paddle-rec + ``` + > This method will download and install `paddlepaddle-v1.7.2-cpu`. If `PaddlePaddle` can not be installed automatically,You need to install `PaddlePaddle` manually,and then install `PaddleRec` again: + > - Download [PaddlePaddle](https://pypi.org/project/paddlepaddle/1.7.2/#files) and install by pip. + > - Install `PaddlePaddle` by pip,`python -m pip install paddlepaddle==1.7.2 -i https://mirror.baidu.com/pypi/simple` + > - Other installation problems can be raised in [Paddle Issue](https://github.com/PaddlePaddle/Paddle/issues) or [PaddleRec Issue](https://github.com/PaddlePaddle/PaddleRec/issues) + +2. **Install by source code** + + - Install PaddlePaddle + + ```shell + python -m pip install paddlepaddle==1.7.2 -i https://mirror.baidu.com/pypi/simple + ``` + + - Install PaddleRec by source code + + ``` + git clone https://github.com/PaddlePaddle/PaddleRec/ + cd PaddleRec + python setup.py install + ``` + +- Install PaddleRec-GPU + + After installing `PaddleRec`,please install the appropriate version of `paddlepaddle-gpu` according to your environment (CUDA / cudnn),refer to the installation tutorial [Installation Manuals](https://www.paddlepaddle.org.cn/documentation/docs/en/install/index_en.html) + + +

Quick Start

+ +We take the `dnn` algorithm as an example to get start of `PaddleRec`, and we take 100 pieces of training data from [Criteo Dataset](https://www.kaggle.com/c/criteo-display-ad-challenge/): + +```bash +# Training with cpu +git clone https://github.com/PaddlePaddle/PaddleRec.git paddle-rec +cd paddle-rec + +python -m paddlerec.run -m models/rank/dnn/config.yaml +``` + + +

Documentation

+ +### Background +* [Recommendation System](doc/rec_background.md) +* [Distributed deep learning](doc/ps_background.md) + +### Introductory Project +* [Get start of PaddleRec in ten minutes](https://aistudio.baidu.com/aistudio/projectdetail/559336) + +### Introductory tutorial +* [Data](doc/slot_reader.md) +* [Model](doc/model.md) +* [Loacl Train](doc/train.md) +* [Distributed Train](doc/distributed_train.md) +* [Predict](doc/predict.md) +* [Serving](doc/serving.md) + + +### Advanced tutorial +* [Custom Reader](doc/custom_reader.md) +* [Custom Model](doc/model_develop.md) +* [Custom Training Process](doc/trainer_develop.md) +* [Configuration description of yaml](doc/yaml.md) +* [Design document of PaddleRec](doc/design.md) + +### Benchmark +* [Benchmark](doc/benchmark.md) + +### FAQ +* [Common Problem FAQ](doc/faq.md) + + +

Community

+ +

+
+ Release + License + Slack +
+

+ +### Version history +- 2020.06.17 - PaddleRec v0.1.0 +- 2020.06.03 - PaddleRec v0.0.2 +- 2020.05.14 - PaddleRec v0.0.1 + +### License +[Apache 2.0 license](LICENSE) + +### Contact us + +For any feedback, please propose a [GitHub Issue](https://github.com/PaddlePaddle/PaddleRec/issues) + +You can also communicate with us in the following ways: + +- QQ group id:`861717190` +- Wechat account:`paddlerec2020` + +

     

+

PaddleRec QQ Group               PaddleRec Wechat account

diff --git a/core/engine/cluster/cloud/before_hook_cpu.sh.template b/core/engine/cluster/cloud/before_hook_cpu.sh.template index 07e5d7337d9171518187ff96c9de9bcb5e734df4..d0bd67b2fbe60221ad51e99073d097675286eac7 100644 --- a/core/engine/cluster/cloud/before_hook_cpu.sh.template +++ b/core/engine/cluster/cloud/before_hook_cpu.sh.template @@ -1,6 +1,6 @@ echo "Run before_hook.sh ..." -wget https://paddlerec.bj.bcebos.com/whl/PaddleRec.tar.gz +wget https://paddlerec.bj.bcebos.com/whl/PaddleRec.tar.gz --no-check-certificate tar -xf PaddleRec.tar.gz @@ -10,6 +10,6 @@ python setup.py install pip uninstall -y paddlepaddle -pip install paddlepaddle-gpu==<$ PADDLEPADDLE_VERSION $> --index-url=http://pip.baidu.com/pypi/simple --trusted-host pip.baidu.com +pip install paddlepaddle==<$ PADDLEPADDLE_VERSION $> --index-url=http://pip.baidu.com/pypi/simple --trusted-host pip.baidu.com echo "End before_hook.sh ..." diff --git a/core/engine/cluster/cloud/before_hook_gpu.sh.template b/core/engine/cluster/cloud/before_hook_gpu.sh.template index e1bbde468b900262f28f53e8895f5da219aa140d..1a9d5e189870e84670e60571dfbeadd48e1245b0 100644 --- a/core/engine/cluster/cloud/before_hook_gpu.sh.template +++ b/core/engine/cluster/cloud/before_hook_gpu.sh.template @@ -1,6 +1,6 @@ echo "Run before_hook.sh ..." -wget https://paddlerec.bj.bcebos.com/whl/PaddleRec.tar.gz +wget https://paddlerec.bj.bcebos.com/whl/PaddleRec.tar.gz --no-check-certificate tar -xf PaddleRec.tar.gz diff --git a/core/engine/cluster/cloud/cluster.sh b/core/engine/cluster/cloud/cluster.sh index 35ba5657f36cff46b41c06639e43676af44f264a..399a21e78aa2eba2489c8aa0b4f2214328bd0a50 100644 --- a/core/engine/cluster/cloud/cluster.sh +++ b/core/engine/cluster/cloud/cluster.sh @@ -39,7 +39,12 @@ function _before_submit() { elif [ ${DISTRIBUTE_MODE} == "COLLECTIVE_GPU_K8S" ]; then _gen_gpu_before_hook _gen_k8s_config - _gen_k8s_job + _gen_k8s_gpu_job + _gen_end_hook + elif [ ${DISTRIBUTE_MODE} == "PS_CPU_K8S" ]; then + _gen_cpu_before_hook + _gen_k8s_config + _gen_k8s_cpu_job _gen_end_hook fi @@ -54,6 +59,7 @@ function _gen_mpi_config() { -e "s#<$ OUTPUT_PATH $>#$OUTPUT_PATH#g" \ -e "s#<$ THIRDPARTY_PATH $>#$THIRDPARTY_PATH#g" \ -e "s#<$ CPU_NUM $>#$max_thread_num#g" \ + -e "s#<$ USE_PYTHON3 $>#$USE_PYTHON3#g" \ -e "s#<$ FLAGS_communicator_is_sgd_optimizer $>#$FLAGS_communicator_is_sgd_optimizer#g" \ -e "s#<$ FLAGS_communicator_send_queue_size $>#$FLAGS_communicator_send_queue_size#g" \ -e "s#<$ FLAGS_communicator_thread_pool_size $>#$FLAGS_communicator_thread_pool_size#g" \ @@ -71,6 +77,7 @@ function _gen_k8s_config() { -e "s#<$ AFS_REMOTE_MOUNT_POINT $>#$AFS_REMOTE_MOUNT_POINT#g" \ -e "s#<$ OUTPUT_PATH $>#$OUTPUT_PATH#g" \ -e "s#<$ CPU_NUM $>#$max_thread_num#g" \ + -e "s#<$ USE_PYTHON3 $>#$USE_PYTHON3#g" \ -e "s#<$ FLAGS_communicator_is_sgd_optimizer $>#$FLAGS_communicator_is_sgd_optimizer#g" \ -e "s#<$ FLAGS_communicator_send_queue_size $>#$FLAGS_communicator_send_queue_size#g" \ -e "s#<$ FLAGS_communicator_thread_pool_size $>#$FLAGS_communicator_thread_pool_size#g" \ @@ -101,6 +108,7 @@ function _gen_end_hook() { function _gen_mpi_job() { echo "gen mpi_job.sh" sed -e "s#<$ GROUP_NAME $>#$GROUP_NAME#g" \ + -e "s#<$ JOB_NAME $>#$OLD_JOB_NAME#g" \ -e "s#<$ AK $>#$AK#g" \ -e "s#<$ SK $>#$SK#g" \ -e "s#<$ MPI_PRIORITY $>#$PRIORITY#g" \ @@ -109,18 +117,34 @@ function _gen_mpi_job() { ${abs_dir}/cloud/mpi_job.sh.template >${PWD}/job.sh } -function _gen_k8s_job() { +function _gen_k8s_gpu_job() { echo "gen k8s_job.sh" sed -e "s#<$ GROUP_NAME $>#$GROUP_NAME#g" \ + -e "s#<$ JOB_NAME $>#$OLD_JOB_NAME#g" \ -e "s#<$ AK $>#$AK#g" \ -e "s#<$ SK $>#$SK#g" \ -e "s#<$ K8S_PRIORITY $>#$PRIORITY#g" \ -e "s#<$ K8S_TRAINERS $>#$K8S_TRAINERS#g" \ + -e "s#<$ K8S_CPU_CORES $>#$K8S_CPU_CORES#g" \ -e "s#<$ K8S_GPU_CARD $>#$K8S_GPU_CARD#g" \ -e "s#<$ START_CMD $>#$START_CMD#g" \ ${abs_dir}/cloud/k8s_job.sh.template >${PWD}/job.sh } +function _gen_k8s_cpu_job() { + echo "gen k8s_job.sh" + sed -e "s#<$ GROUP_NAME $>#$GROUP_NAME#g" \ + -e "s#<$ JOB_NAME $>#$OLD_JOB_NAME#g" \ + -e "s#<$ AK $>#$AK#g" \ + -e "s#<$ SK $>#$SK#g" \ + -e "s#<$ K8S_PRIORITY $>#$PRIORITY#g" \ + -e "s#<$ K8S_TRAINERS $>#$K8S_TRAINERS#g" \ + -e "s#<$ K8S_PS_NUM $>#$K8S_PS_NUM#g" \ + -e "s#<$ K8S_PS_CORES $>#$K8S_PS_CORES#g" \ + -e "s#<$ K8S_CPU_CORES $>#$K8S_CPU_CORES#g" \ + -e "s#<$ START_CMD $>#$START_CMD#g" \ + ${abs_dir}/cloud/k8s_cpu_job.sh.template >${PWD}/job.sh +} #----------------------------------------------------------------------------------------------------------------- @@ -145,6 +169,7 @@ function _submit() { function package_hook() { cur_time=`date +"%Y%m%d%H%M"` new_job_name="${JOB_NAME}_${cur_time}" + export OLD_JOB_NAME=${JOB_NAME} export JOB_NAME=${new_job_name} export job_file_path="${PWD}/${new_job_name}" mkdir ${job_file_path} diff --git a/core/engine/cluster/cloud/k8s_config.ini.template b/core/engine/cluster/cloud/k8s_config.ini.template index 904bfbc5e1453f90ec1163d1681d554b52dae45f..471bd1a0dd2931591b0d6eda7f87cc25458b3f80 100644 --- a/core/engine/cluster/cloud/k8s_config.ini.template +++ b/core/engine/cluster/cloud/k8s_config.ini.template @@ -19,6 +19,8 @@ afs_local_mount_point="/root/paddlejob/workspace/env_run/afs/" # 新k8s afs挂载帮助文档: http://wiki.baidu.com/pages/viewpage.action?pageId=906443193 PADDLE_PADDLEREC_ROLE=WORKER +PADDLEREC_CLUSTER_TYPE=K8S +use_python3=<$ USE_PYTHON3 $> CPU_NUM=<$ CPU_NUM $> GLOG_v=0 diff --git a/core/engine/cluster/cloud/k8s_cpu_job.sh.template b/core/engine/cluster/cloud/k8s_cpu_job.sh.template new file mode 100644 index 0000000000000000000000000000000000000000..2889cd1d55008f22b7e9fb854019f996a4746f8c --- /dev/null +++ b/core/engine/cluster/cloud/k8s_cpu_job.sh.template @@ -0,0 +1,40 @@ +#!/bin/bash +############################################################### +## 注意-- 注意--注意 ## +## K8S PS-CPU多机作业作业示例 ## +############################################################### +job_name=<$ JOB_NAME $> + +# 作业参数 +group_name="<$ GROUP_NAME $>" +job_version="paddle-fluid-v1.7.1" +start_cmd="<$ START_CMD $>" +wall_time="2000:00:00" + +k8s_priority=<$ K8S_PRIORITY $> +k8s_trainers=<$ K8S_TRAINERS $> +k8s_cpu_cores=<$ K8S_CPU_CORES $> +k8s_ps_num=<$ K8S_PS_NUM $> +k8s_ps_cores=<$ K8S_PS_CORES $> + +# 你的ak/sk(可在paddlecloud web页面【个人中心】处获取) +ak=<$ AK $> +sk=<$ SK $> + +paddlecloud job --ak ${ak} --sk ${sk} \ + train --job-name ${job_name} \ + --group-name ${group_name} \ + --job-conf config.ini \ + --start-cmd "${start_cmd}" \ + --files ./* \ + --job-version ${job_version} \ + --k8s-priority ${k8s_priority} \ + --wall-time ${wall_time} \ + --k8s-trainers ${k8s_trainers} \ + --k8s-cpu-cores ${k8s_cpu_cores} \ + --k8s-ps-num ${k8s_ps_num} \ + --k8s-ps-cores ${k8s_ps_cores} \ + --is-standalone 0 \ + --distribute-job-type "PSERVER" \ + --json + \ No newline at end of file diff --git a/core/engine/cluster/cloud/k8s_job.sh.template b/core/engine/cluster/cloud/k8s_job.sh.template index 5c2ebdcd62ef4ca46dafc57db95ede9fcfd13ab3..8314e9efd0ec349bb00e28605386e34dfc601102 100644 --- a/core/engine/cluster/cloud/k8s_job.sh.template +++ b/core/engine/cluster/cloud/k8s_job.sh.template @@ -3,18 +3,30 @@ ## 注意-- 注意--注意 ## ## K8S NCCL2多机作业作业示例 ## ############################################################### -job_name=${JOB_NAME} +job_name=<$ JOB_NAME $> # 作业参数 group_name="<$ GROUP_NAME $>" job_version="paddle-fluid-v1.7.1" start_cmd="<$ START_CMD $>" -wall_time="10:00:00" +wall_time="2000:00:00" k8s_priority=<$ K8S_PRIORITY $> k8s_trainers=<$ K8S_TRAINERS $> +k8s_cpu_cores=<$ K8S_CPU_CORES $> k8s_gpu_cards=<$ K8S_GPU_CARD $> +is_stand_alone=0 +nccl="--distribute-job-type "NCCL2"" +if [ ${k8s_trainers} == 1 ];then + is_stand_alone=1 + nccl="--job-remark single-trainer" + if [ ${k8s_gpu_cards} == 1];then + nccl="--job-remark single-gpu" + echo "Attention: Use single GPU card for PaddleRec distributed training, please set runner class from 'cluster_train' to 'train' in config.yaml." + fi +fi + # 你的ak/sk(可在paddlecloud web页面【个人中心】处获取) ak=<$ AK $> sk=<$ SK $> @@ -27,9 +39,11 @@ paddlecloud job --ak ${ak} --sk ${sk} \ --files ./* \ --job-version ${job_version} \ --k8s-trainers ${k8s_trainers} \ + --k8s-cpu-cores ${k8s_cpu_cores} \ --k8s-gpu-cards ${k8s_gpu_cards} \ --k8s-priority ${k8s_priority} \ --wall-time ${wall_time} \ - --is-standalone 0 \ - --distribute-job-type "NCCL2" \ - --json \ No newline at end of file + --is-standalone ${is_stand_alone} \ + --json \ + ${nccl} + \ No newline at end of file diff --git a/core/engine/cluster/cloud/mpi_config.ini.template b/core/engine/cluster/cloud/mpi_config.ini.template index 8312d46a01449b3d6eac322b098d5b029bb67f86..a3ac22f0c7fc09e9b6eda44306972dd296d19ab7 100644 --- a/core/engine/cluster/cloud/mpi_config.ini.template +++ b/core/engine/cluster/cloud/mpi_config.ini.template @@ -17,6 +17,8 @@ output_path=<$ OUTPUT_PATH $> thirdparty_path=<$ THIRDPARTY_PATH $> PADDLE_PADDLEREC_ROLE=WORKER +PADDLEREC_CLUSTER_TYPE=MPI +use_python3=<$ USE_PYTHON3 $> CPU_NUM=<$ CPU_NUM $> GLOG_v=0 diff --git a/core/engine/cluster/cloud/mpi_job.sh.template b/core/engine/cluster/cloud/mpi_job.sh.template index 84fafaffaa9f6ccc06578d673144c0d63069e13b..b3a3c20a02094cca68c96f527bf29d3150996228 100644 --- a/core/engine/cluster/cloud/mpi_job.sh.template +++ b/core/engine/cluster/cloud/mpi_job.sh.template @@ -3,13 +3,13 @@ ## 注意--注意--注意 ## ## MPI 类型作业演示 ## ############################################################### -job_name=${JOB_NAME} +job_name=<$ JOB_NAME $> # 作业参数 group_name=<$ GROUP_NAME $> job_version="paddle-fluid-v1.7.1" start_cmd="<$ START_CMD $>" -wall_time="2:00:00" +wall_time="2000:00:00" # 你的ak/sk(可在paddlecloud web页面【个人中心】处获取) ak=<$ AK $> diff --git a/core/engine/cluster/cluster.py b/core/engine/cluster/cluster.py index 4fe7529f9664a4e9a78c63dbe6c5c18dfe59f141..a64e99e38b2df3033e480706bedd02eadea1dc90 100644 --- a/core/engine/cluster/cluster.py +++ b/core/engine/cluster/cluster.py @@ -67,10 +67,10 @@ class ClusterEngine(Engine): @staticmethod def workspace_replace(): - workspace = envs.get_runtime_environ("workspace") + remote_workspace = envs.get_runtime_environ("remote_workspace") for k, v in os.environ.items(): - v = v.replace("{workspace}", workspace) + v = v.replace("{workspace}", remote_workspace) os.environ[k] = str(v) def run(self): @@ -98,14 +98,12 @@ class ClusterEngine(Engine): cluster_env_check_tool = PaddleCloudMpiEnv() else: raise ValueError( - "Paddlecloud with Mpi don't support GPU training, check your config" + "Paddlecloud with Mpi don't support GPU training, check your config.yaml & backend.yaml" ) elif cluster_type.upper() == "K8S": if fleet_mode == "PS": if device == "CPU": - raise ValueError( - "PS-CPU on paddlecloud is not supported at this time, comming soon" - ) + cluster_env_check_tool = CloudPsCpuEnv() elif device == "GPU": raise ValueError( "PS-GPU on paddlecloud is not supported at this time, comming soon" @@ -115,7 +113,7 @@ class ClusterEngine(Engine): cluster_env_check_tool = CloudCollectiveEnv() elif device == "CPU": raise ValueError( - "Unexpected config -> device: CPU with fleet_mode: Collective, check your config" + "Unexpected config -> device: CPU with fleet_mode: Collective, check your config.yaml" ) else: raise ValueError("cluster_type {} error, must in MPI/K8S".format( @@ -161,23 +159,30 @@ class ClusterEnvBase(object): self.cluster_env["PADDLE_VERSION"] = self.backend_env.get( "config.paddle_version", "1.7.2") + # python_version + self.cluster_env["USE_PYTHON3"] = self.backend_env.get( + "config.use_python3", "0") + # communicator + max_thread_num = int(envs.get_runtime_environ("max_thread_num")) self.cluster_env[ "FLAGS_communicator_is_sgd_optimizer"] = self.backend_env.get( "config.communicator.FLAGS_communicator_is_sgd_optimizer", 0) self.cluster_env[ "FLAGS_communicator_send_queue_size"] = self.backend_env.get( - "config.communicator.FLAGS_communicator_send_queue_size", 5) + "config.communicator.FLAGS_communicator_send_queue_size", + max_thread_num) self.cluster_env[ "FLAGS_communicator_thread_pool_size"] = self.backend_env.get( "config.communicator.FLAGS_communicator_thread_pool_size", 32) self.cluster_env[ "FLAGS_communicator_max_merge_var_num"] = self.backend_env.get( - "config.communicator.FLAGS_communicator_max_merge_var_num", 5) + "config.communicator.FLAGS_communicator_max_merge_var_num", + max_thread_num) self.cluster_env[ "FLAGS_communicator_max_send_grad_num_before_recv"] = self.backend_env.get( "config.communicator.FLAGS_communicator_max_send_grad_num_before_recv", - 5) + max_thread_num) self.cluster_env["FLAGS_communicator_fake_rpc"] = self.backend_env.get( "config.communicator.FLAGS_communicator_fake_rpc", 0) self.cluster_env["FLAGS_rpc_retry_times"] = self.backend_env.get( @@ -234,7 +239,7 @@ class PaddleCloudMpiEnv(ClusterEnvBase): "config.train_data_path", "") if self.cluster_env["TRAIN_DATA_PATH"] == "": raise ValueError( - "No -- TRAIN_DATA_PATH -- found in your backend.yaml, please check." + "No -- TRAIN_DATA_PATH -- found in your backend.yaml, please add train_data_path in your backend yaml." ) # test_data_path self.cluster_env["TEST_DATA_PATH"] = self.backend_env.get( @@ -274,7 +279,7 @@ class PaddleCloudK8sEnv(ClusterEnvBase): category=UserWarning, stacklevel=2) warnings.warn( - "The remote mount point will be mounted to the ./afs/", + "The remote afs path will be mounted to the ./afs/", category=UserWarning, stacklevel=2) @@ -293,3 +298,21 @@ class CloudCollectiveEnv(PaddleCloudK8sEnv): "submit.k8s_gpu_card", 1) self.cluster_env["K8S_CPU_CORES"] = self.backend_env.get( "submit.k8s_cpu_cores", 1) + + +class CloudPsCpuEnv(PaddleCloudK8sEnv): + def __init__(self): + super(CloudPsCpuEnv, self).__init__() + + def env_check(self): + super(CloudPsCpuEnv, self).env_check() + + self.cluster_env["DISTRIBUTE_MODE"] = "PS_CPU_K8S" + self.cluster_env["K8S_TRAINERS"] = self.backend_env.get( + "submit.k8s_trainers", 1) + self.cluster_env["K8S_CPU_CORES"] = self.backend_env.get( + "submit.k8s_cpu_cores", 2) + self.cluster_env["K8S_PS_NUM"] = self.backend_env.get( + "submit.k8s_ps_num", 1) + self.cluster_env["K8S_PS_CORES"] = self.backend_env.get( + "submit.k8s_ps_cores", 2) diff --git a/core/factory.py b/core/factory.py index 9430c88283800e69db7043aa141b6f735212c79f..95e0e7778141ad76d1166205213bccdaae67aff7 100755 --- a/core/factory.py +++ b/core/factory.py @@ -22,6 +22,19 @@ trainers = {} def trainer_registry(): + trainers["SingleTrainer"] = os.path.join(trainer_abs, "single_trainer.py") + trainers["ClusterTrainer"] = os.path.join(trainer_abs, + "cluster_trainer.py") + trainers["CtrCodingTrainer"] = os.path.join(trainer_abs, + "ctr_coding_trainer.py") + trainers["CtrModulTrainer"] = os.path.join(trainer_abs, + "ctr_modul_trainer.py") + trainers["TDMSingleTrainer"] = os.path.join(trainer_abs, + "tdm_single_trainer.py") + trainers["TDMClusterTrainer"] = os.path.join(trainer_abs, + "tdm_cluster_trainer.py") + trainers["OnlineLearningTrainer"] = os.path.join( + trainer_abs, "online_learning_trainer.py") # Definition of procedure execution process trainers["CtrCodingTrainer"] = os.path.join(trainer_abs, "ctr_coding_trainer.py") diff --git a/core/metric.py b/core/metric.py index d9968fa40167b6ca728b0c1046fca5e70ef427a7..12a9ddf79d5a0821f0e6c6d9195bf51a63ebd6fb 100755 --- a/core/metric.py +++ b/core/metric.py @@ -23,34 +23,58 @@ class Metric(object): __metaclass__ = abc.ABCMeta def __init__(self, config): - """ """ + """R + """ pass - def clear(self, scope=None, **kwargs): - """ - clear current value - Args: - scope: value container - params: extend varilable for clear + def clear(self, scope=None): + """R """ if scope is None: scope = fluid.global_scope() place = fluid.CPUPlace() - for (varname, dtype) in self._need_clear_list: - if scope.find_var(varname) is None: + for key in self._global_metric_state_vars: + varname, dtype = self._global_metric_state_vars[key] + var = scope.find_var(varname) + if not var: continue - var = scope.var(varname).get_tensor() + var = var.get_tensor() data_array = np.zeros(var._get_dims()).astype(dtype) var.set(data_array, place) - def calculate(self, scope, params): + def _get_global_metric_state(self, fleet, scope, metric_name, mode="sum"): + """R """ - calculate result - Args: - scope: value container - params: extend varilable for clear + var = scope.find_var(metric_name) + if not var: + return None + input = np.array(var.get_tensor()) + if fleet is None: + return input + fleet._role_maker._barrier_worker() + old_shape = np.array(input.shape) + input = input.reshape(-1) + output = np.copy(input) * 0 + fleet._role_maker._all_reduce(input, output, mode=mode) + output = output.reshape(old_shape) + return output + + def calc_global_metrics(self, fleet, scope=None): + """R """ + if scope is None: + scope = fluid.global_scope() + + global_metrics = dict() + for key in self._global_metric_state_vars: + varname, dtype = self._global_metric_state_vars[key] + global_metrics[key] = self._get_global_metric_state(fleet, scope, + varname) + + return self._calculate(global_metrics) + + def _calculate(self, global_metrics): pass @abc.abstractmethod diff --git a/core/metrics/__init__.py b/core/metrics/__init__.py index 23fd64e7ac281f4521ce9b6ea3cb7d6d465e5a17..2820518c02ebffd1c0c4e847bb30b14cf0a689f9 100755 --- a/core/metrics/__init__.py +++ b/core/metrics/__init__.py @@ -12,6 +12,9 @@ # See the License for the specific language governing permissions and # limitations under the License. -from precision import Precision +from .recall_k import RecallK +from .pairwise_pn import PosNegRatio +from .precision_recall import PrecisionRecall +from .auc import AUC -__all__ = ['Precision'] +__all__ = ['RecallK', 'PosNegRatio', 'AUC', 'PrecisionRecall'] diff --git a/core/metrics/auc_metrics.py b/core/metrics/auc.py similarity index 50% rename from core/metrics/auc_metrics.py rename to core/metrics/auc.py index 431411f343d2b7d15d7f6620ebbcd0ecec6a32d4..672a1ffa84291782963d32bd58875170253e41d1 100755 --- a/core/metrics/auc_metrics.py +++ b/core/metrics/auc.py @@ -18,102 +18,60 @@ import numpy as np import paddle.fluid as fluid from paddlerec.core.metric import Metric +from paddle.fluid.layers.tensor import Variable -class AUCMetric(Metric): +class AUC(Metric): """ Metric For Fluid Model """ - def __init__(self, config, fleet): + def __init__(self, + input, + label, + curve='ROC', + num_thresholds=2**12 - 1, + topk=1, + slide_steps=1): """ """ - self.config = config - self.fleet = fleet - - def clear(self, scope, params): - """ - Clear current metric value, usually set to zero - Args: - scope : paddle runtime var container - params(dict) : - label : a group name for metric - metric_dict : current metric_items in group - Return: - None - """ - self._label = params['label'] - self._metric_dict = params['metric_dict'] - self._result = {} - place = fluid.CPUPlace() - for metric_name in self._metric_dict: - metric_config = self._metric_dict[metric_name] - if scope.find_var(metric_config['var'].name) is None: - continue - metric_var = scope.var(metric_config['var'].name).get_tensor() - data_type = 'float32' - if 'data_type' in metric_config: - data_type = metric_config['data_type'] - data_array = np.zeros(metric_var._get_dims()).astype(data_type) - metric_var.set(data_array, place) - - def get_metric(self, scope, metric_name): - """ - reduce metric named metric_name from all worker - Return: - metric reduce result - """ - metric = np.array(scope.find_var(metric_name).get_tensor()) - old_metric_shape = np.array(metric.shape) - metric = metric.reshape(-1) - global_metric = np.copy(metric) * 0 - self.fleet._role_maker.all_reduce_worker(metric, global_metric) - global_metric = global_metric.reshape(old_metric_shape) - return global_metric[0] - - def get_global_metrics(self, scope, metric_dict): - """ - reduce all metric in metric_dict from all worker - Return: - dict : {matric_name : metric_result} - """ - self.fleet._role_maker._barrier_worker() - result = {} - for metric_name in metric_dict: - metric_item = metric_dict[metric_name] - if scope.find_var(metric_item['var'].name) is None: - result[metric_name] = None - continue - result[metric_name] = self.get_metric(scope, - metric_item['var'].name) - return result - - def calculate_auc(self, global_pos, global_neg): - """R - """ - num_bucket = len(global_pos) - area = 0.0 - pos = 0.0 - neg = 0.0 - new_pos = 0.0 - new_neg = 0.0 - total_ins_num = 0 - for i in range(num_bucket): - index = num_bucket - 1 - i - new_pos = pos + global_pos[index] - total_ins_num += global_pos[index] - new_neg = neg + global_neg[index] - total_ins_num += global_neg[index] - area += (new_neg - neg) * (pos + new_pos) / 2 - pos = new_pos - neg = new_neg - auc_value = None - if pos * neg == 0 or total_ins_num == 0: - auc_value = 0.5 - else: - auc_value = area / (pos * neg) - return auc_value - - def calculate_bucket_error(self, global_pos, global_neg): + if not isinstance(input, Variable): + raise ValueError("input must be Variable, but received %s" % + type(input)) + if not isinstance(label, Variable): + raise ValueError("label must be Variable, but received %s" % + type(label)) + + auc_out, batch_auc_out, [ + batch_stat_pos, batch_stat_neg, stat_pos, stat_neg + ] = fluid.layers.auc(input, + label, + curve=curve, + num_thresholds=num_thresholds, + topk=topk, + slide_steps=slide_steps) + + prob = fluid.layers.slice(input, axes=[1], starts=[1], ends=[2]) + label_cast = fluid.layers.cast(label, dtype="float32") + label_cast.stop_gradient = True + sqrerr, abserr, prob, q, pos, total = \ + fluid.contrib.layers.ctr_metric_bundle(prob, label_cast) + + self._global_metric_state_vars = dict() + self._global_metric_state_vars['stat_pos'] = (stat_pos.name, "float32") + self._global_metric_state_vars['stat_neg'] = (stat_neg.name, "float32") + self._global_metric_state_vars['total_ins_num'] = (total.name, + "float32") + self._global_metric_state_vars['pos_ins_num'] = (pos.name, "float32") + self._global_metric_state_vars['q'] = (q.name, "float32") + self._global_metric_state_vars['prob'] = (prob.name, "float32") + self._global_metric_state_vars['abserr'] = (abserr.name, "float32") + self._global_metric_state_vars['sqrerr'] = (sqrerr.name, "float32") + + self.metrics = dict() + self.metrics["AUC"] = auc_out + self.metrics["BATCH_AUC"] = batch_auc_out + + def _calculate_bucket_error(self, global_pos, global_neg): """R """ num_bucket = len(global_pos) @@ -161,56 +119,69 @@ class AUCMetric(Metric): bucket_error = error_sum / error_count if error_count > 0 else 0.0 return bucket_error - def calculate(self, scope, params): - """ """ - self._label = params['label'] - self._metric_dict = params['metric_dict'] - self.fleet._role_maker._barrier_worker() - result = self.get_global_metrics(scope, self._metric_dict) + def _calculate_auc(self, global_pos, global_neg): + """R + """ + num_bucket = len(global_pos) + area = 0.0 + pos = 0.0 + neg = 0.0 + new_pos = 0.0 + new_neg = 0.0 + total_ins_num = 0 + for i in range(num_bucket): + index = num_bucket - 1 - i + new_pos = pos + global_pos[index] + total_ins_num += global_pos[index] + new_neg = neg + global_neg[index] + total_ins_num += global_neg[index] + area += (new_neg - neg) * (pos + new_pos) / 2 + pos = new_pos + neg = new_neg + auc_value = None + if pos * neg == 0 or total_ins_num == 0: + auc_value = 0.5 + else: + auc_value = area / (pos * neg) + return auc_value + + def _calculate(self, global_metrics): + result = dict() + for key in self._global_metric_state_vars: + if key not in global_metrics: + raise ValueError("%s not existed" % key) + result[key] = global_metrics[key][0] + if result['total_ins_num'] == 0: - self._result = result - self._result['auc'] = 0 - self._result['bucket_error'] = 0 - self._result['actual_ctr'] = 0 - self._result['predict_ctr'] = 0 - self._result['mae'] = 0 - self._result['rmse'] = 0 - self._result['copc'] = 0 - self._result['mean_q'] = 0 - return self._result - if 'stat_pos' in result and 'stat_neg' in result: - result['auc'] = self.calculate_auc(result['stat_pos'], - result['stat_neg']) - result['bucket_error'] = self.calculate_auc(result['stat_pos'], - result['stat_neg']) - if 'pos_ins_num' in result: + result['auc'] = 0 + result['bucket_error'] = 0 + result['actual_ctr'] = 0 + result['predict_ctr'] = 0 + result['mae'] = 0 + result['rmse'] = 0 + result['copc'] = 0 + result['mean_q'] = 0 + else: + result['auc'] = self._calculate_auc(result['stat_pos'], + result['stat_neg']) + result['bucket_error'] = self._calculate_bucket_error( + result['stat_pos'], result['stat_neg']) result['actual_ctr'] = result['pos_ins_num'] / result[ 'total_ins_num'] - if 'abserr' in result: result['mae'] = result['abserr'] / result['total_ins_num'] - if 'sqrerr' in result: result['rmse'] = math.sqrt(result['sqrerr'] / result['total_ins_num']) - if 'prob' in result: result['predict_ctr'] = result['prob'] / result['total_ins_num'] if abs(result['predict_ctr']) > 1e-6: result['copc'] = result['actual_ctr'] / result['predict_ctr'] - - if 'q' in result: result['mean_q'] = result['q'] / result['total_ins_num'] - self._result = result - return result - - def get_result(self): - """ """ - return self._result - def __str__(self): - """ """ - result = self.get_result() - result_str = "%s AUC=%.6f BUCKET_ERROR=%.6f MAE=%.6f RMSE=%.6f " \ + result_str = "AUC=%.6f BUCKET_ERROR=%.6f MAE=%.6f RMSE=%.6f " \ "Actural_CTR=%.6f Predicted_CTR=%.6f COPC=%.6f MEAN Q_VALUE=%.6f Ins number=%s" % \ - (self._label, result['auc'], result['bucket_error'], result['mae'], result['rmse'], + (result['auc'], result['bucket_error'], result['mae'], result['rmse'], result['actual_ctr'], result['predict_ctr'], result['copc'], result['mean_q'], result['total_ins_num']) return result_str + + def get_result(self): + return self.metrics diff --git a/core/metrics/pairwise_pn.py b/core/metrics/pairwise_pn.py new file mode 100755 index 0000000000000000000000000000000000000000..fb10e1fc349d1120255f421cd510c40842eca557 --- /dev/null +++ b/core/metrics/pairwise_pn.py @@ -0,0 +1,101 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import math + +import numpy as np +import paddle.fluid as fluid + +from paddlerec.core.metric import Metric +from paddle.fluid.initializer import Constant +from paddle.fluid.layer_helper import LayerHelper +from paddle.fluid.layers.tensor import Variable + + +class PosNegRatio(Metric): + """ + Metric For Fluid Model + """ + + def __init__(self, pos_score, neg_score): + """ """ + kwargs = locals() + del kwargs['self'] + + helper = LayerHelper("PaddleRec_PosNegRatio", **kwargs) + if "pos_score" not in kwargs or "neg_score" not in kwargs: + raise ValueError( + "PosNegRatio expect pos_score and neg_score as inputs.") + pos_score = kwargs.get('pos_score') + neg_score = kwargs.get('neg_score') + + if not isinstance(pos_score, Variable): + raise ValueError("pos_score must be Variable, but received %s" % + type(pos_score)) + if not isinstance(neg_score, Variable): + raise ValueError("neg_score must be Variable, but received %s" % + type(neg_score)) + + wrong = fluid.layers.cast( + fluid.layers.less_equal(pos_score, neg_score), dtype='float32') + wrong_cnt = fluid.layers.reduce_sum(wrong) + right = fluid.layers.cast( + fluid.layers.less_than(neg_score, pos_score), dtype='float32') + right_cnt = fluid.layers.reduce_sum(right) + + global_right_cnt, _ = helper.create_or_get_global_variable( + name="right_cnt", persistable=True, dtype='float32', shape=[1]) + global_wrong_cnt, _ = helper.create_or_get_global_variable( + name="wrong_cnt", persistable=True, dtype='float32', shape=[1]) + + for var in [global_right_cnt, global_wrong_cnt]: + helper.set_variable_initializer( + var, Constant( + value=0.0, force_cpu=True)) + + helper.append_op( + type="elementwise_add", + inputs={"X": [global_right_cnt], + "Y": [right_cnt]}, + outputs={"Out": [global_right_cnt]}) + helper.append_op( + type="elementwise_add", + inputs={"X": [global_wrong_cnt], + "Y": [wrong_cnt]}, + outputs={"Out": [global_wrong_cnt]}) + self.pn = (global_right_cnt + 1.0) / (global_wrong_cnt + 1.0) + + self._global_metric_state_vars = dict() + self._global_metric_state_vars['right_cnt'] = (global_right_cnt.name, + "float32") + self._global_metric_state_vars['wrong_cnt'] = (global_wrong_cnt.name, + "float32") + + self.metrics = dict() + self.metrics['WrongCnt'] = global_wrong_cnt + self.metrics['RightCnt'] = global_right_cnt + self.metrics['PN'] = self.pn + + def _calculate(self, global_metrics): + for key in self._global_communicate_var: + if key not in global_metrics: + raise ValueError("%s not existed" % key) + pn = (global_metrics['right_cnt'][0] + 1.0) / ( + global_metrics['wrong_cnt'][0] + 1.0) + return "RightCnt=%s WrongCnt=%s PN=%s" % ( + str(global_metrics['right_cnt'][0]), + str(global_metrics['wrong_cnt'][0]), str(pn)) + + def get_result(self): + return self.metrics diff --git a/core/metrics/precision.py b/core/metrics/precision.py deleted file mode 100755 index 4b9b4bd3101854f70308455cabc67bb64249b5dc..0000000000000000000000000000000000000000 --- a/core/metrics/precision.py +++ /dev/null @@ -1,109 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import math - -import numpy as np -import paddle.fluid as fluid - -from paddlerec.core.metric import Metric -from paddle.fluid.layers import nn, accuracy -from paddle.fluid.initializer import Constant -from paddle.fluid.layer_helper import LayerHelper - - -class Precision(Metric): - """ - Metric For Fluid Model - """ - - def __init__(self, **kwargs): - """ """ - helper = LayerHelper("PaddleRec_Precision", **kwargs) - self.batch_accuracy = accuracy( - kwargs.get("input"), kwargs.get("label"), kwargs.get("k")) - local_ins_num, _ = helper.create_or_get_global_variable( - name="local_ins_num", persistable=True, dtype='float32', - shape=[1]) - local_pos_num, _ = helper.create_or_get_global_variable( - name="local_pos_num", persistable=True, dtype='float32', - shape=[1]) - - batch_pos_num, _ = helper.create_or_get_global_variable( - name="batch_pos_num", - persistable=False, - dtype='float32', - shape=[1]) - batch_ins_num, _ = helper.create_or_get_global_variable( - name="batch_ins_num", - persistable=False, - dtype='float32', - shape=[1]) - - tmp_ones = helper.create_global_variable( - name="batch_size_like_ones", - persistable=False, - dtype='float32', - shape=[-1]) - - for var in [ - batch_pos_num, batch_ins_num, local_pos_num, local_ins_num - ]: - print(var, type(var)) - helper.set_variable_initializer( - var, Constant( - value=0.0, force_cpu=True)) - - helper.append_op( - type='fill_constant_batch_size_like', - inputs={"Input": kwargs.get("label")}, - outputs={'Out': [tmp_ones]}, - attrs={ - 'shape': [-1, 1], - 'dtype': tmp_ones.dtype, - 'value': float(1.0), - }) - helper.append_op( - type="reduce_sum", - inputs={"X": [tmp_ones]}, - outputs={"Out": [batch_ins_num]}) - - helper.append_op( - type="elementwise_mul", - inputs={"X": [batch_ins_num], - "Y": [self.batch_accuracy]}, - outputs={"Out": [batch_pos_num]}) - - helper.append_op( - type="elementwise_add", - inputs={"X": [local_pos_num], - "Y": [batch_pos_num]}, - outputs={"Out": [local_pos_num]}) - - helper.append_op( - type="elementwise_add", - inputs={"X": [local_ins_num], - "Y": [batch_ins_num]}, - outputs={"Out": [local_ins_num]}) - - self.accuracy = local_pos_num / local_ins_num - - self._need_clear_list = [("local_ins_num", "float32"), - ("local_pos_num", "float32")] - self.metrics = dict() - metric_varname = "P@%d" % kwargs.get("k") - self.metrics[metric_varname] = self.accuracy - - def get_result(self): - return self.metrics diff --git a/core/metrics/precision_recall.py b/core/metrics/precision_recall.py new file mode 100755 index 0000000000000000000000000000000000000000..f7f25ca808642c4a8543bdd464b4748c421653e8 --- /dev/null +++ b/core/metrics/precision_recall.py @@ -0,0 +1,156 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import math + +import numpy as np +import paddle.fluid as fluid + +from paddlerec.core.metric import Metric +from paddle.fluid.initializer import Constant +from paddle.fluid.layer_helper import LayerHelper +from paddle.fluid.layers.tensor import Variable + + +class PrecisionRecall(Metric): + """ + Metric For Fluid Model + """ + + def __init__(self, input, label, class_num): + """R + """ + kwargs = locals() + del kwargs['self'] + + self.num_cls = class_num + + if not isinstance(input, Variable): + raise ValueError("input must be Variable, but received %s" % + type(input)) + if not isinstance(label, Variable): + raise ValueError("label must be Variable, but received %s" % + type(label)) + + helper = LayerHelper("PaddleRec_PrecisionRecall", **kwargs) + label = fluid.layers.cast(label, dtype="int32") + label.stop_gradient = True + max_probs, indices = fluid.layers.nn.topk(input, k=1) + indices = fluid.layers.cast(indices, dtype="int32") + indices.stop_gradient = True + + states_info, _ = helper.create_or_get_global_variable( + name="states_info", + persistable=True, + dtype='float32', + shape=[self.num_cls, 4]) + states_info.stop_gradient = True + + helper.set_variable_initializer( + states_info, Constant( + value=0.0, force_cpu=True)) + + batch_metrics, _ = helper.create_or_get_global_variable( + name="batch_metrics", + persistable=False, + dtype='float32', + shape=[6]) + accum_metrics, _ = helper.create_or_get_global_variable( + name="global_metrics", + persistable=False, + dtype='float32', + shape=[6]) + + batch_states = fluid.layers.fill_constant( + shape=[self.num_cls, 4], value=0.0, dtype="float32") + batch_states.stop_gradient = True + + helper.append_op( + type="precision_recall", + attrs={'class_number': self.num_cls}, + inputs={ + 'MaxProbs': [max_probs], + 'Indices': [indices], + 'Labels': [label], + 'StatesInfo': [states_info] + }, + outputs={ + 'BatchMetrics': [batch_metrics], + 'AccumMetrics': [accum_metrics], + 'AccumStatesInfo': [batch_states] + }) + helper.append_op( + type="assign", + inputs={'X': [batch_states]}, + outputs={'Out': [states_info]}) + + batch_states.stop_gradient = True + states_info.stop_gradient = True + + self._global_metric_state_vars = dict() + self._global_metric_state_vars['states_info'] = (states_info.name, + "float32") + + self.metrics = dict() + self.metrics["precision_recall_f1"] = accum_metrics + self.metrics["[TP FP TN FN]"] = states_info + + def _calculate(self, global_metrics): + for key in self._global_metric_state_vars: + if key not in global_metrics: + raise ValueError("%s not existed" % key) + + def calc_precision(tp_count, fp_count): + if tp_count > 0.0 or fp_count > 0.0: + return tp_count / (tp_count + fp_count) + return 1.0 + + def calc_recall(tp_count, fn_count): + if tp_count > 0.0 or fn_count > 0.0: + return tp_count / (tp_count + fn_count) + return 1.0 + + def calc_f1_score(precision, recall): + if precision > 0.0 or recall > 0.0: + return 2 * precision * recall / (precision + recall) + return 0.0 + + states = global_metrics["states_info"] + total_tp_count = 0.0 + total_fp_count = 0.0 + total_fn_count = 0.0 + macro_avg_precision = 0.0 + macro_avg_recall = 0.0 + for i in range(self.num_cls): + total_tp_count += states[i][0] + total_fp_count += states[i][1] + total_fn_count += states[i][3] + macro_avg_precision += calc_precision(states[i][0], states[i][1]) + macro_avg_recall += calc_recall(states[i][0], states[i][3]) + metrics = [] + macro_avg_precision /= self.num_cls + macro_avg_recall /= self.num_cls + metrics.append(macro_avg_precision) + metrics.append(macro_avg_recall) + metrics.append(calc_f1_score(macro_avg_precision, macro_avg_recall)) + micro_avg_precision = calc_precision(total_tp_count, total_fp_count) + metrics.append(micro_avg_precision) + micro_avg_recall = calc_recall(total_tp_count, total_fn_count) + metrics.append(micro_avg_recall) + metrics.append(calc_f1_score(micro_avg_precision, micro_avg_recall)) + return "total metrics: [TP, FP, TN, FN]=%s; precision_recall_f1=%s" % ( + str(states), str(np.array(metrics).astype('float32'))) + + def get_result(self): + return self.metrics diff --git a/core/metrics/recall_k.py b/core/metrics/recall_k.py new file mode 100755 index 0000000000000000000000000000000000000000..f727c25e97bf1486886310c30e2304cba568c8b8 --- /dev/null +++ b/core/metrics/recall_k.py @@ -0,0 +1,103 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import math + +import numpy as np +import paddle.fluid as fluid + +from paddlerec.core.metric import Metric +from paddle.fluid.layers import accuracy +from paddle.fluid.initializer import Constant +from paddle.fluid.layer_helper import LayerHelper +from paddle.fluid.layers.tensor import Variable + + +class RecallK(Metric): + """ + Metric For Fluid Model + """ + + def __init__(self, input, label, k=20): + """ """ + kwargs = locals() + del kwargs['self'] + self.k = k + + if not isinstance(input, Variable): + raise ValueError("input must be Variable, but received %s" % + type(input)) + if not isinstance(label, Variable): + raise ValueError("label must be Variable, but received %s" % + type(label)) + + helper = LayerHelper("PaddleRec_RecallK", **kwargs) + batch_accuracy = accuracy(input, label, self.k) + global_ins_cnt, _ = helper.create_or_get_global_variable( + name="ins_cnt", persistable=True, dtype='float32', shape=[1]) + global_pos_cnt, _ = helper.create_or_get_global_variable( + name="pos_cnt", persistable=True, dtype='float32', shape=[1]) + + for var in [global_ins_cnt, global_pos_cnt]: + helper.set_variable_initializer( + var, Constant( + value=0.0, force_cpu=True)) + + tmp_ones = fluid.layers.fill_constant( + shape=fluid.layers.shape(label), dtype="float32", value=1.0) + batch_ins = fluid.layers.reduce_sum(tmp_ones) + batch_pos = batch_ins * batch_accuracy + + helper.append_op( + type="elementwise_add", + inputs={"X": [global_ins_cnt], + "Y": [batch_ins]}, + outputs={"Out": [global_ins_cnt]}) + + helper.append_op( + type="elementwise_add", + inputs={"X": [global_pos_cnt], + "Y": [batch_pos]}, + outputs={"Out": [global_pos_cnt]}) + + self.acc = global_pos_cnt / global_ins_cnt + + self._global_metric_state_vars = dict() + self._global_metric_state_vars['ins_cnt'] = (global_ins_cnt.name, + "float32") + self._global_metric_state_vars['pos_cnt'] = (global_pos_cnt.name, + "float32") + + metric_name = "Acc(Recall@%d)" % self.k + self.metrics = dict() + self.metrics["InsCnt"] = global_ins_cnt + self.metrics["RecallCnt"] = global_pos_cnt + self.metrics[metric_name] = self.acc + + # self.metrics["batch_metrics"] = batch_metrics + def _calculate(self, global_metrics): + for key in self._global_metric_state_vars: + if key not in global_metrics: + raise ValueError("%s not existed" % key) + ins_cnt = global_metrics['ins_cnt'][0] + pos_cnt = global_metrics['pos_cnt'][0] + if ins_cnt == 0: + acc = 0 + else: + acc = float(pos_cnt) / ins_cnt + return "InsCnt=%s RecallCnt=%s Acc(Recall@%d)=%s" % ( + str(ins_cnt), str(pos_cnt), self.k, str(acc)) + + def get_result(self): + return self.metrics diff --git a/core/trainer.py b/core/trainer.py index 8b1afd449a70265d5bcae9996d42795a1235197a..bbba6250529283d24389e2719b7110f8aa321973 100755 --- a/core/trainer.py +++ b/core/trainer.py @@ -107,6 +107,7 @@ class Trainer(object): self.device = Device.GPU gpu_id = int(os.environ.get('FLAGS_selected_gpus', 0)) self._place = fluid.CUDAPlace(gpu_id) + print("PaddleRec run on device GPU: {}".format(gpu_id)) self._exe = fluid.Executor(self._place) elif device == "CPU": self.device = Device.CPU @@ -146,6 +147,7 @@ class Trainer(object): elif engine.upper() == "CLUSTER": self.engine = EngineMode.CLUSTER self.is_fleet = True + self.which_cluster_type() else: raise ValueError("Not Support Engine {}".format(engine)) self._context["is_fleet"] = self.is_fleet @@ -165,6 +167,14 @@ class Trainer(object): self._context["is_pslib"] = (fleet_mode.upper() == "PSLIB") self._context["fleet_mode"] = fleet_mode + def which_cluster_type(self): + cluster_type = os.getenv("PADDLEREC_CLUSTER_TYPE", "MPI") + print("PADDLEREC_CLUSTER_TYPE: {}".format(cluster_type)) + if cluster_type and cluster_type.upper() == "K8S": + self._context["cluster_type"] = "K8S" + else: + self._context["cluster_type"] = "MPI" + def which_executor_mode(self): executor_mode = envs.get_runtime_environ("train.trainer.executor_mode") if executor_mode.upper() not in ["TRAIN", "INFER"]: diff --git a/core/trainers/finetuning_trainer.py b/core/trainers/finetuning_trainer.py new file mode 100644 index 0000000000000000000000000000000000000000..4525a18867ff232121256c876c185c502427c130 --- /dev/null +++ b/core/trainers/finetuning_trainer.py @@ -0,0 +1,140 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +General Trainer, applicable to many situations: Single/Cluster/Local_Cluster + PS/COLLECTIVE +""" +from __future__ import print_function + +import os + +from paddlerec.core.utils import envs +from paddlerec.core.trainer import Trainer, EngineMode, FleetMode + + +class FineTuningTrainer(Trainer): + """ + Trainer for various situations + """ + + def __init__(self, config=None): + Trainer.__init__(self, config) + self.processor_register() + self.abs_dir = os.path.dirname(os.path.abspath(__file__)) + self.runner_env_name = "runner." + self._context["runner_name"] + + def processor_register(self): + print("processor_register begin") + self.regist_context_processor('uninit', self.instance) + self.regist_context_processor('network_pass', self.network) + self.regist_context_processor('startup_pass', self.startup) + self.regist_context_processor('train_pass', self.runner) + self.regist_context_processor('terminal_pass', self.terminal) + + def instance(self, context): + instance_class_path = envs.get_global_env( + self.runner_env_name + ".instance_class_path", default_value=None) + if instance_class_path: + instance_class = envs.lazy_instance_by_fliename( + instance_class_path, "Instance")(context) + else: + if self.engine == EngineMode.SINGLE: + instance_class_name = "SingleInstance" + else: + raise ValueError( + "FineTuningTrainer can only support SingleTraining.") + + instance_path = os.path.join(self.abs_dir, "framework", + "instance.py") + + instance_class = envs.lazy_instance_by_fliename( + instance_path, instance_class_name)(context) + + instance_class.instance(context) + + def network(self, context): + network_class_path = envs.get_global_env( + self.runner_env_name + ".network_class_path", default_value=None) + if network_class_path: + network_class = envs.lazy_instance_by_fliename(network_class_path, + "Network")(context) + else: + if self.engine == EngineMode.SINGLE: + network_class_name = "FineTuningNetwork" + else: + raise ValueError( + "FineTuningTrainer can only support SingleTraining.") + + network_path = os.path.join(self.abs_dir, "framework", + "network.py") + network_class = envs.lazy_instance_by_fliename( + network_path, network_class_name)(context) + + network_class.build_network(context) + + def startup(self, context): + startup_class_path = envs.get_global_env( + self.runner_env_name + ".startup_class_path", default_value=None) + if startup_class_path: + startup_class = envs.lazy_instance_by_fliename(startup_class_path, + "Startup")(context) + else: + if self.engine == EngineMode.SINGLE and not context["is_infer"]: + startup_class_name = "FineTuningStartup" + else: + raise ValueError( + "FineTuningTrainer can only support SingleTraining.") + + startup_path = os.path.join(self.abs_dir, "framework", + "startup.py") + + startup_class = envs.lazy_instance_by_fliename( + startup_path, startup_class_name)(context) + startup_class.startup(context) + + def runner(self, context): + runner_class_path = envs.get_global_env( + self.runner_env_name + ".runner_class_path", default_value=None) + if runner_class_path: + runner_class = envs.lazy_instance_by_fliename(runner_class_path, + "Runner")(context) + else: + if self.engine == EngineMode.SINGLE and not context["is_infer"]: + runner_class_name = "SingleRunner" + else: + raise ValueError( + "FineTuningTrainer can only support SingleTraining.") + + runner_path = os.path.join(self.abs_dir, "framework", "runner.py") + runner_class = envs.lazy_instance_by_fliename( + runner_path, runner_class_name)(context) + runner_class.run(context) + + def terminal(self, context): + terminal_class_path = envs.get_global_env( + self.runner_env_name + ".terminal_class_path", default_value=None) + if terminal_class_path: + terminal_class = envs.lazy_instance_by_fliename( + terminal_class_path, "Terminal")(context) + terminal_class.terminal(context) + else: + terminal_class_name = "TerminalBase" + if self.engine != EngineMode.SINGLE and self.fleet_mode != FleetMode.COLLECTIVE: + terminal_class_name = "PSTerminal" + + terminal_path = os.path.join(self.abs_dir, "framework", + "terminal.py") + terminal_class = envs.lazy_instance_by_fliename( + terminal_path, terminal_class_name)(context) + terminal_class.terminal(context) + context['is_exit'] = True diff --git a/core/trainers/framework/dataset.py b/core/trainers/framework/dataset.py index 8059eeb09a482671b8329fb88f5b52cfd64f163b..5c5a2357ff4a07d54d4e0c56e692b4d79fcb2095 100644 --- a/core/trainers/framework/dataset.py +++ b/core/trainers/framework/dataset.py @@ -123,10 +123,21 @@ class QueueDataset(DatasetBase): os.path.join(train_data_path, x) for x in os.listdir(train_data_path) ] + file_list.sort() + need_split_files = False if context["engine"] == EngineMode.LOCAL_CLUSTER: + # for local cluster: split files for multi process + need_split_files = True + elif context["engine"] == EngineMode.CLUSTER and context[ + "cluster_type"] == "K8S": + # for k8s mount afs, split files for every node + need_split_files = True + + if need_split_files: file_list = split_files(file_list, context["fleet"].worker_index(), context["fleet"].worker_num()) print("File_list: {}".format(file_list)) + dataset.set_filelist(file_list) for model_dict in context["phases"]: if model_dict["dataset_name"] == dataset_name: diff --git a/core/trainers/framework/network.py b/core/trainers/framework/network.py index 74d2c97540419b15e6a5d0f87b3c5af368a7e9b3..7d7a8273b6a402bd163f653a7beb3900de899ae3 100644 --- a/core/trainers/framework/network.py +++ b/core/trainers/framework/network.py @@ -23,7 +23,7 @@ from paddlerec.core.trainers.framework.dataset import DataLoader, QueueDataset __all__ = [ "NetworkBase", "SingleNetwork", "PSNetwork", "PslibNetwork", - "CollectiveNetwork" + "CollectiveNetwork", "FineTuningNetwork" ] @@ -99,7 +99,90 @@ class SingleNetwork(NetworkBase): context["dataset"] = {} for dataset in context["env"]["dataset"]: type = envs.get_global_env("dataset." + dataset["name"] + ".type") - if type != "DataLoader": + + if type == "QueueDataset": + dataset_class = QueueDataset(context) + context["dataset"][dataset[ + "name"]] = dataset_class.create_dataset(dataset["name"], + context) + + context["status"] = "startup_pass" + + +class FineTuningNetwork(NetworkBase): + """R + """ + + def __init__(self, context): + print("Running FineTuningNetwork.") + + def build_network(self, context): + context["model"] = {} + for model_dict in context["phases"]: + context["model"][model_dict["name"]] = {} + train_program = fluid.Program() + startup_program = fluid.Program() + scope = fluid.Scope() + dataset_name = model_dict["dataset_name"] + + with fluid.program_guard(train_program, startup_program): + with fluid.unique_name.guard(): + with fluid.scope_guard(scope): + model_path = envs.os_path_adapter( + envs.workspace_adapter(model_dict["model"])) + model = envs.lazy_instance_by_fliename( + model_path, "Model")(context["env"]) + + model._data_var = model.input_data( + dataset_name=model_dict["dataset_name"]) + + if envs.get_global_env("dataset." + dataset_name + + ".type") == "DataLoader": + model._init_dataloader( + is_infer=context["is_infer"]) + data_loader = DataLoader(context) + data_loader.get_dataloader(context, dataset_name, + model._data_loader) + + model.net(model._data_var, context["is_infer"]) + + finetuning_varnames = envs.get_global_env( + "runner." + context["runner_name"] + + ".finetuning_aspect_varnames", + default_value=[]) + + if len(finetuning_varnames) == 0: + raise ValueError( + "nothing need to be fine tuning, you may use other traning mode" + ) + + if len(finetuning_varnames) != 1: + raise ValueError( + "fine tuning mode can only accept one varname now" + ) + + varname = finetuning_varnames[0] + finetuning_vars = train_program.global_block().vars[ + varname] + finetuning_vars.stop_gradient = True + optimizer = model.optimizer() + optimizer.minimize(model._cost) + + context["model"][model_dict["name"]][ + "main_program"] = train_program + context["model"][model_dict["name"]][ + "startup_program"] = startup_program + context["model"][model_dict["name"]]["scope"] = scope + context["model"][model_dict["name"]]["model"] = model + context["model"][model_dict["name"]][ + "default_main_program"] = train_program.clone() + context["model"][model_dict["name"]]["compiled_program"] = None + + context["dataset"] = {} + for dataset in context["env"]["dataset"]: + type = envs.get_global_env("dataset." + dataset["name"] + ".type") + + if type == "QueueDataset": dataset_class = QueueDataset(context) context["dataset"][dataset[ "name"]] = dataset_class.create_dataset(dataset["name"], @@ -133,9 +216,7 @@ class PSNetwork(NetworkBase): if envs.get_global_env("dataset." + dataset_name + ".type") == "DataLoader": model._init_dataloader(is_infer=False) - data_loader = DataLoader(context) - data_loader.get_dataloader(context, dataset_name, - model._data_loader) + model.net(model._data_var, False) optimizer = model.optimizer() strategy = self._build_strategy(context) @@ -160,7 +241,11 @@ class PSNetwork(NetworkBase): for dataset in context["env"]["dataset"]: type = envs.get_global_env("dataset." + dataset["name"] + ".type") - if type != "DataLoader": + if type == "DataLoader": + data_loader = DataLoader(context) + data_loader.get_dataloader(context, dataset_name, + model._data_loader) + elif type == "QueueDataset": dataset_class = QueueDataset(context) context["dataset"][dataset[ "name"]] = dataset_class.create_dataset( @@ -229,9 +314,6 @@ class PslibNetwork(NetworkBase): if envs.get_global_env("dataset." + dataset_name + ".type") == "DataLoader": model._init_dataloader(is_infer=False) - data_loader = DataLoader(context) - data_loader.get_dataloader(context, dataset_name, - model._data_loader) model.net(model._data_var, False) optimizer = model.optimizer() @@ -257,7 +339,11 @@ class PslibNetwork(NetworkBase): for dataset in context["env"]["dataset"]: type = envs.get_global_env("dataset." + dataset["name"] + ".type") - if type != "DataLoader": + if type == "DataLoader": + data_loader = DataLoader(context) + data_loader.get_dataloader(context, dataset_name, context[ + "model"][model_dict["name"]]["model"]._data_loader) + elif type == "QueueDataset": dataset_class = QueueDataset(context) context["dataset"][dataset[ "name"]] = dataset_class.create_dataset( @@ -323,7 +409,10 @@ class CollectiveNetwork(NetworkBase): context["dataset"] = {} for dataset in context["env"]["dataset"]: type = envs.get_global_env("dataset." + dataset["name"] + ".type") - if type != "DataLoader": + if type == "QueueDataset": + raise ValueError( + "Collective don't support QueueDataset training, please use DataLoader." + ) dataset_class = QueueDataset(context) context["dataset"][dataset[ "name"]] = dataset_class.create_dataset(dataset["name"], diff --git a/core/trainers/framework/runner.py b/core/trainers/framework/runner.py index d5fced11ffd546b36ee7db3e596f061bf8a58328..79d7be66e58d0c4244980cf4bf871f42984d186e 100644 --- a/core/trainers/framework/runner.py +++ b/core/trainers/framework/runner.py @@ -16,10 +16,12 @@ from __future__ import print_function import os import time +import warnings import numpy as np import paddle.fluid as fluid from paddlerec.core.utils import envs +from paddlerec.core.metric import Metric __all__ = [ "RunnerBase", "SingleRunner", "PSRunner", "CollectiveRunner", "PslibRunner" @@ -77,9 +79,10 @@ class RunnerBase(object): name = "dataset." + reader_name + "." if envs.get_global_env(name + "type") == "DataLoader": - self._executor_dataloader_train(model_dict, context) + return self._executor_dataloader_train(model_dict, context) else: self._executor_dataset_train(model_dict, context) + return None def _executor_dataset_train(self, model_dict, context): reader_name = model_dict["dataset_name"] @@ -137,8 +140,10 @@ class RunnerBase(object): metrics_varnames = [] metrics_format = [] + metrics_names = ["total_batch"] metrics_format.append("{}: {{}}".format("batch")) for name, var in metrics.items(): + metrics_names.append(name) metrics_varnames.append(var.name) metrics_format.append("{}: {{}}".format(name)) metrics_format = ", ".join(metrics_format) @@ -147,6 +152,7 @@ class RunnerBase(object): reader.start() batch_id = 0 scope = context["model"][model_name]["scope"] + result = None with fluid.scope_guard(scope): try: while True: @@ -168,6 +174,10 @@ class RunnerBase(object): except fluid.core.EOFException: reader.reset() + if batch_id > 0: + result = dict(zip(metrics_names, metrics)) + return result + def _get_dataloader_program(self, model_dict, context): model_name = model_dict["name"] if context["model"][model_name]["compiled_program"] == None: @@ -275,6 +285,7 @@ class RunnerBase(object): return (epoch_id + 1) % epoch_interval == 0 def save_inference_model(): + # get global env name = "runner." + context["runner_name"] + "." save_interval = int( envs.get_global_env(name + "save_inference_interval", -1)) @@ -287,18 +298,44 @@ class RunnerBase(object): if feed_varnames is None or fetch_varnames is None or feed_varnames == "" or fetch_varnames == "" or \ len(feed_varnames) == 0 or len(fetch_varnames) == 0: return - fetch_vars = [ - fluid.default_main_program().global_block().vars[varname] - for varname in fetch_varnames - ] + + # check feed var exist + for var_name in feed_varnames: + if var_name not in fluid.default_main_program().global_block( + ).vars: + raise ValueError( + "Feed variable: {} not in default_main_program, global block has follow vars: {}". + format(var_name, + fluid.default_main_program().global_block() + .vars.keys())) + + # check fetch var exist + fetch_vars = [] + for var_name in fetch_varnames: + if var_name not in fluid.default_main_program().global_block( + ).vars: + raise ValueError( + "Fetch variable: {} not in default_main_program, global block has follow vars: {}". + format(var_name, + fluid.default_main_program().global_block() + .vars.keys())) + else: + fetch_vars.append(fluid.default_main_program() + .global_block().vars[var_name]) + dirname = envs.get_global_env(name + "save_inference_path", None) assert dirname is not None dirname = os.path.join(dirname, str(epoch_id)) if is_fleet: - context["fleet"].save_inference_model( - context["exe"], dirname, feed_varnames, fetch_vars) + warnings.warn( + "Save inference model in cluster training is not recommended! Using save checkpoint instead.", + category=UserWarning, + stacklevel=2) + if context["fleet"].worker_index() == 0: + context["fleet"].save_inference_model( + context["exe"], dirname, feed_varnames, fetch_vars) else: fluid.io.save_inference_model(dirname, feed_varnames, fetch_vars, context["exe"]) @@ -314,7 +351,8 @@ class RunnerBase(object): return dirname = os.path.join(dirname, str(epoch_id)) if is_fleet: - context["fleet"].save_persistables(context["exe"], dirname) + if context["fleet"].worker_index() == 0: + context["fleet"].save_persistables(context["exe"], dirname) else: fluid.io.save_persistables(context["exe"], dirname) @@ -336,11 +374,28 @@ class SingleRunner(RunnerBase): ".epochs")) for epoch in range(epochs): for model_dict in context["phases"]: + model_class = context["model"][model_dict["name"]]["model"] + metrics = model_class._metrics + begin_time = time.time() - self._run(context, model_dict) + result = self._run(context, model_dict) end_time = time.time() seconds = end_time - begin_time - print("epoch {} done, use time: {}".format(epoch, seconds)) + message = "epoch {} done, use time: {}".format(epoch, seconds) + metrics_result = [] + for key in metrics: + if isinstance(metrics[key], Metric): + _str = metrics[key].calc_global_metrics( + None, + context["model"][model_dict["name"]]["scope"]) + metrics_result.append(_str) + elif result is not None: + _str = "{}={}".format(key, result[key]) + metrics_result.append(_str) + if len(metrics_result) > 0: + message += ", global metrics: " + ", ".join(metrics_result) + print(message) + with fluid.scope_guard(context["model"][model_dict["name"]][ "scope"]): train_prog = context["model"][model_dict["name"]][ @@ -362,12 +417,32 @@ class PSRunner(RunnerBase): envs.get_global_env("runner." + context["runner_name"] + ".epochs")) model_dict = context["env"]["phase"][0] + model_class = context["model"][model_dict["name"]]["model"] + metrics = model_class._metrics for epoch in range(epochs): begin_time = time.time() - self._run(context, model_dict) + result = self._run(context, model_dict) end_time = time.time() seconds = end_time - begin_time - print("epoch {} done, use time: {}".format(epoch, seconds)) + message = "epoch {} done, use time: {}".format(epoch, seconds) + + # TODO, wait for PaddleCloudRoleMaker supports gloo + from paddle.fluid.incubate.fleet.base.role_maker import GeneralRoleMaker + if context["fleet"] is not None and isinstance(context["fleet"], + GeneralRoleMaker): + metrics_result = [] + for key in metrics: + if isinstance(metrics[key], Metric): + _str = metrics[key].calc_global_metrics( + context["fleet"], + context["model"][model_dict["name"]]["scope"]) + metrics_result.append(_str) + elif result is not None: + _str = "{}={}".format(key, result[key]) + metrics_result.append(_str) + if len(metrics_result) > 0: + message += ", global metrics: " + ", ".join(metrics_result) + print(message) with fluid.scope_guard(context["model"][model_dict["name"]][ "scope"]): train_prog = context["model"][model_dict["name"]][ @@ -476,14 +551,30 @@ class SingleInferRunner(RunnerBase): for index, epoch_name in enumerate(self.epoch_model_name_list): for model_dict in context["phases"]: + model_class = context["model"][model_dict["name"]]["model"] + metrics = model_class._infer_results self._load(context, model_dict, self.epoch_model_path_list[index]) begin_time = time.time() - self._run(context, model_dict) + result = self._run(context, model_dict) end_time = time.time() seconds = end_time - begin_time - print("Infer {} of {} done, use time: {}".format(model_dict[ - "name"], epoch_name, seconds)) + message = "Infer {} of epoch {} done, use time: {}".format( + model_dict["name"], epoch_name, seconds) + metrics_result = [] + for key in metrics: + if isinstance(metrics[key], Metric): + _str = metrics[key].calc_global_metrics( + None, + context["model"][model_dict["name"]]["scope"]) + metrics_result.append(_str) + elif result is not None: + _str = "{}={}".format(key, result[key]) + metrics_result.append(_str) + if len(metrics_result) > 0: + message += ", global metrics: " + ", ".join(metrics_result) + print(message) + context["status"] = "terminal_pass" def _load(self, context, model_dict, model_path): diff --git a/core/trainers/framework/startup.py b/core/trainers/framework/startup.py index 362592e6de64a4bbfecb6868726b4a733edf4e14..a38dbd5bb3c2cea268fc5551e10e488f2fbdabd6 100644 --- a/core/trainers/framework/startup.py +++ b/core/trainers/framework/startup.py @@ -17,9 +17,13 @@ from __future__ import print_function import warnings import paddle.fluid as fluid +import paddle.fluid.core as core from paddlerec.core.utils import envs -__all__ = ["StartupBase", "SingleStartup", "PSStartup", "CollectiveStartup"] +__all__ = [ + "StartupBase", "SingleStartup", "PSStartup", "CollectiveStartup", + "FineTuningStartup" +] class StartupBase(object): @@ -65,6 +69,122 @@ class SingleStartup(StartupBase): context["status"] = "train_pass" +class FineTuningStartup(StartupBase): + """R + """ + + def __init__(self, context): + self.op_name_scope = "op_namescope" + self.clip_op_name_scope = "@CLIP" + self.self.op_role_var_attr_name = core.op_proto_and_checker_maker.kOpRoleVarAttrName( + ) + + print("Running SingleStartup.") + + def _is_opt_role_op(self, op): + # NOTE: depend on oprole to find out whether this op is for + # optimize + op_maker = core.op_proto_and_checker_maker + optimize_role = core.op_proto_and_checker_maker.OpRole.Optimize + if op_maker.kOpRoleAttrName() in op.attr_names and \ + int(op.all_attrs()[op_maker.kOpRoleAttrName()]) == int(optimize_role): + return True + return False + + def _get_params_grads(self, program): + """ + Get optimizer operators, parameters and gradients from origin_program + Returns: + opt_ops (list): optimize operators. + params_grads (dict): parameter->gradient. + """ + block = program.global_block() + params_grads = [] + # tmp set to dedup + optimize_params = set() + origin_var_dict = program.global_block().vars + for op in block.ops: + if self._is_opt_role_op(op): + # Todo(chengmo): Whether clip related op belongs to Optimize guard should be discussed + # delete clip op from opt_ops when run in Parameter Server mode + if self.op_name_scope in op.all_attrs( + ) and self.clip_op_name_scope in op.attr(self.op_name_scope): + op._set_attr( + "op_role", + int(core.op_proto_and_checker_maker.OpRole.Backward)) + continue + + if op.attr(self.op_role_var_attr_name): + param_name = op.attr(self.op_role_var_attr_name)[0] + grad_name = op.attr(self.op_role_var_attr_name)[1] + if not param_name in optimize_params: + optimize_params.add(param_name) + params_grads.append([ + origin_var_dict[param_name], + origin_var_dict[grad_name] + ]) + return params_grads + + @staticmethod + def is_persistable(var): + """ + Check whether the given variable is persistable. + + Args: + var(Variable): The variable to be checked. + + Returns: + bool: True if the given `var` is persistable + False if not. + + Examples: + .. code-block:: python + + import paddle.fluid as fluid + param = fluid.default_main_program().global_block().var('fc.b') + res = fluid.io.is_persistable(param) + """ + if var.desc.type() == core.VarDesc.VarType.FEED_MINIBATCH or \ + var.desc.type() == core.VarDesc.VarType.FETCH_LIST or \ + var.desc.type() == core.VarDesc.VarType.READER: + return False + return var.persistable + + def load(self, context, is_fleet=False, main_program=None): + dirname = envs.get_global_env( + "runner." + context["runner_name"] + ".init_model_path", None) + if dirname is None or dirname == "": + return + print("going to load ", dirname) + + params_grads = self._get_params_grads(main_program) + update_params = [p for p, _ in params_grads] + need_load_vars = [] + parameters = list( + filter(FineTuningStartup.is_persistable, main_program.list_vars())) + + for param in parameters: + if param not in update_params: + need_load_vars.append(param) + + fluid.io.load_vars(context["exe"], dirname, main_program, + need_load_vars) + print("load from {} success".format(dirname)) + + def startup(self, context): + for model_dict in context["phases"]: + with fluid.scope_guard(context["model"][model_dict["name"]][ + "scope"]): + train_prog = context["model"][model_dict["name"]][ + "main_program"] + startup_prog = context["model"][model_dict["name"]][ + "startup_program"] + with fluid.program_guard(train_prog, startup_prog): + context["exe"].run(startup_prog) + self.load(context, main_program=train_prog) + context["status"] = "train_pass" + + class PSStartup(StartupBase): def __init__(self, context): print("Running PSStartup.") diff --git a/core/utils/dataloader_instance.py b/core/utils/dataloader_instance.py index 2461473aa79a51133db8aa319f4ee7d45981d815..d878f08415c7b0405bc593f06ab4541801aa5501 100755 --- a/core/utils/dataloader_instance.py +++ b/core/utils/dataloader_instance.py @@ -39,9 +39,21 @@ def dataloader_by_name(readerclass, data_path = os.path.join(package_base, data_path.split("::")[1]) files = [str(data_path) + "/%s" % x for x in os.listdir(data_path)] + files.sort() + + need_split_files = False if context["engine"] == EngineMode.LOCAL_CLUSTER: + # for local cluster: split files for multi process + need_split_files = True + elif context["engine"] == EngineMode.CLUSTER and context[ + "cluster_type"] == "K8S": + # for k8s mount mode, split files for every node + need_split_files = True + print("need_split_files: {}".format(need_split_files)) + if need_split_files: files = split_files(files, context["fleet"].worker_index(), context["fleet"].worker_num()) + print("file_list : {}".format(files)) reader = reader_class(yaml_file) @@ -81,10 +93,20 @@ def slotdataloader_by_name(readerclass, dataset_name, yaml_file, context): data_path = os.path.join(package_base, data_path.split("::")[1]) files = [str(data_path) + "/%s" % x for x in os.listdir(data_path)] + files.sort() + + need_split_files = False if context["engine"] == EngineMode.LOCAL_CLUSTER: + # for local cluster: split files for multi process + need_split_files = True + elif context["engine"] == EngineMode.CLUSTER and context[ + "cluster_type"] == "K8S": + # for k8s mount mode, split files for every node + need_split_files = True + + if need_split_files: files = split_files(files, context["fleet"].worker_index(), context["fleet"].worker_num()) - print("file_list: {}".format(files)) sparse = get_global_env(name + "sparse_slots", "#") if sparse == "": @@ -135,10 +157,20 @@ def slotdataloader(readerclass, train, yaml_file, context): data_path = os.path.join(package_base, data_path.split("::")[1]) files = [str(data_path) + "/%s" % x for x in os.listdir(data_path)] + files.sort() + + need_split_files = False if context["engine"] == EngineMode.LOCAL_CLUSTER: + # for local cluster: split files for multi process + need_split_files = True + elif context["engine"] == EngineMode.CLUSTER and context[ + "cluster_type"] == "K8S": + # for k8s mount mode, split files for every node + need_split_files = True + + if need_split_files: files = split_files(files, context["fleet"].worker_index(), context["fleet"].worker_num()) - print("file_list: {}".format(files)) sparse = get_global_env("sparse_slots", "#", namespace) if sparse == "": diff --git a/doc/custom_reader.md b/doc/custom_reader.md deleted file mode 100644 index c9079b5397057f35191bd376d22e978806e6c646..0000000000000000000000000000000000000000 --- a/doc/custom_reader.md +++ /dev/null @@ -1,362 +0,0 @@ -# PaddleRec 自定义数据集及Reader - -用户自定义数据集及配置异步Reader,需要关注以下几个步骤: - -* [数据集整理](#数据集整理) -* [在模型组网中加入输入占位符](#在模型组网中加入输入占位符) -* [Reader实现](#Reader的实现) -* [在yaml文件中配置Reader](#在yaml文件中配置reader) - -我们以CTR-DNN模型为例,给出了从数据整理,变量定义,Reader写法,调试的完整历程。 - -* [数据及Reader示例-DNN](#数据及Reader示例-DNN) - - -## 数据集整理 - -PaddleRec支持模型自定义数据集。 - -关于数据的tips: -1. 数据量: - - PaddleRec面向大规模数据设计,可以轻松支持亿级的数据读取,工业级的数据读写api:`dataset`在搜索、推荐、信息流等业务得到了充分打磨。 -2. 文件类型: - - 支持任意直接可读的文本数据,`dataset`同时支持`.gz`格式的文本压缩数据,无需额外代码,可直接读取。数据样本应以`\n`为标志,按行组织。 - -3. 文件存放位置: - - 文件通常存放在训练节点本地,但同时,`dataset`支持使用`hadoop`远程读取数据,数据无需下载到本地,为dataset配置hadoop相关账户及地址即可。 -4. 数据类型 - - Reader处理的是以行为单位的`string`数据,喂入网络的数据需要转为`int`,`float`的数值数据,不支持`string`喂入网络,不建议明文保存及处理训练数据。 -5. Tips - - Dataset模式下,训练线程与数据读取线程的关系强相关,为了多线程充分利用,`强烈建议将文件合理的拆为多个小文件`,尤其是在分布式训练场景下,可以均衡各个节点的数据量,同时加快数据的下载速度。 - -## 在模型组网中加入输入占位符 - -Reader读取文件后,产出的数据喂入网络,需要有占位符进行接收。占位符在Paddle中使用`fluid.data`或`fluid.layers.data`进行定义。`data`的定义可以参考[fluid.data](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/fluid_cn/data_cn.html#data)以及[fluid.layers.data](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/layers_cn/data_cn.html#data)。 - -假如您希望输入三个数据,分别是维度32的数据A,维度变长的稀疏数据B,以及一个一维的标签数据C,并希望梯度可以经过该变量向前传递,则示例如下: - -数据A的定义: -```python -var_a = fluid.data(name='A', shape= [-1, 32], dtype='float32') -``` - -数据B的定义,变长数据的使用可以参考[LoDTensor](https://www.paddlepaddle.org.cn/documentation/docs/zh/beginners_guide/basic_concept/lod_tensor.html#cn-user-guide-lod-tensor): -```python -var_b = fluid.data(name='B', shape=[-1, 1], lod_level=1, dtype='int64') -``` - -数据C的定义: -```python -var_c = fluid.data(name='C', shape=[-1, 1], dtype='int32') -var_c.stop_gradient = False -``` - -当我们完成以上三个数据的定义后,在PaddleRec的模型定义中,还需将其加入model基类成员变量`self._data_var` - -```python -self._data_var.append(var_a) -self._data_var.append(var_b) -self._data_var.append(var_c) -``` -至此,我们完成了在组网中定义输入数据的工作。 - -## Reader的实现 - -### Reader的实现范式 - -Reader的逻辑需要一个单独的python文件进行描述。我们试写一个`test_reader.py`,实现的具体流程如下: -1. 首先我们需要引入Reader基类 - - ```python - from paddlerec.core.reader import ReaderBase - ``` -2. 创建一个子类,继承Reader的基类,训练所需Reader命名为`TrainerReader` - ```python - class TrainerReader(ReaderBase): - def init(self): - pass - - def generator_sample(self, line): - pass - ``` - -3. 在`init(self)`函数中声明一些在数据读取中会用到的变量,必要时可以在`config.yaml`文件中配置变量,利用`env.get_global_env()`拿到。 - - 比如,我们希望从yaml文件中读取一个数据预处理变量`avg=10`,目的是将数据A的数据缩小10倍,可以这样实现: - - 首先更改yaml文件,在某个space下加入该变量 - - ```yaml - ... - train: - reader: - avg: 10 - ... - ``` - - - 再更改Reader的init函数 - - ```python - from paddlerec.core.utils import envs - class TrainerReader(Reader): - def init(self): - self.avg = envs.get_global_env("avg", None, "train.reader") - - def generator_sample(self, line): - pass - ``` - -4. 继承并实现基类中的`generate_sample(self, line)`函数,逐行读取数据。 - - 该函数应返回一个可以迭代的reader方法(带有yield的函数不再是一个普通的函数,而是一个生成器generator,成为了可以迭代的对象,等价于一个数组、链表、文件、字符串etc.) - - 在这个可以迭代的函数中,如示例代码中的`def reader()`,我们定义数据读取的逻辑。以行为单位的数据进行截取,转换及预处理。 - - 最后,我们需要将数据整理为特定的格式,才能够被PaddleRec的Reader正确读取,并灌入的训练的网络中。简单来说,数据的输出顺序与我们在网络中创建的`inputs`必须是严格一一对应的,并转换为类似字典的形式。 - - 示例: 假设数据ABC在文本数据中,每行以这样的形式存储: - ```shell - 0.1,0.2,0.3...3.0,3.1,3.2 \t 99999,99998,99997 \t 1 \n - ``` - - 则示例代码如下: - ```python - from paddlerec.core.utils import envs - class TrainerReader(Reader): - def init(self): - self.avg = envs.get_global_env("avg", None, "train.reader") - - def generator_sample(self, line): - - def reader(self, line): - # 先分割 '\n', 再以 '\t'为标志分割为list - variables = (line.strip('\n')).split('\t') - - # A是第一个元素,并且每个数据之间使用','分割 - var_a = variables[0].split(',') # list - var_a = [float(i) / self.avg for i in var_a] # 将str数据转换为float - - - # B是第二个元素,同样以 ',' 分割 - var_b = variables[1].split(',') # list - var_b = [int(i) for i in var_b] # 将str数据转换为int - - # C是第三个元素, 只有一个元素,没有分割符 - var_c = variables[2] - var_c = int(var_c) # 将str数据转换为int - var_c = [var_c] # 将单独的数据元素置入list中 - - # 将数据与数据名结合,组织为dict的形式 - # 如下,output形式为{ A: var_a, B: var_b, C: var_c} - variable_name = ['A', 'B', 'C'] - output = zip(variable_name, [var_a] + [var_b] + [var_c]) - - # 将数据输出,使用yield方法,将该函数变为了一个可迭代的对象 - yield output - - ``` - - 至此,我们完成了Reader的实现。 - - -### 在yaml文件中配置Reader - -在模型的yaml配置文件中,主要的修改是三个,如下 - -```yaml -reader: - batch_size: 2 - class: "{workspace}/reader.py" - train_data_path: "{workspace}/data/train_data" - reader_debug_mode: False -``` - -batch_size: 顾名思义,是小批量训练时的样本大小 -class: 运行改模型所需reader的路径 -train_data_path: 训练数据所在文件夹 -reader_debug_mode: 测试reader语法,及输出是否符合预期的debug模式的开关 - - -## 数据及Reader示例-DNN - -Reader代码来源于[criteo_reader.py](../models/rank/criteo_reader.py), 组网代码来源于[model.py](../models/rank/dnn/model.py) - -### Criteo数据集格式 - -CTR-DNN训练及测试数据集选用[Display Advertising Challenge](https://www.kaggle.com/c/criteo-display-ad-challenge/)所用的Criteo数据集。该数据集包括两部分:训练集和测试集。训练集包含一段时间内Criteo的部分流量,测试集则对应训练数据后一天的广告点击流量。 -每一行数据格式如下所示: -```bash -