"Let us begin the tutorial with a classical problem called Linear Regression \\[[1](#References)\\]. In this chapter, we will train a model from a realistic dataset to predict home prices. Some important concepts in Machine Learning will be covered through this example.\n",
"\n",
"The source code for this tutorial lives on [book/fit_a_line](https://github.com/PaddlePaddle/book/tree/develop/fit_a_line). For instructions on getting started with PaddlePaddle, see [PaddlePaddle installation guide](http://www.paddlepaddle.org/doc_cn/build_and_install/index.html).\n",
"The source code for this tutorial lives on [book/fit_a_line](https://github.com/PaddlePaddle/book/tree/develop/fit_a_line). For instructions on getting started with PaddlePaddle, see [PaddlePaddle installation guide](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/getstarted/build_and_install/docker_install_en.rst).\n",
"\n",
"## Problem Setup\n",
"Suppose we have a dataset of $n$ real estate properties. These real estate properties will be referred to as *homes* in this chapter for clarity.\n",
...
...
@@ -384,7 +384,7 @@
"4. Bishop C M. Pattern recognition[J]. Machine Learning, 2006, 128.\n",
"\n",
"\u003cbr/\u003e\n",
"\u003ca rel=\"license\" href=\"http://creativecommons.org/licenses/by-nc-sa/4.0/\"\u003e\u003cimg alt=\"Common Creative License\" style=\"border-width:0\" src=\"https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png\" /\u003e\u003c/a\u003e This tutorial was created and published with [Creative Common License 4.0](http://creativecommons.org/licenses/by-nc-sa/4.0/).\n"
"This tutorial is contributed by \u003ca xmlns:cc=\"http://creativecommons.org/ns#\" href=\"http://book.paddlepaddle.org\" property=\"cc:attributionName\" rel=\"cc:attributionURL\"\u003ePaddlePaddle\u003c/a\u003e, and licensed under a \u003ca rel=\"license\" href=\"http://creativecommons.org/licenses/by-nc-sa/4.0/\"\u003eCreative Commons Attribution-NonCommercial-ShareAlike 4.0 International License\u003c/a\u003e.\n"
"我们使用从[UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing)获得的波士顿房价数据集进行模型的训练和预测。下面的散点图展示了使用模型对部分房屋价格进行的预测。其中,每个点的横坐标表示同一类房屋真实价格的中位数,纵坐标表示线性回归模型根据特征预测的结果,当二者值完全相等的时候就会落在虚线上。所以模型预测得越准确,则点离虚线越近。\n",
"The source code of this tutorial is in [book/recommender_system](https://github.com/PaddlePaddle/book/tree/develop/recommender_system).\n",
"\n",
"For instructions on getting started with PaddlePaddle, see [PaddlePaddle installation guide](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/getstarted/build_and_install/docker_install_en.rst).\n",
"\n",
"\n",
"## Background\n",
"\n",
"With the fast growth of e-commerce, online videos, and online reading business, users have to rely on recommender systems to avoid manually browsing tremendous volume of choices. Recommender systems understand users' interest by mining user behavior and other properties of users and products.\n",
...
...
@@ -82,22 +85,617 @@
"\n",
"## Dataset\n",
"\n",
"We use the [MovieLens ml-1m](http://files.grouplens.org/datasets/movielens/ml-1m.zip) to train our model. This dataset includes 10,000 ratings of 4,000 movies from 6,000 users to 4,000 movies. Each rate is in the range of 1~5. Thanks to GroupLens Research for collecting, processing and publishing the dataset. \n",
"We use the [MovieLens ml-1m](http://files.grouplens.org/datasets/movielens/ml-1m.zip) to train our model. This dataset includes 10,000 ratings of 4,000 movies from 6,000 users to 4,000 movies. Each rate is in the range of 1~5. Thanks to GroupLens Research for collecting, processing and publishing the dataset.\n",
"\n",
"`paddle.v2.datasets` package encapsulates multiple public datasets, including `cifar`, `imdb`, `mnist`, `moivelens` and `wmt14`, etc. There's no need for us to manually download and preprocess `MovieLens` dataset.\n",
"\n"
]
},
{
"cell_type": "code",
"metadata": {
"editable": true
},
"source": [
"# Run this block to show dataset's documentation\n",
"help(paddle.v2.dataset.movielens)\n"
],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
}
],
"execution_count": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"The raw `MoiveLens` contains movie ratings, relevant features from both movies and users.\n",
"print \"User %s rates Movie %s with Score %s\"%(user_info[uid], movie_info[mov_id], train_sample[-1])\n"
],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
}
],
"execution_count": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"```text\n",
"User \u003cUserInfo id(1), gender(F), age(1), job(10)\u003e rates Movie \u003cMovieInfo id(1193), title(One Flew Over the Cuckoo's Nest), categories(['Drama'])\u003e with Score [5.0]\n",
"```\n",
"\n",
"The output shows that user 1 gave movie `1193` a rating of 5.\n",
"\n",
"After issuing a command `python train.py`, training will start immediately. The details will be unpacked by the following sessions to see how it works.\n",
"\n",
"## Model Architecture\n",
"\n",
"### Initialize PaddlePaddle\n",
"\n",
"First, we must import and initialize PaddlePaddle (enable/disable GPU, set the number of trainers, etc).\n",
"As shown in the above code, the input is four dimension integers for each user, that is, `user_id`,`gender_id`, `age_id` and `job_id`. In order to deal with these features conveniently, we use the language model in NLP to transform these discrete values into embedding vaules `usr_emb`, `usr_gender_emb`, `usr_age_emb` and `usr_job_emb`.\n",
"Movie title, a sequence of words represented by an integer word index sequence, will be feed into a `sequence_conv_pool` layer, which will apply convolution and pooling on time dimension. Because pooling is done on time dimension, the output will be a fixed-length vector regardless the length of the input sequence.\n",
"\n",
"## Model Specification\n",
"Finally, we can use cosine similarity to calculate the similarity between user characteristics and movie features.\n",
"First, we define the model parameters according to the previous model configuration `cost`.\n",
"\n"
]
},
{
"cell_type": "code",
"metadata": {
"editable": true
},
"source": [
"# Create parameters\n",
"parameters = paddle.parameters.create(cost)\n"
],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
}
],
"execution_count": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### Create Trainer\n",
"\n",
"Before jumping into creating a training module, algorithm setting is also necessary. Here we specified Adam optimization algorithm via `paddle.optimizer`.\n",
"`feeding` is devoted to specifying the correspondence between each yield record and `paddle.layer.data`. For instance, the first column of data generated by `movielens.train` corresponds to `user_id` feature.\n",
"\n"
]
},
{
"cell_type": "code",
"metadata": {
"editable": true
},
"source": [
"feeding = {\n",
" 'user_id': 0,\n",
" 'gender_id': 1,\n",
" 'age_id': 2,\n",
" 'job_id': 3,\n",
" 'movie_id': 4,\n",
" 'category_id': 5,\n",
" 'movie_title': 6,\n",
" 'score': 7\n",
"}\n"
],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
}
],
"execution_count": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Callback function `event_handler` will be called during training when a pre-defined event happens.\n",
"\n"
]
},
{
"cell_type": "code",
"metadata": {
"editable": true
},
"source": [
"step=0\n",
"\n",
"train_costs=[],[]\n",
"test_costs=[],[]\n",
"\n",
"def event_handler(event):\n",
" global step\n",
" global train_costs\n",
" global test_costs\n",
" if isinstance(event, paddle.event.EndIteration):\n",
" need_plot = False\n",
" if step % 10 == 0: # every 10 batches, record a train cost\n",
" train_costs[0].append(step)\n",
" train_costs[1].append(event.cost)\n",
"\n",
" if step % 1000 == 0: # every 1000 batches, record a test cost\n",
"Finally, we can invoke `trainer.train` to start training:\n",
"\n"
]
},
{
"cell_type": "code",
"metadata": {
"editable": true
},
"source": [
"trainer.train(\n",
" reader=reader,\n",
" event_handler=event_handler,\n",
" feeding=feeding,\n",
" num_passes=200)\n"
],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
}
],
"execution_count": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Conclusion\n",
"\n",
...
...
@@ -105,16 +703,16 @@
"\n",
"## Reference\n",
"\n",
"1. [Peter Brusilovsky](https://en.wikipedia.org/wiki/Peter_Brusilovsky) (2007). *The Adaptive Web*. p. 325.\n",
"2. Robin Burke ,[Hybrid Web Recommender Systems](http://www.dcs.warwick.ac.uk/~acristea/courses/CS411/2010/Book%20-%20The%20Adaptive%20Web/HybridWebRecommenderSystems.pdf), pp. 377-408, The Adaptive Web, Peter Brusilovsky, Alfred Kobsa, Wolfgang Nejdl (Ed.), Lecture Notes in Computer Science, Springer-Verlag, Berlin, Germany, Lecture Notes in Computer Science, Vol. 4321, May 2007, 978-3-540-72078-2.\n",
"1. [Peter Brusilovsky](https://en.wikipedia.org/wiki/Peter_Brusilovsky) (2007). *The Adaptive Web*. p. 325.\n",
"2. Robin Burke ,[Hybrid Web Recommender Systems](http://www.dcs.warwick.ac.uk/~acristea/courses/CS411/2010/Book%20-%20The%20Adaptive%20Web/HybridWebRecommenderSystems.pdf), pp. 377-408, The Adaptive Web, Peter Brusilovsky, Alfred Kobsa, Wolfgang Nejdl (Ed.), Lecture Notes in Computer Science, Springer-Verlag, Berlin, Germany, Lecture Notes in Computer Science, Vol. 4321, May 2007, 978-3-540-72078-2.\n",
"3. P. Resnick, N. Iacovou, etc. “[GroupLens: An Open Architecture for Collaborative Filtering of Netnews](http://ccs.mit.edu/papers/CCSWP165.html)”, Proceedings of ACM Conference on Computer Supported Cooperative Work, CSCW 1994. pp.175-186.\n",
"4. Sarwar, Badrul, et al. \"[Item-based collaborative filtering recommendation algorithms.](http://files.grouplens.org/papers/www10_sarwar.pdf)\"*Proceedings of the 10th International Conference on World Wide Web*. ACM, 2001.\n",
"4. Sarwar, Badrul, et al. \"[Item-based collaborative filtering recommendation algorithms.](http://files.grouplens.org/papers/www10_sarwar.pdf)\"*Proceedings of the 10th International Conference on World Wide Web*. ACM, 2001.\n",
"5. Kautz, Henry, Bart Selman, and Mehul Shah. \"[Referral Web: Combining Social networks and collaborative filtering.](http://www.cs.cornell.edu/selman/papers/pdf/97.cacm.refweb.pdf)\" Communications of the ACM 40.3 (1997): 63-65. APA\n",
"6. Yuan, Jianbo, et al. [\"Solving Cold-Start Problem in Large-scale Recommendation Engines: A Deep Learning Approach.\"](https://arxiv.org/pdf/1611.05480v1.pdf) *arXiv preprint arXiv:1611.05480* (2016).\n",
"6. Yuan, Jianbo, et al. [\"Solving Cold-Start Problem in Large-scale Recommendation Engines: A Deep Learning Approach.\"](https://arxiv.org/pdf/1611.05480v1.pdf) *arXiv preprint arXiv:1611.05480* (2016).\n",
"7. Covington P, Adams J, Sargin E. [Deep neural networks for youtube recommendations](https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf)[C]//Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 2016: 191-198.\n",
"\n",
"\u003cbr/\u003e\n",
"\u003ca rel=\"license\" href=\"http://creativecommons.org/licenses/by-nc-sa/4.0/\"\u003e\u003cimg alt=\"Creative Commons\" style=\"border-width:0\" src=\"https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png\" /\u003e\u003c/a\u003e\u003cbr /\u003e\u003cspan xmlns:dct=\"http://purl.org/dc/terms/\" href=\"http://purl.org/dc/dcmitype/Text\" property=\"dct:title\" rel=\"dct:type\"\u003eThis tutorial\u003c/span\u003e was created by \u003ca xmlns:cc=\"http://creativecommons.org/ns#\" href=\"http://book.paddlepaddle.org\" property=\"cc:attributionName\" rel=\"cc:attributionURL\"\u003ethe PaddlePaddle community\u003c/a\u003e and published under \u003ca rel=\"license\" href=\"http://creativecommons.org/licenses/by-nc-sa/4.0/\"\u003eCommon Creative 4.0 License\u003c/a\u003e。\n"
"This tutorial is contributed by \u003ca xmlns:cc=\"http://creativecommons.org/ns#\" href=\"http://book.paddlepaddle.org\" property=\"cc:attributionName\" rel=\"cc:attributionURL\"\u003ePaddlePaddle\u003c/a\u003e, and licensed under a \u003ca rel=\"license\" href=\"http://creativecommons.org/licenses/by-nc-sa/4.0/\"\u003eCreative Commons Attribution-NonCommercial-ShareAlike 4.0 International License\u003c/a\u003e.\n"