diff --git a/recommender_system/index.en.html b/recommender_system/index.en.html index af9ecc0bc94a46e780d961787afefa542e233e1b..ea02f50e6b2e490b78eaa9fe4e4e76dcaef81c17 100644 --- a/recommender_system/index.en.html +++ b/recommender_system/index.en.html @@ -118,22 +118,287 @@ Figure 3. A hybrid recommendation model. ## Dataset -We use the [MovieLens ml-1m](http://files.grouplens.org/datasets/movielens/ml-1m.zip) to train our model. This dataset includes 10,000 ratings of 4,000 movies from 6,000 users to 4,000 movies. Each rate is in the range of 1~5. Thanks to GroupLens Research for collecting, processing and publishing the dataset. +We use the [MovieLens ml-1m](http://files.grouplens.org/datasets/movielens/ml-1m.zip) to train our model. This dataset includes 10,000 ratings of 4,000 movies from 6,000 users to 4,000 movies. Each rate is in the range of 1~5. Thanks to GroupLens Research for collecting, processing and publishing the dataset. + +`paddle.v2.datasets` package encapsulates multiple public datasets, including `cifar`, `imdb`, `mnist`, `moivelens` and `wmt14`, etc. There's no need for us to manually download and preprocess `MovieLens` dataset. + +```python +# Run this block to show dataset's documentation +help(paddle.v2.dataset.movielens) +``` + +The raw `MoiveLens` contains movie ratings, relevant features from both movies and users. +For instance, one movie's feature could be: + +```python +movie_info = paddle.dataset.movielens.movie_info() +print movie_info.values()[0] +``` + +```text + +``` + +One user's feature could be: + +```python +user_info = paddle.dataset.movielens.user_info() +print user_info.values()[0] +``` + +```text + +``` + +In this dateset, the distribution of age is shown as follows: + +```text +1: "Under 18" +18: "18-24" +25: "25-34" +35: "35-44" +45: "45-49" +50: "50-55" +56: "56+" +``` + +User's occupation is selected from the following options: + +```text +0: "other" or not specified +1: "academic/educator" +2: "artist" +3: "clerical/admin" +4: "college/grad student" +5: "customer service" +6: "doctor/health care" +7: "executive/managerial" +8: "farmer" +9: "homemaker" +10: "K-12 student" +11: "lawyer" +12: "programmer" +13: "retired" +14: "sales/marketing" +15: "scientist" +16: "self-employed" +17: "technician/engineer" +18: "tradesman/craftsman" +19: "unemployed" +20: "writer" +``` + +Each record consists of three main components: user features, movie features and movie ratings. +Likewise, as a simple example, consider the following: + +```python +train_set_creator = paddle.dataset.movielens.train() +train_sample = next(train_set_creator()) +uid = train_sample[0] +mov_id = train_sample[len(user_info[uid].value())] +print "User %s rates Movie %s with Score %s"%(user_info[uid], movie_info[mov_id], train_sample[-1]) +``` + +```text +User rates Movie with Score [5.0] +``` + +The output shows that user 1 gave movie `1193` a rating of 5. + +After issuing a command `python train.py`, training will start immediately. The details will be unpacked by the following sessions to see how it works. + +## Model Architecture + +### Initialize PaddlePaddle + +First, we must import and initialize PaddlePaddle (enable/disable GPU, set the number of trainers, etc). + +```python +%matplotlib inline + +import matplotlib.pyplot as plt +from IPython import display +import cPickle + +import paddle.v2 as paddle + +paddle.init(use_gpu=False) +``` + +### Model Configuration + +```python +uid = paddle.layer.data( + name='user_id', + type=paddle.data_type.integer_value( + paddle.dataset.movielens.max_user_id() + 1)) +usr_emb = paddle.layer.embedding(input=uid, size=32) + +usr_gender_id = paddle.layer.data( + name='gender_id', type=paddle.data_type.integer_value(2)) +usr_gender_emb = paddle.layer.embedding(input=usr_gender_id, size=16) + +usr_age_id = paddle.layer.data( + name='age_id', + type=paddle.data_type.integer_value( + len(paddle.dataset.movielens.age_table))) +usr_age_emb = paddle.layer.embedding(input=usr_age_id, size=16) + +usr_job_id = paddle.layer.data( + name='job_id', + type=paddle.data_type.integer_value(paddle.dataset.movielens.max_job_id( + ) + 1)) +usr_job_emb = paddle.layer.embedding(input=usr_job_id, size=16) +``` + +As shown in the above code, the input is four dimension integers for each user, that is, `user_id`,`gender_id`, `age_id` and `job_id`. In order to deal with these features conveniently, we use the language model in NLP to transform these discrete values into embedding vaules `usr_emb`, `usr_gender_emb`, `usr_age_emb` and `usr_job_emb`. + +```python +usr_combined_features = paddle.layer.fc( + input=[usr_emb, usr_gender_emb, usr_age_emb, usr_job_emb], + size=200, + act=paddle.activation.Tanh()) +``` + +Then, employing user features as input, directly connecting to a fully-connected layer, which is used to reduce dimension to 200. + +Furthermore, we do a similar transformation for each movie feature. The model configuration is: + +```python +mov_id = paddle.layer.data( + name='movie_id', + type=paddle.data_type.integer_value( + paddle.dataset.movielens.max_movie_id() + 1)) +mov_emb = paddle.layer.embedding(input=mov_id, size=32) + +mov_categories = paddle.layer.data( + name='category_id', + type=paddle.data_type.sparse_binary_vector( + len(paddle.dataset.movielens.movie_categories()))) + +mov_categories_hidden = paddle.layer.fc(input=mov_categories, size=32) + -We don't have to download and preprocess the data. Instead, we can use PaddlePaddle's dataset module `paddle.v2.dataset.movielens`. +movie_title_dict = paddle.dataset.movielens.get_movie_title_dict() +mov_title_id = paddle.layer.data( + name='movie_title', + type=paddle.data_type.integer_value_sequence(len(movie_title_dict))) +mov_title_emb = paddle.layer.embedding(input=mov_title_id, size=32) +mov_title_conv = paddle.networks.sequence_conv_pool( + input=mov_title_emb, hidden_size=32, context_len=3) +mov_combined_features = paddle.layer.fc( + input=[mov_emb, mov_categories_hidden, mov_title_conv], + size=200, + act=paddle.activation.Tanh()) +``` -## Model Specification +Movie title, a sequence of words represented by an integer word index sequence, will be feed into a `sequence_conv_pool` layer, which will apply convolution and pooling on time dimension. Because pooling is done on time dimension, the output will be a fixed-length vector regardless the length of the input sequence. +Finally, we can use cosine similarity to calculate the similarity between user characteristics and movie features. +```python +inference = paddle.layer.cos_sim(a=usr_combined_features, b=mov_combined_features, size=1, scale=5) +cost = paddle.layer.regression_cost( + input=inference, + label=paddle.layer.data( + name='score', type=paddle.data_type.dense_vector(1))) +``` -## Training +## Model Training +### Define Parameters +First, we define the model parameters according to the previous model configuration `cost`. -## Inference +```python +# Create parameters +parameters = paddle.parameters.create(cost) +``` +### Create Trainer + +Before jumping into creating a training module, algorithm setting is also necessary. Here we specified Adam optimization algorithm via `paddle.optimizer`. +```python +trainer = paddle.trainer.SGD(cost=cost, parameters=parameters, + update_equation=paddle.optimizer.Adam(learning_rate=1e-4)) +``` + +```text +[INFO 2017-03-06 17:12:13,378 networks.py:1472] The input order is [user_id, gender_id, age_id, job_id, movie_id, category_id, movie_title, score] +[INFO 2017-03-06 17:12:13,379 networks.py:1478] The output order is [__regression_cost_0__] +``` + +### Training + +`paddle.dataset.movielens.train` will yield records during each pass, after shuffling, a batch input is generated for training. + +```python +reader=paddle.reader.batch( + paddle.reader.shuffle( + paddle.dataset.movielens.trai(), buf_size=8192), + batch_size=256) +``` + +`feeding` is devoted to specifying the correspondence between each yield record and `paddle.layer.data`. For instance, the first column of data generated by `movielens.train` corresponds to `user_id` feature. + +```python +feeding = { + 'user_id': 0, + 'gender_id': 1, + 'age_id': 2, + 'job_id': 3, + 'movie_id': 4, + 'category_id': 5, + 'movie_title': 6, + 'score': 7 +} +``` + +Callback function `event_handler` will be called during training when a pre-defined event happens. + +```python +step=0 + +train_costs=[],[] +test_costs=[],[] + +def event_handler(event): + global step + global train_costs + global test_costs + if isinstance(event, paddle.event.EndIteration): + need_plot = False + if step % 10 == 0: # every 10 batches, record a train cost + train_costs[0].append(step) + train_costs[1].append(event.cost) + + if step % 1000 == 0: # every 1000 batches, record a test cost + result = trainer.test(reader=paddle.batch( + paddle.dataset.movielens.test(), batch_size=256)) + test_costs[0].append(step) + test_costs[1].append(result.cost) + + if step % 100 == 0: # every 100 batches, update cost plot + plt.plot(*train_costs) + plt.plot(*test_costs) + plt.legend(['Train Cost', 'Test Cost'], loc='upper left') + display.clear_output(wait=True) + display.display(plt.gcf()) + plt.gcf().clear() + step += 1 +``` + +Finally, we can invoke `trainer.train` to start training: + +```python +trainer.train( + reader=reader, + event_handler=event_handler, + feeding=feeding, + num_passes=200) +``` ## Conclusion @@ -141,12 +406,12 @@ This tutorial goes over traditional approaches in recommender system and a deep ## Reference -1. [Peter Brusilovsky](https://en.wikipedia.org/wiki/Peter_Brusilovsky) (2007). *The Adaptive Web*. p. 325. -2. Robin Burke , [Hybrid Web Recommender Systems](http://www.dcs.warwick.ac.uk/~acristea/courses/CS411/2010/Book%20-%20The%20Adaptive%20Web/HybridWebRecommenderSystems.pdf), pp. 377-408, The Adaptive Web, Peter Brusilovsky, Alfred Kobsa, Wolfgang Nejdl (Ed.), Lecture Notes in Computer Science, Springer-Verlag, Berlin, Germany, Lecture Notes in Computer Science, Vol. 4321, May 2007, 978-3-540-72078-2. +1. [Peter Brusilovsky](https://en.wikipedia.org/wiki/Peter_Brusilovsky) (2007). *The Adaptive Web*. p. 325. +2. Robin Burke , [Hybrid Web Recommender Systems](http://www.dcs.warwick.ac.uk/~acristea/courses/CS411/2010/Book%20-%20The%20Adaptive%20Web/HybridWebRecommenderSystems.pdf), pp. 377-408, The Adaptive Web, Peter Brusilovsky, Alfred Kobsa, Wolfgang Nejdl (Ed.), Lecture Notes in Computer Science, Springer-Verlag, Berlin, Germany, Lecture Notes in Computer Science, Vol. 4321, May 2007, 978-3-540-72078-2. 3. P. Resnick, N. Iacovou, etc. “[GroupLens: An Open Architecture for Collaborative Filtering of Netnews](http://ccs.mit.edu/papers/CCSWP165.html)”, Proceedings of ACM Conference on Computer Supported Cooperative Work, CSCW 1994. pp.175-186. -4. Sarwar, Badrul, et al. "[Item-based collaborative filtering recommendation algorithms.](http://files.grouplens.org/papers/www10_sarwar.pdf)" *Proceedings of the 10th International Conference on World Wide Web*. ACM, 2001. +4. Sarwar, Badrul, et al. "[Item-based collaborative filtering recommendation algorithms.](http://files.grouplens.org/papers/www10_sarwar.pdf)" *Proceedings of the 10th International Conference on World Wide Web*. ACM, 2001. 5. Kautz, Henry, Bart Selman, and Mehul Shah. "[Referral Web: Combining Social networks and collaborative filtering.](http://www.cs.cornell.edu/selman/papers/pdf/97.cacm.refweb.pdf)" Communications of the ACM 40.3 (1997): 63-65. APA -6. Yuan, Jianbo, et al. ["Solving Cold-Start Problem in Large-scale Recommendation Engines: A Deep Learning Approach."](https://arxiv.org/pdf/1611.05480v1.pdf) *arXiv preprint arXiv:1611.05480* (2016). +6. Yuan, Jianbo, et al. ["Solving Cold-Start Problem in Large-scale Recommendation Engines: A Deep Learning Approach."](https://arxiv.org/pdf/1611.05480v1.pdf) *arXiv preprint arXiv:1611.05480* (2016). 7. Covington P, Adams J, Sargin E. [Deep neural networks for youtube recommendations](https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf)[C]//Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 2016: 191-198.