未验证 提交 e82c7d45 编写于 作者: N Nicky Chan 提交者: GitHub

Merge pull request #526 from nickyfantasy/recommend_system_book_rewrite

[High-Level-API] Rewrite Chapter 5 Personalized Recommendation in Book to use new Flui…
...@@ -98,13 +98,13 @@ Figure 4. A hybrid recommendation model. ...@@ -98,13 +98,13 @@ Figure 4. A hybrid recommendation model.
We use the [MovieLens ml-1m](http://files.grouplens.org/datasets/movielens/ml-1m.zip) to train our model. This dataset includes 10,000 ratings of 4,000 movies from 6,000 users to 4,000 movies. Each rate is in the range of 1~5. Thanks to GroupLens Research for collecting, processing and publishing the dataset. We use the [MovieLens ml-1m](http://files.grouplens.org/datasets/movielens/ml-1m.zip) to train our model. This dataset includes 10,000 ratings of 4,000 movies from 6,000 users to 4,000 movies. Each rate is in the range of 1~5. Thanks to GroupLens Research for collecting, processing and publishing the dataset.
`paddle.v2.datasets` package encapsulates multiple public datasets, including `cifar`, `imdb`, `mnist`, `moivelens` and `wmt14`, etc. There's no need for us to manually download and preprocess `MovieLens` dataset. `paddle.datasets` package encapsulates multiple public datasets, including `cifar`, `imdb`, `mnist`, `movielens` and `wmt14`, etc. There's no need for us to manually download and preprocess `MovieLens` dataset.
The raw `MoiveLens` contains movie ratings, relevant features from both movies and users. The raw `MoiveLens` contains movie ratings, relevant features from both movies and users.
For instance, one movie's feature could be: For instance, one movie's feature could be:
```python ```python
import paddle.v2 as paddle import paddle
movie_info = paddle.dataset.movielens.movie_info() movie_info = paddle.dataset.movielens.movie_info()
print movie_info.values()[0] print movie_info.values()[0]
``` ```
...@@ -181,197 +181,307 @@ The output shows that user 1 gave movie `1193` a rating of 5. ...@@ -181,197 +181,307 @@ The output shows that user 1 gave movie `1193` a rating of 5.
After issuing a command `python train.py`, training will start immediately. The details will be unpacked by the following sessions to see how it works. After issuing a command `python train.py`, training will start immediately. The details will be unpacked by the following sessions to see how it works.
## Model Architecture ## Model Configuration
### Initialize PaddlePaddle
First, we must import and initialize PaddlePaddle (enable/disable GPU, set the number of trainers, etc).
Our program starts with importing necessary packages and initializing some global variables:
```python ```python
import paddle.v2 as paddle import math
paddle.init(use_gpu=False) import sys
import numpy as np
import paddle
import paddle.fluid as fluid
import paddle.fluid.layers as layers
import paddle.fluid.nets as nets
IS_SPARSE = True
USE_GPU = False
BATCH_SIZE = 256
``` ```
### Model Configuration
Then we define the model configuration for user combined features:
```python ```python
uid = paddle.layer.data( def get_usr_combined_features():
name='user_id',
type=paddle.data_type.integer_value(
paddle.dataset.movielens.max_user_id() + 1))
usr_emb = paddle.layer.embedding(input=uid, size=32)
usr_fc = paddle.layer.fc(input=usr_emb, size=32)
usr_gender_id = paddle.layer.data(
name='gender_id', type=paddle.data_type.integer_value(2))
usr_gender_emb = paddle.layer.embedding(input=usr_gender_id, size=16)
usr_gender_fc = paddle.layer.fc(input=usr_gender_emb, size=16)
usr_age_id = paddle.layer.data(
name='age_id',
type=paddle.data_type.integer_value(
len(paddle.dataset.movielens.age_table)))
usr_age_emb = paddle.layer.embedding(input=usr_age_id, size=16)
usr_age_fc = paddle.layer.fc(input=usr_age_emb, size=16)
usr_job_id = paddle.layer.data(
name='job_id',
type=paddle.data_type.integer_value(
paddle.dataset.movielens.max_job_id() + 1))
usr_job_emb = paddle.layer.embedding(input=usr_job_id, size=16)
usr_job_fc = paddle.layer.fc(input=usr_job_emb, size=16)
```
As shown in the above code, the input is four dimension integers for each user, that is, `user_id`,`gender_id`, `age_id` and `job_id`. In order to deal with these features conveniently, we use the language model in NLP to transform these discrete values into embedding vaules `usr_emb`, `usr_gender_emb`, `usr_age_emb` and `usr_job_emb`. USR_DICT_SIZE = paddle.dataset.movielens.max_user_id() + 1
```python uid = layers.data(name='user_id', shape=[1], dtype='int64')
usr_combined_features = paddle.layer.fc(
input=[usr_fc, usr_gender_fc, usr_age_fc, usr_job_fc], usr_emb = layers.embedding(
size=200, input=uid,
act=paddle.activation.Tanh()) dtype='float32',
size=[USR_DICT_SIZE, 32],
param_attr='user_table',
is_sparse=IS_SPARSE)
usr_fc = layers.fc(input=usr_emb, size=32)
USR_GENDER_DICT_SIZE = 2
usr_gender_id = layers.data(name='gender_id', shape=[1], dtype='int64')
usr_gender_emb = layers.embedding(
input=usr_gender_id,
size=[USR_GENDER_DICT_SIZE, 16],
param_attr='gender_table',
is_sparse=IS_SPARSE)
usr_gender_fc = layers.fc(input=usr_gender_emb, size=16)
USR_AGE_DICT_SIZE = len(paddle.dataset.movielens.age_table)
usr_age_id = layers.data(name='age_id', shape=[1], dtype="int64")
usr_age_emb = layers.embedding(
input=usr_age_id,
size=[USR_AGE_DICT_SIZE, 16],
is_sparse=IS_SPARSE,
param_attr='age_table')
usr_age_fc = layers.fc(input=usr_age_emb, size=16)
USR_JOB_DICT_SIZE = paddle.dataset.movielens.max_job_id() + 1
usr_job_id = layers.data(name='job_id', shape=[1], dtype="int64")
usr_job_emb = layers.embedding(
input=usr_job_id,
size=[USR_JOB_DICT_SIZE, 16],
param_attr='job_table',
is_sparse=IS_SPARSE)
usr_job_fc = layers.fc(input=usr_job_emb, size=16)
concat_embed = layers.concat(
input=[usr_fc, usr_gender_fc, usr_age_fc, usr_job_fc], axis=1)
usr_combined_features = layers.fc(input=concat_embed, size=200, act="tanh")
return usr_combined_features
``` ```
Then, employing user features as input, directly connecting to a fully-connected layer, which is used to reduce dimension to 200. As shown in the above code, the input is four dimension integers for each user, that is `user_id`,`gender_id`, `age_id` and `job_id`. In order to deal with these features conveniently, we use the language model in NLP to transform these discrete values into embedding vaules `usr_emb`, `usr_gender_emb`, `usr_age_emb` and `usr_job_emb`.
Then we can use user features as input, directly connecting to a fully-connected layer, which is used to reduce dimension to 200.
Furthermore, we do a similar transformation for each movie feature. The model configuration is: Furthermore, we do a similar transformation for each movie feature. The model configuration is:
```python ```python
mov_id = paddle.layer.data( def get_mov_combined_features():
name='movie_id',
type=paddle.data_type.integer_value( MOV_DICT_SIZE = paddle.dataset.movielens.max_movie_id() + 1
paddle.dataset.movielens.max_movie_id() + 1))
mov_emb = paddle.layer.embedding(input=mov_id, size=32) mov_id = layers.data(name='movie_id', shape=[1], dtype='int64')
mov_fc = paddle.layer.fc(input=mov_emb, size=32)
mov_emb = layers.embedding(
mov_categories = paddle.layer.data( input=mov_id,
name='category_id', dtype='float32',
type=paddle.data_type.sparse_binary_vector( size=[MOV_DICT_SIZE, 32],
len(paddle.dataset.movielens.movie_categories()))) param_attr='movie_table',
mov_categories_hidden = paddle.layer.fc(input=mov_categories, size=32) is_sparse=IS_SPARSE)
movie_title_dict = paddle.dataset.movielens.get_movie_title_dict() mov_fc = layers.fc(input=mov_emb, size=32)
mov_title_id = paddle.layer.data(
name='movie_title', CATEGORY_DICT_SIZE = len(paddle.dataset.movielens.movie_categories())
type=paddle.data_type.integer_value_sequence(len(movie_title_dict)))
mov_title_emb = paddle.layer.embedding(input=mov_title_id, size=32) category_id = layers.data(
mov_title_conv = paddle.networks.sequence_conv_pool( name='category_id', shape=[1], dtype='int64', lod_level=1)
input=mov_title_emb, hidden_size=32, context_len=3)
mov_categories_emb = layers.embedding(
mov_combined_features = paddle.layer.fc( input=category_id, size=[CATEGORY_DICT_SIZE, 32], is_sparse=IS_SPARSE)
input=[mov_fc, mov_categories_hidden, mov_title_conv],
size=200, mov_categories_hidden = layers.sequence_pool(
act=paddle.activation.Tanh()) input=mov_categories_emb, pool_type="sum")
MOV_TITLE_DICT_SIZE = len(paddle.dataset.movielens.get_movie_title_dict())
mov_title_id = layers.data(
name='movie_title', shape=[1], dtype='int64', lod_level=1)
mov_title_emb = layers.embedding(
input=mov_title_id, size=[MOV_TITLE_DICT_SIZE, 32], is_sparse=IS_SPARSE)
mov_title_conv = nets.sequence_conv_pool(
input=mov_title_emb,
num_filters=32,
filter_size=3,
act="tanh",
pool_type="sum")
concat_embed = layers.concat(
input=[mov_fc, mov_categories_hidden, mov_title_conv], axis=1)
mov_combined_features = layers.fc(input=concat_embed, size=200, act="tanh")
return mov_combined_features
``` ```
Movie title, a sequence of words represented by an integer word index sequence, will be feed into a `sequence_conv_pool` layer, which will apply convolution and pooling on time dimension. Because pooling is done on time dimension, the output will be a fixed-length vector regardless the length of the input sequence. Movie title, which is a sequence of words represented by an integer word index sequence, will be fed into a `sequence_conv_pool` layer, which will apply convolution and pooling on time dimension. Because pooling is done on time dimension, the output will be a fixed-length vector regardless the length of the input sequence.
Finally, we can define a `inference_program` that uses cosine similarity to calculate the similarity between user characteristics and movie features.
Finally, we can use cosine similarity to calculate the similarity between user characteristics and movie features. ```python
def inference_program():
usr_combined_features = get_usr_combined_features()
mov_combined_features = get_mov_combined_features()
inference = layers.cos_sim(X=usr_combined_features, Y=mov_combined_features)
scale_infer = layers.scale(x=inference, scale=5.0)
return scale_infer
```
Then we define a `training_program` that uses the result from `inference_program` to compute the cost with label data.
Also define `optimizer_func` to specify the optimizer.
```python ```python
inference = paddle.layer.cos_sim(a=usr_combined_features, b=mov_combined_features, size=1, scale=5) def train_program():
cost = paddle.layer.square_error_cost(
input=inference, scale_infer = inference_program()
label=paddle.layer.data(
name='score', type=paddle.data_type.dense_vector(1))) label = layers.data(name='score', shape=[1], dtype='float32')
square_cost = layers.square_error_cost(input=scale_infer, label=label)
avg_cost = layers.mean(square_cost)
return [avg_cost, scale_infer]
def optimizer_func():
return fluid.optimizer.SGD(learning_rate=0.2)
``` ```
## Model Training ## Model Training
### Define Parameters ### Specify training environment
First, we define the model parameters according to the previous model configuration `cost`. Specify your training environment, you should specify if the training is on CPU or GPU.
```python ```python
# Create parameters use_cuda = False
parameters = paddle.parameters.create(cost) place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
``` ```
### Create Trainer ### Datafeeder Configuration
Before jumping into creating a training module, algorithm setting is also necessary. Here we specified Adam optimization algorithm via `paddle.optimizer`. Next we define data feeders for test and train. The feeder reads a `buf_size` of data each time and feed them to the training/testing process.
`paddle.dataset.movielens.train` will yield records during each pass, after shuffling, a batch input of `BATCH_SIZE` is generated for training.
```python ```python
trainer = paddle.trainer.SGD(cost=cost, parameters=parameters, train_reader = paddle.batch(
update_equation=paddle.optimizer.Adam(learning_rate=1e-4)) paddle.reader.shuffle(
``` paddle.dataset.movielens.train(), buf_size=8192),
batch_size=BATCH_SIZE)
```text test_reader = paddle.batch(
[INFO 2017-03-06 17:12:13,378 networks.py:1472] The input order is [user_id, gender_id, age_id, job_id, movie_id, category_id, movie_title, score] paddle.dataset.movielens.test(), batch_size=BATCH_SIZE)
[INFO 2017-03-06 17:12:13,379 networks.py:1478] The output order is [__square_error_cost_0__]
``` ```
### Training ### Create Trainer
`paddle.dataset.movielens.train` will yield records during each pass, after shuffling, a batch input is generated for training. Create a trainer that takes `train_program` as input and specify optimizer function.
```python ```python
reader=paddle.batch( trainer = fluid.Trainer(
paddle.reader.shuffle( train_func=train_program, place=place, optimizer_func=optimizer_func)
paddle.dataset.movielens.train(), buf_size=8192),
batch_size=256)
``` ```
`feeding` is devoted to specifying the correspondence between each yield record and `paddle.layer.data`. For instance, the first column of data generated by `movielens.train` corresponds to `user_id` feature. ### Feeding Data
`feed_order` is devoted to specifying the correspondence between each yield record and `paddle.layer.data`. For instance, the first column of data generated by `movielens.train` corresponds to `user_id` feature.
```python ```python
feeding = { feed_order = [
'user_id': 0, 'user_id', 'gender_id', 'age_id', 'job_id', 'movie_id', 'category_id',
'gender_id': 1, 'movie_title', 'score'
'age_id': 2, ]
'job_id': 3,
'movie_id': 4,
'category_id': 5,
'movie_title': 6,
'score': 7
}
``` ```
Callback function `event_handler` and `event_handler_plot` will be called during training when a pre-defined event happens. ### Event Handler
Callback function `event_handler` will be called during training when a pre-defined event happens.
For example, we can check the cost by `trainer.test` when `EndStepEvent` occurs
```python ```python
# Specify the directory path to save the parameters
params_dirname = "recommender_system.inference.model"
def event_handler(event): def event_handler(event):
if isinstance(event, paddle.event.EndIteration): if isinstance(event, fluid.EndStepEvent):
if event.batch_id % 100 == 0: avg_cost_set = trainer.test(
print "Pass %d Batch %d Cost %.2f" % ( reader=test_reader, feed_order=feed_order)
event.pass_id, event.batch_id, event.cost)
# get avg cost
avg_cost = np.array(avg_cost_set).mean()
print("avg_cost: %s" % avg_cost)
print('BatchID {0}, Test Loss {1:0.2}'.format(event.epoch + 1,
float(avg_cost)))
if float(avg_cost) < 4:
trainer.save_params(params_dirname)
trainer.stop()
``` ```
### Training
Finally, we invoke `trainer.train` to start training with `num_epochs` and other parameters.
```python ```python
from paddle.v2.plot import Ploter trainer.train(
num_epochs=1,
event_handler=event_handler,
reader=train_reader,
feed_order=feed_order)
```
## Inference
train_title = "Train cost" ### Create Inferencer
test_title = "Test cost"
cost_ploter = Ploter(train_title, test_title)
step = 0 Initialize Inferencer with `inference_program` and `params_dirname` which is where we save params from training.
def event_handler_plot(event): ```python
global step inferencer = fluid.Inferencer(
if isinstance(event, paddle.event.EndIteration): inference_program, param_path=params_dirname, place=place)
if step % 10 == 0: # every 10 batches, record a train cost ```
cost_ploter.append(train_title, step, event.cost)
if step % 1000 == 0: # every 1000 batches, record a test cost ### Generate input data for testing
result = trainer.test(
reader=paddle.batch(
paddle.dataset.movielens.test(), batch_size=256),
feeding=feeding)
cost_ploter.append(test_title, step, result.cost)
if step % 100 == 0: # every 100 batches, update cost plot Use create_lod_tensor(data, lod, place) API to generate LoD Tensor, where `data` is a list of sequences of index numbers, `lod` is the level of detail (lod) info associated with `data`.
cost_ploter.plot() For example, data = [[10, 2, 3], [2, 3]] means that it contains two sequences of indices, of length 3 and 2, respectively.
Correspondingly, lod = [[3, 2]] contains one level of detail info, indicating that `data` consists of two sequences of length 3 and 2.
step += 1 ```python
user_id = fluid.create_lod_tensor([[1]], [[1]], place)
gender_id = fluid.create_lod_tensor([[1]], [[1]], place)
age_id = fluid.create_lod_tensor([[0]], [[1]], place)
job_id = fluid.create_lod_tensor([[10]], [[1]], place)
movie_id = fluid.create_lod_tensor([[783]], [[1]], place)
category_id = fluid.create_lod_tensor([[10, 8, 9]], [[3]], place)
movie_title = fluid.create_lod_tensor([[1069, 4140, 2923, 710, 988]], [[5]],
place)
``` ```
Finally, we can invoke `trainer.train` to start training: ### Infer
Now we can infer with inputs that we provide in `feed_order` during training.
```python ```python
trainer.train( results = inferencer.infer(
reader=reader, {
event_handler=event_handler_plot, 'user_id': user_id,
feeding=feeding, 'gender_id': gender_id,
num_passes=2) 'age_id': age_id,
'job_id': job_id,
'movie_id': movie_id,
'category_id': category_id,
'movie_title': movie_title
},
return_numpy=False)
print("infer results: ", np.array(results[0]))
``` ```
## Conclusion ## Conclusion
......
...@@ -140,13 +140,13 @@ Figure 4. A hybrid recommendation model. ...@@ -140,13 +140,13 @@ Figure 4. A hybrid recommendation model.
We use the [MovieLens ml-1m](http://files.grouplens.org/datasets/movielens/ml-1m.zip) to train our model. This dataset includes 10,000 ratings of 4,000 movies from 6,000 users to 4,000 movies. Each rate is in the range of 1~5. Thanks to GroupLens Research for collecting, processing and publishing the dataset. We use the [MovieLens ml-1m](http://files.grouplens.org/datasets/movielens/ml-1m.zip) to train our model. This dataset includes 10,000 ratings of 4,000 movies from 6,000 users to 4,000 movies. Each rate is in the range of 1~5. Thanks to GroupLens Research for collecting, processing and publishing the dataset.
`paddle.v2.datasets` package encapsulates multiple public datasets, including `cifar`, `imdb`, `mnist`, `moivelens` and `wmt14`, etc. There's no need for us to manually download and preprocess `MovieLens` dataset. `paddle.datasets` package encapsulates multiple public datasets, including `cifar`, `imdb`, `mnist`, `movielens` and `wmt14`, etc. There's no need for us to manually download and preprocess `MovieLens` dataset.
The raw `MoiveLens` contains movie ratings, relevant features from both movies and users. The raw `MoiveLens` contains movie ratings, relevant features from both movies and users.
For instance, one movie's feature could be: For instance, one movie's feature could be:
```python ```python
import paddle.v2 as paddle import paddle
movie_info = paddle.dataset.movielens.movie_info() movie_info = paddle.dataset.movielens.movie_info()
print movie_info.values()[0] print movie_info.values()[0]
``` ```
...@@ -223,197 +223,307 @@ The output shows that user 1 gave movie `1193` a rating of 5. ...@@ -223,197 +223,307 @@ The output shows that user 1 gave movie `1193` a rating of 5.
After issuing a command `python train.py`, training will start immediately. The details will be unpacked by the following sessions to see how it works. After issuing a command `python train.py`, training will start immediately. The details will be unpacked by the following sessions to see how it works.
## Model Architecture ## Model Configuration
### Initialize PaddlePaddle
First, we must import and initialize PaddlePaddle (enable/disable GPU, set the number of trainers, etc).
Our program starts with importing necessary packages and initializing some global variables:
```python ```python
import paddle.v2 as paddle import math
paddle.init(use_gpu=False) import sys
import numpy as np
import paddle
import paddle.fluid as fluid
import paddle.fluid.layers as layers
import paddle.fluid.nets as nets
IS_SPARSE = True
USE_GPU = False
BATCH_SIZE = 256
``` ```
### Model Configuration
Then we define the model configuration for user combined features:
```python ```python
uid = paddle.layer.data( def get_usr_combined_features():
name='user_id',
type=paddle.data_type.integer_value(
paddle.dataset.movielens.max_user_id() + 1))
usr_emb = paddle.layer.embedding(input=uid, size=32)
usr_fc = paddle.layer.fc(input=usr_emb, size=32)
usr_gender_id = paddle.layer.data(
name='gender_id', type=paddle.data_type.integer_value(2))
usr_gender_emb = paddle.layer.embedding(input=usr_gender_id, size=16)
usr_gender_fc = paddle.layer.fc(input=usr_gender_emb, size=16)
usr_age_id = paddle.layer.data(
name='age_id',
type=paddle.data_type.integer_value(
len(paddle.dataset.movielens.age_table)))
usr_age_emb = paddle.layer.embedding(input=usr_age_id, size=16)
usr_age_fc = paddle.layer.fc(input=usr_age_emb, size=16)
usr_job_id = paddle.layer.data(
name='job_id',
type=paddle.data_type.integer_value(
paddle.dataset.movielens.max_job_id() + 1))
usr_job_emb = paddle.layer.embedding(input=usr_job_id, size=16)
usr_job_fc = paddle.layer.fc(input=usr_job_emb, size=16)
```
As shown in the above code, the input is four dimension integers for each user, that is, `user_id`,`gender_id`, `age_id` and `job_id`. In order to deal with these features conveniently, we use the language model in NLP to transform these discrete values into embedding vaules `usr_emb`, `usr_gender_emb`, `usr_age_emb` and `usr_job_emb`. USR_DICT_SIZE = paddle.dataset.movielens.max_user_id() + 1
```python uid = layers.data(name='user_id', shape=[1], dtype='int64')
usr_combined_features = paddle.layer.fc(
input=[usr_fc, usr_gender_fc, usr_age_fc, usr_job_fc], usr_emb = layers.embedding(
size=200, input=uid,
act=paddle.activation.Tanh()) dtype='float32',
size=[USR_DICT_SIZE, 32],
param_attr='user_table',
is_sparse=IS_SPARSE)
usr_fc = layers.fc(input=usr_emb, size=32)
USR_GENDER_DICT_SIZE = 2
usr_gender_id = layers.data(name='gender_id', shape=[1], dtype='int64')
usr_gender_emb = layers.embedding(
input=usr_gender_id,
size=[USR_GENDER_DICT_SIZE, 16],
param_attr='gender_table',
is_sparse=IS_SPARSE)
usr_gender_fc = layers.fc(input=usr_gender_emb, size=16)
USR_AGE_DICT_SIZE = len(paddle.dataset.movielens.age_table)
usr_age_id = layers.data(name='age_id', shape=[1], dtype="int64")
usr_age_emb = layers.embedding(
input=usr_age_id,
size=[USR_AGE_DICT_SIZE, 16],
is_sparse=IS_SPARSE,
param_attr='age_table')
usr_age_fc = layers.fc(input=usr_age_emb, size=16)
USR_JOB_DICT_SIZE = paddle.dataset.movielens.max_job_id() + 1
usr_job_id = layers.data(name='job_id', shape=[1], dtype="int64")
usr_job_emb = layers.embedding(
input=usr_job_id,
size=[USR_JOB_DICT_SIZE, 16],
param_attr='job_table',
is_sparse=IS_SPARSE)
usr_job_fc = layers.fc(input=usr_job_emb, size=16)
concat_embed = layers.concat(
input=[usr_fc, usr_gender_fc, usr_age_fc, usr_job_fc], axis=1)
usr_combined_features = layers.fc(input=concat_embed, size=200, act="tanh")
return usr_combined_features
``` ```
Then, employing user features as input, directly connecting to a fully-connected layer, which is used to reduce dimension to 200. As shown in the above code, the input is four dimension integers for each user, that is `user_id`,`gender_id`, `age_id` and `job_id`. In order to deal with these features conveniently, we use the language model in NLP to transform these discrete values into embedding vaules `usr_emb`, `usr_gender_emb`, `usr_age_emb` and `usr_job_emb`.
Then we can use user features as input, directly connecting to a fully-connected layer, which is used to reduce dimension to 200.
Furthermore, we do a similar transformation for each movie feature. The model configuration is: Furthermore, we do a similar transformation for each movie feature. The model configuration is:
```python ```python
mov_id = paddle.layer.data( def get_mov_combined_features():
name='movie_id',
type=paddle.data_type.integer_value( MOV_DICT_SIZE = paddle.dataset.movielens.max_movie_id() + 1
paddle.dataset.movielens.max_movie_id() + 1))
mov_emb = paddle.layer.embedding(input=mov_id, size=32) mov_id = layers.data(name='movie_id', shape=[1], dtype='int64')
mov_fc = paddle.layer.fc(input=mov_emb, size=32)
mov_emb = layers.embedding(
mov_categories = paddle.layer.data( input=mov_id,
name='category_id', dtype='float32',
type=paddle.data_type.sparse_binary_vector( size=[MOV_DICT_SIZE, 32],
len(paddle.dataset.movielens.movie_categories()))) param_attr='movie_table',
mov_categories_hidden = paddle.layer.fc(input=mov_categories, size=32) is_sparse=IS_SPARSE)
movie_title_dict = paddle.dataset.movielens.get_movie_title_dict() mov_fc = layers.fc(input=mov_emb, size=32)
mov_title_id = paddle.layer.data(
name='movie_title', CATEGORY_DICT_SIZE = len(paddle.dataset.movielens.movie_categories())
type=paddle.data_type.integer_value_sequence(len(movie_title_dict)))
mov_title_emb = paddle.layer.embedding(input=mov_title_id, size=32) category_id = layers.data(
mov_title_conv = paddle.networks.sequence_conv_pool( name='category_id', shape=[1], dtype='int64', lod_level=1)
input=mov_title_emb, hidden_size=32, context_len=3)
mov_categories_emb = layers.embedding(
mov_combined_features = paddle.layer.fc( input=category_id, size=[CATEGORY_DICT_SIZE, 32], is_sparse=IS_SPARSE)
input=[mov_fc, mov_categories_hidden, mov_title_conv],
size=200, mov_categories_hidden = layers.sequence_pool(
act=paddle.activation.Tanh()) input=mov_categories_emb, pool_type="sum")
MOV_TITLE_DICT_SIZE = len(paddle.dataset.movielens.get_movie_title_dict())
mov_title_id = layers.data(
name='movie_title', shape=[1], dtype='int64', lod_level=1)
mov_title_emb = layers.embedding(
input=mov_title_id, size=[MOV_TITLE_DICT_SIZE, 32], is_sparse=IS_SPARSE)
mov_title_conv = nets.sequence_conv_pool(
input=mov_title_emb,
num_filters=32,
filter_size=3,
act="tanh",
pool_type="sum")
concat_embed = layers.concat(
input=[mov_fc, mov_categories_hidden, mov_title_conv], axis=1)
mov_combined_features = layers.fc(input=concat_embed, size=200, act="tanh")
return mov_combined_features
``` ```
Movie title, a sequence of words represented by an integer word index sequence, will be feed into a `sequence_conv_pool` layer, which will apply convolution and pooling on time dimension. Because pooling is done on time dimension, the output will be a fixed-length vector regardless the length of the input sequence. Movie title, which is a sequence of words represented by an integer word index sequence, will be fed into a `sequence_conv_pool` layer, which will apply convolution and pooling on time dimension. Because pooling is done on time dimension, the output will be a fixed-length vector regardless the length of the input sequence.
Finally, we can use cosine similarity to calculate the similarity between user characteristics and movie features. Finally, we can define a `inference_program` that uses cosine similarity to calculate the similarity between user characteristics and movie features.
```python ```python
inference = paddle.layer.cos_sim(a=usr_combined_features, b=mov_combined_features, size=1, scale=5) def inference_program():
cost = paddle.layer.square_error_cost( usr_combined_features = get_usr_combined_features()
input=inference, mov_combined_features = get_mov_combined_features()
label=paddle.layer.data(
name='score', type=paddle.data_type.dense_vector(1))) inference = layers.cos_sim(X=usr_combined_features, Y=mov_combined_features)
scale_infer = layers.scale(x=inference, scale=5.0)
return scale_infer
```
Then we define a `training_program` that uses the result from `inference_program` to compute the cost with label data.
Also define `optimizer_func` to specify the optimizer.
```python
def train_program():
scale_infer = inference_program()
label = layers.data(name='score', shape=[1], dtype='float32')
square_cost = layers.square_error_cost(input=scale_infer, label=label)
avg_cost = layers.mean(square_cost)
return [avg_cost, scale_infer]
def optimizer_func():
return fluid.optimizer.SGD(learning_rate=0.2)
``` ```
## Model Training ## Model Training
### Define Parameters ### Specify training environment
First, we define the model parameters according to the previous model configuration `cost`. Specify your training environment, you should specify if the training is on CPU or GPU.
```python ```python
# Create parameters use_cuda = False
parameters = paddle.parameters.create(cost) place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
``` ```
### Create Trainer ### Datafeeder Configuration
Before jumping into creating a training module, algorithm setting is also necessary. Here we specified Adam optimization algorithm via `paddle.optimizer`. Next we define data feeders for test and train. The feeder reads a `buf_size` of data each time and feed them to the training/testing process.
`paddle.dataset.movielens.train` will yield records during each pass, after shuffling, a batch input of `BATCH_SIZE` is generated for training.
```python ```python
trainer = paddle.trainer.SGD(cost=cost, parameters=parameters, train_reader = paddle.batch(
update_equation=paddle.optimizer.Adam(learning_rate=1e-4)) paddle.reader.shuffle(
``` paddle.dataset.movielens.train(), buf_size=8192),
batch_size=BATCH_SIZE)
```text test_reader = paddle.batch(
[INFO 2017-03-06 17:12:13,378 networks.py:1472] The input order is [user_id, gender_id, age_id, job_id, movie_id, category_id, movie_title, score] paddle.dataset.movielens.test(), batch_size=BATCH_SIZE)
[INFO 2017-03-06 17:12:13,379 networks.py:1478] The output order is [__square_error_cost_0__]
``` ```
### Training ### Create Trainer
`paddle.dataset.movielens.train` will yield records during each pass, after shuffling, a batch input is generated for training. Create a trainer that takes `train_program` as input and specify optimizer function.
```python ```python
reader=paddle.batch( trainer = fluid.Trainer(
paddle.reader.shuffle( train_func=train_program, place=place, optimizer_func=optimizer_func)
paddle.dataset.movielens.train(), buf_size=8192),
batch_size=256)
``` ```
`feeding` is devoted to specifying the correspondence between each yield record and `paddle.layer.data`. For instance, the first column of data generated by `movielens.train` corresponds to `user_id` feature. ### Feeding Data
`feed_order` is devoted to specifying the correspondence between each yield record and `paddle.layer.data`. For instance, the first column of data generated by `movielens.train` corresponds to `user_id` feature.
```python ```python
feeding = { feed_order = [
'user_id': 0, 'user_id', 'gender_id', 'age_id', 'job_id', 'movie_id', 'category_id',
'gender_id': 1, 'movie_title', 'score'
'age_id': 2, ]
'job_id': 3,
'movie_id': 4,
'category_id': 5,
'movie_title': 6,
'score': 7
}
``` ```
Callback function `event_handler` and `event_handler_plot` will be called during training when a pre-defined event happens. ### Event Handler
Callback function `event_handler` will be called during training when a pre-defined event happens.
For example, we can check the cost by `trainer.test` when `EndStepEvent` occurs
```python ```python
# Specify the directory path to save the parameters
params_dirname = "recommender_system.inference.model"
def event_handler(event): def event_handler(event):
if isinstance(event, paddle.event.EndIteration): if isinstance(event, fluid.EndStepEvent):
if event.batch_id % 100 == 0: avg_cost_set = trainer.test(
print "Pass %d Batch %d Cost %.2f" % ( reader=test_reader, feed_order=feed_order)
event.pass_id, event.batch_id, event.cost)
# get avg cost
avg_cost = np.array(avg_cost_set).mean()
print("avg_cost: %s" % avg_cost)
print('BatchID {0}, Test Loss {1:0.2}'.format(event.epoch + 1,
float(avg_cost)))
if float(avg_cost) < 4:
trainer.save_params(params_dirname)
trainer.stop()
``` ```
### Training
Finally, we invoke `trainer.train` to start training with `num_epochs` and other parameters.
```python ```python
from paddle.v2.plot import Ploter trainer.train(
num_epochs=1,
event_handler=event_handler,
reader=train_reader,
feed_order=feed_order)
```
train_title = "Train cost" ## Inference
test_title = "Test cost"
cost_ploter = Ploter(train_title, test_title)
step = 0 ### Create Inferencer
def event_handler_plot(event): Initialize Inferencer with `inference_program` and `params_dirname` which is where we save params from training.
global step
if isinstance(event, paddle.event.EndIteration):
if step % 10 == 0: # every 10 batches, record a train cost
cost_ploter.append(train_title, step, event.cost)
if step % 1000 == 0: # every 1000 batches, record a test cost ```python
result = trainer.test( inferencer = fluid.Inferencer(
reader=paddle.batch( inference_program, param_path=params_dirname, place=place)
paddle.dataset.movielens.test(), batch_size=256), ```
feeding=feeding)
cost_ploter.append(test_title, step, result.cost) ### Generate input data for testing
if step % 100 == 0: # every 100 batches, update cost plot Use create_lod_tensor(data, lod, place) API to generate LoD Tensor, where `data` is a list of sequences of index numbers, `lod` is the level of detail (lod) info associated with `data`.
cost_ploter.plot() For example, data = [[10, 2, 3], [2, 3]] means that it contains two sequences of indices, of length 3 and 2, respectively.
Correspondingly, lod = [[3, 2]] contains one level of detail info, indicating that `data` consists of two sequences of length 3 and 2.
step += 1 ```python
user_id = fluid.create_lod_tensor([[1]], [[1]], place)
gender_id = fluid.create_lod_tensor([[1]], [[1]], place)
age_id = fluid.create_lod_tensor([[0]], [[1]], place)
job_id = fluid.create_lod_tensor([[10]], [[1]], place)
movie_id = fluid.create_lod_tensor([[783]], [[1]], place)
category_id = fluid.create_lod_tensor([[10, 8, 9]], [[3]], place)
movie_title = fluid.create_lod_tensor([[1069, 4140, 2923, 710, 988]], [[5]],
place)
``` ```
Finally, we can invoke `trainer.train` to start training: ### Infer
Now we can infer with inputs that we provide in `feed_order` during training.
```python ```python
trainer.train( results = inferencer.infer(
reader=reader, {
event_handler=event_handler_plot, 'user_id': user_id,
feeding=feeding, 'gender_id': gender_id,
num_passes=2) 'age_id': age_id,
'job_id': job_id,
'movie_id': movie_id,
'category_id': category_id,
'movie_title': movie_title
},
return_numpy=False)
print("infer results: ", np.array(results[0]))
``` ```
## Conclusion ## Conclusion
......
import paddle.v2 as paddle # Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
import cPickle #
import copy # Licensed under the Apache License, Version 2.0 (the "License");
import os # you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
with_gpu = os.getenv('WITH_GPU', '0') != '0' #
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import math
import sys
import numpy as np
import paddle
import paddle.fluid as fluid
import paddle.fluid.layers as layers
import paddle.fluid.nets as nets
IS_SPARSE = True
USE_GPU = False
BATCH_SIZE = 256
def get_usr_combined_features(): def get_usr_combined_features():
uid = paddle.layer.data(
name='user_id', USR_DICT_SIZE = paddle.dataset.movielens.max_user_id() + 1
type=paddle.data_type.integer_value(
paddle.dataset.movielens.max_user_id() + 1)) uid = layers.data(name='user_id', shape=[1], dtype='int64')
usr_emb = paddle.layer.embedding(input=uid, size=32)
usr_fc = paddle.layer.fc(input=usr_emb, size=32) usr_emb = layers.embedding(
input=uid,
usr_gender_id = paddle.layer.data( dtype='float32',
name='gender_id', type=paddle.data_type.integer_value(2)) size=[USR_DICT_SIZE, 32],
usr_gender_emb = paddle.layer.embedding(input=usr_gender_id, size=16) param_attr='user_table',
usr_gender_fc = paddle.layer.fc(input=usr_gender_emb, size=16) is_sparse=IS_SPARSE)
usr_age_id = paddle.layer.data( usr_fc = layers.fc(input=usr_emb, size=32)
name='age_id',
type=paddle.data_type.integer_value( USR_GENDER_DICT_SIZE = 2
len(paddle.dataset.movielens.age_table)))
usr_age_emb = paddle.layer.embedding(input=usr_age_id, size=16) usr_gender_id = layers.data(name='gender_id', shape=[1], dtype='int64')
usr_age_fc = paddle.layer.fc(input=usr_age_emb, size=16)
usr_gender_emb = layers.embedding(
usr_job_id = paddle.layer.data( input=usr_gender_id,
name='job_id', size=[USR_GENDER_DICT_SIZE, 16],
type=paddle.data_type.integer_value( param_attr='gender_table',
paddle.dataset.movielens.max_job_id() + 1)) is_sparse=IS_SPARSE)
usr_job_emb = paddle.layer.embedding(input=usr_job_id, size=16)
usr_job_fc = paddle.layer.fc(input=usr_job_emb, size=16) usr_gender_fc = layers.fc(input=usr_gender_emb, size=16)
usr_combined_features = paddle.layer.fc( USR_AGE_DICT_SIZE = len(paddle.dataset.movielens.age_table)
input=[usr_fc, usr_gender_fc, usr_age_fc, usr_job_fc], usr_age_id = layers.data(name='age_id', shape=[1], dtype="int64")
size=200,
act=paddle.activation.Tanh()) usr_age_emb = layers.embedding(
input=usr_age_id,
size=[USR_AGE_DICT_SIZE, 16],
is_sparse=IS_SPARSE,
param_attr='age_table')
usr_age_fc = layers.fc(input=usr_age_emb, size=16)
USR_JOB_DICT_SIZE = paddle.dataset.movielens.max_job_id() + 1
usr_job_id = layers.data(name='job_id', shape=[1], dtype="int64")
usr_job_emb = layers.embedding(
input=usr_job_id,
size=[USR_JOB_DICT_SIZE, 16],
param_attr='job_table',
is_sparse=IS_SPARSE)
usr_job_fc = layers.fc(input=usr_job_emb, size=16)
concat_embed = layers.concat(
input=[usr_fc, usr_gender_fc, usr_age_fc, usr_job_fc], axis=1)
usr_combined_features = layers.fc(input=concat_embed, size=200, act="tanh")
return usr_combined_features return usr_combined_features
def get_mov_combined_features(): def get_mov_combined_features():
movie_title_dict = paddle.dataset.movielens.get_movie_title_dict()
mov_id = paddle.layer.data( MOV_DICT_SIZE = paddle.dataset.movielens.max_movie_id() + 1
name='movie_id',
type=paddle.data_type.integer_value( mov_id = layers.data(name='movie_id', shape=[1], dtype='int64')
paddle.dataset.movielens.max_movie_id() + 1))
mov_emb = paddle.layer.embedding(input=mov_id, size=32) mov_emb = layers.embedding(
mov_fc = paddle.layer.fc(input=mov_emb, size=32) input=mov_id,
dtype='float32',
mov_categories = paddle.layer.data( size=[MOV_DICT_SIZE, 32],
name='category_id', param_attr='movie_table',
type=paddle.data_type.sparse_binary_vector( is_sparse=IS_SPARSE)
len(paddle.dataset.movielens.movie_categories())))
mov_categories_hidden = paddle.layer.fc(input=mov_categories, size=32) mov_fc = layers.fc(input=mov_emb, size=32)
mov_title_id = paddle.layer.data( CATEGORY_DICT_SIZE = len(paddle.dataset.movielens.movie_categories())
name='movie_title',
type=paddle.data_type.integer_value_sequence(len(movie_title_dict))) category_id = layers.data(
mov_title_emb = paddle.layer.embedding(input=mov_title_id, size=32) name='category_id', shape=[1], dtype='int64', lod_level=1)
mov_title_conv = paddle.networks.sequence_conv_pool(
input=mov_title_emb, hidden_size=32, context_len=3) mov_categories_emb = layers.embedding(
input=category_id, size=[CATEGORY_DICT_SIZE, 32], is_sparse=IS_SPARSE)
mov_combined_features = paddle.layer.fc(
input=[mov_fc, mov_categories_hidden, mov_title_conv], mov_categories_hidden = layers.sequence_pool(
size=200, input=mov_categories_emb, pool_type="sum")
act=paddle.activation.Tanh())
MOV_TITLE_DICT_SIZE = len(paddle.dataset.movielens.get_movie_title_dict())
mov_title_id = layers.data(
name='movie_title', shape=[1], dtype='int64', lod_level=1)
mov_title_emb = layers.embedding(
input=mov_title_id, size=[MOV_TITLE_DICT_SIZE, 32], is_sparse=IS_SPARSE)
mov_title_conv = nets.sequence_conv_pool(
input=mov_title_emb,
num_filters=32,
filter_size=3,
act="tanh",
pool_type="sum")
concat_embed = layers.concat(
input=[mov_fc, mov_categories_hidden, mov_title_conv], axis=1)
mov_combined_features = layers.fc(input=concat_embed, size=200, act="tanh")
return mov_combined_features return mov_combined_features
def main(): def inference_program():
paddle.init(use_gpu=with_gpu)
usr_combined_features = get_usr_combined_features() usr_combined_features = get_usr_combined_features()
mov_combined_features = get_mov_combined_features() mov_combined_features = get_mov_combined_features()
inference = paddle.layer.cos_sim(
a=usr_combined_features, b=mov_combined_features, size=1, scale=5)
cost = paddle.layer.square_error_cost(
input=inference,
label=paddle.layer.data(
name='score', type=paddle.data_type.dense_vector(1)))
parameters = paddle.parameters.create(cost)
trainer = paddle.trainer.SGD(
cost=cost,
parameters=parameters,
update_equation=paddle.optimizer.Adam(learning_rate=1e-4))
feeding = {
'user_id': 0,
'gender_id': 1,
'age_id': 2,
'job_id': 3,
'movie_id': 4,
'category_id': 5,
'movie_title': 6,
'score': 7
}
def event_handler(event): inference = layers.cos_sim(X=usr_combined_features, Y=mov_combined_features)
if isinstance(event, paddle.event.EndIteration): scale_infer = layers.scale(x=inference, scale=5.0)
if event.batch_id % 100 == 0:
print "Pass %d Batch %d Cost %.2f" % (
event.pass_id, event.batch_id, event.cost)
trainer.train( return scale_infer
reader=paddle.batch(
paddle.reader.shuffle(
paddle.dataset.movielens.train(), buf_size=8192),
batch_size=256),
event_handler=event_handler,
feeding=feeding,
num_passes=1)
user_id = 234
movie_id = 345
user = paddle.dataset.movielens.user_info()[user_id] def train_program():
movie = paddle.dataset.movielens.movie_info()[movie_id]
feature = user.value() + movie.value() scale_infer = inference_program()
infer_dict = copy.copy(feeding) label = layers.data(name='score', shape=[1], dtype='float32')
del infer_dict['score'] square_cost = layers.square_error_cost(input=scale_infer, label=label)
avg_cost = layers.mean(square_cost)
prediction = paddle.infer( return [avg_cost, scale_infer]
output_layer=inference,
parameters=parameters,
input=[feature], def optimizer_func():
feeding=infer_dict) return fluid.optimizer.SGD(learning_rate=0.2)
print(prediction + 5) / 2
def train(use_cuda, train_program, params_dirname):
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
trainer = fluid.Trainer(
train_func=train_program, place=place, optimizer_func=optimizer_func)
feed_order = [
'user_id', 'gender_id', 'age_id', 'job_id', 'movie_id', 'category_id',
'movie_title', 'score'
]
def event_handler(event):
if isinstance(event, fluid.EndStepEvent):
test_reader = paddle.batch(
paddle.dataset.movielens.test(), batch_size=BATCH_SIZE)
avg_cost_set = trainer.test(
reader=test_reader, feed_order=feed_order)
# get avg cost
avg_cost = np.array(avg_cost_set).mean()
print("avg_cost: %s" % avg_cost)
if float(avg_cost) < 4: # Change this number to adjust accuracy
trainer.save_params(params_dirname)
trainer.stop()
else:
print('BatchID {0}, Test Loss {1:0.2}'.format(event.epoch + 1,
float(avg_cost)))
if math.isnan(float(avg_cost)):
sys.exit("got NaN loss, training failed.")
train_reader = paddle.batch(
paddle.reader.shuffle(paddle.dataset.movielens.train(), buf_size=8192),
batch_size=BATCH_SIZE)
trainer.train(
num_epochs=1,
event_handler=event_handler,
reader=train_reader,
feed_order=feed_order)
def infer(use_cuda, inference_program, params_dirname):
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
inferencer = fluid.Inferencer(
inference_program, param_path=params_dirname, place=place)
# Use the first data from paddle.dataset.movielens.test() as input.
# Use create_lod_tensor(data, lod, place) API to generate LoD Tensor,
# where `data` is a list of sequences of index numbers, `lod` is
# the level of detail (lod) info associated with `data`.
# For example, data = [[10, 2, 3], [2, 3]] means that it contains
# two sequences of indexes, of length 3 and 2, respectively.
# Correspondingly, lod = [[3, 2]] contains one level of detail info,
# indicating that `data` consists of two sequences of length 3 and 2.
user_id = fluid.create_lod_tensor([[1]], [[1]], place)
gender_id = fluid.create_lod_tensor([[1]], [[1]], place)
age_id = fluid.create_lod_tensor([[0]], [[1]], place)
job_id = fluid.create_lod_tensor([[10]], [[1]], place)
movie_id = fluid.create_lod_tensor([[783]], [[1]], place)
category_id = fluid.create_lod_tensor([[10, 8, 9]], [[3]], place)
movie_title = fluid.create_lod_tensor([[1069, 4140, 2923, 710, 988]], [[5]],
place)
results = inferencer.infer(
{
'user_id': user_id,
'gender_id': gender_id,
'age_id': age_id,
'job_id': job_id,
'movie_id': movie_id,
'category_id': category_id,
'movie_title': movie_title
},
return_numpy=False)
print("infer results: ", np.array(results[0]))
def main(use_cuda):
if use_cuda and not fluid.core.is_compiled_with_cuda():
return
params_dirname = "recommender_system.inference.model"
train(
use_cuda=use_cuda,
train_program=train_program,
params_dirname=params_dirname)
infer(
use_cuda=use_cuda,
inference_program=inference_program,
params_dirname=params_dirname)
if __name__ == '__main__': if __name__ == '__main__':
main() main(USE_GPU)
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册