ml_regression.rst 12.9 KB
Newer Older
Z
zhangjinchao01 已提交
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221
Regression MovieLens Ratting
============================

Here we demonstrate a **Cosine Similarity Regression** job in movie lens dataset.
This demo will show how paddle does (word) embedding job,
handles the similarity regression,
the character-level convolutional networks for text, and how does paddle handle
multiple types of inputs.
Note that the model structure is not fine-tuned and just a demo to show how paddle works.


YOU ARE WELCOME TO BUILD A BETTER DEMO
BY USING PADDLEPADDLE, AND LET US KNOW TO MAKE THIS DEMO BETTER.

Data Preparation
````````````````
Download and extract dataset
''''''''''''''''''''''''''''
We use `movielens 1m dataset <ml_dataset.html>`_ here. 
To download and unzip the dataset, simply run the following commands.

..  code-block:: bash

    cd demo/recommendation/data 
    ./ml_data.sh

And the directory structure of :code:`demo/recommendation/data/ml-1m` is:

..  code-block:: text

    +--ml-1m
         +--- movies.dat    # movie features
         +--- ratings.dat   # ratings
         +--- users.dat     # user features
         +--- README        # dataset description

Field config file
'''''''''''''''''
**Field config file** is used to specific the fields dataset and file format,
i.e, specific **WHAT** type it is in each feature file.

The field config file of ml-1m shows in :code:`demo/recommendation/data/config.json`.
It specifics the field types and file names: 1) there are four types of field for user file\: id, gender, age and occupation;
2) the filename is "users.dat", and the delimiter of file is "::".

..  include:: ../../../demo/recommendation/data/config.json
    :code: json
    :literal:

Preprocess Data
```````````````
You need to install python 3rd party libraries.
IT IS HIGHLY RECOMMEND TO USE VIRTUALENV MAKE A CLEAN PYTHON ENVIRONMENT.

..  code-block:: bash

    pip install -r requirements.txt

The general command for preprocessing the dataset is:

..  code-block:: bash

    cd demo/recommendation
    ./preprocess.sh
    
And the detail steps are introduced as follows.

Extract Movie/User features to python object
'''''''''''''''''''''''''''''''''''''''''''''

There are many features in movie or user in movielens 1m dataset.
Each line of rating file just provides a Movie/User id to refer each movie or user.
We process the movie/user feature file first, and pickle the feature (**Meta**) object as a file.

Meta config file
................

**Meta config file** is used to specific **HOW** to parse each field in dataset.
It could be translated from field config file, or written by hand.
Its file format could be either json or yaml syntax file. Parser will automatically choose the file format by extension name.

To convert Field config file to meta config file, just run:

..  code-block:: bash

    cd demo/recommendation/data
    python config_generator.py config.json > meta_config.json

The meta config file shows below:

..  include:: ../../../demo/recommendation/data/meta_config.json
    :code: json
    :literal:

There are two kinds of features in meta\: movie and user.

* in movie file, whose name is movies.dat
   * we just split each line by "::"
   * pos 0 is id.
   * pos 1 feature:
      * name is title.
      * it uses regex to parse this feature.
      * it is a char based word embedding feature.
      * it is a sequence.
   * pos 2 feature:
      * name is genres.
      * type is one hot dense vector.
      * dictionary is auto generated by parsing, each key is split by '|'
* in user file, whose name is users.dat
   * we just split each line by "::"
   * pos 0 is id.
   * pos 1 feature:
       * name is gender
       * just simple char based embedding.
   * pos 2 feature:
       * name is age
       * just whole word embedding.
       * embedding id will be sort by word.
   * pos 3 feature:
       * name is occupation.
       * just simple whole word embedding.


Meta file
'''''''''

After having meta config file, we can generate **Meta file**, a python pickle object which stores movie/user information.
The following commands could be run to generate it.

..  code-block:: bash

    python meta_generator.py ml-1m meta.bin --config=meta_config.json

And the structure of the meta file :code:`meta.bin` is:

..  code-block:: text

    +--+ movie
    |      +--+ __meta__
    |      |       +--+ raw_meta  # each feature meta config. list
    |      |       |       +
    |      |       |       |     # ID Field, we use id as key
    |      |       |       +--+ {'count': 3883, 'max': 3952, 'is_key': True, 'type': 'id', 'min': 1}
    |      |       |       |
    |      |       |       |     # Titile field, the dictionary list of embedding.
    |      |       |       +--+ {'dict': [ ... ], 'type': 'embedding', 'name': 'title', 'seq': 'sequence'}
    |      |       |       |
    |      |       |       |     # Genres field, the genres dictionary
    |      |       |       +--+ {'dict': [ ... ], 'type': 'one_hot_dense', 'name': 'genres'}
    |      |       |
    |      |       +--+ feature_map [1, 2] # a list for raw_meta index for feature field.
    |      |                               # it means there are 2 features for each key.
    |      |                               #    * 0 offset of feature is raw_meta[1], Title.
    |      |                               #    * 1 offset of feature is raw_meta[2], Genres.
    |      |
    |      +--+ 1 # movie 1 features
    |      |    +
    |      |    +---+ [[...], [...]] # title ids, genres dense vector
    |      |
    |      +--+ 2
    |      |
    |      +--+ ...
    |
    +--- user
           +--+ __meta__
           |       +
           |       +--+ raw_meta
           |       |       +
           |       |       +--+ id field as user
           |       |       |
           |       |       +--+ {'dict': ['F', 'M'], 'type': 'embedding', 'name': 'gender', 'seq': 'no_sequence'}
           |       |       |
           |       |       +--+ {'dict': ['1', '18', '25', '35', '45', '50', '56'], 'type': 'embedding', 'name': 'age', 'seq': 'no_sequence'}
           |       |       |
           |       |       +--+ {'dict': [...], 'type': 'embedding', 'name': 'occupation', 'seq': 'no_sequence'}
           |       |
           |       +--+ feature_map [1, 2, 3]
           |
           +--+ 1 # user 1 features
           |
           +--+ 2
           +--+ ...


Split Training/Testing files
''''''''''''''''''''''''''''

We split :code:`ml-1m/ratings.dat` into a training and testing file. The way to split file is for each user, we split the
rating by two parts. So each user in testing file will have some rating information in training file.

Use separate.py to separate the training and testing file.

..  code-block:: bash

    python split.py ml-1m/ratings.dat --delimiter="::" --test_ratio=0.1

Then two files will be generated\: :code:`ml-1m/ratings.dat.train` and :code:`ml-1m/rating.data.test`.
Move them to workspace :code:`data`, shuffle the train file, and prepare the file list for paddle train.

..  code-block:: bash

    shuf ml-1m/ratings.dat.train > ratings.dat.train
    cp ml-1m/ratings.dat.test .
    echo "./data/ratings.dat.train" > train.list
    echo "./data/ratings.dat.test" > test.list


Neural Network Configuration
````````````````````````````

Trainer Config File
'''''''''''''''''''

The network structure shows below.

..  image:: rec_regression_network.png
    :align: center
    :alt: rec_regression_network

The demo's neural network config file "trainer_config.py" show as below.

222 223 224
..  literalinclude:: ../../../demo/recommendation/trainer_config.py
    :language: python
    :lines: 15-
Z
zhangjinchao01 已提交
225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259

In this :code:`trainer_config.py`, we just map each feature type to
a feature vector, following shows how to map each feature to a vector shows below.

* :code:`id`\: Just simple embedding, and then add to fully connected layer.
* :code:`embedding`\:
    - if is_sequence, get the embedding and do a text convolutional operation,
      get the average pooling result.
    - if not sequence, get the embedding and add to fully connected layer.
* :code:`one_host_dense`\:
    - just two fully connected layer.

Then we combine each features of movie into one movie feature by a
:code:`fc_layer` with multiple inputs, and do the same thing to user features,
get one user feature. Then we calculate the cosine similarity of these two
features.

In these network, we use several api in `trainer_config_helpers
<../../ui/api/trainer_config_helpers/index.html>`_. There are

*  Data Layer, `data_layer 
   <../../ui/api/trainer_config_helpers/layers.html#id1>`_
*  Fully Connected Layer, `fc_layer
   <../../ui/api/trainer_config_helpers/layers.html#fc-layer>`_
*  Embedding Layer, `embedding_layer
   <../../ui/api/trainer_config_helpers/layers.html#embedding-layer>`_
*  Context Projection Layer, `context_projection
   <../../ui/api/trainer_config_helpers/layers.html#context-projection>`_
*  Pooling Layer, `pooling_layer
   <../../ui/api/trainer_config_helpers/layers.html#pooling-layer>`_
*  Cosine Similarity Layer, `cos_sim
   <../../ui/api/trainer_config_helpers/layers.html#cos-sim>`_
*  Text Convolution Pooling Layer, `text_conv_pool
   <../../ui/api/trainer_config_helpers/networks.html
   #trainer_config_helpers.networks.text_conv_pool>`_
260
*  Declare Python Data Sources, `define_py_data_sources2
Z
zhangjinchao01 已提交
261 262 263 264 265
   <../../ui/api/trainer_config_helpers/data_sources.html>`_

Data Provider
'''''''''''''

266 267 268
..  literalinclude:: ../../../demo/recommendation/dataprovider.py
    :language: python
    :lines: 15-
Z
zhangjinchao01 已提交
269 270 271 272 273 274 275 276

The data provider just read the meta.bin and rating file, yield each sample for training.
In this :code:`dataprovider.py`, we should set\:

* obj.slots\: The feature types and dimension.
* use_seq\: Whether this :code:`dataprovider.py` in sequence mode or not.
* process\: Return each sample of data to :code:`paddle`.

277
The data provider details document see `there <../../ui/data_provider/pydataprovider2.html>`_.
Z
zhangjinchao01 已提交
278 279 280 281 282 283 284 285

Train
`````

After prepare data, config network, writting data provider, now we can run paddle training.

The run.sh is shown as follow:

286 287 288
..  literalinclude:: ../../../demo/recommendation/run.sh
    :language: bash
    :lines: 16-
Z
zhangjinchao01 已提交
289 290 291 292 293

It just start a paddle training process, write the log to `log.txt`,
then print it on screen.

Each command line argument in :code:`run.sh`, please refer to the `command line
L
Luo Tao 已提交
294
arguments <../../ui/index.html#command-line-argument>`_ page. The short description of these arguments is shown as follow.
Z
zhangjinchao01 已提交
295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330

*  config\: Tell paddle which file is neural network configuration.
*  save_dir\: Tell paddle save model into './output'
*  use_gpu\: Use gpu or not. Default is false.
*  trainer_count\: The compute thread in one machine.
*  test_all_data_in_one_period\: Test All Data during one test period. Otherwise,
   will test a :code:`batch_size` data in one test period.
*  log_period\: Print log after train :code:`log_period` batches.
*  dot_period\: Print a :code:`.` after train :code:`dot_period` batches.
*  num_passes\: Train at most :code:`num_passes`.

If training process starts successfully, the output likes follow:

..  code-block:: text

    I0601 08:07:22.832059 10549 TrainerInternal.cpp:157]  Batch=100 samples=160000 AvgCost=4.13494 CurrentCost=4.13494 Eval:  CurrentEval:

    I0601 08:07:50.672627 10549 TrainerInternal.cpp:157]  Batch=200 samples=320000 AvgCost=3.80957 CurrentCost=3.48421 Eval:  CurrentEval:

    I0601 08:08:18.877369 10549 TrainerInternal.cpp:157]  Batch=300 samples=480000 AvgCost=3.68145 CurrentCost=3.42519 Eval:  CurrentEval:

    I0601 08:08:46.863963 10549 TrainerInternal.cpp:157]  Batch=400 samples=640000 AvgCost=3.6007 CurrentCost=3.35847 Eval:  CurrentEval:

    I0601 08:09:15.413025 10549 TrainerInternal.cpp:157]  Batch=500 samples=800000 AvgCost=3.54811 CurrentCost=3.33773 Eval:  CurrentEval:
    I0601 08:09:36.058670 10549 TrainerInternal.cpp:181]  Pass=0 Batch=565 samples=902826 AvgCost=3.52368 Eval:
    I0601 08:09:46.215489 10549 Tester.cpp:101]  Test samples=97383 cost=3.32155 Eval:
    I0601 08:09:46.215966 10549 GradientMachine.cpp:132] Saving parameters to ./output/model/pass-00000
    I0601 08:09:46.233397 10549 ParamUtil.cpp:99] save dir ./output/model/pass-00000
    I0601 08:09:46.233438 10549 Util.cpp:209] copy trainer_config.py to ./output/model/pass-00000
    I0601 08:09:46.233541 10549 ParamUtil.cpp:147] fileName trainer_config.py

The model is saved in :code:`output/` directory. You can use :code:`Ctrl-C` to stop training whenever you want.

Evaluate and Predict
````````````````````

L
liaogang 已提交
331
After training several passes, you can evaluate them and get the best pass. Just run
Z
zhangjinchao01 已提交
332 333 334

.. code-block:: bash

L
liaogang 已提交
335
    ./evaluate.sh 
Z
zhangjinchao01 已提交
336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359

You will see messages like this:

.. code-block:: text

    Best pass is 00009,  error is 3.06949, which means predict get error as 0.875998002281
    evaluating from pass output/pass-00009

Then, you can predict what any user will rate a movie. Just run

..  code-block:: bash

    python prediction.py 'output/pass-00009/'

Predictor will read user input, and predict scores. It has a command-line user interface as follows:

..  code-block:: text

    Input movie_id: 9
    Input user_id: 4
    Prediction Score is 2.56
    Input movie_id: 8
    Input user_id: 2
    Prediction Score is 3.13