README.md 8.6 KB
Newer Older
J
Javier 已提交
1 2
# Wide-and-Deep-PyTorch
PyTorch implementation of Tensorflow's Wide and Deep Algorithm
J
Javier 已提交
3

J
jrzaurin 已提交
4 5 6 7 8 9 10 11 12 13
This is a PyTorch implementation of Tensorflow's Wide and Deep Algorithm, with
a few adds on so the algorithm can take text and images. Details of the
original algorithm can be found
[here](https://www.tensorflow.org/tutorials/wide_and_deep) and the very nice
research paper can be found [here](https://arxiv.org/abs/1606.07792). A
`Keras` (quick and relatively dirty) implementation of the algorithm can be
found [here](https://github.com/jrzaurin/Wide-and-Deep-Keras).

The Figure below is my attempt to illustrate the different components of the algorithm

14
![Figure 1. Wide and Deeper](widedeeper.png)
J
Javier 已提交
15

J
Javier 已提交
16
## Requirements:
J
Javier 已提交
17

J
jrzaurin 已提交
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
The algorithm was built using `python 3.6.5` and the required packages and
version I have used are:

```
pandas==0.24.2
numpy==1.16.2
scipy==1.2.1
sklearn==0.20.3
gensim==3.5.0
cv2==3.4.2
imutils==0.5.2
pytorch==1.1.0
fastai==1.0.52
tqdm==4.31.1
```

## Datasets.

I have used two datasets, the well known
[adult](https://www.kaggle.com/wenruliu/adult-income-dataset/downloads/adult.csv/2)
dataset and the latest
[airbnb](http://data.insideairbnb.com/united-kingdom/england/london/2019-03-07/data/listings.csv.gz)
listings dataset.

My working directory structure looks like this (`tree -d`):

```
.
├── data
│   ├── adult
│   │   └── wide_deep_data
│   ├── airbnb
│   │   ├── host_picture
│   │   ├── property_picture
│   │   └── wide_deep_data
│   ├── fasttext.cc
│   └── glove.6B
│   └── models
└── widedeep
    ├── models
    └── utils
```

Once you have downloaded the two files (`adult.csv` and `listings.csv`) place
them in their corresponding directories (`data/adult/` and `data/airbnb`).
Note that we also have the directories corresponding to the
[Glove](http://nlp.stanford.edu/data/glove.6B.zip) and
[FastText](https://fasttext.cc/docs/en/english-vectors.html) wordvectors. This
is because we will be dealing with text and these will be used to build the
pretrained embeddings matrix.


The Airbnb listings dataset requires a bit of preprocessing. This is attained
by simply running `python airbnb_data_preprocessing.py`. Details of what
happens within that script can be found in the companion notebook
`airbnb_data_preprocessing.ipynb`. The resulting file `listings_processed.csv`
is also stored in `data/adult/`. Then you can run `python download_images.py`
and it will download images of the hosts and their properties (Warning: this
will take a while).

Once you have:

1. Download adult and airbnb listings dataset
2. run `python airbnb_data_preprocessing.py`
3. run `python download_images.py`

your data directory will look like (`tree data -L 2`):

J
Javier 已提交
86
```
J
jrzaurin 已提交
87 88 89 90 91 92 93 94 95 96
data
├── adult
│   └── adult.csv
├── airbnb
│   ├── host_picture
│   ├── listings.csv
│   ├── listings_processed.csv
│   ├── property_picture
│   └── wide_deep_data
└── models
J
Javier 已提交
97 98
```

J
jrzaurin 已提交
99 100 101
And now we can move on to using the model


J
Javier 已提交
102 103
## How to use it.

J
jrzaurin 已提交
104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122
I have included 3 demos to explain how the data needs to be prepared, how the
algorithm is built (the wide and deep parts separately). In our case the deep
part is comprised by what I call: Deep Dense, Deep Text and Deep Image. If you
are familiar with the algorithm and you just want to give it a go, you can
directly go to demo3 or have a look to main.py (which can be run as `python
main.py` and has a few more details. Also note that the parameters there are
not necessarily optimized for the best result, simply are the last set of
parameters I used). Using it is as simple as this:

### 1. Prepare the (airbnb) data

You could simply run:

```
python prepare_data.py --dataset airbnb --wordvectors fasttext --imtype property
```

and this will store a pickle `wd_dataset.p` object at `data/airbnb/wide_deep_data/`. Alternatively, you can do it manually as:

J
Javier 已提交
123

J
Javier 已提交
124
```
J
Javier 已提交
125 126
import numpy as np
import pandas as pd
J
jrzaurin 已提交
127 128 129 130 131 132 133 134
from pathlib import Path

# I assume you have runned airbnb_data_preprocessing.py and the resulting file is at
# `data/airbnb/listings_processed.csv`
DATA_PATH=Path('data')
DF_airbnb = pd.read_csv(DATA_PATH/'airbnb/listings_processed.csv')
DF_airbnb = DF_airbnb[DF_airbnb.description.apply(lambda x: len(x.split(' '))>=10)]
out_dir = DATA_PATH/'airbnb/wide_deep_data/'
J
Javier 已提交
135

J
jrzaurin 已提交
136 137 138 139 140
# WIDE
crossed_cols = (['property_type', 'room_type'],)
already_dummies = [c for c in DF_airbnb.columns if 'amenity' in c] + ['has_house_rules']
wide_cols = ['is_location_exact', 'property_type', 'room_type', 'host_gender'] +\
    already_dummies
J
Javier 已提交
141

J
jrzaurin 已提交
142 143 144 145
# DEEP_DENSE
embeddings_cols = [(c, 16) for c in DF_airbnb.columns if 'catg' in c] + [('neighbourhood_cleansed', 64)]
continuous_cols = ['latitude', 'longitude', 'security_deposit', 'extra_people']
standardize_cols = ['security_deposit', 'extra_people']
J
Javier 已提交
146

J
jrzaurin 已提交
147 148 149
# DEEP_TEXT
text_col = 'description'
word_vectors_path = 'data/glove.6B/glove.6B.300d.txt'
J
Javier 已提交
150

J
jrzaurin 已提交
151 152 153
# DEEP_IMAGE
img_id = 'id'
img_path = DATA_PATH/'airbnb/property_picture'
J
Javier 已提交
154

J
jrzaurin 已提交
155 156
#TARGET
target = 'yield'
J
Javier 已提交
157

J
Javier 已提交
158
# PREPARE DATA
J
jrzaurin 已提交
159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
from prepare_data import prepare_data_airbnb
wd_dataset_airbnb = prepare_data_airbnb(
    df = DF_airbnb,
    img_id = img_id,
    img_path = img_path,
    text_col = text_col,
    max_vocab = 20000,
    min_freq = 2,
    maxlen = 170,
    word_vectors_path = word_vectors_path,
    embeddings_cols = embeddings_cols,
    continuous_cols = continuous_cols,
    standardize_cols = standardize_cols,
    target = target,
    wide_cols = wide_cols,
    crossed_cols = crossed_cols,
    already_dummies = already_dummies,
    out_dir = out_dir,
    scale=True,
    seed=1
    )

J
Javier 已提交
181
```
J
jrzaurin 已提交
182

J
Javier 已提交
183
### 2. Build the model
J
jrzaurin 已提交
184 185
The model is built with the `WideDeep` class.

J
Javier 已提交
186
```
J
Javier 已提交
187
# Network set up
J
jrzaurin 已提交
188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215
params = dict()
params['wide'] = dict(
    wide_dim = wd_dataset_airbnb['train']['wide'].shape[1]
    )
params['deep_dense'] = dict(
    embeddings_input = wd_dataset_airbnb['cat_embeddings_input'],
    embeddings_encoding_dict = wd_dataset_airbnb['cat_embeddings_encoding_dict'],
    continuous_cols = wd_dataset_airbnb['continuous_cols'],
    deep_column_idx = wd_dataset_airbnb['deep_column_idx'],
    hidden_layers = [64,32],
    dropout = [0.5]
    )
params['deep_text'] = dict(
    vocab_size = len(wd_dataset_airbnb['vocab'].itos),
    embedding_dim = wd_dataset_airbnb['word_embeddings_matrix'].shape[1],
    hidden_dim = 64,
    n_layers = 3,
    rnn_dropout = 0.5,
    spatial_dropout = 0.1,
    padding_idx = 1,
    attention = False,
    bidirectional = False,
    embedding_matrix = wd_dataset_airbnb['word_embeddings_matrix']
    )
params['deep_img'] = dict(
    pretrained = True,
    freeze=6,
    )
J
Javier 已提交
216 217

# Build the model
J
jrzaurin 已提交
218 219
from widedeep.models.wide_deep import WideDeepLoader, WideDeep
model = WideDeep(output_dim=1, **params)
J
Javier 已提交
220

J
jrzaurin 已提交
221 222 223 224 225 226 227 228 229 230 231 232 233 234 235
# Compile and run with, for example, the following set up
optimizer=dict(
    wide=['Adam', 0.1],
    deep_dense=['Adam', 0.01],
    deep_text=['RMSprop', 0.01,0.1],
    deep_img= ['Adam', 0.01]
    )
lr_scheduler=dict(
    wide=['StepLR', 3, 0.1],
    deep_dense=['StepLR', 3, 0.1],
    deep_text=['MultiStepLR', [3,5,7], 0.1],
    deep_img=['MultiStepLR', [3,5,7], 0.1]
    )
model.compile(method='regression', optimizer=optimizer, lr_scheduler=lr_scheduler)
if use_cuda:
J
Javier 已提交
236 237 238 239 240 241
    model = model.cuda()
```

### 3. Fit and predict

```
J
jrzaurin 已提交
242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257
# Define the data loaders
mean=[0.485, 0.456, 0.406] #RGB
std=[0.229, 0.224, 0.225]  #RGB
transform  = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(mean=mean, std=std)
])
train_set = WideDeepLoader(wd_dataset_airbnb['train'], transform, mode='train')
valid_set = WideDeepLoader(wd_dataset_airbnb['valid'], transform, mode='train')
test_set = WideDeepLoader(wd_dataset_airbnb['test'], transform, mode='test')
train_loader = torch.utils.data.DataLoader(dataset=train_set,
    batch_size=128,shuffle=True)
valid_loader = torch.utils.data.DataLoader(dataset=valid_set,
    batch_size=128,shuffle=True)
test_loader = torch.utils.data.DataLoader(dataset=test_set,
    batch_size=32,shuffle=False)
J
Javier 已提交
258

J
jrzaurin 已提交
259 260
# Fit
model.fit(n_epochs=5, train_loader=train_loader, eval_loader=valid_loader)
J
Javier 已提交
261

J
jrzaurin 已提交
262 263
# Predict
preds = model.predict(test_loader)
J
Javier 已提交
264

J
jrzaurin 已提交
265 266 267
# save
torch.save(model.state_dict(), 'model/logistic.pkl')
```
J
Javier 已提交
268

J
jrzaurin 已提交
269 270 271 272 273 274 275 276
And that's it. I have also included two python files:
`lightgbm_adult_benchmark.py` and `lightgbm_airbnb_benchmark.py`. If you run
it you will see that `lightgbm` produces significantly better results than
Wide and Deep (although I have not put much effort in optimizing the
parameters for Wide and Deep). With this I simply wanted to illustrate that
in many occasions a simpler solution is faster and better. However, there will
be a number of problems where wide and deep algorithms will come very handy
(i.e. feature representation learning).
J
Javier 已提交
277 278