README.md 10.7 KB
Newer Older
J
Javier 已提交
1

J
jrzaurin 已提交
2
<p align="center">
3
  <img width="300" src="docs/figures/widedeep_logo.png">
J
jrzaurin 已提交
4 5
</p>

J
jrzaurin 已提交
6
[![Build Status](https://travis-ci.org/jrzaurin/pytorch-widedeep.svg?branch=master)](https://travis-ci.org/jrzaurin/pytorch-widedeep)
J
jrzaurin 已提交
7
[![Documentation Status](https://readthedocs.org/projects/pytorch-widedeep/badge/?version=latest)](https://pytorch-widedeep.readthedocs.io/en/latest/?badge=latest)
8
[![PyPI version](https://badge.fury.io/py/pytorch-widedeep.svg)](https://badge.fury.io/py/pytorch-widedeep)
9
[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://github.com/jrzaurin/pytorch-widedeep/graphs/commit-activity)
10
[![contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat)](https://github.com/jrzaurin/pytorch-widedeep/issues)
J
jrzaurin 已提交
11
[![codecov](https://codecov.io/gh/jrzaurin/pytorch-widedeep/branch/master/graph/badge.svg)](https://codecov.io/gh/jrzaurin/pytorch-widedeep)
12
[![Python 3.6 3.7 3.8](https://img.shields.io/badge/python-3.6%20%7C%203.7%20%7C%203.8-blue.svg)](https://www.python.org/)
13

J
jrzaurin 已提交
14 15
# pytorch-widedeep

16 17
A flexible package to use Deep Learning with tabular data, text and images
using wide and deep models.
J
jrzaurin 已提交
18

J
jrzaurin 已提交
19 20
**Documentation:** [https://pytorch-widedeep.readthedocs.io](https://pytorch-widedeep.readthedocs.io/en/latest/index.html)

J
jrzaurin 已提交
21 22
**Companion posts:** [infinitoml](https://jrzaurin.github.io/infinitoml/)

J
jrzaurin 已提交
23 24
### Introduction

25 26
`pytorch-widedeep` is based on Google's Wide and Deep Algorithm, [Wide & Deep
Learning for Recommender Systems](https://arxiv.org/abs/1606.07792).
J
jrzaurin 已提交
27

J
jrzaurin 已提交
28 29
In general terms, `pytorch-widedeep` is a package to use deep learning with
tabular data. In particular, is intended to facilitate the combination of text
J
jrzaurin 已提交
30
and images with corresponding tabular data using wide and deep models. With
31 32 33
that in mind there are a number of architectures that can be implemented with
just a few lines of code. The main components of those architectures are shown
in the Figure below:
J
jrzaurin 已提交
34 35 36


<p align="center">
37
  <img width="750" src="docs/figures/widedeep_arch.png">
J
jrzaurin 已提交
38 39
</p>

40
The dashed boxes in the figure represent optional, overall components, and the
41 42 43 44 45 46 47
dashed lines/arrows indicate the corresponding connections, depending on
whether or not certain components are present. For example, the dashed,
blue-lines indicate that the ``deeptabular``, ``deeptext`` and ``deepimage``
components are connected directly to the output neuron or neurons (depending
on whether we are performing a binary classification or regression, or a
multi-class classification) if the optional ``deephead`` is not present.
Finally, the components within the faded-pink rectangle are concatenated.
48 49 50 51

Note that it is not possible to illustrate the number of possible
architectures and components available in ``pytorch-widedeep`` in one Figure.
Therefore, for more details on possible architectures (and more) please, see
52 53 54
the
[documentation]((https://pytorch-widedeep.readthedocs.io/en/latest/index.html)),
or the Examples folders and the notebooks there.
J
jrzaurin 已提交
55

56
In math terms, and following the notation in the
57 58
[paper](https://arxiv.org/abs/1606.07792), the expression for the architecture
without a ``deephead`` component can be formulated as:
J
jrzaurin 已提交
59 60

<p align="center">
61
  <img width="500" src="docs/figures/architecture_1_math.png">
J
jrzaurin 已提交
62 63
</p>

64 65 66

Where *'W'* are the weight matrices applied to the wide model and to the final
activations of the deep models, *'a'* are these final activations, and
J
jrzaurin 已提交
67 68 69 70 71 72 73
&phi;(x) are the cross product transformations of the original features *'x'*.
In case you are wondering what are *"cross product transformations"*, here is
a quote taken directly from the paper: *"For binary features, a cross-product
transformation (e.g., “AND(gender=female, language=en)”) is 1 if and only if
the constituent features (“gender=female” and “language=en”) are all 1, and 0
otherwise".*

74

75 76
While if there is a ``deephead`` component, the previous expression turns
into:
77 78 79 80 81

<p align="center">
  <img width="300" src="docs/figures/architecture_2_math.png">
</p>

82 83 84
It is important to emphasize that **each individual component, `wide`,
`deeptabular`, `deeptext` and `deepimage`, can be used independently** and in
isolation. For example, one could use only `wide`, which is in simply a linear
85 86 87
model. In fact, one of the most interesting functionalities
in``pytorch-widedeep`` is the ``deeptabular`` component. Currently,
``pytorch-widedeep`` offers 3 models for that component:
88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113

1. ``TabMlp``: this is almost identical to the [tabular
model](https://docs.fast.ai/tutorial.tabular.html) in the fantastic
[fastai](https://docs.fast.ai/) library, and consists simply in embeddings
representing the categorical features, concatenated with the continuous
features, and passed then through a MLP.

2. ``TabRenset``: This is similar to the previous model but the embeddings are
passed through a series of ResNet blocks built with dense layers.

3. ``TabTransformer``: Details on the TabTransformer can be found in:
[TabTransformer: Tabular Data Modeling Using Contextual
Embeddings](https://arxiv.org/pdf/2012.06678.pdf)


For details on these 3 models and their options please see the examples in the
Examples folder and the documentation.

Finally, while I recommend using the ``wide`` and ``deeptabular`` models in
``pytorch-widedeep`` it is very likely that users will want to use their own
models for the ``deeptext`` and ``deepimage`` components. That is perfectly
possible as long as the the custom models have an attribute called
``output_dim`` with the size of the last layer of activations, so that
``WideDeep`` can be constructed. Again, examples on how to use custom
components can be found in the Examples folder. Just in case
``pytorch-widedeep`` includes standard text (stack of LSTMs) and image
114 115
(pre-trained ResNets or stack of CNNs) models.

J
jrzaurin 已提交
116 117

### Installation
J
jrzaurin 已提交
118

119 120 121 122 123 124 125
Install using pip:

```bash
pip install pytorch-widedeep
```

Or install directly from github
J
jrzaurin 已提交
126

J
jrzaurin 已提交
127
```bash
J
jrzaurin 已提交
128 129 130
pip install git+https://github.com/jrzaurin/pytorch-widedeep.git
```

J
jrzaurin 已提交
131
#### Developer Install
J
jrzaurin 已提交
132 133

```bash
J
jrzaurin 已提交
134
# Clone the repository
J
jrzaurin 已提交
135
git clone https://github.com/jrzaurin/pytorch-widedeep
J
jrzaurin 已提交
136 137
cd pytorch-widedeep

J
jrzaurin 已提交
138
# Install in dev mode
J
jrzaurin 已提交
139 140 141
pip install -e .
```

142 143
**Important note for Mac users**: at the time of writing (Feb-2020) the latest
`torch` release is `1.7.1`. This release has some
144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163
[issues](https://stackoverflow.com/questions/64772335/pytorch-w-parallelnative-cpp206)
when running on Mac and the data-loaders will not run in parallel. In
addition, since `python 3.8`, [the `multiprocessing` library start method
changed from `'fork'` to
`'spawn'`](https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods).
This also affects the data-loaders (for any `torch` version) and they will not
run in parallel. Therefore, for Mac users I recommend using `python 3.6` or
`3.7` and `torch <= 1.6` (with the corresponding, consistent version of
`torchvision`, e.g. `0.7.0` for `torch 1.6`). I do not want to force this
versioning in the `setup.py` file since I expect that all these issues are
fixed in the future. Therefore, after installing `pytorch-widedeep` via pip or
directly from github, downgrade `torch` and `torchvision` manually:

```bash
pip install pytorch-widedeep
pip install torch==1.6.0 torchvision==0.7.0
```

None of these issues affect Linux users.

J
jrzaurin 已提交
164 165 166
### Quick start

Binary classification with the [adult
167
dataset]([adult](https://www.kaggle.com/wenruliu/adult-income-dataset))
168
using `Wide` and `DeepDense` and defaults settings.
J
jrzaurin 已提交
169

170 171 172 173 174 175

```python
```

Building a wide (linear) and deep model with ``pytorch-widedeep``:

J
jrzaurin 已提交
176
```python
177

J
jrzaurin 已提交
178
import pandas as pd
179
import numpy as np
180 181
from sklearn.model_selection import train_test_split

182 183 184
from pytorch_widedeep import Trainer
from pytorch_widedeep.preprocessing import WidePreprocessor, TabPreprocessor
from pytorch_widedeep.models import Wide, TabMlp, WideDeep
185
from pytorch_widedeep.metrics import Accuracy
J
jrzaurin 已提交
186

187 188 189
# the following 4 lines are not directly related to ``pytorch-widedeep``. I
# assume you have downloaded the dataset and place it in a dir called
# data/adult/
190 191 192 193
df = pd.read_csv("data/adult/adult.csv.zip")
df["income_label"] = (df["income"].apply(lambda x: ">50K" in x)).astype(int)
df.drop("income", axis=1, inplace=True)
df_train, df_test = train_test_split(df, test_size=0.2, stratify=df.income_label)
J
jrzaurin 已提交
194 195

# prepare wide, crossed, embedding and continuous columns
196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212
wide_cols = [
    "education",
    "relationship",
    "workclass",
    "occupation",
    "native-country",
    "gender",
]
cross_cols = [("education", "occupation"), ("native-country", "occupation")]
embed_cols = [
    ("education", 16),
    ("workclass", 16),
    ("occupation", 16),
    ("native-country", 32),
]
cont_cols = ["age", "hours-per-week"]
target_col = "income_label"
J
jrzaurin 已提交
213 214

# target
215
target = df_train[target_col].values
J
jrzaurin 已提交
216 217

# wide
218 219
wide_preprocessor = WidePreprocessor(wide_cols=wide_cols, crossed_cols=cross_cols)
X_wide = wide_preprocessor.fit_transform(df_train)
220
wide = Wide(wide_dim=np.unique(X_wide).shape[0], pred_dim=1)
J
jrzaurin 已提交
221

222 223 224 225 226 227 228
# deeptabular
tab_preprocessor = TabPreprocessor(embed_cols=embed_cols, continuous_cols=cont_cols)
X_tab = tab_preprocessor.fit_transform(df_train)
deeptabular = TabMlp(
    mlp_hidden_dims=[64, 32],
    column_idx=tab_preprocessor.column_idx,
    embed_input=tab_preprocessor.embeddings_input,
229 230
    continuous_cols=cont_cols,
)
231 232 233 234 235 236 237

# wide and deep
model = WideDeep(wide=wide, deeptabular=deeptabular)

# train the model
trainer = Trainer(model, objective="binary", metrics=[Accuracy])
trainer.fit(
238
    X_wide=X_wide,
239
    X_tab=X_tab,
240 241 242 243 244 245 246
    target=target,
    n_epochs=5,
    batch_size=256,
    val_split=0.1,
)

# predict
247 248 249 250 251 252
X_wide_te = wide_preprocessor.transform(df_test)
X_tab_te = tab_preprocessor.transform(df_test)
preds = trainer.predict(X_wide=X_wide_te, X_tab=X_tab_te)

# save and load
trainer.save_model("model_weights/model.t")
J
jrzaurin 已提交
253 254
```

255 256 257
Of course, one can do **much more**. See the Examples folder, the
documentation or the companion posts for a better understanding of the content
of the package and its functionalities.
J
jrzaurin 已提交
258 259 260 261

### Testing

```
J
jrzaurin 已提交
262
pytest tests
263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284
```

### Acknowledgments

This library takes from a series of other libraries, so I think it is just
fair to mention them here in the README (specific mentions are also included
in the code).

The `Callbacks` and `Initializers` structure and code is inspired by the
[`torchsample`](https://github.com/ncullen93/torchsample) library, which in
itself partially inspired by [`Keras`](https://keras.io/).

The `TextProcessor` class in this library uses the
[`fastai`](https://docs.fast.ai/text.transform.html#BaseTokenizer.tokenizer)'s
`Tokenizer` and `Vocab`. The code at `utils.fastai_transforms` is a minor
adaptation of their code so it functions within this library. To my experience
their `Tokenizer` is the best in class.

The `ImageProcessor` class in this library uses code from the fantastic [Deep
Learning for Computer
Vision](https://www.pyimagesearch.com/deep-learning-computer-vision-python-book/)
(DL4CV) book by Adrian Rosebrock.