提交 108cebca 编写于 作者: J jrzaurin

updated README and docs

上级 23b93317
......@@ -15,14 +15,14 @@
# pytorch-widedeep
A flexible package to use Deep Learning with tabular data, text and images
using wide and deep models.
A flexible package for multimodal-deep-learning to combine tabular data with
text and images using Wide and Deep models in Pytorch
**Documentation:** [https://pytorch-widedeep.readthedocs.io](https://pytorch-widedeep.readthedocs.io/en/latest/index.html)
**Companion posts and tutorials:** [infinitoml](https://jrzaurin.github.io/infinitoml/)
**Experiments and comparisson with `LightGBM`**: [TabularDL vs LightGBM](https://github.com/jrzaurin/tabulardl-benchmark)
**Experiments and comparison with `LightGBM`**: [TabularDL vs LightGBM](https://github.com/jrzaurin/tabulardl-benchmark)
The content of this document is organized as follows:
......@@ -33,7 +33,8 @@ The content of this document is organized as follows:
### Introduction
``pytorch-widedeep`` is based on Google's [Wide and Deep Algorithm](https://arxiv.org/abs/1606.07792)
``pytorch-widedeep`` is based on Google's [Wide and Deep Algorithm](https://arxiv.org/abs/1606.07792),
adjusted for multi-modal datasets
In general terms, `pytorch-widedeep` is a package to use deep learning with
tabular data. In particular, is intended to facilitate the combination of text
......@@ -89,15 +90,11 @@ into:
<img width="300" src="docs/figures/architecture_2_math.png">
</p>
I recommend using the ``wide`` and ``deeptabular`` models in
``pytorch-widedeep``. However it is very likely that users will want to use
their own models for the ``deeptext`` and ``deepimage`` components. That is
perfectly possible as long as the the custom models have an attribute called
It is perfectly possible to use custom models (and not necessarily those in
the library) as long as the the custom models have an attribute called
``output_dim`` with the size of the last layer of activations, so that
``WideDeep`` can be constructed. Again, examples on how to use custom
components can be found in the Examples folder. Just in case
``pytorch-widedeep`` includes standard text (stack of LSTMs) and image
(pre-trained ResNets or stack of CNNs) models.
``WideDeep`` can be constructed. Examples on how to use custom components can
be found in the Examples folder.
### The ``deeptabular`` component
......@@ -110,15 +107,17 @@ its own, i.e. what one might normally refer as Deep Learning for Tabular
Data. Currently, ``pytorch-widedeep`` offers the following different models
for that component:
0. **Wide**: a simple linear model where the nonlinearities are captured via
cross-product transformations, as explained before.
1. **TabMlp**: a simple MLP that receives embeddings representing the
categorical features, concatenated with the continuous features.
categorical features, concatenated with the continuous features, which can
also be embedded.
2. **TabResnet**: similar to the previous model but the embeddings are
passed through a series of ResNet blocks built with dense layers.
3. **TabNet**: details on TabNet can be found in
[TabNet: Attentive Interpretable Tabular Learning](https://arxiv.org/abs/1908.07442)
And the ``Tabformer`` family, i.e. Transformers for Tabular data:
The ``Tabformer`` family, i.e. Transformers for Tabular data:
4. **TabTransformer**: details on the TabTransformer can be found in
[TabTransformer: Tabular Data Modeling Using Contextual Embeddings](https://arxiv.org/pdf/2012.06678.pdf).
......@@ -133,12 +132,19 @@ on the Fasformer can be found in
the Perceiver can be found in
[Perceiver: General Perception with Iterative Attention](https://arxiv.org/abs/2103.03206)
And probabilistic DL models for tabular data based on
[Weight Uncertainty in Neural Networks](https://arxiv.org/abs/1505.05424):
9. **BayesianWide**: Probabilistic adaptation of the `Wide` model.
10. **BayesianTabMlp**: Probabilistic adaptation of the `TabMlp` model
Note that while there are scientific publications for the TabTransformer,
SAINT and FT-Transformer, the TabFasfFormer and TabPerceiver are our own
adaptation of those algorithms for tabular data.
For details on these models and their options please see the examples in the
Examples folder and the documentation.
For details on these models (and all the other models in the library for the
different data modes) and their corresponding options please see the examples
in the Examples folder and the documentation.
### Installation
......@@ -165,13 +171,6 @@ cd pytorch-widedeep
pip install -e .
```
**Important note for Mac users**: Since `python
3.8`, [the `multiprocessing` library start method changed from `'fork'` to`'spawn'`](https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods) which affects the data-loaders.
For the time being, `pytorch-widedeep` sets the `num_workers` to 0 when using
Mac and python version 3.8+.
Note that this issue does not affect Linux users.
### Quick start
Binary classification with the [adult
......@@ -181,7 +180,6 @@ using `Wide` and `DeepDense` and defaults settings.
Building a wide (linear) and deep model with ``pytorch-widedeep``:
```python
import pandas as pd
import numpy as np
import torch
......@@ -191,16 +189,15 @@ from pytorch_widedeep import Trainer
from pytorch_widedeep.preprocessing import WidePreprocessor, TabPreprocessor
from pytorch_widedeep.models import Wide, TabMlp, WideDeep
from pytorch_widedeep.metrics import Accuracy
from pytorch_widedeep.datasets import load_adult
# the following 4 lines are not directly related to ``pytorch-widedeep``. I
# assume you have downloaded the dataset and place it in a dir called
# data/adult/
df = pd.read_csv("data/adult/adult.csv.zip")
df = load_adult(as_frame=True)
df["income_label"] = (df["income"].apply(lambda x: ">50K" in x)).astype(int)
df.drop("income", axis=1, inplace=True)
df_train, df_test = train_test_split(df, test_size=0.2, stratify=df.income_label)
# prepare wide, crossed, embedding and continuous columns
# Define the 'column set up'
wide_cols = [
"education",
"relationship",
......@@ -209,38 +206,43 @@ wide_cols = [
"native-country",
"gender",
]
cross_cols = [("education", "occupation"), ("native-country", "occupation")]
embed_cols = [
("education", 16),
("workclass", 16),
("occupation", 16),
("native-country", 32),
]
cont_cols = ["age", "hours-per-week"]
target_col = "income_label"
crossed_cols = [("education", "occupation"), ("native-country", "occupation")]
# target
target = df_train[target_col].values
cat_embed_cols = [
"workclass",
"education",
"marital-status",
"occupation",
"relationship",
"race",
"gender",
"capital-gain",
"capital-loss",
"native-country",
]
continuous_cols = ["age", "hours-per-week"]
target = "income_label"
target = df_train[target].values
# wide
wide_preprocessor = WidePreprocessor(wide_cols=wide_cols, crossed_cols=cross_cols)
# prepare the data
wide_preprocessor = WidePreprocessor(wide_cols=wide_cols, crossed_cols=crossed_cols)
X_wide = wide_preprocessor.fit_transform(df_train)
wide = Wide(wide_dim=np.unique(X_wide).shape[0], pred_dim=1)
# deeptabular
tab_preprocessor = TabPreprocessor(cat_embed_cols=embed_cols, continuous_cols=cont_cols)
tab_preprocessor = TabPreprocessor(
cat_embed_cols=cat_embed_cols, continuous_cols=continuous_cols # type: ignore[arg-type]
)
X_tab = tab_preprocessor.fit_transform(df_train)
deeptabular = TabMlp(
mlp_hidden_dims=[64, 32],
# build the model
wide = Wide(input_dim=np.unique(X_wide).shape[0], pred_dim=1)
tab_mlp = TabMlp(
column_idx=tab_preprocessor.column_idx,
embed_input=tab_preprocessor.cat_embed_input,
continuous_cols=cont_cols,
cat_embed_input=tab_preprocessor.cat_embed_input,
continuous_cols=continuous_cols,
)
model = WideDeep(wide=wide, deeptabular=tab_mlp)
# wide and deep
model = WideDeep(wide=wide, deeptabular=deeptabular)
# train the model
# train and validate
trainer = Trainer(model, objective="binary", metrics=[Accuracy])
trainer.fit(
X_wide=X_wide,
......@@ -248,10 +250,9 @@ trainer.fit(
target=target,
n_epochs=5,
batch_size=256,
val_split=0.1,
)
# predict
# predict on test
X_wide_te = wide_preprocessor.transform(df_test)
X_tab_te = tab_preprocessor.transform(df_test)
preds = trainer.predict(X_wide=X_wide_te, X_tab=X_tab_te)
......@@ -268,14 +269,11 @@ torch.save(model.state_dict(), "model_weights/wd_model.pt")
# From here in advance, Option 1 or 2 are the same. I assume the user has
# prepared the data and defined the new model components:
# 1. Build the model
model_new = WideDeep(wide=wide, deeptabular=deeptabular)
model_new = WideDeep(wide=wide, deeptabular=tab_mlp)
model_new.load_state_dict(torch.load("model_weights/wd_model.pt"))
# 2. Instantiate the trainer
trainer_new = Trainer(
model_new,
objective="binary",
)
trainer_new = Trainer(model_new, objective="binary")
# 3. Either start the fit or directly predict
preds = trainer_new.predict(X_wide=X_wide, X_tab=X_tab)
......
......@@ -31,7 +31,8 @@ Documentation
Introduction
------------
``pytorch-widedeep`` is based on Google's `Wide and Deep Algorithm
<https://arxiv.org/abs/1606.07792>`_.
<https://arxiv.org/abs/1606.07792>`_, adjusted for multi-modal datasets
In general terms, ``pytorch-widedeep`` is a package to use deep learning with
tabular and multimodal data. In particular, is intended to facilitate the
......@@ -97,9 +98,12 @@ own, i.e. what one might normally refer as Deep Learning for Tabular Data.
Currently, ``pytorch-widedeep`` offers the following different models for
that component:
0. **Wide**: a simple linear model where the nonlinearities are captured via
cross-product transformations, as explained before.
1. **TabMlp**: a simple MLP that receives embeddings representing the
categorical features, concatenated with the continuous features.
categorical features, concatenated with the continuous features, which can
also be embedded.
2. **TabResnet**: similar to the previous model but the embeddings are
passed through a series of ResNet blocks built with dense layers.
......@@ -107,7 +111,7 @@ passed through a series of ResNet blocks built with dense layers.
3. **TabNet**: details on TabNet can be found in `TabNet: Attentive
Interpretable Tabular Learning <https://arxiv.org/abs/1908.07442>`_
And the ``Tabformer`` family, i.e. Transformers for Tabular data:
The ``Tabformer`` family, i.e. Transformers for Tabular data:
4. **TabTransformer**: details on the TabTransformer can be found in
`TabTransformer: Tabular Data Modeling Using Contextual Embeddings
......@@ -130,22 +134,24 @@ Models for Natural Language Understanding
the Perceiver can be found in `Perceiver: General Perception with Iterative
Attention <https://arxiv.org/abs/2103.03206>`_
And probabilistic DL models for tabular data based on
`Weight Uncertainty in Neural Networks <https://arxiv.org/abs/1505.05424>`_:
9. **BayesianWide**: Probabilistic adaptation of the `Wide` model.
10. **BayesianTabMlp**: Probabilistic adaptation of the `TabMlp` model
Note that while there are scientific publications for the TabTransformer,
SAINT and FT-Transformer, the TabFasfFormer and TabPerceiver are our own
adaptation of those algorithms for tabular data.
For details on these models and their options please see the examples in the
Examples folder and the documentation.
Finally, while I recommend using the ``wide`` and ``deeptabular`` models in
``pytorch-widedeep`` it is very likely that users will want to use their own
models for the ``deeptext`` and ``deepimage`` components. That is perfectly
possible as long as the the custom models have an attribute called
``output_dim`` with the size of the last layer of activations, so that
``WideDeep`` can be constructed. Again, examples on how to use custom
components can be found in the Examples folder. Just in case
``pytorch-widedeep`` includes standard text (stack of LSTMs or GRUs) and
image(pre-trained ResNets or stack of CNNs) models.
adaptation of those algorithms for tabular data. For details on these models
and their options please see the examples in the Examples folder and the
documentation.
Finally, it is perfectly possible to use custom models as long as the the
custom models have an attribute called ``output_dim`` with the size of the
last layer of activations, so that ``WideDeep`` can be constructed. Again,
examples on how to use custom components can be found in the Examples
folder.
Indices and tables
==================
......
......@@ -15,8 +15,9 @@ Read and split the dataset
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from pytorch_widedeep.datasets import load_adult
df = pd.read_csv("data/adult/adult.csv.zip")
df = load_adult(as_frame=True)
df["income_label"] = (df["income"].apply(lambda x: ">50K" in x)).astype(int)
df.drop("income", axis=1, inplace=True)
df_train, df_test = train_test_split(df, test_size=0.2, stratify=df.income_label)
......@@ -28,13 +29,12 @@ Prepare the wide and deep columns
.. code-block:: python
import torch
from pytorch_widedeep import Trainer
from pytorch_widedeep.preprocessing import WidePreprocessor, TabPreprocessor
from pytorch_widedeep.models import Wide, TabMlp, WideDeep
from pytorch_widedeep.metrics import Accuracy
# prepare wide, crossed, embedding and continuous columns
# Define the 'column set up'
wide_cols = [
"education",
"relationship",
......@@ -43,41 +43,45 @@ Prepare the wide and deep columns
"native-country",
"gender",
]
cross_cols = [("education", "occupation"), ("native-country", "occupation")]
embed_cols = [
("education", 16),
("workclass", 16),
("occupation", 16),
("native-country", 32),
]
cont_cols = ["age", "hours-per-week"]
target_col = "income_label"
crossed_cols = [("education", "occupation"), ("native-country", "occupation")]
# target
target = df_train[target_col].values
cat_embed_cols = [
"workclass",
"education",
"marital-status",
"occupation",
"relationship",
"race",
"gender",
"capital-gain",
"capital-loss",
"native-country",
]
continuous_cols = ["age", "hours-per-week"]
target = "income_label"
target = df_train[target].values
Preprocessing and model components definition
---------------------------------------------
.. code-block:: python
# wide
wide_preprocessor = WidePreprocessor(wide_cols=wide_cols, crossed_cols=cross_cols)
wide_preprocessor = WidePreprocessor(wide_cols=wide_cols, crossed_cols=crossed_cols)
X_wide = wide_preprocessor.fit_transform(df_train)
wide = Wide(input_dim=np.unique(X_wide).shape[0], pred_dim=1)
# deeptabular
tab_preprocessor = TabPreprocessor(cat_embed_cols=embed_cols, continuous_cols=cont_cols)
tab_preprocessor = TabPreprocessor(
cat_embed_cols=cat_embed_cols, continuous_cols=continuous_cols # type: ignore[arg-type]
)
X_tab = tab_preprocessor.fit_transform(df_train)
deeptabular = TabMlp(
# build the model
wide = Wide(input_dim=np.unique(X_wide).shape[0], pred_dim=1)
tab_mlp = TabMlp(
column_idx=tab_preprocessor.column_idx,
cat_embed_input=tab_preprocessor.cat_embed_input,
continuous_cols=cont_cols,
mlp_hidden_dims=[64, 32],
continuous_cols=continuous_cols,
)
# wide and deep
model = WideDeep(wide=wide, deeptabular=deeptabular)
model = WideDeep(wide=wide, deeptabular=tab_mlp)
Fit and predict
......@@ -85,7 +89,7 @@ Fit and predict
.. code-block:: python
# train the model
# train and validate
trainer = Trainer(model, objective="binary", metrics=[Accuracy])
trainer.fit(
X_wide=X_wide,
......@@ -93,10 +97,9 @@ Fit and predict
target=target,
n_epochs=5,
batch_size=256,
val_split=0.1,
)
# predict
# predict on test
X_wide_te = wide_preprocessor.transform(df_test)
X_tab_te = tab_preprocessor.transform(df_test)
preds = trainer.predict(X_wide=X_wide_te, X_tab=X_tab_te)
......@@ -109,34 +112,23 @@ Save and load
# Option 1: this will also save training history and lr history if the
# LRHistory callback is used
# Day 0, you have trained your model, save it using the trainer.save
# method
trainer.save(path="model_weights", save_state_dict=True)
# Option 2: save as any other torch model
# Day 0, you have trained your model, save as any other torch model
torch.save(model.state_dict(), "model_weights/wd_model.pt")
# From here in advance, Option 1 or 2 are the same
# Few days have passed...I assume the user has prepared the data and
# defined the model components:
# From here in advance, Option 1 or 2 are the same. I assume the user has
# prepared the data and defined the new model components:
# 1. Build the model
model_new = WideDeep(wide=wide, deeptabular=deeptabular)
model_new = WideDeep(wide=wide, deeptabular=tab_mlp)
model_new.load_state_dict(torch.load("model_weights/wd_model.pt"))
# 2. Instantiate the trainer
trainer_new = Trainer(
model_new,
objective="binary",
)
trainer_new = Trainer(model_new, objective="binary")
# 3. Either fit or directly predict
# 3. Either start the fit or directly predict
preds = trainer_new.predict(X_wide=X_wide, X_tab=X_tab)
Of course, one can do **much more**. See the Examples folder in the repo, this
documentation or the companion posts for a better understanding of the content
of the package and its functionalities.
......@@ -11,8 +11,8 @@
# pytorch-widedeep
A flexible package to use Deep Learning with tabular data, text and images
using wide and deep models.
A flexible package for multimodal-deep-learning to combine tabular data with
text and images using Wide and Deep models in Pytorch
**Documentation:** [https://pytorch-widedeep.readthedocs.io](https://pytorch-widedeep.readthedocs.io/en/latest/index.html)
......@@ -24,7 +24,8 @@ using wide and deep models.
### Introduction
``pytorch-widedeep`` is based on Google's [Wide and Deep Algorithm](https://arxiv.org/abs/1606.07792)
``pytorch-widedeep`` is based on Google's [Wide and Deep Algorithm](https://arxiv.org/abs/1606.07792),
adjusted for multi-modal datasets
In general terms, `pytorch-widedeep` is a package to use deep learning with
tabular data. In particular, is intended to facilitate the combination of text
......@@ -35,7 +36,7 @@ architectures please visit the
[repo](https://github.com/jrzaurin/pytorch-widedeep).
### Installation
### Installation
Install using pip:
......@@ -60,20 +61,6 @@ cd pytorch-widedeep
pip install -e .
```
**Important note for Mac users**: Since `python
3.8`, [the `multiprocessing` library start method changed from `'fork'` to`'spawn'`](https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods) which affects the data-loaders.
For the time being, `pytorch-widedeep` sets the `num_workers` to 0 when using
Mac and python version 3.8+.
Note that this issue does not affect Linux users.
```bash
pip install pytorch-widedeep
pip install torch==1.6.0 torchvision==0.7.0
```
None of these issues affect Linux users.
### Quick start
Binary classification with the [adult
......@@ -83,7 +70,6 @@ using `Wide` and `DeepDense` and defaults settings.
Building a wide (linear) and deep model with ``pytorch-widedeep``:
```python
import pandas as pd
import numpy as np
import torch
......@@ -93,16 +79,15 @@ from pytorch_widedeep import Trainer
from pytorch_widedeep.preprocessing import WidePreprocessor, TabPreprocessor
from pytorch_widedeep.models import Wide, TabMlp, WideDeep
from pytorch_widedeep.metrics import Accuracy
from pytorch_widedeep.datasets import load_adult
# the following 4 lines are not directly related to ``pytorch-widedeep``. I
# assume you have downloaded the dataset and place it in a dir called
# data/adult/
df = pd.read_csv("data/adult/adult.csv.zip")
df = load_adult(as_frame=True)
df["income_label"] = (df["income"].apply(lambda x: ">50K" in x)).astype(int)
df.drop("income", axis=1, inplace=True)
df_train, df_test = train_test_split(df, test_size=0.2, stratify=df.income_label)
# prepare wide, crossed, embedding and continuous columns
# Define the 'column set up'
wide_cols = [
"education",
"relationship",
......@@ -111,38 +96,43 @@ wide_cols = [
"native-country",
"gender",
]
cross_cols = [("education", "occupation"), ("native-country", "occupation")]
embed_cols = [
("education", 16),
("workclass", 16),
("occupation", 16),
("native-country", 32),
]
cont_cols = ["age", "hours-per-week"]
target_col = "income_label"
crossed_cols = [("education", "occupation"), ("native-country", "occupation")]
# target
target = df_train[target_col].values
cat_embed_cols = [
"workclass",
"education",
"marital-status",
"occupation",
"relationship",
"race",
"gender",
"capital-gain",
"capital-loss",
"native-country",
]
continuous_cols = ["age", "hours-per-week"]
target = "income_label"
target = df_train[target].values
# wide
wide_preprocessor = WidePreprocessor(wide_cols=wide_cols, crossed_cols=cross_cols)
# prepare the data
wide_preprocessor = WidePreprocessor(wide_cols=wide_cols, crossed_cols=crossed_cols)
X_wide = wide_preprocessor.fit_transform(df_train)
wide = Wide(wide_dim=np.unique(X_wide).shape[0], pred_dim=1)
# deeptabular
tab_preprocessor = TabPreprocessor(cat_embed_cols=embed_cols, continuous_cols=cont_cols)
tab_preprocessor = TabPreprocessor(
cat_embed_cols=cat_embed_cols, continuous_cols=continuous_cols # type: ignore[arg-type]
)
X_tab = tab_preprocessor.fit_transform(df_train)
deeptabular = TabMlp(
mlp_hidden_dims=[64, 32],
# build the model
wide = Wide(input_dim=np.unique(X_wide).shape[0], pred_dim=1)
tab_mlp = TabMlp(
column_idx=tab_preprocessor.column_idx,
embed_input=tab_preprocessor.cat_embed_input,
continuous_cols=cont_cols,
cat_embed_input=tab_preprocessor.cat_embed_input,
continuous_cols=continuous_cols,
)
model = WideDeep(wide=wide, deeptabular=tab_mlp)
# wide and deep
model = WideDeep(wide=wide, deeptabular=deeptabular)
# train the model
# train and validate
trainer = Trainer(model, objective="binary", metrics=[Accuracy])
trainer.fit(
X_wide=X_wide,
......@@ -150,10 +140,9 @@ trainer.fit(
target=target,
n_epochs=5,
batch_size=256,
val_split=0.1,
)
# predict
# predict on test
X_wide_te = wide_preprocessor.transform(df_test)
X_tab_te = tab_preprocessor.transform(df_test)
preds = trainer.predict(X_wide=X_wide_te, X_tab=X_tab_te)
......@@ -170,14 +159,11 @@ torch.save(model.state_dict(), "model_weights/wd_model.pt")
# From here in advance, Option 1 or 2 are the same. I assume the user has
# prepared the data and defined the new model components:
# 1. Build the model
model_new = WideDeep(wide=wide, deeptabular=deeptabular)
model_new = WideDeep(wide=wide, deeptabular=tab_mlp)
model_new.load_state_dict(torch.load("model_weights/wd_model.pt"))
# 2. Instantiate the trainer
trainer_new = Trainer(
model_new,
objective="binary",
)
trainer_new = Trainer(model_new, objective="binary")
# 3. Either start the fit or directly predict
preds = trainer_new.predict(X_wide=X_wide, X_tab=X_tab)
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册