67439c42 · 67439c42 · 67439c42 · 67439c42 · 67439c42 · 67439c42
88 changed file
--- a/.gitignore
+++ b/.gitignore
@@ -21,6 +21,7 @@ tmp_dir/
 weights/
 pretrained_weights/
 model_weights/
+prepared_data/

 # Unit Tests/Coverage
 .coverage

--- a/CITATION.cff
+++ b/CITATION.cff
+cff-version: "1.2.0"
+authors:
+- family-names: Zaurin
+  given-names: Javier Rodriguez
+  orcid: "https://orcid.org/0000-0002-1082-1107"
+- family-names: Mulinka
+  given-names: Pavol
+  orcid: "https://orcid.org/0000-0002-9394-8794"
+doi: 10.5281/zenodo.7908172
+message: If you use this software, please cite our article in the
+  Journal of Open Source Software.
+preferred-citation:
+  authors:
+  - family-names: Zaurin
+    given-names: Javier Rodriguez
+    orcid: "https://orcid.org/0000-0002-1082-1107"
+  - family-names: Mulinka
+    given-names: Pavol
+    orcid: "https://orcid.org/0000-0002-9394-8794"
+  date-published: 2023-06-24
+  doi: 10.21105/joss.05027
+  issn: 2475-9066
+  issue: 86
+  journal: Journal of Open Source Software
+  publisher:
+    name: Open Journals
+  start: 5027
+  title: "pytorch-widedeep: A flexible package for multimodal deep
+    learning"
+  type: article
+  url: "https://joss.theoj.org/papers/10.21105/joss.05027"
+  volume: 8
+title: "pytorch-widedeep: A flexible package for multimodal deep
+  learning"
\ No newline at end of file
--- a/README.md
+++ b/README.md
@@ -12,6 +12,7 @@
 [![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://github.com/jrzaurin/pytorch-widedeep/graphs/commit-activity)
 [![contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat)](https://github.com/jrzaurin/pytorch-widedeep/issues)
 [![Slack](https://img.shields.io/badge/slack-chat-green.svg?logo=slack)](https://join.slack.com/t/pytorch-widedeep/shared_invite/zt-soss7stf-iXpVuLeKZz8lGTnxxtHtTw)
+[![DOI](https://joss.theoj.org/papers/10.21105/joss.05027/status.svg)](https://doi.org/10.21105/joss.05027)

 # pytorch-widedeep

@@ -38,6 +39,9 @@ The content of this document is organized as follows:
    - [How to Contribute](#how-to-contribute)
    - [Acknowledgments](#acknowledgments)
    - [License](#license)
+    - [Cite](#cite)
+      - [BibTex](#bibtex)
+      - [APA](#apa)

 ### Introduction

@@ -82,7 +86,7 @@ without a ``deephead`` component can be formulated as:


 Where &sigma; is the sigmoid function, *'W'* are the weight matrices applied to the wide model and to the final
-activations of the deep models, *'a'* are these final activations, 
+activations of the deep models, *'a'* are these final activations,
 &phi;(x) are the cross product transformations of the original features *'x'*, and
 , and *'b'* is the bias term.
 In case you are wondering what are *"cross product transformations"*, here is
@@ -126,26 +130,33 @@ passed through a series of ResNet blocks built with dense layers.
 3. **TabNet**: details on TabNet can be found in
 [TabNet: Attentive Interpretable Tabular Learning](https://arxiv.org/abs/1908.07442)

+Two simpler attention based models that we call:
+
+4. **ContextAttentionMLP**: MLP with at attention mechanism "on top" that is based on
+    [Hierarchical Attention Networks for Document Classification](https://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-attention-networks.pd)
+5. **SelfAttentionMLP**: MLP with an attention mechanism that is a simplified
+    version of a transformer block that we refer as "query-key self-attention".
+
 The ``Tabformer`` family, i.e. Transformers for Tabular data:

-4. **TabTransformer**: details on the TabTransformer can be found in
+6. **TabTransformer**: details on the TabTransformer can be found in
 [TabTransformer: Tabular Data Modeling Using Contextual Embeddings](https://arxiv.org/pdf/2012.06678.pdf).
-5. **SAINT**: Details on SAINT can be found in
+7. **SAINT**: Details on SAINT can be found in
 [SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training](https://arxiv.org/abs/2106.01342).
-6. **FT-Transformer**: details on the FT-Transformer can be found in
+8. **FT-Transformer**: details on the FT-Transformer can be found in
 [Revisiting Deep Learning Models for Tabular Data](https://arxiv.org/abs/2106.11959).
-7. **TabFastFormer**: adaptation of the FastFormer for tabular data. Details
+9. **TabFastFormer**: adaptation of the FastFormer for tabular data. Details
 on the Fasformer can be found in
 [FastFormers: Highly Efficient Transformer Models for Natural Language Understanding](https://arxiv.org/abs/2010.13382)
-8. **TabPerceiver**: adaptation of the Perceiver for tabular data. Details on
+10. **TabPerceiver**: adaptation of the Perceiver for tabular data. Details on
 the Perceiver can be found in
 [Perceiver: General Perception with Iterative Attention](https://arxiv.org/abs/2103.03206)

 And probabilistic DL models for tabular data based on
 [Weight Uncertainty in Neural Networks](https://arxiv.org/abs/1505.05424):

-9. **BayesianWide**: Probabilistic adaptation of the `Wide` model.
-10. **BayesianTabMlp**: Probabilistic adaptation of the `TabMlp` model
+11. **BayesianWide**: Probabilistic adaptation of the `Wide` model.
+12. **BayesianTabMlp**: Probabilistic adaptation of the `TabMlp` model

 Note that while there are scientific publications for the TabTransformer,
 SAINT and FT-Transformer, the TabFasfFormer and TabPerceiver are our own
@@ -192,7 +203,6 @@ using `Wide` and `DeepDense` and defaults settings.
 Building a wide (linear) and deep model with ``pytorch-widedeep``:

 ```python
-import pandas as pd
 import numpy as np
 import torch
 from sklearn.model_selection import train_test_split
@@ -331,4 +341,31 @@ Vision](https://www.pyimagesearch.com/deep-learning-computer-vision-python-book/
 This work is dual-licensed under Apache 2.0 and MIT (or any later version).
 You can choose between one of them if you use this work.

-`SPDX-License-Identifier: Apache-2.0 AND MIT`
\ No newline at end of file
+`SPDX-License-Identifier: Apache-2.0 AND MIT`
+
+### Cite
+
+#### BibTex
+
+```
+@article{Zaurin_pytorch-widedeep_A_flexible_2023,
+author = {Zaurin, Javier Rodriguez and Mulinka, Pavol},
+doi = {10.21105/joss.05027},
+journal = {Journal of Open Source Software},
+month = jun,
+number = {86},
+pages = {5027},
+title = {{pytorch-widedeep: A flexible package for multimodal deep learning}},
+url = {https://joss.theoj.org/papers/10.21105/joss.05027},
+volume = {8},
+year = {2023}
+}
+```
+
+#### APA
+
+```
+Zaurin, J. R., & Mulinka, P. (2023). pytorch-widedeep: A flexible package for
+multimodal deep learning. Journal of Open Source Software, 8(86), 5027.
+https://doi.org/10.21105/joss.05027
+```
--- a/VERSION
+++ b/VERSION
-1.3.0
+1.3.2
--- a/examples/notebooks/19_wide_and_deep_for_recsys_pt1.ipynb
+++ b/examples/notebooks/19_wide_and_deep_for_recsys_pt1.ipynb
--- a/examples/notebooks/19_wide_and_deep_for_recsys_pt2.ipynb
+++ b/examples/notebooks/19_wide_and_deep_for_recsys_pt2.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This is the second of the two notebooks where we aim to illustrate how one could use this library to build recommendation algorithms using the example in this [Kaggle notebook](https://www.kaggle.com/code/matanivanov/wide-deep-learning-for-recsys-with-pytorch) as guidance. In the previous notebook we used `pytorch-widedeep` to build a model that replicated almost exactly that in the notebook. In this, shorter notebook we will show how one could use the library to explore other models, following the same problem formulation, this is: given a state of a user at a certain point in time having watched a series of movies, our goal is to predict which movie the user will watch next. \n",
+    "\n",
+    "Assuming that one has read (and run) the previous notebook, the required data will be stored in a local dir called `prepared_data`, so let's read it:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pathlib import Path\n",
+    "\n",
+    "import numpy as np\n",
+    "import torch\n",
+    "import pandas as pd\n",
+    "from torch import nn\n",
+    "\n",
+    "from pytorch_widedeep import Trainer\n",
+    "from pytorch_widedeep.utils import pad_sequences\n",
+    "from pytorch_widedeep.models import TabMlp, WideDeep, Transformer\n",
+    "from pytorch_widedeep.preprocessing import TabPreprocessor"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "save_path = Path(\"prepared_data\")\n",
+    "\n",
+    "PAD_IDX = 0\n",
+    "\n",
+    "id_cols = [\"user_id\", \"movie_id\"]\n",
+    "\n",
+    "df_train = pd.read_pickle(save_path / \"df_train.pkl\")\n",
+    "df_valid = pd.read_pickle(save_path / \"df_valid.pkl\")\n",
+    "df_test = pd.read_pickle(save_path / \"df_test.pkl\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "...remember that in the previous notebook we explained that we are not  going to use a validation set here (in a real-world example, or simply a more realistic example, one should always use it).\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_test = pd.concat([df_valid, df_test], ignore_index=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Also remember that, in the previous notebook we discussed that the `'maxlen'` and `'max_movie_index'` parameters should be computed using only the train set. In particular, to properly do the tokenization, one would have to use ONLY train tokens and add a token for new 'unknown'/'unseen' movies in the test set. This can also be done with this library or manually, so I will leave it to the reader to implement that tokenzation appraoch."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "maxlen = max(\n",
+    "    df_train.prev_movies.apply(lambda x: len(x)).max(),\n",
+    "    df_test.prev_movies.apply(lambda x: len(x)).max(),\n",
+    ")\n",
+    "\n",
+    "max_movie_index = max(df_train.movie_id.max(), df_test.movie_id.max())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "From now one things are pretty simple, moreover bearing in mind that in this example we are not going to use a wide component since, in pple, one would believe that the information in that component is also 'carried' by the movie sequences (However in the previous notebook, if one performs ablation studies, these suggest that most of the prediction power comes from the linear, wide model).\n",
+    "\n",
+    "In the example here we are going to explore one (of many) possibilities. We are simply going to encode the triplet `(user, item, rating)` and use it as a `deeptabular` component and the sequences of previously watched movies as the `deeptext` component. For the `deeptext` component we are going to use a basic encoder-only transformer model.\n",
+    "\n",
+    "Let's start with the tabular data preparation\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_train_user_item = df_train[[\"user_id\", \"movie_id\", \"rating\"]]\n",
+    "train_movies_sequences = df_train.prev_movies.apply(\n",
+    "    lambda x: [int(el) for el in x]\n",
+    ").to_list()\n",
+    "y_train = df_train.target.values.astype(int)\n",
+    "\n",
+    "df_test_user_item = df_train[[\"user_id\", \"movie_id\", \"rating\"]]\n",
+    "test_movies_sequences = df_test.prev_movies.apply(\n",
+    "    lambda x: [int(el) for el in x]\n",
+    ").to_list()\n",
+    "y_test = df_test.target.values.astype(int)\n",
+    "\n",
+    "tab_preprocessor = tab_preprocessor = TabPreprocessor(\n",
+    "    cat_embed_cols=[\"user_id\", \"movie_id\", \"rating\"],\n",
+    ")\n",
+    "X_train_tab = tab_preprocessor.fit_transform(df_train_user_item)\n",
+    "X_test_tab = tab_preprocessor.transform(df_test_user_item)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "And not the text component, simply padding the sequences:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "X_train_text = np.array(\n",
+    "    [\n",
+    "        pad_sequences(\n",
+    "            s,\n",
+    "            maxlen=maxlen,\n",
+    "            pad_first=False,\n",
+    "            pad_idx=PAD_IDX,\n",
+    "        )\n",
+    "        for s in train_movies_sequences\n",
+    "    ]\n",
+    ")\n",
+    "X_test_text = np.array(\n",
+    "    [\n",
+    "        pad_sequences(\n",
+    "            s,\n",
+    "            maxlen=maxlen,\n",
+    "            pad_first=False,\n",
+    "            pad_idx=0,\n",
+    "        )\n",
+    "        for s in test_movies_sequences\n",
+    "    ]\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We now define the model components and the wide and deep model."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tab_mlp = TabMlp(\n",
+    "    column_idx=tab_preprocessor.column_idx,\n",
+    "    cat_embed_input=tab_preprocessor.cat_embed_input,\n",
+    "    mlp_hidden_dims=[1024, 512, 256],\n",
+    "    mlp_activation=\"relu\",\n",
+    ")\n",
+    "\n",
+    "# plenty of options here, see the docs\n",
+    "transformer = Transformer(\n",
+    "    vocab_size=max_movie_index + 1,\n",
+    "    embed_dim=32,\n",
+    "    n_heads=2,\n",
+    "    n_blocks=2,\n",
+    "    seq_length=maxlen,\n",
+    ")\n",
+    "\n",
+    "wide_deep_model = WideDeep(\n",
+    "    deeptabular=tab_mlp, deeptext=transformer, pred_dim=max_movie_index + 1\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "WideDeep(\n",
+       "  (deeptabular): Sequential(\n",
+       "    (0): TabMlp(\n",
+       "      (cat_and_cont_embed): DiffSizeCatAndContEmbeddings(\n",
+       "        (cat_embed): DiffSizeCatEmbeddings(\n",
+       "          (embed_layers): ModuleDict(\n",
+       "            (emb_layer_user_id): Embedding(749, 65, padding_idx=0)\n",
+       "            (emb_layer_movie_id): Embedding(1612, 100, padding_idx=0)\n",
+       "            (emb_layer_rating): Embedding(6, 4, padding_idx=0)\n",
+       "          )\n",
+       "          (embedding_dropout): Dropout(p=0.1, inplace=False)\n",
+       "        )\n",
+       "      )\n",
+       "      (encoder): MLP(\n",
+       "        (mlp): Sequential(\n",
+       "          (dense_layer_0): Sequential(\n",
+       "            (0): Dropout(p=0.1, inplace=False)\n",
+       "            (1): Linear(in_features=169, out_features=1024, bias=True)\n",
+       "            (2): ReLU(inplace=True)\n",
+       "          )\n",
+       "          (dense_layer_1): Sequential(\n",
+       "            (0): Dropout(p=0.1, inplace=False)\n",
+       "            (1): Linear(in_features=1024, out_features=512, bias=True)\n",
+       "            (2): ReLU(inplace=True)\n",
+       "          )\n",
+       "          (dense_layer_2): Sequential(\n",
+       "            (0): Dropout(p=0.1, inplace=False)\n",
+       "            (1): Linear(in_features=512, out_features=256, bias=True)\n",
+       "            (2): ReLU(inplace=True)\n",
+       "          )\n",
+       "        )\n",
+       "      )\n",
+       "    )\n",
+       "    (1): Linear(in_features=256, out_features=1683, bias=True)\n",
+       "  )\n",
+       "  (deeptext): Sequential(\n",
+       "    (0): Transformer(\n",
+       "      (embedding): Embedding(1683, 32)\n",
+       "      (pos_encoder): PositionalEncoding(\n",
+       "        (dropout): Dropout(p=0.1, inplace=False)\n",
+       "      )\n",
+       "      (encoder): Sequential(\n",
+       "        (transformer_block0): TransformerEncoder(\n",
+       "          (attn): MultiHeadedAttention(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (q_proj): Linear(in_features=32, out_features=32, bias=False)\n",
+       "            (kv_proj): Linear(in_features=32, out_features=64, bias=False)\n",
+       "            (out_proj): Linear(in_features=32, out_features=32, bias=False)\n",
+       "          )\n",
+       "          (ff): FeedForward(\n",
+       "            (w_1): Linear(in_features=32, out_features=128, bias=True)\n",
+       "            (w_2): Linear(in_features=128, out_features=32, bias=True)\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (activation): GELU(approximate='none')\n",
+       "          )\n",
+       "          (attn_addnorm): AddNorm(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (ln): LayerNorm((32,), eps=1e-05, elementwise_affine=True)\n",
+       "          )\n",
+       "          (ff_addnorm): AddNorm(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (ln): LayerNorm((32,), eps=1e-05, elementwise_affine=True)\n",
+       "          )\n",
+       "        )\n",
+       "        (transformer_block1): TransformerEncoder(\n",
+       "          (attn): MultiHeadedAttention(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (q_proj): Linear(in_features=32, out_features=32, bias=False)\n",
+       "            (kv_proj): Linear(in_features=32, out_features=64, bias=False)\n",
+       "            (out_proj): Linear(in_features=32, out_features=32, bias=False)\n",
+       "          )\n",
+       "          (ff): FeedForward(\n",
+       "            (w_1): Linear(in_features=32, out_features=128, bias=True)\n",
+       "            (w_2): Linear(in_features=128, out_features=32, bias=True)\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (activation): GELU(approximate='none')\n",
+       "          )\n",
+       "          (attn_addnorm): AddNorm(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (ln): LayerNorm((32,), eps=1e-05, elementwise_affine=True)\n",
+       "          )\n",
+       "          (ff_addnorm): AddNorm(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (ln): LayerNorm((32,), eps=1e-05, elementwise_affine=True)\n",
+       "          )\n",
+       "        )\n",
+       "      )\n",
+       "    )\n",
+       "    (1): Linear(in_features=23552, out_features=1683, bias=True)\n",
+       "  )\n",
+       ")"
+      ]
+     },
+     "execution_count": 10,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "wide_deep_model"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "And as in the previous notebook, let's train (you will need a GPU for this)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "trainer = Trainer(\n",
+    "    model=wide_deep_model,\n",
+    "    objective=\"multiclass\",\n",
+    "    custom_loss_function=nn.CrossEntropyLoss(ignore_index=PAD_IDX),\n",
+    "    optimizers=torch.optim.Adam(wide_deep_model.parameters(), lr=1e-3),\n",
+    ")\n",
+    "\n",
+    "trainer.fit(\n",
+    "    X_train={\n",
+    "        \"X_tab\": X_train_tab,\n",
+    "        \"X_text\": X_train_text,\n",
+    "        \"target\": y_train,\n",
+    "    },\n",
+    "    X_val={\n",
+    "        \"X_tab\": X_test_tab,\n",
+    "        \"X_text\": X_test_text,\n",
+    "        \"target\": y_test,\n",
+    "    },\n",
+    "    n_epochs=10,\n",
+    "    batch_size=521,\n",
+    "    shuffle=False,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.15"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/examples/scripts/adult_census_linear_and_flash_attention.py
+++ b/examples/scripts/adult_census_linear_and_flash_attention.py
+from time import time
+
+from sklearn.model_selection import train_test_split
+
+from pytorch_widedeep import Trainer
+from pytorch_widedeep.models import WideDeep, TabTransformer
+from pytorch_widedeep.metrics import Accuracy
+from pytorch_widedeep.datasets import load_adult
+from pytorch_widedeep.preprocessing import TabPreprocessor
+
+# use_cuda = torch.cuda.is_available()
+
+df = load_adult(as_frame=True)
+df.columns = [c.replace("-", "_") for c in df.columns]
+df["income_label"] = (df["income"].apply(lambda x: ">50K" in x)).astype(int)
+df.drop("income", axis=1, inplace=True)
+target_colname = "income_label"
+
+cat_embed_cols = []
+for col in df.columns:
+    if df[col].dtype == "O" or df[col].nunique() < 200 and col != target_colname:
+        cat_embed_cols.append(col)
+
+train, test = train_test_split(
+    df, test_size=0.1, random_state=1, stratify=df[[target_colname]]
+)
+
+with_cls_token = True
+tab_preprocessor = TabPreprocessor(
+    cat_embed_cols=cat_embed_cols, with_attention=True, with_cls_token=with_cls_token
+)
+
+X_tab_train = tab_preprocessor.fit_transform(train)
+X_tab_test = tab_preprocessor.transform(test)
+target = train[target_colname].values
+
+
+tab_transformer = TabTransformer(
+    column_idx=tab_preprocessor.column_idx,
+    cat_embed_input=tab_preprocessor.cat_embed_input,
+    input_dim=16,
+    n_heads=2,
+    n_blocks=2,
+)
+
+linear_tab_transformer = TabTransformer(
+    column_idx=tab_preprocessor.column_idx,
+    cat_embed_input=tab_preprocessor.cat_embed_input,
+    input_dim=16,
+    n_heads=2,
+    n_blocks=2,
+    use_linear_attention=True,
+)
+
+flash_tab_transformer = TabTransformer(
+    column_idx=tab_preprocessor.column_idx,
+    cat_embed_input=tab_preprocessor.cat_embed_input,
+    input_dim=16,
+    n_heads=2,
+    n_blocks=2,
+    use_flash_attention=True,
+)
+
+s_model = WideDeep(deeptabular=tab_transformer)
+l_model = WideDeep(deeptabular=linear_tab_transformer)
+f_model = WideDeep(deeptabular=flash_tab_transformer)
+
+for name, model in [("standard", s_model), ("linear", l_model), ("flash", f_model)]:
+    trainer = Trainer(
+        model,
+        objective="binary",
+        metrics=[Accuracy],
+    )
+
+    s = time()
+    trainer.fit(
+        X_tab=X_tab_train,
+        target=target,
+        n_epochs=1,
+        batch_size=64,
+        val_split=0.2,
+    )
+    e = time() - s
+    print(f"{name} attention time: {round(e, 3)} secs")
--- a/examples/scripts/wide_deep_for_recsys/kaggle_wide_deep_model.py
+++ b/examples/scripts/wide_deep_for_recsys/kaggle_wide_deep_model.py
+# This script is mostly a copy/paste from the Kaggle notebook
+# https://www.kaggle.com/code/matanivanov/wide-deep-learning-for-recsys-with-pytorch.
+# Is a response to the issue:
+# https://github.com/jrzaurin/pytorch-widedeep/issues/133.
+# In this script we run the exact same model used in that Kaggle notebook
+
+from pathlib import Path
+
+import numpy as np
+import torch
+import pandas as pd
+from torch import nn, cat, mean
+from scipy.sparse import coo_matrix
+
+device = "cuda" if torch.cuda.is_available() else "cpu"
+
+save_path = Path("prepared_data")
+
+
+def get_coo_indexes(lil):
+    rows = []
+    cols = []
+    for i, el in enumerate(lil):
+        if type(el) != list:
+            el = [el]
+        for j in el:
+            rows.append(i)
+            cols.append(j)
+    return rows, cols
+
+
+def get_sparse_features(series, shape):
+    coo_indexes = get_coo_indexes(series.tolist())
+    sparse_df = coo_matrix(
+        (np.ones(len(coo_indexes[0])), (coo_indexes[0], coo_indexes[1])), shape=shape
+    )
+    return sparse_df
+
+
+def sparse_to_idx(data, pad_idx=-1):
+    indexes = data.nonzero()
+    indexes_df = pd.DataFrame()
+    indexes_df["rows"] = indexes[0]
+    indexes_df["cols"] = indexes[1]
+    mdf = indexes_df.groupby("rows").apply(lambda x: x["cols"].tolist())
+    max_len = mdf.apply(lambda x: len(x)).max()
+    return mdf.apply(lambda x: pd.Series(x + [pad_idx] * (max_len - len(x)))).values
+
+
+def idx_to_sparse(idx, sparse_dim):
+    sparse = np.zeros(sparse_dim)
+    sparse[int(idx)] = 1
+    return pd.Series(sparse, dtype=int)
+
+
+def process_cats_as_kaggle_notebook(df):
+    df["gender"] = (df["gender"] == "M").astype(int)
+    df = pd.concat(
+        [
+            df.drop("occupation", axis=1),
+            pd.get_dummies(df["occupation"]).astype(int),
+        ],
+        axis=1,
+    )
+    df.drop("other", axis=1, inplace=True)
+    df.drop("zip_code", axis=1, inplace=True)
+
+    return df
+
+
+id_cols = ["user_id", "movie_id"]
+
+df_train = pd.read_pickle(save_path / "df_train.pkl")
+df_valid = pd.read_pickle(save_path / "df_valid.pkl")
+df_test = pd.read_pickle(save_path / "df_test.pkl")
+df_test = pd.concat([df_valid, df_test], ignore_index=True)
+
+df_train = process_cats_as_kaggle_notebook(df_train)
+df_test = process_cats_as_kaggle_notebook(df_test)
+
+# here is another caveat, using all dataset to build 'train_movies_watched'
+# when in reality one should use only the training
+max_movie_index = max(df_train.movie_id.max(), df_test.movie_id.max())
+
+X_train = df_train.drop(id_cols + ["prev_movies", "target"], axis=1)
+y_train = df_train.target.values
+train_movies_watched = get_sparse_features(
+    df_train["prev_movies"], (len(df_train), max_movie_index + 1)
+)
+
+X_test = df_test.drop(id_cols + ["prev_movies", "target"], axis=1)
+y_test = df_test.target.values
+test_movies_watched = get_sparse_features(
+    df_test["prev_movies"], (len(df_test), max_movie_index + 1)
+)
+
+PAD_IDX = 0
+
+X_train_tensor = torch.Tensor(X_train.fillna(0).values).to(device)
+train_movies_watched_tensor = (
+    torch.sparse_coo_tensor(
+        indices=train_movies_watched.nonzero(),
+        values=[1] * len(train_movies_watched.nonzero()[0]),
+        size=train_movies_watched.shape,
+    )
+    .to_dense()
+    .to(device)
+)
+movies_train_sequences = (
+    torch.Tensor(
+        sparse_to_idx(train_movies_watched, pad_idx=PAD_IDX),
+    )
+    .long()
+    .to(device)
+)
+target_train = torch.Tensor(y_train).long().to(device)
+
+
+X_test_tensor = torch.Tensor(X_test.fillna(0).values).to(device)
+test_movies_watched_tensor = (
+    torch.sparse_coo_tensor(
+        indices=test_movies_watched.nonzero(),
+        values=[1] * len(test_movies_watched.nonzero()[0]),
+        size=test_movies_watched.shape,
+    )
+    .to_dense()
+    .to(device)
+)
+movies_test_sequences = (
+    torch.Tensor(
+        sparse_to_idx(test_movies_watched, pad_idx=PAD_IDX),
+    )
+    .long()
+    .to(device)
+)
+target_test = torch.Tensor(y_test).long().to(device)
+
+
+class WideAndDeep(nn.Module):
+    def __init__(
+        self,
+        continious_feature_shape,  # number of continious features
+        embed_size,  # size of embedding for binary features
+        embed_dict_len,  # number of unique binary features
+        pad_idx,  # padding index
+    ):
+        super(WideAndDeep, self).__init__()
+        self.embed = nn.Embedding(embed_dict_len, embed_size, padding_idx=pad_idx)
+        self.linear_relu_stack = nn.Sequential(
+            nn.Linear(embed_size + continious_feature_shape, 1024),
+            nn.ReLU(),
+            nn.Linear(1024, 512),
+            nn.ReLU(),
+            nn.Linear(512, 256),
+            nn.ReLU(),
+        )
+        self.head = nn.Sequential(
+            nn.Linear(embed_dict_len + 256, embed_dict_len),
+        )
+
+    def forward(self, continious, binary, binary_idx):
+        # get embeddings for sequence of indexes
+        binary_embed = self.embed(binary_idx)
+        binary_embed_mean = mean(binary_embed, dim=1)
+        # get logits for "deep" part: continious features + binary embeddings
+        deep_logits = self.linear_relu_stack(
+            cat((continious, binary_embed_mean), dim=1)
+        )
+        # get final softmax logits for "deep" part and raw binary features
+        total_logits = self.head(cat((deep_logits, binary), dim=1))
+        return total_logits
+
+
+model = WideAndDeep(X_train.shape[1], 16, max_movie_index + 1, PAD_IDX).to(device)
+print(model)
+
+
+EPOCHS = 10
+loss_fn = nn.CrossEntropyLoss(ignore_index=PAD_IDX)
+optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
+
+for t in range(EPOCHS):
+    model.train()
+    pred_train = model(
+        X_train_tensor, train_movies_watched_tensor, movies_train_sequences
+    )
+    loss_train = loss_fn(pred_train, target_train)
+
+    # Backpropagation
+    optimizer.zero_grad()
+    loss_train.backward()
+    optimizer.step()
+
+    model.eval()
+    with torch.no_grad():
+        pred_test = model(
+            X_test_tensor, test_movies_watched_tensor, movies_test_sequences
+        )
+        loss_test = loss_fn(pred_test, target_test)
+
+    print(f"Epoch {t}")
+    print(f"Train loss: {loss_train:>7f}")
+    print(f"Test loss: {loss_test:>7f}")
--- a/examples/scripts/wide_deep_for_recsys/ml100k_data_preparation.py
+++ b/examples/scripts/wide_deep_for_recsys/ml100k_data_preparation.py
+# This script is mostly a copy/paste from the Kaggle notebook
+# https://www.kaggle.com/code/matanivanov/wide-deep-learning-for-recsys-with-pytorch.
+# Is a response to the issue:
+# https://github.com/jrzaurin/pytorch-widedeep/issues/133 In this script we
+# simply prepare the data that will later be used for a custom Wide and Deep
+# model and for Wide and Deep models created using this library
+from pathlib import Path
+
+from sklearn.model_selection import train_test_split
+
+from pytorch_widedeep.datasets import load_movielens100k
+
+data, users, items = load_movielens100k(as_frame=True)
+
+# Alternatively, as specified in the docs: 'The last 19 fields are the genres' so:
+# list_of_genres = items.columns.tolist()[-19:]
+list_of_genres = [
+    "unknown",
+    "Action",
+    "Adventure",
+    "Animation",
+    "Children's",
+    "Comedy",
+    "Crime",
+    "Documentary",
+    "Drama",
+    "Fantasy",
+    "Film-Noir",
+    "Horror",
+    "Musical",
+    "Mystery",
+    "Romance",
+    "Sci-Fi",
+    "Thriller",
+    "War",
+    "Western",
+]
+
+
+# adding a column with the number of movies watched per users
+dataset = data.sort_values(["user_id", "timestamp"]).reset_index(drop=True)
+dataset["one"] = 1
+dataset["num_watched"] = dataset.groupby("user_id")["one"].cumsum()
+dataset.drop("one", axis=1, inplace=True)
+
+# adding a column with the mean rating at a point in time per user
+dataset["mean_rate"] = (
+    dataset.groupby("user_id")["rating"].cumsum() / dataset["num_watched"]
+)
+
+# In this particular exercise the problem is formulating as predicting the
+# next movie that will be watched (in consequence the last interactions will be discarded)
+dataset["target"] = dataset.groupby("user_id")["movie_id"].shift(-1)
+
+# Here the author builds the sequences
+dataset["prev_movies"] = dataset["movie_id"].apply(lambda x: str(x))
+dataset["prev_movies"] = (
+    dataset.groupby("user_id")["prev_movies"]
+    .apply(lambda x: (x + " ").cumsum().str.strip())
+    .reset_index(drop=True)
+)
+dataset["prev_movies"] = dataset["prev_movies"].apply(lambda x: x.split())
+
+# Adding user feats
+dataset = dataset.merge(users, on="user_id", how="left")
+
+# Adding a genre_rate as the mean of all movies rated for a given genre per
+# user
+dataset = dataset.merge(items[["movie_id"] + list_of_genres], on="movie_id", how="left")
+for genre in list_of_genres:
+    dataset[f"{genre}_rate"] = dataset[genre] * dataset["rating"]
+    dataset[genre] = dataset.groupby("user_id")[genre].cumsum()
+    dataset[f"{genre}_rate"] = (
+        dataset.groupby("user_id")[f"{genre}_rate"].cumsum() / dataset[genre]
+    )
+dataset[list_of_genres] = dataset[list_of_genres].apply(
+    lambda x: x / dataset["num_watched"]
+)
+
+# Again, we use the same settings as those in the Kaggle notebook,
+# but 'COLD_START_TRESH' is pretty aggressive
+COLD_START_TRESH = 5
+
+filtred_data = dataset[
+    (dataset["num_watched"] >= COLD_START_TRESH) & ~(dataset["target"].isna())
+].sort_values("timestamp")
+train_data, _test_data = train_test_split(filtred_data, test_size=0.2, shuffle=False)
+valid_data, test_data = train_test_split(_test_data, test_size=0.5, shuffle=False)
+
+cols_to_drop = [
+    # "rating",
+    "timestamp",
+    "num_watched",
+]
+
+df_train = train_data.drop(cols_to_drop, axis=1)
+df_valid = valid_data.drop(cols_to_drop, axis=1)
+df_test = test_data.drop(cols_to_drop, axis=1)
+
+save_path = Path("prepared_data")
+if not save_path.exists():
+    save_path.mkdir(parents=True, exist_ok=True)
+
+df_train.to_pickle(save_path / "df_train.pkl")
+df_valid.to_pickle(save_path / "df_valid.pkl")
+df_test.to_pickle(save_path / "df_test.pkl")
--- a/examples/scripts/wide_deep_for_recsys/pytorch_wide_deep_pt1.py
+++ b/examples/scripts/wide_deep_for_recsys/pytorch_wide_deep_pt1.py
+# In this script I illustrate how one coould use our library to reproduce
+# almost exactly the same model used in the Kaggle Notebook
+
+from pathlib import Path
+
+import numpy as np
+import torch
+import pandas as pd
+from torch import nn
+from scipy.sparse import coo_matrix
+
+from pytorch_widedeep import Trainer
+from pytorch_widedeep.models import TabMlp, BasicRNN, WideDeep
+from pytorch_widedeep.preprocessing import TabPreprocessor
+
+device = "cuda" if torch.cuda.is_available() else "cpu"
+
+save_path = Path("prepared_data")
+
+PAD_IDX = 0
+
+
+def get_coo_indexes(lil):
+    rows = []
+    cols = []
+    for i, el in enumerate(lil):
+        if type(el) != list:
+            el = [el]
+        for j in el:
+            rows.append(i)
+            cols.append(j)
+    return rows, cols
+
+
+def get_sparse_features(series, shape):
+    coo_indexes = get_coo_indexes(series.tolist())
+    sparse_df = coo_matrix(
+        (np.ones(len(coo_indexes[0])), (coo_indexes[0], coo_indexes[1])), shape=shape
+    )
+    return sparse_df
+
+
+def sparse_to_idx(data, pad_idx=-1):
+    indexes = data.nonzero()
+    indexes_df = pd.DataFrame()
+    indexes_df["rows"] = indexes[0]
+    indexes_df["cols"] = indexes[1]
+    mdf = indexes_df.groupby("rows").apply(lambda x: x["cols"].tolist())
+    max_len = mdf.apply(lambda x: len(x)).max()
+    return mdf.apply(lambda x: pd.Series(x + [pad_idx] * (max_len - len(x)))).values
+
+
+id_cols = ["user_id", "movie_id"]
+
+df_train = pd.read_pickle(save_path / "df_train.pkl")
+df_valid = pd.read_pickle(save_path / "df_valid.pkl")
+df_test = pd.read_pickle(save_path / "df_test.pkl")
+df_test = pd.concat([df_valid, df_test], ignore_index=True)
+
+# here is another caveat, using all dataset to build 'train_movies_watched'
+# when in reality one should use only the training
+max_movie_index = max(df_train.movie_id.max(), df_test.movie_id.max())
+
+X_train = df_train.drop(id_cols + ["rating", "prev_movies", "target"], axis=1)
+y_train = np.array(df_train.target.values, dtype="int64")
+train_movies_watched = get_sparse_features(
+    df_train["prev_movies"], (len(df_train), max_movie_index + 1)
+)
+
+X_test = df_test.drop(id_cols + ["rating", "prev_movies", "target"], axis=1)
+y_test = np.array(df_test.target.values, dtype="int64")
+test_movies_watched = get_sparse_features(
+    df_test["prev_movies"], (len(df_test), max_movie_index + 1)
+)
+
+cat_cols = ["gender", "occupation", "zip_code"]
+cont_cols = [c for c in X_train if c not in cat_cols]
+tab_preprocessor = TabPreprocessor(
+    cat_embed_cols=cat_cols,
+    continuous_cols=cont_cols,
+)
+
+# The sparse matrices need to be turned into dense whether at array or tensor
+# stage. This is one of the reasons why the wide component in our library is
+# implemented as Embeddings. However, our implementation is still not
+# suitable for the type of pre-processing that the author of the Kaggle
+# notebook did to come up with the what it would be the wide component
+# (a sparse martrix with 1s at those locations corresponding to the movies
+# that a user has seen at a point in time). Therefore, we will have to code a
+# Wide model (fairly simple since it is a linear layer)
+X_train_wide = np.array(train_movies_watched.todense())
+X_test_wide = np.array(test_movies_watched.todense())
+
+# Here our tabular component is a bit more elaborated than that in the
+# notebook, just a bit...
+X_train_tab = tab_preprocessor.fit_transform(X_train.fillna(0))
+X_test_tab = tab_preprocessor.transform(X_test.fillna(0))
+
+# The text component are the sequences of movies wacthed. There is an element
+# of information redundancy here in my opinion. This is because the wide and
+# text components have implicitely the same information, but in different
+# form. Anyway, we want to reproduce the Kaggle notebook as close as
+# possible.
+X_train_text = sparse_to_idx(train_movies_watched, pad_idx=PAD_IDX)
+X_test_text = sparse_to_idx(test_movies_watched, pad_idx=PAD_IDX)
+
+
+class Wide(nn.Module):
+    def __init__(self, input_dim: int, pred_dim: int):
+        super().__init__()
+
+        self.input_dim = input_dim
+        self.pred_dim = pred_dim
+
+        # The way I coded the library I never though that someone would ever
+        # wanted to code their own wide component. However, if you do, the
+        # wide component must have a 'wide_linear' attribute. In other words,
+        # the linear layer must be called 'wide_linear'
+        self.wide_linear = nn.Linear(input_dim, pred_dim)
+
+    def forward(self, X):
+        out = self.wide_linear(X.type(torch.float32))
+        return out
+
+
+wide = Wide(X_train_wide.shape[1], max_movie_index + 1)
+
+
+class SimpleEmbed(nn.Module):
+    def __init__(self, vocab_size: int, embed_dim: int, pad_idx: int):
+        super().__init__()
+
+        self.vocab_size = vocab_size
+        self.embed_dim = embed_dim
+        self.pad_idx = pad_idx
+
+        # The sequences of movies watched are simply embedded in the Kaggle
+        # notebook. No RNN, Transformer or any model is used
+        self.embed = nn.Embedding(vocab_size, embed_dim, padding_idx=pad_idx)
+
+    def forward(self, X):
+        embed = self.embed(X)
+        embed_mean = torch.mean(embed, dim=1)
+        return embed_mean
+
+    @property
+    def output_dim(self) -> int:
+        return self.embed_dim
+
+
+# In the notebook the author uses simply embeddings
+simple_embed = SimpleEmbed(max_movie_index + 1, 16, 0)
+# but maybe one would like to use an RNN to account for the sequence nature of
+# the problem formulation
+basic_rnn = BasicRNN(
+    vocab_size=max_movie_index + 1,
+    embed_dim=16,
+    hidden_dim=32,
+    n_layers=2,
+    rnn_type="gru",
+)
+
+tab_mlp = TabMlp(
+    column_idx=tab_preprocessor.column_idx,
+    cat_embed_input=tab_preprocessor.cat_embed_input,
+    continuous_cols=tab_preprocessor.continuous_cols,
+    cont_norm_layer=None,
+    mlp_hidden_dims=[1024, 512, 256],
+    mlp_activation="relu",
+)
+
+# The main difference between this wide and deep model and the Wide and Deep
+# model in the Kaggle notebook is that in that notebook, the author
+# concatenates the embedings and the tabular features(which he refers
+# as 'continuous'), then passes this concatenation through a stack of
+# linear + Relu layers. Then concatenates this output with the binary
+# features and connects this concatenation with the final linear layer. Our
+# implementation follows the notation of the original paper and instead of
+# concatenating the tabular, text and wide components, we first compute their
+# output, and then add it (see here: https://arxiv.org/pdf/1606.07792.pdf,
+# their Eq 3). Note that this is effectively the same with the caveat that
+# while in one case we initialise a big weight matrix at once, in our
+# implementation we initialise different matrices for different components.
+# Anyway, let's give it a go.
+wide_deep_model = WideDeep(
+    wide=wide, deeptabular=tab_mlp, deeptext=simple_embed, pred_dim=max_movie_index + 1
+)
+# # To use an RNN, simply
+# wide_deep_model = WideDeep(
+#     wide=wide, deeptabular=tab_mlp, deeptext=basic_rnn, pred_dim=max_movie_index + 1
+# )
+
+trainer = Trainer(
+    model=wide_deep_model,
+    objective="multiclass",
+    custom_loss_function=nn.CrossEntropyLoss(ignore_index=PAD_IDX),
+    optimizers=torch.optim.Adam(wide_deep_model.parameters(), lr=1e-3),
+)
+
+trainer.fit(
+    X_train={
+        "X_wide": X_train_wide,
+        "X_tab": X_train_tab,
+        "X_text": X_train_text,
+        "target": y_train,
+    },
+    X_val={
+        "X_wide": X_test_wide,
+        "X_tab": X_test_tab,
+        "X_text": X_test_text,
+        "target": y_test,
+    },
+    n_epochs=10,
+    batch_size=512,
+    shuffle=False,
+)
--- a/examples/scripts/wide_deep_for_recsys/pytorch_wide_deep_pt2.py
+++ b/examples/scripts/wide_deep_for_recsys/pytorch_wide_deep_pt2.py
+from pathlib import Path
+
+import numpy as np
+import torch
+import pandas as pd
+from torch import nn
+
+from pytorch_widedeep import Trainer
+from pytorch_widedeep.utils import pad_sequences
+from pytorch_widedeep.models import TabMlp, WideDeep, Transformer
+from pytorch_widedeep.preprocessing import TabPreprocessor
+
+save_path = Path("prepared_data")
+
+PAD_IDX = 0
+
+id_cols = ["user_id", "movie_id"]
+
+df_train = pd.read_pickle(save_path / "df_train.pkl")
+df_valid = pd.read_pickle(save_path / "df_valid.pkl")
+df_test = pd.read_pickle(save_path / "df_test.pkl")
+df_test = pd.concat([df_valid, df_test], ignore_index=True)
+
+# sequence length. Shorter sequences will be padded to this length. This is
+# identical to the Kaggle's implementation
+maxlen = max(
+    df_train.prev_movies.apply(lambda x: len(x)).max(),
+    df_test.prev_movies.apply(lambda x: len(x)).max(),
+)
+
+# Here there is a caveat. In pple, we are using (as in the Kaggle notebook)
+# all indexes to compute the number of tokens in the dataset. To do this
+# properly, one would have to use ONLY train tokens and add a token for new
+# unknown/unseen movies in the test set. This can also be done with this
+# library and manually, so I will leave it to the reader to implement that
+# tokenzation appraoch
+max_movie_index = max(df_train.movie_id.max(), df_test.movie_id.max())
+
+# From now one things are pretty simple, moreover bearing in mind that in this
+# example we are not going to use a wide component since, in pple, I believe
+# the information in that component is also 'carried' by the movie sequences
+# (also in previous scripts one can see that most prediction power comes from
+# the linear, wide model)
+df_train_user_item = df_train[["user_id", "movie_id", "rating"]]
+train_movies_sequences = df_train.prev_movies.apply(
+    lambda x: [int(el) for el in x]
+).to_list()
+y_train = df_train.target.values.astype(int)
+
+df_test_user_item = df_test[["user_id", "movie_id", "rating"]]
+test_movies_sequences = df_test.prev_movies.apply(
+    lambda x: [int(el) for el in x]
+).to_list()
+y_test = df_test.target.values.astype(int)
+
+# As a tabular component we are going to encode simply the triplets
+# (user, items, rating)
+tab_preprocessor = tab_preprocessor = TabPreprocessor(
+    cat_embed_cols=["user_id", "movie_id", "rating"],
+)
+X_train_tab = tab_preprocessor.fit_transform(df_train_user_item)
+X_test_tab = tab_preprocessor.transform(df_test_user_item)
+
+# And here we pad the sequences and define a transformer model for the text
+# component that is, in this case, the sequences of movies watched
+X_train_text = np.array(
+    [
+        pad_sequences(
+            s,
+            maxlen=maxlen,
+            pad_first=False,
+            pad_idx=PAD_IDX,
+        )
+        for s in train_movies_sequences
+    ]
+)
+X_test_text = np.array(
+    [
+        pad_sequences(
+            s,
+            maxlen=maxlen,
+            pad_first=False,
+            pad_idx=0,
+        )
+        for s in test_movies_sequences
+    ]
+)
+
+tab_mlp = TabMlp(
+    column_idx=tab_preprocessor.column_idx,
+    cat_embed_input=tab_preprocessor.cat_embed_input,
+    mlp_hidden_dims=[512, 256],
+    mlp_activation="relu",
+)
+
+# plenty of options here, see the docs
+transformer = Transformer(
+    vocab_size=max_movie_index + 1,
+    embed_dim=16,
+    n_heads=2,
+    n_blocks=2,
+    seq_length=maxlen,
+)
+
+wide_deep_model = WideDeep(
+    deeptabular=tab_mlp, deeptext=transformer, pred_dim=max_movie_index + 1
+)
+
+trainer = Trainer(
+    model=wide_deep_model,
+    objective="multiclass",
+    custom_loss_function=nn.CrossEntropyLoss(ignore_index=PAD_IDX),
+    optimizers=torch.optim.Adam(wide_deep_model.parameters(), lr=1e-3),
+)
+
+trainer.fit(
+    X_train={
+        "X_tab": X_train_tab,
+        "X_text": X_train_text,
+        "target": y_train,
+    },
+    X_val={
+        "X_tab": X_test_tab,
+        "X_text": X_test_text,
+        "target": y_test,
+    },
+    n_epochs=2,
+    batch_size=32,
+    shuffle=False,
+)
--- a/mkdocs/mkdocs.yml
+++ b/mkdocs/mkdocs.yml
@@ -56,6 +56,9 @@ nav:
        - 16_Self-Supervised Pre-Training pt 1: examples/16_Self_Supervised_Pretraning_pt1.ipynb
        - 16_Self-Supervised Pre-Training pt 2: examples/16_Self_Supervised_Pretraning_pt2.ipynb
        - 17_Using_a_huggingface_model: examples/17_Usign_a_hugging_face_model.ipynb
+        - 18_feature_importance_via_attention_weights: examples/18_feature_importance_via_attention_weights.ipynb
+        - 19_wide_and_deep_for_recsys_pt1: examples/19_wide_and_deep_for_recsys_pt1.ipynb
+        - 19_wide_and_deep_for_recsys_pt2: examples/19_wide_and_deep_for_recsys_pt2.ipynb
    - Contributing: contributing.md

 theme:

--- a/mkdocs/site/404.html
+++ b/mkdocs/site/404.html
@@ -739,6 +739,12 @@
          
        
          
+        
+          
+        
+          
+        
+          
        
          
        
@@ -1012,6 +1018,48 @@

            
          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="/examples/18_feature_importance_via_attention_weights.html" class="md-nav__link">
+        18_feature_importance_via_attention_weights
+      </a>
+    </li>
+  
+
+            
+          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="/examples/19_wide_and_deep_for_recsys_pt1.html" class="md-nav__link">
+        19_wide_and_deep_for_recsys_pt1
+      </a>
+    </li>
+  
+
+            
+          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="/examples/19_wide_and_deep_for_recsys_pt2.html" class="md-nav__link">
+        19_wide_and_deep_for_recsys_pt2
+      </a>
+    </li>
+  
+
+            
+          
        </ul>
      </nav>
    </li>

--- a/mkdocs/site/contributing.html
+++ b/mkdocs/site/contributing.html
@@ -743,6 +743,12 @@
          
        
          
+        
+          
+        
+          
+        
+          
        
          
        
@@ -1016,6 +1022,48 @@

            
          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="examples/18_feature_importance_via_attention_weights.html" class="md-nav__link">
+        18_feature_importance_via_attention_weights
+      </a>
+    </li>
+  
+
+            
+          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="examples/19_wide_and_deep_for_recsys_pt1.html" class="md-nav__link">
+        19_wide_and_deep_for_recsys_pt1
+      </a>
+    </li>
+  
+
+            
+          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="examples/19_wide_and_deep_for_recsys_pt2.html" class="md-nav__link">
+        19_wide_and_deep_for_recsys_pt2
+      </a>
+    </li>
+  
+
+            
+          
        </ul>
      </nav>
    </li>
@@ -1095,7 +1143,7 @@
    <nav class="md-footer__inner md-grid" aria-label="Footer" >
      
        
-        <a href="examples/17_Usign_a_hugging_face_model.html" class="md-footer__link md-footer__link--prev" aria-label="Previous: 17_Using_a_huggingface_model" rel="prev">
+        <a href="examples/19_wide_and_deep_for_recsys_pt2.html" class="md-footer__link md-footer__link--prev" aria-label="Previous: 19_wide_and_deep_for_recsys_pt2" rel="prev">
          <div class="md-footer__button md-icon">
            <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M20 11v2H8l5.5 5.5-1.42 1.42L4.16 12l7.92-7.92L13.5 5.5 8 11h12Z"/></svg>
          </div>
@@ -1104,7 +1152,7 @@
              <span class="md-footer__direction">
                Previous
              </span>
-              17_Using_a_huggingface_model
+              19_wide_and_deep_for_recsys_pt2
            </div>
          </div>
        </a>

--- a/mkdocs/site/examples/01_Preprocessors_and_utils.html
+++ b/mkdocs/site/examples/01_Preprocessors_and_utils.html
@@ -750,6 +750,12 @@
          
        
          
+        
+          
+        
+          
+        
+          
        
          
        
@@ -1092,6 +1098,48 @@

            
          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="18_feature_importance_via_attention_weights.html" class="md-nav__link">
+        18_feature_importance_via_attention_weights
+      </a>
+    </li>
+  
+
+            
+          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="19_wide_and_deep_for_recsys_pt1.html" class="md-nav__link">
+        19_wide_and_deep_for_recsys_pt1
+      </a>
+    </li>
+  
+
+            
+          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="19_wide_and_deep_for_recsys_pt2.html" class="md-nav__link">
+        19_wide_and_deep_for_recsys_pt2
+      </a>
+    </li>
+  
+
+            
+          
        </ul>
      </nav>
    </li>

--- a/mkdocs/site/examples/02_model_components.html
+++ b/mkdocs/site/examples/02_model_components.html
@@ -750,6 +750,12 @@
          
        
          
+        
+          
+        
+          
+        
+          
        
          
        
@@ -1085,6 +1091,48 @@

            
          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="18_feature_importance_via_attention_weights.html" class="md-nav__link">
+        18_feature_importance_via_attention_weights
+      </a>
+    </li>
+  
+
+            
+          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="19_wide_and_deep_for_recsys_pt1.html" class="md-nav__link">
+        19_wide_and_deep_for_recsys_pt1
+      </a>
+    </li>
+  
+
+            
+          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="19_wide_and_deep_for_recsys_pt2.html" class="md-nav__link">
+        19_wide_and_deep_for_recsys_pt2
+      </a>
+    </li>
+  
+
+            
+          
        </ul>
      </nav>
    </li>

--- a/mkdocs/site/examples/03_Binary_Classification_with_Defaults.html
+++ b/mkdocs/site/examples/03_Binary_Classification_with_Defaults.html
@@ -750,6 +750,12 @@
          
        
          
+        
+          
+        
+          
+        
+          
        
          
        
@@ -1071,6 +1077,48 @@

            
          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="18_feature_importance_via_attention_weights.html" class="md-nav__link">
+        18_feature_importance_via_attention_weights
+      </a>
+    </li>
+  
+
+            
+          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="19_wide_and_deep_for_recsys_pt1.html" class="md-nav__link">
+        19_wide_and_deep_for_recsys_pt1
+      </a>
+    </li>
+  
+
+            
+          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="19_wide_and_deep_for_recsys_pt2.html" class="md-nav__link">
+        19_wide_and_deep_for_recsys_pt2
+      </a>
+    </li>
+  
+
+            
+          
        </ul>
      </nav>
    </li>

--- a/mkdocs/site/examples/04_regression_with_images_and_text.html
+++ b/mkdocs/site/examples/04_regression_with_images_and_text.html
@@ -750,6 +750,12 @@
          
        
          
+        
+          
+        
+          
+        
+          
        
          
        
@@ -1085,6 +1091,48 @@

            
          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="18_feature_importance_via_attention_weights.html" class="md-nav__link">
+        18_feature_importance_via_attention_weights
+      </a>
+    </li>
+  
+
+            
+          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="19_wide_and_deep_for_recsys_pt1.html" class="md-nav__link">
+        19_wide_and_deep_for_recsys_pt1
+      </a>
+    </li>
+  
+
+            
+          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="19_wide_and_deep_for_recsys_pt2.html" class="md-nav__link">
+        19_wide_and_deep_for_recsys_pt2
+      </a>
+    </li>
+  
+
+            
+          
        </ul>
      </nav>
    </li>

--- a/mkdocs/site/examples/05_save_and_load_model_and_artifacts.html
+++ b/mkdocs/site/examples/05_save_and_load_model_and_artifacts.html
@@ -750,6 +750,12 @@
          
        
          
+        
+          
+        
+          
+        
+          
        
          
        
@@ -1033,6 +1039,48 @@

            
          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="18_feature_importance_via_attention_weights.html" class="md-nav__link">
+        18_feature_importance_via_attention_weights
+      </a>
+    </li>
+  
+
+            
+          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="19_wide_and_deep_for_recsys_pt1.html" class="md-nav__link">
+        19_wide_and_deep_for_recsys_pt1
+      </a>
+    </li>
+  
+
+            
+          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="19_wide_and_deep_for_recsys_pt2.html" class="md-nav__link">
+        19_wide_and_deep_for_recsys_pt2
+      </a>
+    </li>
+  
+
+            
+          
        </ul>
      </nav>
    </li>

--- a/mkdocs/site/examples/06_fineTune_and_warmup.html
+++ b/mkdocs/site/examples/06_fineTune_and_warmup.html
@@ -750,6 +750,12 @@
          
        
          
+        
+          
+        
+          
+        
+          
        
          
        
@@ -1071,6 +1077,48 @@

            
          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="18_feature_importance_via_attention_weights.html" class="md-nav__link">
+        18_feature_importance_via_attention_weights
+      </a>
+    </li>
+  
+
+            
+          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="19_wide_and_deep_for_recsys_pt1.html" class="md-nav__link">
+        19_wide_and_deep_for_recsys_pt1
+      </a>
+    </li>
+  
+
+            
+          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="19_wide_and_deep_for_recsys_pt2.html" class="md-nav__link">
+        19_wide_and_deep_for_recsys_pt2
+      </a>
+    </li>
+  
+
+            
+          
        </ul>
      </nav>
    </li>

--- a/mkdocs/site/examples/07_Custom_Components.html
+++ b/mkdocs/site/examples/07_Custom_Components.html
@@ -750,6 +750,12 @@
          
        
          
+        
+          
+        
+          
+        
+          
        
          
        
@@ -1085,6 +1091,48 @@

            
          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="18_feature_importance_via_attention_weights.html" class="md-nav__link">
+        18_feature_importance_via_attention_weights
+      </a>
+    </li>
+  
+
+            
+          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="19_wide_and_deep_for_recsys_pt1.html" class="md-nav__link">
+        19_wide_and_deep_for_recsys_pt1
+      </a>
+    </li>
+  
+
+            
+          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="19_wide_and_deep_for_recsys_pt2.html" class="md-nav__link">
+        19_wide_and_deep_for_recsys_pt2
+      </a>
+    </li>
+  
+
+            
+          
        </ul>
      </nav>
    </li>

--- a/mkdocs/site/examples/08_custom_dataLoader_imbalanced_dataset.html
+++ b/mkdocs/site/examples/08_custom_dataLoader_imbalanced_dataset.html
@@ -750,6 +750,12 @@
          
        
          
+        
+          
+        
+          
+        
+          
        
          
        
@@ -1064,6 +1070,48 @@

            
          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="18_feature_importance_via_attention_weights.html" class="md-nav__link">
+        18_feature_importance_via_attention_weights
+      </a>
+    </li>
+  
+
+            
+          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="19_wide_and_deep_for_recsys_pt1.html" class="md-nav__link">
+        19_wide_and_deep_for_recsys_pt1
+      </a>
+    </li>
+  
+
+            
+          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="19_wide_and_deep_for_recsys_pt2.html" class="md-nav__link">
+        19_wide_and_deep_for_recsys_pt2
+      </a>
+    </li>
+  
+
+            
+          
        </ul>
      </nav>
    </li>

--- a/mkdocs/site/examples/09_extracting_embeddings.html
+++ b/mkdocs/site/examples/09_extracting_embeddings.html
@@ -750,6 +750,12 @@
          
        
          
+        
+          
+        
+          
+        
+          
        
          
        
@@ -1033,6 +1039,48 @@

            
          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="18_feature_importance_via_attention_weights.html" class="md-nav__link">
+        18_feature_importance_via_attention_weights
+      </a>
+    </li>
+  
+
+            
+          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="19_wide_and_deep_for_recsys_pt1.html" class="md-nav__link">
+        19_wide_and_deep_for_recsys_pt1
+      </a>
+    </li>
+  
+
+            
+          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="19_wide_and_deep_for_recsys_pt2.html" class="md-nav__link">
+        19_wide_and_deep_for_recsys_pt2
+      </a>
+    </li>
+  
+
+            
+          
        </ul>
      </nav>
    </li>

--- a/mkdocs/site/examples/11_auc_multiclass.html
+++ b/mkdocs/site/examples/11_auc_multiclass.html
@@ -750,6 +750,12 @@
          
        
          
+        
+          
+        
+          
+        
+          
        
          
        
@@ -1078,6 +1084,48 @@

            
          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="18_feature_importance_via_attention_weights.html" class="md-nav__link">
+        18_feature_importance_via_attention_weights
+      </a>
+    </li>
+  
+
+            
+          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="19_wide_and_deep_for_recsys_pt1.html" class="md-nav__link">
+        19_wide_and_deep_for_recsys_pt1
+      </a>
+    </li>
+  
+
+            
+          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="19_wide_and_deep_for_recsys_pt2.html" class="md-nav__link">
+        19_wide_and_deep_for_recsys_pt2
+      </a>
+    </li>
+  
+
+            
+          
        </ul>
      </nav>
    </li>

--- a/mkdocs/site/examples/13_Model_Uncertainty_prediction.html
+++ b/mkdocs/site/examples/13_Model_Uncertainty_prediction.html
@@ -750,6 +750,12 @@
          
        
          
+        
+          
+        
+          
+        
+          
        
          
        
@@ -1092,6 +1098,48 @@

            
          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="18_feature_importance_via_attention_weights.html" class="md-nav__link">
+        18_feature_importance_via_attention_weights
+      </a>
+    </li>
+  
+
+            
+          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="19_wide_and_deep_for_recsys_pt1.html" class="md-nav__link">
+        19_wide_and_deep_for_recsys_pt1
+      </a>
+    </li>
+  
+
+            
+          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="19_wide_and_deep_for_recsys_pt2.html" class="md-nav__link">
+        19_wide_and_deep_for_recsys_pt2
+      </a>
+    </li>
+  
+
+            
+          
        </ul>
      </nav>
    </li>

--- a/mkdocs/site/examples/14_bayesian_models.html
+++ b/mkdocs/site/examples/14_bayesian_models.html
@@ -750,6 +750,12 @@
          
        
          
+        
+          
+        
+          
+        
+          
        
          
        
@@ -1071,6 +1077,48 @@

            
          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="18_feature_importance_via_attention_weights.html" class="md-nav__link">
+        18_feature_importance_via_attention_weights
+      </a>
+    </li>
+  
+
+            
+          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="19_wide_and_deep_for_recsys_pt1.html" class="md-nav__link">
+        19_wide_and_deep_for_recsys_pt1
+      </a>
+    </li>
+  
+
+            
+          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="19_wide_and_deep_for_recsys_pt2.html" class="md-nav__link">
+        19_wide_and_deep_for_recsys_pt2
+      </a>
+    </li>
+  
+
+            
+          
        </ul>
      </nav>
    </li>

--- a/mkdocs/site/examples/15_DIR-LDS_and_FDS.html
+++ b/mkdocs/site/examples/15_DIR-LDS_and_FDS.html
@@ -750,6 +750,12 @@
          
        
          
+        
+          
+        
+          
+        
+          
        
          
        
@@ -1064,6 +1070,48 @@

            
          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="18_feature_importance_via_attention_weights.html" class="md-nav__link">
+        18_feature_importance_via_attention_weights
+      </a>
+    </li>
+  
+
+            
+          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="19_wide_and_deep_for_recsys_pt1.html" class="md-nav__link">
+        19_wide_and_deep_for_recsys_pt1
+      </a>
+    </li>
+  
+
+            
+          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="19_wide_and_deep_for_recsys_pt2.html" class="md-nav__link">
+        19_wide_and_deep_for_recsys_pt2
+      </a>
+    </li>
+  
+
+            
+          
        </ul>
      </nav>
    </li>

--- a/mkdocs/site/examples/16_Self_Supervised_Pretraning_pt1.html
+++ b/mkdocs/site/examples/16_Self_Supervised_Pretraning_pt1.html
@@ -750,6 +750,12 @@
          
        
          
+        
+          
+        
+          
+        
+          
        
          
        
@@ -1080,6 +1086,48 @@

            
          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="18_feature_importance_via_attention_weights.html" class="md-nav__link">
+        18_feature_importance_via_attention_weights
+      </a>
+    </li>
+  
+
+            
+          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="19_wide_and_deep_for_recsys_pt1.html" class="md-nav__link">
+        19_wide_and_deep_for_recsys_pt1
+      </a>
+    </li>
+  
+
+            
+          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="19_wide_and_deep_for_recsys_pt2.html" class="md-nav__link">
+        19_wide_and_deep_for_recsys_pt2
+      </a>
+    </li>
+  
+
+            
+          
        </ul>
      </nav>
    </li>

--- a/mkdocs/site/examples/16_Self_Supervised_Pretraning_pt2.html
+++ b/mkdocs/site/examples/16_Self_Supervised_Pretraning_pt2.html
@@ -750,6 +750,12 @@
          
        
          
+        
+          
+        
+          
+        
+          
        
          
        
@@ -1073,6 +1079,48 @@

            
          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="18_feature_importance_via_attention_weights.html" class="md-nav__link">
+        18_feature_importance_via_attention_weights
+      </a>
+    </li>
+  
+
+            
+          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="19_wide_and_deep_for_recsys_pt1.html" class="md-nav__link">
+        19_wide_and_deep_for_recsys_pt1
+      </a>
+    </li>
+  
+
+            
+          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="19_wide_and_deep_for_recsys_pt2.html" class="md-nav__link">
+        19_wide_and_deep_for_recsys_pt2
+      </a>
+    </li>
+  
+
+            
+          
        </ul>
      </nav>
    </li>

--- a/mkdocs/site/index.html
+++ b/mkdocs/site/index.html
--- a/mkdocs/site/index.md
+++ b/mkdocs/site/index.md
--- a/mkdocs/site/installation.html
+++ b/mkdocs/site/installation.html
@@ -789,6 +789,12 @@
          
        
          
+        
+          
+        
+          
+        
+          
        
          
        
@@ -1062,6 +1068,48 @@

            
          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="examples/18_feature_importance_via_attention_weights.html" class="md-nav__link">
+        18_feature_importance_via_attention_weights
+      </a>
+    </li>
+  
+
+            
+          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="examples/19_wide_and_deep_for_recsys_pt1.html" class="md-nav__link">
+        19_wide_and_deep_for_recsys_pt1
+      </a>
+    </li>
+  
+
+            
+          
+            
+              
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="examples/19_wide_and_deep_for_recsys_pt2.html" class="md-nav__link">
+        19_wide_and_deep_for_recsys_pt2
+      </a>
+    </li>
+  
+
+            
+          
        </ul>
      </nav>
    </li>
@@ -1121,10 +1169,10 @@ pip<span class="w"> </span>install<span class="w"> </span>-e<span class="w"> </s
 </code></pre></div>
 <h2 id="dependencies">Dependencies<a class="headerlink" href="#dependencies" title="Permanent link">&para;</a></h2>
 <ul>
-<li>pandas</li>
-<li>numpy</li>
-<li>scipy</li>
-<li>scikit-learn</li>
+<li>pandas&gt;=1.3.5</li>
+<li>numpy&gt;=1.21.6</li>
+<li>scipy&gt;=1.7.3</li>
+<li>scikit-learn&gt;=1.0.2</li>
 <li>gensim</li>
 <li>spacy</li>
 <li>opencv-contrib-python</li>
@@ -1135,6 +1183,8 @@ pip<span class="w"> </span>install<span class="w"> </span>-e<span class="w"> </s
 <li>einops</li>
 <li>wrapt</li>
 <li>torchmetrics</li>
+<li>pyarrow</li>
+<li>fastparquet&gt;=0.8.1</li>
 </ul>



--- a/mkdocs/site/installation.md
+++ b/mkdocs/site/installation.md
@@ -27,10 +27,10 @@ pip install -e .

 ## Dependencies

-* pandas
-* numpy
-* scipy
-* scikit-learn
+* pandas>=1.3.5
+* numpy>=1.21.6
+* scipy>=1.7.3
+* scikit-learn>=1.0.2
 * gensim
 * spacy
 * opencv-contrib-python
@@ -41,3 +41,5 @@ pip install -e .
 * einops
 * wrapt
 * torchmetrics
+* pyarrow
+* fastparquet>=0.8.1
\ No newline at end of file
--- a/mkdocs/site/objects.inv
+++ b/mkdocs/site/objects.inv
--- a/mkdocs/site/pytorch-widedeep/bayesian_models.html
+++ b/mkdocs/site/pytorch-widedeep/bayesian_models.html
--- a/mkdocs/site/pytorch-widedeep/bayesian_trainer.html
+++ b/mkdocs/site/pytorch-widedeep/bayesian_trainer.html
--- a/mkdocs/site/pytorch-widedeep/callbacks.html
+++ b/mkdocs/site/pytorch-widedeep/callbacks.html
--- a/mkdocs/site/pytorch-widedeep/dataloaders.html
+++ b/mkdocs/site/pytorch-widedeep/dataloaders.html
--- a/mkdocs/site/pytorch-widedeep/losses.html
+++ b/mkdocs/site/pytorch-widedeep/losses.html
--- a/mkdocs/site/pytorch-widedeep/metrics.html
+++ b/mkdocs/site/pytorch-widedeep/metrics.html
--- a/mkdocs/site/pytorch-widedeep/model_components.html
+++ b/mkdocs/site/pytorch-widedeep/model_components.html
--- a/mkdocs/site/pytorch-widedeep/model_components.md
+++ b/mkdocs/site/pytorch-widedeep/model_components.md
--- a/mkdocs/site/pytorch-widedeep/preprocessing.html
+++ b/mkdocs/site/pytorch-widedeep/preprocessing.html
--- a/mkdocs/site/pytorch-widedeep/preprocessing.md
+++ b/mkdocs/site/pytorch-widedeep/preprocessing.md
--- a/mkdocs/site/pytorch-widedeep/self_supervised_pretraining.html
+++ b/mkdocs/site/pytorch-widedeep/self_supervised_pretraining.html
--- a/mkdocs/site/pytorch-widedeep/tab2vec.html
+++ b/mkdocs/site/pytorch-widedeep/tab2vec.html
--- a/mkdocs/site/pytorch-widedeep/trainer.html
+++ b/mkdocs/site/pytorch-widedeep/trainer.html
--- a/mkdocs/site/pytorch-widedeep/utils/deeptabular_utils.html
+++ b/mkdocs/site/pytorch-widedeep/utils/deeptabular_utils.html
--- a/mkdocs/site/pytorch-widedeep/utils/fastai_transforms.html
+++ b/mkdocs/site/pytorch-widedeep/utils/fastai_transforms.html
--- a/mkdocs/site/pytorch-widedeep/utils/image_utils.html
+++ b/mkdocs/site/pytorch-widedeep/utils/image_utils.html
--- a/mkdocs/site/pytorch-widedeep/utils/index.html
+++ b/mkdocs/site/pytorch-widedeep/utils/index.html
--- a/mkdocs/site/pytorch-widedeep/utils/text_utils.html
+++ b/mkdocs/site/pytorch-widedeep/utils/text_utils.html
--- a/mkdocs/site/quick_start.html
+++ b/mkdocs/site/quick_start.html
--- a/mkdocs/site/search/search_index.json
+++ b/mkdocs/site/search/search_index.json
--- a/mkdocs/site/sitemap.xml
+++ b/mkdocs/site/sitemap.xml
--- a/mkdocs/site/sitemap.xml.gz
+++ b/mkdocs/site/sitemap.xml.gz
--- a/mkdocs/sources/examples/18_feature_importance_via_attention_weights.ipynb
+++ b/mkdocs/sources/examples/18_feature_importance_via_attention_weights.ipynb
--- a/mkdocs/sources/examples/19_wide_and_deep_for_recsys_pt1.ipynb
+++ b/mkdocs/sources/examples/19_wide_and_deep_for_recsys_pt1.ipynb
--- a/mkdocs/sources/examples/19_wide_and_deep_for_recsys_pt2.ipynb
+++ b/mkdocs/sources/examples/19_wide_and_deep_for_recsys_pt2.ipynb
--- a/mkdocs/sources/index.md
+++ b/mkdocs/sources/index.md
--- a/mkdocs/sources/installation.md
+++ b/mkdocs/sources/installation.md
--- a/mkdocs/sources/pytorch-widedeep/model_components.md
+++ b/mkdocs/sources/pytorch-widedeep/model_components.md
--- a/mkdocs/sources/pytorch-widedeep/preprocessing.md
+++ b/mkdocs/sources/pytorch-widedeep/preprocessing.md
--- a/pytorch_widedeep/datasets/__init__.py
+++ b/pytorch_widedeep/datasets/__init__.py
--- a/pytorch_widedeep/datasets/_base.py
+++ b/pytorch_widedeep/datasets/_base.py
--- a/pytorch_widedeep/datasets/data/MovieLens100k_data.parquet.brotli
+++ b/pytorch_widedeep/datasets/data/MovieLens100k_data.parquet.brotli
--- a/pytorch_widedeep/datasets/data/MovieLens100k_items.parquet.brotli
+++ b/pytorch_widedeep/datasets/data/MovieLens100k_items.parquet.brotli
--- a/pytorch_widedeep/datasets/data/MovieLens100k_users.parquet.brotli
+++ b/pytorch_widedeep/datasets/data/MovieLens100k_users.parquet.brotli
--- a/pytorch_widedeep/models/__init__.py
+++ b/pytorch_widedeep/models/__init__.py
--- a/pytorch_widedeep/models/tabular/transformers/_attention_layers.py
+++ b/pytorch_widedeep/models/tabular/transformers/_attention_layers.py
--- a/pytorch_widedeep/models/tabular/transformers/_encoders.py
+++ b/pytorch_widedeep/models/tabular/transformers/_encoders.py
--- a/pytorch_widedeep/models/tabular/transformers/ft_transformer.py
+++ b/pytorch_widedeep/models/tabular/transformers/ft_transformer.py
--- a/pytorch_widedeep/models/tabular/transformers/saint.py
+++ b/pytorch_widedeep/models/tabular/transformers/saint.py
--- a/pytorch_widedeep/models/tabular/transformers/tab_fastformer.py
+++ b/pytorch_widedeep/models/tabular/transformers/tab_fastformer.py
--- a/pytorch_widedeep/models/tabular/transformers/tab_perceiver.py
+++ b/pytorch_widedeep/models/tabular/transformers/tab_perceiver.py
--- a/pytorch_widedeep/models/tabular/transformers/tab_transformer.py
+++ b/pytorch_widedeep/models/tabular/transformers/tab_transformer.py
--- a/pytorch_widedeep/models/text/__init__.py
+++ b/pytorch_widedeep/models/text/__init__.py
--- a/pytorch_widedeep/models/text/basic_transformer.py
+++ b/pytorch_widedeep/models/text/basic_transformer.py
--- a/pytorch_widedeep/preprocessing/tab_preprocessor.py
+++ b/pytorch_widedeep/preprocessing/tab_preprocessor.py
--- a/pytorch_widedeep/preprocessing/text_preprocessor.py
+++ b/pytorch_widedeep/preprocessing/text_preprocessor.py
--- a/pytorch_widedeep/training/trainer.py
+++ b/pytorch_widedeep/training/trainer.py
--- a/pytorch_widedeep/utils/fastai_transforms.py
+++ b/pytorch_widedeep/utils/fastai_transforms.py
--- a/pytorch_widedeep/utils/text_utils.py
+++ b/pytorch_widedeep/utils/text_utils.py
--- a/pytorch_widedeep/version.py
+++ b/pytorch_widedeep/version.py
--- a/tests/test_datasets/test_datasets.py
+++ b/tests/test_datasets/test_datasets.py
--- a/tests/test_model_components/test_mc_attn_layers.py
+++ b/tests/test_model_components/test_mc_attn_layers.py
--- a/tests/test_model_components/test_mc_text.py
+++ b/tests/test_model_components/test_mc_text.py
--- a/tests/test_model_functioning/test_miscellaneous.py
+++ b/tests/test_model_functioning/test_miscellaneous.py